Course Web Pages - Fall 2012 - LIBR 246-06/15 Greensheet - Assignments
LIBR 246
Text/Data/Web Mining for LIS
Assignments
Dr. Geoffrey Z. Liu
E-mail
Course Links Course Calendar Group Project Individual Assignment |
LIBR 246 Resources Online Resources Bibliography Software Tools |
Resources D2L D2L Tutorial SLIS e-Bookstore |
Individual Assignment of Reading Summaries
Throughout the course, students will write four brief summaries of selected readings, each summarizing one book chapter or a conference/journal article related to his/her chosen track of study focus. The summary is a write-up of digested understanding of the chosen reading material, not an abstract, nor an annotated bibliography. The content of reading material chosen for each summary corresponds roughly to the major components of the group research report (see below); Therefore, readings for the summaries may be chosen purposefully according to the group study focus, so that the works may be integrated as part of, the group project to build toward the final report of group research. Specifically,
- Summary #1: Basic concepts, techniques, or mining tools;
- Summary #2: Case study/report of mining project(s);
- Summary #3: Application of text/data/web mining in LIS;
- Summary #4: Ethical, theoretical, and social issues.
Students are expected to announce their choice of reading for each summary assignment as soon as the decision is made, by posting its bibliogrpahic reference and abstract in a designated discussion forum. The written report of reading summary is to be added later as attachment of a Microsoft Word file, either by responding to the posting of bibliographic reference or by revising the original posting. The same work (file) should also be submitted via the digital dropbox for grading.
The reading summaries should be no more than 5 pages, excluding coverpage, appendix, and references if any. Double spaced, using 12-point Times New Roman font. The summary report should start with a complete bibliographic entry of the chosen/reviewed chapter/article, following the APA editorial style.
Group Project of Topical Research
Immediately after the orientation, students in this class will be assigned into groups to complete a collaborative project of topical research. Groups are to be formed according to each student's preference of study focus. There are four areas of study focus (referred to as "tracks" from now on) to choose from, namely: NLP-based text mining, statistical text mining, data mining, and (web use) transaction log analysis. These focus areas correspond roughtly to four subfields in the general area of data mining, each building on a different set of theoretical concepts and encompassing different techniques, approaches, and tools.
In a nutshell, NLP-based text mining draws on computational linguistics to syntactically and sematically process texts at sentence level to extract embedded information; Statistica text mining utilizes keyword-indexing techniques developed in information retrieval to discover keywords distribution patterns (themes) in texts. In contrast, data mining draws on descriptive and (mostly) inferential statistical analyses to look for correlational patterns and potentially influential factors in non-textual data; and finally, web use and transaction log analysis aims to extract behavioral patterns of user-system interaction such as duration time, action sequence, and even query formulation of online searching.
Student groups will conduct topical research on their chosen tracks, collaboratively learning about basic theoreical concepts, major approaches, and related issues in the focus area, from lectures, textbook chapters, and conference/journal articles reporting mining projects. The group learning will concentrate on basic/perspective understanding of theoretical concepts, major mining approaches/techniques, and issues related to library and information services, not including mathematics and computational algorithms unless students want to. Individual assignments of reading summaries (as explained below) may be integrated as part of the group project to build toward the final report of group research.
At the conclusion of the group project, each group will make a 30-minute presentation in Elluminate to the class, and submit a written report of their research. Further instruction on project presentation will be distributed in the D2L class site. The group research report shall cover the following aspects:
- Introduction of basic concepts and theories;
- Description of major technqiues/approaches of mining;
- Survey of selected software/tools (one minimum, three maximum);
- Discussion of potential application to/in LIS;
- Analysis of ethical and social issues.
The report should be no more than 15 pages, exclusing coverpage, references, and appendices, double spaced, using 12-point Times New Roman font.
Individual Data Mining Exercise
This exercise is designed for students to practice basic data manipulation skills, especially to familiarize themselves with key statistical concepts, major data mining models, and mining procedures. It also serves the purpose of checking software installation to ensure that it work properly.
For those choosing to do NLP-based text mining for the individual mining project, this exercise provides an opportunity to get some exposure to basics of statistical data mining.
Task / Track | Data Set | Software Tool |
Data mining exercise | House spatial data (to be provided), or to be extracted from the US Census Data, or the data set self constructed for the mining project, | RapidMiner | SPSS |
Although one may choose to use SPSS for the exercise, Rapid Miner is strongly recommended. This is because Rapid Miner includes not only statistical analyses available in SPSS, but also more advanced modeling and algorithms specifically developed for data mining.
Those choosing to do NLP-based text mining may use the provided House Spatial Data set for this exercise. Others may use either the provided data set, or the same data set (either self constructed or adopted) for the individual mining project.
Procedure
After having downloaded, installed, and configured the software successfully, follow these steps below.
- If using the provided House Spatial Data (or data extracted from the US Census Data site), import into the software.
- For Rapid Miner, add appropriate pre-processing to texts (web data, or logs) in the depository to extract attribtue data.
- For SPSS, after importing, define attribute variables, and convert attribute values from "raw" to interal coding. For numeric data, changing data type would automatically convert. For nominal/ordinal data, code categories with integers, with proper value label definition of course.
- Execute appropriate procedures to generate descriptive statistics (frequency for categorical variables, mean and median for numeric variables). For SPSS, these procedures are self evident, under "Analysis" -> Descriptive. For Rapid Miner, "Data Transformation" -> "Attribute Set Reduction and Transformation" -> "Generation" -> "Generate Aggregation". Alternatively, "Data Transformation" -> "Aggregation" -> "Aggregate". In both cases, proper aggregate function needs to be chosen.
- Execute appropriate procedures to compute Pearson correlation of at least one paire of numeric variables, and cross-tabulation with Chi-square test for at least one pair of categorical variables. These prcedures may be found by following the same paths described in #4.
Products for submission
The submission packet for this exercise should include the following items:
- Summary Statement (5 pages maximum)
- Brief description of data set
- List of attribute variables
- Discussion of process and issues
- Appendices
- Reports of results from #4
- Reports of results from #5
Individual Mining Project
This individual project is designed for students to gain hand-on experience with at least one kind of practical mining. Students may choose to completing a mining task suitable to their chosen track of study focus, as with the group project of topical research. Depending on which mining task is chosen, students will work with different mining software tool and data set to complete this assignmet. In some cases, students may need to construct/adopt a data set of their preference, and to write small computer programs/scripts for pre-processing. The following table lists key components of mining tasks for different tracks of study focus.
Students need to complete only one mining task. Although it's possible to choose a mining task different from the track of study focus for the group project, it would work to the greater advantage for the students to choose a mining task of the same track.
Task / Track | Data Set | Software Tool |
Text mining (NLP) | NYT news articles (to be provided) |
CALAIS
(free limited license) |
Text mining (Statistical) | NYT news articles, or self constructed set of short paragraphs in CSV format. |
RapidMiner
(open source application) |
Web content/structure | To be self constructed, of pages from either one single site or the public domain. Programming skills may be needed for pre-processing. | RapidMiner |
Transaction log analysis | Excite 1997 search logs, or one-month worth of OPAC logs (both to be provided). Programming skills may be needed for pre-processing. | Excel, RapidMiner |
Procedure
Specific instructions for each mining task will be distributed in the D2L class site. Basically, Students would follow the general procedure below to complete this assignment.
- Decide on a mining task;
- Download, install, and configure the needed software tool on home computer;
- Construct or obtain the data set (by downloading);
- Preprocessing the databse, recode/convert if necessary, to prepare it for importing into the mining software tool;
- Import/upload the dataset into the mining software system
- Explore the data set to gain some basic understanding of its nature;
- Determine specific questions you want to answer (or patterns to reveal) by mining the data set;
- Identify key variables/factors/elements you want to mine;
- Mine for patterns, facts, relations etc. while at the same time taking good notes of your findings, assumptions, and mining strategies;
- Write up a report of your mining project.
Product for submission
The only thing you need to submit for this exercise is the report of your mining project. Specifically, the report shall include the following key components.
- Introduction of mining task;
- Brief discussion of data set and tool used;
- Explanation of preliminary data cleaning (and transformation if applicable);
- Statement of specific questions and key variables/factors/elements/patterns mined;
- Description of mining strategies, tactics, techniques, and process employed;
- Summary of significant findings from mining, i.e. patterns/factors/relations/summary statistics/list of extracted entities and relations etc.
- Conclusion
The project report should be kept short and can be no more than 10 pages (excluding references and appendices). Large tables and figures should be left for appendices.