Programs

Courses

Textbooks by Semester

Course Web Pages - Fall 2012 - LIBR 246-06/15 Greensheet - Assignments

LIBR 246
Text/Data/Web Mining for LIS
Assignments

Dr. Geoffrey Z. Liu
E-mail


Course Links
Course Calendar
Group Project
Individual Assignment
LIBR 246 Resources
Online Resources
Bibliography
Software Tools
Resources
D2L
D2L Tutorial
SLIS e-Bookstore

Reading Summaries | Group Research | Data Mining Exercise | Individual Project

Individual Assignment of Reading Summaries

Throughout the course, students will write four brief summaries of selected readings, each summarizing one book chapter or a conference/journal article related to his/her chosen track of study focus. The summary is a write-up of digested understanding of the chosen reading material, not an abstract, nor an annotated bibliography. The content of reading material chosen for each summary corresponds roughly to the major components of the group research report (see below); Therefore, readings for the summaries may be chosen purposefully according to the group study focus, so that the works may be integrated as part of, the group project to build toward the final report of group research. Specifically,

Students are expected to announce their choice of reading for each summary assignment as soon as the decision is made, by posting its bibliogrpahic reference and abstract in a designated discussion forum. The written report of reading summary is to be added later as attachment of a Microsoft Word file, either by responding to the posting of bibliographic reference or by revising the original posting. The same work (file) should also be submitted via the digital dropbox for grading.

The reading summaries should be no more than 5 pages, excluding coverpage, appendix, and references if any. Double spaced, using 12-point Times New Roman font. The summary report should start with a complete bibliographic entry of the chosen/reviewed chapter/article, following the APA editorial style.

Group Project of Topical Research

Immediately after the orientation, students in this class will be assigned into groups to complete a collaborative project of topical research. Groups are to be formed according to each student's preference of study focus. There are four areas of study focus (referred to as "tracks" from now on) to choose from, namely: NLP-based text mining, statistical text mining, data mining, and (web use) transaction log analysis. These focus areas correspond roughtly to four subfields in the general area of data mining, each building on a different set of theoretical concepts and encompassing different techniques, approaches, and tools.

In a nutshell, NLP-based text mining draws on computational linguistics to syntactically and sematically process texts at sentence level to extract embedded information; Statistica text mining utilizes keyword-indexing techniques developed in information retrieval to discover keywords distribution patterns (themes) in texts. In contrast, data mining draws on descriptive and (mostly) inferential statistical analyses to look for correlational patterns and potentially influential factors in non-textual data; and finally, web use and transaction log analysis aims to extract behavioral patterns of user-system interaction such as duration time, action sequence, and even query formulation of online searching.

Student groups will conduct topical research on their chosen tracks, collaboratively learning about basic theoreical concepts, major approaches, and related issues in the focus area, from lectures, textbook chapters, and conference/journal articles reporting mining projects. The group learning will concentrate on basic/perspective understanding of theoretical concepts, major mining approaches/techniques, and issues related to library and information services, not including mathematics and computational algorithms unless students want to. Individual assignments of reading summaries (as explained below) may be integrated as part of the group project to build toward the final report of group research.

At the conclusion of the group project, each group will make a 30-minute presentation in Elluminate to the class, and submit a written report of their research. Further instruction on project presentation will be distributed in the D2L class site. The group research report shall cover the following aspects:

The report should be no more than 15 pages, exclusing coverpage, references, and appendices, double spaced, using 12-point Times New Roman font.

Individual Data Mining Exercise

This exercise is designed for students to practice basic data manipulation skills, especially to familiarize themselves with key statistical concepts, major data mining models, and mining procedures. It also serves the purpose of checking software installation to ensure that it work properly.

For those choosing to do NLP-based text mining for the individual mining project, this exercise provides an opportunity to get some exposure to basics of statistical data mining.

Task / Track Data Set Software Tool
Data mining exercise House spatial data (to be provided), or to be extracted from the US Census Data, or the data set self constructed for the mining project, RapidMiner | SPSS

Although one may choose to use SPSS for the exercise, Rapid Miner is strongly recommended. This is because Rapid Miner includes not only statistical analyses available in SPSS, but also more advanced modeling and algorithms specifically developed for data mining.

Those choosing to do NLP-based text mining may use the provided House Spatial Data set for this exercise. Others may use either the provided data set, or the same data set (either self constructed or adopted) for the individual mining project.

Procedure

After having downloaded, installed, and configured the software successfully, follow these steps below.

  1. If using the provided House Spatial Data (or data extracted from the US Census Data site), import into the software.
  2. For Rapid Miner, add appropriate pre-processing to texts (web data, or logs) in the depository to extract attribtue data.
  3. For SPSS, after importing, define attribute variables, and convert attribute values from "raw" to interal coding. For numeric data, changing data type would automatically convert. For nominal/ordinal data, code categories with integers, with proper value label definition of course.
  4. Execute appropriate procedures to generate descriptive statistics (frequency for categorical variables, mean and median for numeric variables). For SPSS, these procedures are self evident, under "Analysis" -> Descriptive. For Rapid Miner, "Data Transformation" -> "Attribute Set Reduction and Transformation" -> "Generation" -> "Generate Aggregation". Alternatively, "Data Transformation" -> "Aggregation" -> "Aggregate". In both cases, proper aggregate function needs to be chosen.
  5. Execute appropriate procedures to compute Pearson correlation of at least one paire of numeric variables, and cross-tabulation with Chi-square test for at least one pair of categorical variables. These prcedures may be found by following the same paths described in #4.

Products for submission

The submission packet for this exercise should include the following items:

Individual Mining Project

This individual project is designed for students to gain hand-on experience with at least one kind of practical mining. Students may choose to completing a mining task suitable to their chosen track of study focus, as with the group project of topical research. Depending on which mining task is chosen, students will work with different mining software tool and data set to complete this assignmet. In some cases, students may need to construct/adopt a data set of their preference, and to write small computer programs/scripts for pre-processing. The following table lists key components of mining tasks for different tracks of study focus.

Students need to complete only one mining task. Although it's possible to choose a mining task different from the track of study focus for the group project, it would work to the greater advantage for the students to choose a mining task of the same track.

Task / Track Data Set Software Tool
Text mining (NLP) NYT news articles (to be provided) CALAIS
(free limited license)
Text mining (Statistical) NYT news articles, or self constructed set of short paragraphs in CSV format. RapidMiner
(open source application)
Web content/structure To be self constructed, of pages from either one single site or the public domain. Programming skills may be needed for pre-processing. RapidMiner
Transaction log analysis Excite 1997 search logs, or one-month worth of OPAC logs (both to be provided). Programming skills may be needed for pre-processing. Excel, RapidMiner

Procedure

Specific instructions for each mining task will be distributed in the D2L class site. Basically, Students would follow the general procedure below to complete this assignment.

Product for submission

The only thing you need to submit for this exercise is the report of your mining project. Specifically, the report shall include the following key components.

The project report should be kept short and can be no more than 10 pages (excluding references and appendices). Large tables and figures should be left for appendices.

BlogsCommunity Profiles   | Databases  | eBookstore  | Maps  | PhD  | Second Life |