The objective of this assignment is to give you some hands-on experience with the construction of information retrieval systems by building a simple inverted index for a small document set.
The Reuters-21578 text categorization corpus is commonly used by researchers in machine learning and information retrieval as a standard for evaluating the performance of text categorization and retrieval algorithms. It consists of 21,578 articles which appeared on the Reuters newswire during 1987. Each article in the corpus has been hand-annotated with SGML tags identifying its content (e.g. the topic, named entities such as places, organizations, and people). Details about the corpus are available from the README file.
You will construct an inverted index for a small subset (1000 articles) of the Reuters corpus. To simplify the task, the articles have been preprocessed to remove all SGML tags except a boundary tag beginning with "
A sample preprocessed article is shown below in Figure 1. The beginning of each article is indicated by a single line containing a tag of the form
RED LION INNS FILES PLANS OFFERING PORTLAND Ore Feb 26 Red Lion Inns Limited Partnership said it filed a registration statement with the Securities and Exchange Commission covering a proposed offering of 4,790,000 units of limited partnership interests The company said it expects the offering to be priced at 20 dlrs per unit It said proceeds from the offering along with a 102.5 mln dlr mortgage loan will be used to finance its planned acquisition of 10 Red Lion hotels Reuter |
You will need a copy of the preprocessed file to do the assignment. download data
NOTE: There are exactly 1000 documents in the collection, and the docids are not continuous.
Using the preprocessed data, construct an inverted index consisting of: a dictionary entry for each unique term in the document set and an associated posting list for each entry. Each posting list should contain the IDF (inverse document frequency) of the term, the IDs of all documents in which the term appears, the normalized and un-normalized TF (term frequency) and positions it appears (within-document frequency counts).
Because we have a small document collection, it is simplest to store the index in a single file, such as shown in Figure 2. Here, each line in the file represents one dictionary entry (sorted lexicographically), followed by its postings containing the IDF and the pairs of document ID, normalized TF, TF and positions. (Single spaces are used as delimiters between document entries, colons serve as delimiters between a document id, its term frequency and positions, commas are used as delimiters among the positions.) For example, the term "advertisers" occurs twice in document 79, as the 94th and 102nd words. Since it occurs only in one document, its DF equals 1 and the IDF (suppose we have 1000 documents in the collection) is log2 (1000/1) = 9.966.
In this homework, the IDF is defined as log2( N / DF), where N is the total number of documents in the collection). And the normalized TF is defined as TF/maxTF , where maxTF is the maximal term frequency in the same document.
adversaries 8.966
32:0.125:1:24
55:0.25:1:67 adversely 7.966 54:0.25:1:23 356:0.05:1:103 454:0.0625:1:111 634:0.2:1:57 adverserial 9.966 342:0.4:4:23,54,89,45 advertisers 9.966 79:0.25:2:94,102 advertising 8.38 736:0.125:2:45,77 766:0.33:2:4,67 810:0.04:1:26 advice 8.38 178:0.125:1:44 467:0.5:3:12,45,86 687:0.08:2:25,65 |
In order to construct the index, you will need to:
You may assume a "term" is any sequence of consecutive non-whitespace characters (i.e.you do not have to filter out stopwords and do stemming). And you can scan the collection multi-pass to simplify the algorithm (i.e. you create the dictionary at first pass, then build the index at the second pass).
Stopword Removal and Word Stemming (10 points)
Augment your
indexer by including support for stopword removal and word stemming.
The stopword file is available with the current distributiion, and the Porter Stemmers Source Code in C/Java/Perl are also available. Add them into your index construction routines, how much does this reduce the size of the index?
You can check whether your stemmer works correctly by the sampled words and their stems (download).
You are encouraged to use Java to do the assignment, although a language such as Perl, C or C++ is also OK. If you really want to use Perl/C/C++, talk to TA first.
You will do the assignment in groups of 3 people. Each group is expected
to hand in one solution (please include the names of all group members in the
source code comments). Your solution should consist of three parts: (1) source
codes for the indexer (2) a file named reuters.index containing the inverted index itself (3) a written
discription of main data structure and algorithm.
Solutions should be submitted electronically before 7:00pm July 25,
2002 as follows: Source code & Index file for the extra credit problem should be submitted the same
way (in a DIFFERENT file) To make sure the mini-corpus contained articles which were relatively
interesting to index, two additional preprocessing filters were used: articles
containing mostly numbers (> 30%) and any without text in the body were
skipped.
The perl script which does this is also availble.
/afs/andrew/course/20/760/students/yourandrewid/
If you have any questions or problems
accessing/submitting the files, send email to jianzhang@cmu.edu.
OPTIONAL: If you are interested in how we get the data file, the followings are the Preprocessing Rules