Text Concordance
In this assignment you are to implement a concordancer. This program reads words from an input text file (stripping out punctuation marks) and produces a concordance (or cross reference) stored in a binary search tree.
Concordancers are tools developed for conducting searches for words and then listing the occurrences of that word together with the contexts in which the words occur in the source text. The purpose of a concordance is to study how words are used in a language, and to allow us to acquire a deeper understanding of meaning and usage than can be obtained from a dictionary.
Read the Wikipedia page on Concordance.
First, implement the Word class.
public class Word implements Comparable<Word> { private String word; private Setindex; private int frequency;
This class stores a word along with its multiplicity and its location in the input file. A word is a sequence of two or more letters [a..zA..Z] that might include digits [0..9] and the undescore character. All delimiters are thrown away and not kept as part of the concordance. Here are examples of words:
abcd abcd12 abc_ a12cd 15111
Notice, we do not include words consisting of a single letter, but we DO distinguish between upper- and lower-case letters. This distinguishing is supported by passing into a tree a special Comparator object. Because words can occur multiple times in a text, you have to be able to keep track of their occurrences (frequencies). If a word (to be inserted into a tree) is already there, you increase the frequency by one. If a word is inserted into a tree for the first time, you make a new node and set a frequency count to 1.
The index of a word is simply the word's line number in the text. We suggest you use an ordered set (for example, the TreeSet class) for storing line numbers.
This is a general-purpose binary search tree
public class BSTConcordance<AnyType extends Comparable<AnyType>> implements Iterable<AnyType> { private Comparator<AnyType> comparator; private BSTNode<AnyType> root;
private static class BSTNode<AnyType> { private AnyType data; //Word object private BSTNodeleft; private BSTNode right;
Note, the data field here is actually the Word object. However, your implementation must be as general as possible and should not have any references to the Word object and its methods.
public void insert(AnyType toInsert)
The method accepts a comparable object toInsert and recursively inserts it into the BSTConcordance tree based either on
You will implement
public AnyType search(AnyType toSearch)
that recursively searches the BSTConcordance tree. As with insertion, the search method compares objects in the tree either using compareTo() or compare() methods.
To make the BSTConcordance a useful Concordancer we need an in-order iterator that traverses (iteratively - dah!) the elements of the BST in an ordered sequence. One way to implement the iterator is to maintain an internal stack. This stack will represent the current path from the root to a leaf. Notice that the first element that our iterator should produce is the leftmost child of the root, and that furthermore all the nodes between the root and this point should be added to the stack. As a starting point of implementation, I suggest you review a pre-order iterator implemented in class.
See BSTConcordance.java for additional methods to implement.
The purpose of this class is to build a concordance tree in three different ways. In the first buildTree() method you parse an input text file and build a concordance tree using a natural alphabetical ordering among words:
public BSTConcordance<Word> buildTree(String file)
In the second method you parse an input text file and build a concordance tree using a specific ordering between words provided by a specified comparator:
public BSTConcordance<Word> buildTree(String file, Comparator<Word> comparator)
In the third buildTree() method you build a concordance tree from a given list of unique Word objects
public BSTConcordance<Word> buildTree(ArrayList<Word> list, Comparator<Word> comparator)
This method allows us to rebuild a concordance tree using a different ordering criteria specified by a comparator. In particular, we are interested in a concordance tree where all items are organized according to words frequencies, having the most frequently used word placed at the root of the tree.
See the class starter code for additional methods to implement.
public class IgnoreUpperCase implements Comparator<Word>
that orders strings in case-insensitive alphabetical order,
public class Frequency implements Comparator<Word>
that orders words according to their frequencies, and
public class AlphaFrequency implements Comparator<Word>
that orders words according to their frequencies and, in case of a tie, words are compared in case-sensitive alphabetical order.
The following code example will help you with implementation of the above classes.
split("\\W")
. You might want to review the lecture on strings in
which we discussed some file-processing.
For the purpose of testing we provide a sample output that was generated by the MainDriver class. All input files are stored in the src folder
Better still, create your own small data files and use them for additional testing. We plan to create our own test input files for use when grading your submissions.
Once you finish implementation, you must turn in the above seven files.
Your assignment will be graded first by compiling and testing it. After that, we will read your code to determine whether appropriate classes were used, as well as judge the overall style and modularity of your code. Points will be deducted for poor design decisions and un- commented, or un-readable code.
Here is the point breakdown: