15-121: Concordance

Homework Assignment 6
Text Concordance

Overview:

In this assignment you are to implement a concordancer. This program reads words from an input text file (stripping out punctuation marks) and produces a concordance (or cross reference) stored in a binary search tree.

Concordancers are tools developed for conducting searches for words and then listing the occurrences of that word together with the contexts in which the words occur in the source text. The purpose of a concordance is to study how words are used in a language, and to allow us to acquire a deeper understanding of meaning and usage than can be obtained from a dictionary.

Read the Wikipedia page on Concordance.

Objectives:

Gain experience in implementing and using binary search trees.
Gain experience in implementing and using comparators
Gain experience in coding and designing larger and more sophisticated projects

Class Word

First, implement the Word class.

public class Word implements Comparable<Word>
{
   private String word;
   private Set index;
   private int frequency;

This class stores a word along with its multiplicity and its location in the input file. A word is a sequence of two or more letters [a..zA..Z] that might include digits [0..9] and the undescore character. All delimiters are thrown away and not kept as part of the concordance. Here are examples of words:

abcd
abcd12
abc_
a12cd
15111

Notice, we do not include words consisting of a single letter, but we DO distinguish between upper- and lower-case letters. This distinguishing is supported by passing into a tree a special Comparator object. Because words can occur multiple times in a text, you have to be able to keep track of their occurrences (frequencies). If a word (to be inserted into a tree) is already there, you increase the frequency by one. If a word is inserted into a tree for the first time, you make a new node and set a frequency count to 1.

The index of a word is simply the word's line number in the text. We suggest you use an ordered set (for example, the TreeSet class) for storing line numbers.

Class BSTConcordance

This is a general-purpose binary search tree

public class BSTConcordance<AnyType extends Comparable<AnyType>>
                  implements Iterable<AnyType>
{
   private Comparator<AnyType> comparator;
   private BSTNode<AnyType> root;

with the inner private class BSTNode

   private static class BSTNode<AnyType>
   {
      private AnyType data;   //Word object
      private BSTNode left;
      private BSTNode right;

Note, the data field here is actually the Word object. However, your implementation must be as general as possible and should not have any references to the Word object and its methods.

Insert method

public void insert(AnyType toInsert)

The method accepts a comparable object toInsert and recursively inserts it into the BSTConcordance tree based either on

natural ordering, provided by the Comparable interface
or according to the specified Comparator, passed into the tree by a constructor.

Search method

You will implement

public AnyType search(AnyType toSearch)

that recursively searches the BSTConcordance tree. As with insertion, the search method compares objects in the tree either using compareTo() or compare() methods.

Iterator

To make the BSTConcordance a useful Concordancer we need an in-order iterator that traverses (iteratively - dah!) the elements of the BST in an ordered sequence. One way to implement the iterator is to maintain an internal stack. This stack will represent the current path from the root to a leaf. Notice that the first element that our iterator should produce is the leftmost child of the root, and that furthermore all the nodes between the root and this point should be added to the stack. As a starting point of implementation, I suggest you review a pre-order iterator implemented in class.

See BSTConcordance.java for additional methods to implement.

Class Concordance

The purpose of this class is to build a concordance tree in three different ways. In the first buildTree() method you parse an input text file and build a concordance tree using a natural alphabetical ordering among words:

public BSTConcordance<Word> buildTree(String file)

In the second method you parse an input text file and build a concordance tree using a specific ordering between words provided by a specified comparator:

public BSTConcordance<Word> buildTree(String file, Comparator<Word> comparator)

In the third buildTree() method you build a concordance tree from a given list of unique Word objects

public BSTConcordance<Word> buildTree(ArrayList<Word> list, Comparator<Word> comparator)

This method allows us to rebuild a concordance tree using a different ordering criteria specified by a comparator. In particular, we are interested in a concordance tree where all items are organized according to words frequencies, having the most frequently used word placed at the root of the tree.

See the class starter code for additional methods to implement.

Comparator classes

You will have to implement the following Comparator classess

public class IgnoreUpperCase implements Comparator<Word>

that orders strings in case-insensitive alphabetical order,

public class Frequency implements Comparator<Word>

that orders words according to their frequencies, and

public class AlphaFrequency implements Comparator<Word>

that orders words according to their frequencies and, in case of a tie, words are compared in case-sensitive alphabetical order.

The following code example will help you with implementation of the above classes.

Hints

How to read words from a file? There are a few differentt ways to identify words, however the simplest one is to use the split method along with a regular expression, namely split("\\W"). You might want to review the lecture on strings in which we discussed some file-processing.
How to store words and their multiplicities? Before you insert a Word object into a tree, you have to query (search) the tree. The search method should return either null or a Word object, from which you can extract the current frequency.
Your binary search tree implementation must be able to store ANY comparable data types, not just Words objects.
Search and insert operations on a binary search tree must be implemented recursively.
Before you start hacking an in-order iterator, we suggest you look at a pre-order iterator implemented in class.
Start early. Plan your time accordingly.

Testing

For the purpose of testing we provide a sample output that was generated by the MainDriver class. All input files are stored in the src folder

Better still, create your own small data files and use them for additional testing. We plan to create our own test input files for use when grading your submissions.

Starter Code:

Download the starter code for assignment (zip file):

Once you finish implementation, you must turn in the above seven files.

Grading

Your assignment will be graded first by compiling and testing it. After that, we will read your code to determine whether appropriate classes were used, as well as judge the overall style and modularity of your code. Points will be deducted for poor design decisions and un- commented, or un-readable code.

Here is the point breakdown:

BSTConcordance class - 40 points
Word class - 10 points
Concordance class - 20 points
Comparator classes - 10 points
Coding style - 20 points.

Homework Assignment 6 Text Concordance