Overview
One straight-forward way to implement a fast spell checker is to use a hash table. If a word is found, it is assumed to be spelled correctly. Otherwise, it is assumed to be spelled incorrectly.The performance of the spell checker can be further improved by maintaining a separate hash table that maps common misspellings to their correct spelling. This can enable the spell checker to suggest a correct spelling for commonly misspelled words.
The performance can be further improved by correcting common mistakes in words and searching again, for example, "ie" vs "ei" or "as" instead of "sa", &c. If the adjusted spelling is found, it can be suggested.
This assignment asks you to build a spell checker capable of identifying correctly spelled words, suggesting words from a hash table of commonly misspelled words, and making as many other enhancements as you'd like.
The Hash Table
You should build a generic open chain hash table, constructed as an array or ArrayList. It should have an insert() method and a retrieve() method:
- public void insert (Object newItem)
- public Object retrieve (Object matchingItem)
- public int getMaxChainLength()
- public int getMinNonZeroChainLength()
- public int getAvgNonZeroChainLength()
Objects should be inserted and found based on their hashCode().
The Spell Checker, Itself
You should design and implement a SpellChecker class, which contains the hash tables needed to perform spell checking and which provides the appropriate functionality. This class should contain at least one constructor:
- public SpellChecker (String dictionaryFile, String misspellingsFile, int tableSize)
- dictionaryFile is the name of the provided dictionary
- misspellingsFile is the name of the fiel that provides a mapping from commonly misspelled words to the correct spelling. This file is of your own design, but a simple format with one pair per line is suggested.
- documentFile is the name of the text document that should be spell-checked
- tableSize is the number of buckets in the hash table.
Additionally, the SpellChecker must have a method as follows:
- void spellCheck (String documentFile)
- documentFile is the name of the text document that should be spell-checked
This method should output each potentially misspelled word with potentially correct spellings. The output should be readable and nicely formatted.
The Dictionary Hash Table
You have been provided with a dictionary that contains root words. These should be inserted into the dictionary hash table based on the hashCode() of their String representation.
The Misspellings Hash Table
You should create a new class of Objects to represent misspelled words. It should contain the misspelling and the correct spelling and should be able to report each. It should also override the hashCode() method to return the hashCode() of the misspelling. This will enable the retrieve() method of your hash table to find this object and the correct spelling within it, given the misspelling.
Finding Root Words
The provided dictionary only contains root words, not each of their forms. So, in order to find a word, you might need to reduce it to its root form. So, for each word, you should try at least the following:
- The word exactly as it appears in the user-provided text
- If the word ends in -ing, remove the -ing
- If the word ends in -s, remove the -s
- If the word ends in -es, -ly, or -ed, remove the -es, -ly, or -ed
- If the word ends in -ing, remove the -ing and add -e
- If the word ends in -ies, remove the -ies, and add -y
- If the word ends in -es or -ed, remove the --s or --d
Program Main
The SpellChecker class should also have a main method. This method should acept all of the files as command-line arguments, should instantiate a new SpellChecker, and should spellCheck() the requested document file.
Chain Length Plot
Experiment making the hash table different sizes, holding the dictionary size as a constant. When you find the interesting range of sizes, please produce a plot of the minimum, maximum, and average chain sizes as a function of the table size.You'll probably want to write a separate main program to collect and output this data. Then, please use Excel, gnuplot, or the tool of your choice to produce a plot the interesting size range. The interesting size range is the minimum range that shows the table, in practice, being too large through the table being too small.
Useful Files