Java's built in hashCode, and both "Add and Multiply" functions perform roughly the same. The "Add and modulus" hash function however, is by far the worst one in this group.
Notice that Character.getNumericValue
will return a 'small' number. So
the loop is adding up a few relatively small values. As a result, this
hash function always returns small values. The largest hash value of any
word in the dictionary is only 461 (for immunoelectrophoresis)! This
means that all the buckets after 461 will be empty regardless of the size
of the hashtable. Since 25,000 words are being inserted into 500 buckets,
we'll have an average of atleast 50 words in each non-empty bucket.
Running this on the actual data we see that the results are even worse,
only 356 buckets are used:
> Average non-zero size: 70 (25164 words, in 356 buckets)
And on average, nearly 100 comparisons must be made for a lookup:
> Average list search was 97.81132
The other hash functions all spread the words evenly around the hashtable, very rarely will you see a bucket with more than half a dozen elements. The results for each of them are nearly identical, clustered around the following values:
Hash table size: 12.5k 25k 50k Avg. list search: 2.6 1.6 1.4
Clearly these three hash functions do a very good job of distributing the elements since they are very close to the theoretical minimums. The actual values here depend on the data set ofcourse. We would expect to see about 2 or 3 comparisons with a hashtable half the size of the dictionary, in that case we hope that each bucket contains two elements so lookups would take at most 3 comparisons in the best case. With a hashtable as big as the dictionary, we hope to have one element in each bucket, so in the best case no lookup would take more than 2 comparisons. And with a table twice as big as the dictionary, most lookups would hopefully take only one comparison.
In conclusion, we see that three of the hash functions evenly distribute the data over the hashtable and the numerical results are near optimal. The "Add and modulus" hash function however is very poor since it clusters the data around the first few hundred entries resulting in woefully long lists and inefficient lookups.
/**************
Here is an example of acceptable code for your two functions:
**************/
public void printHistogramAndAverage(){
int max = 1;
//counter to sum the total number of elements (alternatively we can use the dictSize variable)
int tot=0;
for(int i=0;i<ht.length;i++){
tot+=ht[i].size();
if(ht[i].size()>max) max = ht[i].size();
}
max++;
int[] counts = new int[max];
//want counts[i] to be the number of bucket with i words
for(int i=0;i<max;i++){
counts[i]=0;
}
for(int i=0;i<ht.length;i++){
counts[ht[i].size()]++;
}
//this is to count the number of nonempty buckets
int nonzero=0;
for(int i=0;i<max;i++){
if(counts[i]>0){
if(i>0) nonzero+=counts[i];
System.out.println("buckets with "+i+" items: "+counts[i]);
}
}
System.out.println("Average non-zero size: "+ tot/nonzero+ "("+tot+" words, in "+nonzero+" buckets)");
}
public String toString(){
String s = "";
for(int i=0;i<ht.length;i++)
s+= (i+": "+ht[i].size()+"\n");
return s;
}