Exam and ReExam
Okay -- so, the last exam was too long. And, if most of you are honest with yourselves, you'll admit that you didn't practice enough. I'll make a shorter exam -- and you, practice! Chop! Chop. Get caffeinated and get coding!The exam will be Thursday, in the cluster, but on paper. There will be two sittings, please pick your favorit: 4:30PM and 6:30PM.
Bucket Sort
Picture a bunch of buckets all lined up in a row, facing you.
Now imagine that you have 1 green tennis ball, 1 blue tennis ball, 1 yellow tennis ball, 1 purple tennis ball, and 1 red tennis ball, and that you need to put each ball into its proper bucket. Not too hard to do, is it? You simply put each ball into its corresponding bucket. Red maps to red, blue maps to blue, etc.
Now imagine that you have a collection of numbers from 0-99 and an array of size 100. If you simply put the 0 into index 0, 1 into index 1, 2 into index 2, 3 into index 3, etc., you'll have no trouble finding anything in the array. If you want to find 1, you go to index 1. If you want to find 33, you go to index 33. Simple.
With a collection of numbers from 0-99, it's easy to put each item into its proper place by putting 0 into index 0, 1 into index 1, 2 into index 2, etc. But what do you do in the real world? What if you want to store information about people in a medical database and find their information quickly in an emergency?
If you have a database, based on a hash table, of
Person
objects, you assign a unique number, called a key, to thePerson
object when you insert it. The process we use to map keys data is called hashing. To hash, say, aPerson
object to a number, we write a hash function.
Hashing and Hash Functions
If you're hashing
Person
objects, one simple hash function could be to assign a number to each letter of the person's first name; A=1, B=2, C=3, etc. Then, if you store aPerson
object with afirstName
field of "Fred," you can storeFred
in F=6, R=13, E=5, D=4, 6+13+5+4= array index 28.This simple hash function converted, or hashed, the
Person
object, Fred, into a number. All inserts, lookups, and removals of Fred will be based on this number. The hash function is a mapping from data (e.g.,Person
objects) to numbers between 0 and size of the hash table.A hash function must map the data to a number, then return that number modulo the size of the hash table (think of a circular hash table).
This process of obliterating the actual item into something you'd never recognize as the real thing -- 28 in the case of Fred -- is called hashing because it comes from the idea of chopping up corned beef into something you'd never recognize as corned beef - corned beef hash. There's no meaning between the actual data value and the hash key, so there's no practical way to traverse a hash table. Hash table items are not in any order. The purpose of hash tables is to provide fast lookups. And the function we used to turn our friend Fred into the number 28 (turning the letters of his first name into numbers and adding them together) is the hash function.
When I want to find Fred in the hash table, I simply use my hash function again (6+13+5+4). Fred is in hash table position 6+13+5+4. That's the way a hash table works. You store and find your items with the same hash function so that you can always find what you're looking for, even though, by just eyeballing the hash key 28, you'd have no idea that it hashes to Fred.
Collisions
So if every first name hashes to a different hash key, we're set. It's just like our buckets. We only had one blue tennis ball, which went into the blue bucket, and so on for each color tennis ball. But the real world isn't like that. What if we try to store aPerson
object Ned into our hash table using our hash function? N=19, E=5, D=4 is 19+5+4=index 28. Now we've got a problem, because Fred is at 28.When we try to put an item into a spot in the hash table that's occupied, this is called a collision. We'll talk about how to manage collision next.
(Separate) Chaining, a.k.a. Closed Addressing
One common technique for resolving collisions within in memory hash tables is known as separate chaining. It works like this: each position in the hash table is a linked list. When we try to insert something into a spot that's already taken, we just make a new node and insert it into that hash table element's list.This technique is known as separate chaining, because each hash table element is a separate chain (linked list). This is easy to do, and this way, you always have a place for anything you want to put into the table.
But, if there are many collisions, it can become less efficient, because, you can end up with long linked lists at certain array indices and nothing at others. This problem is called clustering.
Using Hash Tables
We've talked a lot about binary trees this semester. They're great for finding things in log2n time. If you have a balanced binary tree of 10,000 items, you'll find what you're looking for in no more than 13 steps (the tree will be no more than 13 levels deep). But if you want to access something "instantly," hash tables provide the random access you need to find something almost immediately.