Lab #3: Compression


Due: Wednesday, July 25th, 2007 at 11:59PM

For this homework assignment we will implement and experiment with methods for data compression, using two lossless data compression algorithms: Huffman coding and LZW. The primary goal of the assignment is to implement the backend for a program called tez (two-eleven zip) that is able to compress and decompress data files using a variety of compression algorithms.

Goals

Time Management

Due to the size of this assignment, it is normal to feel somewhat overwhelmed. Make sure to get an early start and think before you hack! We recommend that you complete at least one part every two days, in the order suggested. Breaking it up will allow you to treat this homework assignment essentially as three smaller assignments, and also give you the flexibility to work on one part while you're still debugging another part. Don't wait until the last couple days to start on all the parts!

The Parts

This assignment comes in two major parts, the Huffman and LZW sections.

Huffman

Huffman is the first part of this homework assignment and involves the implementation of plain Huffman compression as discussed in lecture. However, before you can get coding the Huffman code proper, you must first create a priority queue (though there are versions of Huffman that do not require priority queues, we do). Once this is done, you will generate an object-oriented abstraction of the concept of Huffman codes. We have created a simple Huffman compressor that uses this. We also want you to get practice in consuming the Huffman code API. Thus, we will have you apply Huffman compresson to sequences of n bytes, known as N-grams, rather than to a single byte.

LZW

The second part of this assignment involves implementing a more sophisticated method for data compression, based on a version of the Lempel/Ziv/Welch (LZW) algorithm. This is similar to one of the components in DEFLATE compression, used by gzip and zip, two extremely popular compression methods. It is also used in GIF image compression. Details can be found in the lecture notes.

Testing Code

Writing tests is good!

By now you should realize that testing is highly beneficial. Be sure to test major compress/decompress functionality, as well as unit tests for individual parts like the PQ and building the HuffmanTree. The priority queue should be extremely easy to test (just do it like you did for snake, comparing to the Java version). For the other types, you should try to make sure various files round trip. We've included some helper code in TestUtil, which you should feel free to take advantage of. You should use the JUnit framework. As in previous assignments, the more JUnit tests you write, the more private tests you get to see.

Sample files

To help you ensure that your file format is correct, we have created a repository of example files and their compressed versions. The files are in /afs/andrew.cmu.edu/usr19/merichar/public/15211/hw5. For those of you unlucky enough to be on an operating system without sftp you might want to read these instructions. Note that if you download the files via FTP, in order to get the exact same file we're using, download the orignal file as a binary file, rather than an ASCII file.

The uncompressed version of each file is in orig, and the various types of compression have been applied to each file and then put in a directory with the name of the compression used (read gen.sh to see exactly what we are doing). Note that not all files are able to be compressed, for example lzw sometimes runs out of codes, and one or two cases resulted in running out of heap space. You don't have to worry about these.

You should be able to uncompress all of the files that we compressed. Futhermore, for LZW, excepting the files that ran out of codes, the files you compress should be byte for byte identical to our files (try the cmp tool). For Huffman based formats, the bytes can be a bit different (because of how one creates Huffman trees). However, the file length should be the exactly same (try wc -b or ls -la).

Note that you cannot read files when unit testing using frontdesk. This means you need to somehow test your code without reading from sample files. We suggest a few ways of doing this.

  1. Do some hand cases. You can easily create a few hand test cases to figure out exactly how a compressed file would appear, and test against that
  2. Test roundtrip using java's Random class. An example is provided for you in HuffmanTest.java. This method fills a byte array with random bytes, compresses and decompresses it, and makes sure the original byte array is equal to the decompressed byte array.
This is a good time to point out that good code coverage does not neccesarily equal good grades. The few hand cases, if written correctly can achieve full code coverage of the compression stage. They are, however, insufficient to thoroughly test your code. You should perform local tests with the provided files. You should put try catch blocks around all tests that access files to handle the io exceptions that will occur on frontdesk.

Priority Queue

Your first task in this assignment is going to be to implement a priority queue. You will be using the Queue/AbstractQueue in Java 5, much like API that you know and love from the Snake assignment. There isn't much to say about this API, other than to read the javadocs and PLEASE TEST.

As far as implementation goes, you are free to implement a priority queue in any (efficient) way you want. This means that offer and poll must be done in atleast log(n) time. Recommendations for this are a skew heap or a binary heap (covered in lecture). More info about skew heaps can be found in a paper by Danny Sleator and Bob Tarjan. The code is all on pages 58 and 59.

You may not use the Java implementation of priority queue, nor may you hack a priority queue using some sort method (not logn time).

Note that for the iterator, you do not have to return the elements in any order. This would make your life painful, and we have no desire to do that. Similar to Java's own implementation of PriorityQueue's Iterator, just return all elements in the queue in any order. One more thing, you must write your own iterator. For example, you may not just create a new ArrayList, insert all the PQ items into the list, and return the list's iterator. Any attempt to do so will result in a loss of all the iterator points.

Bits and Compressors

I'm More Than Just a Number!

Up until now you have been dealing with integers which represent the natural numbers. However, there is another, (slightly more painful) world where we use integers as messengers of a sequence of bits. Compression is often done in terms of a stream of bits.

We define a BitReader as an interface which allows reading bits from some stream. It has methods for reading one bit, reading a few bits, reading a byte, reading a few bytes, and reading an integer. It also supports getting the length of the stream and resetting to the beginning. InputStreamBitReader is a concrete implementation of BitReader that supports reading from a file or a few other types of streams (including a stream backed by a byte array).

A BitWriter is defined which reads files in the same format as BitReader with a concrete implementation in OutputStreamBitWriter.

In general, we will tell you the exact format in which to write the files. Your files must be bit for bit compatible with our files. You should be able to read our files, and we yours.

WARNING: do not try to convert bytes into strings. Ever. It might work when you are working with ASCII. But, due to the beast known as encoding, it will not work on binary files. In this assignment, you are dealing with bytes not strings

The byte primitive type in Java is signed (sadly). This may be slightly confusing if you are using the debugger and see the value -32 or something. Do not be alarmed. The byte encoding is still the exact same, so you needn't worry about this. If you want to get a readable output from Java, say Integer.toString (byteValue %amp; 0xff). In general you code should work even if you never knew this.

When testing code, you may have to view binary files. The easiest platform to do this on is Linux, where there are lots of quality hex editors. In emacs, use the Hexl mode (M-x hexl-mode). In vi, execute :%xxd to change the file to hex and :%xxd -r to change it back. There is also the hexdump utility (try with the -C flag). GUIs also exist such at ghex2 for GNOME. For Windows and Mac, there are quite a few shareware/freeware hex editors, use Google.

Compressor

We also define an interface called Compressor which represnts the abstract concept of compression. In mathematical terms, a compressor is a function f(s) which takes a bit stream s and maps it to another stream s' in an injective manner. The function f(s) has the property that when s has redundent data, then |s'|<|s|

In this assignment you will see three compressors. One will be a simple Huffman coding. Another will apply Huffman coding to sequences of bytes. The last will implement LZW compression.

Please note that your compressors must be stateless. This means that first we should be able to construct a new instance, c, of your compressor. We should then be able to call c.compress(x) and c.decompress(y) multiple times on multiple bit streams, without having the reconstruct another compressor. object. This means you should not do things like store the LZW dictionary as a class member variable.

Huffman Code

HuffmanCode.java implements a generic Huffman backend. The details of this API are described in the javadocs, this gives a high level overview. Feel free to make use of anything you want from java.util and java.lang, but nothing else (you may not use java.util.PriorityQueue. In fact, due to the queue you wrote having the same name, Java will not easily let you use this).

Compression

First we consider the process of compression.

generate_tree -> write_header -> read_input -> output_code -> read_input } subgraph cluster_hss { label = "HuffmanSymbolSerializer" hss_write [label = "write"] } output_code -> hss_write [style = invis] } subgraph cluster_2 { edge [style=invis] hc_ctor_map [label="HuffmanCode ctor"] hc_write_header [label = "writeHeader"] hc_encode [label = "encode"] hc_ctor_map -> hc_write_header -> hc_encode label="HuffmanCode" } { rank=same generate_tree; hc_ctor_map } generate_tree -> hc_ctor_map; write_header -> hc_write_header; hc_write_header -> hss_write output_code -> hc_encode; } ]]>

The compression sequence starts in an object implementing the Compressor interface. First it reads in the files and creates a list of frequencies for each data item. It passes this in to the HuffmanCode constructor that takes a Map. Then it calls the writeheader method which emits the tree recursively, using the HuffmanSymbolSerializer to emit each data item. The compressor then reads each symbol in the file and emits the code word by using encode

Expansion

We now consider expansion, the reverse of compression.

read_code -> output_item -> read_code } subgraph cluster_hss { label = "HuffmanSymbolSerializer" hss_read [label = "read"] } output_item -> hss_read [style = invis] } subgraph cluster_2 { edge [style=invis] hc_ctor [label="HuffmanCode ctor"] hc_decode [label = "decode"] hc_ctor -> hc_decode label="HuffmanCode" } read_header -> hc_ctor ->hss_read read_code -> hc_decode } ]]>

First, the header is read from the file using the HuffmanCode constructor that takes a BitReader and a HuffmanSymbolSerializer. This reads in the header, calling the read method of the HSS whenever it sees a leaf in the file header. The compressor then reads in each Huffman code and outputs the symbol.

The Huffman Compressor

Once you complete HuffmanCode.java, the provided implementation for HuffmanCompressor.java should work. Note the file format we use when compressing/decompressing files. We first encode the Huffman heading (table for the tree outputted via writeHeader). Then we encode the number of symbols in the file, which corresponds to the number of times we need to call encode/decode (which for HuffmanCompressor is just the number of bytes in the file).

The implication with this file format is that when we have a character set of just one letter, we can define it's huffman code to be the empty string (rather than arbitrarily defining it to be 0 or 1). This actually makes it simpler because we don't have to deal with the special case (atleast we shouldn't) of when the root is a leaf node. Also, a file of all 'a's compresses better under this technique. Please test for this and make sure you get the appropriate output.

HuffmanCodeVisualizer

Since so many of you liked the TreeVisualizer, we've decided to do it again. Now you can visualize Huffman trees. Yay.

N-Gram Compressor

Write NGramCompressor.java, that implements the Compressor interface using the algorithm described below. The constructor for this takes an integer parameter which specifies the blocksize B which will be used.

This file essentially emulates what HuffmanCompressor does, except instead of reading in symbols one byte at a time, a symbol will consist of B bytes. Similar to HuffmanCompressor, it will rely on HuffmanCode; however, instead of HuffmanCode<Byte> it will be of type HuffmanCode<ByteArrayWrapper>.

The algorithm is simple. Let B be the blocksize. Partition the file into blocks of B bytes. (At the end there may be one block of fewer than B bytes.) Each distinct block that occurs will be a symbol of a Huffman code. So as you read them in put them into a hash table (use HashMap from java.util). The key is the byte array of characters in the block, and the value is the frequency of occurence.

However, as you may have noticed, we run into a problem when we write out the header (what the tree looks like). In our algorithm, if we encounter an internal node, we emit a 1 and recursively emit the left subtree followed by the right subtree. If we encounter a leaf node, we emit a 0, then the data associated with it. As with normal data output, when we output a generic data item to a file, we have to figure out how to encode it. This is why HuffmanSymbolSerializer exists. In it, we specify two methods:

By using HuffmanSymbolSerializer, we separate the encoding algorithms used in emitting and reading the huffman header from the logic in HuffmanCode (good work!!!). In HuffmanCompressor, we give an example of a HuffmanSymbolSerializer<Byte>. The algorithms used here are simple: we write a byte out as the value, and we read in a byte as the value. For the NGramCompressor, we are not as lucky. Perhaps not optimal, the algorithm we use will still be simple: since we don't know how many bytes each block will be, we first emit this information, followed by the data. For example, if the block size is 5, we'd say: writer.writeInt(5); writer.writeBytes(bytes);. Reading the information just reverse-engineers this process. Note that, we could use this same encoding/decoding algorithm for the HuffmanCompressor, but it introduces unnecessary overhead, while in NGramCompressor it is necessary. As you implement the HuffmanSymbolSerializer<ByteArrayWrapper> for NGramCompressor, please remember that HuffmanSymbolSerializer is only used for the header in a huffman compressed file (HuffmanCode (HuffmanSymbolSerializer<T> hss, BitReader br) and public void writeHeader (HuffmanSymbolSerializer<T< hss, BitWriter writer)) and not the regular encode and decode methods.

The strategy in the NGramCompressor's version of compress is the same as in HuffmanCompressor.

  1. Read in the file and count the frequencies
  2. Send this info to HuffmanCode
  3. Emit the header with writeHeader
  4. Emit the number of symbols (not bytes) in the file
  5. Emit the Huffman encoding of the sequence of symbols in the file

The algorithm for expand should be self-evident. One interesting thing to note is that expand does not need to know the value of B. In fact, using the encoding/decoding algorithm specified by our HuffmanSymbolSerializer, it does not assume that all the codeword lengths in the Huffman code are the same length. This means that if you are really bored, you could try compressing a file with different blocksizes to try to improve the compression ratio.

LZW

LZW is a more sophisticated method of compression that relies on patterns in the files instead of just the frequency of the bytes. It can lead to drasticly better compression sizes than the huffman algorithm. When implemented with a trie, it also runs extremely quickly. You only need to read the file in once, and compress as you go. You also do not need a header.

In order for us to test that you are encoding and decoding correctly, you need to follow a few LZW conventions:

Be sure to call flush at the end of the compression and decompression methods. Otherwise, file output will be corrupt along with other potentially bad things.

Note for decompression: Although the compression uses tries, the decompression does not have to. In fact, using a trie for decompression would be a mess. Just use an array or a hash table to store the decompression dictionary.

FAQ

Pay special attention to this section. This is a collection of common questions students had from last semester

Handing it in

As always, hand in just the files you modified. If you hand in files that we give you, other than the ones with RuntimeExceptions where you implemented things, We will ignore them. Also, please do not submit tons of random compressed files, filter them out before you zip.