Monday, January 22, 2007

AllanZip revisited

I can't even begin to count the times I've sat in a dark room in front of an old monochrome monitor (in my early childhood) enchanted by the magic of such amazing applications as pkzip, arc, lzh, rar, and so on. Imagine taking a certain amount of data, compressing it to almost nothing without loosing any information and easily being able to extract it again. After some time, the internet came along and I started reading up on lossless compression in various newsgroups. I gazzled at stories of lossless compression programs that could compress any data to 90% of it's former size, so you'd just have to run the program enough times to reduce any amount of data to almost nothing.....Naturally I quickly figured out that it was just hoax - but I was still left trying to figure out how the actual algorithms for lossless compression were constructed.

Eventually I became curious if I would be able to make my own compression program so I decided to give it a try. After checking the books I decided that I could try to implement the Huffman algorithm for fun and see where that got me. Today I came across that old patch of code, and ported it to c# 2.0 for your viewing pleasure :-)

If you, my beloved reader, is a data-dummy, here's a quick intro to the concept of lossless compression:
Most uncompressed data is stored pretty simple. The data consist of bytes, and each byte consist of 8 bit (0's and 1's). A character is often a byte (although some encodings use two bytes per character) making it possible to have 2^8 = 256 different characters. However this way of storing data isn't all that optimal since not all characters occur as frequently as the others. The concept of most compression algorithms is to find out what characters are the most popular and describe those in the fewest bits possible, while then using more than 8 bits to store the most uncommon characters.

My small code demo works in several steps. First of all it breaks up the text to compress into characters, creating an object for each (in this case called a "LeafNode"). Then it counts how popular each character is and assigns it as a weight.

private void BuildTree()
//Prep leaves for tree
LeafNode[] leafs = new LeafNode[rawList.Count];
this.rawList.Values.CopyTo(leafs, 0);

//Make Queue's
Queue<LeafNode> q1 = new Queue<LeafNode>(leafs.Length);
Queue<InternalNode> q2 = new Queue<InternalNode>(leafs.Length - 1);
foreach (LeafNode ln in leafs) q1.Enqueue(ln);

//Build tree
while ((q1.Count > 0) || (q2.Count > 1))
HuffNode hn1;
HuffNode hn2;
//Get the two smallest nodes from either queue.
if ((q1.Count > 0) && (q2.Count > 0))
hn1 = (q2.Peek() < q1.Peek()) ?
(HuffNode)q2.Dequeue() : (HuffNode)q1.Dequeue();
else if (q1.Count > 0) hn1 = q1.Dequeue();
else hn1 = q2.Dequeue();
if ((q1.Count > 0) && (q2.Count > 0))
hn2 = (q2.Peek() < q1.Peek()) ?
(HuffNode)q2.Dequeue() : (HuffNode)q1.Dequeue();
else if (q1.Count > 0) hn2 = q1.Dequeue();
else hn2 = q2.Dequeue();

//Create a new internalnode pointing to the two found and enqueue it.
q2.Enqueue(new InternalNode(hn1, hn2));

TreeRoot = q2.Dequeue();

Afterwards it builds up a binary tree of the characters, based on popularity (the most popular has the shortest distance from the top). From this tree you can extract the bits needed for each character when compressed.

The binary tree showed here is for the text: "Lossless compression is lots of fun". Notice for instance how the "s" character that occurs 8 times has a very short binary signature 00 while "r" that only occurs once has the signature 11101.

When this tree is ready, it's just to look up each LeafNode and find it's bitstring (by recursivly examining the parents) and append the bits to your compressed output. You are effectively looking down-up. When you want to decompress a compressed string, you just need to do the opposite - start from the top, follow the bit-patterns down until you reach a character. Then start from the top with the next bit again.
The encoding code looks like this:

//Build conversiontable
//Make a frequency list of chars


//Temp: output tree

//Encode string
StringBuilder sb = new StringBuilder();
foreach (char c in s)
string myBitString = sb.ToString();

When I execute the included console program, that first compresses and then decompresses the above string I get the following output:

As you can see we managed to compress to 43% of original size (35 bytes went down to 15). However note that in this compression we did not include a description of the tree, which would be necessary to decompress it.

I remember that I played around with this for a while back in the day and I even wrote a small commandline util that would compress and decompress a file. However after I did some compressions tests, comparing my implementation with Zip, it all of a sudden wasn't that fun any more :-)

Download the full C# 2.0 source code sample here.

No comments: