I forget what I was looking for

when I stumbled onto Haruhiko Okumura’s lzss.c, but I was really intrigued.  Every time I’ve seen anything to do with compression, it’s insanely massive.

Except for this.

Including the ‘main’ portion of the source, it’s 180 lines long.  4.3Kb.  That’s microscopic by today’s standards.  On OS X I get a 13kb executable, compared to 76kb for gzip, or 1.6MB for 7za.

And googling around I found a few other variations.  So I figured it would be slightly fun to have a ‘bakeoff’ with the ‘tradtional’ Calgary Corpus, which includes some variable types of data.

Unsurprisingly, 7zip is the best of the bunch.

$ ./testcomp.sh
compiling……..
cleaning up
unzipping
running…….The winner (smallest) is :
261057 6 Jun 20:18 book1.7z
53167 6 Jun 20:18 geo.7z
9472 6 Jun 20:18 obj1.7z
17322 6 Jun 20:18 paper1.7z
43596 6 Jun 20:18 pic.7z
15060 6 Jun 20:18 progl.7z
16748 6 Jun 20:18 trans.7z
30716 6 Jun 20:18 bib.7z
169894 6 Jun 20:18 book2.7z
119399 6 Jun 20:18 news.7z
61758 6 Jun 20:18 obj2.7z
27310 6 Jun 20:18 paper2.7z
12605 6 Jun 20:18 progc.7z
10428 6 Jun 20:18 progp.7z

But the source to 7zip is unwieldy at best.  So how did the small lzss and variants stuff do?

Compression percentage

Compression percentage

Honestly I’m surprised gzip put up a good fight.  Bzip2 & 7zip really fought for the top, The surprise to me was lzhuf leading the old stuff, which has it’s roots back in 1988/1989.  So let’s look at the data without anything modern in the way.

Old Compression only

Old Compression only

So from the numbers, we can see that lzs2 and lz3 run almost identical, with lzs & lzs4 at the bottom.  Now when we look at time, we get something different.

Compression duration

Compression duration

Both lzs & lzs4 take eight or more seconds!  So they are both out, as I’m shopping for something good/fast, and taking this long is out of the question!  So it comes down to how complicated lzhuf2, lzs2 and lzs3 are.

SourceLines
lzs.c4360
lzs4.c4632
lzs2.c8308
lzs3.c12844
lzhuf.c18323
lzhuf2.c22556

While lzs.c is still pretty impressive for the size, for what I’m going to try thought, I’m going to use lzs2.c as it’s 8kb, and seems to fit the bill.

For anyone who’s interested in running this on their own, here is the package.  I only tested on OS X, it may run on other UNIX stuff, it may not.  Extract it and run ‘testcomp.sh’.  And it may even work!  The only thing on OS X I had to add was ‘-Wno-return-type’ for compiling, as clang doesn’t like ancient source like this…

3 thoughts on “I forget what I was looking for

Leave a Reply