Project for Lossless Data Compression, Fall 2005

The project for the course consists of a sequence of subtasks which ask you to design, implement and evaluate certain lossless codecs. The result of your project should be a report which carefully documents your work on each subtask. The report will be evaluated based on
  1. the quality of your designs and your implementations (measured by the compression rate they achieve and the running time of your programs),
  2. the scientific quality of your evaluations and
  3. the presentational quality of the descriptions and discussions in your report.
Each subtask is weighted as described below. You may carry out the project in groups consisting of up to three students. If you join a group, the group should do all subtasks together and write a single report. The report should be handed in in person to the lecturer no later than October 14th at 12 noon. This is a firm deadline.

Subtask A (10 percent)

We shall use the files of the Canterbury Corpus as benchmark files in the project. The Canterbury Corpus website also contains links to various state-of-the-art and not-so-state-of-the-art but well-known compression software. Select some of these and replicate the experimental studies that are described on the website, i.e., run each program on each file and measure compression rate and compression and decompression time. Note that for text files, the natural measure of the compression rate is bits/letter with the bits being the number of bits in the compressed file and the letters being the number of letters in the original file (For image files, the natural measure would be bits/pixel and for sound files the natural measure would be bits/sample OR bits/second). Next, run simple compression schemes which may described using the command line tool in the ldc framework (without further programming) on the same files. Such compression schemes include arithmetic coding relative to a max likelihood kth order model for various values of k and aritmetic coding relative to adaptive kth order models for various values of k. It also includes simple non-statistical methods, such as applying a move-to-front transformation and Elias-coding the resulting sequence of integers. Plot the results against the results of the public software. Also plot the results against the kth order entropy of the files for various values of k. Is the kth order entropy a reasonable prediction of the compression rate. For which schemes? For which files? For which values of k? What is the overhead of transmitting the model? Could the overhead be reduced by encoding the model differently? On which files do the adaptive models perform better than the non-adaptive? Can you explain why? On which files do the non-adaptive models perform better than the adaptive? Can you explain why?

Subtask B (10 percent)

Repeat subtask A, but instead of using the Canterbury corpus files, use randomly generated data according to some models you select. The models should be of various complexities and may include, for instance, 0th order Markov models, kth order Markov models, finite state models and Hidden Markov models. Try a range of parameters. Compare the average compression rate to the entropy of the model as well as the individual compression rate to the self-entropy of individually generated messages according to the models. How do the conclusions of Subtask A change when considering random data? Can you explain the differences, if any?

Subtask C (50 percent)

As the main design and implementation task of the project, you are asked to design and implement a variant of PPM, as described by Bell, Witten and Cleary and Kieffer. There are several variants to try and you are encouraged to design your own experiments to see which variants and which paramters work well. It may be difficult to get the running time of the more advanced schemes reasonable. You may thus want to implement data structures along the lines suggested by N. Jesper Larsson or you may want to boost the efficiency in other ways of your own design. If you are thirsty for trying more advanced variants of PPM to try, see the paper by Suzanne Bunton. Compare the performance of your PPM variants to the results of subproject A and B.

Subtask D (15 percent)

It has been suggested that by first transforming the text to be compressed using the Burrows-Wheeler transform makes complicated models such as PPM unnecessary and that similar compression ratios can be achieved by using Burrows-Wheeler transform followed, possibly, by a Move-to-front transform and/or a run-length transform and aritmetic coding relative to a simple adaptive model. Can you verify the claim? Experiment with schemes using such ideas and compare their performance to the schemes of previous subtasks.

Subtask E (15 percent)

Using the various technology you have now mastered, give an upper bound, as good as you can, on the length of the Kolmogorov code for alice29.txt relative to the standard Java programming system, including the ldc framework. More precisely, your bound should include the length of the compressed file + the sizes of any new (i.e., not included in the ldc framework) compiled class files you have to use + 8*(the length of the command line used to decompress). Explain how you arrived at your solution.