Project for Lossless Data Compression, Fall 2005
The project for the course consists of a sequence of subtasks which
ask you to design, implement and evaluate certain
lossless codecs.
The result of your project should be a report which carefully
documents your work on each subtask. The report will be evaluated based
on
- the quality of your designs and your implementations (measured
by the compression rate they achieve and the running time of your
programs),
- the scientific quality of your
evaluations and
- the presentational quality of the descriptions and discussions in
your report.
Each subtask is weighted as described below.
You may carry out the project in groups consisting
of up to three students. If you join a group, the group should do
all subtasks together and write a single report. The report should
be handed in in person to the lecturer no later than October
14th at 12 noon. This is a firm deadline.
Subtask A (10 percent)
We shall use the files of the Canterbury Corpus as
benchmark files in the project. The Canterbury Corpus website also
contains links to various state-of-the-art and not-so-state-of-the-art
but well-known compression software. Select some of these and
replicate the experimental studies that are described on the website,
i.e., run each program on each file and measure compression rate
and compression and decompression time.
Note that for text
files, the natural measure of the compression rate is
bits/letter with the bits being the number
of bits in the compressed file and the letters being the number
of letters in the original file (For image files, the natural measure
would be bits/pixel and
for sound files the natural measure would be bits/sample
OR bits/second).
Next, run simple compression schemes which may described
using the command line tool in the ldc framework (without further
programming) on the same files. Such compression schemes include
arithmetic coding relative to a max likelihood kth order model for
various values of k and aritmetic coding relative to adaptive
kth order models for various values of k. It also includes simple
non-statistical methods, such as applying a move-to-front
transformation and Elias-coding the resulting sequence of integers.
Plot the results against the results of the public software. Also
plot the results against the kth order entropy of the files for
various values of k. Is the kth order entropy a reasonable prediction
of the compression rate. For which schemes? For which files?
For which values of k? What
is the overhead of transmitting the model? Could the overhead be
reduced by encoding the model differently? On which files do the
adaptive models perform better than the non-adaptive? Can you explain why?
On which files do the non-adaptive models perform better than the
adaptive? Can you explain why?
Subtask B (10 percent)
Repeat subtask A, but instead of using the Canterbury corpus files,
use randomly generated data according to some models you
select.
The models should be of
various complexities and may include, for instance, 0th order Markov models, kth order Markov models,
finite state models and Hidden Markov models. Try a range of parameters.
Compare the average compression rate to the entropy of the model as
well as the individual compression rate to the self-entropy of
individually generated messages according to the models.
How do the conclusions of Subtask A change when considering random
data? Can you explain the differences, if any?
Subtask C (50 percent)
As the main design and implementation task of the project, you
are asked to design and implement a variant of PPM, as described
by Bell, Witten and Cleary and Kieffer.
There are several variants to try and you are encouraged to design
your own experiments to see which variants and which paramters work
well. It may be difficult to
get the running time of the more advanced schemes reasonable. You may
thus want
to implement
data structures along the lines suggested by N. Jesper Larsson or
you
may want to
boost the efficiency in other ways of your own design.
If you are thirsty for trying more advanced variants of PPM to try, see the
paper by Suzanne
Bunton.
Compare the performance of your PPM variants to the results of
subproject A and B.
Subtask D (15 percent)
It has been suggested that by first transforming the text to be
compressed using the Burrows-Wheeler transform makes complicated
models such as PPM unnecessary and that similar compression ratios
can be achieved by using Burrows-Wheeler transform followed, possibly,
by a Move-to-front
transform and/or a run-length transform and aritmetic coding
relative to a simple adaptive model.
Can you verify the claim?
Experiment with schemes using such ideas and compare their performance
to the schemes of previous subtasks.
Subtask E (15 percent)
Using the various technology you have now mastered, give an
upper bound, as good as you can, on the length of the Kolmogorov code
for alice29.txt
relative to the standard Java
programming system, including the ldc framework.
More precisely, your bound should include
the length of the compressed file + the sizes of any new (i.e., not
included in the ldc framework) compiled
class files you have to use + 8*(the length of the command line used
to decompress). Explain how you arrived at your solution.