, , , , , ,

Let us imagine we wanted to achieve really great compression, and all computers were sold with a copy of the Library of Congress. If for example we wanted to quote Shakespeare to someone, we could just reference the particular play and the necessary act, chapter and verse number. Same with the Bible and with other books. Our email messages would not need to contain the actual quotations themselves, so they would be more compact.

This type of compression is an extreme version of the “codebook” method.

The problem with the above example is that it requires that a large codebook (in this case, the Library of Congress) be stored everywhere. Now, we could use the Internet to refer to the actual Library of Congress itself, but then we need to transmit the codebook lookup data a second time, and download the actual information from the Library so we can read it, thus defeating the purpose.

Another problem is, what happens if you want to say or quote something that is not in the Library? You would have to send that data in its raw form. So this kind of system (preinstalled codebooks) offers great compression but quickly grows very inflexible.

A flexible, future-proof codebook scheme is to send the codebook along with the compressed data. For example, I have a small codebook of the words “the”, “and”, “home”, and “person”, and my data is several paragraphs of text where these words are frequently used. So the sentence “I walked home and a person told me the time” would be coded like this:

I walked 3 2 a 4 told me 1 time

and when I decode, I treat numbers as codebook item numbers.

So I save twelve characters. If this was the only text I was sending, however, I would lose because the codebook itself is sixteen characters. So we need to be careful that the codebook is always a small part of the data we are sending.

If our data contains many different words, we can compress them if we make the codebook larger. For example, most English speakers use only about two thousand different words. But if our codebook is that big, then it would only be useful to compress messages that are several thousand words long or more.

This is a tradeoff that is common in compression schemes. The more clever an algorithm tries to be, it can wind up encoding enough secondary data (codebooks, the references to the codebooks, etc.) that it defeats the purpose.

The core of the problem is that compression seeks to express one kind of data (some repetitive sequence) with another kind of data, but if one is not careful, the second kind of data becomes just as verbose as the data it is trying to replace.

In some compression schemes, the problem moves from space to time. An algorithm may compress efficiently, but it takes a long time to figure out what the compressed data should be. Or it may also take a long time to decode the compressed data back to its original form. Since we use compression mostly to reduce transmission time over the Internet, an algorithm that takes a long time to encode defeats those time savings, and is useful only if the data is sent many times (e.g., broadcast or widely shared). If the algorithm takes a long time to decode, that is worse, since time will be lost on each viewing, quickly overwhelming any time savings in transmission.