Nine Algorithms That Changed the Future: The Ingenious Ideas That Drive Today's Computers (17 page)

BOOK: Nine Algorithms That Changed the Future: The Ingenious Ideas That Drive Today's Computers
11.41Mb size Format: txt, pdf, ePub

Numeric codes that a computer could use for storing symbols.

Meanwhile, let's get back to the shorter-symbol trick. As with many of the supposedly technical ideas described in this book, the shorter-symbol trick is something that humans do all the time without even thinking about it. The basic idea is that if you use something often enough, it's worth having a shorthand abbreviation for it. Everyone knows that “USA” is short for “United States of America”—we all save a lot of effort each time we type or say the 3-letter code “USA” instead of the full 24-letter phrase it stands for. But we don't bother with 3-letter codes for every 24-letter phrase. Do you know an abbreviation for “The sky is blue in color,” which also happens to be a 24-letter phrase? Of course not! But why? What is the difference between “United States of America” and “The sky is blue in color”? The key difference is that one of these phrases is used much more often than the other, and we can save a lot more effort by abbreviating a frequently used phrase instead of one that is rarely used.

Let's try and apply this idea to the coding system shown on the previous page. We already know that we can save the most effort by using abbreviations for things that are used frequently. Well, the letters “e' and “t” are the ones used most often in English, so let's try to use a shorter code for each of those letters. At the moment, “e” is 31 and “t” is 46—so it takes two digits to represent each of these letters. How about cutting them down to only one digit? Let's say “e” is now represented by the single digit 8, and “t” is 9. This is a great idea! Remember how we encoded the phrase “Meet your fiancé there.” earlier, using a total of 46 digits. Now we can do it as follows, using only 40 digits:

M  e  e  t    y  o  u  r    f  i  a  n  c  é    t  h  e  r  e  .
138  8  9  005141474400323527402982009  348  448  66

Unfortunately, there is a fatal flaw in this plan. Remember that the computer does not store the spaces between the individual letters. So the encoding doesn't really look like “13 8 8 9 00 51…44 8 66.” Instead it looks like “138890051…44866.” Can you see the problem yet? Concentrate on just the first five digits, which are 13889. Notice that the code 13 represents “M,” 8 represents “e,” and 9 represents “t,” so one way of decoding the digits 13889 is to split them up as 138-8-9, giving the word “Meet.” But 88 represents the accented symbol “ú,” so the digits 13889 might also be split up as 13-88-9, which represents “Mút.” In fact the situation is even worse, because 89 represents the slightly different accented symbol “ù,” so another possible split of 13889 is 13-8-89, representing “Meù.” There is absolutely no way to tell which of the three possible interpretations is correct.

Disaster! Our cunning plan to use shorter codes for the letters “e” and “t” has led to a coding system that doesn't work at all. Fortunately, it can be fixed with one more trick. The real problem is that whenever we see a digit 8 or 9, there is no way to tell if it is part of a one-digit code (for either “e” or “t”), or one of the two-digit codes that starts with 8 or 9 (for the various accented symbols like “á” and “è”). To solve this problem, we have to make a sacrifice: some of our codes will actually get
longer.
The ambiguous two-digit codes that start with 8 or 9 will become three-digit codes that
do not
start with 8 or 9. The table on page 114 shows one particular way of achieving this. Some of the punctuation characters got affected too, but we finally have a very nice situation: anything starting with a7isa three-digit code, anything starting with an 8 or 9 is a one-digit code, and anything starting with 0, 1, 2, 3, 4, 5 or 6 is the same two-digit code as before. So there is exactly one way to split up the digits 13889 now (13-8-8-9, representing “Meet”)—and this is true for any other correctly coded sequence of digits. All ambiguity has been removed, and our original message can be encoded like this:

M  e  e  t    y  o  u  r    f  i  a  n  c  é    t  h  e  r  e  .
138  8  9  0051414744003235274029782009  348  448  66

The original encoding used 46 digits, and this uses only 41. This might seem like a small saving, but with a longer message the savings can be very significant. For example, the text of this book (that is, just the words, with images excluded) requires nearly 500 kilobytes of storage—that's half a million characters! But when compressed using the two tricks just described, the size is reduced to only 160 kilobytes, or less than one-third of the original.

Summary: Where Did the Free Lunch Come From?

At this point, we understand all the important concepts behind the creation of typical compressed ZIP files on a computer. Here's how it happens:

Step 1.
The original uncompressed file is transformed using the same-as-earlier trick, so that most of the repeated data in the file is replaced by much shorter instructions to go back and copy the data from somewhere else.

Step 2.
The transformed file is examined to see which symbols occur frequently. For example, if the original file was written in English, then the computer will probably discover that “e” and “t” are the two most common symbols. The computer then constructs a table like the one on the following page, in which frequently used symbols are given short numeric codes and rarely used symbols are given longer numeric codes.

Step 3.
The file is transformed again by directly translating into the numeric codes from Step 2.

The table of numeric codes, computed in step 2, is also stored in the ZIP file—otherwise it would be impossible to decode (and hence decompress) the ZIP file later. Note that different uncompressed files will result in different tables of numeric codes. In fact, in a real ZIP file, the original file is broken up into chunks and each chunk can have a different numeric code table. All of this can be done efficiently and automatically, achieving excellent compression on many types of files.

Numeric codes using the shorter-symbol trick. Changes to the previous table on page 111 are shown in bold. The codes for two common letters have been shortened, at the expense of lengthening the codes for a larger number of uncommon symbols. This results in a shorter total length for most messages.

LOSSY COMPRESSION: NOT A FREE LUNCH, BUT A VERY GOOD DEAL

So far, we have been talking about the type of compression known as
lossless
, because you can take a compressed file and reconstruct exactly the same file that you started with, without even one character or one punctuation mark being changed. In contrast, sometimes it is much more useful to use
lossy
compression, which lets you take a compressed file and reconstruct one that is very similar to the original, but not necessarily exactly the same. For example, lossy compression is used very frequently on files that contain images or audio data: as long as a picture
looks
the same to the human eye, it doesn't really matter whether the file that stores that picture on your computer is exactly the same as the file that stores it on your camera. And the same is true for audio data: as long as a song sounds the same to the human ear, it doesn't really matter whether the file storing that song on your digital music player is exactly the same as the file that stores that song on a compact disc.

In fact, sometimes lossy compression is used in a much more extreme way. We have all seen low-quality videos and images on the internet in which the picture is blurry or the sound quality rather bad. This is the result of lossy compression being used in a more aggressive fashion to make the file size of the videos or images very small. The idea here is not that the video looks the same as the original to the human eye, but rather that it is at least recognizable. By tuning just how “lossy” the compression is, website operators can trade off between large, high-quality files that look and sound almost perfect, and low-quality files that have obvious defects but require much less bandwidth to transmit. You may have done the same thing on a digital camera, where you can usually choose different settings for the quality of images and videos. If you choose a high-quality setting, the number of pictures or videos you can store on the camera is smaller than when you choose a lower quality setting. That's because high-quality media files take up more space than low-quality ones. And it's all done by tuning just how “lossy” the compression is. In this section, we will find out a few of the tricks for doing this tuning.

The Leave-It-Out Trick

One simple and useful trick for lossy compression is to simply leave out some of the data. Let's take a look at how this “leave-it-out” trick works in the case of black-and-white pictures. First we need to understand a little about how black-and-white pictures are stored in a computer. A picture consists of a large number of small dots, called “pixels.” Each pixel has exactly one color, which could be black, white, or any shade of gray in between. Of course, we are not generally aware of these pixels because they are so small, but you can see the individual pixels if you look closely enough at a monitor or TV screen.

In a black-and-white picture stored in a computer, each possible pixel color is represented by a number. For this example, let's assume that higher numbers represent whiter colors, with 100 being the highest. So 100 represents white, 0 represents black, 50 represents a medium shade of gray, 90 represents a light gray, and so on. The pixels are arranged in a rectangular array of rows and columns, with each pixel representing the color at some very small part of the picture. The total number of rows and columns tells you the “resolution” of the image. For example, many high-definition TV sets have a resolution of 1920 by 1080—that means there are 1920 columns of pixels and 1080 rows of pixels. The total number of pixels is found by multiplying 1920 by 1080, which gives over 2 million pixels! Digital cameras use the same terminology. A “megapixel” is just a fancy name for a million pixels. So a 5-megapixel camera has enough rows and columns of pixels so that when you multiply the number of rows by the number of columns, you get more than 5 million. When a picture is stored in a computer, it is just a list of numbers, one for each pixel.

The picture of a house with a turret shown at the top left of the figure on the next page has a much lower resolution than a highdefinition TV: only 320 by 240. Nevertheless, the number of pixels is still rather large (320 × 240 = 76,800), and the file that stores this picture in uncompressed form uses over 230 kilobytes of storage space. A kilobyte, by the way, is equivalent to about 1000 characters of text—roughly the size of a one-paragraph e-mail, for instance. Very approximately, then, the top-left picture, when stored as a file, requires the same amount of disk space as around 200 short e-mail messages.

We can compress this file with the following extremely simple technique: ignore, or “leave out,” every second row of pixels and every second column of pixels. The leave-it-out trick really is that simple! In this case, it results in a picture with a smaller resolution of 160 by 120, shown below the original picture in the figure. The size of this file is only one-quarter of the original (about 57 kilobytes). This is because there are only one-quarter as many pixels—we reduced both the width
and
the height of the image by one-half. Effectively, the size of the image was reduced by 50% twice—once horizontally and once vertically—resulting in a final size that is only 25% of the original.

And we can do this trick again. Take the new 160 by 120 image, and leave out every second row and column to get another new image, this time only 80 by 60—the result is shown at the bottom left of the figure. The image size is reduced by 75% again, resulting in a final file size of only 14 kilobytes. That's only about 6% of the original—some very impressive compression.

Other books

Invincible (The Trident Code) by Albertson, Alana
Death of a Raven by Margaret Duffy
Margaret's Ark by Daniel G. Keohane
The Paladins by Julie Reece
The Dracons' Woman by Laura Jo Phillips
What He's Been Missing by Grace Octavia
Grey Wolf: The Escape of Adolf Hitler by Simon Dunstan, Gerrard Williams
Inside the O'Briens by Lisa Genova
A Wild Affair by Gemma Townley