Authors: John MacCormick,Chris Bishop
Compression using the leave-it-out trick. The left column shows the original image, and two smaller, reduced versions of this image. Each reduced image is computed by leaving out half of the rows and columns in the previous one. In the right column, we see the effect of decompressing the reduced images to the same size as the original. The reconstruction is not perfect and there are some noticeable differences between the reconstructions and the original.
But remember, we are using lossy compression, so we don't get a free lunch this time. The lunch is cheap, but we do have to pay for it. Look at what happens when we take one of the compressed files and decompress it back to the original size. Because some of the rows and columns of pixels were deleted, the computer has to guess what the colors of those missing pixels should be. The simplest possible guess is to give any missing pixel the same color as one of its neighbors. Any choice of neighbor would work fine, but the examples shown here choose the pixel immediately above and to the left of the missing pixel.
The result of this decompression scheme is shown in the right-hand column of the figure. You can see that most of the visual features have been retained, but there is some definite loss of quality and detail, especially in complex areas like the tree, the turret's roof, and the fret-work in the gable of the house. Also, especially in the version decompressed from the 80 by 60 image, you can see some rather unpleasant jagged edges, for example, on the diagonal lines of the house's roof. These are what we call “compression artifacts”: not just a loss of detail, but noticeable new features that are introduced by a particular method of lossy compression followed by decompression.
Although it's useful for understanding the basic idea of lossy compression, the leave-it-out trick is rarely used in the simple form described here. Computers do indeed “leave out” information to achieve lossy compression, but they are much more careful about which information they leave out. A common example of this is the JPEG image compression format. JPEG is a carefully designed image compression technique which has far better performance than leaving out every second row and column. Take a look at the figure on the facing page, and compare the quality and size of the images with the previous figure. At the top, we have a JPEG image whose size is 35 kilobytes, and yet it is virtually indistinguishable from the original image. By leaving out more information, but sticking with the JPEG format, we can get down to the 19-kilobyte image in the center, which still has excellent quality although you can see some blurring and loss of detail in the fret-work of the house. But even JPEG suffers from compression artifacts if the compression is too extreme: at the bottom you can see a JPEG image compressed down to only 12 kilobytes, and you'll notice some blocky effects in the sky and some unpleasant blotches in the sky right next to the diagonal line of the house.
Although the details of JPEG's leave-it-out strategy are too technical to be described completely here, the basic flavor of the technique is fairly straightforward. JPEG first divides the whole image into small squares of 8 pixels by 8 pixels. Each of these squares is compressed separately. Note that without any compression, each square would be represented by 8 × 8 = 64 numbers. (We are assuming that the picture is black-and-white—if it is a color image, then there are three different colors and therefore three times as many numbers, but we won't worry about that detail here.) If the square happens to be all one color, the entire square can be represented by a single number, and the computer can “leave out” 63 numbers. If the square is mostly the same color, with a few very slight differences (perhaps a region of sky that is almost all the same shade of gray), the computer can decide to represent the square by a single number anyway, resulting in good compression for that square with only a small amount of error when it gets decompressed later. In the bottom image of the figure on the previous page, you can actually see some of the 8-by-8 blocks in the sky that have been compressed in exactly this way, resulting in small square blocks of uniform color.
With lossy compression schemes, higher compression produces lower quality. The same image is shown compressed at three different JPEG quality levels. At the top is the highest quality, which also requires the most storage. At the bottom is the lowest quality, which requires less than half the storage, but now there are noticeable compression artifacts—especially in the sky and along the border of the roof.
If the 8-by-8 square varies smoothly from one color to another (say, dark gray on the left to light gray on the right), then the 64 numbers might be compressed down to just two: a value for the dark gray and the value for the light gray. The JPEG algorithm does not work exactly like this, but it uses the same ideas: if an 8-by-8 square is close enough to some combination of known patterns like a constant color or a smoothly varying color, then most of the information can be thrown away, and just the level or amount of each pattern is stored.
JPEG works well for pictures, but how about audio and music files? These are also compressed using lossy compression, and they use the same basic philosophy: leave out information that has little effect on the final product. Popular music compression formats, such as MP3 and AAC, generally use the same high-level approach as JPEG. The audio is divided into chunks, and each chunk is compressed separately. As with JPEG, chunks that vary in a predictable way can be described with only a few numbers. However, audio compression formats can also exploit known facts about the human ear. In particular, certain types of sounds have little or no effect on human listeners and can be eliminated by the compression algorithm without reducing the quality of the output.
THE ORIGINS OF COMPRESSION ALGORITHMS
The same-as-earlier trick described in this chapter—one of the main compression methods used in ZIP files—is known to computer scientists as the LZ77 algorithm. It was invented by two Israeli computer scientists, Abraham Lempel and Jacob Ziv, and published in 1977.
To trace the origins of compression algorithms, however, we need to delve three decades further back into scientific history. We have already met Claude Shannon, the Bell Labs scientist who founded the field of information theory with his 1948 paper. Shannon was one of the two main heroes in our story of error-correcting codes (
chapter 5
), but he and his 1948 paper also figure importantly in the emergence of compression algorithms.
This is no coincidence. In fact, error-correcting codes and compression algorithms are two sides of the same coin. It all comes down to the notion of
redundancy
, which featured quite heavily in
chapter 5
. If a file has redundancy, it is longer than necessary. To repeat a simple example from
chapter 5
, the file might use the word “five” instead of the numeral “5.” That way, an error such as “fivq” can be easily recognized and corrected. Thus, error-correcting codes can be viewed as a principled way of
adding
redundancy to a message or file.
Compression algorithms do the opposite: they
remove
redundancy from a message or file. It's easy to imagine a compression algorithm that would notice the frequent use of the word “five” in a file and replace this with a shorter symbol (which might even be the symbol “5”), exactly reversing the error-correction encoding process. In practice, compression and error correction do not cancel each other out like this. Instead, good compression algorithms remove inefficient types of redundancy, while error-correction encoding adds a different, more efficient type of redundancy. Thus, it is very common to first compress a message and then add some error correction to it.
Let's get back to Shannon. His seminal 1948 paper, among its many extraordinary contributions, included a description of one of the earliest compression techniques. An MIT professor, Robert Fano, had also discovered the technique at about the same time, and the approach is now known as Shannon-Fano coding. In fact, Shannon-Fano coding is a particular way of implementing the shorter-symbol trick described earlier in this chapter. As we shall soon see, Shannon-Fano coding was rapidly superseded by another algorithm, but the method is very effective and survives to this day as one of the optional compression methods in the ZIP file format.
Both Shannon and Fano were aware that although their approach was both practical and efficient, it was not the best possible: Shannon had proved mathematically that even better compression techniques must exist, but had not yet discovered how to achieve them. Meanwhile, Fano had started teaching a graduate course in information theory at MIT, and he posed the problem of achieving optimal compression as one of the options for a term paper in the course. Remarkably, one of his students solved the problem, producing a method that yields the best possible compression for each individual symbol. The student was David Huffman, and his technique—now known as Huffman coding—is another example of the shorter-symbol trick. Huffman coding remains a fundamental compression algorithm and is widely used in communication and data storage systems.
1
The solution: the letter A repeated 251 times.
8
Databases: The Quest for Consistency
“Data! data! data!” he cried impatiently. “I can't make bricks without clay.”
—S
HERLOCK
H
OLMES IN
A
RTHUR
C
ONAN
D
OYLE'S
The Adventure of the Copper Beeches
Consider the following arcane ritual. A person takes from a desk a specially printed pad of paper (known as a
checkbook
), writes some numbers on it, and adds a signature with a flourish. The person then tears the top sheet from the pad, puts it in an envelope, and sticks another piece of paper (known as a
stamp)
on the front of the envelope. Finally, the person carries the envelope outside and down the street, to a large box where the envelope is deposited.
Until the turn of the 21st century, this was the monthly ritual of anyone paying a bill: phone bills, electric bills, credit card bills, and so on. Since then, systems of online bill payment and online banking have evolved. The simplicity and convenience of these systems makes the previous paper-based method seem almost ludicrously laborious and inefficient by comparison.
What technologies have enabled this transformation? The most obvious answer is the arrival of the internet, without which online communication of any form would be impossible. Another crucial technology is public key cryptography, which we already discussed in
chapter 4
. Without public key crypto, sensitive financial information could not be securely transmitted over the internet. There is, however, at least one other technology that is essential for online transactions: the
database.
As computer users, most of us are blissfully unaware of it, but virtually all of our online transactions are processed using sophisticated database techniques, developed by computer scientists since the 1970s.
Databases address two major issues in transaction processing: efficiency and reliability. Databases provide efficiency through algorithms that permit thousands of customers to simultaneously conduct transactions without leading to any conflicts or inconsistencies. And databases provide reliability through algorithms that allow data to survive intact despite the failure of computer components like disk drives, which would usually lead to severe data loss. Online banking is a canonical example of an application that requires outstanding efficiency (to serve many customers at once without producing any errors or inconsistencies) and essentially perfect reliability. So to focus our discussions, we will often return to the example of online banking.
In this chapter, we will learn about three of the fundamental—and beautiful—ideas behind databases: write-ahead logging, two-phase commit, and relational databases. These ideas have led to the absolute dominance of database technology for storing certain types of important information. As usual, we'll try to focus on the core insight behind each of these ideas, identifying a single trick that makes it work. Write-ahead logging boils down to the “to-do list trick,” which is tackled first. Then we move on to the two-phase commit protocol, described here via the simple but powerful “prepare-then-commit trick.” Finally, we will take a peek into the world of relational databases by learning about the “virtual table trick.”
But before learning any of these tricks, let's try to clear up the mystery of what a database actually is. In fact, even in the technical computer science literature, the word “database” can mean a lot of different things, so it is impossible to give a single, correct definition. But most experts would agree that the key property of databases, the one that distinguishes them from other ways of storing information, is that the information in a database has a predefined structure.
To understand what “structure” means here, let's first look at its opposite—an example of
unstructured
information:
Rosina is 35, and she's friends with Matt, who is 26. Jingyi is 37 and Sudeep is 31. Matt, Jingyi, and Sudeep are all friends with each other.
This is exactly the type of information that a social networking site, like Facebook or MySpace, would need to store about its members. But, of course, the information would not be stored in this unstructured way. Here's the same information in a structured form:
Computer scientists call this type of structure a
table.
Each row of the table contains information about a single thing (in this case, a person). Each column of the table contains a particular type of information, such as a person's age or name. A database often consists of many tables, but our initial examples will keep things simple and use only a single table.
Obviously, it is vastly more efficient for humans and computers alike to manipulate data in the structured form of a table, rather than the unstructured free text in the example above. But databases have much more going for them than mere ease of use.
Our journey into the world of databases begins with a new concept: consistency. As we will soon discover, database practitioners are obsessed with consistency—and with good reason. In simple terms, “consistency” means that the information in the database doesn't contradict itself. If there is a contradiction in the database, we have the worst nightmare of the database administrator: inconsistency. But how could an inconsistency arise in the first place? Well, imagine that the first two rows in the table above were changed slightly, giving:
Can you spot the problem here? According to the first row, Rosina is friends with Jingyi. But according to the second row, Jingyi is not friends with Rosina. This violates the basic notion of friendship, which is that two people are simultaneously friends with each other. Admittedly, this is a rather benign example of inconsistency.
To imagine a more serious case, suppose that the concept of “friendship” is replaced with “marriage.” Then we would end up with
A
married to B, but
B
married to
C
—a situation that is actually illegal in many countries.
Actually, this type of inconsistency is easy to avoid when new data is added to the database. Computers are great at following rules, so it's easy to set up a database to follow the rule “If
A
is married to
B
, then
B
must be married to
A.”
If someone tries to enter a new row that violates this rule, they will receive an error message and the entry will fail. So ensuring consistency based on simple rules doesn't require any clever trick.
But there are other types of inconsistency that require much more ingenious solutions. We'll look at one of these next.
TRANSACTIONS AND THE TO-DO LIST TRICK
Transactions are probably the most important idea in the world of databases. But to understand what they are, and why they are necessary, we need to accept two facts about computers. The first fact is one that you are probably all too familiar with: computer programs crash—and when a program crashes, it forgets everything it was doing. Only information that was explicitly saved to the computer's file system is preserved. The second fact we need to know is rather obscure, but extremely important: computer storage devices, such as hard drives and flash memory sticks, can write only a small amount of data instantaneously—typically about 500 characters. (If you're interested in technical jargon, I'm referring here to the
sector size
of a hard disk, which is typically 512 bytes. With flash memory, the relevant quantity is the
page size
, which may also be hundreds or thousands of bytes.) As computer users, we never notice this small size limit for instantaneously storing data on a device, because modern drives can execute hundreds of thousands of these 500-character writes every second. But the fact remains that the disk's contents get changed only a few hundred characters at a time.
What on earth does this have to do with databases? It has an extremely important consequence: typically, the computer can update only a single row of a database at any one time. Unfortunately, the very small and simple example above doesn't really demonstrate this. The entire table above contains less than 200 characters, so in this particular case, it
would
be possible for the computer to update two rows at once. But in general, for a database of any reasonable size, altering two different rows does require two separate disk operations.
With these background facts established, we can get to the heart of the matter. It turns out that many seemingly simple changes to a database require two or more rows to be altered. And as we now know, altering two different rows cannot be achieved in a single disk operation, so the database update will result in some sequence of two or more disk operations. But the computer can crash at any time. What will happen if the computer crashes
between
two of these disk operations? The computer can be rebooted, but it will have forgotten about any operations it was planning to perform, so it's possible that some of the necessary changes were never made. In other words, the database might be left in an inconsistent state!
At this stage, the whole problem of inconsistency after a crash might seem rather academic, so we'll look at two examples of this extremely important problem. Let's start with an even simpler database than the one above, say:
This very dull and depressing database lists three lonely people. Now suppose Rosina and Jingyi become friends, and we would like to update the database to reflect this happy event. As you can see, this update will require changes to both the first and second rows of the table—and as we discussed earlier, this will generally require two separate disk operations. Let's suppose that row 1 happens to get updated first. Immediately after that update, and before the computer has had a chance to execute the second disk operation that will update row 2, the database will look like this:
So far, so good. Now the database program just needs to update row 2, and it will be done. But wait: what if the computer crashes before it gets a chance to do that? Then after the computer has restarted, it will have no idea that row 2 still needs to be updated. The database will be left exactly as printed above: Rosina is friends with Jingyi, but Jingyi is
not
friends with Rosina. This is the dreaded inconsistency.