Authors: John MacCormick,Chris Bishop
five two one three point seven five
Let's again suppose that about 20% of the characters in this message get flipped randomly to something else due to a poor communication channel. The message might end up looking something like this:
fiqe kwo one thrxp point sivpn fivq
Although it's a little annoying to read, I think you will agree that anyone who knows English can guess that this corrupted message represents the true bank balance of $5213.75.
The key point is that because we used a
redundant
message, it is possible to reliably detect and correct any single change to the message. If I tell you that the characters “fiqe” represent a number in English and that only one character has been altered, you can be absolutely certain that the original message was “five,” because there is no other English number that can be obtained from “fiqe” by altering only one character. In stark contrast, if I tell you that the digits “367” represent a number but one of the digits has been altered, you have no way of knowing what the original number was, because there is no redundancy in this message.
Although we haven't yet explored exactly how redundancy works, we have already seen that it has something to do with making the message
longer
, and that each part of the message should conform to some kind of well-known
pattern.
In this way, any single change can be first identified (because it does not fit in with a known pattern) and then corrected (by changing the error to fit with the pattern).?
Computer scientists call these known patterns “code words.” In our example, the code words are just numbers written in English, like “one,” “two,” “three,” and so on.
Now it's time to explain exactly how the redundancy trick works. Messages are made up of what computer scientists call “symbols.” In our simple example, the symbols are the numeric digits 0-9 (we'll ignore the dollar sign and decimal point to make things even easier). Each symbol is assigned a code word. In our example, the symbol 1 is assigned the code word “one,” 2 is assigned “two,” and so on.
To transmit a message, you first take each symbol and translate it into its corresponding code word. Then you send the transformed message over the unreliable communication channel. When the message is received, you look at each part of a message and check whether it is a valid code word. If it is valid (e.g., “five”), you just transform it back into the corresponding symbol (e.g., 5). If it is not a valid code word (e.g., “fiqe”), you find out which code word it matches most closely (in this case, “five”), and transform
that
into the corresponding symbol (in this case, 5). Examples of using this code are shown in the figure above.
A code using English words for digits.
That's really all there is to it. Computers actually use this redundancy trick all the time to store and transmit information. Mathematicians have worked out fancier codewords than the English-language ones we were using as an example, but otherwise the workings of reliable computer communication are the same. The figure on the facing page gives a real example. This is the code computer scientists call the (7,4) Hamming code, and it is one of the codes discovered by Richard Hamming at Bell Labs in 1947, in response to the weekend computer crashes described earlier. (Because of Bell's requirement that he patent the codes, Hamming did not publish them until three years later, in 1950.) The most obvious difference to our previous code is that everything is done in terms of zeros and ones. Because every piece of data stored or transmitted by a computer is converted into strings of zeros and ones, any code used in real life is restricted to just these two digits.
Areal code used by computers. Computer scientists call this code the (7,4) Hamming code. Note that the “Encoding” box lists only five of the 16 possible 4-digit inputs. The remaining inputs also have corresponding code words, but they are omitted here.
But apart from that, everything works exactly the same as before. When encoding, each group of four digits has redundancy added to it, generating a code word of seven digits. When decoding, you first look for an exact match for the seven digits you received, and if that fails, you take the closest match. You might be worried that, now we are working with only ones and zeros, there might be more than one equally close match and you could end up choosing the wrong decoding. However, this particular code has been designed cunningly so that any single error in a 7-digit codeword can be corrected unambiguously. There is some beautiful mathematics behind the design of codes with this property, but we won't be pursuing the details here.
It's worth emphasizing why the redundancy trick is preferred to the repetition trick in practice. The main reason is the relative
cost
of the two tricks. Computer scientists measure the cost of error-correction systems in terms of “overhead.” Overhead is just the amount of extra information that needs to be sent to make sure a message is received correctly. The overhead of the repetition trick is enormous, since you have to send several entire copies of the message. The overhead of the redundancy trick depends on the exact set of code words that you use. In the example above that used English words, the redundant message was 35 characters long, whereas the original message consisted of only 6 numeric digits, so the overhead of this particular application of the redundancy trick is also quite large. But mathematicians have worked out sets of code words that have much lower redundancy, yet still get incredibly high performance in terms of the chance of an error going undetected. The low overhead of these code words is the reason that computers use the redundancy trick instead of the repetition trick.
The discussion so far has used examples of
transmitting
information using codes, but everything we have discussed applies equally well to the task of
storing
information. CDs, DVDs, and computer hard drives all rely heavily on error-correcting codes to achieve the superb reliability we observe in practice.
THE CHECKSUM TRICK
So far, we've looked at ways to simultaneously
detect
and
correct
errors in data. The repetition trick and the redundancy trick are both ways of doing this. But there's another possible approach to this whole problem: we can forget about
correcting
errors and concentrate only on
detecting
them. (The 17th-century philosopher John Locke was clearly aware of the distinction between error detection and error correction—as you can see from the opening quotation of this chapter.) For many applications, merely detecting an error is sufficient, because if you detect an error, you just request another copy of the data. And you can keep on requesting new copies, until you get one that has no errors in it. This is a very frequently used strategy. For example, almost all internet connections use this technique. We'll call it the “checksum trick,” for reasons that will become clear very soon.
To understand the checksum trick, it will be more convenient to pretend that all of our messages consist of numbers only. This is a very realistic assumption, since computers store all information in the form of numbers and only translate the numbers into text or images when presenting that information to humans. But, in any case, it is important to understand that any particular choice of symbols for the messages does not affect the techniques described in this chapter. Sometimes it is more convenient to use numeric symbols (the digits 0-9), and sometimes it is more convenient to use alphabetic symbols (the characters a-z). But in either case, we can agree on some translation between these sets of symbols. For example, one obvious translation from alphabetic to numeric symbols would be a —> 01, b —> 02,…, z —> 26. So it really doesn't matter whether we investigate a technique for transmitting numeric messages or alphabetic messages; the technique can later be applied to any type of message by first doing a simple, fixed translation of the symbols.
At this point, we have to learn what a checksum actually is. There are many different types of checksums, but for now we will stick with the least complicated type, which we'll call a “simple checksum.”
Computing the simple checksum of a numeric message is really, really easy: you just take the digits of the message, add them all up, throw away everything in the result except for the last digit, and the remaining digit is your simple checksum. Here's an example: suppose the message is
4 6 7 5 6
Then the sum all the digits is 4 + 6 + 7 + 5 + 6 = 28, but we keep only the last digit, so the simple checksum of this message is 8.
But how are checksums used? That's easy: you just append the checksum of your original message to the end of the message before you send it. Then, when the message is received by someone else, they can calculate the checksum again, compare it with the one you sent, and see if it is correct. In other words, they “check” the “sum” of the message—hence the terminology “checksum.” Let's stick with the above example. The simple checksum of the message “46756” is 8, so we transmit the message and its checksum as
4 6 7 5 6 8
Now, the person receiving the message has to know that you are using the checksum trick. Assuming they do know, they can immediately recognize that the last digit, the 8, is not part of the original message, so they put it to one side and compute the checksum of everything else. If there were no errors in the transmission of the message, they will compute 4 + 6 + 7 + 5 + 6 = 28, keep the last digit (which is 8), check that it is equal to the checksum they put aside earlier (which it is), and therefore conclude that the message was transmitted correctly. On the other hand, what happens if there
was
an error in transmitting the message? Suppose the 7 was randomly changed to a 3. Then you would receive the message
4 6 3 5 6 8
You would set aside the 8 for later comparison and compute the checksum as 4 + 6 + 3 + 5 + 6 = 24, keeping only the last digit (4). This is
not
equal to the 8 that was set aside earlier, so you would be sure that the message was corrupted during transmission. At this point, you request that the message is retransmitted, wait until you receive a new copy, then again compute and compare the checksum. And you can keep doing this until you get a message whose checksum is correct.
All of this seems almost too good to be true. Recall that the “overhead” of an error-correcting system is the amount of extra information you have to send in addition to the message itself. Well, here we seem to have the ultimate low-overhead system, since no matter how long the message is, we only have to add one extra digit (the checksum) to detect an error!
Alas, it turns out that this system of simple checksums
is
too good to be true. Here is the problem: the simple checksum described above can detect at most
one
error in the message. If there are two or more errors, the simple checksum might detect them, but then again it might not. Let's look at some examples of this:
The original message (46756) is the same as before, and so is its checksum (8). In the next line is a message with one error (the first digit is a 1 instead of a 4), and the checksum turns out to be 5. In fact, you can probably convince yourself that changing any
single
digit results in a checksum that differs from 8, so you are guaranteed to detect any single mistake in the message. It's not hard to prove that this is always true: if there is only one error, a simple checksum is absolutely guaranteed to detect it.