The Numbers Behind NUMB3RS (15 page)

BOOK: The Numbers Behind NUMB3RS
4.1Mb size Format: txt, pdf, ePub

At this point, most laypeople are likely to say, “Look, since DNA profiling has an inaccuracy rate of less than one in many trillions (or more), the chances of there being a false match in a database of maybe 3 million entries, such as the CODIS database, is so tiny that no matter which method you use to calculate the odds, a match will surely be definitive proof.” The intuition behind such a conclusion is presumably that the database search has 3 million shots at finding a match, so if the odds against there being a match are 1 in 10 trillion, then the odds against finding a match in the entire database are roughly 1 in 3 million (3 million divided by 3 trillion is roughly 1/3,000,000).

Unfortunately—at least it could be unfortunate for an innocent defendant in the case—this argument is not valid. In fact, notwithstanding an RMP in the “one in many trillions” range, even a fairly small DNA database is likely to contain numerous pairs of accidental matches, where two different people have the same DNA profile. A tiny RMP simply does not mean there won't be accidental matches. This is a more subtle version of the well-known birthday puzzle that says you need only have 23 randomly selected people in a room for there to be a better-than-even chance that two of them will have the same birthday. (The exact calculation is a bit intricate, but you get a sense of what is going on when you realize that with 23 people, there are 23 × 22 = 506 possible pairs of people, each of which might share a birthday, and that turns out to be just enough pairs to tilt the odds to .508 in favor of there being a match.)

For example, the Arizona DNA convicted offender database is a fairly small one, with some 65,000 entries, each being a thirteen-loci profile. Suppose, for simplicity, that the probability of a random match at a single locus is 1/10, a figure that, as we observed earlier, is not unreasonable. Thus, the RMP for a nine-locus match is 1/10
9
, i.e., 1 in 1 billion. You might think that with such long odds against a randomly selected pair of profiles matching at nine loci, it would be highly unlikely that the database contained a pair of entries that were identical on nine loci. Yet, by an argument similar to the one used in the birthday puzzle, the probability of getting two profiles that match on nine loci is around 5 percent, or 1 in 20. For a database of 65,000 entries, that means you would be quite likely to find some matching profiles!

We'll sketch the calculation at the end of the chapter, but the answer becomes less surprising when you realize that for a database of 65,000 entries, there are roughly 65,000
2
—that is, 4,225,000,000—possible pairs of entries, each of which has a chance of yielding a nine-loci match.

In 2005, an actual analysis of the Arizona database uncovered 144 individuals whose DNA profiles matched at nine loci. There were another few that matched at ten loci, one pair that matched at eleven, and one pair that matched at twelve. The eleven and twelve loci matches turned out to be siblings, hence not random. But the others were not, and were, in fact, close to what one should expect from the mathematics when you replace our simplifying 1/10 single-locus match assumption with a realistic figure obtained empirically.

All of which leaves judges and juries facing a mathematical nightmare in reasoning their way to a just decision. On the other hand, even after the mathematical complexities are taken into account, DNA profiling is considerably more reliable than that much older identification standby: fingerprints, which we look at in Chapter 9.

The Database Match Calculation

Here is the calculation we promised earlier. Recall that we have a DNA profile database with 65,000 entries, each entry being a thirteen-loci profile. We suppose that the probability of a random match at a single locus is 1/ 10, so the RMP for a nine-locus match is 1/ 10
9
, that is 1 in a billion.

Now, there are 13!/ [9! × 4!] = [13 × 12 × 11 × 10]/ [4 × 3 × 2 × 1] = 715 possible ways to choose nine loci from thirteen, so the RMP for finding a match on
any
nine loci of the thirteen is 715/ 10
9
.

If you pick any profile in the database, the probability of a second profile not matching on nine loci is roughly 1 – 715/ 10
9
.

Hence, the probability of all 65,000 database entries not matching on nine loci is roughly (1 – 715/ 10
9
)
65,000
. Using the binomial theorem, this is approximately 1 – 65,000 × 715/ 10
9
= 1 – 46,475/ 10
6
, roughly 1 – .05.

The probability of there being a nine-loci match is the difference between 1 and this figure, namely 1 – (1 – 0.05) = 0.05.

CHAPTER
8
Secrets—Making and Breaking Codes

PRIME SUSPECT

In the fifth episode of the first season of
NUMB3RS
, titled “Prime Suspect,” broadcast February 18, 2005, a five-year-old girl is kidnapped. Don asks for Charlie's help when he discovers that the girl's father, Ethan, is also a mathematician. When Charlie sees the mathematics Ethan has scribbled on the whiteboard in his home office, he recognizes that Ethan is working on Riemann's Hypothesis, a famous math problem that has resisted attempts at solution for more than 150 years.

The Riemann problem is one of the so-called Millennium Problems, a list of seven unsolved mathematics problems drawn up by an international panel of experts in the year 2000, for each of which the solver will be awarded a $1 million prize. In the case of the Riemann problem, a solution is likely to lead to more than a $1 million prize. It could also lead to a major breakthrough in how to factor large numbers into primes, and hence provide a method for breaking the security code system used to encrypt Internet communications. If that were to happen, Internet commerce would break down immediately, with major economic consequences.

When Don is able to determine the identity of one of the kidnappers, and learns that the plan is to “unlock the world's biggest financial secret” it becomes clear why Ethan's daughter was kidnapped. The captors want to use Ethan's method to break into a bank's computer and steal millions of dollars. Don's obvious strategy is for Ethan to provide the gang with the key to get into the bank's computer and trace the activity electronically in order to catch the thieves. But when Charlie finds a major error in Ethan's argument, the only hope Don has to rescue Ethan's daughter is to come up with a way to fool the kidnappers into believing that he really can provide the Internet encryption key they are demanding, and use that to trace their location to rescue the daughter.

At one point in the episode, Charlie gives a lecture to the FBI agents on how Internet encryption depends on the difficulty of factoring large numbers into primes. Elsewhere in the story, Charlie and Ethan discuss the feasibility of turning Ethan's solution into an algorithm and Charlie refers to “the expansion of the zero-free region to the critical strip.” Charlie also observes that the kidnappers would need a supercomputer to factor a large number into primes. Amita, his student, notes that it is possible to build a supercomputer with a large number of PCs linked together. As always, these are all mathematically meaningful and realistic statements. So too is the basic premise for the story: a solution to the Riemann problem might very well lead to a collapse of methods currently used to keep Internet communications secure. Ever since the Second World War, message encryption has been the business of mathematicians.

WWW.CYBERCRIME.GOV

These days, you don't need a gun or a knife to steal money. A cheap personal computer and an Internet connection will do. It's called cybercrime; it's a new form of crime; it is substantial; and it is growing. It includes a broad range of illegal activities, such as software piracy, music piracy, credit card fraud (of many kinds), identity theft, manipulation of stocks, corporate espionage, child pornography, and “phishing” (sending a computer user an e-mail that purports to be from a financial institution that seeks to trick the receiver into revealing their bank details and other personal data).

There are no reliable figures on the extent of cybercrime, since many banks and Internet commerce companies keep such information secret, to avoid giving the impression that your money or credit card number is not safe in their hands. It has been suggested, though hotly disputed, that the annual proceeds from cybercrime may be in excess of $100 billion. If that were true, it would exceed the sale of illegal drugs. Regardless of the actual figures, cybercrime is a sufficiently major problem that both the U.S. Department of Justice and the FBI have entire units that focus on such criminal activity, and both have websites devoted to information about it: www.cybercrime.gov and www.fbi.gov/cyberinvest/cyberhome.htm, respectively.

The 2005 FBI computer crime survey, developed and analyzed with the help of leading public and private authorities on cyber security, and based on responses from a cross section of more than 2,000 public and private organizations in four states, reported that:

  • Nearly nine out of ten organizations experienced computer security incidents in the year; 20 percent of them indicated they had experienced twenty or more attacks; viruses (83.7 percent) and spyware (79.5 percent) headed the list.
  • Over 64 percent of the respondents incurred a financial loss. Viruses and worms cost the most, accounting for $12 million of the $32 million in total losses.
  • The attacks came from thirty-six different countries. The United States (26.1 percent) and China (23.9 percent) were the source of more than half of the intrusion attempts, though many attackers route through one or more intermediate computers in different countries, which makes it difficult to get an accurate reading.

Law enforcement agents who focus their energies on cybercrime use mathematics in much of their work. In many cases, they use the same techniques as are described elsewhere in this book. In this chapter, however, we'll focus our attention on one important aspect of the fight against cybercrime that uses different mathematics, namely Internet security. In this area, ingenious use of some sophisticated mathematics has led to major advances, with the result that, if properly used, the systems available today for keeping Internet communications secure are extremely reliable.

KEEPING SECRETS

When you use an ATM to withdraw money from your account, or send your credit card details to an Internet retailer, you want to be sure that only the intended receiver has access to the details you send. This cannot be achieved by preventing unauthorized third parties from “eavesdropping” on the electronic messages that pass between you and the organization you are dealing with. The Internet is what is called an open system, which means that the connections between the millions of computers that make up the network are, to all intents and purposes, public. Security of Internet communications traffic is achieved by means of encryption—“scrambling” the message so that, even if an unauthorized third party picks up the signal transmitted, the eavesdropper will be unable to make sense of it.

The notion of encryption is not new. The idea of using a secret code to keep the contents of a message secret goes back at least as far as the days of the Roman Empire, when Julius Caesar used secret codes to ensure the security of the orders he sent to his generals during the Gallic wars. In what is nowadays called a Caesar cipher, the original message is transformed by taking each letter of each word in turn and replacing it by another letter according to some fixed rule, such as taking the letter three places along in the alphabet, so A is replaced by D, G by J, Y by B, and so on. Thus the word “mathematics” would become “pdwkhpdwlfv”.

A message encrypted using a Caesar cipher may look on the surface to be totally indecipherable without knowing the rule used, but this is by no means the case. For one thing, there are only twenty-five such “shift along” ciphers, and an enemy who suspected you were using one need only try them all in turn until the one used was found.

A slightly more robust approach would be to employ some other, less obvious rule for substituting letters. Unfortunately, any such substitution cipher, which simply replaces one letter by another, is highly vulnerable to being broken by a simple pattern analysis. For instance, there are very definite frequencies with which individual letters occur in English (or in any other language), and by counting the number of occurrences of each letter in your coded text, an enemy can easily deduce just what your substitution rule is—especially when computers are used to speed up the process.

With simple substitution out of the question, what else might you try? Whatever you choose, similar dangers are present. If there is any kind of recognizable pattern to your coded text, a sophisticated statistical analysis can usually crack the code without much difficulty.

To be secure, therefore, an encryption system must destroy any pattern that the enemy could discover in order to break the code. Yet, the transformation performed on the message by your encryption scheme clearly cannot destroy
all
order—the message itself must still be there beneath it all, to allow the intended receiver to recover it. The trick, then, is to design the encryption system so that this hidden order is buried sufficiently deeply to prevent an enemy from discovering it.

All cipher systems employed since the end of the Second World War depend on mathematics, and all use computers. They have to. Because the enemy may be assumed to have powerful computers to analyze your encrypted message, your system needs to be sufficiently complex to resist computer attack.

It takes a lot of time and effort to design and build a secure encryption system. To avoid having constantly to develop new systems, modern encryption systems invariably consist of two components: an encryption procedure and a “key.” The former is, typically, a computer program or possibly a specially designed computer. In order to encrypt a message the system requires not only the message but also the chosen key, usually a secret number. The encryption program will code the message in a way that depends upon the chosen key, so that only by knowing that key will it be possible to decode the ciphered text. Because the security depends on the key, the same encryption program may be used by many people for a long period of time, and this means that a great deal of time and effort can be put into its design.

An obvious analogy is that manufacturers of safes and locks are able to stay in business by designing one type of lock which may be sold to hundreds of users, who rely upon the uniqueness of their own key to provide security. (The “key” in this case could be a physical key or a numerical combination.) Just as an enemy may know how your lock is designed and yet still be unable to break into your safe without having the physical key or knowing the combination, so the enemy may know what encryption system you are using without being able to crack your coded messages—a task for which knowledge of your key is required.

In some key-based encryption systems, the message sender and receiver agree beforehand on some secret key that they then use to send each other messages. As long as they keep this key secret the system, if it is well designed, should be secure. One such system used for many years, though now regarded as a bit too long in the tooth and vulnerable to attack using computers much faster than were available when it was first developed, is the American-designed Data Encryption Standard (DES). The DES requires for its key a number whose binary representation has 56 bits (in other words, a string of 56 zeros and ones operates as the key). Why such a long key? Well, no one made any secret of how the DES system works. All the details were published at the outset. That means that an enemy could crack your coded messages simply by trying all possible keys one after the other until one is found which works. With the DES, there are 2
56
possible keys to be tried, a number that was large enough to render the task virtually impossible in the days when the system was first used.

Encryption systems such as DES have an obvious drawback. Before such a scheme can be used, the sender and receiver have to agree on the key they will use. Since they will not want to transmit that key over any communication channel, they have to meet and choose the key, or at the very least employ a trusted courier to convey the key from one to the other. This is fine for setting up Internet access to your bank account; you can simply go along in person to your local branch and set up the key in person. But it is no use at all to establish secure communication between individuals who have not already met. In particular, it is not suitable for use in Internet commerce, where people want to send secure messages across the world to someone they have never met.

PUBLIC KEY CRYPTOGRAPHY

In 1976, two young researchers at Stanford University, Whitfield Diffie and Martin Hellman, published a landmark paper titled “New Directions in Cryptography,” in which they proposed a new type of cipher system: public key cryptography. In a public key system, the encryption method requires not one but two keys—one for enciphering and the other for deciphering. (This would be like having a lock that requires one key to lock it and another to unlock it.) Such a system would be used as follows, they suggested.

An individual, let's call her Alice, who wishes to use the system, purchases the standard program (or special computer) used by all members of the communication network concerned. Alice then generates two keys. One of these, her deciphering key, she keeps secret. The other key, the one that will be used by anyone else on the network for encoding messages they want to send
to her
, she publishes in a directory of the network users.

If another network user, Bob, wants to send Alice a message, he looks up Alice's public enciphering key, encrypts the message using that key, and sends the encrypted message to Alice. To decode the message, it is of no help knowing (as anyone can) Alice's enciphering key. You need the deciphering key. And only Alice, the intended receiver, knows that. (An intriguing feature of such a system is that once Bob has enciphered his message, he cannot decipher it; so if he wants to refer to it later he'd better keep a copy of the original, unciphered version!)

Diffie and Hellman were not able to come up with a reliable way to construct such a system, but the idea was brilliant, and it was not long before three researchers at MIT, Ronal Rivest, Adi Shamir, and Leonard Adleman, found how to make the suggestion work. Their idea was to exploit the strengths and weaknesses of those very computers whose existence makes the encryption-scheme designer's task so difficult.

Other books

Up All Night by Faye Avalon
While Love Stirs by Lorna Seilstad
The Missing Madonna by Sister Carol Anne O’Marie
Portrait of A Novel by MICHAEL GORRA
Taken Away by Celine Kiernan
Blind Passion by Brannan Black
The Vanishing Sculptor by Donita K. Paul
Zom-B by Darren Shan
Suspicion by Joseph Finder