Data and Goliath (20 page)

Read Data and Goliath Online

Authors: Bruce Schneier

BOOK: Data and Goliath
9.19Mb size Format: txt, pdf, ePub

The primary difference between a computer and a dog is that the computer communicates
with other people and the dog does not—at least, not well enough to matter. Computer
algorithms are written by people, and their output is used by people. And when we
think of computer algorithms surveilling us or analyzing our personal data, we need
to think about the people behind those algorithms. Whether or not anyone actually
looks at our data, the very facts that (1) they could, and (2) they guide the algorithms
that do, make it surveillance.

You know this is true. If you believed what Clapper said, then you wouldn’t object
to a camera in your bedroom—as long as there were rules governing when the police
could look at the footage. You wouldn’t object to
being compelled to wear a government-issued listening device 24/7, as long as your
bureaucratic monitors followed those same rules. If you do object, it’s because you
realize that the privacy harm comes from the automatic collection and algorithmic
analysis, regardless of whether or not a person is directly involved in the process.


We all have experience with identifying ourselves on the Internet. Some websites tie
your online identity to your real identity: banks, websites for some government services,
and so on. Some tie your online identity to a payment system—generally credit cards—and
others to your bank account or cell phone. Some websites don’t care about your real
identity, and allow you to maintain a unique username just for that site. Many more
sites could work that way. Apple’s iTunes, for example, could be so designed that
it doesn’t know who you really are, just that you’re authorized to access a particular
set of audio and video files.

The means to perform identification and authentication include passwords, biometrics,
and tokens. Many people, myself included, have written extensively about the various
systems and their relative strengths and weaknesses. I’ll spare you the details; the
takeaway is that none of these systems is perfect, but all are generally good enough
for their applications. Authentication basically works.

It works because the people involved want to be identified. You want to convince Hotmail
that it’s your account; you want to convince your bank that it’s your money. And while
you might not want AT&T to be able to tie all the Internet browsing you do on your
smartphone to your identity, you do want the phone network to transmit your calls
to you. All of these systems are trying to answer the following question: “Is this
the person she claims to be?” That is why it’s so easy to gather data about us online;
most of it comes from sources where we’ve intentionally identified ourselves.

Attribution of anonymous activity to a particular person is a much harder problem.
In this case, the person doesn’t necessarily want to be identified. He is making an
anonymous comment on a website. Or he’s launching a
cyberattack against your network. In such a case, the systems have to answer the harder
question: “Who is this?”

At a very basic level, we are unable to identify individual pieces of hardware and
software when a malicious adversary is trying to evade detection. We can’t attach
identifying information to data packets zipping around the Internet. We can’t verify
the identity of a person sitting in front of a random keyboard somewhere on the planet.
Solving this problem isn’t a matter of overcoming some engineering challenges; this
inability is inherent in how the Internet works.

This means that we can’t conclusively figure out who left an anonymous comment on
a blog. (It could have been posted using a public computer, or a shared IP address.)
We can’t conclusively identify the sender of an e-mail. (Those headers can be spoofed;
spammers do it all the time.) We can’t conclusively determine who was behind a series
of failed log-ins to your bank account, or a cyberattack against our nation’s infrastructure.

We can’t even be sure whether a particular attack was criminal or military in origin,
or which government was behind it. The 2007 cyberattack against Estonia, often talked
about as the first cyberwar, was either conducted by a group associated with the Russian
government or by a disaffected 22-year-old.

When we do manage to attribute an attack—be it to a mischievous high schooler, a bank
robber, or a team of state-sanctioned cyberwarriors—we usually do so after extensive
forensic analysis or because the attacker gave himself away in some other manner.
It took analysts months to identify China as the definitive source of the
New York Times
attacks in 2012, and we didn’t know for sure who was behind Stuxnet until the US
admitted it. This is a very difficult problem, and one we’re not likely to solve anytime

Over the years, there have been many proposals to eliminate anonymity on the Internet.
The idea is that if everything anyone did was attributable—if all actions could be
traced to their source—then it would be easy to identify criminals, spammers, stalkers,
and Internet trolls. Basically, everyone would get the Internet equivalent of a driver’s

This is an impossible goal. First of all, we don’t have the real-world infrastructure
to provide Internet user credentials based on other identification
systems—passports, national identity cards, driver’s licenses, whatever—which is what
would be needed. We certainly don’t have the infrastructure to do that globally.

Even if we did, it would be impossible to make it secure. Every one of our existing
identity systems is already subverted by teenagers trying to buy alcohol—and that’s
a face-to-face transaction. A new one isn’t going to be any better. And even if it
were, it still wouldn’t work. It is always possible to set up an anonymity service
on top of an identity system. This fact already annoys countries like China that want
to identify everyone using the Internet on their territory.

This might seem to contradict what I wrote in Chapter 3—that it is easy to identify
people on the Internet who are trying to stay anonymous. This can be done if you have
captured enough data streams to correlate and are willing to put in the investigative
time. The only way to effectively reduce anonymity on the Internet is through massive
surveillance. The examples from Chapter 3 all relied on piecing together different
clues, and all took time. It’s much harder to trace a single Internet connection back
to its source: a single e-mail, a single web connection, a single attack.

The open question is whether the process of identification through correlation and
analysis can be automated. Can we build computer systems smart enough to analyze surveillance
information to identify individual people, as in the examples we saw in Chapter 3,
on a large-scale basis? Not yet, but maybe soon.

It’s being worked on. Countries like China and Russia want automatic systems to ferret
out dissident voices on the Internet. The entertainment industry wants similar systems
to identify movie and music pirates. And the US government wants the same systems
to identify people and organizations it feels are threats, ranging from lone individuals
to foreign governments.

In 2012, US Secretary of Defense Leon Panetta said publicly that the US has “made
significant advances in . . . identifying the origins” of cyberattacks. My guess is
that we have not developed some new science or engineering that fundamentally alters
the balance between Internet identifiability and anonymity. Instead, it’s more likely
that we have penetrated
our adversaries’ networks so deeply that we can spy on and understand their planning

Of course, anonymity cuts both ways, since it can also protect hate speech and criminal
activity. But while identification can be important, anonymity is valuable for all
the reasons I’ve discussed in this chapter. It protects privacy, it empowers individuals,
and it’s fundamental to liberty.



ur security is important. Crime, terrorism, and foreign aggression are threats both
in and out of cyberspace. They’re not the only threats in town, though, and I just
spent the last four chapters delineating others.

We need to defend against a panoply of threats, and this is where we start having
problems. Ignoring the risk of overaggressive police or government tyranny in an effort
to protect ourselves from terrorism makes as little sense as ignoring the risk of
terrorism in an effort to protect ourselves from police overreach.

Unfortunately, as a society we tend to focus on only one threat at a time and minimize
the others. Even worse, we tend to focus on rare and spectacular threats and ignore
the more frequent and pedestrian ones. So we fear flying more than driving, even though
the former is much safer. Or we fear terrorists more than the police, even though
in the US you’re nine times more likely to be killed by a police officer than by a

We let our fears get in the way of smart security. Defending against some threats
at the expense of others is a failing strategy, and we need to find ways of balancing
them all.


The NSA repeatedly uses a connect-the-dots metaphor to justify its surveillance activities.
Again and again—after 9/11, after the Underwear Bomber, after the Boston Marathon
bombings—government is criticized for not connecting the dots.

However, this is a terribly misleading metaphor. Connecting the dots in a coloring
book is easy, because they’re all numbered and visible. In real life, the dots can
only be recognized after the fact.

That doesn’t stop us from demanding to know why the authorities couldn’t connect the
dots. The warning signs left by the Fort Hood shooter, the Boston Marathon bombers,
and the Isla Vista shooter look obvious in hindsight. Nassim Taleb, an expert on risk
engineering, calls this tendency the “narrative fallacy.” Humans are natural storytellers,
and the world of stories is much more tidy, predictable, and coherent than reality.
Millions of people behave strangely enough to attract the FBI’s notice, and almost
all of them are harmless. The TSA’s no-fly list has over 20,000 people on it. The
Terrorist Identities Datamart Environment, also known as the watch list, has 680,000,
40% of whom have “no recognized terrorist group affiliation.”

Data mining is offered as the technique that will enable us to connect those dots.
But while corporations are successfully mining our personal data in order to target
advertising, detect financial fraud, and perform other tasks, three critical issues
make data mining an inappropriate tool for finding terrorists.

The first, and most important, issue is error rates. For advertising, data mining
can be successful even with a large error rate, but finding terrorists requires a
much higher degree of accuracy than data-mining systems can possibly provide.

Data mining works best when you’re searching for a well-defined profile, when there
are a reasonable number of events per year, and when the cost of false alarms is low.
Detecting credit card fraud is one of data mining’s security success stories: all
credit card companies mine their transaction databases for spending patterns that
indicate a stolen card. There are over a billion active credit cards in circulation
in the United States, and nearly 8% of those are fraudulently used each year. Many
credit card thefts share
a pattern—purchases in locations not normally frequented by the cardholder, and purchases
of travel, luxury goods, and easily fenced items—and in many cases data-mining systems
can minimize the losses by preventing fraudulent transactions. The only cost of a
false alarm is a phone call to the cardholder asking her to verify a couple of her

Similarly, the IRS uses data mining to identify tax evaders, the police use it to
predict crime hot spots, and banks use it to predict loan defaults. These applications
have had mixed success, based on the data and the application, but they’re all within
the scope of what data mining can accomplish.

Terrorist plots are different, mostly because whereas fraud is common, terrorist attacks
are very rare. This means that even highly accurate terrorism prediction systems will
be so flooded with false alarms that they will be useless.

The reason lies in the mathematics of detection. All detection systems have errors,
and system designers can tune them to minimize either false positives or false negatives.
In a terrorist-detection system, a false positive occurs when the system mistakenly
identifies something harmless as a threat. A false negative occurs when the system
misses an actual attack. Depending on how you “tune” your detection system, you can
increase the number of false positives to assure you are less likely to miss an attack,
or you can reduce the number of false positives at the expense of missing attacks.

Because terrorist attacks are so rare, false positives completely overwhelm the system,
no matter how well you tune. And I mean
: millions of people will be falsely accused for every real terrorist plot the system
finds, if it ever finds any.

We might be able to deal with all of the innocents being flagged by the system if
the cost of false positives were minor. Think about the full-body scanners at airports.
Those alert all the time when scanning people. But a TSA officer can easily check
for a false alarm with a simple pat-down. This doesn’t work for a more general data-based
terrorism-detection system. Each alert requires a lengthy investigation to determine
whether it’s real or not. That takes time and money, and prevents intelligence officers
from doing other productive work. Or, more pithily, when you’re watching everything,
you’re not seeing anything.

The US intelligence community also likens finding a terrorist plot to looking for
a needle in a haystack. And, as former NSA director General Keith Alexander said,
“you need the haystack to find the needle.” That statement perfectly illustrates the
problem with mass surveillance and bulk collection. When you’re looking for the needle,
the last thing you want to do is pile lots more hay on it. More specifically, there
is no scientific rationale for believing that adding irrelevant data about innocent
people makes it easier to find a terrorist attack, and lots of evidence that it does
not. You might be adding slightly more signal, but you’re also adding much more noise.
And despite the NSA’s “collect it all” mentality, its own documents bear this out.
The military intelligence community even talks about the problem of “drinking from
a fire hose”: having so much irrelevant data that it’s impossible to find the important

Other books

Ann Granger by The Companion
Michael Connelly by The Harry Bosch Novels, Volume 2
Siempre en capilla by Lluïsa Forrellad
30 Great Myths about Shakespeare by Maguire, Laurie, Smith, Emma
Proof by Redwood, Jordyn
Charm and Consequence by Stephanie Wardrop
The Discreet Hero by Mario Vargas Llosa
Spice and Secrets by Suleikha Snyder
Ascendance by John Birmingham