Nine Algorithms That Changed the Future: The Ingenious Ideas That Drive Today's Computers (2 page)

BOOK: Nine Algorithms That Changed the Future: The Ingenious Ideas That Drive Today's Computers
8.73Mb size Format: txt, pdf, ePub

Chapter 6
is a little exceptional. It covers pattern recognition algorithms, which sneak into the list of great computer science ideas despite violating the very first criterion: that ordinary computer users must use them every day. Pattern recognition is the class of techniques whereby computers recognize highly variable information, such as handwriting, speech, and faces. In fact, in the first decade of the 21st century, most everyday computing did not use these techniques. But as I write these words in 2011, the importance of pattern recognition is increasing rapidly: mobile devices with small on-screen keyboards need automatic correction, tablet devices must recognize handwritten input, and all these devices (especially smartphones) are becoming increasingly voice-activated. Some websites even use pattern recognition to determine what kind of advertisements to display to their users. In addition, I have a personal bias toward pattern recognition, which is my own area of research. So
chapter 6
describes three of the most interesting and successful pattern recognition techniques: nearest-neighbor classifiers, decision trees, and neural networks.

Compression algorithms, discussed in
chapter 7
, form another set of great ideas that help transform a computer into a genius at our fingertips. Computer users do sometimes apply compression directly, perhaps to save space on a disk or to reduce the size of a photo before e-mailing it. But compression is used even more often under the covers: without us being aware of it, our downloads or uploads may be compressed to save bandwidth, and data centers often compress customers' data to reduce costs. That 5 GB of space that your e-mail provider allows you probably occupies significantly less than 5 GB of the provider's storage!

Chapter 8
covers some of the fundamental algorithms underlying databases. The chapter emphasizes the clever techniques employed to achieve
consistency
—meaning that the relationships in a database never contradict each other. Without these ingenious techniques, most of our online life (including online shopping and interacting with social networks like Facebook) would collapse in a jumble of computer errors. This chapter explains what the problem of consistency really is and how computer scientists solve it without sacrificing the formidable efficiency we expect from online systems.

In
chapter 9
, we learn about one of the indisputable gems of theoretical computer science: digital signatures. The ability to “sign” an electronic document digitally seems impossible at first glance. Surely, you might think, any such signature must consist of digital information, which can be copied effortlessly by anyone wishing to forge the signature. The resolution of this paradox is one of the most remarkable achievements of computer science.

We take a completely different tack in
chapter 10
: instead of describing a great algorithm that already exists, we will learn about an algorithm that
would
be great if it existed. Astonishingly, we will discover that this particular great algorithm is impossible. This establishes some absolute limits on the power of computers to solve problems, and we will briefly discuss the implications of this result for philosophy and biology.

In the conclusion, we will draw together some common threads from the great algorithms and spend a little time speculating about what the future might hold. Are there more great algorithms out there or have we already found them all?

This is a good time to mention a caveat about the book's style. It's essential for any scientific writing to acknowledge sources clearly, but citations break up the flow of the text and give it an academic flavor. As readability and accessibility are top priorities for this book, there are no citations in the main body of the text. All sources are, however, clearly identified—often with amplifying comments—in the “Sources and Further Reading” section at the end of the book. This section also points to additional material that interested readers can use to find out more about the great algorithms of computer science.

While I'm dealing with caveats, I should also mention that a small amount of poetic license was taken with the book's title. Our
Nine Algorithms That Changed the Future
are—without a doubt—revolutionary, but are there exactly nine of them? This is debatable, and depends on exactly what gets counted as a separate algorithm. So let's see where the “nine” comes from. Excluding the introduction and conclusion, there are nine chapters in the book, each covering algorithms that have revolutionized a different type of computational task, such as cryptography, compression, or pattern recognition. Thus, the “Nine Algorithms” of the book's title really refer to nine classes of algorithms for tackling these nine computational tasks.

WHY SHOULD WE CARE ABOUT THE GREAT ALGORITHMS?

Hopefully, this quick summary of the fascinating ideas to come has left you eager to dive in and find out how they really work. But you may still be wondering: what is the ultimate goal here? So let me make some brief remarks about the true purpose of this book. It is definitely not a how-to manual. After reading the book, you won't be an expert on computer security or artificial intelligence or anything else. It's true that you may pick up some useful skills. For example: you'll be more aware of how to check the credentials of “secure” websites and “signed” software packages; you'll be able to choose judiciously between lossy and lossless compression for different tasks; and you may be able to use search engines more efficiently by understanding some aspects of their indexing and ranking techniques.

These, however, are relatively minor bonuses compared to the book's true objective. After reading the book, you
won't
be a vastly more skilled computer user. But you
will
have a much deeper appreciation of the beauty of the ideas you are constantly using, day in and day out, on all your computing devices.

Why is this a good thing? Let me argue by analogy. I am definitely not an expert on astronomy—in fact, I'm rather ignorant on the topic and wish I knew more. But every time I glance at the night sky, the small amount of astronomy that I do know enhances my enjoyment of this experience. Somehow, my understanding of what I am looking at leads to a feeling of contentment and wonder. It is my fervent hope that after reading this book, you will occasionally achieve this same sense of contentment and wonder while using a computer. You'll have a true appreciation of the most ubiquitous, inscrutable black box of our times: your personal computer, the genius at your fingertips.

2

Search Engine Indexing: Finding Needles in the World's Biggest Haystack

Now, Huck, where we're a-standing you could touch that hole I got out of with a fishing-pole. See if you can find it.

—M
ARK
T
WAIN
,
Tom Sawyer

Search engines have a profound effect on our lives. Most of us issue search queries many times a day, yet we rarely stop to wonder just how this remarkable tool can possibly work. The vast amount of information available and the speed and quality of the results have come to seem so normal that we actually get frustrated if a question can't be answered within a few seconds. We tend to forget that every successful web search extracts a needle from the world's largest haystack: the World Wide Web.

In fact, the superb service provided by search engines is not just the result of throwing a large amount of fancy technology at the problem. Yes, each of the major search engine companies runs an international network of enormous data centers, containing thousands of server computers and advanced networking equipment. But all of this hardware would be useless without the clever algorithms needed to organize and retrieve the information we request. So in this chapter and the one that follows, we'll investigate some of the algorithmic gems that are put to work for us every time we do a web search. As we'll soon see, two of the main tasks for a search engine are
matching
and
ranking.
This chapter covers a clever matching technique: the metaword trick. In the next chapter, we turn to the ranking task and examine Google's celebrated PageRank algorithm.

MATCHING AND RANKING

It will be helpful to begin with a high-level view of what happens when you issue a web search query. As already mentioned, there will be two main phases: matching and ranking. In practice, search engines combine matching and ranking into a single process for efficiency. But the two phases are conceptually separate, so we'll assume that matching is completed before ranking begins. The figure above shows an example, where the query is “London bus timetable.” The matching phase answers the question “which web pages match my query?”—in this case, all pages that mention London bus timetables.

The two phases of web search: matching and ranking. There can be thousands or millions of matches after the first (matching) phase, and these must be sorted by relevance in the second (ranking) stage.

But many queries on real search engines have hundreds, thousands, or even millions of hits. And the users of search engines generally prefer to look through only a handful of results, perhaps five or ten at the most. Therefore, a search engine must be capable of picking the best few from a very large number of hits. A good search engine will not only pick out the best few hits, but display them in the most useful order—with the most suitable page listed first, then the next most suitable, and so on.

The task of picking out the best few hits in the right order is called “ranking.” This is the crucial second phase that follows the initial matching phase. In the cutthroat world of the search industry, search engines live or die by the quality of their ranking systems. Back in 2002, the market share of the top three search engines in the United States was approximately equal, with Google, Yahoo, and MSN each having just under 30% of U.S. searches. (MSN was later rebranded first as Live Search and then as Bing.) In the next few years, Google made a dramatic improvement in its market share, crushing Yahoo and MSN down to under 20% each. It is widely believed that the phenomenal rise of Google to the top of the search industry was due to its ranking algorithms. So it's no exaggeration to say that search engines live or die according to the quality of their ranking algorithms. But as already mentioned, we'll be discussing ranking algorithms in the next chapter. For now, let's focus on the matching phase.?

ALTAVISTA: THE FIRST WEB-SCALE MATCHING ALGORITHM

Where does our story of search engine matching algorithms begin? An obvious—but wrong—answer would be to start with Google, the greatest technology success story of the early 21st century. Indeed, the story of Google's beginnings as the Ph.D. project of two graduate students at Stanford University is both heartwarming and impressive. It was in 1998 that Larry Page and Sergey Brin assembled a ragtag bunch of computer hardware into a new type of search engine. Less than 10 years later, their company had become the greatest digital giant to rise in the internet age.

But the idea of web search had already been around for several years. Among the earliest commercial offerings were Infoseek and Lycos (both launched in 1994), and AltaVista, which launched its search engine in 1995. For a few years in the mid-1990s, AltaVista was the king of the search engines. I was a graduate student in computer science during this period, and I have clear memories of being wowed by the comprehensiveness of AltaVista's results. For the first time, a search engine had fully indexed all of the text on every page of the web—and, even better, results were returned in the blink of an eye. Our journey toward understanding this sensational technological breakthrough begins with a (literally) age-old concept: indexing.

PLAIN OLD INDEXING

The concept of an
index
is the most fundamental idea behind any search engine. But search engines did not invent indexes: in fact, the idea of indexing is almost as old as writing itself. For example, archaeologists have discovered a 5000-year-old Babylonian temple library that cataloged its cuneiform tablets by subject. So indexing has a pretty good claim to being the oldest useful idea in computer science.

These days, the word “index” usually refers to a section at the end of a reference book. All of the concepts you might want to look up are listed in a fixed order (usually alphabetical), and under each concept is a list of locations (usually page numbers) where that concept is referenced. So a book on animals might have an index entry that looks like “cheetah 124, 156,” which means that the word “cheetah” appears on pages 124 and 156. (As a mildly amusing exercise, you could look up the word “index” in the index of this book. You should be brought back to this very page.)

The index for a web search engine works the same way as a book's index. The “pages” of the book are now web pages on the World Wide Web, and search engines assign a different page number to every single web page on the web. (Yes, there are a lot of pages—many billions at the last count—but computers are great at dealing with large numbers.) The figure above gives an example that will make this more concrete. Imagine that the World Wide Web consisted of only the 3 short web pages shown there, where the pages have been assigned page numbers 1,2, and 3.

A simple index with page numbers.

A computer could build up an index of these three web pages by first making a list of all the words that appear in any page and then sorting that list in alphabetical order. Let's call the result a
word
lisf—in this particular case it would be “a, cat, dog, mat, on, sat, stood, the, while.” Then the computer would run through the pages word by word. For each word, it would make a note of the current page number next to the corresponding word in the word list. The final result is shown in the figure above. You can see immediately, for example, that the word “cat” occurs in pages 1 and 3, but not in page 2. And the word “while” appears only in page 3.

With this very simple approach, a search engine can already provide the answers to a lot of simple queries. For example, suppose you enter the query cat. The search engine can quickly jump to the entry for cat in the word list. (Because the word list is in alphabetical order, a computer can quickly find any entry, just like a human can quickly find a word in a dictionary.) And once it finds the entry for cat, the search engine can just give you the list of pages at that entry—in this case, 1 and 3. Modern search engines format the results nicely, with little snippets from each of the pages that were returned, but we will mostly ignore details like that and concentrate on how search engines know which page numbers are “hits” for the query you entered.

As another very simple example, let's check the procedure for the query dog. In this case, the search engine quickly finds the entry for dog and returns the hits 2 and 3. But how about a multiple-word query, like cat dog? This means you are looking for pages that contain both of the words “cat” and “dog.” Again, this is pretty easy for the search engine to do with the existing index. It first looks up the two words individually to find which pages they occur on as individual words. This gives the answer 1, 3 for “cat” and 2, 3 for “dog.” Then, the computer can quickly scan along both of the lists of hits, looking for any page numbers that occur on both lists. In this case, pages 1 and 2 are rejected, but page 3 occurs in both lists, so the final answer is a single hit on page 3. And a very similar strategy works for queries with more than two words. For example, the query cat the sat returns pages 1 and 3 as hits, since they are the common elements of the lists for “cat” (1, 3), “the” (1, 2, 3), and “sat” (1, 3).

So far, it sounds like building a search engine would be pretty easy. The simplest possible indexing technology seems to work just fine, even for multiword queries. Unfortunately, it turns out that this simple approach is completely inadequate for modern search engines. There are quite a few reasons for this, but for now we will concentrate on just one of the problems. This is the problem of how to do
phrase queries.
A phrase query is a query that searches for an exact phrase, rather than just the occurrence of some words anywhere on a page. On most search engines, phrase queries are entered using quotation marks. So, for example, the query “cat sat” has a very different meaning to the query cat sat. The query cat sat looks for pages that contain the two words “cat” and “sat” anywhere, in any order; whereas the query “cat sat” looks for pages that contain the word “cat” immediately followed by the word “sat.” In our simple three-page example, cat sat results in hits on pages 1 and 3, but “cat sat” returns only one hit, on page 1.

How can a search engine efficiently perform a phrase query? Let's stick with the “cat sat” example. It seems like the first step should be to do the same thing as for the ordinary multiword query cat sat: retrieve from the word list the list of pages that each word occurs on, in this case 1, 3 for “cat,” and the same thing—1, 3—for “sat.” But here the search engine is stuck. It knows for sure that both words occur on both pages 1 and 3, but there is no way of telling whether the words occur next to each other in the right order. You might think that at this point the search engine could go back and look at the original web pages to see if the exact phrase is there or not. This would indeed be a possible solution, but it is very, very inefficient. It requires reading through the entire contents of every web page that
might
contain the phrase, and there could be a huge number of such pages. Remember, we are dealing with an extremely small example of only three pages here, but a real search engine has to give correct results on tens of billions of web pages.

THE WORD-LOCATION TRICK

The solution to this problem is the first really ingenious idea that makes modern search engines work well: the index should not store only page
numbers
, but also
locations
within the pages. These locations are nothing mysterious: they just indicate the position of a word within its page. So the third word has location 3, the 29th word has location 29, and so on. Our entire three-page data set is shown in the top figure on the next page, with the word locations added. Below that is the index that results from storing both page numbers and word locations. We'll call this way of building an index the “word-location trick.” Let's look at a couple of examples to make sure we understand the word-location trick. The first line of the index is “a 3-5.” This means the word “a” occurs exactly once in the data set, on page 3, and it is the fifth word on that page. The longest line of the index is “the 1-1 1-5 2-1 2-5 3-1.” This line lets you know the exact locations of all occurrences of the word “the” in the data set. It occurs twice on page 1 (at locations 1 and 5), twice on page 2 (at locations 1 and 5), and once on page 3 (at location 1).

Other books

Genesis by Keith R. A. DeCandido
Snowbound With the Sheriff by Lauri Robinson
Fuckowski - Memorias de un ingeniero by Alfredo de Hoces García-Galán
Forbidden Fruit by Lee, Anna
Nostalgia by M.G. Vassanji
Restless Hearts by Mona Ingram
Penthouse Prince by Nelson, Virginia