Read The Half-Life of Facts Online
Authors: Samuel Arbesman
In addition to automated discoveries, there are now even automated scientists, software capable of detecting regularities in data and making more abstract discoveries. A computer program known as Eureqa was developed by Mike Schmidt, a graduate student at Cornell University (and current president of Nutonian, Inc.), and Hod Lipson, a professor at Cornell. It does something a bit different from the other projects mentioned already: Given lots of data, it attempts to find meaning in an otherwise meaningless jumble of facts.
Eureqa takes in a vast quantity of data points. Let’s say you’re studying a bridge and trying to understand why it wobbles. Or an ecosystem, and how the relative amounts of predators and prey change over time. You dump all the data you’ve collected into Eureqa—how many predators there are on each day, as well as the quantity of prey, for example—and it attempts to find meaning.
Eureqa does this by using a simple technique known as
evolutionary programming
; due to its computing power, this technique is very powerful. Eureqa randomly generates a large variety of equations that could conceivably explain relationships between the changes in data. For example, it will create an equation that attempts to mathematically combine the inputs of your system and show how they can yield the outputs. Of course, if it’s given a random equation, the odds are very good that it will have absolutely
no insight into the underlying phenomena it is trying to explain. Instead of explaining the data, it will spit out gibberish.
But what is randomly generated doesn’t have to be satisfactory. Instead, a population of random equations can be evolved. Just as biological evolution can result in a solution—an organism that is well adapted to its environment—the same thing can be done with digital organisms. In this case, the equations are allowed to reproduce, mutate, swap bits of their formulas, and more. And this is all in the service of explaining the data set. The better the equations adhere to the data, the more they are allowed to reproduce.
Doing this over and over results in a population of good and fit equations, formulas that are far cries from the initial, randomly generated ones. Eureqa can even yield equations that can actually generate findings as complex as the concept of the conservation of energy, one of the foundations of thermodynamics.
In the case of these automated-discovery programs, the more knowledge we have available, the more raw materials we have for these programs. The more data, the more new facts these programs can in turn reveal.
So it’s important for us to understand how knowledge is maintained if we want to make sure we can have the maximum amount of data for these automated-discovery programs. Specifically, is most knowledge actually preserved? Or are the raw materials for hidden knowledge that we have available only a remnant of what we might truly know?
. . .
THE
Middle Ages, far from being the Dark Ages, as some of us might have been taught, was a time of science and innovation. Europeans developed medical techniques and made advances in such areas as wind energy and gunpowder.
But it was also a time for preservation. That many of the texts written in ancient times, or even in the early Middle Ages, would make it to the modern era was by no means a foregone conclusion.
As mentioned in the previous chapter, prior to the printing press manuscripts had to be copied by hand in order for information to spread.
I have a book on my shelf entitled
The Book of Lost Books: An Incomplete History of All the Great Books You’ll Never Read
. It’s a discussion by Stuart Kelly of books that have been lost to time, whose names we know or from which excerpts have been passed down, but whose full texts are unknown. There is a long history of such references, even going back to
The Book of the Wars of the Lord
, a lost book that the Bible itself references when quoting a short description of the location of ancient tribal boundaries in Numbers 21:14–15.
The Book of Lost Books
is organized by author, and the names of those whose books we don’t have is astonishing: Alexander Pope, Gottfried Leibniz, William Shakespeare, Charles Dickens, Franz Kafka, Edward Gibbon, and many more. And, of course, there are many ancient and medieval writers. From Ovid and Menander to Ahmad ad-Daqiqi and Widsith the Wide-Traveled, we know of many writers whose works have not been preserved. And then there is the Venerable Bede.
The Venerable Bede was a Christian scholar and monk from England in the late seventh and early eighth centuries. In addition to being a very holy man in the eyes of the Catholic Church—he was made a doctor of the Church in 1899, a title indicating a person’s importance on theological and doctrinal thought, as well as being given his Venerable title only a few decades after his death, and later canonized—he was also a man of history and science. In fact, he has become known as the father of English history due to his
History of the English Church and People
. He also wrote books about mathematics, such as
De temporum ratione
, which discusses how to quickly perform mathematical calculations by hand.
But for all that we have of Bede’s work, even more of it no longer exists. Bede wrote a great deal of English poetry, nearly all of which has been lost. However, we have most of the Venerable Bede’s important scholarly works. In order for knowledge to be
discovered, and to be recombined in novel ways for new facts to be unearthed, it first needs to be preserved. For every one of Bede’s books, how many books of others do we not even have a memory? How much knowledge is fated to remain forever hidden?
A Cornell professor of earth and atmospheric sciences named John Cisne decided to tackle this question. Using the works of the Venerable Bede as a guide, he employed the same technique I did in
chapter 5
: He brought biology to the world of handwritten manuscripts. While I examined mutations and evolution, Cisne wanted to see how often medieval manuscripts go extinct while being copied. So he used population and demographic models to understand how Bede’s works spread.
Just as organisms reproduce and die out, so do manuscripts, in a fashion: They “reproduce” by being copied and can die by being lost or destroyed. Reproduction in this case follows a logistic curve, just like bacteria in a petri dish. Since books cannot grow without bound, a logistic curve is a more realistic function to use than a simple exponential. The logistic allows for growth in which there is a certain carrying capacity—the maximum number of copies of a book that can be made.
Cisne used mathematics from the world of population biology to describe this simple state of affairs, and he was able to create a model that fit the number of the Venerable Bede’s technical books that have survived from each century. From this Cisne arrived at the likelihood that a document would be copied. Specifically, he found that documents from the Middle Ages were fifteen to thirty times more likely to be copied than destroyed.
In addition, he calculated the half-lives of these books: how long they would last before the destruction of half of the copies. He found that the half-lives were between four and nine centuries, a surprisingly long time. Cisne was able to conclude that most documents from the early Middle Ages, and perhaps even antiquity, have, in fact, survived.
We have certainly lost a great deal; time ravages much knowledge. But when it comes to hidden knowledge, it’s heartening to
know that many facts aren’t lost to history; they can indeed be discovered.
. . .
SO
we now know that facts are seldom lost. And as long as knowledge is preserved, we have the raw materials for unearthing hidden knowledge. As we’ve seen earlier in the chapter, that still doesn’t prevent much of knowledge from remaining hidden—witness everything from Mendelian genetics to the true import of clinical trials—but through modern technology, we now have computational ways of connecting and recombining disparate bits of knowledge to create new facts.
In fact, hidden knowledge and its discovery is no longer the domain of the medieval scholar or information scientist, or even of the robotic mathematician. Tools related to hidden knowledge are being created for everyone, enabling a certain renaissance in the discovery of knowledge; facts can be spread and mixed in novel ways, unburied and shown the sunlight. One of these tools is Mendeley, which is designed for the average scientist.
One of the most annoying and tedious parts of publishing scientific work is in the details—specifically, the details of formatting. Each journal has its own specific rules for fonts, the organization of the paper’s content, and, most maddening, how to format the citations. When you write a paper, you carefully format each reference specifically for the journal to which you are submitting. But woe betide the scientist whose paper is rejected, forcing her to reformat for another journal and a resubmission. And lest you be surprised by this tedium, multiple submissions are more often the rule than the exception.
Into this detail-oriented morass have stepped a number of computational tools to help deal with these issues. The most popular of these is EndNote. This is a computer program that allows references to be imported easily from scientific databases, or to be entered manually. Creating a bibliography for a specific journal becomes as simple as selecting it from a drop-down menu. Want to
submit to
Nature
? Easy. Rejected from
Nature
and now aiming a paper at
Proceedings of the National Academy of Sciences
? This too is a simple matter, requiring little more than a single click.
But a new online tool has arrived recently, called Mendeley. In addition to simplifying reference importation, synchronizing one’s bibliography online, and many other wonderful features, it has another: social networking. Instead of the scientist simply working with the set of references they use to write their papers in isolation, it allows them to see their friends’ references; it acts as a sort of social network for scientists.
As Mendeley grows in popularity—and it seems that it’s hitting the critical mass that’s necessary for any social Web site to thrive—it allows for the collaborative exposure of knowledge that each of us individually hasn’t been aware of.
But it provides another important feature: It allows scientists to see articles that are related to ones that they’re already looking at. By automatically finding topic relationships between papers, Mendeley brings undiscovered public knowledge to the scientific masses. Scientists can now find a paper on psychology that can shed light onto network science or a math paper that can help with X-ray crystallography. In doing so it can help create new facts.
These capabilities are even being brought to the everyday user. There are a wide variety of computer tools that allow someone to collect snippets of information—quotations, references, pictures, articles, Web pages, and more—in a simple and searchable place. Some store these notes in the cloud, some on a desktop, and some even allow these notes to be shared with others.
However, one program has an ability that others lack: DEVONthink uses something called
semantic and associative data processing.
The other programs require searching for certain words or combinations of words. If a note has these words, it shows up; otherwise the program can’t find it. DEVONthink, however, is a bit more clever. It uses a special computational technique to analyze the entire text and find relationships between words. So if the search is for the word
house
and there is a note that is a quotation
about the wonderful nature of the home, DEVONthink is likely to find such a relationship. It can also tell which notes are similar to one another, providing cognitive connections that are not always available to us, since we can’t hold thousands, or even hundreds, of notes in our minds at once. But computers can, and they can draw the connections for us, providing the substrate for new facts and bits of knowledge.
Steven Johnson, a writer whose books rely on the connectedness of disparate ideas, uses DEVONthink a great deal, and he reports that it has benefited him greatly. From an essay in which he praises this tool’s powers, he gives an example:
This can create almost lyrical connections between ideas. I’m now working on a project that involves the history of the London sewers. The other day I ran a search that included the word “sewage” several times. Because the software knows the word “waste” is often used alongside “sewage” it directed me to a quote that explained the way bones evolved in vertebrate bodies: by repurposing the calcium waste products created by the metabolism of cells.
That might seem like an errant result, but it sent me off on a long and fruitful tangent into the way complex systems—whether cities or bodies—find productive uses for the waste they create.
New facts are all around us. And due to the algorithmic properties of modern technology, we now have the possibility of discovering them.
. . .
DIGGING
up hidden knowledge is now far from an impossibility, or even from being solely the domain of the specialist; it has become eminently possible and easy. Knowledge doesn’t get lost or destroyed any longer, and that seems to have happened even less often than we used to believe. Facts are now commonly digitized, and
are ripe for being combined and turned into new facts. We are in a golden age of revealing hidden knowledge.
When this happens it can sometimes lead to drastic, sudden changes in what we know. These changes—when what we know is fundamentally overhauled in an abrupt and dramatic way—are also subject to quantitative regularity, and are the subject of the next chapter.
IN
1750, Thomas Wright, a British astronomer, published a diagram in his book
An Original Theory or New Hypothesis of the Universe
. This diagram showed a whole host of stars. But there was more in the diagram than just stars; the stars weren’t alone. Each star was surrounded by a small cloud of orbits, an entire planetary system. What Wright was clearly implying was that our sun was not particularly special: Other planets orbited around every star, much like the ones in our solar system.