Read Superintelligence: Paths, Dangers, Strategies Online
Authors: Nick Bostrom
Tags: #Science, #Philosophy, #Non-Fiction
Another issue in coding the goal “Maximize the realization of the values described in the envelope” is that even if all the correct values were described in a letter, and even if the AI’s motivation system were successfully keyed to this source, the AI might not interpret the descriptions the way we intended. This would create a risk of perverse instantiation, as discussed in
Chapter 8
.
To clarify, the difficulty here is not so much how to ensure that the AI can understand human intentions. A superintelligence should easily develop such understanding. Rather, the difficulty is ensuring that the AI will be motivated to pursue the described values in the way we intended. This is not guaranteed by the AI’s ability to understand our intentions: an AI could know exactly what we meant and yet be indifferent to that interpretation of our words (being motivated instead by some other interpretation of the words or being indifferent to our words altogether).
The difficulty is compounded by the desideratum that, for reasons of safety, the correct motivation should ideally be installed in the seed AI
before
it becomes capable of fully representing human concepts or understanding human intentions. This requires that somehow a cognitive framework be created, with a particular
location in that framework designated in the AI’s motivation system as the repository of its final value. But the cognitive framework itself must be revisable, so as to allow the AI to expand its representational capacities as it learns more about the world and grows more intelligent. The AI might undergo the equivalent of scientific revolutions, in which its worldview is shaken up and it perhaps suffers ontological crises in which it discovers that its previous ways of thinking about values were based on confusions and illusions. Yet starting at a sub-human level of development and continuing throughout all its subsequent development into a galactic superintelligence, the AI’s conduct is to be guided by an essentially unchanging final value, a final value that becomes better understood by the AI in direct consequence of its general intellectual progress—and likely quite differently understood by the mature AI than it was by its original programmers, though not different in a random or hostile way but in a benignly appropriate way. How to accomplish this remains an open question.
20
(See
Box 11
.)
In summary, it is not yet known how to use the value learning approach to install plausible human values (though see
Box 12
for some examples of recent
ideas). At present, the approach should be viewed as a research program rather than an available technique. If it could be made to work, it might constitute the most ideal solution to the value-loading problem. Among other benefits, it would seem to offer a natural way to prevent mind crime, since a seed AI that makes reasonable guesses about which values its programmers might have installed would anticipate that mind crime is probably negatively evaluated by those values, and thus best avoided, at least until more definitive information has been obtained.
Last, but not least, there is the question of “what to write in the envelope”—or, less metaphorically, the question of which values we should try to get the AI to learn. But this issue is common to all approaches to the AI value-loading problem. We return to it in
Chapter 13
.
Eliezer Yudkowsky has tried to describe some features of a seed AI architecture intended to enable the kind of behavior described in the text above. In his terminology, the AI would use “external reference semantics.”
21
To illustrate the basic idea, let us suppose that we want the system to be “friendly.” The system starts out with the goal of trying to instantiate property
F
but does not initially know much about what
F
is. It might just know that
F
is some abstract property and that when the programmers speak of “friendliness,” they are probably trying to convey information about
F
. Since the AI’s final goal is to instantiate
F
, an important instrumental value is to learn more about what
F
is. As the AI discovers more about
F
, its behavior is increasingly guided by the actual content of
F
. Thus, hopefully, the AI becomes increasingly friendly the more it learns and the smarter it gets.
The programmers can help this process along, and reduce the risk of the AI making some catastrophic mistake while its understanding of
F
is still incomplete, by providing the AI with “programmer affirmations,” hypotheses about the nature and content of
F
to which an initially high probability is assigned. For instance, the hypothesis “misleading the programmers is unfriendly” can be given a high prior probability. These programmer affirmations, however, are not “true by definition”—they are not unchallengeable axioms about the concept of friendliness. Rather, they are initial hypotheses about friendliness, hypotheses to which a rational AI will assign a high probability at least for as long as it trusts the programmers’ epistemic capacities more than its own.
Yudkowsky’s proposal also involves the use of what he called “causal validity semantics.” The idea here is that the AI should do not exactly what the programmers told it to do but rather (something like) what they were trying to tell it to do. While the programmers are trying to explain to the seed AI what friendliness is, they might make errors in their explanations. Moreover, the programmers themselves may not fully understand the true nature of friendliness. One would therefore want the AI to have the ability to correct errors in the programmers’ thinking, and to infer the true or intended meaning from whatever imperfect explanations the programmers manage to provide. For example, the AI should be able to represent the causal processes whereby the programmers learn and communicate about friendliness. Thus, to pick a trivial example, the AI should understand that there is a possibility that a programmer might make a typo while inputting information about friendliness, and the AI should then seek to correct the error. More generally, the AI should seek to correct for whatever distortive influences may have corrupted the flow of information about friendliness as it passed from its source through the programmers to the AI (where “distortive” is an epistemic category). Ideally, as the AI matures, it should overcome any cognitive biases and other more fundamental misconceptions that may have prevented its programmers from fully understanding what friendliness is.
What we might call the “Hail Mary” approach is based on the hope that elsewhere in the universe there exist (or will come to exist) civilizations that successfully manage the intelligence explosion, and that they end up with values that significantly overlap with our own. We could then try to build our AI so that it is motivated to do what these other superintelligences want it to do.
22
The advantage is that this might be easier than to build our AI to be motivated to do what we want directly.
For this scheme to work it is
not
necessary that our AI can establish communication with any alien superintelligence. Rather, our AI’s actions would be guided by
its estimates
of what the alien superintelligences would want it to do. Our AI would model the likely outcomes of intelligence explosions elsewhere, and as it becomes superintelligent itself its estimates should become increasingly accurate. Perfect knowledge is not required. There may be a range of plausible outcomes of intelligence explosions, and our AI would then do its best to accommodate the preferences of the various different kinds of superintelligence that might emerge, weighted by probability.
This version of the Hail Mary approach requires that we construct a final value for our AI that refers to the preferences of other superintelligences. Exactly how to do this is not yet clear. However, superintelligent agents might be structurally distinctive enough that we could write a piece of code that would function as a detector that would look at the world model in our developing AI and designate the representational elements that correspond to the presence of a superintelligence. The detector would then, somehow, extract the preferences of the superintelligence in question (as it is represented within our own AI).
23
If we could create such a detector, we could then use it to define our AI’s final values. One challenge is that we may need to create the detector before we know what representational framework our AI will develop. The detector may thus need to query an unknown representational framework and extract the preferences of whatever superintelligence may be represented therein. This looks difficult, but perhaps some clever solution can be found.
24
If the basic setup could be made to work, various refinements immediately suggest themselves. For example, rather than aiming to follow (some weighted composition of) the preferences of
every
alien superintelligence, our AI’s final value could incorporate a filter to select a subset of alien superintelligences for obeisance (with the aim of selecting ones whose values are closer to our own). For instance, we might use criteria pertaining to a superintelligence’s causal origin to determine whether to include it in the obeisance set. Certain properties of its origination (which we might be able to define in structural terms) may correlate with the degree to which the resultant superintelligence could be expected to share our values. Perhaps we wish to place more trust in superintelligences whose causal origins trace back to a whole brain emulation, or to a seed AI that did not make heavy use of evolutionary algorithms or that emerged slowly in a way suggestive of a controlled takeoff. (Taking causal origins into account would also let us avoid over-weighting superintelligences that create multiple copies of themselves—indeed would let us avoid creating an incentive for them to do so.) Many other refinements would also be possible.
The Hail Mary approach requires faith that there are other superintelligences out there that sufficiently share our values.
25
This makes the approach non-ideal. However, the technical obstacles facing the Hail Mary approach, though very substantial, might possibly be less formidable than those confronting alternative approaches. Exploring non-ideal but more easily implementable approaches can make sense—not with the intention of using them, but to have something to fall back upon in case an ideal solution should not be ready in time.
Another idea for how to solve the value-loading problem has recently been proposed by Paul Christiano.
26
Like the Hail Mary, it is a value learning method that tries to define the value criterion by means of a “trick” rather than through laborious construction. By contrast to the Hail Mary, it does not presuppose the existence of other superintelligent agents that we could point to as role models for our own AI. Christiano’s proposal is somewhat resistant to brief explanation—it involves a series of arcane considerations—but we can try to at least gesture at its main elements.
Suppose we could obtain (a) a mathematically precise specification of a particular human brain and (b) a mathematically well-specified virtual environment that contains an idealized computer with an arbitrarily large amount of memory and CPU power. Given (a) and (b), we could define a utility function
U
as the output the human brain would produce after interacting with this environment.
U
would be a mathematically well-defined object, albeit one which (because of computational limitations) we may be unable to describe
explicitly
. Nevertheless,
U
could serve as the value criterion for a value learning AI, which could use various heuristics for assigning probabilities to hypotheses about what
U
implies.
Intuitively, we want
U
to be the utility function that a suitably prepared human would output if she had the advantage of being able use an arbitrarily large amount of computing power—enough computing power, for example, to run astronomical numbers of copies of herself to assist her with her analysis of specifying a utility function, or to help her devise a better process for going about this analysis. (We are here foreshadowing a theme, “coherent extrapolated volition,” which will be further explored in
Chapter 13
.)
It would seem relatively easy to specify the idealized environment: we can give a mathematical description of an abstract computer with arbitrarily large capacity; and in other respects we could use a virtual reality program that gives a mathematical description of, say, a single room with a computer terminal in it (instantiating the abstract computer). But how to obtain a mathematically precise description of a particular human brain? The obvious way would be through whole brain emulation, but what if the technology for emulation is not available in time?
This is where Christiano’s proposal offers a key innovation. Christiano observes that in order to obtain a mathematically well-specified value criterion, we do not need a practically useful computational model of a mind, a model we could run. We just need a (possibly implicit and hopelessly complicated) mathematical
definition
—and this may be much easier to attain. Using functional neuroimaging and other measurements, we can perhaps collect gigabytes of data about the input–output behavior of a selected human. If we collect a sufficient amount of data, then it might be that the simplest mathematical model that accounts for all this data is in fact an emulation of the particular human in question. Although it would be computationally intractable for us to
find
this simplest model from the data, it could be perfectly possible for us to
define
the model, by referring to the data and a using a mathematically well-defined simplicity measure (such as some variant of the Kolmogorov complexity, which we encountered in
Box 1
,
Chapter 1
).
27