Superintelligence: Paths, Dangers, Strategies (27 page)

Read Superintelligence: Paths, Dangers, Strategies Online

Authors: Nick Bostrom

Tags: #Science, #Philosophy, #Non-Fiction

BOOK: Superintelligence: Paths, Dangers, Strategies
7.93Mb size Format: txt, pdf, ePub

For these reasons it would be imprudent to rely on information deprivation as the main check on a superintelligence’s power. Some forms of stunting however, including limiting hardware capacity and preventing the development of specific cognitive faculties, might find use as auxiliary control methods in the developmental phase. They could be used, for instance, to regulate the rate of capacity gain in an intermediary AI and to allow tests to be run at different levels of capability short of full-throttled superintelligence.

Tripwires
 

A tripwire is a mechanism that performs diagnostic tests on the system (possibly without its knowledge) and effects a shutdown if it detects signs of dangerous activity. Programmers can then scrutinize recordings of the suspicious activity at their leisure, and make changes to the AI’s code base or environment before restarting it. If they find that the problem is sufficiently deep-rooted, they could decide to abort the entire project.

Tripwires differ from incentive methods in that they do not rely on the system being aware of the consequences of engaging in forbidden activities. Tripwires are
more closely related to stunting methods. Like stunting, tripwires could be used as a temporary safeguard, providing a degree of protection during the development phase. In principle, tripwires can also be used during the operational phase, particularly for a boxed system. However, the ability of tripwires to constrain a full-fledged superintelligence must remain very much in doubt, since it would be hard for us to assure ourselves that such an agent could not find ways to subvert any tripwire devised by the human intellect.

Table 9
Different kinds of tripwires

 

 

Type of monitoring

Description and examples

Behavior

Detectors could be placed around a boxed AI to detect attempts to breach the containment. For example, detectors could intercept attempts at radio communication or at accessing internal computational resources intended to be off limits. An “Ethernet port of Eden” could be installed: an apparent connection to the internet that leads to a shutdown switch.

Ability

Automated capability testing could be performed at frequent intervals to determine the AI’s skill in various domains. If either the rate of improvement is unexpectedly high or the AI attains a level of competence that brings it close to the potential danger zone, the system could be slowed down or paused, in order to allow the programmers to monitor further developments more closely.

Content

The AI’s internal processes could be monitored for suspicious activity. For example, any change to the AI’s representation of its final values might trigger an automatic shutdown and review. The AI’s resource use and algorithmic methods may also be also be monitored for unexpected patterns.

More ambitiously, the AI’s beliefs, plans, subgoals, and justifications may be continuously scrutinized, to ensure they conform to the programmers’ expectations. Such monitoring of the AI’s intentional states would enable a particularly powerful form of content monitoring: the scanning of the AI’s cognitive processes for a
conception of deception
—that vulnerable moment when an unfriendly AI first forms the intention to conceal its true intentions.
21

Content monitoring that requires that the AI’s intentional states be transparent to the programmers or to an automatic monitoring mechanism may not be feasible for all kinds of AI architectures. (Some neural networks, for instance, are opaque, as they represent information holistically and in ways that do not necessarily match up with human concepts.) This may be a reason to avoid using such architectures.

As an adjunct safety measure in the development phase, though, tripwires are important. It would be good practice for the AI’s designers to codify their expectations about how the system is supposed to function, and then to circumscribe the AI with tripwires that would be activated if any of these expectations is violated. To make the tests more stringent, “honeypots” could be strategically placed to create temptations for a malfunctioning AI to commit some easily observable violation. For instance, if an AI has been designed in such a way that it is supposed not to want to access the internet, a fake Ethernet port could be installed (leading to an automatic shutdown switch) just to see if they AI tries to use it. (Some examples of tripwires are given in
Table 9
.)

It should be emphasized that the value of a tripwire depends not only on the mechanism itself but also—critically—on how a project reacts when a tripwire is triggered. If the programmers or project managers, impatient to make progress, simply switch the system back on again—or if they do so after making some token modification to prevent the tripwire being triggered on the next run—then no safety has been gained even if the tripwire itself works exactly as intended.

Motivation selection methods
 

Motivation selection methods seek to prevent undesirable outcomes by shaping what the superintelligence wants to do. By engineering the agent’s motivation system and its final goals, these methods would produce a superintelligence that would not
want
to exploit a decisive strategic advantage in a harmful way. Since a superintelligent agent is skilled at achieving its ends, if it prefers not to cause harm (in some appropriate sense of “harm”) then it would tend not to cause harm (in that sense of “harm”).

Motivation selection can involve explicitly formulating a goal or set of rules to be followed (
direct specification
) or setting up the system so that it can discover an appropriate set of values for itself by reference to some implicitly or indirectly formulated criterion (
indirect normativity
). One option in motivation selection is to try to build the system so that it would have modest, non-ambitious goals (
domesticity
). An alternative to creating a motivation system from scratch is to select an agent that already has an acceptable motivation system and then augment that agent’s cognitive powers to make it superintelligent, while ensuring that the motivation system does not get corrupted in the process (
augmentation
). Let us look at these in turn.

Direct specification
 

Direct specification is the most straightforward approach to the control problem. The approach comes in two versions, rule-based and consequentialist, and involves trying to explicitly define a set of rules or values that will cause even a free-roaming superintelligent AI to act safely and beneficially. Direct specification, however, faces what may be insuperable obstacles, deriving from both the difficulties in determining which rules or values we would wish the AI to be guided by and the difficulties in expressing those rules or values in computer-readable code.

The traditional illustration of the direct rule-based approach is the “three laws of robotics” concept, formulated by science fiction author Isaac Asimov in a short story published in 1942.
22
The three laws were: (1) A robot may not injure a human being or, through inaction, allow a human being to come to harm; (2) A robot must obey any orders given to it by human beings, except where such orders would conflict with the First Law; (3) A robot must protect its own existence as long as such protection does not conflict with the First or Second Law. Embarrassingly for our species, Asimov’s laws remained state-of-the-art for over half a century: this despite obvious problems with the approach, some of which are explored in Asimov’s own writings (Asimov probably having formulated the laws in the first place precisely so that they would fail in interesting ways, providing fertile plot complications for his stories).
23

Bertrand Russell, who spent many years working on the foundations of mathematics, once remarked that “everything is vague to a degree you do not realize till you have tried to make it precise.”
24
Russell’s dictum applies in spades to the direct specification approach. Consider, for example, how one might explicate Asimov’s first law. Does it mean that the robot should minimize the probability of any human being coming to harm? In that case the other laws become otiose since it is always possible for the AI to take some action that would have at least some microscopic effect on the probability of a human being coming to harm. How is the robot to balance a large risk of a few humans coming to harm versus a small risk of many humans being harmed? How do we define “harm” anyway? How should the harm of physical pain be weighed against the harm of architectural ugliness or social injustice? Is a sadist harmed if he is prevented from tormenting his victim? How do we define “human being”? Why is no consideration given to other morally considerable beings, such as sentient nonhuman animals and digital minds? The more one ponders, the more the questions proliferate.

Perhaps the closest existing analog to a rule set that could govern the actions of a superintelligence operating in the world at large is a legal system. But legal systems have developed through a long process of trial and error, and they regulate relatively slowly-changing human societies. Laws can be revised when necessary. Most importantly, legal systems are administered by judges and juries who generally apply a measure of common sense and human decency to ignore logically possible legal interpretations that are sufficiently obviously unwanted
and unintended by the lawgivers. It is probably humanly impossible to explicitly formulate a highly complex set of detailed rules, have them apply across a highly diverse set of circumstances, and get it right on the first implementation.
25

Problems for the direct consequentialist approach are similar to those for the direct rule-based approach. This is true even if the AI is intended to serve some apparently simple purpose such as implementing a version of classical utilitarianism. For instance, the goal “Maximize the expectation of the balance of pleasure over pain in the world” may appear simple. Yet expressing it in computer code would involve, among other things, specifying how to recognize pleasure and pain. Doing this reliably might require solving an array of persistent problems in the philosophy of mind—even just to obtain a correct account expressed in a natural language, an account which would then, somehow, have to be translated into a programming language.

A small error in either the philosophical account or its translation into code could have catastrophic consequences. Consider an AI that has hedonism as its final goal, and which would therefore like to tile the universe with “hedonium” (matter organized in a configuration that is optimal for the generation of pleasurable experience). To this end, the AI might produce computronium (matter organized in a configuration that is optimal for computation) and use it to implement digital minds in states of euphoria. In order to maximize efficiency, the AI omits from the implementation any mental faculties that are not essential for the experience of pleasure, and exploits any computational shortcuts that according to its definition of pleasure do not vitiate the generation of pleasure. For instance, the AI might confine its simulation to reward circuitry, eliding faculties such as memory, sensory perception, executive function, and language; it might simulate minds at a relatively coarse-grained level of functionality, omitting lower-level neuronal processes; it might replace commonly repeated computations with calls to a lookup table; or it might put in place some arrangement whereby multiple minds would share most parts of their underlying computational machinery (their “supervenience bases” in philosophical parlance). Such tricks could greatly increase the quantity of pleasure producible with a given amount of resources. It is unclear how desirable this would be. Furthermore, if the AI’s criterion for determining whether a physical process generates pleasure is wrong, then the AI’s optimizations might throw the baby out with the bathwater: discarding something which is inessential according to the AI’s criterion yet essential according to the criteria implicit in our human values. The universe then gets filled not with exultingly heaving hedonium but with computational processes that are unconscious and completely worthless—the equivalent of a smiley-face sticker xeroxed trillions upon trillions of times and plastered across the galaxies.

Domesticity
 

One special type of final goal which might be more amenable to direct specification than the examples given above is the goal of self-limitation. While it seems
extremely difficult to specify how one would want a superintelligence to behave in the world
in general
—since this would require us to account for all the trade-offs in all the situations that could arise—it might be feasible to specify how a superintelligence should behave in one particular situation. We could therefore seek to motivate the system to confine itself to acting on a small scale, within a narrow context, and through a limited set of action modes. We will refer to this approach of giving the AI final goals aimed at limiting the scope of its ambitions and activities as “domesticity.”

For example, one could try to design an AI such that it would function as a question-answering device (an “oracle,” to anticipate the terminology that we will introduce in the next chapter). Simply giving the AI the final goal of producing maximally accurate answers to any question posed to it would be unsafe—recall the “Riemann hypothesis catastrophe” described in
Chapter 8
. (Reflect, also, that this goal would incentivize the AI to take actions to ensure that it is asked easy questions.) To achieve domesticity, one might try to define a final goal that would somehow overcome these difficulties: perhaps a goal that combined the desiderata of answering questions correctly and minimizing the AI’s impact on the world except whatever impact results as an incidental consequence of giving accurate and non-manipulative answers to the questions it is asked.
26

Other books

Alternatives to Sex by Stephen McCauley
That Part Was True by Deborah McKinlay
Dragonvein Book Four by Brian D. Anderson
Better than Perfect by Simone Elkeles
The Wonders by Paddy O’Reilly
The Wife of Reilly by Jennifer Coburn
Miracle In March by Juliet Madison
Tracing the Shadow by Sarah Ash
The Diamond by King, J. Robert