purpose CPU.

During the decades following the 8080, the number of transistors that

could be packed onto a single chip increased at a stunning pace. As CPU

designers had more and more transistors to work with when designing new

chips, they began to think up novel ways for using those transistors to increase computing performance on application code. One of the first things that

occurred to designers was that they could put more than one ALU on a chip

and have both ALUs working in parallel to process code faster. Since these

designs could do more than one scalar (or
, for our purposes) operation at once, they were called
computers. The RS6000 from IBM was released in 1990 and was the world’s first commercially available superscalar

CPU. Intel followed in 1993 with the Pentium, which, with its two ALUs,

brought the
86 world into the superscalar era.

For illustrative purposes, I’ll now introduce a
two-way superscalar
version of the DLW-1, called the DLW-2 and illustrated in Figure 4-1. The DLW-2

has two ALUs, so it’s able to execute two arithmetic instructions in parallel

(hence the term
superscalar). These two ALUs share a single register file, a situation that in terms of our file clerk analogy would correspond to

the file clerk sharing his personal filing cabinet with a second file clerk.

As you can probably guess from looking at Figure 4-1, superscalar

processing adds a bit of complexity to the DLW-2’s design, because it needs

new circuitry that enables it to reorder the linear instruction stream so that

some of the stream’s instructions can execute in parallel. This circuitry has to ensure that it’s “safe” to dispatch two instructions in parallel to the two execution units. But before I go on to discuss some reasons why it might not be

safe to execute two instructions in parallel, I should define the term I just



Chapter 4

Main Memory




Figure 4-1: The superscalar DLW-2

Notice that in Figure 4-2 I’ve renamed the second pipeline stage

. This is because attached to the latter part of the decode stage is a bit of dispatch circuitry whose job it is to determine whether or not two

instructions can be executed in parallel, in other words, on the same clock

cycle. If they can be executed in parallel, the dispatch unit sends one instruc-

tion to the first integer ALU and one to the second integer ALU. If they can’t

be dispatched in parallel, the dispatch unit sends them in program order to

the first of the two ALUs. There are a few reasons why the dispatcher might

decide that two instructions can’t be executed in parallel, and we’ll cover

those in the following sections.

It’s important to note that even though the processor has multiple ALUs,

the programming model does not change. The programmer still writes to the

same interface, even though that interface now represents a fundamentally

different type of machine than the processor actually is; the interface repre-

sents a sequential execution machine, but the processor is actually a parallel

execution machine. So even though the superscalar CPU executes instruc-

tions in parallel, the illusion of sequential execution absolutely must be

maintained for the sake of the programmer. We’ll see some reasons why

this is so later on, but for now the important thing to remember is that main

memory still sees one sequential code stream, one data stream, and one

results stream, even though the code and data streams are carved up inside

the computer and pushed through the two ALUs in parallel.

Superscalar Execution


Front End








Back End

Figure 4-2: The pipeline of the superscalar DLW-2

If the processor is to execute multiple instructions at once, it must be

able to fetch and decode multiple instructions at once. A two-way superscalar

processor like the DLW-2 can fetch two instructions at once from memory on

each clock cycle, and it can also decode and dispatch two instructions each

clock cycle. So the DLW-2 fetches instructions from memory in groups of

two, starting at the memory address that marks the beginning of the current

program’s code segment and incrementing the program counter to point

four bytes ahead each time a new instruction is fetched. (Remember, the

DLW-2’s instructions are two bytes wide.)

As you might guess, fetching and decoding two instructions at a time

complicates the way the DLW-2 deals with branch instructions. What if the

first instruction in a fetched pair happens to be a branch instruction that has

the processor jump directly to another part of memory? In this case, the

second instruction in the pair has to be discarded. This wastes fetch band-

width and introduces a bubble into the pipeline. There are other issues

relating to superscalar execution and branch instructions, and I’ll say more

about them in the section on control hazards.

Superscalar Computing and IPC

Superscalar computing allows a microprocessor to increase the number

of instructions per clock that it completes beyond one instruction per clock.

Recall that one instruction per clock was the maximum theoretical instruction

throughput for a pipelined processor, as described in “Instruction Through-

put” on
page 53. Because a
superscalar machine can have multiple instructions

Chapter 4

in multiple write stages on each clock cycle, the superscalar machine can

complete multiple instructions per cycle. If we adapt Chapter 3’s pipeline

diagrams to take account of superscalar execution, they look like Figure 4-3.

















Figure 4-3: Superscalar execution and pipelining combined

In Figure 4-3, two instructions are added to the
Completed Instructions

box on each cycle once the pipeline is full. The more ALU pipelines that a

processor has operating in parallel, the more instructions it can add to that

box on each cycle. Thus superscalar computing allows you to increase a pro-

cessor’s IPC by adding more hardware. There are some practical limits to how

many instructions can be executed in parallel, and we’ll discuss those later.

Expanding Superscalar Processing with Execution Units

Most modern processors do more with superscalar execution than just add-

ing a second ALU. Rather, they distribute the work of handling different

types of instructions among different types of execution units. An
is a block of circuitry in the processor’s back end that executes a certain category of instruction. For instance, you’ve already met the arithmetic logic

unit (ALU), an execution unit that performs arithmetic and logical opera-

tions on integers. In this section we’ll take a closer look at the ALU, and

you’ll learn about some other types of execution units for non-integer arith-

metic operations, memory accesses, and branch instructions.

Superscalar Execution


Basic Number Formats and Computer Arithmetic

The kinds of numbers on which modern microprocessors operate can be

divided into two main types: integers (aka fixed-point numbers) and floating-

point numbers.
are simply whole numbers of the type with which

you first learn to count in grade school. An integer can be positive, negative,

or zero, but it cannot, of course, be a fraction. Integers are also called
fixed-point numbers
because an integer’s decimal point does not move. Examples of integers are 1, 0, 500, 27, and 42. Arithmetic and logical operations involving integers are among the simplest and fastest operations that a micropro-

cessor performs. Applications like compilers, databases, and word processors

make heavy use of integer operations, because the numbers they deal with

are usually whole numbers.

floating-point number
is a decimal number that represents a fraction.

Examples of floating-point numbers are 56.5, 901.688, and 41.9999. As you

can see from these three numbers, the decimal point “floats” around and

isn’t fixed in once place, hence the name. The number of places behind the

decimal point determines a floating-point number’s accuracy, so floating-

point numbers are often
of fractional values. Arithmetic and logical operations performed on floating-point numbers are more complex

and, hence, slower than their integer counterparts. Because floating-point

numbers are approximations of fractional values, and the real world is kind

of approximate and fractional, floating-point arithmetic is commonly found

in real world–oriented applications like simulations, games, and signal-

processing applications.

Both integer and floating-point numbers can themselves be divided into

one of two types: scalars and vectors.
are values that have only one numerical component, and they’re best understood in contrast with

Briefly, a vector is a multicomponent value, most often seen as an ordered

sequence or array of numbers. (Vectors
are covered in detail in “The Vector

Execution Units” on page 168
.) Here are some examples of different types of vectors and scalars:











{5, −7, −9, 8}

{0.99, −1.1, 3.31}

{1,003, 42, 97, 86, 97}

{50.01, 0.002, −1.4, 1.4}

{234, 7, 6, 1, 3, 10, 11}

{5.6, 22.3, 44.444, 76.01, 9.9}

Returning to the code/data distinction, we can say that the data

stream consists of four types of numbers: scalar integers, scalar floating-

point numbers, vector integers, and vector floating-point numbers. (Note

that even memory addresses fall into one of these four categories—scalar

integers.) The code stream, then, consists of instructions that operate on

all four types of numbers.


Chapter 4

The kinds of operations that can be performed on the four types of

numbers fall into two main categories: arithmetic operations and logical

operations. When I first introduced arithmetic operations in Chapter 1,

I lumped them together with logical operations for the sake of convenience.

At this point, though, it’s useful to distinguish the two types of operations

from one another:


Arithmetic operations are operations like addition, subtraction,

multiplication, and division, all of which can be performed on any

type of number.


Logical operations are Boolean operations like AND, OR, NOT, and

XOR, along with bit shifts and rotates. Such operations are performed

on scalar and vector integers, as well as on the contents of special-

purpose registers like the processor status word (PSW).

The types of operations performed on these types of numbers can be

broken down as illustrated in Figure 4-4.



Arithmetic Operations






Logic Operations

Figure 4-4: Number formats and operation types

As you make your way through the rest of the book, you may want to

refer back to this section occasionally. Different microprocessors divide these

operations among different execution units in a variety of ways, and things

can easily get confusing.

Arithmetic Logic Units

On early microprocessors, as on the DLW-1 and DLW-2, all integer arithmetic

and logical operations were handled by the ALU. Floating-point operations

were executed by a companion chip, commonly called an
arithmetic coprocessor
, that was attached to the motherboard and designed to work in conjunction

with the microprocessor. Eventually, floating-point capabilities were inte-

grated onto the CPU as a separate execution unit alongside the ALU.

Superscalar Execution


Consider the Intel Pentium processor depicted in Figure 4-5, which

contains two integer ALUs and a floating-point ALU, along with some

other units that we’ll describe shortly.

Front End

Instruction Fetch





Control Unit





Integer Unit


Back End


Figure 4-5: The Intel Pentium

This diagram is a variation on Figure 4-2, with the execute stage replaced

