Authors: jon stokes
Tags: #Computers, #Systems Architecture, #General, #Microprocessors
by labeled white boxes (SIU, CIU, FPU, BU, etc.) that designate the type of
execution unit that’s modifying the code stream during the execution phase.
Notice also that the figure contains a slight shift in terminology that I should clarify before we move on.
Until now, I’ve been using the term
ALU
as synonymous with
integer
execution unit
. After the previous section, however, we know that a microprocessor does arithmetic and logical operations on more than just integer
data, so we have to be more precise in our terminology. From now on,
ALU
is a general term for any execution unit that performs arithmetic and logical
operations on any type of data. More specific labels will be used to identify
the ALUs that handle specific types of instructions and numerical data. For
instance, an
integer execution unit (IU)
is an ALU that executes integer arithmetic and logical instructions, a
floating-point execution unit (FPU)
is an ALU
that executes floating-point arithmetic and logical instructions, and so on.
Figure 4-5 shows that the Pentium has two IUs—a simple integer unit (SIU)
and a complex integer unit (CIU)—and a single FPU.
Execution units can be organized logically into functional blocks for
ease of reference, so the two integer execution units can be referred
68
Chapter 4
to collectively as the Pentium’s
integer unit
. The Pentium’s
floating-point unit
consists of only a single FPU, but some processors have more than one FPU;
likewise with the load-store unit (LSU). The floating-point unit can consist
of two FPUs—FPU1 and FPU2—and the load-store unit can consist of LSU1
and LSU2. In both cases, we’ll often refer to “the FPU” or “the LSU” when we
mean all of the execution units in that functional block, taken as a group.
Many modern microprocessors also feature vector execution units, which
perform arithmetic and logical operations on vectors. I won’t describe vector
computing in detail here, however, because that discussion belongs in another
chapter.
Memory-Access Units
In almost all of the processors that we’ll cover in later chapters, you’ll see a pair of execution units that execute memory-access instructions: the load-store unit
and the branch execution unit. The
load-store unit (LSU)
is responsible for the execution of load and store instructions, as well as for
address generation
. As mentioned in Chapter 1, LSUs have small, stripped-down integer addition
hardware that can quickly perform the addition required to compute an
address.
The
branch execution unit (BEU)
is responsible for executing conditional and unconditional branch instructions. The BEU of the DLW series reads
the processor status word as described in Chapter 1 and decides whether
or not to replace the program counter with the branch target. The BEU
also often has its own address generation unit for performing quick address
calculations as needed. We’ll talk more about the branch units of real-world
processors later on.
Microarchitecture and the ISA
In the preceding discussion of superscalar execution, I made a number of
references to the discrepancy between the linear-execution, single-ALU
programming model that the programmer sees and what the superscalar
processor’s hardware actually does. It’s now time to flesh out that distinction
between the programming model and the actual hardware by introducing
some concepts and vocabulary that will allow us to talk with more precision
about the divisions between the apparent and the actual in computer
architecture.
Chapter 1 introduced the concept of the programming model as an
abstract representation of the microprocessor that exposes to the programmer
the microprocessor’s functionality. The DLW-1’s programming model con-
sisted of a single, integer-only ALU, four general-purpose registers, a program
counter, an instruction register, a processor status word, and a control unit.
The DLW-1’s
instruction set
consisted of a few instructions for working with different parts of the programming model: arithmetic instructions (e.g., add
and sub) for the ALU and general-purpose registers (GPRs), load and store
instructions for manipulating the control unit and filling the GPRs with data,
Superscalar Execution
69
and branch instructions for checking the PSW and changing the PC. We can
call this programmer-centric combination of programming model and
instruction set an
instruction set architecture (ISA)
.
The DLW-1’s ISA was a straightforward reflection of its hardware, which
consisted of a single ALU, four GPRs, a PC, a PSW, and a control unit. In
contrast, the successor to the DLW-1, the DLW-2, contained a second ALU
that was invisible to the programmer and accessible only to the DLW-2’s
decode/dispatch logic. The DLW-2’s decode/dispatch logic would examine
pairs of integer arithmetic instructions to determine if they could safely be
executed in parallel (and hence out of sequential program order). If they
could, it would send them off to the two integer ALUs to be executed simul-
taneously. Now, the DLW-2 has the same instruction set architecture as the
DLW-1—the instruction set and programming model remain unchanged—
but the DLW-2’s
hardware implementation
of that ISA is significantly different in that the DLW-2 is superscalar.
A particular processor’s hardware implementation of an ISA is generally
referred to as that processor’s
microarchitecture
. We might call the ISA introduced with the DLW-1 the
DLW ISA
. Each successive iteration of our hypo-
thetical DLW line of computers—the DLW-1 and DLW-2—implements the
DLW ISA using a different microarchitecture. The DLW-1 has only one ALU,
while the DLW-2 is a two-way superscalar implementation of the DLW-ISA.
Intel’s
x
86 hardware followed the same sort of evolution, with each
successive generation becoming more complex while the ISA stayed largely
unchanged. Regarding the Pentium’s inclusion of floating-point hardware,
you might be wondering how the programmer was able to use the floating-
point hardware (i.e., the FPU plus a floating-point register file) if the original
x
86 ISA didn’t include any floating-point operations or specify any floating-point registers. The Pentium’s designers had to make the following changes
to the ISA to accommodate the new functionality:
z
First, they had to modify the programming model by adding an FPU and
floating-point–specific registers.
z
Second, they had to extend the instruction set by adding a new group of
floating-point arithmetic instructions.
These types of
ISA extensions
are fairly common in the computing world.
Intel extended the original
x
86 instruction set to include the
x
87 floating-point extensions. The
x
87 included an FPU and a stack-based floating-point register file, but we’ll talk in more detail about the
x
87’s stack-based architecture in the next chapter. Intel later extended
x
86 again with the introduction of a vector-processing instruction set called
MMX (multimedia extensions)
, and again with the introduction of the
SSE (streaming SIMD extensions)
and SSE2 instruction sets. (
SIMD
stands for
single instruction, multiple data
and is another way of describing vector computing. We’ll cover this in more detail
in
“The Vector Execution Units” on page 168.) Si
milarly, Apple, Motorola, and IBM added a set of vector extensions to the PowerPC ISA in the form of
AltiVec, as the extensions are called by Motorola, or VMX, as they’re called
by IBM.
70
Chapter 4
A Brief History of the ISA
Back in the early days of computing, computer makers like IBM didn’t build
a whole line of software-compatible computer systems and aim each system
at a different price/performance point. Instead, each of a manufacturer’s
systems was like each of today’s game consoles, at least from a programmer’s
perspective—programmers wrote directly to the machine’s unique hardware,
with the result that a program written for one machine would run neither on
competing machines nor on other machines from a different product line
put out by the manufacturer’s own company. Just like a Nintendo 64 will run
neither PlayStation games nor older SNES games, programs written for one
circa-1960 machine wouldn’t run on any machine but that one particular
product from that one particular manufacturer. The programming model
was different for each machine, and the code was fitted directly to the hard-
ware like a key fits a lock (see Figure 4-6).
Software
Hardware
Figure 4-6: Software was custom-fitted
to each generation of hardware
The problems this situation posed are obvious. Every time a new machine
came out, software developers had to start from scratch. You couldn’t reuse
programs, and programmers had to learn the intricacies of each new piece
of hardware in order to code for it. This cost quite a bit of time and money,
making software development a very expensive undertaking. This situation
presented computer system designers with the following problem: How do
you
expose
(make available) the functionality of a range of related hardware systems in a way that allows software to be easily developed for and ported
between those systems? IBM solved this problem in the 1960s with the launch
of the IBM System/360, which ushered in the era of modern computer
architecture. The System/360 introduced the concept of the ISA as a layer
of abstraction—or an interface, if you will—separated from a particular
processor’s microarchitecture (see Figure 4-7). This means that the infor-
mation the programmer needed to know to program the machine was
abstracted from the actual hardware implementation of that machine.
Once the design and specification of the instruction set, or the set of
instructions available to a programmer for writing programs, was separated
from the low-level details of a particular machine’s design, programs written
for a particular ISA could run on any machine that implemented that ISA.
Thus the ISA provided a standardized way to expose the features of a
system’s hardware that allowed manufacturers to innovate and fine-tune that
hardware for performance without worrying about breaking the existing
software base. You could release a first-generation product with a particular
Superscalar Execution
71
ISA, and then work on speeding up the implementation of that same ISA for
the second-generation product, which would be backward-compatible with
the first generation. We take all this for granted now, but before the IBM
System/360, binary compatibility between different machines of different
generations didn’t exist.
Software
Software
Instruction Set
Instruction Set
Architecture
Architecture
1st-Generation Hardware
2nd-Generation Hardware
Figure 4-7: The ISA sits between the software and the hardware, providing a
consistent interface to the software across hardware generations.
The blue layer in Figure 4-7 simply represents the ISA as an abstract
model of a machine for which a programmer writes programs. As mentioned
earlier, the technical innovation that made this abstract layer possible was
something called the microcode engine. A
microcode engine
is sort of like a CPU within a CPU. It consists of a tiny bit of storage, the
microcode ROM
, which holds
microcode programs
, and an execution unit that executes those programs. The job of each of these microcode programs is to translate a
particular instruction into a series of commands that controls the internal
parts of the chip. When a System/360 instruction is executed, the microcode
unit reads the instruction in, accesses the portion of the microcode ROM
where that instruction’s corresponding microcode program is located, and
then produces a sequence of
machine instructions
, in the processor’s internal instruction format, that orchestrates the dance of memory accesses and functional unit activations that actually does the number crunching (or whatever
else) the architectural instruction has commanded the machine to do.
By decoding instructions this way, all programs are effectively running
in
emulation
. This means that the ISA represents a sort of idealized model, emulated by the underlying hardware, on the basis of which programmers
can design applications. This emulation means that between iterations of a
product line, a vendor can change the way their CPU executes a program,
and all they have to do is rewrite the microcode program each time so the
programmer will never have to be aware of the hardware differences because
the ISA hasn’t changed a bit. Microcode engines still show up in modern
CPUs. AMD’s Athlon processor uses one for the part of its decoding path that