Authors: jon stokes
Tags: #Computers, #Systems Architecture, #General, #Microprocessors
it’s accessed through a special interface that allows the ALU to read from or
write to specific registers. This interface consists of a data bus and two types of ports: the
read ports
and the
write ports
. In order to read a value from a single register in the register file, the ALU accesses the register file’s read
port and requests that the data from a specific register be placed on the
special internal data bus that the register file shares with the ALU. Likewise,
writing to the register file is done through the file’s write port.
A single read port allows the ALU to access a single register at a time, so
in order for an ALU to read from two registers simultaneously (as in the case
of a three-operand add instruction), the register file must have two read ports.
Likewise, a write port allows the ALU to write to only one register at a time,
so a single ALU needs a single write port in order to be able to write the results of an operation back to a register. Therefore, the register file needs two read
ports and one write port for each ALU. So for the two-ALU superscalar design,
the register file needs a total of four read ports and two write ports.
It so happens that the amount of die space that the register file takes up
increases approximately with the square of the number of ports, so there is a
practical limit on the number of ports that a given register file can support.
This is one of the reasons why modern CPUs use separate register files to
store integer, floating-point, and vector numbers. Since each type of math
(integer, floating-point, and vector) uses a different type of execution unit,
attaching multiple integer, floating-point, and vector execution units to a
single register file would result in quite a large file.
Superscalar Execution
77
There’s also another reason for using multiple register files to accom-
modate different types of execution units. As the size of the register file
increases, so does the amount of time it takes to access it. You might recall
fr
om “The File-Clerk Model Revisited and Expanded” on page 9 that w
e assume that register reads and writes happen instantaneously. If a register
file gets too large and the register file access latency gets too high, this can slow down register accesses to the point where such access takes up a notice-able amount of time. So instead of using one massive register file for each
type of numerical data, computer architects use two or three register files
connected to a few different types of execution units.
Incidentally, if you’ll recall
“Opcodes and Machine Language” on
page 19,
the DLW-1 used a series of binary numbers to designate which of the four registers an instruction was accessing. Well, in the case of a register file read, these numbers are fed into the register file’s interface in order to
specify which of the registers should place its data on the data bus. Taking
our two-bit register designations as an example, a port on our four-register
file would have two lines that would be held at either high or low voltages
(depending on whether the bit placed on each line was a 1 or a 0), and these
lines would tell the file which of its registers should have its data placed on
the data bus.
Control Hazards
Control hazards
, also known as
branch hazards
, are hazards that arise when the processor arrives at a conditional branch and has to decide which instruction
to fetch next. In more primitive processors, the pipeline stalls while the
branch condition is evaluated and the branch target is calculated. This stall
inserts a few cycles of bubbles into the pipeline, depending on how long it
takes the processor to identify and locate the branch target instruction.
Modern processors use a technique called
branch prediction
to get around these branch-related stalls. We’ll discuss branch prediction in more detail in
the next chapter.
Another potential problem associated with branches lies in the fact that
once the branch condition is evaluated and the address of the next instruc-
tion is loaded into the program counter, it then takes a number of cycles to
actually fetch the next instruction from storage. This
instruction load latency
is added to the branch condition evaluation latency discussed earlier in this
section. Depending on where the next instruction is located—such as in a
nearby cache, in main memory, or on a hard disk—it can take anywhere from
a few cycles to thousands of cycles to fetch the instruction. The cycles that the processor spends waiting on that instruction to show up are dead, wasted
cycles that show up as bubbles in the processor’s pipeline and kill performance.
Computer architects use
instruction caching
to alleviate the effects of load latency, and we’ll talk more about this technique in the next chapter.
78
Chapter 4
T H E I N T E L P E N T I U M A N D
P E N T I U M P R O
Now that you’ve got the basics of microprocessor archi-
tecture down, let’s look at some real hardware to see
how manufacturers implement the two main concepts
covered in the previous two chapters—pipelining and
superscalar execution—and introduce an entirely new
concept: the instruction window. First, we’ll wrap up
our discussion of the fundamentals of microprocessors
by taking a look at the Pentium. Then we’ll explore in
detail the P6 microarchitecture that forms the heart of the Pentium Pro,
Pentium II, and Pentium III. The P6 microarchitecture represents a
fundamental departure from the microprocessor designs we’ve studied so
far, and an understanding of how it works will give you a solid grasp of the
most important concepts in modern microprocessor architecture.
The Original Pentium
The original Pentium is an extremely modest design by today’s standards.
Transistor budgets were smaller when the chip was introduced in 1993, so
the Pentium doesn’t pack nearly as much hardware onto its die as a modern
microprocessor. Table 5-1 summarizes its features.
Table 5-1:
Summary of Pentium Features
Introduction Date
March 22, 1993
Manufacturing Process
0.8 micron
Transistor Count
3.1 million
Clock Speed at Introduction
60 and 66 MHz
Cache Sizes
L1: 8KB instruction, 8KB data
x
86 ISA Extensions
MMX added in 1997
A glance at a diagram of the Pentium (see Figure 5-1) shows that it has
two integer ALUs and a floating-point ALU, along with some other units that
I’ll describe later. The Pentium also has a
level 1 cache
—a component of the microprocessor that you haven’t yet seen. Before moving on, let’s take a
moment to look in more detail at this new component, which acts as a code
and data storage area for the processor.
Front End
Instruction Fetch
BU
Branch
Decode
Unit
Control Unit
SIU (V) CIU (U)
FPU
Floating-
Point
Unit
Integer Unit
Back End
Write
Figure 5-1: The basic microarchitecture of the
original Intel Pentium
80
Chapter 5
Caches
So far, I’ve talked about code and data as if they were all stored in main
memory. While this may be true in a limited sense, it doesn’t tell the whole
story. Though processor speeds have increased dramatically over the past
two decades, the speed of main memory has not been able to keep pace. In
every computer system currently on the market, there’s a yawning speed gap
between the processor and main memory. It takes such a huge number of
processor clock cycles to transfer code and data between main memory and
the registers and execution units that if no solution were available to alleviate this bottleneck, it would kill most of the performance gains brought on by
the increase in processor clock speeds.
Very fast memory that could close some of the speed gap is indeed avail-
able, but it’s too expensive to be widely used in a PC system’s main memory.
In fact, as a general rule of thumb, the faster the memory technology, the
more it costs per unit of storage. As a result, computer system designers fill
the speed gap by placing smaller amounts of faster, more expensive memory,
called
cache memory
, in between main memory and the registers. These caches, which are depicted in Figure 5-2, hold chunks of frequently used code and
data, keeping them within easy reach of the processor’s front end.
In most systems, there are multiple levels of cache between main
memory and the registers. The level 1 cache (called
L1 cache
or just
L1
for short) is the smallest, most expensive bit of cache, so it’s located the closest to the processor’s back end. Most PC systems have another level of cache,
called
level 2 cache
(
L2 cache
or just
L2
), located between the L1 cache and main memory, and some systems even have a third cache level,
L3 cache
, located between the L2 cache and main memory. In fact, as Figure 5-2 shows, main
memory itself is really just a cache for the hard disk drive.
When the processor needs a particular piece of code or data, it first
checks the L1 cache to see if the desired item is present. If it is—a situation
called a
cache hit
—it moves that item directly to either the fetch stage (in the case of code) or the register file (in the case of data). If the item is not
present—a
cache miss
—the processor checks the slower but larger L2 cache.
If the item is present in the L2, it’s copied into the L1 and passed along to
the front end or back end. If there’s a cache miss in the L2, the processor
checks the L3, and so on, until there’s either a cache hit, or the cache miss
propagates all the way out to main memory.
One popular way of laying out the L1 cache is to have code and data
stored in separate halves of the cache. The code half of the cache is often
referred to as the
instruction cache
or
I-cache
, and the data half of the cache is referred to as the
data cache
or
D-cache
. This kind of split cache design has certain performance advantages and is used in the all of the processors
discussed in this book.
NOTE
The split L1 cache design is often called the
Harvard architecture
as an homage to
the Harvard Mark I. The Mark I was a relay-based computer designed by IBM and
shipped to Harvard in 1944, and it was the first machine to incorporate the conceptual
split between code and data explicitly into its architecture.
The Intel Pentium and Pentium Pro
81
Processor Register File
L1 Cache
Main Memory
Hard Disk Drive
Figure 5-2: The memory hierarchy of a computer system, from the
smallest, fastest, and most expensive memory (the register file) to
the largest, slowest, and least expensive (the hard disk)
Back when transistor budgets were much tighter than they are today, all
caches were located somewhere on the computer’s system bus between the
CPU and main memory. Today, however, the L1 and L2 caches are commonly
integrated onto the CPU die itself, along with the rest of the CPU’s circuitry.
An on-die cache has significant performance advantages over an off-die cache
and is essential for keeping today’s deeply pipelined superscalar machines full
of code and data.
The Pentium’s Pipeline
As you’ve probably already guessed, a superscalar processor doesn’t have
just one pipeline. Because its execute stage is split up among multiple exe-
cution units that operate in parallel, a processor like the Pentium can be
said to have multiple pipelines—one for each execution unit. Figure 5-3
illustrates the Pentium’s multiple pipelines.
82
Chapter 5
Front End
L1 Instruction Cache
Instruction Fetch
Decode-1
Decode-2
FPU-1
SIU-1
CIU-1
FPU-2
FPU-3
Floating-
Point
Integer Unit
Unit
Back End
Write
Figure 5-3: The Pentium’s pipelines
As you can see, each of the Pentium’s pipelines shares four stages in
common:
z
Fetch
z
Decode-1
z
Decode-2
z
Write
It’s when an instruction reaches the execute phase of its lifecycle that it
enters a more specialized pipeline, specific to the execution unit.
A processor’s various execution units can have different pipeline depths,
with the integer pipeline usually being the shortest and the floating-point pipeline usually being the longest. In Figure 5-3, you can see that the Pentium’s two integer ALUs have single-stage pipelines, while the floating-point unit has a
three-stage pipeline.