Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture (62 page)

Read Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture Online

Authors: jon stokes

Tags: #Computers, #Systems Architecture, #General, #Microprocessors

BOOK: Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture
13.54Mb size Format: txt, pdf, ePub

it’s accessed through a special interface that allows the ALU to read from or

write to specific registers. This interface consists of a data bus and two types of ports: the
read ports
and the
write ports
. In order to read a value from a single register in the register file, the ALU accesses the register file’s read

port and requests that the data from a specific register be placed on the

special internal data bus that the register file shares with the ALU. Likewise,

writing to the register file is done through the file’s write port.

A single read port allows the ALU to access a single register at a time, so

in order for an ALU to read from two registers simultaneously (as in the case

of a three-operand add instruction), the register file must have two read ports.

Likewise, a write port allows the ALU to write to only one register at a time,

so a single ALU needs a single write port in order to be able to write the results of an operation back to a register. Therefore, the register file needs two read

ports and one write port for each ALU. So for the two-ALU superscalar design,

the register file needs a total of four read ports and two write ports.

It so happens that the amount of die space that the register file takes up

increases approximately with the square of the number of ports, so there is a

practical limit on the number of ports that a given register file can support.

This is one of the reasons why modern CPUs use separate register files to

store integer, floating-point, and vector numbers. Since each type of math

(integer, floating-point, and vector) uses a different type of execution unit,

attaching multiple integer, floating-point, and vector execution units to a

single register file would result in quite a large file.

Superscalar Execution

77

There’s also another reason for using multiple register files to accom-

modate different types of execution units. As the size of the register file

increases, so does the amount of time it takes to access it. You might recall

fr
om “The File-Clerk Model Revisited and Expanded” on page 9 that w
e assume that register reads and writes happen instantaneously. If a register

file gets too large and the register file access latency gets too high, this can slow down register accesses to the point where such access takes up a notice-able amount of time. So instead of using one massive register file for each

type of numerical data, computer architects use two or three register files

connected to a few different types of execution units.

Incidentally, if you’ll recall
“Opcodes and Machine Language” on

page 19,
the DLW-1 used a series of binary numbers to designate which of the four registers an instruction was accessing. Well, in the case of a register file read, these numbers are fed into the register file’s interface in order to

specify which of the registers should place its data on the data bus. Taking

our two-bit register designations as an example, a port on our four-register

file would have two lines that would be held at either high or low voltages

(depending on whether the bit placed on each line was a 1 or a 0), and these

lines would tell the file which of its registers should have its data placed on

the data bus.

Control Hazards

Control hazards
, also known as
branch hazards
, are hazards that arise when the processor arrives at a conditional branch and has to decide which instruction

to fetch next. In more primitive processors, the pipeline stalls while the

branch condition is evaluated and the branch target is calculated. This stall

inserts a few cycles of bubbles into the pipeline, depending on how long it

takes the processor to identify and locate the branch target instruction.

Modern processors use a technique called
branch prediction
to get around these branch-related stalls. We’ll discuss branch prediction in more detail in

the next chapter.

Another potential problem associated with branches lies in the fact that

once the branch condition is evaluated and the address of the next instruc-

tion is loaded into the program counter, it then takes a number of cycles to

actually fetch the next instruction from storage. This
instruction load latency
is added to the branch condition evaluation latency discussed earlier in this

section. Depending on where the next instruction is located—such as in a

nearby cache, in main memory, or on a hard disk—it can take anywhere from

a few cycles to thousands of cycles to fetch the instruction. The cycles that the processor spends waiting on that instruction to show up are dead, wasted

cycles that show up as bubbles in the processor’s pipeline and kill performance.

Computer architects use
instruction caching
to alleviate the effects of load latency, and we’ll talk more about this technique in the next chapter.

78

Chapter 4

T H E I N T E L P E N T I U M A N D

P E N T I U M P R O

Now that you’ve got the basics of microprocessor archi-

tecture down, let’s look at some real hardware to see

how manufacturers implement the two main concepts

covered in the previous two chapters—pipelining and

superscalar execution—and introduce an entirely new

concept: the instruction window. First, we’ll wrap up

our discussion of the fundamentals of microprocessors

by taking a look at the Pentium. Then we’ll explore in

detail the P6 microarchitecture that forms the heart of the Pentium Pro,

Pentium II, and Pentium III. The P6 microarchitecture represents a

fundamental departure from the microprocessor designs we’ve studied so

far, and an understanding of how it works will give you a solid grasp of the

most important concepts in modern microprocessor architecture.

The Original Pentium

The original Pentium is an extremely modest design by today’s standards.

Transistor budgets were smaller when the chip was introduced in 1993, so

the Pentium doesn’t pack nearly as much hardware onto its die as a modern

microprocessor. Table 5-1 summarizes its features.

Table 5-1:
Summary of Pentium Features

Introduction Date

March 22, 1993

Manufacturing Process

0.8 micron

Transistor Count

3.1 million

Clock Speed at Introduction

60 and 66 MHz

Cache Sizes

L1: 8KB instruction, 8KB data

x
86 ISA Extensions

MMX added in 1997

A glance at a diagram of the Pentium (see Figure 5-1) shows that it has

two integer ALUs and a floating-point ALU, along with some other units that

I’ll describe later. The Pentium also has a
level 1 cache
—a component of the microprocessor that you haven’t yet seen. Before moving on, let’s take a

moment to look in more detail at this new component, which acts as a code

and data storage area for the processor.

Front End

Instruction Fetch

BU

Branch

Decode

Unit

Control Unit

SIU (V) CIU (U)

FPU

Floating-

Point

Unit

Integer Unit

Back End

Write

Figure 5-1: The basic microarchitecture of the

original Intel Pentium

80

Chapter 5

Caches

So far, I’ve talked about code and data as if they were all stored in main

memory. While this may be true in a limited sense, it doesn’t tell the whole

story. Though processor speeds have increased dramatically over the past

two decades, the speed of main memory has not been able to keep pace. In

every computer system currently on the market, there’s a yawning speed gap

between the processor and main memory. It takes such a huge number of

processor clock cycles to transfer code and data between main memory and

the registers and execution units that if no solution were available to alleviate this bottleneck, it would kill most of the performance gains brought on by

the increase in processor clock speeds.

Very fast memory that could close some of the speed gap is indeed avail-

able, but it’s too expensive to be widely used in a PC system’s main memory.

In fact, as a general rule of thumb, the faster the memory technology, the

more it costs per unit of storage. As a result, computer system designers fill

the speed gap by placing smaller amounts of faster, more expensive memory,

called
cache memory
, in between main memory and the registers. These caches, which are depicted in Figure 5-2, hold chunks of frequently used code and

data, keeping them within easy reach of the processor’s front end.

In most systems, there are multiple levels of cache between main

memory and the registers. The level 1 cache (called
L1 cache
or just
L1
for short) is the smallest, most expensive bit of cache, so it’s located the closest to the processor’s back end. Most PC systems have another level of cache,

called
level 2 cache
(
L2 cache
or just
L2
), located between the L1 cache and main memory, and some systems even have a third cache level,
L3 cache
, located between the L2 cache and main memory. In fact, as Figure 5-2 shows, main

memory itself is really just a cache for the hard disk drive.

When the processor needs a particular piece of code or data, it first

checks the L1 cache to see if the desired item is present. If it is—a situation

called a
cache hit
—it moves that item directly to either the fetch stage (in the case of code) or the register file (in the case of data). If the item is not

present—a
cache miss
—the processor checks the slower but larger L2 cache.

If the item is present in the L2, it’s copied into the L1 and passed along to

the front end or back end. If there’s a cache miss in the L2, the processor

checks the L3, and so on, until there’s either a cache hit, or the cache miss

propagates all the way out to main memory.

One popular way of laying out the L1 cache is to have code and data

stored in separate halves of the cache. The code half of the cache is often

referred to as the
instruction cache
or
I-cache
, and the data half of the cache is referred to as the
data cache
or
D-cache
. This kind of split cache design has certain performance advantages and is used in the all of the processors

discussed in this book.

NOTE

The split L1 cache design is often called the
Harvard architecture
as an homage to
the Harvard Mark I. The Mark I was a relay-based computer designed by IBM and
shipped to Harvard in 1944, and it was the first machine to incorporate the conceptual
split between code and data explicitly into its architecture.

The Intel Pentium and Pentium Pro

81

Processor Register File

L1 Cache

Main Memory

Hard Disk Drive

Figure 5-2: The memory hierarchy of a computer system, from the

smallest, fastest, and most expensive memory (the register file) to

the largest, slowest, and least expensive (the hard disk)

Back when transistor budgets were much tighter than they are today, all

caches were located somewhere on the computer’s system bus between the

CPU and main memory. Today, however, the L1 and L2 caches are commonly

integrated onto the CPU die itself, along with the rest of the CPU’s circuitry.

An on-die cache has significant performance advantages over an off-die cache

and is essential for keeping today’s deeply pipelined superscalar machines full

of code and data.

The Pentium’s Pipeline

As you’ve probably already guessed, a superscalar processor doesn’t have

just one pipeline. Because its execute stage is split up among multiple exe-

cution units that operate in parallel, a processor like the Pentium can be

said to have multiple pipelines—one for each execution unit. Figure 5-3

illustrates the Pentium’s multiple pipelines.

82

Chapter 5

Front End

L1 Instruction Cache

Instruction Fetch

Decode-1

Decode-2

FPU-1

SIU-1

CIU-1

FPU-2

FPU-3

Floating-

Point

Integer Unit

Unit

Back End

Write

Figure 5-3: The Pentium’s pipelines

As you can see, each of the Pentium’s pipelines shares four stages in

common:

z

Fetch

z

Decode-1

z

Decode-2

z

Write

It’s when an instruction reaches the execute phase of its lifecycle that it

enters a more specialized pipeline, specific to the execution unit.

A processor’s various execution units can have different pipeline depths,

with the integer pipeline usually being the shortest and the floating-point pipeline usually being the longest. In Figure 5-3, you can see that the Pentium’s two integer ALUs have single-stage pipelines, while the floating-point unit has a

three-stage pipeline.

Other books

The Tail of the Tip-Off by Rita Mae Brown
Try Fear by James Scott Bell
As You Are by Ethan Day
Just Breathe Again by Mia Villano
Tarantula by Mark Dawson
Manly Wade Wellman - John Thunstone 02 by The School of Darkness (v1.1)