Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture (58 page)

Read Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture Online

Authors: jon stokes

Tags: #Computers, #Systems Architecture, #General, #Microprocessors

BOOK: Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture
12.7Mb size Format: txt, pdf, ePub

each clock in which there’s a bubble in the write stage, the pipeline’s instruc-

tion throughput is 0 instructions/clock, so its average instruction throughput

for the whole period continues to decline. After the last bubble has worked

its way out of the write stage, then the pipeline begins completing new instruc-

tions again at a rate of one instruction/cycle and its average instruction

throughput begins to climb. And when the processor’s instruction through-

put begins to climb, so does its completion rate and its performance on

programs.

56

Chapter 3

1

0.8

Average

Instruction

0.6

Throughput

(instructions/clock)

0.4

0.2

20

40

60

80

100

Clock Cycles

Figure 3-12: Average instruction throughput of a four-stage pipeline with a 10-cycle stall
Now, 10 or 20 cycles worth of stalls here and there may not seem like

much, but they do add up. Even more important, though, is the fact that the

numbers in the preceding examples would be increased by a factor of 30 or

more in real-world execution scenarios. As of this writing, a processor can

spend from 50 to 120 nanoseconds waiting on data from main memory. For a

3 GHz processor that has a clock cycle time of a fraction of a nanosecond, a

100 ns main memory access translates into a few thousand clock cycles worth

of bubbles—and that’s just for one main memory access out of the many

millions that a program might make over the course of its execution.

In later chapters, we’ll look at the causes of pipeline stalls and the many

tricks that computer architects use to overcome them.

Instruction Latency and Pipeline Stalls

Before closing out our discussion of pipeline stalls, I should introduce

another term that you’ll be seeing periodically throughout the rest of the

book:
instruction latency
. An instruction’s latency is the number of clock cycles it takes for the instruction to pass through the pipeline. For a single-cycle

processor, all instructions have a latency of one clock cycle. In contrast, for

the simple four-stage pipeline described so far, all instructions have a latency of four cycles. To get a visual image of this, take one more look at the blue

instruction in Figure 3-6 earlier in this chapter; this instruction takes four

clock cycles to advance, at a rate of one clock cycle per stage, through each of the four stages of the pipeline. Likewise, instructions have a latency of eight

cycles on an eight-stage pipeline, 12 cycles on a 12-stage pipeline, and so on.

In real-world processors, instruction latency is not necessarily a fixed

number that’s equal to the number of pipeline stages. Because instructions

can get hung up in one or more pipeline stages for multiple cycles, each extra

cycle that they spend waiting in a pipeline stage adds one more cycle to their

latency. So the instruction latencies given in the previous paragraph (i.e., four cycles for a four-stage pipeline, eight cycles for an eight-stage pipeline, and

Pipelined Execution

57

so on) represent
minimum
instruction latencies. Actual instruction latencies in pipelines of any length can be longer than the depth of the pipeline,

depending on whether or not the instruction stalls in one or more stages.

Limits to Pipelining

As you can probably guess, there are some practical limits to how deeply you

can pipeline an assembly line or a processor before the actual speedup in

completion rate that you gain from pipelining starts to become significantly

less than the ideal speedup that you might expect. In the real world, the

different phases of an instruction’s lifecycle don’t easily break down into an

arbitrarily high number of shorter stages of perfectly equal duration. Some

stages are inherently more complex and take longer than others.

But because each pipeline stage must take exactly one clock cycle to

complete, the clock pulse that coordinates all the stages can be no faster

than the pipeline’s slowest stage. In other words, the amount of time it takes

for the slowest stage in the pipeline to complete will determine the length of

the CPU’s clock cycle and thus the length of every pipeline stage. This means

that the pipeline’s slowest stage will spend the entire clock cycle working,

while the faster stages will spend part of the clock cycle idle. Not only does

this waste resources, but it increases each instruction’s overall execution time by dragging out some phases of the lifecycle to take up more time than they

would if the processor was not pipelined—all of the other stages must wait a

little extra time each cycle while the slowest stage plays catch-up.

So, as you slice the pipeline more finely in order to add stages and increase

throughput, the individual stages get less and less uniform in length and

complexity, with the result that the processor’s overall instruction execution

time gets longer. Because of this feature of pipelining, one of the most diffi-

cult and important challenges that the CPU designer faces is that of balancing

the pipeline so that no one stage has to do more work to do than any other.

The designer must distribute the work of processing an instruction evenly to

each stage, so that no one stage takes up too much time and thus slows down

the entire pipeline.

Clock Period and Completion Rate

If the pipelined processor’s clock cycle time, or
clock period
, is longer than its ideal length (i.e., non-pipelined instruction execution time/pipeline depth),

and it always is, then the processor’s completion rate will suffer. If the instruction throughput stays fixed at, say, one instruction/clock, then as the clock

period increases, the completion rate decreases. Because new instructions

can be completed only at the end of each clock cycle, a longer clock cycle translates into fewer instructions completed per nanosecond, which in turn

translates into longer program execution times.

To get a better feel for the relationship between completion rate, instruc-

tion throughput, and clock cycle time, let’s take the eight-stage pipeline from

Figure 3-8 and increase its clock cycle time to 1 ns instead of 0.5 ns. Its first 9 ns of execution would then look as in Figure 3-13.

58

Chapter 3

1

2

3

4

5

6

7

8

9

Stored

Instructions

CPU

Fetch1

Fetch2

Decode

Execute

Data1

Data2

Tag Ch.

Write

Completed

Instructions

Figure 3-13: An eight-stage pipeline with a 1 ns clock period

As you can see, the instruction execution time has now increased from

an original time of 4 ns to a new time of 8 ns, which means that the eight-

stage pipeline does not complete its first instruction until the end of the

eighth nanosecond. Once the pipeline is full, the processor pictured in

Figure 3-13 begins completing instructions at a rate of one instruction per

nanosecond. This completion rate is half the completion rate of the ideal

eight-stage pipeline with the 0.5 ns clock cycle time. It’s also the exact same

completion rate as the one instruction/ns completion rate of the ideal four-

stage pipeline. In short, the longer clock cycle time of the new eight-stage

pipeline has robbed the deeper pipeline of its completion rate advantage.

Furthermore, the eight-stage pipeline now takes twice as long to fill.

Take a look at the graph in Figure 3-14 to see what this doubled execu-

tion time does to the eight-stage pipeline’s average completion rate curve

versus the same curve for a four-stage pipeline.

It takes longer for the slower eight-stage pipeline to fill up, which means

that its average completion rate—and hence its performance—ramps up

more slowly when the pipeline is first filled with instructions. There are many

situations in which a processor that’s executing a program must flush its

pipeline entirely and then begin refilling it from a different point in the

code stream. In such instances, that slower-ramping completion rate curve

causes a performance hit.

Pipelined Execution

59

1

0.8

Average

Completion

0.6

Rate

(instructions/ns)

4-stage pipeline

0.4

8-stage pipeline

0.2

20

40

60

80

100

Time (ns)

Figure 3-14: Average instruction completion rate for four- and eight-stage

pipelines with a 1 ns clock period

In the end, the performance gains brought about by pipelining depend

on two things:

1.

Pipeline stalls must be avoided. As you’ve seen earlier, pipeline stalls

cause the processor’s completion rate and performance to drop. Main

memory accesses are a major cause of pipeline stalls, but this problem

can be alleviated significantly by the use of caching. We’ll cover caching

in detail in Chapter 11. Other causes of stalls, like the various types of

hazards, will be covered at the end of Chapter 4.

2.

Pipeline refills must be avoided. Flushing the pipeline and refilling it

again takes a serious toll on both completion rate and performance.

Once the pipeline is full, it must remain full for as long as possible if the

average completion rate is to be kept up.

When we look more closely at real-world pipelined processors in later

chapters, you’ll see these two issues come up again and again. In fact, much

of the rest of the book will be about how the different architectures under

discussion work to keep their pipelines full by preventing stalls and ensuring

a continuous and uninterrupted flow of instructions into the processor from

the code storage area.

The Cost of Pipelining

In addition to the inherent limits to performance improvement that we’ve

just looked at, pipelining requires a nontrivial amount of extra bookkeeping

and buffering logic to implement, so it incurs an overhead cost in transis-

tors and die space. Furthermore, this overhead cost increases with pipeline

depth, so that a processor with a very deep pipeline (for example, Intel’s

Pentium 4) spends a significant amount of its transistor budget on pipeline-

related logic. These overhead costs place some practical constraints on how

deeply you can pipeline a processor. I’ll say a lot more about such constraints

in the chapters covering the Pentium line and the Pentium 4.

60

Chapter 3

S U P E R S C A L A R E X E C U T I O N

Chapters 1 and 2 described the processor as it is visible

to the programmer. The register files, the processor

status word (PSW), the arithmetic logic unit (ALU),

and other parts of the programming model are all

there to provide a means for the programmer to

manipulate the processor and make it do useful work.

In other words, the programming model is essentially

a user interface for the CPU.

Much like the graphical user interfaces on modern computer systems,

there’s a lot more going on under the hood of a microprocessor than the

simplicity of the programming model would imply. In Chapter 12, I’ll talk

about the various ways in which the operating system and processor collab-

orate to fool the user into thinking that he or she is executing multiple pro-

grams at once. There’s a similar sort of trickery that goes on beneath the

programming model in a modern microprocessor, but it’s intended to fool

the programmer into thinking that there’s only one thing going on at a time,

when really there are multiple things happening simultaneously. Let me

explain.

Back in the days when computer designers could fit relatively few

transistors on a single piece of silicon, many parts of the programming

model actually resided on separate chips attached to a single circuit board.

For instance, one chip contained the ALU, another chip contained the

control unit, still another chip contained the registers, and so on. Such

computers were relatively slow, and the fact that they were made of multiple

chips made them expensive. Each chip had its own manufacturing and

packaging costs, so the more chips you put on a board, the more expensive

the overall system was. (Note that this is still true today. The cost of pro-

ducing systems and components can be drastically reduced by packing the

functionality of multiple chips into a single chip.)

With the advent of the Intel 4004 in 1971, all of that changed. The 4004

was the world’s first microprocessor on a chip. Designed to be the brains of

a calculator manufactured by a now defunct company named Busicom, the

4004 had 16 four-bit registers, an ALU, and decoding and control logic all

packed onto a single, 2,300-transistor chip. The 4004 was quite a feat for its

day, and it paved the way for the PC revolution. However, it wasn’t until Intel

released the 8080 four years later that the world saw the first true general-

Other books

Better Not Love Me by Kolbet, Dan
Absolution by Jambrea Jo Jones
The Fearless by Emma Pass
Tenth of December by George Saunders
A Touch of Dead by Charlaine Harris
The Doctor's Proposal by Marion Lennox
Amanda's Beau by Shirley Raye Redmond
Bobby's Diner by Wingate, Susan