Authors: jon stokes
Tags: #Computers, #Systems Architecture, #General, #Microprocessors
1hr
2hr
3hr
4hr
5hr
6hr
7hr
Factory
Floor
Completed
SUVs
Figure 3-5: The lifecycle of an SUV in a pipelined factory
So as the assembly line begins to fill up with SUVs in various stages of
production, more of the crews are put to work simultaneously until all of the
crews are working on a different vehicle in a different stage of production.
(Of course, this is how most of us nowadays in the post-Ford era expect a
good, efficient assembly line to work.) If we can keep the assembly line full
and keep all five crews working at once, we can produce one SUV every hour:
a fivefold improvement in SUV completion rate over the previous comple-
tion rate of one SUV every five hours. That, in a nutshell, is pipelining.
42
Chapter 3
While the total amount of time that each individual SUV spends in pro-
duction has not changed from the original five hours, the rate at which the
factory as a whole completes SUVs has increased drastically. Furthermore,
the rate at which the factory can fulfill the Army’s orders for batches of
SUVs has increased drastically, as well. Pipelining works its magic by making
optimal use of already existing resources. We don’t need to speed up each
individual stage of the production process, nor do we need to drastically
increase the amount of resources that we throw at the problem; all that’s
necessary is that we get more work out of resources that are already there.
W H Y T H E S U V F A C T O R Y ?
The preceding discussion uses a factory analogy to explain pipelining. Other books use simpler analogies, like doing laundry, for instance, to explain this technique, but there are a few reasons why I chose a more elaborate and lengthy analogy to
illustrate what is a relatively simple concept. First, I use factory analogies throughout this book, because assembly line-based factories are easy for readers to visualize and there’s plenty of room for filling out the mental image in interesting ways in order to make a variety of related points. Second, and perhaps even more importantly, the many scheduling-, queuing- and resource management–related problems
that factory designers face have direct analogies in computer architecture. In many cases, the problems and solutions are exactly the same, simply translated into a different domain. (Similar queuing-related problem/solution pairs also crop up in the service industry, which is why analogies involving supermarkets and fast food restaurants are also favorites of mine.)
Applying the Analogy
Bringing our discussion back to microprocessors, it should be easy to see
how this concept applies to the four phases of an instruction’s lifecycle. Just
as the owners of the factory in our analogy wanted to increase the number of
SUVs that the factory could finish in a given period of time, microprocessor
designers are always looking for ways to increase the number of instructions
that a CPU can complete in a given period of time. When you recall that a
program is an ordered sequence of instructions, it becomes clear that
increasing the number of instructions executed per unit time is one way
to decrease the total amount of time that it takes to execute a program.
(The other way to decrease a program’s execution time is to decrease the
number of instructions in the program, but this chapter won’t address that
approach until later.) In terms of our analogy, a program is like an order
of SUVs from the military; just like increasing our factory’s output of SUVs
per hour enabled us to fill orders faster, increasing a processor’s
instruction
completion rate
(the number of instructions completed per unit time) enables it to run programs faster.
A Non-Pipelined Processor
The previous chapter briefly described how the simple processors described
so far (e.g., the DLW-1) use the clock to time its internal operations. These
Pipelined Execution
43
non-pipelined processors work on one instruction at a time, moving each
instruction through all four phases of its lifecycle during the course of
one clock cycle. Thus non-pipelined processors are also called
single-cycle
processors, because all instructions take exactly one clock cycle to execute
fully (i.e., to pass through all four phases of their lifecycles).
Because the processor completes instructions at a rate of one per clock
cycle, you want the CPU’s clock to run as fast as possible so that the processor’s instruction completion rate can be as high as possible.
Thus you need to calculate the maximum amount of time that it takes
to complete an instruction and make the clock cycle time equivalent to that
length of time. It just so happens that on the hypothetical example CPU, the
four phases of the instruction’s lifecycle take a total of 4 ns to complete. Therefore, you should set the duration of the CPU clock cycle to 4 ns, so that the
CPU can complete the instruction’s lifecycle—from
fetch
to
write-back
—in a single clock. (A CPU clock cycle is often just called a
clock
for short.) In Figure 3-6, the blue instruction leaves the code storage area, enters
the processor, and then advances through the phases of its lifecycle over the
course of the 4 ns clock period, until at the end of the fourth nanosecond, it
completes the last phase and its lifecycle is over. The end of the fourth nano-
second is also the end of the first clock cycle, so now that the first clock cycle is finished and the blue instruction has completed its execution, the red
instruction can enter the processor at the start of a new clock cycle and go
through the same process. This 4 ns sequence of steps is repeated until, after
a total of 16 ns (or four clock cycles), the processor has completed all four
instructions at a completion rate of 0.25 instructions/ns (= 4 instructions/
16 ns).
1ns 2ns 3ns 4ns 5ns 6ns 7ns 8ns 9ns
Stored
Instructions
CPU
Fetch
Decode
Execute
Write
Completed
Instructions
Figure 3-6: A single-cycle processor
44
Chapter 3
Single-cycle processors like the one in Figure 3-6 are simple to design, but
they waste a lot of hardware resources. All of that white space in the diagram
represents processor hardware that’s sitting idle while it waits for the instruction that’s currently in the processor to finish executing. By pipelining the
processor in this figure, you can put more of that hardware to work every
nanosecond, thereby increasing the processor’s efficiency and its perfor-
mance on executing programs.
Before moving on, I should clarify a few concepts illustrated in Figure 3-6.
At the bottom is a region labeled “Completed Instructions.” Completed
instructions don’t actually go anywhere when they’re finished executing;
once they’ve done their job of telling the processor how to modify the
data stream, they’re simply deleted from the processor. So the “Completed
Instructions” box does not represent a real part of the computer, which
is why I’ve placed a dotted line around it. This area is just a place for you
to keep track of how many instructions the processor has completed in
a certain amount of time, or the processor’s instruction completion rate
(or
completion rate
, for short), so that when you compare different types of processors, you’ll have a place where you can quickly see which processor
performs better. The more instructions a processor completes in a set
amount of time, the better it performs on programs, which are an ordered
sequence of instructions. Think of the “Completed Instructions” box as a
sort of scoreboard for tracking each processor’s completion rate, and check
the box in each of the subsequent figures to see how long it takes for the
processor to populate this box.
Following on the preceding point, you may be curious as to why the
blue instruction that has completed in the fourth nanosecond does not
appear in the “Completed Instructions” box until the fifth nanosecond.
The reason is straightforward and stems from the nature of the diagram.
Because an instruction spends
one complete nanosecond
, from start to finish, in each stage of execution, the blue instruction enters the write phase at the
beginning
of the fourth nanosecond and exits the write phase at the
end
of the fourth nanosecond. This means that the fifth nanosecond is the first
full nanosecond in which the blue instruction stands completed. Thus at
the beginning of the fifth nanosecond (which coincides with the end of the
fourth nanosecond), the processor has completed one instruction.
A Pipelined Processor
Pipelining a processor means breaking down its instruction execution
process—what I’ve been calling the instruction’s lifecycle—into a series of
discrete
pipeline stages
that can be completed in sequence by specialized hardware. Recall the way that we broke down the SUV assembly process into five
discrete steps—with one dedicated crew assigned to complete each step—
and you’ll get the idea.
Because an instruction’s lifecycle consists of four fairly distinct phases, you
can start by breaking down the single-cycle processor’s instruction execution
Pipelined Execution
45
process into a sequence of four discrete pipeline stages, where each pipeline
stage corresponds to a phase in the standard instruction lifecycle:
Stage 1:
Fetch the instruction from code storage.
Stage 2:
Decode the instruction.
Stage 3:
Execute the instruction.
Stage 4:
Write the results of the instruction back to the register file.
Note that the number of pipeline stages is called the
pipeline depth
. So the four-stage pipeline has a pipeline depth of four.
For convenience’s sake, let’s say that each of these four pipeline stages
takes exactly 1 ns to finish its work on an instruction, just like each crew in
our assembly line analogy takes one hour to finish its portion of the work on
an SUV. So the original single-cycle processor’s 4 ns execution process is now
broken down into four discrete, sequential pipeline stages of 1 ns each in
length.
Now let’s step through another diagram together to see how a pipelined
CPU would execute the four instructions depicted in Figure 3-7.
1ns 2ns 3ns 4ns 5ns 6ns 7ns 8ns 9ns
Stored
Instructions
CPU
Fetch
Decode
Execute
Write
Completed
Instructions
Figure 3-7: A four-stage pipeline
At the beginning of the first nanosecond, the blue instruction enters
the fetch stage. After that nanosecond is complete, the second nanosecond
begins and the blue instruction moves on to the decode stage, while the
next instruction, the red one, starts to make its way from code storage to the
processor (i.e., it enters the fetch stage). At the start of the third nanosecond, the blue instruction advances to the execute stage, the red instruction
advances to the decode stage, and the green instruction enters the fetch
stage. At the fourth nanosecond, the blue instruction advances to the write
46
Chapter 3
stage, the red instruction advances to the execute stage, the green instruc-
tion advances to the decode stage, and the purple instruction advances to
the fetch stage. After the fourth nanosecond has fully elapsed and the fifth
nanosecond starts, the blue instruction has passed from the pipeline and is
now finished executing. Thus we can say that at the end of 4 ns (= four clock
cycles), the pipelined processor depicted in Figure 3-7 has completed one
instruction.
At start of the fifth nanosecond, the pipeline is now full and the processor
can begin completing instructions at a rate of one instruction per nanosecond.
This one instruction/ns completion rate is a fourfold improvement over
the single-cycle processor’s completion rate of 0.25 instructions/ns (or four
instructions every 16 ns).
Shrinking the Clock
You can see from Figure 3-7 that the role of the CPU clock changes slightly
in a pipelined processor, compared to the single-cycle processor shown in
Figure 3-6. Because all of the pipeline stages must now work together
simultaneously and be ready at the start of each new nanosecond to hand
over the results of their work to the next pipeline stage, the clock is needed
to coordinate the activity of the whole pipeline. The way this is done is
simple: Shrink the clock cycle time to match the time it takes each stage to
complete its work so that at the start of each clock cycle, each pipeline stage
hands off the instruction it was working on to the next stage in the pipeline.
Because each pipeline stage in the example processor takes 1 ns to complete
its work, you can set the clock cycle to be 1 ns in duration.
This new method of clocking the processor means that a new instruction