Authors: jon stokes
Tags: #Computers, #Systems Architecture, #General, #Microprocessors
will not necessarily be completed at the close of each clock cycle, as was the
case with the single-cycle processor. Instead, a new instruction will be com-
pleted at the close of only those clock cycles in which the write stage has been working on an instruction. Any clock cycle with an empty write stage will add
no new instructions to the “Completed Instructions” box, and any clock cycle
with an active write stage will add one new instruction to the box. Of course,
this means that when the pipeline first starts to work on a program, there will
be a few clock cycles—three to be exact—during which no instructions are
completed. But once the fourth clock cycle starts, the first instruction enters
the write stage and the pipeline can then begin completing new instructions
on each clock cycle, which, because each clock cycle is 1 ns, translates into a
completion rate of one instruction per nanosecond.
Shrinking Program Execution Time
Note that the total execution time for each individual instruction is not
changed by pipelining. It still takes an instruction 4 ns to make it all the way through the processor; that 4 ns can be split up into four clock cycles of 1 ns
each, or it can cover one longer clock cycle, but it’s still the same 4 ns. Thus pipelining doesn’t speed up instruction execution time, but it does speed up
program execution time
(the number of nanoseconds that it takes to execute an entire program) by increasing the number of instructions finished per unit
Pipelined Execution
47
of time. Just like pipelining our hypothetical SUV assembly line allowed us to
fill the Army’s orders in a shorter span of time, even though each individual
SUV still spent a total of five hours in the assembly line, so does pipelining
allow a processor to execute programs in a shorter amount of time, even
though each individual instruction still spends the same amount of time
traveling through the CPU. Pipelining makes more efficient use of the CPU’s
existing resources by putting all of its units to work simultaneously, thereby
allowing it to do more total work each nanosecond.
The Speedup from Pipelining
In general, the speedup in completion rate versus a single-cycle implementa-
tion that’s gained from pipelining is ideally equal to the number of pipeline
stages. A four-stage pipeline yields a fourfold speedup in the completion rate
versus a single-cycle pipeline, a five-stage pipeline yields a fivefold speedup, a twelve-stage pipeline yields a twelvefold speedup, and so on. This speedup is
possible because the more pipeline stages there are in a processor, the more
instructions the processor can work on simultaneously, and the more instruc-
tions it can complete in a given period of time. So the more finely you can slice those four phases of the instruction’s lifecycle, the more of the hardware that’s used to implement those phases you can put to work at any given moment.
To return to our assembly line analogy, let’s say that each crew is made
up of six workers, and that each of the hour-long tasks that each crew per-
forms can be readily subdivided into two shorter, 30-minute tasks. So we can
double our factory’s throughput by splitting each crew into two smaller, more
specialized crews of three workers each, and then having each smaller crew
perform one of the shorter tasks on one SUV per 30 minutes.
Stage 1:
Build the chassis.
z
Crew 1a:
Fit the parts of the chassis together and spot-weld the joints.
z
Crew 1b:
Fully weld all the parts of the chassis.
Stage 2:
Drop the engine into the chassis.
z
Crew 2a:
Place the engine into the chassis and mount it in place.
z
Crew 2b:
Connect the engine to the moving parts of the car.
Stage 3:
Put the doors, a hood, and coverings on the chassis.
z
Crew 3a:
Put the doors and hood on the chassis.
z
Crew 3b:
Put the other coverings on the chassis.
Stage 4:
Attach the wheels.
z
Crew 4a:
Attach the two front wheels.
z
Crew 4b:
Attach the two rear wheels.
Stage 5:
Paint the SUV.
z
Crew 5a:
Paint the sides of the SUV.
z
Crew 5b:
Paint the top of the SUV.
48
Chapter 3
After the modifications described here, the 10 smaller crews in our
factory now have a collective total of 10 SUVs in progress during the course
of any given 30-minute period. Furthermore, our factory can now complete
a new SUV every 30 minutes, a tenfold improvement over our first factory’s
completion rate of one SUV every five hours. So by pipelining our assembly
line even more deeply, we’ve put even more of its workers to work con-
currently, thereby increasing the number of SUVs that can be worked on
simultaneously and increasing the number of SUVs that can be completed
within a given period of time.
Deepening the pipeline of the four-stage processor works on similar
principles and has a similar effect on completion rates. Just as the five
stages in our SUV assembly line could be broken down further into a
longer sequence of more specialized stages, the execution process that
each instruction goes through can be broken down into a series of much
more than just four discrete stages. By breaking the processor’s four-stage
pipeline down into a longer series of shorter, more specialized stages,
even more of the processor’s specialized hardware can work simultaneously
on more instructions and thereby increase the number of instructions that
the pipeline completes each nanosecond.
We first moved from a single-cycle processor to a pipelined processor
by taking the 4 ns time period that the instruction spent traveling through
the processor and slicing it into four discrete pipeline stages of 1 ns each in
length. These four discrete pipeline stages corresponded to the four phases
of an instruction’s lifecycle. A processor’s pipeline stages aren’t always going to correspond exactly to the four phases of a processor’s lifecycle, though.
Some processors have a five-stage pipeline, some have a six-stage pipeline,
and many have pipelines deeper than 10 or 20 stages. In such cases, the CPU
designer must slice up the instruction’s lifecycle into the desired number of
stages in such a way that all the stages are equal in length.
Now let’s take that 4 ns execution process and slice it into eight discrete
stages. Because all eight pipeline stages must be of exactly the same duration
for pipelining to work, the eight pipeline stages must each be 4 ns ÷ 8 = 0.5 ns in length. Since we’re presently working with an idealized example, let’s
pretend that splitting up the processor’s four-phase lifecycle into eight
equally long (0.5 ns) pipeline stages is a trivial matter, and that the results
look like what you see in Figure 3-8. (In reality, this task is not trivial and
involves a number of trade-offs. As a concession to that reality, I’ve chosen
to use the eight stages of a real-world pipeline—the MIPS pipeline—in
Figure 3-8, instead of just splitting each of the four traditional stages in two.) Because pipelining requires that each pipeline stage take exactly one
clock cycle to complete, the clock cycle can now be shortened to 0.5 ns
in order to fit the lengths of the eight pipeline stages. At the bottom of
Figure 3-8, you can see the impact that this increased number of pipeline
stages has on the number of instructions completed per unit time.
Pipelined Execution
49
0.5 1 1.5 2 2.5 3 3.5 4 4.5
Stored
Instructions
CPU
Fetch1
Fetch2
Decode
Execute
Data1
Data2
Tag Ch.
Write
Completed
Instructions
4.5 5 5.5 6 6.5 7 7.5 8 8.5
Stored
Instructions
CPU
Fetch1
Fetch2
Decode
Execute
Data1
Data2
Tag Ch.
Write
Completed
Instructions
Figure 3-8: An eight-stage pipeline
50
Chapter 3
The single-cycle processor can complete one instruction every 4 ns, for
a completion rate of 0.25 instructions/ns, and the four-stage pipelined pro-
cessor can complete one instruction every nanosecond for a completion rate
of one instructions/ns. The eight-stage processor depicted in Figure 3-8
improves on both of these by completing one instruction every 0.5 ns, for a
completion rate of two instructions/ns. Note that because each instruction
still takes 4 ns to execute, the first 4 ns of the eight-stage processor are still dedicated to filling up the pipeline. But once the pipeline is full, the processor can begin completing instructions twice as fast as the four-stage processor
and eight times as fast as the single-stage processor.
This eightfold increase in completion rate versus a single-cycle design
means that the eight-stage processor can execute programs much faster than
either a single-cycle or a four-stage processor. But does the eightfold increase in completion rate translate into an eightfold increase in processor performance? Not exactly.
Program Execution Time and Completion Rate
If the program that the single-cycle processor in Figure 3-6 is running
consisted of only the four instructions depicted, that program would have
a program execution time of 16 ns, or 4 instructions ÷ 0.25 instructions/ns.
If the program consisted of, say, seven instructions, it would have a program
execution time of 7 instructions ÷ 0.25 instructions/ns = 28 ns. In general,
a program’s execution time is equal to the total number of instructions in
the program divided by the processor’s instruction completion rate (number
of instructions completed per nanosecond), as in the following equation:
program execution time = number of instructions in program / instruction
completion rate
Most of the time, when I talk about processor
performance
in this book,
I’m talking about program execution time. One processor performs better
than another if it executes all of a program’s instructions in a shorter
amount of time, so reducing program execution time is the key to increasing
processor performance.
From the preceding equation, it should be clear that program execution
time can be reduced in one of two ways: by a reduction in the number of
instructions per program or by an increase in the processor’s completion
rate. For now, let’s assume that the number of instructions in a program is
fixed and that there’s nothing that can be done about this term of the equa-
tion. As such, our focus in this chapter will be on increasing instruction
completion rates.
In the case of a non-pipelined, single-cycle processor, the instruction
completion rate (
x
instructions per 1 ns) is simply the inverse of the instruction execution time (
y
ns per 1 instruction), where
x
and
y
have different numerical values. Because the relationship between completion rate and
instruction execution time is simple and direct in a single-cycle processor,
Pipelined Execution
51
an
n
fold improvement in one is an
n
fold improvement in the other. So improving the performance of a single-cycle processor is really about
reducing instruction execution times.
With pipelined processors, the relationship between instruction execution
time and completion rate is more complex. As discussed previously, pipelined
processors allow you to increase the processor’s completion rate without
altering the instruction execution time. Of course, a reduction in instruction
execution time still translates into a completion rate improvement, but the
reverse is not necessarily true. In fact, as you’ll learn later on, pipelining’s improvements to completion rate often come at the price of
increased
instruction execution times. This means that for pipelining to improve performance,
the processor’s completion rate must be as high as possible over the course
of a program’s execution.
The Relationship Between Completion Rate and Program Execution Time
If you look at the “Completed Instructions” box of the four-stage processor
back in Figure 3-7, you’ll see that a total of five instructions have been com-
pleted at the start of the ninth nanosecond. In contrast, the non-pipelined
processor illustrated in Figure 3-6 sports two completed instructions at the
start of the ninth nanosecond. Five completed instructions in the span of 8 ns
is obviously not a fourfold improvement over two completed instructions in
the same time period, so what gives?
Remember that it took the pipelined processor 4 ns initially to fill up
with instructions; the pipelined processor did not complete its first instruc-
tion until the end of the fourth nanosecond. Therefore, it completed fewer