Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture (72 page)

Read Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture Online

Authors: jon stokes

Tags: #Computers, #Systems Architecture, #General, #Microprocessors

BOOK: Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture
4.3Mb size Format: txt, pdf, ePub

the appropriate execution unit’s reservation station so that the instructions

behind it in the instruction queue can move up and be dispatched.

The small size of the 604’s reservation stations compared to similar struc-

tures on the P6 is due to the fact that the 604’s pipeline is relatively short.

Pipeline stalls aren’t quite as devastating for performance on a machine with

a 6-stage pipeline as they are on a machine with a 12-stage pipeline, so the

604 doesn’t need as large of an instruction window as its super-pipelined

counterparts.

The Four Rules of Instruction Dispatch

Here are the four most important rules governing instruction dispatch on

the 604:

The in-order dispatch rule

Before an instruction can dispatch, all of the instructions preceding that

instruction must have dispatched. In other words, instructions dispatch

from the instruction queue in program order. It is not until instructions

have arrived at the reservation stations, where they may issue out of order

to the execution units, that the original program order is disrupted.

The issue buffer/execution unit availability rule

Before the dispatch logic can send an instruction to an execution unit’s

reservation station, that reservation station must have an entry available.

If an instruction doesn’t need to go to a reservation station because its

inputs are available at the time of dispatch, the required execution unit

must have a pipeline slot available, and the unit’s reservation station

must be empty (i.e., there are no older instructions waiting to execute)

before the instruction can be sent to the execution unit. (This rule is

modified on the PowerPC 7450—aka G4e—and we’ll cover the modifi-

cation
in “The PowerPC 7400 (aka the G4)” on page 133.
) PowerPC Processors: 600 Series, 700 Series, and 7400

127

The completion buffer availability rule

For an instruction to dispatch, there must be space available in the

completion queue so that a new entry can be created for the instruction.

Remember, the completion queue (or ROB) keeps track of the program

order of each in-flight instruction, so any instruction that enters the out-

of-order back end must be logged in the completion queue first.

The rename register availability rule

There must be enough rename registers available to temporarily store

the results for each register that the instruction will modify.

If a dispatched instruction meets the requirements imposed by these

rules, and if it meets the other more instruction-specific dispatch rules not

listed here, it can dispatch from the instruction queue to the back end.

All of the PowerPC processors discussed in this chapter that have reserva-

tion stations are subject to (at least) these four dispatch rules, so keep these rules in mind as we talk about instruction dispatch throughout the rest of this

chapter. Note that all of the processors—including the 604—have additional

rules that govern the dispatch of specific types of instructions, but these four general dispatch rules are the most important.

The Completion Phase: The 604’s Reorder Buffer

As with the P6 microarchitecture, the reservation stations aren’t the only

structures that make up the 604’s instruction window. The 604 has a 16-entry

reorder buffer (ROB) that performs the same function as the P6 micro-

architecture’s much larger 40-entry ROB.

The ROB corresponds to the simpler completion queue on older PPC

processors. In the dispatch stage, not only are instructions sent to the back

end’s reservation stations, but entries for the dispatched instructions are

allocated an entry in the ROB and a set of rename registers. In the com-

pletion stage, the instructions are put back in program order so that their

results can be written back to the register file in the subsequent write-back

stage. The completion stage corresponds to what I’ve called the completion

phase of an instruction’s lifecycle, and the write-back stage corresponds to

what I’ve called the
commit phase
.

The 604’s ROB is much smaller than the P6’s ROB for the same reason

that the 604’s reservation stations are fewer: the 604 has a much shallower

pipeline, which means that it needs a much smaller instruction window

for tracking fewer in-flight instructions in order to achieve the same

performance.

The trade-off for this lack of complexity and lower pipeline depth is a

lower clock speed. The 6-stage 604 debuted in May 1995 at 120 MHz, while

the 12-stage Pentium Pro debuted later that year (November 1995) at speeds

ranging from 150 to 200 MHz.

128

Chapter 6

Summary: The 604 in Historical Context

With a 32KB split L1 cache, the 604 had a much heftier cache than its prede-

cessors, which it needed to help keep its deeper pipeline fed. The larger cache, higher dispatch and issue rate, wider back end, and deeper pipeline made for a

solid RISC performer that was easily able to keep pace with its
x
86 competitors.

Still, the Pentium Pro was no slouch, and its performance was scaling

well with improvements in processor manufacturing techniques. Apple

needed more power from AIM to keep the pace, and more power is what

they got with a minor microarchitectural revision that came to be called

the 604e.

The PowerPC 604e

The 604e built on gains made by the 604 with a few core changes that

included a doubling of the L1 cache size (to 32KB instruction/32KB data)

and the addition of a new independent execution unit: the
condition register
unit (CRU)
.

The previous 600-series processors had moved the responsibility for

handling condition register logical operations back and forth among various

units (the integer unit in the 601, the system unit in the 603/603e, and the

branch unit in the 604). Now with the 604e, these operations got an execu-

tion unit of their own. The 604e sported a functional block in its back end

that was dedicated to handling condition register logical operations, which

meant that these not uncommon operations didn’t tie up other execution

units—like the integer unit or the branch unit—that had more serious

work to do.

The 604e’s branch unit, now that it was free from having to handle CR

logical operations, got a few expanded capabilities that I won’t detail here.

The 604e’s caches, in addition to being enlarged, also got additional copy-

back buffers and a handful of other enhancements.

The 604e was ultimately able to scale up to 350 MHz once it moved from

a 0.35 to a 0.25 micron manufacturing process, making it a successful part for

Apple’s budding RISC media workstation line.

The PowerPC 750 (aka the G3)

The PowerPC 750—known to Apple users as the G3—is a design based heavily

on the 603/603e. Its four-stage pipeline is the same as that of the 603/603e,

and many of the features of its front end and back end will be familiar to you

from our discussion of the older processor. Nonetheless, the 750 sports a few

very powerful improvements over the 603e that make it faster than even the

604e, as you can see in Table 6-4.

PowerPC Processors: 600 Series, 700 Series, and 7400

129

.
Table 6-4:
Features of the PowerPC 750

Introduction Date

September 1997

Process

0.25 micron

Transistor Count

6.35 million

Die Size

67 mm2

Clock Speed at Introduction
200–300 MHz

Cache Sizes

64KB split L1, 1MB L2

First Appeared In

Power Macintosh G3

The 750’s significant improvement in performance over the 603/603e is

the result of a number of factors, not the least of which are the improvements

that IBM made to the 750’s integer and floating-point capabilities.

A quick glance at the 750’s layout (see Figure 6-4) reveals that its back end

is wider than that of the 603. More specifically, where the 603 has a single

integer unit, the 750 has two—a simple integer unit (SIU) and complex inte-

ger unit (CIU). The 750’s complex integer unit handles all integer instructions, while the simple integer unit handles all integer instructions except multiply

and divide. Most of the integer instructions that execute in the SIU are

single-cycle instructions.

Like the 603 (and the 604), the 750’s floating-point unit can execute all

single-precision floating-point operations—including multiply—with a latency

of three cycles. And like the 603, early versions of the 750 had to insert a

pipeline bubble after every third floating-point instruction in its pipeline;

this is fixed in later IBM-produced versions of the 750. Double-precision

floating-point operations, with the exception of operations involving mul-

tiplication, also take three cycles on the 750. Double-precision multiply and

multiply-add operations take four cycles, because the 750 doesn’t have a full

double-precision FPU.

The 750’s load-store unit and system register unit perform the same

functions described in the preceding section for the 603, so they don’t merit

further comment.

The 750’s Front End, Instruction Window, and Branch Instruction

The 750 fetches up to four instructions per cycle into its six-entry instruction queue, and it dispatches up to two non-branch instructions per cycle from

the IQ’s two bottom entries. The dispatch logic follows the four dispatch rules

described earlier when deciding when an instruction is eligible to dispatch,

and each dispatched instruction is assigned an entry in the 750’s six-entry

ROB (compare the 603’s five-entry ROB).

130

Chapter 6

Front End

Instruction Fetch

BU

Branch

Instruction Queue

Unit

Decode/Dispatch

Reserv.

Reserv.

Reserv.

Reserv.

Station

Station

Station

Station

FPU-1

IU1-1

IU2-1

LSU-1

FPU-2

LSU-2

FPU-3

Load-

Floating-

Integer

Store

Point Unit

Unit

Unit

Memory Access

Scalar Arithmetic Logic Units

Units

Back End

Completion

Queue

Write

Commit Unit

Figure 6-4: Microarchitecture of the PowerPC 750

As on the 603 and 604, newly dispatched instructions enter the reserva-

tion station of the execution unit to which they have been dispatched, where

they wait for their operands to become available so that they can issue. The

750’s reservation station configuration is similar to that of the 603 in that, with the exception of the two-entry reservation station attached to the 750’s LSU,

all of the execution units have single-entry reservation stations. And like the

603, the 750’s branch unit has no reservation station.

Because the 750’s instruction window is so small, it has half the rename

registers of the 604. Nonetheless, the 750’s six general-purpose and six floating-point rename registers still put it ahead of the 603’s number of rename registers (five GPRs and four FPRs). Like the 603, the 750 has one rename register

each for the CR, LR, and CTR.

PowerPC Processors: 600 Series, 700 Series, and 7400

131

You would think that the 750’s smaller reservation stations and shorter

ROB would put it at a disadvantage with respect to the 604, which has a larger

instruction window. But the 750’s pipeline is shorter than that of the 604, so

it needs fewer buffers to track fewer in-flight instructions. More importantly,

though, the 750 has one very clever trick up its sleeve that it uses to keep its pipeline full.

Recall that standard dynamic branch prediction schemes generally use

a branch history table (BHT) in combination with a branch target buffer (BTB)

to speculate on the outcome of branch instructions and to redirect the

processor’s front end to a different point in the code stream based on this

speculation. The BHT stores information on the past behavior (taken or not

taken) of the most recently executes branch instructions, so that the processor

can determine whether or not it should take these branches if it encounters

them again. The target addresses of recently taken branches are stored in the

BTB, so that when the branch prediction hardware decides to speculatively

take a branch, it has immediate access to that branch’s target address without

having to recalculate it. The target address of the speculatively taken branch

is loaded from the BTB into the instruction register, so that on the next fetch

cycle, the processor can begin fetching and speculatively executing instruc-

tions from the target address.

The 750 improves on this standard scheme in a very clever way. Instead

of storing only the target addresses of recently taken branches in a BTB, the

750’s 64-entry
branch target instruction cache (BTIC)
stores the instruction that is located at the branch’s target address. When the 750’s branch prediction

Other books

The Astral by Kate Christensen
One Night With You by Shiloh Walker
Oracles of Delphi Keep by Victoria Laurie
Citizen Girl by Emma McLaughlin
Dragon Blood 3: Surety by Avril Sabine