Authors: jon stokes
Tags: #Computers, #Systems Architecture, #General, #Microprocessors
four dispatch slots, and which slot it goes into in turn dictates which of the
970’s two identical FPUs executes it. As I explained in the previous section,
if the fadd goes into slots 0 or 3, it is dispatched to the logical issue queue
associated with FPU1; if it goes into dispatch slot 1 or 2, it is dispatched to the logical issue queue associated with FPU2. This means that the FPU instruction
scheduling hardware is restricted in ways that it wouldn’t be if both FPUs were
fed from a common issue queue, because half the instructions are forced
into one FPU and half the instructions are forced into the other FPU. Or at
least this 50/50 split is how it’s supposed to work out under optimal circum-
stances, when the code is scheduled properly so that it dispatches IOPs
evenly into both logical issue queues.
Because of the grouping scheme and the two separate logical issue
queues, it seems that keeping both FPUs busy by splitting the computation
load between them is very much a matter of scheduling instructions for
dispatch so that no single FPU happens to get overworked while the other
goes underutilized. Normally, this kind of load balancing among execution
units would happen at the issue queue level, but in the 970’s case, it’s con-
strained by the structure of the issue queues themselves.
212
Chapter 10
This load balancing at the dispatch level isn’t quite as simple as it may
sound, because group formation takes place according to a specific set of rules
that ultimately constrain dispatch bandwidth and subsequent instruction issue
in very specific and peculiar ways. For instance, an integer instruction that’s
preceded by, say, a CR logical instruction, may have to move over a slot to
make room because the CR logical instruction can go only in slots 0 and 1.
Likewise, depending on whether an instruction near the integer IOP in the
instruction stream is cracked or millicoded, the integer IOP may have to
move over a certain number of slots; if the millicoded instruction breaks
down into a long string of instructions, that integer IOP may even get
bumped over into a later dispatch group. The overall result is that which
queue an integer IOP goes into very much depends on the other (possibly
non-integer) instructions that surround it.
The take-home message here is that PowerPC code that’s optimized
specifically for the 970 performs significantly better on the processor than
legacy code that’s optimized for other PowerPC processors like the G4e.
Of course, no one should get the impression that legacy code runs poorly
on the 970. It’s just that the full potential of the chip can’t be unlocked without properly scheduled code. Furthermore, in addition to the mitigating factors
mentioned in the section on integer performance (for example, deep OOOE
capabilities, or high-bandwidth FSB), the fact that quantitative studies have
shown the amount of ILP inherent in most RISC code to be around two
instructions per clock means that the degenerate case described in the
FPU example should be exceedingly rare.
Conclusions
While the 970’s group dispatch scheme does suffer from some of the draw-
backs described in the preceding section, it must be judged a success in
terms of its impact on the processor’s performance per watt. That this dis-
patch scheme has a significant positive impact on performance per watt is
evidenced by the fact that Intel’s Pentium M processor also uses a similar
grouping mechanism to achieve greater power efficiency. Furthermore,
Intel continues to employ this grouping mechanism more extensively with
each successive revision of the Pentium M, as the company seeks to mini-
mize power consumption without sacrificing number-crunching capabilities.
Thus such grouping mechanisms will only become more widespread as micro-
processor designers become ever more sensitive to the need to balance
performance and power consumption.
Because the 970 can track more instructions with less power-hungry book-
keeping logic, it can spend more transistors on execution units, branch
prediction resources, and cache. This last item—cache—is an especially
important performance-enhancing element in modern processors, for
reasons that will be covered in Chapter 11.
The G5: IBM’s PowerPC 970
213
U N D E R S T A N D I N G C A C H I N G A N D
P E R F O R M A N C E
This chapter is intended as a general introduction to
CPU caching and performance. Because cache is critical
to keeping the processors described so far fed with code
and data, you can’t understand how computer systems
function without first understanding the structure and
functioning of the cache memory hierarchy. To that end, this chapter covers
fundamental cache concepts like spatial and temporal locality, set associa-
tivity, how different types of applications use the cache, the general layout
and function of the memory hierarchy, and other cache-related issues.
Caching Basics
In order to really understand the role of caching in system design, think of
the CPU and memory subsystem as operating on a
consumer-producer model
(or
client-server model
): The CPU consumes information provided to it by the hard disks and RAM, which act as producers.
Driven by innovations in process technology and processor design, CPUs
have increased their ability to consume at a significantly higher rate than the
memory subsystem has increased its ability to produce. The problem is that
CPU clock cycles have gotten shorter at a faster rate than memory and bus
clock cycles, so the number of CPU clock cycles that the processor has to wait
before main memory can fulfill its requests for data has increased. With each
CPU clockspeed increase, memory is getting farther and farther away from
the CPU in terms of the number of CPU clock cycles.
Figures 11-1 and 11-2 illustrate how CPU clock cycles have gotten shorter
relative to memory clock cycles.
Figure 11-1: Slower CPU clock
Figure 11-2: Faster CPU clock
To visualize the effect that this widening speed gap has on overall system
performance, imagine the CPU as a downtown furniture maker’s workshop
and the main memory as a lumberyard that keeps getting moved farther and
farther out into the suburbs. Even if you start using bigger trucks to cart all
the wood, it’s still going to take longer from the time the workshop places an
order to the time that order gets filled.
NOTE
I’m not the first person to use a workshop and warehouse analogy to explain caching.
The most famous example of such an analogy is the Thing King game, which is widely
available on the Internet.
Sticking with the furniture workshop analogy, one solution to this
problem would be to rent out a small warehouse in town and store the
most commonly requested types of lumber there. This smaller, closer
warehouse would act as a cache that sits between the lumberyard and the
workshop, and you could keep a driver on hand at the workshop who could
run out at a moment’s notice and quickly pick up whatever you need from
the warehouse.
Of course, the bigger your warehouse, the better, because it allows you
to store more types of wood, thereby increasing the likelihood that the raw
materials for any particular order will be on hand when you need them. In
216
Chapter 11
the event that you need a type of wood that isn’t in the nearby warehouse,
you’ll have to drive all the way out of town to get it from your big, suburban
lumberyard. This is bad news, because unless your furniture workers have
another task to work on while they’re waiting for your driver to return with
the lumber, they’re going to sit around in the break room smoking and
watching
The Oprah Winfrey Show
. And you hate paying people to watch
The Oprah Winfrey Show
.
The Level 1 Cache
I’m sure you’ve figured it out already, but the smaller, closer warehouse in
this analogy is the
level 1 cache
(
L1 cache
or
L1
, for short). The L1 can be accessed very quickly by the CPU, so it’s a good place to keep the code and
data that the CPU is most likely to request. (In a moment, we’ll talk in more
detail about how the L1 can “predict” what the CPU will probably want.) The
L1’s quick access time is a result of the fact that it’s made of the fastest and most expensive type of
static RAM
, or
SRAM
. Since each SRAM memory cell is made up of four to six transistors (compared to the one-transistor-per-cell
configuration of DRAM), its cost per bit is quite high. This high cost per bit
means that you generally can’t afford to have a very large L1 unless you really
want to drive up the total cost of the system.
In modern CPUs, the L1 sits on the same piece of silicon as the rest of
the processor. In terms of the warehouse analogy, this is a bit like having the
warehouse on the same block as the workshop. This has the advantage of
giving the CPU some very fast, very close storage, but the disadvantage is that
now the main memory (the suburban lumberyard) is just as far away from the
L1 as it is from the processor. If data that the CPU needs is not in the L1—
a situation called a
cache miss
—it’s going to take quite a while to retrieve that data from memory. Furthermore, remember that as the processor gets faster,
the main memory gets “farther” away all the time. So while your warehouse
may be on the same block as your workshop, the lumberyard has now moved
not just out of town but out of the state. For an ultra–high-clock-rate processor like the P4, being forced to wait for data to load from main memory in order
to complete an operation is like your workshop having to wait a few days for
lumber to ship in from out of state.
Check out Table 11-1, which shows common latency and size information
for the various levels of the memory hierarchy. (The numbers in this table
are shrinking all the time, so if they look a bit large to you, that’s probably
because by the time you read this, they’re dated.)
Table 11-1:
A Comparison of Different Types of Data Storage
Level
Access Time
Typical Size
Technology
Managed By
Registers
1–3 ns
1KB
Custom CMOS Compiler
Level 1 Cache
(on-chip)
2–8 ns
8KB–128KB
SRAM
Hardware
Level 2 Cache
(off-chip) 5–12 ns
0.5MB–8MB
SRAM
Hardware
Main Memory
10–60 ns
64MB–1GB
DRAM
Operating system
Hard Disk
3,000,000–10,000,000 ns
20GB–100GB Magnetic
Operating system/user
Understanding Caching and Performance
217
Notice the large gap in access times between the L1 and the main
memory. For a 1 GHz CPU, a 50 ns wait means 50 wasted clock cycles. Ouch!
To see the kind of effect such stalls have on a modern, hyperpipelined
processo
r, see “Instruction Throughput and Pipeline Stalls” on page 53
.
The Level 2 Cache
The solution to this dilemma is to add more cache. At first you might think you
could get more cache by enlarging the L1, but as I said earlier, cost considera-
tions are a major factor limiting L1 cache size. In terms of the workshop ana-
logy, you could say that rents are much higher in town than in the suburbs, so
you can’t afford much in-town warehouse space without the cost of rent eating
into your bottom line, to the point where the added costs of the warehouse
space would outweigh the benefits of increased worker productivity. You have
to fine-tune the amount of warehouse space that you rent by weighing all the
costs and benefits so that you get the maximum output for the least cost.
A better solution than adding more in-town warehouse space would be
to rent some cheaper, larger warehouse space right outside of town to act as
a cache for the in-town warehouse. Similarly, processors like the P4 and G4e
have a
level 2 cache
(
L2 cache
or
L2
) that sits between the L1 and main memory.
The L2 usually contains all of the data that’s in the L1 plus some extra. The
common way to describe this situation is to say that the L1
subsets
the L2, because the L1 contains a subset of the data in the L2.
A series of caches, starting with the page file on the hard disk (the lumber-
yard) and going all the way up to the registers on the CPU (the workshop’s
work benches), is called a
cache hierarchy
. As you go up the cache hierarchy towards the CPU, the caches get smaller, faster, and more expensive to
implement; conversely, as you go down the cache hierarchy, the caches get
larger, cheaper, and much slower. The data contained in each level of the
hierarchy is usually mirrored in the level below it, so for a piece of data that’s in the L1, there are usually copies of that same data in the L2, main memory,