Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture (88 page)

Read Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture Online

Authors: jon stokes

Tags: #Computers, #Systems Architecture, #General, #Microprocessors

BOOK: Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture
3.77Mb size Format: txt, pdf, ePub

four dispatch slots, and which slot it goes into in turn dictates which of the

970’s two identical FPUs executes it. As I explained in the previous section,

if the fadd goes into slots 0 or 3, it is dispatched to the logical issue queue

associated with FPU1; if it goes into dispatch slot 1 or 2, it is dispatched to the logical issue queue associated with FPU2. This means that the FPU instruction

scheduling hardware is restricted in ways that it wouldn’t be if both FPUs were

fed from a common issue queue, because half the instructions are forced

into one FPU and half the instructions are forced into the other FPU. Or at

least this 50/50 split is how it’s supposed to work out under optimal circum-

stances, when the code is scheduled properly so that it dispatches IOPs

evenly into both logical issue queues.

Because of the grouping scheme and the two separate logical issue

queues, it seems that keeping both FPUs busy by splitting the computation

load between them is very much a matter of scheduling instructions for

dispatch so that no single FPU happens to get overworked while the other

goes underutilized. Normally, this kind of load balancing among execution

units would happen at the issue queue level, but in the 970’s case, it’s con-

strained by the structure of the issue queues themselves.

212

Chapter 10

This load balancing at the dispatch level isn’t quite as simple as it may

sound, because group formation takes place according to a specific set of rules

that ultimately constrain dispatch bandwidth and subsequent instruction issue

in very specific and peculiar ways. For instance, an integer instruction that’s

preceded by, say, a CR logical instruction, may have to move over a slot to

make room because the CR logical instruction can go only in slots 0 and 1.

Likewise, depending on whether an instruction near the integer IOP in the

instruction stream is cracked or millicoded, the integer IOP may have to

move over a certain number of slots; if the millicoded instruction breaks

down into a long string of instructions, that integer IOP may even get

bumped over into a later dispatch group. The overall result is that which

queue an integer IOP goes into very much depends on the other (possibly

non-integer) instructions that surround it.

The take-home message here is that PowerPC code that’s optimized

specifically for the 970 performs significantly better on the processor than

legacy code that’s optimized for other PowerPC processors like the G4e.

Of course, no one should get the impression that legacy code runs poorly

on the 970. It’s just that the full potential of the chip can’t be unlocked without properly scheduled code. Furthermore, in addition to the mitigating factors

mentioned in the section on integer performance (for example, deep OOOE

capabilities, or high-bandwidth FSB), the fact that quantitative studies have

shown the amount of ILP inherent in most RISC code to be around two

instructions per clock means that the degenerate case described in the

FPU example should be exceedingly rare.

Conclusions

While the 970’s group dispatch scheme does suffer from some of the draw-

backs described in the preceding section, it must be judged a success in

terms of its impact on the processor’s performance per watt. That this dis-

patch scheme has a significant positive impact on performance per watt is

evidenced by the fact that Intel’s Pentium M processor also uses a similar

grouping mechanism to achieve greater power efficiency. Furthermore,

Intel continues to employ this grouping mechanism more extensively with

each successive revision of the Pentium M, as the company seeks to mini-

mize power consumption without sacrificing number-crunching capabilities.

Thus such grouping mechanisms will only become more widespread as micro-

processor designers become ever more sensitive to the need to balance

performance and power consumption.

Because the 970 can track more instructions with less power-hungry book-

keeping logic, it can spend more transistors on execution units, branch

prediction resources, and cache. This last item—cache—is an especially

important performance-enhancing element in modern processors, for

reasons that will be covered in Chapter 11.

The G5: IBM’s PowerPC 970

213

U N D E R S T A N D I N G C A C H I N G A N D

P E R F O R M A N C E

This chapter is intended as a general introduction to

CPU caching and performance. Because cache is critical

to keeping the processors described so far fed with code

and data, you can’t understand how computer systems

function without first understanding the structure and

functioning of the cache memory hierarchy. To that end, this chapter covers

fundamental cache concepts like spatial and temporal locality, set associa-

tivity, how different types of applications use the cache, the general layout

and function of the memory hierarchy, and other cache-related issues.

Caching Basics

In order to really understand the role of caching in system design, think of

the CPU and memory subsystem as operating on a
consumer-producer model

(or
client-server model
): The CPU consumes information provided to it by the hard disks and RAM, which act as producers.

Driven by innovations in process technology and processor design, CPUs

have increased their ability to consume at a significantly higher rate than the

memory subsystem has increased its ability to produce. The problem is that

CPU clock cycles have gotten shorter at a faster rate than memory and bus

clock cycles, so the number of CPU clock cycles that the processor has to wait

before main memory can fulfill its requests for data has increased. With each

CPU clockspeed increase, memory is getting farther and farther away from

the CPU in terms of the number of CPU clock cycles.

Figures 11-1 and 11-2 illustrate how CPU clock cycles have gotten shorter

relative to memory clock cycles.

Figure 11-1: Slower CPU clock

Figure 11-2: Faster CPU clock

To visualize the effect that this widening speed gap has on overall system

performance, imagine the CPU as a downtown furniture maker’s workshop

and the main memory as a lumberyard that keeps getting moved farther and

farther out into the suburbs. Even if you start using bigger trucks to cart all

the wood, it’s still going to take longer from the time the workshop places an

order to the time that order gets filled.

NOTE

I’m not the first person to use a workshop and warehouse analogy to explain caching.

The most famous example of such an analogy is the Thing King game, which is widely
available on the Internet.

Sticking with the furniture workshop analogy, one solution to this

problem would be to rent out a small warehouse in town and store the

most commonly requested types of lumber there. This smaller, closer

warehouse would act as a cache that sits between the lumberyard and the

workshop, and you could keep a driver on hand at the workshop who could

run out at a moment’s notice and quickly pick up whatever you need from

the warehouse.

Of course, the bigger your warehouse, the better, because it allows you

to store more types of wood, thereby increasing the likelihood that the raw

materials for any particular order will be on hand when you need them. In

216

Chapter 11

the event that you need a type of wood that isn’t in the nearby warehouse,

you’ll have to drive all the way out of town to get it from your big, suburban

lumberyard. This is bad news, because unless your furniture workers have

another task to work on while they’re waiting for your driver to return with

the lumber, they’re going to sit around in the break room smoking and

watching
The Oprah Winfrey Show
. And you hate paying people to watch

The Oprah Winfrey Show
.

The Level 1 Cache

I’m sure you’ve figured it out already, but the smaller, closer warehouse in

this analogy is the
level 1 cache
(
L1 cache
or
L1
, for short). The L1 can be accessed very quickly by the CPU, so it’s a good place to keep the code and

data that the CPU is most likely to request. (In a moment, we’ll talk in more

detail about how the L1 can “predict” what the CPU will probably want.) The

L1’s quick access time is a result of the fact that it’s made of the fastest and most expensive type of
static RAM
, or
SRAM
. Since each SRAM memory cell is made up of four to six transistors (compared to the one-transistor-per-cell

configuration of DRAM), its cost per bit is quite high. This high cost per bit

means that you generally can’t afford to have a very large L1 unless you really

want to drive up the total cost of the system.

In modern CPUs, the L1 sits on the same piece of silicon as the rest of

the processor. In terms of the warehouse analogy, this is a bit like having the

warehouse on the same block as the workshop. This has the advantage of

giving the CPU some very fast, very close storage, but the disadvantage is that

now the main memory (the suburban lumberyard) is just as far away from the

L1 as it is from the processor. If data that the CPU needs is not in the L1—

a situation called a
cache miss
—it’s going to take quite a while to retrieve that data from memory. Furthermore, remember that as the processor gets faster,

the main memory gets “farther” away all the time. So while your warehouse

may be on the same block as your workshop, the lumberyard has now moved

not just out of town but out of the state. For an ultra–high-clock-rate processor like the P4, being forced to wait for data to load from main memory in order

to complete an operation is like your workshop having to wait a few days for

lumber to ship in from out of state.

Check out Table 11-1, which shows common latency and size information

for the various levels of the memory hierarchy. (The numbers in this table

are shrinking all the time, so if they look a bit large to you, that’s probably

because by the time you read this, they’re dated.)

Table 11-1:
A Comparison of Different Types of Data Storage

Level

Access Time

Typical Size

Technology

Managed By

Registers

1–3 ns

1KB

Custom CMOS Compiler

Level 1 Cache
(on-chip)

2–8 ns

8KB–128KB

SRAM

Hardware

Level 2 Cache
(off-chip) 5–12 ns

0.5MB–8MB

SRAM

Hardware

Main Memory

10–60 ns

64MB–1GB

DRAM

Operating system

Hard Disk

3,000,000–10,000,000 ns

20GB–100GB Magnetic

Operating system/user

Understanding Caching and Performance

217

Notice the large gap in access times between the L1 and the main

memory. For a 1 GHz CPU, a 50 ns wait means 50 wasted clock cycles. Ouch!

To see the kind of effect such stalls have on a modern, hyperpipelined

processo
r, see “Instruction Throughput and Pipeline Stalls” on page 53
.

The Level 2 Cache

The solution to this dilemma is to add more cache. At first you might think you

could get more cache by enlarging the L1, but as I said earlier, cost considera-

tions are a major factor limiting L1 cache size. In terms of the workshop ana-

logy, you could say that rents are much higher in town than in the suburbs, so

you can’t afford much in-town warehouse space without the cost of rent eating

into your bottom line, to the point where the added costs of the warehouse

space would outweigh the benefits of increased worker productivity. You have

to fine-tune the amount of warehouse space that you rent by weighing all the

costs and benefits so that you get the maximum output for the least cost.

A better solution than adding more in-town warehouse space would be

to rent some cheaper, larger warehouse space right outside of town to act as

a cache for the in-town warehouse. Similarly, processors like the P4 and G4e

have a
level 2 cache
(
L2 cache
or
L2
) that sits between the L1 and main memory.

The L2 usually contains all of the data that’s in the L1 plus some extra. The

common way to describe this situation is to say that the L1
subsets
the L2, because the L1 contains a subset of the data in the L2.

A series of caches, starting with the page file on the hard disk (the lumber-

yard) and going all the way up to the registers on the CPU (the workshop’s

work benches), is called a
cache hierarchy
. As you go up the cache hierarchy towards the CPU, the caches get smaller, faster, and more expensive to

implement; conversely, as you go down the cache hierarchy, the caches get

larger, cheaper, and much slower. The data contained in each level of the

hierarchy is usually mirrored in the level below it, so for a piece of data that’s in the L1, there are usually copies of that same data in the L2, main memory,

Other books

Gone With the Wind by Margaret Mitchell
Beyond The Door by Phaedra Weldon
Body Slammed! by Ray Villareal
Be Nice to Mice by Nancy Krulik
Keeper of the Keys by Perri O'Shaughnessy
Traitor's Chase by Stuart Gibbs
American Psychosis by Executive Director E Fuller, M. D. Torrey