Authors: jon stokes
Tags: #Computers, #Systems Architecture, #General, #Microprocessors
and hard disk’s page file. Each level in the hierarchy subsets the level below
it. We’ll talk later about how all of those copies are kept in sync.
In Figure 11-3, the red cells are the code and data for the program that
the CPU is currently running. The blue cells are unrelated to the currently
running program. This figure, which will become even clearer once you read
“Locality of Reference” on page 220,
shows how each level of the cache hierarchy subsets the level below it.
As Table 11-1 indicates, each level of the hierarchy depicted in Figure 11-3
is controlled by a different part of the system. Data is promoted up the hier-
archy or demoted down the hierarchy based on a number of different criteria;
in the remainder of this chapter we’ll concern ourselves only with the top
levels of the hierarchy.
Example: A Byte’s Brief Journey Through the Memory Hierarchy
For the sake of example, let’s say the CPU issues a load instruction that tells
the memory subsystem to load a piece of data (in this case, a single byte)
into one of its registers. First, the request goes out to the L1, which is
218
Chapter 11
Registers
L1 Cache
L2 Cache
CPU
RAM
Main Memory
Figure 11-3: Code and data in the cache hierarchy
checked to see if it contains the requested data. If the L1 does not contain the data and therefore cannot fulfill the request—in other words, a cache miss
occurs—the request propagates down to the L2. If the L2 does not contain the
desired byte, the request begins the relatively long trip out to main memory.
If main memory doesn’t contain the data, you’re in big trouble, because then
it has to be
paged
in from the hard disk, an act that can take a relative eternity in CPU time.
Let’s assume that the requested byte is found in main memory. Once
located, the byte is copied from main memory, along with a bunch of its
neighboring bytes in the form of a
cache block
or
cache line
, into the L2 and L1. When the CPU requests this same byte again, it will be waiting for it
there in the L1, a situation called a
cache hit
.
Cache Misses
Computer architects usually divide cache misses up into three different types
depending on the situation that brought about the miss. I’ll introduce these
three types of misses at appropriate points over the course of the chapter,
but I can talk about the first one right now.
A
compulsory miss
is a cache miss that occurs because the desired data
was never in the cache and therefore must be paged in for the first time in
a program’s execution. It’s called a
compulsory
miss because, barring the use of certain specialized tricks like data prefetching, it’s the one type of miss
that just can’t be avoided. All cached data must be brought into the cache for
the very first time at some point, and the occasion for doing so is normally a
compulsory miss.
The two other types of cache misses are misses that result when the CPU
requests data that was previously in the cache but has been
evicted
for some reason or other. We’ll discuss evictions in
“Temporal and Spatial Locality
Revisited: Replacement/Eviction Policies and Block Sizes” on page 230,
and I’ll cover the other two types of cache misses as they come up through-
out the course of this chapter.
Understanding Caching and Performance
219
Locality of Reference
Caching works because of a very simple property exhibited to one degree
or another by all types of code and data: locality of reference. We generally
find it useful to talk about two types of locality of reference:
spatial locality
and
temporal locality
.
Spatial locality
Spatial locality is a fancy label for the general rule that
if the CPU needs an
item from memory at any given moment, it’s likely to need that item’s neighbors next
.
Temporal locality
Temporal locality is the name we give to the general rule that
if an item in
memory was accessed once, it’s likely to be accessed again in the near future
.
Depending on the type of application, both code and data streams can
exhibit spatial and temporal locality.
Spatial Locality of Data
Spatial locality of data is the easiest type of locality to understand, because
most of you have used media applications like MP3 players, DVD players, and
other types of applications whose datasets consist of large, ordered files. Con-
sider an MP3 file, which has a bunch of blocks of data that are consumed by
the processor in sequence from the file’s beginning to its end. If the CPU is
running iTunes and it has just requested second 1:23 of a five-minute MP3,
you can be reasonably certain that next it’s going to want seconds 1:24, 1:25,
and so on. This is the same with a DVD file, and with many other types of
media files like images, AutoCAD drawings, and 3D game levels. All of these
applications operate on large arrays of sequentially ordered data that get
ground through in sequence by the CPU again and again.
Business applications like word processors also have great spatial locality
for data. If you think about it, few people open six or seven documents in a
word processor and quickly alternate between them typing one or two words
in each. Most of us just open up one or two relatively modest-sized files and
work in them for a while without skipping around too much within the same
file. These files are stored in contiguous regions of memory, where they can
be brought quickly into the cache in a few large batches.
Ultimately, spatial locality is just a way of saying that related chunks of
data tend to clump together in memory, and since they’re related, they also
tend to be processed together in batches by the CPU.
In Figure 11-4, as in Figure 11-3, the red cells are related chunks of data
stored in the memory array. This figure shows a program with fairly good
spatial locality, since the red cells are clumped closely together. In an appli-
cation with poor spatial locality, the red cells would be more randomly
distributed among the unrelated blue cells.
220
Chapter 11
Spatial Locality of Code
Spatial locality applies to code just like it does to data—most well-written
code tries to avoid jumps and branches so that the processor can execute
through large contiguous blocks uninterrupted. Games, simulations, and
media processing applications tend to have decent spatial locality for code,
because such applications often feature small blocks of code (called
kernels
) operating repeatedly on very large datasets.
Figure 11-4: Spatial locality
When it comes to spatial locality of code for business applications, the
picture is mixed. If you think about the way that you use a word processor,
it’s easy to see what’s going on. As you create a document, most of you are
constantly pressing different formatting buttons and invoking different
menu options. For instance, you might format one word in italics, change
the paragraph spacing, and then save the file, all in sequence. Each of these
actions invokes a very different part of the code in a large application like
Microsoft Word; it’s not likely that the code for the FileSave menu option
is stored right next to the code that formats a font in italics. The way you
use a word processor forces the CPU to jump around from place to place
in memory in order to retrieve the correct code. However, the segment of
the code stream that implements each individual action (i.e., saving a file,
formatting a font, and so on) usually exists in memory as a fairly large, spatially localized chunk—very much like a little subprogram within the larger application. While the code for the FileSave menu action might be quite far
away from the code for the italics formatting option, both of these chunks
of code exhibit good spatial locality as small programs in their own right.
What this means for a cache designer is that business applications need
large instruction caches to be able to collect all of the most frequently used
clumps of code that correspond to the most frequently executed actions and
Understanding Caching and Performance
221
pack them together in the cache. If the instruction cache is too small, the
different clumps get
swapped
out as different actions are performed. If the instruction cache is large enough, many of these subprograms will fit and
there’s little swapping needed. Incidentally, this is why business applications
performed so poorly on Intel’s original cacheless Celeron processor.
Temporal Locality of Code and Data
Consider a simple Photoshop filter that inverts an image to produce a
negative; there’s a small piece of code that performs the same inversion on
each pixel, starting at one corner and going in sequence all the way across
and down to the opposite corner. This code is just a small loop that gets
executed repeatedly, once on each pixel, so it’s an example of code that is
reused again and again. Media applications, games, and simulations, because
they use lots of small loops that iterate through very large datasets, have
excellent temporal locality for code.
The same large, homogenous datasets that give media applications and
the like good temporal locality for code also given them extremely poor
temporal locality for data. Returning to the MP3 example, a music file is
usually played through once in sequence and none of its parts are repeated.
This being the case, it’s actually a waste to store any of that file in the data cache, since it’s only going to stop off there temporarily before passing
through to the CPU.
When an application, like the aforementioned MP3 player, fills up the
cache with data that doesn’t really need to be cached because it won’t be
used again and as a result winds up bumping out of the cache data that will
be reused, that application is said to “pollute” the cache. Media applications,
games, and the like are big cache polluters, which is why they weren’t too
affected by the original Celeron’s lack of cache. Because these applications’
data wasn’t going to be needed again any time soon, the fact that it wasn’t in
a readily accessible cache didn’t really matter.
The primary way in which caches take advantage of temporal locality is
probably obvious by this point: Caches provide a place to store code and data
that the CPU is currently working with. By
working with
, I mean that the CPU
has used it once and is likely to use it again. A group of related blocks of
code and/or data that the CPU uses and reuses over a period of time in
order to complete a task is called a
working set
. Depending on the size of a task’s working set and the number of operations it takes the CPU to complete
that task, spatial and temporal locality—and with them the cache hierarchy—
will afford a greater or lesser performance increase on that task.
Locality: Conclusions
One point that should be apparent from the preceding discussion is that
temporal locality implies spatial locality, but not vice versa. That is to say,
data that is reused is always related (and therefore collects into spatially
localized clumps in memory), but data that is related is not always reused.
An open text file is an example of reusable data that occupies a localized
region of memory, and an MP3 file is an example of non-reusable (or
streaming) data that also occupies a localized region of memory.
222
Chapter 11
The other thing that you probably noticed from this section is the fact
that memory access patterns for code and memory access patterns for data
are often very different within the same application. For instance, media
applications have excellent temporal locality for code but poor temporal
locality for data. This fact has inspired many cache designers to split the L1
into two regions—one for code and one for data. The code half of the cache
is called the instruction cache, or I-cache, while the data half of the cache is called the data cache, or D-cache. This partitioning can result in significant
performance gains, depending on the size of the cache, the types of appli-
cations normally run on the system, and a variety of other factors.
Cache Organization: Blocks and Block Frames
One way that caches take advantage of locality of reference is by loading data
from memory in large chunks. When the CPU requests a particular piece of
data from the memory subsystem, that piece gets fetched and loaded into the
L1 along with some of its nearest neighbors. The actual piece of data that was
requested is called the
critical word
, and the surrounding group of bytes that gets fetched along with it is called a cache line or cache block. By fetching
not only the critical word but also a group of its neighbors and loading
them into the cache, the CPU is prepared to take advantage of the fact that