Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture (95 page)

Read Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture Online

Authors: jon stokes

Tags: #Computers, #Systems Architecture, #General, #Microprocessors

BOOK: Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture
10.53Mb size Format: txt, pdf, ePub

detect loops that have a very low number of iterations. Core Duo’s loop

detector can detect loops with smaller iteration counts, a feature that saves

power and improves performance by lowering the number of instruction

fetches and BTB accesses.

SSE3

Core Duo introduces a new member into the SSE family of ISA extensions:

SSE3. The SSE3 instruction set consists of 13 new instructions, which Intel’s

Software Developer’s Manual
summarizes as follows:

z

One
x
87 FPU instruction used in integer conversion

z

One SIMD integer instruction that addresses unaligned data loads

z

Two SIMD floating-point packed ADD/SUB instructions

z

Four SIMD floating-point horizontal ADD/SUB instructions

z

Three SIMD floating-point LOAD/MOVE/DUPLICATE instructions

z

Two thread-synchronization instructions

These new instructions fill in some gaps left in the SSE family and in the

x
87 extensions, mostly in the areas of byte shuffling and floating-point inter-element arithmetic. These are the areas in which the SSE family has been

weakest when compared with AltiVec.

Floating-Point Improvement

When programmers use the
x
87 floating-point instructions to perform floating-point computations, they have multiple options available to them for dealing

with the more complicated aspects of floating-point math like number formats

and rounding behavior. The
x
87 FPU has a special register called the
floating-point control word (FPCW)
, which programmers can use to tell the FPU how they’d like it to handle these issues. In short, the FPCW holds configuration

data for the floating-point unit, and programmers write new data into that

register whenever they’d like to change the FPU’s configuration.

All Intel designs prior to Core Duo have assumed that programmers very

rarely write to the FPCW. Because of this assumption, Intel’s chip architects

have never associated any rename registers with the FPCW. As it turns out,

however, some types of programs contain code that writes to the FPCW fairly

252

Chapter 12

frequently, most often to change the FPU’s rounding control options. For

such programs, a single copy of the FPCW is a significant bottleneck, because

the entire floating-point pipeline must stall until that one register is finished being updated.

Core Duo is the first Intel processor to feature a set of microarchitectural

rename registers for the FPCW. These four new rename registers enable Core

Duo to extract more parallelism from floating-point code by eliminating false

register name conflicts associated with the FCPW. (For more on false register

name conflicts, data hazards, and register renaming, see Chapter 4.)

Integer Divide Improvement

Integer divisions are rare in most code, but when they do occur, they stall the

complex integer unit for many cycles. The CIU must grind through the large

number of computations and bit shifts that it takes to produce a division

result; no other instructions can enter the CIU’s pipeline during this time.

Core Duo’s complex integer unit tries to shorten integer division’s long

latencies by examining each
x
86 integer divide instruction (idiv) that it encounters in order to see if it can exit the division process early. For idiv

instructions that have smaller data sizes and need fewer iterations inside the

ALU hardware to produce a valid result, the integer unit stops the division

once the required number of iterations has completed. This technique

reduces average idiv latencies because the ALU no longer forces every idiv,

regardless of data size, to go through the same number of iterations. In some

cases, an idiv that would take 12 cycles on Dothan takes only 4 cycles on Core

Duo, and in others the latency can be reduced from 20 cycles (Dothan) to

12 cycles (Core Duo).

Virtualization Technology

The SSE3 instructions aren’t the only new extensions added to the
x
86 ISA.

Intel also used Core Duo to introduce its Virtualization Technology, called

VT-x
, along with a set of supporting ISA extensions called
Virtual Machine
Extensions (VMX)
.

VT-x is worthy of its own chapter, but I’ll summarize it very briefly here.

In a nutshell, VT-x enables a single processor to run multiple operating

system/application stacks simultaneously, with each stack thinking that it has

complete control of the processor. VT-x accomplishes this by presenting a

virtual processor
to each operating system instance. A
virtual machine monitor
(VMM)
then runs at a level beneath the operating systems, closest to the processor hardware, and manages the multiple operating system instances

running on the virtual processors.

With virtualization technology, a single, possibly underutilized multi-

core processor can be made to do the work of multiple computers, thereby

keeping more of its execution hardware busy during each cycle. Indeed,

VT-x can be thought of as a way to increase power efficiency simply by giving

the processor more work to do, so that fewer execution slots per cycle are

wasted due to idleness.

Intel’s Pentium M, Core Duo, and Core 2 Duo

253

Summary: Core Duo in Historical Context

Core Duo’s improvements on the Dothan design enabled Intel to offer a

dual-core part with the power dissipation characteristics of previous single-

core parts. Because it integrated two cores onto a single die, Core Duo could

also offer a significant speedup for workloads involving multiple instruction

streams (or
threads of execution
in computer science parlance). However, more radical changes to the microarchitecture were needed if Intel was to

meet its goal of dramatically increasing performance on single instruction

streams without also increasing clockspeed and power consumption.

Core 2 Duo

The Intel Core microarchitecture introduced in the Core 2 Duo line of

processors represents Intel’s most ambitious attempt since the Pentium Pro

to increase single-threaded performance independently of clockspeeds.

Because its designers took a “more hardware” instead of “more clockspeed”

approach to performance, Core is bigger and wider than just about any

mass-market design that has come before it (see Table 12-3). Indeed, this

“more of everything” is readily apparent with a glance at the diagram of the

new microarchitecture in Figure 12-10.

In every phase of Core’s 14-stage pipeline, there is more of just about

anything you could think of: more decoding logic, more re-order buffer

space, more reservation station entries, more issue ports, more execution

hardware, more memory buffer space, and so on. In short, Core’s designers

took everything that has already been proven to work and added more of it,

along with a few new tricks and tweaks.

Table 12-3:
Features of the Core 2 Duo/Solo

Introduction Date

July 27, 2006

Process

65 nanometer

Transistor Count

291 million

Clock Speed at Introduction
1.86 to 2.93 GHz

L1 Cache Size

32KB instruction, 32KB data

L2 Cache Size (on-die)

2MB or 4MB

x86 ISA Extensions

EM64T for 64-bit support

Core is wider in the decode, dispatch, issue, and commit pipeline phases

than every processor covered in this book except the PowerPC 970. Core’s

instruction window, which consists of a 96-entry reorder buffer and a 32-entry

reservation station, is bigger than that of any previous Intel microarchitecture except for Netburst. However, as I’ve mentioned before, bigger doesn’t automatically mean better. There are real-world limits on the number of instruc-

tions that can be executed in parallel, so the wider the machine, the more

execution slots per cycle that can potentially go unused because of limits to

instruction-level parallelism (ILP)
. Furthermore, Chapter 3 described how memory latency can starve a wide machine for code and data, resulting in a

254

Chapter 12

waste of execution resources. Core has a number of features that are there

solely to address ILP and memory latency issues and to ensure that the

processor is able to keep its execution units full.

Front End

Instruction Fetch

BPU

Translate x86/

Decode

Branch

Unit

Reorder Buffer (ROB)

Reservation Station (RS)

Port 0

Port 1

Port 5

Port 5

Port 0

Port 1

Port 0

Port 1

Port 5

Port 5

Port 4

Port 3

Port 2

MMX5

FADD

FMUL

CIU1

CIU2

SIU

Store

Store

Load

MMX0

MMX1

Data

Addr.

Addr.

VSHUF

BU

F/VMOV

F/VMOV

F/VMOV

VFADD

VFMUL

Floating-Point

MMX/SSE Unit

Branch

Unit

Integer Unit

Load-Store Unit

Unit

Vector ALUs

Scalar ALUs

Memory Access Units

Back End

Re-order Buffer

(ROB)

Commit

Commitment Unit

Figure 12-10: The Intel Core microarchitecture

NOTE

The Intel Core microarchitecture family actually consists of three nearly identical
microarchitectural variants, each of which is known by its code name. Merom is the
low-power mobile microarchitecture, Conroe is the desktop microarchitecture, and
Woodcrest is the server microarchitecture.

In the front end, micro-ops fusion and a new trick called
macro-fusion

work together to keep code moving into the back end; and in the back end,

a greatly enlarged instruction window ensures that more instructions can

reach the execution units on each cycle. Intel has also fixed an important

SSE bottleneck that existed in previous designs, thereby massively improving

Core’s vector performance over that of its predecessors.

In the remainder of this chapter, I’ll talk about all of these improvements

and many more, placing each of Core’s new features in the context of Intel’s

overall focus on balancing performance, scalability, and power consumption.

Intel’s Pentium M, Core Duo, and Core 2 Duo

255

The Fetch Phase

As I’ll discuss in more detail later, Core has a higher decode rate than any of

its predecessors. This higher decode rate means that more radical design

changes were needed in the fetch phase to prevent the decoder from being

starved for instructions. A simple increase in the size of the fetch buffer

wouldn’t cut it this time, so Intel tried a different approach.

Core’s fetch buffer is only 32 bytes—the size of the fetch buffer on the

original P6 core. In place of an expanded fetch buffer, Core sports an entirely

new structure that sits in between the fetch buffer and the decoders: a bona

fide instruction queue.

Core’s 18-entry IQ, depicted in Figure 12-11, holds about the same

number of
x
86 instructions as the Pentium M’s 64-byte fetch buffer. The predecode hardware can move up to six
x
86 instructions per cycle from the fetch buffer into the IQ, where a new feature called macro-fusion is used to

prepare between four and five
x
86 instructions each cycle for transfer from the IQ to the decode hardware.

L1 Instruction Cache

Simple Decoder

2 x 16-byte

Fetch Buffer

Simple Decoder

18-entry

Instruction

Queue

Micro-op

Reorder Buffer

Simple Decoder

Buffer

(ROB)

Complex

Decoder

Translate/

Microcode

Engine

x86 Decode

x86 instruction path

macro-fused instruction path

micro-op instruction path

fused micro-op instruction path

Figure 12-11: Core’s fetch and decode hardware

256

Chapter 12

NOTE

Core’s instruction queue also takes over the hardware loop buffer function of previous
designs’ fetch buffers.

Macro-Fusion

A major new feature of Core’s front end hardware is its ability to fuse pairs of
x
86 instructions together in the predecode phase and send them through a single decoder to be translated into a single micro-op. This feature, called

macro-fusion
, can be used only on certain types of instructions; specifically, compare and test instructions can be macro-fused with branch instructions.

Core’s predecode phase can send one macro-fused
x
86 instruction per

cycle to any one of the front end’s four decoders. (As we’ll see later, Core has four instruction decoders, one more than its predecessors.) In turn, the decode

Other books

The Remaining by Travis Thrasher
The House of Thunder by Dean Koontz
The Anatomy Lesson by Philip Roth
Falling Under by Gwen Hayes
Giants of the Frost by Kim Wilkins
Between The Sheets by Jeanie London
Stuart Little by E. B. White, Garth Williams
Divine Sacrifice, The by Hays, Anthony
Shadow on the Moon by Connie Flynn