Authors: jon stokes
Tags: #Computers, #Systems Architecture, #General, #Microprocessors
phase as a whole can translate one macro-fused
x
86 instruction into a
macro-fused micro-op
on each cycle. (No more than one such macro-fused micro-op can be generated per cycle.)
All told, macro-fusion allows the predecode phase to send to the decode
phase a maximum of either
z
four normal
x
86 instructions per cycle, or
z
three normal
x
86 instructions plus one macro-fused instruction, for a
total of
five
x
86 instructions per cycle.
Moving five instructions per cycle into the decode phase is a huge
improvement over the throughput of three instructions per cycle in previous
designs. By enabling the front end to combine two
x
86 instructions per cycle into a single micro-op, macro fusion effectively enlarges Core’s decode,
dispatch, and retire bandwidth, all without the need for extra ROB and RS
entries. Ultimately, less book-keeping hardware means better power efficiency
per
x
86 instruction for the processor as a whole, which is why it’s important for Core to approach the goal of one micro-op per
x
86 instruction as closely as possible.
The Decode Phase
Core’s widened back end can grind through micro-ops at an unprecedented
rate, so Intel needed to dramatically increase the new microarchitecture’s
decode rate compared with previous designs so that more micro-ops per
cycle could reach the back end. Core’s designers did a few things to achieve
this goal.
I’ve already talked about one innovation that Core uses to increase its
decode rate: macro-fusion. This new capability has the effect of giving Core
an extra decoder for “free,” but remember that this free decoder that macro-
fusion affords is only good for certain instruction types. Also, the decode
phase as a whole can translate only one macro-fused
x
86 instruction into a
macro-fused micro-op
on each cycle. (No more than one such macro-fused
micro-op can be generated per cycle.)
Intel’s Pentium M, Core Duo, and Core 2 Duo
257
Intel also expanded the decode phase’s total throughput by adding a
brand new simple/fast decoding unit, bringing Core’s number of simple/
fast decoders up to three. The three simple/fast decoders combine with the
complex/slow decoder to enable Core’s decoding hardware to send up to
seven micro-ops per cycle into the micro-op queue, from which up to four
micro-ops per cycle can pass into the ROB. The newly expanded decoding
unit was depicted in Figure 12-11.
Finally, Intel has increased Core’s decode rate by making a change to the
back end (described later) that now permits 128-bit SSE instructions to be
decoded into a single micro-op instead of a fused micro-op pair, as in previous
designs. Thus Core’s new front end design brings the processor much closer
to the goal of one micro-op per
x
86 instruction.
Core’s Pipeline
Core’s 14-stage pipeline is two stages longer than the original 12-stage P6
pipeline. Both of Core’s new stages were added in the processor’s front end.
The first new stage was added in the fetch/predecode phase to accommodate
the instruction queue and macro-fusion, and the second stage was added to
help out with 64-bit address translation.
Intel has not yet made available a detailed breakdown of Core’s pipeline
stages, so the precise locations of the two new stages are still unknown.
Core’s Back End
One of the most distinctive features of the older P6 design is its back end’s
issue port structure, described in Chapter 5. Core uses a similar structure in
its back end, although there are some major differences between the issue
port and reservation station (RS) combination of Core and that of the P6.
To get a sense of the historical development of the issue port scheme,
let’s take a look at the back end of the original Pentium Pro.
As you can see from Figure 12-12, ports 0 and 1 host the arithmetic hard-
ware, while ports 2, 3, and 4 host the memory access hardware. The P6 core’s
reservation station is capable of issuing up to five instructions per cycle to the execution units—one instruction per issue port per cycle.
As the P6 core developed through the Pentium II and Pentium III, Intel
began adding execution units to handle integer and floating-point vector
arithmetic. This new vector execution hardware was added on ports 0 and 1,
with the result that by the time the PIII was introduced, the P6 back end
looked like Figure 12-13.
The PIII’s core is fairly wide, but the distribution of arithmetic execution
resources between only two of the five issue ports means that its performance
can sometimes be bottlenecked by a lack of issue bandwidth (among other
things). All of the code stream’s vector and scalar arithmetic instructions are
contending with each other for two ports, a fact that, when combined with
the two-cycle SSE limitation that I’ll outline in a moment, means the PIII’s
vector performance could never really reach the heights of a cleaner design
like Core.
258
Chapter 12
Reservation Station (RS)
Port 0
Port 0
Port 1
Port 4
Port 3
Port 2
Port 1
CIU
SIU
Store
Store
Load
FPU
Data
Addr.
Addr.
BU
Floating-
Point
Branch
Integer Unit
Load-Store Unit
Unit
Unit
Scalar ALUs
Memory Access Units
Back End
Figure 12-12: The Pentium Pro’s back end
Almost nothing is known about the back ends of the Pentium M and
Core Duo processors because Intel has declined to release that infor-
mation. Both are rumored to be quite similar in organization to the back
end of the Pentium III, but that rumor cannot be confirmed based on
publicly available information.
Reservation Station (RS)
Port 0
Port 1
Port 1
Port 0
Port 0
Port 1
Port 4
Port 3
Port 2
Port 1
FPU &
CIU
SIU
Store
Store
Load
VFADD
MMX 0
MMX 1
Data
Addr.
Addr.
BU
VFMUL
VSHUFF
VRECIP
FP/SSE
Branch
MMX/SSE Unit
Unit
Integer Unit
Load-Store Unit
Unit
Vector ALUs
Scalar ALUs
Memory Access Units
Back End
Figure 12-13: The Pentium III’s back end
For Core, Intel’s architects added a new issue port for handling
arithmetic operations. They also changed the distribution of labor on
issue ports 1 and 2 to provide more balance and accommodate more
execution hardware. The final result is the much wider back end that is
shown in Figure 12-14.
Each of Core’s three arithmetic issue ports (0, 1, and 5) now con-
tains a scalar integer ALU, a vector integer ALU, and hardware to perform
floating-point, vector move, and logic operations (the F/VMOV label in
Figure 12-14). Let’s take a brief look at Core’s integer and floating-point
pipelines before moving on to look at the vector hardware in more detail.
Intel’s Pentium M, Core Duo, and Core 2 Duo
259
Reservation Station (RS)
Port 0
Port 1
Port 5
Port 5
Port 0
Port 1
Port 0
Port 1
Port 5
Port 5
Port 4
Port 3
Port 2
MMX5
FADD
FMUL
CIU1
CIU2
SIU
Store
Store
Load
MMX0
MMX1
Data
Addr.
Addr.
VSHUF
BU
F/VMOV
F/VMOV
F/VMOV
VFADD
VFMUL
Floating-Point
MMX/SSE Unit
Branch
Unit
Integer Unit
Load-Store Unit
Unit
Vector ALUs
Scalar ALUs
Memory Access Units
Back End
Figure 12-14: The back end of the Intel Core microarchitecture
Integer Units
Core’s back end features three scalar 64-bit integer units: one complex
integer unit that’s capable of handling 64-bit multiplication (port 0); one
complex integer unit that’s capable of handling shift instructions, rotate
instructions, and 32-bit multiplication (port 1); and one simple integer
unit (port 5).
The new processor’s back end also has three vector integer units that
handle MMX instructions, one each on ports 0, 1, and 5. This abundance of
scalar and vector integer hardware means that Core can issue three vector
or scalar integer operations per cycle.
Floating-Point Units
In the P6-derived processors leading up to Core, there was a mix of
floating-point hardware of different types on the issue ports. Specifically, the Pentium III added vector floating-point multiplication to its back end by
modifying the existing FPU on port 0 to support this function. Vector floating-
point addition was added as a separate VFADD (or PFADD, for
packed floating-
point addition
) unit on port 1. Thus, the floating-point arithmetic capabilities were unevenly divided among the Pentium III’s two issue ports as follows:
Port 0
z
Scalar addition (
x
87 and SSE family)
z
Scalar multiplication (
x
87 and SSE family)
z
Vector multiplication
Port 1
z
Vector addition
Core cleans up this arrangement, which the Pentium M and Core Duo
probably also inherited from the Pentium III, by consolidating all floating-
point multiplication functions (both scalar and vector) into a single VFMUL
unit on port 0; similarly, all vector and scalar floating-point addition functions are brought together in a single VFADD unit on port 1.
260
Chapter 12
itm12_03.fm Page 261 Thursday, January 11, 2007 10:40 AM
Core’s distribution of floating-point labor therefore looks as follows:
Port 0
z
Scalar multiplication (single- and double-precision,
x
87 and SSE
family)
z
Vector multiplication (four single-precision or two double-precision)
Port 1
z
Scalar addition (single- and double-precision,
x
87 and SSE family)
z
Vector addition (four single-precision or two double-precision)
The Core 2 Duo is the first
x
86 processor from Intel to support double-
precision floating-point operations with a single-cycle throughput. Thus,
Core’s floating-point unit can complete up to four double-precision or
eight single-precision floating-point operations on every cycle. To see just
how much of an improvement Core’s floating-point hardware offers over
its predecessors, take a look at Table 12-4, which compares the throughputs
(number of instructions completed per cycle) of scalar and vector floating-
point instructions on four generations of Intel hardware.
Table 12-4:
Throughput numbers (cycles/instruction) for vector and
scalar floating-point instructions on five different Intel processors
PentiumM/
Instruction Pentium III Pentium 4 Core Duo
Core 2 Duo
fadd1
1
1
1
1
fmul1
2
2
2
2
addss
1
2
1
1
addsd
2
1
1
addps
2
2
2
1
addpd
2
2
1
mulss
1
2
1
1
mulsd
2
2
1
mulps
2
2
2
1
mulpd
2
4
1
1
x
87 instruction
In Table 12-4, the rows with green shaded backgrounds denote vector
operations, while those with blue shaded backgrounds denote scalar opera-
tions. The throughput numbers for all double-precision operations are bold.
With the exception of the fadd and fmul instructions, all of the instructions
listed belong to the SSE family. Here are a few SSE instructions interpreted for you, so that you can figure out which operations the instructions perform:
z
addss: scalar, single-precision addition
z
addsd: scalar, double-precision addition
z
mulps: packed (vector), single-precision multiplication