Pentium 4 Deeply pipelined processor supporting multiple issue - - PowerPoint PPT Presentation

▶

Aug 24, 2022 455 likes •693 views

Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire, 3.2 GHZ clock rate (deep pipeline allows higher clock rate) Front end decoder

SLIDE 1

Section 2.10 1

Pentium 4

Deeply pipelined processor supporting multiple

issue with speculation and multi-threading

– 2004 version: 31 clock cycles from fetch to retire, 3.2 GHZ clock rate (deep pipeline allows higher clock rate)

Front end decoder translates each IA-32

instruction into a series of RISC like micro-

perations called uops
Uops executed by dynamically scheduled

speculative pipeline

SLIDE 2

Section 2.10 2

Pentium 4, continued

Uops are stored in an execution trace

cache

– Stores sequences of instructions to be executed, including nonadjacent instructions – Accessed using branch prediction bits and address of first instruction in trace – Has its own branch target buffer for predicting the outcome of uop branches – Very high hit rate – IA-32 instruction fetch rarely needed

SLIDE 3

Section 2.10 3

Pentium 4, continued

Uops executed by an out-of-order speculative

pipeline that uses register renaming rather than a reorder buffer

Up to three uops per clock cycle can be

renamed and dispatched to the functional unit queue

Up to three uops can commit each clock cycle
Up to six uops can be dispatched to functional

units each clock cycle

SLIDE 4

Section 2.10 4

Figure 2.26

SLIDE 5

Section 2.10 5

About figure 2.26

Front-end BTB – predicts next IA-32 instruction

to fetch; only accessed if miss in execution trace cache

Execution trace cache – holds uops
Trace cache BTB – predicts the next uop
Registers for renaming – 128; supports 128

uops executing simultaneously

Functional units – 7 (simple ones run at twice

the clock rate and accept up to two every clock cycle)

SLIDE 6

Section 2.10 6

About figure 2.26

L1 data cache – supports up to 8
utstanding misses; integer load latency is

4 cycles; FP load latency is 12 cycles

L2 cache – 18 cycle access time

SLIDE 7

Section 2.10 7

Pentium 4

Deep pipeline makes speculation and

branch prediction very important for high performance

Cost of cache miss is also very high as

queues will fill waiting for the miss to be handled

SLIDE 8

Section 2.10 8

Pentium 4: Branch misprediction

Figure 2.28 (next slide) show branch-

misprediction rate per 1000 instructions

– Top five are integer benchmarks (average 186 branches per 1000 instructions) – Bottom five are fp benchmarks (48 branches per 1000 instructions) – Misprediction rate for integer benchmarks is 8 times higher than for fp benchmarks

SLIDE 9

Section 2.10 9

Figure 2.28

SLIDE 10

Section 2.10 10

Pentium 4: Misspeculation

Misprediction causes wrong instructions to

be executed (misspeculated instructions), requires recovery time and wastes energy

Figure 2.29 (next slide) shows the

percentage of uop instructions issued that are misspeculated

Note Figure 2.29 closely matches Figure

2.28

SLIDE 11

Section 2.10 11

Figure 2.29

SLIDE 12

Section 2.10 12

Pentium 4: cache misses

Trace cache miss rates are almost

negligible for SPEC benchmarks

L1 and L2 miss rates are more significant
Figure 2.30 (next slide) shows misses per

1000 instructions for the L1 and L2 caches

Misses for L1 is higher, however miss

penalty for L2 is higher so both will impact performance

SLIDE 13

Section 2.10 13

Figure 2.30

SLIDE 14

Section 2.10 14

Pentium 4: CPI

Figure 2.31 (next slide) shows cycles per

instruction for these same 10 SPEC benchmarks

Note mcf has worst misspeculation rate

and worst L1 and L2 miss rate and also has highest CPI

Note swim has high L1 and L2 miss rate

and is lowest performing FP benchmark

SLIDE 15

Section 2.10 15

Figure 2.31

SLIDE 16

Section 2.10 16

Comparing Pentium 4 to AMD Opteron

Both use dynamically scheduled, speculative

pipeline capable of issuing three IA-32 instructions per clock cycle

Both have two levels of on-chip cache, but

Opteron L1 instruction cache is not a trace cache

Biggest difference is that the Pentium 4 is more

deeply pipelined

Pentium 4 has higher CPI (figure 2.32) but this

makes sense given deeper pipeline

SLIDE 17

Section 2.10 17

Figure 2.32

SLIDE 18

Section 2.10 18

Comparing Pentium 4 to AMD Opteron

Deeper pipelining allows increase in clock

rate – Will this increase make up for increase in CPI?

Figure 2.33 (next slide) compares 2.8 GHz

AMD Opteron versus 3.8 GHz Intel Pentium 4

Note the AMD has higher performance,

thus the higher clock rate is insufficient to

vercome the higher CPI

SLIDE 19

Section 2.10 19

Figure 2.33

SLIDE 20

Section 2.10 20

Comparing Pentium 4 to IBM Power5

Sophisticated multiple-issue pipelines usually

have slower clock rates than simple pipelines

Faster clock rate will win in the presence of

limited ILP

IBM Power5 designed for high-performance

integer and FP (two processor cores each capable of sustaining four instructions per clock cycle); 1.9GHz clock rate

SLIDE 21

Section 2.10 21

Comparing Pentium 4 to IBM Power5

Pentium 4 – single processor with

multithreading; very deep pipeline; can sustain three instructions per clock cycle; higher clock rate (3.8GHz)

Figure 2.34 (next slide) compares the

performance of these machines

– Note that the Power5 often does better on the FP benchmarks (less branches, more parallelism) – Pentium 4 does better on Integer (higher clock rate)

SLIDE 22

Section 2.10 22

Figure 2.34