Pentium 4 Deeply pipelined processor supporting multiple issue - - PowerPoint PPT Presentation

pentium 4
SMART_READER_LITE
LIVE PREVIEW

Pentium 4 Deeply pipelined processor supporting multiple issue - - PowerPoint PPT Presentation

Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire, 3.2 GHZ clock rate (deep pipeline allows higher clock rate) Front end decoder


slide-1
SLIDE 1

Section 2.10 1

Pentium 4

  • Deeply pipelined processor supporting multiple

issue with speculation and multi-threading

– 2004 version: 31 clock cycles from fetch to retire, 3.2 GHZ clock rate (deep pipeline allows higher clock rate)

  • Front end decoder translates each IA-32

instruction into a series of RISC like micro-

  • perations called uops
  • Uops executed by dynamically scheduled

speculative pipeline

slide-2
SLIDE 2

Section 2.10 2

Pentium 4, continued

  • Uops are stored in an execution trace

cache

– Stores sequences of instructions to be executed, including nonadjacent instructions – Accessed using branch prediction bits and address of first instruction in trace – Has its own branch target buffer for predicting the outcome of uop branches – Very high hit rate – IA-32 instruction fetch rarely needed

slide-3
SLIDE 3

Section 2.10 3

Pentium 4, continued

  • Uops executed by an out-of-order speculative

pipeline that uses register renaming rather than a reorder buffer

  • Up to three uops per clock cycle can be

renamed and dispatched to the functional unit queue

  • Up to three uops can commit each clock cycle
  • Up to six uops can be dispatched to functional

units each clock cycle

slide-4
SLIDE 4

Section 2.10 4

Figure 2.26

slide-5
SLIDE 5

Section 2.10 5

About figure 2.26

  • Front-end BTB – predicts next IA-32 instruction

to fetch; only accessed if miss in execution trace cache

  • Execution trace cache – holds uops
  • Trace cache BTB – predicts the next uop
  • Registers for renaming – 128; supports 128

uops executing simultaneously

  • Functional units – 7 (simple ones run at twice

the clock rate and accept up to two every clock cycle)

slide-6
SLIDE 6

Section 2.10 6

About figure 2.26

  • L1 data cache – supports up to 8
  • utstanding misses; integer load latency is

4 cycles; FP load latency is 12 cycles

  • L2 cache – 18 cycle access time
slide-7
SLIDE 7

Section 2.10 7

Pentium 4

  • Deep pipeline makes speculation and

branch prediction very important for high performance

  • Cost of cache miss is also very high as

queues will fill waiting for the miss to be handled

slide-8
SLIDE 8

Section 2.10 8

Pentium 4: Branch misprediction

  • Figure 2.28 (next slide) show branch-

misprediction rate per 1000 instructions

– Top five are integer benchmarks (average 186 branches per 1000 instructions) – Bottom five are fp benchmarks (48 branches per 1000 instructions) – Misprediction rate for integer benchmarks is 8 times higher than for fp benchmarks

slide-9
SLIDE 9

Section 2.10 9

Figure 2.28

slide-10
SLIDE 10

Section 2.10 10

Pentium 4: Misspeculation

  • Misprediction causes wrong instructions to

be executed (misspeculated instructions), requires recovery time and wastes energy

  • Figure 2.29 (next slide) shows the

percentage of uop instructions issued that are misspeculated

  • Note Figure 2.29 closely matches Figure

2.28

slide-11
SLIDE 11

Section 2.10 11

Figure 2.29

slide-12
SLIDE 12

Section 2.10 12

Pentium 4: cache misses

  • Trace cache miss rates are almost

negligible for SPEC benchmarks

  • L1 and L2 miss rates are more significant
  • Figure 2.30 (next slide) shows misses per

1000 instructions for the L1 and L2 caches

  • Misses for L1 is higher, however miss

penalty for L2 is higher so both will impact performance

slide-13
SLIDE 13

Section 2.10 13

Figure 2.30

slide-14
SLIDE 14

Section 2.10 14

Pentium 4: CPI

  • Figure 2.31 (next slide) shows cycles per

instruction for these same 10 SPEC benchmarks

  • Note mcf has worst misspeculation rate

and worst L1 and L2 miss rate and also has highest CPI

  • Note swim has high L1 and L2 miss rate

and is lowest performing FP benchmark

slide-15
SLIDE 15

Section 2.10 15

Figure 2.31

slide-16
SLIDE 16

Section 2.10 16

Comparing Pentium 4 to AMD Opteron

  • Both use dynamically scheduled, speculative

pipeline capable of issuing three IA-32 instructions per clock cycle

  • Both have two levels of on-chip cache, but

Opteron L1 instruction cache is not a trace cache

  • Biggest difference is that the Pentium 4 is more

deeply pipelined

  • Pentium 4 has higher CPI (figure 2.32) but this

makes sense given deeper pipeline

slide-17
SLIDE 17

Section 2.10 17

Figure 2.32

slide-18
SLIDE 18

Section 2.10 18

Comparing Pentium 4 to AMD Opteron

  • Deeper pipelining allows increase in clock

rate – Will this increase make up for increase in CPI?

  • Figure 2.33 (next slide) compares 2.8 GHz

AMD Opteron versus 3.8 GHz Intel Pentium 4

  • Note the AMD has higher performance,

thus the higher clock rate is insufficient to

  • vercome the higher CPI
slide-19
SLIDE 19

Section 2.10 19

Figure 2.33

slide-20
SLIDE 20

Section 2.10 20

Comparing Pentium 4 to IBM Power5

  • Sophisticated multiple-issue pipelines usually

have slower clock rates than simple pipelines

  • Faster clock rate will win in the presence of

limited ILP

  • IBM Power5 designed for high-performance

integer and FP (two processor cores each capable of sustaining four instructions per clock cycle); 1.9GHz clock rate

slide-21
SLIDE 21

Section 2.10 21

Comparing Pentium 4 to IBM Power5

  • Pentium 4 – single processor with

multithreading; very deep pipeline; can sustain three instructions per clock cycle; higher clock rate (3.8GHz)

  • Figure 2.34 (next slide) compares the

performance of these machines

– Note that the Power5 often does better on the FP benchmarks (less branches, more parallelism) – Pentium 4 does better on Integer (higher clock rate)

slide-22
SLIDE 22

Section 2.10 22

Figure 2.34