Dual Path Instruction Processing Dual Path Instruction Processing - - PowerPoint PPT Presentation

dual path instruction processing dual path instruction
SMART_READER_LITE
LIVE PREVIEW

Dual Path Instruction Processing Dual Path Instruction Processing - - PowerPoint PPT Presentation

International Conference on Supercomputing International Conference on Supercomputing (ICS 2002) (ICS 2002) New York City, USA, June 2002 New York City, USA, June 2002 Dual Path Instruction Processing Dual Path Instruction Processing Dual


slide-1
SLIDE 1

GACOP

International Conference on Supercomputing International Conference on Supercomputing (ICS 2002) (ICS 2002)

New York City, USA, June 2002 New York City, USA, June 2002

Dual Path Instruction Processing Dual Path Instruction Processing Dual Path Instruction Processing

Juan L. Aragón1, José González1,*, Antonio González2,* and James E. Smith3

1 Dept. Ing. y Tecnología de Computadores

Universidad de Murcia

2 Dept. d’Arquitectura de Computadors

Universitat Politècnica de Catalunya

3 Dept. Electrical and Computing Eng.

University of Wisconsin-Madison * Currently at Intel Barcelona Research Center e-mail: jlaragon@ditec.um.es

slide-2
SLIDE 2

GACOP

Motivation Motivation Motivation

!Two ways of reducing performance degradation due to branch mispredictions

! Improving prediction accuracy ! Reducing branch misprediction penalty

!Branch misprediction penalty

! Deeper pipelines cause higher misprediction penalties

– Pentium 4 (20 stages); Power 4 (14 stages) – Example: IPC slowdown of 22%, using 32 KB gshare comparing a pipeline of 20 stages over 10 stages (go)

slide-3
SLIDE 3

GACOP

Motivation Motivation Motivation

!Causes of performance degradation after a branch misprediction

! Pipeline must be squashed ! Many cycles until new instructions can be issued

– Front-end length

! Instruction window is not full during many cycles

– ILP cannot be fully exploited

! Correct instructions cannot be scheduled ahead a

mispredicted branch

slide-4
SLIDE 4

GACOP

Outline Outline Outline

!Misprediction Penalty Analysis !Proposal !Dual Path Instruction Processing (DPIP) !Experimental Results !Sensitivity Analysis !Conclusions

slide-5
SLIDE 5

GACOP

Misprediction Penalty Analysis Misprediction Penalty Analysis Misprediction Penalty Analysis

!Three Components

! Pipeline-fill penalty

– Delay between the misprediction and the first correct instruction enters the window – Depends on

" Pipeline length, Recovery actions

! Window-fill penalty

– Window empty many cycles after misprediction – ILP cannot be fully exploited

! Serialization penalty

– Correct instructions cannot be scheduled ahead of the mispredicted branch

slide-6
SLIDE 6

GACOP

Average of selected 10 benchmarks

IPC

1 2 3 4

Perfect Branch Pred. Complete IW fill

  • Instant. F/D 1st group

Real pred., pipe 6 Real pred., pipe 10 Real pred., pipe 14

Overall loss

Pipeline-fill penalty Window-fill penalty Serialization penalty

pipeline 6 25% 25% 10% 65% pipeline 10 33% 44% 7% 49% pipeline 14 39% 54% 6% 40%

  • Misprediction Penalty Analysis

Misprediction Penalty Analysis Misprediction Penalty Analysis

!Analysis of each component

slide-7
SLIDE 7

GACOP

Outline Outline Outline

!Misprediction Penalty Analysis !Proposal !Dual Path Instruction Processing (DPIP) !Experimental Results !Sensitivity Analysis !Conclusions

slide-8
SLIDE 8

GACOP

Proposal Proposal Proposal

!Reduce Pipeline-fill and Window-fill penalties !Dual Path Instruction Processing (DPIP)

! Fetches, decodes and renames both paths

– Reduce Pipeline-fill penalty – Hide front-end stages

! Alternative path instructions are pre-scheduled in an

estimated execution order

– Reduce Window-fill penalty – Similar effect as filling the window completely

!Confidence estimation

! Used to filter branches that must be forked

slide-9
SLIDE 9

GACOP

Related work Related work Related work

!Multiple path execution (MPE)

! Fetch, decode and execute instructions from multiple paths

– Selective Dual Path Execution (Heil & Smith, Tech.Report’97) – PolyPath (Klauser et al, ISCA’98) – Threaded Multiple Path Execution (Wallace et al, ISCA’98)

! Too expensive (drawbacks)

– Aggressive fetch engines (allowing up to 8 different paths!!!) – Bigger register files, instruction windows and ROBs – Complexity of selective flush – Resource contention: more functional units, memory ports,... – Energy consumption: resources used by useless instructions

DPIP does not execute instructions balance between complexity, cost, and performance DPIP does not execute instructions balance between complexity, cost, and performance

slide-10
SLIDE 10

GACOP

Outline Outline Outline

!Misprediction Penalty Analysis !Proposal !Dual Path Instruction Processing (DPIP) !Experimental Results !Sensitivity Analysis !Conclusions

slide-11
SLIDE 11

GACOP I-cache Fetch Unit Decode Unit

Instruction Window

Funct. Units Free list1 RMT1 Free list1 RMT1 issue logic Free list1 RMT1 Free list2 RMT2 ROB1 ROB2 LSQ1 LSQ2 alternative path instructions Alternative Path Buffer program

  • rder

Dual Path Instruction Processing Dual Path Instruction Processing Dual Path Instruction Processing

!DPIP block diagram

DPIP can only manage two paths at the same time

slide-12
SLIDE 12

GACOP

pre- scheduling

Decode Unit

Instruction Window

Free list1 RMT1 Free list1 RMT1 issue logic Free list1 RMT1 Free list2 RMT2 alternative path data-flow

  • rder

Pre-schedule Buffer predicted path ... ...

Canal & Gonzalez, ICS 2000 Michaud & Seznec, HPCA 2001

DPIP DPIP DPIP

!Pre-scheduling alternative path instructions

slide-13
SLIDE 13

GACOP

active line

1 2 3 4 5 6

line width pre-sched. buffer size

r0 r1 r2 r3 r4 r5 r6

logical register Register Availability Table

0 1 0 2 1 A B C D E

Pre-schedule Buffer

Alternative path instructions: A r2 ← r1+ r0 B store r3, 0(r2) C load r2, 0(r6) D r4 ← r2+ r0 E r4 ← r3+ r3

DPIP DPIP DPIP

!Pre-scheduling Example

schedule_line = max( {reg_availability(input reg1), reg_availability(input reg2)} ) reg_availability(output register) = schedule_line + execution_latency

slide-14
SLIDE 14

GACOP

Outline Outline Outline

!Misprediction Penalty Analysis !Proposal !Dual Path Instruction Processing (DPIP) !Experimental Results !Sensitivity Analysis !Conclusions

slide-15
SLIDE 15

GACOP

Results Results Results

!OoO superscalar simulator (sim-outorder) !Configuration

! Fetch/decode/issue/commit up to 8 inst/cycle ! L1 cache: 64 KB I-cache, 64 KB D-cache (2 way) ! L2 cache: 512 KB 4-way ! 8 Int ALU´s, 2 Int Mult ! 8 FP ALU´s, 2 FP mult ! 64-entry Instruction Window ! 128-entry Reorder Buffer ! 14-stage pipeline (IBM Power 4 - like)

!Evaluated programs

! SpecInt95 and SpecInt2000

slide-16
SLIDE 16

GACOP

!DPIP performance

! 8% improvement for DPIP (with pre-scheduling) ! 10% improvement for DPIP + branch prediction reversal ! 17% for oracle estimation (still work to be done) compress gcc go ijpeg bzip2 crafty gzip mcf twolf vpr Average

IPC

1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5

gshare (single-path) 16KB gs+DPIP (8+8)KB gs+DPIP+preSched (8+8) gs+BPRU+DPIP (8+8)KB gs+DPIP(oracle) 8KB perfect branch prediction

5.7

Results Results Results

slide-17
SLIDE 17

GACOP

compress gcc go ijpeg bzip2 crafty gzip mcf twolf vpr Average

Speedup breakdown (%)

20 40 60 80 100 pre-fetching+decoding+renaming pre-scheduling

Results Results Results

!How much pre-scheduling influences DPIP performance?

! 16% of improvement provided by pre-scheduling (31% for go) ! Pre-scheduling provides additional benefits.

84% 16%

slide-18
SLIDE 18

GACOP

Outline Outline Outline

!Misprediction Penalty Analysis !Proposal !Dual Path Instruction Processing (DPIP) !Experimental Results !Sensitivity Analysis !Conclusions

slide-19
SLIDE 19

GACOP

!Alpha 21264 branch predictor

! 5% average speedup (up to 8% for bzip2) ! 15% for oracle estimation

compress gcc go ijpeg bzip2 crafty gzip mcf twolf vpr Average

IPC

1 2 3 4 5 6

21264 (single-path) 16KB 21264+DPIP (8+8)KB 21264+DPIP(oracle) 8KB perfect branch prediction

Sensitivity Study Sensitivity Study Sensitivity Study

slide-20
SLIDE 20

GACOP

I-Window size Pipeline Depth

! Improvements remain constant as the window grows ! 10 stages: 4% average speedup ! 20 stages: 12% average speedup

Instruction Window (Reorder Buffer) size

32 (64) 64 (128) 128 (256) 256 (512)

IPC

2.2 2.4 2.6 2.8 3.0

gshare+DPIP(oracle) gshare+DPIP gshare (single-path)

Pipeline Depth

10 12 14 16 18 20

IPC

2.0 2.2 2.4 2.6 2.8 3.0

gshare+DPIP(oracle) gshare+DPIP gshare (single-path)

Sensitivity Study Sensitivity Study Sensitivity Study

slide-21
SLIDE 21

GACOP

Outline Outline Outline

!Misprediction Penalty Analysis !Proposal !Dual Path Instruction Processing (DPIP) !Experimental Results !Sensitivity Analysis !Conclusions

slide-22
SLIDE 22

GACOP

Conclusions Conclusions Conclusions

!Categorized branch misprediction penalty

! Pipeline-fill penalty ! Window-fill penalty ! Serialization penalty

!Dual Path Instruction Processing reduces penalties

  • f mispredicted branches

! Fetches, decodes, renames and pre-schedules

alternative path instructions

! Similar effect as filling the window completely

!Balance between complexity/cost/performance

! Simpler than Multiple Path Execution schemes

!12% speedup for 20-stages OoO processors

! 25% for oracle estimation

contribution to the

  • verall performance loss