 
              International Conference on Supercomputing International Conference on Supercomputing (ICS 2002) (ICS 2002) New York City, USA, June 2002 New York City, USA, June 2002 Dual Path Instruction Processing Dual Path Instruction Processing Dual Path Instruction Processing Juan L. Aragón 1 , José González 1, *, Antonio González 2, * and James E. Smith 3 1 Dept. Ing. y Tecnología de Computadores Universidad de Murcia 2 Dept. d’Arquitectura de Computadors Universitat Politècnica de Catalunya 3 Dept. Electrical and Computing Eng. University of Wisconsin-Madison * Currently at Intel Barcelona Research Center e-mail: jlaragon@ditec.um.es GACOP
Motivation Motivation Motivation ! Two ways of reducing performance degradation due to branch mispredictions ! Improving prediction accuracy ! Reducing branch misprediction penalty ! Branch misprediction penalty ! Deeper pipelines cause higher misprediction penalties – Pentium 4 (20 stages); Power 4 (14 stages) – Example: IPC slowdown of 22%, using 32 KB gshare comparing a pipeline of 20 stages over 10 stages ( go ) GACOP
Motivation Motivation Motivation ! Causes of performance degradation after a branch misprediction ! Pipeline must be squashed ! Many cycles until new instructions can be issued – Front-end length ! Instruction window is not full during many cycles – ILP cannot be fully exploited ! Correct instructions cannot be scheduled ahead a mispredicted branch GACOP
Outline Outline Outline ! Misprediction Penalty Analysis ! Proposal ! Dual Path Instruction Processing ( DPIP ) ! Experimental Results ! Sensitivity Analysis ! Conclusions GACOP
Misprediction Penalty Analysis Misprediction Penalty Analysis Misprediction Penalty Analysis ! Three Components ! Pipeline-fill penalty – Delay between the misprediction and the first correct instruction enters the window – Depends on " Pipeline length, Recovery actions ! Window-fill penalty – Window empty many cycles after misprediction – ILP cannot be fully exploited ! Serialization penalty – Correct instructions cannot be scheduled ahead of the mispredicted branch GACOP
Misprediction Penalty Analysis Misprediction Penalty Analysis Misprediction Penalty Analysis ! Analysis of each component 4 Perfect Branch Pred. Complete IW fill 3 Instant. F/D 1st group Real pred., pipe 6 IPC 2 Real pred., pipe 10 Real pred., pipe 14 1 0 Average of selected 10 benchmarks Pipeline-fill Window-fill Serialization Overall penalty penalty penalty loss ��� ��� pipeline 6 25% 25% 10% 65% ��� ��� ������������������������������������������������������ ��������������������������������������������� ���������������������������������������������������������������������������������������������������������������������������������������������������������������������� ��� ��� ��� ��� pipeline 10 33% 44% 7% 49% ������������������������������������������������������ ��������������������������������������������� ���������������������������������������������������������������������������������������������������������������������������������������������������������������������� ��� ��� ��� ��� pipeline 14 39% 54% 6% 40% ��� ��� ��� ��� GACOP
Outline Outline Outline ! Misprediction Penalty Analysis ! Proposal ! Dual Path Instruction Processing ( DPIP ) ! Experimental Results ! Sensitivity Analysis ! Conclusions GACOP
Proposal Proposal Proposal ! Reduce Pipeline-fill and Window-fill penalties ! Dual Path Instruction Processing ( DPIP ) ! Fetches, decodes and renames both paths – Reduce Pipeline-fill penalty – Hide front-end stages ! Alternative path instructions are pre-scheduled in an estimated execution order – Reduce Window-fill penalty – Similar effect as filling the window completely ! Confidence estimation ! Used to filter branches that must be forked GACOP
Related work Related work Related work ! Multiple path execution ( MPE ) ! Fetch, decode and execute instructions from multiple paths – Selective Dual Path Execution (Heil & Smith, Tech.Report’97) – PolyPath (Klauser et al , ISCA’98) – Threaded Multiple Path Execution (Wallace et al , ISCA’98) ! Too expensive (drawbacks) – Aggressive fetch engines (allowing up to 8 different paths!!!) – Bigger register files, instruction windows and ROBs – Complexity of selective flush – Resource contention: more functional units, memory ports,... – Energy consumption: resources used by useless instructions DPIP does not execute instructions DPIP does not execute instructions balance between complexity, cost, and performance balance between complexity, cost, and performance GACOP
Outline Outline Outline ! Misprediction Penalty Analysis ! Proposal ! Dual Path Instruction Processing ( DPIP ) ! Experimental Results ! Sensitivity Analysis ! Conclusions GACOP
Dual Path Instruction Processing Dual Path Instruction Processing Dual Path Instruction Processing ! DPIP block diagram I-cache DPIP can only manage two paths at the same time Fetch Unit RMT 1 RMT 1 RMT 1 RMT 2 Decode Free list 1 Free list 1 Unit Free list 1 Free list 2 alternative path instructions Instruction LSQ 1 ROB 1 program Window order Alternative Path Buffer LSQ 2 ROB 2 issue logic Funct. Units GACOP
DPIP DPIP DPIP ! Pre-scheduling alternative path instructions ... RMT 1 RMT 1 RMT 1 RMT 2 Decode Free list 1 Free list 1 Unit Free list 1 Free list 2 alternative path predicted path pre- scheduling Instruction Window data-flow order issue logic Pre-schedule Buffer ... Canal & Gonzalez, ICS 2000 Michaud & Seznec, HPCA 2001 GACOP
DPIP DPIP DPIP ! Pre-scheduling Example schedule_line = max( {reg_availability(input reg1), reg_availability(input reg2)} ) reg_availability(output register) = schedule_line + execution_latency line width Alternative path instructions: r6 0 6 A r2 ← r1+ r0 r5 0 5 B store r3, 0(r2) r4 0 2 1 4 pre-sched. C load r2, 0(r6) logical buffer size r3 0 3 D r4 ← r2+ r0 register r2 0 1 2 E r4 ← r3+ r3 r1 0 1 B D active r0 0 0 A C E line Register Pre-schedule Availability Table Buffer GACOP
Outline Outline Outline ! Misprediction Penalty Analysis ! Proposal ! Dual Path Instruction Processing ( DPIP ) ! Experimental Results ! Sensitivity Analysis ! Conclusions GACOP
Results Results Results ! OoO superscalar simulator (sim-outorder) ! Configuration ! Fetch/decode/issue/commit up to 8 inst/cycle ! L1 cache: 64 KB I-cache, 64 KB D-cache (2 way) ! L2 cache: 512 KB 4-way ! 8 Int ALU´s, 2 Int Mult ! 8 FP ALU´s, 2 FP mult ! 64-entry Instruction Window ! 128-entry Reorder Buffer ! 14-stage pipeline (IBM Power 4 - like) ! Evaluated programs ! SpecInt95 and SpecInt2000 GACOP
Results Results Results ! DPIP performance 5.7 gshare (single-path) 16KB gs+DPIP (8+8)KB 4.5 gs+DPIP+preSched (8+8) gs+BPRU+DPIP (8+8)KB 4.0 gs+DPIP(oracle) 8KB 3.5 perfect branch prediction IPC 3.0 2.5 2.0 1.5 1.0 compress gcc go ijpeg bzip2 crafty gzip mcf twolf vpr Average ! 8% improvement for DPIP (with pre-scheduling) ! 10% improvement for DPIP + branch prediction reversal ! 17% for oracle estimation (still work to be done) GACOP
Results Results Results ! How much pre-scheduling influences DPIP performance? pre-fetching+decoding+renaming pre-scheduling Speedup breakdown (%) 100 16% 80 60 84% 40 20 0 compress gcc go ijpeg bzip2 crafty gzip mcf twolf Average vpr ! 16% of improvement provided by pre-scheduling ( 31% for go ) ! Pre-scheduling provides additional benefits. GACOP
Outline Outline Outline ! Misprediction Penalty Analysis ! Proposal ! Dual Path Instruction Processing ( DPIP ) ! Experimental Results ! Sensitivity Analysis ! Conclusions GACOP
Sensitivity Study Sensitivity Study Sensitivity Study ! Alpha 21264 branch predictor 6 21264 (single-path) 16KB 21264+DPIP (8+8)KB 5 21264+DPIP(oracle) 8KB perfect branch prediction 4 IPC 3 2 1 compress gcc go ijpeg bzip2 crafty gzip mcf twolf vpr Average ! 5% average speedup (up to 8% for bzip2 ) ! 15% for oracle estimation GACOP
Recommend
More recommend