reducing the branch delay
play

Reducing the branch delay IF.Flush Hazard detection unit ID/EX - PowerPoint PPT Presentation

Reducing the branch delay IF.Flush Hazard detection unit ID/EX M u x WB EX/MEM M Control u M WB MEM/WB x 0 EX M WB IF/ID 4 Shift left 2 M u x = Registers Data Instruction ALU PC


  1. Reducing the branch delay IF.Flush Hazard� detection� unit ID/EX M� u� x WB EX/MEM M� Control u� M WB MEM/WB x 0 EX M WB IF/ID 4 Shift� left 2 M� u� x = Registers Data� Instruction� ALU PC memory M� memory u� x M� u� x Sign� extend M� u� x Forwarding� unit bne $2,$3,foo addu $2,$4,$5

  2. Branch bypassing – easy case shaves one cycle off branch penalty IF.Flush Hazard� detection� unit ID/EX M� u� x WB EX/MEM M� Control u� M WB MEM/WB x 0 EX M WB IF/ID 4 Shift� left 2 M� u� x = Registers Data� Instruction� ALU PC memory M� memory u� x M� u� x Sign� extend M� u� x Forwarding� unit bne $2,$3,foo addu $2,$4,$5

  3. Branch bypassing – back-to-back deps IF.Flush Hazard� detection� unit ID/EX M� u� x WB EX/MEM M� Control u� M WB MEM/WB x 0 EX M WB IF/ID 4 Shift� left 2 M� u� x = Registers Data� Instruction� ALU PC memory M� memory u� x M� u� x Sign� extend M� u� x Forwarding� unit bne $2,$3,foo subu $3,$4,$5 ld $2,$4

  4. Branch handling in decode � lots of ugly paths might as well execute branch in EXE stage + use better branch prediction IF.Flush Hazard� detection� unit ID/EX M� u� x WB EX/MEM M� Control u� M WB MEM/WB x 0 EX M WB IF/ID 4 Shift� left 2 M� u� x = Registers Data� Instruction� ALU PC memory M� memory u� x M� u� x Sign� extend M� u� x Forwarding� unit bne $2,$3,foo subu $3,$4,$5 ld $2,$4

  5. Branch Prediction from 10,000 ft Back End Front End dcache Instr Instr FIFO Cache branch dequeue resolution login “Guess” PC != arch. PC FIFO PC “restart at address X” misprediction penalty Invariant: is_correct(PC) � is_correct(Instr[PC]) On restart ( branch misprediction ) must – a. kill all incorrectly fetched instructions (to ensure correct execution) b. refill pipeline (takes # cycles == latency of pipeline up to execute stage)

  6. Aside: Decoupled Execution Buffering Smooths Execution and Improves Cycle time by Reducing Stall Propagation Front End Back End FIFO (f=fetch, s=stall, c=cache miss, e=execute) F6 S1 C7 F6 E2 C8 E8 Cycle FE F F F F F F S C C C C C C C F F F F F F 1 1 1 2 3 4 4 4 4 4 4 3 2 1 1 1 1 1 1 FIFO E E C C C C C C C C E E E E E E E E BE The front end runs ahead .. stalls + cache misses are overlapped. FE F F F S S S S S S S S F F F S C C C C C E E C C C C C C C C E E E S S S S S S BE without decoupling .. stalls + cache misses are not overlapped.

  7. Pipelined Front End Back End bne+ $2,$3,foo branch direction predictor Instruction Cache Instr FIFO br. imm EA checker +4 pc GPC PC FIFO restart new pc

  8. Branch Predicted-Taken Penalty Back End br br dir predictor Instruction Cache Instr X X FIFO br. imm EA checker +4 pc GPC PC Squash Speculatively FIFO Fetch Instructions restart That Follow Branch new pc

  9. Branch Misprediction Penalty Back End Front End X X X X X X X XXX X dcache Instr X Instr FIFO Cache X branch dequeue resolution login X X “Guess” PC != arch. X X X X PC FIFO PC X “restart at address X” Pentium 3 – ~10 cycles Pentium 4 – ~20 cycles “The Microarchitecture of the Pentium 4”, Intel Technology Journal Q1 2001

  10. Since misprediction penalty is larger, we first focus on branch (direction) prediction • Static Strategies: - # 1 predict taken (34% mispredict rate) - # 2 predict (backwards taken, forwards not) (10% , 50% ) mispredict rate - same backwards behavior as # 1 - better forwards behavior ( 50% -50% branches) penalty: # 1 taken 2 cycle ~ taken 20 cycle # 2 taken 20 cycle ~ taken 0 cycle #1 forward branch ave execution time = 50% * 2 + 50% * 20 = 11 cycles #2 forward branch ave execution time = 50% * 20 + 50% * 0 = 10 cycles JZ backward forward 90% 50% JZ

  11. Since misprediction penalty is larger, we first focus on branch (direction) prediction • Static Strategies: # 3 profile (see next slide for misprediction % ’s) - choose a single prediction for each branch and encode in instruction - some studies show that sample runs are fairly representative of inputs in general - negative: extra programmer burden See next slide for misprediction rates

  12. Profiling Based Static Prediction 15% ave. (specint92), 9% ave. (specfp92) misp rate Each branch is permanently assigned a probable direction. To do better we would need to change the prediction as the program runs!

  13. A note on prediction/misprediction rates Qualitatively, ratio of misprediction rates 15% ave. (specint92), is better indicator of predictor improvement. 9% ave. (specfp92) misp rate ( assumes misprediction probability independent between branches ) # Consecutive Prediction Misprediction Branches Predicted Rate (p) Rate Correctly Bernoulli Process: (w/ 50% prob) 50% 50% 1 p k = .5 78% 22% 2.78 k = lg(.5)/lg(p) 85% 15% 4.26 91% 9% 7.34 95% 5% 13.5 2% makes a huge difference here 96% 4% 16.98 98% 2% 34.3

  14. Compiler also can take advantage of Static Prediction / Profiling / Knowledge • Static Strategies: - # 1 predict taken (34% mispredict rate) - # 2 predict backwards taken, forwards not (10% , 50% mispredict rate) - # 3 profile (see previous slide) - # 4 delayed branches always execute instructions after branches avoids need to flush pipeline after branch

  15. Observation: Static Prediction is limited because it only uses instructions as input + has a fixed prediction branch direction predictor Instruction Cache Instr FIFO br. imm EA +4 pc GPC PC FIFO restart new pc

  16. Dynamic Prediction: More inputs allow it to adjust the brand direction prediction over time branch (direction) predictor branch info mispredict instr feedback pc Instruction Cache Instr FIFO EA +4 pc GPC PC FIFO restart new pc branch info

  17. Dynamic Prediction: More detailed br. branch (direction) predictor descr FIFO branch-info mispredict instr feedback pc Instruction Cache Instr FIFO EA +4 pc GPC PC FIFO restart new pc branch-info

  18. Dynamic Branch Prediction – Track Changing Per-Branch Behavior • Store 2 bits per branch • Change the prediction after two consecutive mistakes! 01 ¬ taken taken taken taken 00 ¬ taken 11 p 0 ^p 1 taken ¬ taken ¬ taken 10 p 0 0 p 0 new 1 actually taken/not taken p 1 new BP state: ( next prediction taken/ ¬ taken) x ( last branch taken /¬taken) note: this is not strictly a saturating up-down counter

  19. Why two bits? • One bit is wishy-washy on loops Two mispredictions per loop execution with single-bit prediction top: add add beq top T T T T T T N T N T Prediction T T T T T T T N T T N T T T T Outcome T T N T T (No data – either use what is left over from before or initialize on i. fill with “predict taken” for backwards branches) Single Bit Predictor Analysis

  20. Why two bits? • One bit is wishy-washy on loops One misprediction per loop execution with two-bit prediction top: add add beq top T T T T T T T T T T Prediction T T T T T T T N T T N T T T T Outcome T T N T T Two Bit Predictor Analysis

  21. n-bit implementation blindly write into this hash table; branches may alias but that’s “ok” Branch (direction) Predictor (many cycles later) branch info (pc/descr/outcome) compute state transition write branch descr read “Guess” n-bits 4k PC prediction hash n-bit counters Branch History Table (BHT) is c. branch? ICache

  22. Accuracy of simple dynamic branch predictor: 4096-entry 2-bit predictor on Spec89 somewhat old benchmarks – probably need slightly larger predictors to do this well on current benchmarks 11% 18% 22% 12% vs profiling

  23. Limits of 2-bit prediction - ∞ table does not help much on spec89 - reportedly, more bits does not help significantly either.

  24. Exploiting Spatial Correlation Yeh and Patt, 1992 if (x[i] < 7) then y += 1; if (x[i] < 5) then c -= 4; If first condition false, second condition also false History bit: H records the direction of the last branch executed by the processor Two sets of BHT bits (BHT0 & BHT1) per branch instruction ⇒ H = 0 (not taken) consult BHT0 ⇒ H = 1 (taken) consult BHT1 Adapted from Arvind and Asanovic’s MIT course 6.823, Lecture 6

  25. Accuracy with 2 bits of global history Less storage than 4k x 2bit but better accuracy (for these benchmarks)

  26. Two-Level Branch Predictor Pentium Pro (1995) uses the result from the last two branches to select one of the four sets of BHT bits (~ 95% correct) 0 0 Fetch PC k 2-bit global branch history shift register Shift in Taken/ ¬ Taken results of each branch Taken/ ¬ Taken? Adapted from Arvind and Asanovic’s MIT course 6.823, Lecture 6

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend