Reducing the branch delay IF.Flush Hazard detection unit ID/EX - PowerPoint PPT Presentation

Reducing the branch delay IF.Flush Hazard� detection� unit ID/EX M� u� x WB EX/MEM M� Control u� M WB MEM/WB x 0 EX M WB IF/ID 4 Shift� left 2 M� u� x = Registers Data� Instruction� ALU PC memory M� memory u� x M� u� x Sign� extend M� u� x Forwarding� unit bne $2,$3,foo addu $2,$4,$5

Branch bypassing – easy case shaves one cycle off branch penalty IF.Flush Hazard� detection� unit ID/EX M� u� x WB EX/MEM M� Control u� M WB MEM/WB x 0 EX M WB IF/ID 4 Shift� left 2 M� u� x = Registers Data� Instruction� ALU PC memory M� memory u� x M� u� x Sign� extend M� u� x Forwarding� unit bne $2,$3,foo addu $2,$4,$5

Branch bypassing – back-to-back deps IF.Flush Hazard� detection� unit ID/EX M� u� x WB EX/MEM M� Control u� M WB MEM/WB x 0 EX M WB IF/ID 4 Shift� left 2 M� u� x = Registers Data� Instruction� ALU PC memory M� memory u� x M� u� x Sign� extend M� u� x Forwarding� unit bne $2,$3,foo subu $3,$4,$5 ld $2,$4

Branch handling in decode � lots of ugly paths might as well execute branch in EXE stage + use better branch prediction IF.Flush Hazard� detection� unit ID/EX M� u� x WB EX/MEM M� Control u� M WB MEM/WB x 0 EX M WB IF/ID 4 Shift� left 2 M� u� x = Registers Data� Instruction� ALU PC memory M� memory u� x M� u� x Sign� extend M� u� x Forwarding� unit bne $2,$3,foo subu $3,$4,$5 ld $2,$4

Branch Prediction from 10,000 ft Back End Front End dcache Instr Instr FIFO Cache branch dequeue resolution login “Guess” PC != arch. PC FIFO PC “restart at address X” misprediction penalty Invariant: is_correct(PC) � is_correct(Instr[PC]) On restart ( branch misprediction ) must – a. kill all incorrectly fetched instructions (to ensure correct execution) b. refill pipeline (takes # cycles == latency of pipeline up to execute stage)

Aside: Decoupled Execution Buffering Smooths Execution and Improves Cycle time by Reducing Stall Propagation Front End Back End FIFO (f=fetch, s=stall, c=cache miss, e=execute) F6 S1 C7 F6 E2 C8 E8 Cycle FE F F F F F F S C C C C C C C F F F F F F 1 1 1 2 3 4 4 4 4 4 4 3 2 1 1 1 1 1 1 FIFO E E C C C C C C C C E E E E E E E E BE The front end runs ahead .. stalls + cache misses are overlapped. FE F F F S S S S S S S S F F F S C C C C C E E C C C C C C C C E E E S S S S S S BE without decoupling .. stalls + cache misses are not overlapped.

Pipelined Front End Back End bne+ $2,$3,foo branch direction predictor Instruction Cache Instr FIFO br. imm EA checker +4 pc GPC PC FIFO restart new pc

Branch Predicted-Taken Penalty Back End br br dir predictor Instruction Cache Instr X X FIFO br. imm EA checker +4 pc GPC PC Squash Speculatively FIFO Fetch Instructions restart That Follow Branch new pc

Branch Misprediction Penalty Back End Front End X X X X X X X XXX X dcache Instr X Instr FIFO Cache X branch dequeue resolution login X X “Guess” PC != arch. X X X X PC FIFO PC X “restart at address X” Pentium 3 – ~10 cycles Pentium 4 – ~20 cycles “The Microarchitecture of the Pentium 4”, Intel Technology Journal Q1 2001

Since misprediction penalty is larger, we first focus on branch (direction) prediction • Static Strategies: - # 1 predict taken (34% mispredict rate) - # 2 predict (backwards taken, forwards not) (10% , 50% ) mispredict rate - same backwards behavior as # 1 - better forwards behavior ( 50% -50% branches) penalty: # 1 taken 2 cycle ~ taken 20 cycle # 2 taken 20 cycle ~ taken 0 cycle #1 forward branch ave execution time = 50% * 2 + 50% * 20 = 11 cycles #2 forward branch ave execution time = 50% * 20 + 50% * 0 = 10 cycles JZ backward forward 90% 50% JZ

Since misprediction penalty is larger, we first focus on branch (direction) prediction • Static Strategies: # 3 profile (see next slide for misprediction % ’s) - choose a single prediction for each branch and encode in instruction - some studies show that sample runs are fairly representative of inputs in general - negative: extra programmer burden See next slide for misprediction rates

Profiling Based Static Prediction 15% ave. (specint92), 9% ave. (specfp92) misp rate Each branch is permanently assigned a probable direction. To do better we would need to change the prediction as the program runs!

A note on prediction/misprediction rates Qualitatively, ratio of misprediction rates 15% ave. (specint92), is better indicator of predictor improvement. 9% ave. (specfp92) misp rate ( assumes misprediction probability independent between branches ) # Consecutive Prediction Misprediction Branches Predicted Rate (p) Rate Correctly Bernoulli Process: (w/ 50% prob) 50% 50% 1 p k = .5 78% 22% 2.78 k = lg(.5)/lg(p) 85% 15% 4.26 91% 9% 7.34 95% 5% 13.5 2% makes a huge difference here 96% 4% 16.98 98% 2% 34.3

Compiler also can take advantage of Static Prediction / Profiling / Knowledge • Static Strategies: - # 1 predict taken (34% mispredict rate) - # 2 predict backwards taken, forwards not (10% , 50% mispredict rate) - # 3 profile (see previous slide) - # 4 delayed branches always execute instructions after branches avoids need to flush pipeline after branch

Observation: Static Prediction is limited because it only uses instructions as input + has a fixed prediction branch direction predictor Instruction Cache Instr FIFO br. imm EA +4 pc GPC PC FIFO restart new pc

Dynamic Prediction: More inputs allow it to adjust the brand direction prediction over time branch (direction) predictor branch info mispredict instr feedback pc Instruction Cache Instr FIFO EA +4 pc GPC PC FIFO restart new pc branch info

Dynamic Prediction: More detailed br. branch (direction) predictor descr FIFO branch-info mispredict instr feedback pc Instruction Cache Instr FIFO EA +4 pc GPC PC FIFO restart new pc branch-info

Dynamic Branch Prediction – Track Changing Per-Branch Behavior • Store 2 bits per branch • Change the prediction after two consecutive mistakes! 01 ¬ taken taken taken taken 00 ¬ taken 11 p 0 ^p 1 taken ¬ taken ¬ taken 10 p 0 0 p 0 new 1 actually taken/not taken p 1 new BP state: ( next prediction taken/ ¬ taken) x ( last branch taken /¬taken) note: this is not strictly a saturating up-down counter

Why two bits? • One bit is wishy-washy on loops Two mispredictions per loop execution with single-bit prediction top: add add beq top T T T T T T N T N T Prediction T T T T T T T N T T N T T T T Outcome T T N T T (No data – either use what is left over from before or initialize on i. fill with “predict taken” for backwards branches) Single Bit Predictor Analysis

Why two bits? • One bit is wishy-washy on loops One misprediction per loop execution with two-bit prediction top: add add beq top T T T T T T T T T T Prediction T T T T T T T N T T N T T T T Outcome T T N T T Two Bit Predictor Analysis

n-bit implementation blindly write into this hash table; branches may alias but that’s “ok” Branch (direction) Predictor (many cycles later) branch info (pc/descr/outcome) compute state transition write branch descr read “Guess” n-bits 4k PC prediction hash n-bit counters Branch History Table (BHT) is c. branch? ICache

Accuracy of simple dynamic branch predictor: 4096-entry 2-bit predictor on Spec89 somewhat old benchmarks – probably need slightly larger predictors to do this well on current benchmarks 11% 18% 22% 12% vs profiling

Limits of 2-bit prediction - ∞ table does not help much on spec89 - reportedly, more bits does not help significantly either.

Exploiting Spatial Correlation Yeh and Patt, 1992 if (x[i] < 7) then y += 1; if (x[i] < 5) then c -= 4; If first condition false, second condition also false History bit: H records the direction of the last branch executed by the processor Two sets of BHT bits (BHT0 & BHT1) per branch instruction ⇒ H = 0 (not taken) consult BHT0 ⇒ H = 1 (taken) consult BHT1 Adapted from Arvind and Asanovic’s MIT course 6.823, Lecture 6

Accuracy with 2 bits of global history Less storage than 4k x 2bit but better accuracy (for these benchmarks)

Two-Level Branch Predictor Pentium Pro (1995) uses the result from the last two branches to select one of the four sets of BHT bits (~ 95% correct) 0 0 Fetch PC k 2-bit global branch history shift register Shift in Taken/ ¬ Taken results of each branch Taken/ ¬ Taken? Adapted from Arvind and Asanovic’s MIT course 6.823, Lecture 6

Reducing the branch delay IF.Flush Hazard detection unit ID/EX - PowerPoint PPT Presentation

Reducing the branch delay IF.Flush Hazard detection unit ID/EX M u x WB EX/MEM M Control u M WB MEM/WB x 0 EX M WB IF/ID 4 Shift left 2 M u x = Registers Data Instruction ALU PC

1 Branch History Table of 1-bit Predictor 1-bit BHT Weakness BHT also Called Branch Example: in

1 Predictor for a Single Branch Branch History Table of 1-bit Predictor BHT also Called Branch

REDUCING PLACEMENT DELAY/PLACEMENT FAILURE Background: At any one time, between 15% and 25 % of

Interconnect Gate delay Wire delay The delay in VLSI circuits have two components Gate delay (

RC delay 4: The Elmore delay - 3 Application of the Elmore delay formula to a (RC) wire. Let R

P Packet Scheduling: k S h d li E d t End-to-End Delay Bounds E d D l B d Delay bounds

11 Introduction Introduction M/M/1 Queueing delay (revisited) R=link bandwidth (bps)

Overview Instruction level parallelism Dynamic Scheduling Techniques Scoreboarding

Outline Introduction Delay Test Issues Our Solutions Improved Launch Delay

SIMULATE REAL-WORLD IP NETWORKS Impairments, Delay, Errors, Loss, Optical, Electrical 1

VERIFIABLE DELAY FUNCTIONS Benjamin Wesolowski VERIFIABLE DELAY FUNCTIONS How to slow things

Delay and Disruption Tolerant Networks An Overview NASA through the Delay Tolerant Network

Sepsis Six A Call to Action in Reducing Sepsis Mortality SSM Health St. Marys Hospital,

The delay-line as a discriminator The delay line turns a frequency into a phase delay line

Estimating Delays Gate Delay Model Would be nice to have First, normalize a model of delay

When Negotiation Goes Wrong: Debt Collection and Pay for Delay Pay for Delay Joseph Farrell

An Efficient Quantum Collision Search Algorithm and Implications on Symmetric Cryptography

CS 152: Discussion Section 7 Branch Predictor and VLIW Albert Ou, Yue Dai 03/013/2020

Branch Prediction Tackles problem of stalls from control

r r t rt

Superscalar Design: Instruction Flow Techniques Virendra Singh Associate Professor C omputer A

Preconditioning of conjugate-gradients in observation space for 4D-VAR S. Gratton 12 S. Gurol 2

Endpoint behavior of modulation invariant singular integrals Francesco Di Plinio INdAM-Cofund

Voltage-dependent coherent drift modes and turbulent transition regimes in small magnetron