overview
play

Overview Instruction level parallelism Dynamic Scheduling - PDF document

Overview Instruction level parallelism Dynamic Scheduling Techniques Scoreboarding Chapter 2 Tomasulos Algorithm Reducing Branch Cost with Dynamic Hardware Reducing Branch Cost with Dynamic Hardware Prediction


  1. Overview • Instruction level parallelism • Dynamic Scheduling Techniques – Scoreboarding Chapter 2 – Tomasulo’s Algorithm • Reducing Branch Cost with Dynamic Hardware • Reducing Branch Cost with Dynamic Hardware Prediction Instruction-Level Parallelism and Its – Basic Branch Prediction and Branch-Prediction Buffers Exploitation – Branch Target Buffers • Overview of Superscalar and VLIW processors 1 2 CPI Equation Instruction Level Parallelism • Potential overlap among instructions Pipeline CPI = Ideal pipeline CPI + Structural stalls + RAW stalls + WAR stalls + WAW stalls + Control stalls • Few possibilities in a basic block Technique Reduces – Blocks are small (6-7 instructions) Loop unrolling Control stalls – Instructions are dependent Basic pipeline scheduling RAW stalls • Exploit ILP across multiple basic blocks Dynamic scheduling with scoreboarding RAW stalls Dynamic scheduling with register renaming WAR and WAW stalls – Iterations of a loop Dynamic branch prediction Control stalls for (i = 1000; i > 0; i=i-1) Issuing multiple instructions per cycle Ideal CPI x[i] = x[i] + s; Compiler dependence analysis Ideal CPI and data stalls Software pipelining and trace scheduling Ideal CPI and data stalls – Alternative to vector instructions Speculation All data and control stalls Dynamic memory disambiguation RAW stalls involving memory 3 4 Basic Pipeline Scheduling Sample Pipeline • Find sequences of unrelated instructions EX • Compiler’s ability to schedule – Amount of ILP available in the program IF ID FP1 FP2 FP3 FP4 DM WB – Latencies of the functional units • Latency assumptions for the examples FP1 FP2 FP3 FP4 – Standard MIPS integer pipeline Standard MIPS integer pipeline . . . – No structural hazards (fully pipelined or duplicated units – Latencies of FP operations: IF ID FP1 FP2 FP3 FP4 WB FP ALU DM Instruction producing result Instruction using result Latency FP ALU op FP ALU op 3 FP ALU IF ID stall stall stall FP1 FP2 FP3 FP ALU op SD 2 FP ALU IF ID FP1 FP2 FP3 FP4 DM WB LD FP ALU op 1 LD SD 0 SD IF ID EX stall stall DM WB 5 6 1

  2. Basic Scheduling Loop Unrolling Unrolled loop (four copies): Sequential MIPS Assembly Code Scheduled Unrolled loop: for (i = 1000; i > 0; i=i-1) Loop: LD F0, 0(R1) Loop: LD F0, 0(R1) Loop: LD F0, 0(R1) ADDD F4, F0, F2 x[i] = x[i] + s; ADDD F4, F0, F2 LD F6, -8(R1) SD 0(R1), F4 SD 0(R1), F4 LD F10, -16(R1) SUBI R1, R1, #8 LD F6, -8(R1) LD F14, -24(R1) BNEZ R1, Loop ADDD F8, F6, F2 ADDD F4, F0, F2 SD -8(R1), F8 ADDD F8, F6, F2 Pipelined execution: p Scheduled pipelined execution: p p LD LD F10 F10, -16(R1) 16(R1) ADDD ADDD F12 F10 F2 F12, F10, F2 Loop: LD F0, 0(R1) 1 Loop: LD F0, 0(R1) 1 ADDD F12, F10, F2 ADDD F16, F14, F2 stall 2 SUBI R1, R1, #8 2 SD -16(R1), F12 SD 0(R1), F4 ADDD F4, F0, F2 3 ADDD F4, F0, F2 3 LD F14, -24(R1) SD -8(R1), F8 stall 4 stall 4 ADDD F16, F14, F2 SUBI R1, R1, #32 SD -24(R1), F16 stall 5 BNEZ R1, Loop 5 SD 16(R1), F12 SUBI R1, R1, #32 SD 0(R1), F4 6 SD 8 (R1), F4 6 BNEZ R1, Loop BNEZ R1, Loop SUBI R1, R1, #8 7 SD 8(R1), F16 stall 8 BNEZ R1, Loop 9 stall 10 7 8 Dynamic Scheduling Out-of-order execution (1/2) • Scheduling separates dependent instructions • Central idea of dynamic scheduling – Static – performed by the compiler – In-order execution: – Dynamic – performed by the hardware DIVD F0, F2, F4 IF ID DIV ….. • Advantages of dynamic scheduling ADDD F10, F0, F8 IF ID stall stall stall … SUBD F12 F8 F14 SUBD F12, F8, F14 IF stall stall ….. IF stall stall – Handles dependences unknown at compile time – Out-of-order execution: – Simplifies the compiler DIVD F0, F2, F4 IF ID DIV ….. – Optimization is done at run time SUBD F12, F8, F14 IF ID A1 A2 A3 A4 … • Disadvantages ADDD F10, F0, F8 IF ID stall ….. – Can not eliminate true data dependences 9 10 Dynamic Scheduling with a Out-of-Order Execution (2/2) Scoreboard • Separate issue process in ID: • Details in Appendix A.7 – Issue • Allows out-of-order execution • decode instruction – Sufficient resources • check structural hazards – No data dependencies • in-order execution • Responsible for issue, execution and hazards ibl f i i d h d – Read operands • Functional units with long delays • Wait until no data hazards • Read operands – Duplicated • Out-of-order execution/completion – Fully pipelined – Exception handling problems • CDC 6600 – 16 functional units – WAR hazards 11 12 2

  3. MIPS with Scoreboard Scoreboard Operation • Scoreboard centralizes hazard management – Every instruction goes through the scoreboard – Scoreboard determines when the instruction can read its operands and begin execution – Monitors changes in hardware and decides when an stalled instruction can execute – Controls when instructions can write results • New pipeline ID EX WB Read Regs Execution Issue Write 13 14 Execution Process Scoreboard Data Structure • Issue • Instruction status – indicates pipeline stage – Functional unit is free (structural) • Functional unit status – Active instructions do not have same Rd (WAW) Busy – functional unit is busy or not • Read Operands – Checks availability of source operands Op – operation to perform in the unit (+, -, etc.) – Resolves RAW hazards dynamically (out-of-order R l RAW h d d i ll ( t f d Fi – destination register execution) Fj, Fk – source register numbers • Execution Qj, Qk – functional unit producing Fj, Fk – Functional unit begins execution when operands arrive – Notifies the scoreboard when it has completed execution Rj, Rk – flags indicating when Fj, Fk are ready • Write result • Register result status – FU that will write registers – Scoreboard checks WAR hazards – Stalls the completing instruction if necessary 15 16 Scoreboard Data Structure (1/3) Scoreboard Data Structure (2/3) Instruction Issue Read operands Execution completed Write LD F6, 34(R2) Y Y Y Y LD F2, 45(R3) Y Y Y MULTD F0, F2, F4 Y Y SUBD F8, F6, F2 DIVD F10, F0, F6 Y ADDD F6, F8, F2 Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Y Load F2 R3 N Mult1 Y Mult F0 F2 F4 Integer N Y Mult2 N Add Y Sub F8 F6 F2 Integer Y N Divide Y Div F10 F0 F6 Mult1 N Y F0 F2 F4 F6 F8 F10 F12 . . . F30 Functional Unit Mult1 Int Add Div 17 18 3

  4. Scoreboard Data Structure (3/3) Scoreboard Algorithm 19 20 Scoreboard Limitations Tomasulo Approach • Amount of available ILP • Another approach to eliminate stalls • Number of scoreboard entries – Combines scoreboard with – Limited to a basic block – Register renaming (to avoid WAR and WAW) – Extended beyond a branch • Designed for the IBM 360/91 g • Number and types of functional units – High FP performance for the whole 360 family – Structural hazards can increase with DS – Four double precision FP registers • Presence of anti- and output- dependences – Long memory access and long FP delays – Lead to WAR and WAW stalls • Can support overlapped execution of multiple iterations of a loop 21 22 Tomasulo Approach Stages • Issue – Empty reservation station or buffer – Send operands to the reservation station – Use name of reservation station for operands • Execute E t – Execute operation if operands are available – Monitor CDB for availability of operands • Write result – When result is available, write it to the CDB 23 24 4

  5. Example (1/2) Example (2/2) 25 26 Loop Iterations Tomasulo’s Algorithm Loop: LD F0, 0(R1) MULTD F4,F0,F2 SD 0(R1), F4 SUBI R1, R1, #8 BNEZ R1, Loop An enhanced and detailed design in Fig. 2.12 of the textbook 27 28 Dynamic Hardware Prediction Basic Branch Prediction Buffers • Importance of control dependences a.k.a. Branch History Table (BHT) - Small direct-mapped cache of T/NT bits – Branches and jumps are frequent Branch Instruction – Limiting factor as ILP increases (Amdahl’s law) IR: • Schemes to attack control dependences + Branch Target – Static PC PC: • Basic (stall the pipeline) i ( ll h i li ) • Predict-not-taken and predict-taken BHT T (predict taken) • Delayed branch and canceling branch – Dynamic predictors • Effectiveness of dynamic prediction schemes NT (predict not- taken) – Accuracy PC + 4 – Cost 29 30 5

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend