1 forwarding idea
play

1 forwarding idea read wrong value (e.g. from register) correct - PowerPoint PPT Presentation

1 forwarding idea read wrong value (e.g. from register) correct value is already computed elsewhere in pipeline maybe even after old value was read substitute from wrong value using MUX 2 quiz question: forwarding in IRMOVQ irmovq $50, %r8


  1. 1

  2. forwarding idea read wrong value (e.g. from register) correct value is already computed elsewhere in pipeline maybe even after old value was read substitute from wrong value using MUX 2

  3. quiz question: forwarding in IRMOVQ irmovq $50, %r8 addq %r11, %r8 output of decode/execute regs ( irmovq ) (unchanged during execute stage) input of execute/memory regs ( irmovq ) input of decode/execute regs ( addq ) 3 cycle # 0 1 2 3 4 5 6 7 8 F D E M W F D E M W

  4. quiz question: forwarding in IRMOVQ output of decode/execute regs ( irmovq ) (unchanged during execute stage) input of execute/memory regs ( irmovq ) input of decode/execute regs ( addq ) 3 cycle # 0 1 2 3 4 5 6 7 8 irmovq $50, %r8 F D E M W addq %r11, %r8 F D E M W

  5. forwarding logic split execute/writeback decode/execute fetch/decode MUX MUX add 2 ADD ADD 0xF R[srcB] PC R[srcA] next R[dstE] next R[dstM] dstE dstM srcB srcA register fjle Mem. Instr. 4

  6. some forwarding paths addq %r8, %r9 subq %r9, %r11 rmmovq %r9, 8(%r11) 5 cycle # 0 1 2 3 4 5 6 7 8 F D E M W F D E M W mrmovq 4(%r11), %r10 F D E M W F D E M W xorq %r10, %r9 F D E M W

  7. forwarding in HCL register dE { valA : 64 = 0; dstE : 4 = 0; }; ... /* was: d_valA = reg_outputA; */ d_valA = [ reg_srcA == e_dstE : e_valE; ... 1 : reg_outputA; ]; d_dstE = ...; 6

  8. forwarding in HCL register dE { valA : 64 = 0; dstE : 4 = 0; }; ... /* was: d_valA = reg_outputA; */ d_valA = [ reg_srcA == e_dstE : e_valE; ... 1 : reg_outputA; ]; d_dstE = ...; 6

  9. forwarding in HCL register dE { valA : 64 = 0; dstE : 4 = 0; }; ... /* was: d_valA = reg_outputA; */ d_valA = [ ... 1 : reg_outputA; ]; d_dstE = ...; 6 reg_srcA == e_dstE : e_valE;

  10. unsolved problem subq %rbx, %rcx subq %rbx, %rcx stall 7 cycle # 0 1 2 3 4 5 6 7 8 mrmovq 0(%rax), %rbx F D E M W F D E M W F F D E M W

  11. unsolved problem subq %rbx, %rcx subq %rbx, %rcx stall 7 cycle # 0 1 2 3 4 5 6 7 8 mrmovq 0(%rax), %rbx F D E M W F D E M W F F D E M W

  12. multiple forwarding paths addq %r10, %r8 addq %r11, %r8 addq %r12, %r8 8 cycle # 0 1 2 3 4 5 6 7 8 F D E M W F D E M W F D E M W

  13. 8 multiple forwarding paths cycle # 0 1 2 3 4 5 6 7 8 addq %r10, %r8 F D E M W addq %r11, %r8 F D E M W addq %r12, %r8 F D E M W

  14. multiple forwarding HCL d_valA = [ ... reg_srcA == e_dstE : e_valE; reg_srcA == m_dstE : m_valE; ... 1 : reg_outputA; ]; 9

  15. multiple forwarding paths (2) addq %r10, %r8 addq %r11, %r12 addq %r12, %r8 10 cycle # 0 1 2 3 4 5 6 7 8 F D E M W F D E M W F D E M W

  16. multiple forwarding paths (2) addq %r11, %r12 10 cycle # 0 1 2 3 4 5 6 7 8 addq %r10, %r8 F D E M W F D E M W addq %r12, %r8 F D E M W

  17. multiple forwarding paths (2) addq %r10, %r8 10 cycle # 0 1 2 3 4 5 6 7 8 F D E M W addq %r11, %r12 F D E M W addq %r12, %r8 F D E M W

  18. after forwarding/prediction where do we still need to stall? memory output needed in fetch ret followed by anything memory output needed in exceute mrmovq or popq + use (in immediatelly following instruction) 11

  19. overall CPU 5 stage pipeline most data hazards: solved by forwarding load/use hazard: 1 cycle of stalling jXX control hazard: branch prediction + squashing 2 cycle penalty for misprediction ret control hazard: 3 cycles of stalling 12 1 instruction completes every cycle — except hazards

  20. pipelined control costs how much faster than single-cycle processor? at most fjve times faster depends on hardware details does added logic make clock cycle slower? depends on what programs we run: how many mispredicted jumps? how many rets? how many load/use hazards? 13

  21. hazards versus dependencies dependency — X needs result of instruction Y? hazard — will it not work in some pipeline? before extra work is done to “resolve” hazards like forwarding or stalling or branch prediction 14

  22. ex.: dependencies and hazards (1) %rcx, which are resolved with forwarding? which are hazards in our pipeline? where are dependencies? %r10 %rbx, addq %r10 addq addq %rcx $100, irmovq %rcx %rax, subq %rbx %rax, 15

  23. ex.: dependencies and hazards (1) %rcx, which are resolved with forwarding? which are hazards in our pipeline? where are dependencies? %r10 %rbx, addq %r10 addq addq %rcx $100, irmovq %rcx %rax, subq %rbx %rax, 15

  24. ex.: dependencies and hazards (1) %rcx, which are resolved with forwarding? which are hazards in our pipeline? where are dependencies? %r10 %rbx, addq %r10 addq addq %rcx $100, irmovq %rcx %rax, subq %rbx %rax, 15

  25. ex.: dependencies and hazards (1) %rcx, which are resolved with forwarding? which are hazards in our pipeline? where are dependencies? %r10 %rbx, addq %r10 addq addq %rcx $100, irmovq %rcx %rax, subq %rbx %rax, 15

  26. ex.: dependencies and hazards (2) %rdx which are resolved with forwarding? which are hazards in our pipeline? where are dependencies? foo: %rcx (%rdx) mrmovq %rcx mrmovq addq foo jne %rcx %rbx addq 16 0(%rax) %rbx

  27. pipeline with difgerent hazards xorq %rax, %r10 addq/andq is not a hazard with 4-stage pipeline addq/andq is hazard with 5-stage pipeline // D // D %r11 // E // EM // M example: 4-stage pipeline: // W subq %rax, %r9 // W // // 5 stage // 4 stage fetch/decode/execute+memory/writeback 17 addq %rax, %r8 andq %r8,

  28. pipeline with difgerent hazards xorq %rax, %r10 addq/andq is not a hazard with 4-stage pipeline addq/andq is hazard with 5-stage pipeline // D // D %r11 andq %r8, // E // EM // M example: 4-stage pipeline: // W subq %rax, %r9 // W // addq %rax, %r8 // 5 stage // 4 stage fetch/decode/execute+memory/writeback 17

  29. exercise: difgerent pipeline F addq %r9, %rbx F D D E1 E2 M W addq %rax, %r9 D E1 E2 M D E1 E2 M W addq %rax, %r9 F F D E1 E2 M W W F split execute into two stages: F/D/E1/E2/M/W 5 cycle # 0 1 2 3 4 6 addq %r9, %rbx 7 8 addq %rcx, %r9 F D E1 E2 M W 18

  30. exercise: difgerent pipeline F addq %r9, %rbx F D D E1 E2 M W addq %rax, %r9 D E1 E2 M D E1 E2 M W addq %rax, %r9 F F D E1 E2 M W W F split execute into two stages: F/D/E1/E2/M/W 5 cycle # 0 1 2 3 4 6 addq %r9, %rbx 7 8 addq %rcx, %r9 F D E1 E2 M W 18

  31. exercise: difgerent pipeline F addq %r9, %rbx F D D E1 E2 M W addq %rax, %r9 D E1 E2 M D E1 E2 M W addq %rax, %r9 F F D E1 E2 M W W F split execute into two stages: F/D/E1/E2/M/W 5 cycle # 0 1 2 3 4 6 addq %r9, %rbx 7 8 addq %rcx, %r9 F D E1 E2 M W 18

  32. exercise: difgerent pipeline F addq %r9, %rbx F D D E1 E2 M W addq %rax, %r9 D E1 E2 M D E1 E2 M W addq %rax, %r9 F F D E1 E2 M W W F split execute into two stages: F/D/E1/E2/M/W 5 cycle # 0 1 2 3 4 6 addq %r9, %rbx 7 8 addq %rcx, %r9 F D E1 E2 M W 18

  33. exercise: forwarding paths D F D E M W mrmovq 8(%r9), %r11 F E W M W pushq %r11 F D E M W popq %r10 M cycle # 8 0 1 2 3 4 5 6 7 addq %rcx, %r9 E F D E M W rmmovq %r9, 8(%r8) F D 19

  34. exercise: forwarding paths D F D E M W mrmovq 8(%r9), %r11 F E W M W pushq %r11 F D E M W popq %r10 M cycle # 8 0 1 2 3 4 5 6 7 addq %rcx, %r9 E F D E M W rmmovq %r9, 8(%r8) F D 19

  35. exercise: forwarding paths D F D E M W mrmovq 8(%r9), %r11 F E W M W pushq %r11 F D E M W popq %r10 M cycle # 8 0 1 2 3 4 5 6 7 addq %rcx, %r9 E F D E M W rmmovq %r9, 8(%r8) F D 19

  36. exercise: forwarding paths (alt pipe) W W F DE M pushq %r11 W F DE M mrmovq 8(%r9), %r11 W F DE M popq %r10 W F DE M rmmovq %r9, 8(%r8) F DE M suppose four-stage pipeline: addq %rcx, %r9 8 7 6 5 4 3 2 1 0 cycle # fetch/decode+execute/memory/writeback 20

  37. exercise: forwarding paths (alt pipe) W W F DE M pushq %r11 W F DE M mrmovq 8(%r9), %r11 W F DE M popq %r10 W F DE M rmmovq %r9, 8(%r8) F DE M suppose four-stage pipeline: addq %rcx, %r9 8 7 6 5 4 3 2 1 0 cycle # fetch/decode+execute/memory/writeback 20

  38. overall CPU 5 stage pipeline most data hazards: solved by forwarding load/use hazard: 1 cycle of stalling jXX control hazard: branch prediction + squashing 2 cycle penalty for misprediction ret control hazard: 3 cycles of stalling 21 1 instruction completes every cycle — except hazards

  39. pipelined control costs how much faster than single-cycle processor? at most fjve times faster depends on HW details: how expensive is forwarding logic? (new MUXes on critical path) how well balanced are the stages? depends on what programs we run: how many mispredicted jumps? how many rets? how many load/use hazards? 22

  40. HCL2D pipeline registers valA : 64 = 0; valB : 64 = E; dstE : 4 = REG_NONE; /* Writeback */ } valE : 64 = 0; dstE : 4 = REG_NONE; register eW { /* Execute */ } register dE { register xF { /* Decode */ }; rA : 4 = REG_NONE; rB : 4 = REG_NONE; register fD { /* Fetch+PC Update*/ }; pc : 64 = 0; 23

  41. HCL2D: Fetch/Decode pc = F_pc; pipelined d_valB = reg_outputB; d_valA = reg_outputA; dstE = D_rB; reg_srcB = D_rB; reg_srcA = D_rA; /* Decode */ f_rB = i10bytes[8..12]; f_rA = i10bytes[12..16]; x_pc = pc + 2; /* Fetch+PC Update*/ /* Fetch+PC Update*/ unpipelined valB = reg_outputB; valA = reg_outputA; dstE = rB; reg_srcB = rB; reg_srcA = rA; /* Decode */ rB = i10bytes[8..12]; rA = i10bytes[12..16]; x_pc = pc + 2; pc = F_pc; 24

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend