Modern processor design
Hung-Wei Tseng
Modern processor design Hung-Wei Tseng Outline Achieving CPI < - - PowerPoint PPT Presentation
Modern processor design Hung-Wei Tseng Outline Achieving CPI < 1 Improving instruction level parallelism SuperScalar Dynamic scheduling/Out-of-order execution Simultaneous multithreading 3 Instruction level parallelism
Hung-Wei Tseng
parallelism
3
4
5
LOOP: lw $t1, 0($a0) add $v0, $v0, $t1 addi $a0, $a0, 4 bne $a0, $t0, LOOP lw $t0, 0($sp) lw $t1, 4($sp) lw $t1, 0($a0) add $v0, $v0, $t1 addi $a0, $a0, 4 bne $a0, $t0, LOOP lw $t1, 0($a0) add $v0, $v0, $t1 addi $a0, $a0, 4 bne $a0, $t0, LOOP . . . . . .
If the current value of $a0 is 0x10000000 and $t0 is 0x10001000, what are the dynamic instructions that the processor will execute?
lw $t1, 0($a0) add $v0, $v0, $t1 addi $a0, $a0, 4 bne $a0, $t0, LOOP lw $t1, 0($a0) add $v0, $v0, $t1 addi $a0, $a0, 4 bne $a0, $t0, LOOP
7 cycles per loop in average (if there are many iterations)
IF ID IF EXE ID IF WB EXE ID IF MEM IF ID WB ID IF MEM EXE ID IF WB MEM WB MEM EXE EXE ID IF MEM ID IF
6
MEM EXE WB MEM WB EXE WB EXE ID IF WB MEM ID
short as possible
instruction level parallelism (ILP)
multiple instructions at the same cycle
we still can only achieve CPI = 1 in the best case.
7
8
stage
add $t1, $a0, $a1 addi $a1, $a1, -1 add $t2, $a0, $t1 bne $a1, $zero, LOOP add $t1, $a0, $a1 addi $a1, $a1, -1 add $t2, $a0, $t1 bne $a1, $zero, LOOP
4 cycles per loop. 2 cycle per loop with perfect prediction. Pipeline takes 6 cycles per loop
IF IF EXE EXE ID ID MEM EXE EXE MEM ID ID IF IF
9
WB WB MEM MEM IF IF WB WB EXE EXE ID ID MEM EXE EXE MEM ID ID IF IF WB WB WB WB MEM MEM
lw $t1, 0($a0) add $v0, $v0, $t1 addi $a0, $a0, 4 bne $a0, $t0, LOOP lw $t1, 0($a0) add $v0, $v0, $t1 addi $a0, $a0, 4 bne $a0, $t0, LOOP
7 cycles per loop in worst case, 4 cycles if branch predictor predicts perfectly Not very impressive...
IF IF EXE ID IF IF MEM IF IF ID EXE WB MEM ID ID IF IF EXE ID IF IF WB EXE ID ID MEM ID IF IF MEM IF IF WB WB ID ID IF IF MEM EXE ID WB EXE ID ID MEM EXE ID MEM WB WB WB MEM EXE
10
lw $t1, 0($a0) addi $a0, $a0, 4 add $v0, $v0, $t1 bne $a0, $t0, LOOP lw $t1, 0($a0) addi $a0, $a0, 4 add $v0, $v0, $t1 bne $a0, $t0, LOOP
instruction sequence
5 cycles per loop in worst case, 2 cycles if branch prediction perfectly
IF IF EXE EXE ID ID MEM ID ID MEM WB WB ID ID IF IF MEM MEM WB WB EXE EXE
12
IF IF EXE EXE ID ID MEM ID ID MEM WB WB ID ID IF IF MEM MEM WB WB EXE EXE
that can be executed concurrently
contain n instructions.
word
13
instruction depends on the other.
parallel or out-of-order
1: lw $t1, 0($a0) 2: add $v0, $v0, $t1 3: addi $a0, $a0, 4 4: bne $a0, $t0, LOOP 5: lw $t1, 0($a0) 6: add $v0, $v0, $t1 7: addi $a0, $a0, 4 8: bne $a0, $t0, LOOP
14
1 2 6 3 4 5 8 7
dependencies due to limited number of registers
15
static instructions dynamic instructions LOOP: lw $t1, 0($a0) add $v0, $v0, $t1 addi $a0, $a0, 4 bne $a0, $t0, LOOP lw $t0, 0($sp) lw $t1, 4($sp) 1: lw $t1, 0($a0) 2: add $v0, $v0, $t1 3: addi $a0, $a0, 4 4: bne $a0, $t0, LOOP 5: lw $t1, 0($a0) 6: add $v0, $v0, $t1 7: addi $a0, $a0, 4 8: bne $a0, $t0, LOOP
1 2 6 3 4 5 8 7
don’t have an arrow in data dependency graph
source of an earlier one
1: lw $t1, 0($a0) 2: add $v0, $v0, $t1 3: addi $a0, $a0, 4 4: bne $a0, $t0, LOOP 5: lw $t1, 0($a0) 6: add $v0, $v0, $t1 7: addi $a0, $a0, 4 8: bne $a0, $t0, LOOP
16
1 2 6 3 4 5 8 7
18
instructions into “instruction window”
dependencies of instructions
ready
extract
branches for every 4-5 instructions.
19
each new output in a different register
“architectural” registers
visible to compilers and programmers
execution
20
21
1: lw $t1, 0($a0) 2: add $v0, $v0, $t1 3: addi $a0, $a0, 4 4: bne $a0, $t0, LOOP 5: lw $t1, 0($a0) 6: add $v0, $v0, $t1 7: addi $a0, $a0, 4 8: bne $a0, $t0, LOOP
$a0 $t0 $t1 $v0 p1 p2 p3 p4
1: lw $p5 , 0($p1) 2: add $p6 , $p4, $p5 3: addi $p7 , $p1, 4 4: bne $p7 , $p2, LOOP 5: lw $p8 , 0($p7) 6: add $p9 , $p6, $p8 7: addi $p10, $p7, 4 8: bne $p10, $p2, LOOP
1 p1 p2 p5 p4 2 p1 p2 p5 p6 3 p7 p2 p5 p6 4 p7 p2 p5 p6 5 p7 p2 p8 p6 6 p7 p2 p8 p9 7 p10 p2 p8 p9 8 p10 p2 p8 p9
1 2 6 3 4 5 8 7 1 2 6 3 4 5 8 7
instructions with the help of branch prediction
processor know if we need to execute or not
depending physical registers are generated)
the instruction is going to be executed or not.
22
number
previous instructions finishes.
later reorder buffer indexes and clear the occupied physical registers
instruction window or the register map.
23
24
Register renaming logic Schedule
Execution Units
Data Cache Reorder Buffer/ Commit
Instruction Fetch Instruction Decode
issue pipeline
instructions into/from instruction window each cycle
25
1: lw $p5 , 0($p1) 2: add $p6 , $p4, $p5 3: addi $p7 , $p1, 4 4: bne $p7 , $p2, LOOP 5: lw $p8 , 0($p7) 6: add $p9 , $p6, $p8 7: addi $p10, $p7, 4 8: bne $p10, $p2, LOOP
IF IF IF IF EXE Sch EXE Sch Sch Sch Sch Sch Sch Sch Sch Sch Ren Ren Ren Ren ID ID ID ID IF IF IF IF Ren Ren Ren Ren ID ID ID ID EXE C MEM Sch EXE Sch Sch Sch EXE C C C Sch EXE MEM Sch C C C EXE C C EXE C C C
26
Instruction Cache Branch predictor Instruction prefetcher Register renaming logic Issue queue Register file Execute units Data cache
fetch slot rename issue register read execute memory