Advanced Topics on Heterogeneous System Architectures
Politecnico di Milano Seminar Room @ DEIB 30 November, 2017 Antonio R. Antonio R. Miele Miele Marco D. Santambrogio Marco D. Santambrogio Politecnico di Milano
Outline Processors and Instruction Sets Review of pipelining - - PowerPoint PPT Presentation
Advanced Topics on Heterogeneous System Architectures Pipelining Politecnico di Milano Seminar Room @ DEIB 30 November, 2017 Antonio R. Antonio R. Miele Miele Marco D. Santambrogio Marco D. Santambrogio Politecnico di
Politecnico di Milano Seminar Room @ DEIB 30 November, 2017 Antonio R. Antonio R. Miele Miele Marco D. Santambrogio Marco D. Santambrogio Politecnico di Milano
2
2
Based on the concept of executing only simple instructions in a reduced basic cycle to optimize the performance of CISC CPUs.
ALU operands come from the CPU general purpose registers and they cannot directly come from the memory. Dedicated instructions are necessary to:
– load data from memory to registers – store data from registers to memory
Performance optimization technique based on the overlapping of the execution of multiple instructions derived from a sequential execution flow.
3
– no indirection
4
– Defines set of operations, instruction format, hardware supported data types, named storage, addressing modes, sequencing
architected registers and memory
– Architected storage mapped to actual storage – Function units to do all the required operations – Possible additional storage (eg. MAR, MBR, …) – Interconnect to move information among regs and FUs
diagram (STD)
5
Op
31 26 15 16 20 21 25
Rs1 Rd immediate Op
31 26 25
Op
31 26 15 16 20 21 25
Rs1 Rs2 target Rd Opx Register-Register
5 6 10 11
Register-Immediate Op
31 26 15 16 20 21 25
Rs1
Rs2/Opx
immediate Branch Jump / Call
6
desired functions
– Inputs are Control Points – Outputs are signals
– Based on desired function and signals
7
Memory Data BUS Address BUS Control BUS PC
PSW
Data path Control Unit CPU IR Registri ALU Data Control InstrucHon
8
9
0789
Data BUS Address BUS Contro BUS
… … … … … 0789 load R02,4000 0790 load R03,4004 0791 add R01,R02,R03 0792 load R02,4008 0793 add R01,R01,R02 0794 store R01,4000 … … … … … … … … … … 4000 1492 4004 1918 4008 2006 … … … … …
PSW
Data path Control Unit
CPU
Register ALU IR PC
R00 R01 R02 R03 R04 R05 …
Memory
InstrucHon Data 10
0789
0789
… … … … … 0789 load R02,4000 0790 load R03,4004 0791 add R01,R02,R03 0792 load R02,4008 0793 add R01,R01,R02 0794 store R01,4000 … … … … … … … … … … 4000 1492 4004 1918 4008 2006 … … … … …
PSW
Data path Control Unit
CPU
Registri ALU IR PC
R00 R01 R02 R03 R04 R05 … load R02,4000
+1 0790
Reading
Memory
InstrucHon Data Data Bus Address Bus Contro bus 11
0790
… … … … … 0789 load R02,4000 0790 load R03,4004 0791 add R01,R02,R03 0792 load R02,4008 0793 add R01,R01,R02 0794 store R01,4000 … … … … … … … … … … 4000 1492 4004 1918 4008 2006 … … … … …
PSW
Data path Control Unit
CPU
Registri ALU IR PC
R00 R01 R02 R03 R04 R05 … load R02,4000 4000 1492 Reading
Memory
InstrucHon Data Data Bus Address Bus Contro bus 12
0790
0790
… … … … … 0789 load R02,4000 0790 load R03,4004 0791 add R01,R02,R03 0792 load R02,4008 0793 add R01,R01,R02 0794 store R01,4000 … … … … … … … … … … 4000 1492 4004 1918 4008 2006 … … … … …
PSW
Data path Control Unit
CPU
Registri ALU IR PC
R00 R01 R02 R03 R04 R05 … load R03,4004
+1 0791
load R02,4000 1492 Reading
Memory
InstrucHon Data Data Bus Address Bus Contro bus 13
0791
… … … … … 0789 load R02,4000 0790 load R03,4004 0791 add R01,R02,R03 0792 load R02,4008 0793 add R01,R01,R02 0794 store R01,4000 … … … … … … … … … … 4000 1492 4004 1918 4008 2006 … … … … …
PSW
Data path Control Unit
CPU
Registri ALU IR PC
R00 R01 R02 R03 R04 R05 … load R03,4004 4004 1918 Reading 1492
Memory
InstrucHon Data Data Bus Address Bus Contro bus 14
0791
0791
… … … … … 0789 load R02,4000 0790 load R03,4004 0791 add R01,R02,R03 0792 load R02,4008 0793 add R01,R01,R02 0794 store R01,4000 … … … … … … … … … … 4000 1492 4004 1918 4008 2006 … … … … …
PSW
Data path Control Unit
CPU
Registri ALU IR PC
R00 R01 R02 R03 R04 R05 … add R01,R02,R03
+1 0792
load R03,4004 1492 Reading 1918
Memory
InstrucHon Data Data Bus Address Bus Contro bus 15
0792
… … … … … 0789 load R02,4000 0790 load R03,4004 0791 add R01,R02,R03 0792 load R02,4008 0793 add R01,R01,R02 0794 store R01,4000 … … … … … … … … … … 4000 1492 4004 1918 4008 2006 … … … … …
PSW
Data path Control Unit
CPU
Registri ALU IR PC
R00 R01 R02 R03 R04 R05 … add R01,R02,R03 1492 1918 1492 1918 add ackt 3410
Memory
InstrucHon Data Data Bus Address Bus Contro bus 16
0792
0792
… … … … … 0789 load R02,4000 0790 load R03,4004 0791 add R01,R02,R03 0792 load R02,4008 0793 add R01,R01,R02 0794 store R01,4000 … … … … … … … … … … 4000 1492 4004 1918 4008 2006 … … … … …
PSW
Data path Control Unit
CPU
Registri ALU IR PC
R00 R01 R02 R03 R04 R05 … load R02,4008
+1 0793
add R01,R02,R03 1492 Reading 1918 3410
Memory
InstrucHon Data Data Bus Address Bus Contro bus 17
0793
… … … … … 0789 load R02,4000 0790 load R03,4004 0791 add R01,R02,R03 0792 load R02,4008 0793 add R01,R01,R02 0794 store R01,4000 … … … … … … … … … … 4000 1492 4004 1918 4008 2006 … … … … …
PSW
Data path Control Unit
CPU
Registri ALU IR PC
R00 R01 R02 R03 R04 R05 … load R02,4008 4008 2006 Reading 1492 1918 3410
Memory
InstrucHon Data Data Bus Address Bus Contro bus 18
0793
0793
… … … … … 0789 load R02,4000 0790 load R03,4004 0791 add R01,R02,R03 0792 load R02,4008 0793 add R01,R01,R02 0794 store R01,4000 … … … … … … … … … … 4000 1492 4004 1918 4008 2006 … … … … …
PSW
Data path Control Unit
CPU
Registri ALU IR PC
R00 R01 R02 R03 R04 R05 … add R01,R01,R02
+1 0794
load R02,4008 2006 Reading 1918 3410
Memory
InstrucHon Data Data Bus Address Bus Contro bus 19
0794
… … … … … 0789 load R02,4000 0790 load R03,4004 0791 add R01,R02,R03 0792 load R02,4008 0793 add R01,R01,R02 0794 store R01,4000 … … … … … … … … … … 4000 1492 4004 1918 4008 2006 … … … … …
PSW
Data path Control Unit
CPU
Registri ALU IR PC
R00 R01 R02 R03 R04 R05 … add R01,R01,R02 2006 1918 2006 3410 add ack 5416 3410
Memory
InstrucHon Data Data Bus Address Bus Contro bus 20
0794
0794
… … … … … 0789 load R02,4000 0790 load R03,4004 0791 add R01,R02,R03 0792 load R02,4008 0793 add R01,R01,R02 0794 store R01,4000 … … … … … … … … … … 4000 1492 4004 1918 4008 2006 … … … … …
PSW
Data path Control Unit
CPU
Registri ALU IR PC
R00 R01 R02 R03 R04 R05 … store R01,4000
+1 0795
add R01,R01,R02 2006 Reading 1918 5416
Memory
InstrucHon Data Data Bus Address Bus Contro bus 21
… … … … … 4000 1492 4004 1918 4008 2006 … … … … …
5416
0795
… … … … … 0789 load R02,4000 0790 load R03,4004 0791 add R01,R02,R03 0792 load R02,4008 0793 add R01,R01,R02 0794 store R01,4000 … … … … …
PSW
Data path Control Unit
CPU
Registri ALU IR PC
R00 R01 R02 R03 R04 R05 … store R01,4000 4000 writing 2006 1918 5416
Memory
InstrucHon Data Data Bus Address Bus Contro bus 22
add $s1, $s2, $s3 # $s1 ← $s2 + $s3 addi $s1, $s1, 4 # $s1 ← $s1 + 4
lw $s1, offset ($s2) # $s1 ← M[$s2+offset] sw $s1, offset ($s2) M[$s2+offset] ← $s1
– Conditional branches: the branch is taken only if the condition is
beq $s1, $s2, L1 # go to L1 if ($s1 == $s2) bne $s1, $s2, L1 # go to L1 if ($s1 != $s2) – Unconditional jumps: the branch is always taken. Examples: j (jump) and jr (jump register) j L1 # go to L1 jr $s1 # go to add. contained in $s1
23
– Send the content of Program Counter register to Instruction Memory and fetch the current instruction from Instruction Memory. Update the PC to the next sequential address by adding 4 to the PC (since each instruction is 4 bytes).
– Decode the current instruction (fixed-field decoding) and read from the Register File of one or two registers corresponding to the registers specified in the instruction fields. – Sign-extension of the offset field of the instruction in case it is needed.
24
– Register-Register ALU Instructions:
from the RF
– Register-Immediate ALU Instructions:
read from the RF and the sign-extended immediate operand
– Memory Reference:
effective address.
– Conditional branches:
possible branch target address by adding the sign- extended offset to the incremented PC.
25
– Load instructions require a read access to the Data Memory using the effective address – Store instructions require a write access to the Data Memory using the effective address to write the data from the source register read from the RF – Conditional branches can update the content of the PC with the branch target address, if the conditional test yielded true.
– Load instructions write the data read from memory in the destination register of the RF – ALU instructions write the ALU results into the destination register of the RF.
26
ALU Instructions: op $x,$y,$z
Read of Source
&. PC Increm. ALU OP
($y op $z)
Write Back of
Conditional Branch: beq $x,$y,offset
Read of Source
& PC Increm. ALU Op. ($x-$y)
& (PC+4+offset)
Write PC
Load Instructions: lw $x,offset($y)
Read of Base
& PC Increm. ALU Op.
($y+offset)
Read Mem.
M($y+offset)
Write Back of
Store Instructions: sw $x,offset($y)
Read of Base Reg. $y & Source $x
& PC Increm. ALU Op.
($y+offset)
Write Mem.
M($y+offset) 27
Memory Access Write Back Instruction Fetch
Execute
L M D ALU
MUX
Memory Reg File
MUX MUX
Data Memory
MUX
Sign Extend
4 Adder
Zero?
Next SEQ PC
Address
Next PC WB Data
Inst
RD RS1 RS2 Imm
IR <= mem[PC] PC <= PC + 4 Reg[IRrd] <= Reg[IRrs] opIRop Reg[IRrt]
28
Instruction Type Instruct. Mem. Register Read ALU Op. Data Memory Write Back Total Latency ALU Instr. 2 1 2 1 6 ns Load 2 1 2 2 1 8 ns Store 2 1 2 2 7 ns
2 1 2 5 ns Jump 2 2 ns
29
– Each module must be used once in a clock cycle – The modules used more than once in a cycle must be duplicated.
shared from different instruction flows
multiplexer to enable multiple inputs to a module and select one of different inputs based on the configuration of control lines.
30
C ontent R eg. 2 C ontent R eg. 1 [15-0] [25-21] [20-16] [15-11] R ead Data Write Data R agister R ead 2 R egister R ead 1 Write R egister Zero 32 bit 16 bit M U X M U X R esult
AL U Regis ter File
Write Data Write Address R ead Address
Data Memory
S ign E xtension
Ins truction Memory
Instruction R ead Address
+4 Adder PC
2-bit Left S hifter
Adder
M U X M U X R egWR MemWR MemR D OP [31-26] C ontrol Unit Destination R egister Branch MemToR eg ALU_opB A B ALU C ontrol Unit ALU_op [5-0]
31
– Each phase of the instruction execution requires a clock cycle – Each module can be used more than once per instruction in different clock cycles: possible sharing of modules – We need internal registers to store the values to be used in the next clock cycles.
32
execution of multiple instructions deriving from a sequential execution flow.
sequential instruction stream.
The execution of an instruction is divided into different phases (pipelines stages), requiring a fraction of the time necessary to complete the instruction.
instructions enter in the pipeline at one end, progress through the stages, and exit from the other end, as in an assembly line.
33
34
2 ns Time I2 I3 I1 WB MEM EX ID IF 2 ns 2 ns WB MEM EX ID IF WB MEM EX ID IF WB MEM EX ID IF WB MEM EX ID IF 2 ns I4 I5 I2 … I1 WB MEM EX ID IF WB MEM EX ID IF 10 ns 10 ns 35
36
37
38
ALU Instructions: op $x,$y,$z Conditional Branches: beq $x,$y,offset Load Instructions: lw $x,offset($y)
Read of Base
& PC Increm. ALU Op.
($y+offset)
Read Mem.
M($y+offset)
Write Back
Read of Source
& PC Increm. ALU Op. ($y op $z) Write Back
Store Instructions: sw $x,offset($y)
Read of Base Reg.
$y & Source $x
& PC Increm. ALU Op.
($y+offset)
Write Mem.
M($y+offset)
Read of Source
& PC Increm. ALU Op. ($x-$y)
& (PC+4+offset)
Write PC
ID Instruction Decode IF Instruction Fetch EX Execution ME Memory Access WB Write Back
39
C ontent register 2 C ontent register 1 [15-0] [25-21] [20-16] [15-11] R ead Data Write Data R egister R ead 2 R egister R ead 1 R egister Write Zero 32 bit 16 bit M U X M U X R esult
AL U RF
Write Data Write Address R ead Address
Data Memory
S ign extension
Ins truction Memory
Instruction R ead Address
+4 Adder PC
2-bit Left S hifter M U X M U X
ID/E X IF/ID ME M/WB E X/ME M
IF — Instruction Fetch ID — Instruction Decode EX — Execution WB — Write Back MEM — Memory Access
WR WR RD OP
Adder
40
I n s t r. O r d e r Time (clock cycles)
Reg ALU DMem Ifetch Reg Reg ALU DMem Ifetch Reg Reg ALU DMem Ifetch Reg Reg ALU DMem Ifetch Reg
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 6 Cycle 7 Cycle 5
41
– What happens if read and write refer to the same register in the same clock cycle?
– What happens if read and write refer to the same register in the same clock cycle?
42
42
– What happens if read and write refer to the same register in the same clock cycle?
– What happens if read and write refer to the same register in the same clock cycle?
43
43
Memory Access Write Back Instruction Fetch
Execute
ALU Memory Reg File
MUX MUX
Data Memory
MUX
Sign Extend
Zero?
IF/ID ID/EX MEM/WB EX/MEM
Adder
Next SEQ PC Next SEQ PC
RD RD RD
WB Data Data stationary control local decode for each instruction phase / pipeline stage Next PC
Address
RS1 RS2 Imm
MUX
IR <= mem[PC]; PC <= PC + 4 A <= Reg[IRrs]; B <= Reg[IRrt] rslt <= A opIRop B Reg[IRrd] <= WB WB <= rslt
Figure A.3, Page A-9
44
45
46
– Example: Single memory for instructions and data
– Example: Instruction depending on a result of a previous instruction still in the pipeline
– Example: Conditional branch execution
47
2 ns Time
I2 I3 I1
2 ns 2 ns 2 ns
I4 I5 IM
REG
DM
REG A L U
IM
REG
DM
REG A L U
IM
REG
DM
REG A L U
IM
REG
DM
REG A L U
IM
REG
DM
REG A L U
48
sub
$2, $1, $3 # Reg. $2 written by sub and $12, $2, $5 # 1° operand ($2) depends on sub
add $14, $2, $2 # 1° ($2) & 2° ($2) depend on sub sw $15,100($2) # Base reg. ($2) depends on sub
49
2 ns 2 ns 2 ns
IM
REG
DM
REG A L U
IM
REG
DM
REG A L U
IM
REG
DM
REG A L U
IM
REG
DM
REG A L U
IM
REG
DM
REG A L U
sub $2, $1, $3 and $12, $2, $5
add $14, $2, $2 sw $15,100($2)
Time Instruction order
50
51
among correlating instructions
instructions, it insert nops.
52
2 ns Tempo Ordine di esecuzione delle istruzioni
and
sub
2 ns 2 ns 2 ns
add sw IM
REG
DM
REG A L U
IM
REG
DM
REG A L U
IM
REG
DM
REG A L U
IM
REG
DM
REG A L U
IM
REG
DM
REG A L U
53
IF ID EX ME WB IF ID EX ME WB IF ID EX ME WB IF ID EX ME WB
sub $2, $1, $3 and $12, $2, $5
add $14, $2, $2 sw $15,100($2)
IF ID EX ME WB IF ID EX ME WB IF ID EX ME WB
nop nop
54
CK2 CK1 Time Contenuto di $2 sub $2 , $1, $3 ID IF MEM EX and $12, $2 , $5 bubble IF
$2 add $14, $2 , $2 sw $15,100( $2 ) (clock cycles) WB ID WB MEM EX ID IF WB MEM EX ID IF WB MEM EX ID IF WB MEM EX CK4 CK3 CK6 CK5 CK9 CK8 CK7 10 10 10 10
CK12 CK11 CK10
bubble 55
sub
$2, $1, $3 sub $2, $1, $3 and $12, $2, $5 add $4, $10, $11
and $7, $8, $9 add $14, $2, $2 lw $16, 100($18) sw $15,100($2) lw $17, 200($19) add $4, $10, $11 and $12, $2, $5 and $7, $8, $9
lw $16, 100($18) add $14, $2, $2 lw $17, 200($19) sw $15,100($2)
56
57
IF ID EX ME WB IF ID EX ME WB IF ID EX ME WB IF ID EX ME WB IF ID EX ME WB
sub $2, $1, $3 and $12, $2, $5
add $14, $2, $2 sw $15,100($2)
MEM/EX path EX/EX path MEM/ID path 58
P C M u x M u x A L U M u x F
w a r d i n g u n i t I n s t r u c t i
M u x R d E X / M E M . R e g i s t e r R d M E M / W B . R e g i s t e r R d R t R t R s I F / I D . R e g i s t e r R d I F / I D . R e g i s t e r R t I F / I D . R e g i s t e r R t I F / I D . R e g i s t e r R s ID/EX IF/ID EX/MEM MEM/WB
Instr Memory Reg. Data Memory
M u x M u x
EX/EX path MEM/EX path MEM/ID path WB path
59
L1: lw $s0, 4($t1)
# $s0 <- M [4 + $t1] L2: add $s5, $s0, $s1 # 1° operand depends from L1
CK2 CK1 lw $s0, 4($t1) ID IF WB MEM EX ID IF WB MEM EX CK4 CK3 CK6 CK5 CK7 add $s5,$s0,$s1
60
CK2 CK1 lw $s0, 4($t1) ID IF WB MEM EX ID IF WB MEM EX CK4 CK3 CK6 CK5 CK7 add $s5,$s0,$s1
61
L1: lw $s0, 4($t1)
# $s0 <- M [4 + $t1] L2: sw $s0, 4($t2) # M[4 + $t2] <- $s0
CK2 CK1 Contenuto di $s0 lw $s0, 4($t1) ID IF WB MEM EX ID IF WB MEM EX CK4 CK3 CK6 CK5 CK7 10 10 10 10 20 10/20 20 sw $s0, 4($t2)
Without forwarding : 3 stalls
62
memory (in MEM/WB) to the memory’s input for the store.
CK2 CK1 lw $s0, 4($t1) ID IF WB MEM EX ID IF WB MEM EX CK4 CK3 CK6 CK5 CK7 sw $s0, 4($t2)
63
P C M u x M u x A L U M u x F
w a r d i n g u n i t I n s t r u c t i
M u x R d E X / M E M . R e g i s t e r R d M E M / W B . R e g i s t e r R d R t R t R s I F / I D . R e g i s t e r R d I F / I D . R e g i s t e r R t I F / I D . R e g i s t e r R t I F / I D . R e g i s t e r R s ID/ EX IF/ID EX/MEM MEM/ WB
Memoria Istruzioni Reg. Mem . Dati
M u x M u x
EX/EX path MEM/EX path MEM/ID path WB path
M u x
MEM/MEM path
Ø EX/EX path Ø MEM/EX path Ø MEM/ID path Ø MEM/MEM path
64
65
66
67
CK2 CK1
lw $r1, 0($r2)
ID IF MEM2 MEM 1 EX IF ID WB EX CK4 CK3 CK6 CK5 CK7
add $r1,$r2,$r3
WB
68
CK2 CK1
mul $f6,$f2,$f2
ID IF MUL3 MUL2 MUL1 IF ID AD2 AD1 CK4 CK3 CK6 CK5 CK7
add $f6,$f2,$f2
MUL4 MEM WB CK8 MEM WB
69
– All instructions take 5 stages, and – Writes are always in stage 5
I: sub r1,r4,r3 J: add r1,r2,r3 K: mul r6,r1,r7
70
71
CK2 CK1
sw $r1, 0($r2)
ID IF MEM2 MEM 1 EX IF ID WB EX CK4 CK3 CK6 CK5 CK7
add $r2, $r3, $r4
WB
72
– All instructions take 5 stages, and – Reads are always in stage 2, and – Writes are always in stage 5
I: sub r4,r1,r3 J: add r1,r2,r3 K: mul r6,r1,r7
73
74