Ronny Krashinsky, Christopher Batten, Mark Hampton, Steve Gerding, Brian Pharris, Jared Casper, Krste Asanovic MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA, USA
ISCA 2004
The Vector-Thread Architecture Ronny Krashinsky, Christopher - - PowerPoint PPT Presentation
The Vector-Thread Architecture Ronny Krashinsky, Christopher Batten, Mark Hampton, Steve Gerding, Brian Pharris, Jared Casper, Krste Asanovic MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA, USA ISCA 2004 Goals For
ISCA 2004
PE0 Memory PE1 PE2 PEN vector control PE0 Memory PE1 PE2 PEN thread control Control Processor
fetch fetch
Registers VP ALUs
Registers
ALUs Registers
ALUs Registers
ALUs
Memory
Vector Memory Unit
Registers
ALUs Registers
ALUs Registers
ALUs
<< vector-fetch vector-load x + ld vector-load ld vector-store st
<< x + ld ld st << x + ld ld st << x + ld ld st << x + ld ld st << x + ld ld st vector-fetch vector-load ld vector-load ld ld ld ld ld ld ld ld ld
<< x + ld ld st << x + vector-store st << x + st << x + st << x + st << x + st vector-load ld vector-load ld
ld ld ld ld ld ld ld ld vector-fetch
ld vector-load ld vector-store st
ld st ld == br == br ld == br ld ld st == br ld ld st == br ld ld st == br ld == br ld ld st == br ld == br vector-fetch
Lane 0 Lane 1 Lane 2 Lane 3
VP0 VP4 VP8 VP12
ALU
VP1 VP5 VP9 VP13
ALU
VP2 VP6 VP10 VP14
ALU
VP3 VP7 VP11 VP15
ALU Memory Control Processor
Vector Memory Unit
instruction fetch energy (on par with register read)
executes the AIB for the requesting VP
executes the AIB for every VP
fetch address tag-check, even for non- vectorizable loops
thread-fetch
VP0 VP4 VP8 VP12
AIB tags
vector-fetch
ALU execute directive
vector-fetch miss addr AIB fill miss VP AIB address AIB cache AIB
instr.
Lane 3 Lane 2 Lane 1
VP0 VP4 VP8 VP12
Lane 0
VP1 VP5 VP9 VP13 VP2 VP6 VP10VP14 VP3 VP7 VP11VP15
time-multiplexing time
thread- fetch vector-fetch vector-fetch AIB vector-fetch
Lane 3 Lane 2 Lane 1
VP0 VP4 VP8 VP12
Lane 0
VP1 VP5 VP9 VP13 VP2 VP6 VP10VP14 VP3 VP7 VP11VP15 vector-fetch AIB
time-multiplexing time
fetches, c3 has integer mult/div
Lane 0 Lane 1 Lane 2 Lane 3 Control Processor L1 Cache AIB Fill Unit
c0 c1 c2 c3 c0 c1 c2 c3 c0 c1 c2 c3 c0 c1 c2 c3 c0 c1 c2 c3
SCALE VP
VP24
VP12
cr0 cr1
VP0
VP0 VP4 VP8 VP12 VP16 VP20
VP4 VP8
4 VPs with 0 shared regs 8 private regs
shared
7 VPs with 4 shared regs 4 private regs 25 VPs with 7 shared regs 1 private reg
c0
shared VP0 VP4 VP8 shared
add cr0,cr1→pr0 c1→cr1 c0→cr0 →c2 sll cr0,2 c0→cr0 →c1,c2 xor pr0,pr1 tp compute wb tp compute wb tp compute wb
Cluster 2 Cluster 1 Cluster 0 → pr0 add cr0, cr1 c2 → c2/cr1 sll cr0, 2 c1 → c1/cr0, c2/cr0 xor pr0, pr1 c0 destinations
cluster
cluster micro-op bundle Cluster 3 not shown
AIB cache
ALU
VP Regs compute writeback transport AIBs compute writeback transport Cluster 2 Cluster 3 AIBs compute writeback transport Cluster 1 AIBs compute writeback transport Cluster 0
4 mm 2.5 mm
shftr ALU MD CP0 L/S RF ctrl byp PC RF ctrl shftr ALU latch mux/ IQC RF ctrl shftr ALU latch mux/ IQC RF ctrl shftr ALU latch mux/ IQC RF ctrl shftr ALU latch mux/ IQC RF ctrl shftr ALU latch mux/ IQC RF ctrl shftr ALU latch mux/ IQC RF ctrl shftr ALU latch mux/ IQC RF ctrl shftr ALU latch mux/ IQC RF ctrl shftr ALU latch mux/ IQC RF ctrl shftr ALU latch mux/ IQC RF ctrl shftr ALU latch mux/ IQC RF ctrl shftr ALU latch mux/ IQC RF ctrl shftr ALU latch mux/ IQC RF ctrl shftr ALU latch mux/ IQC RF ctrl shftr ALU latch mux/ IQC RF ctrl shftr ALU latch mux/ IQC ctrl LDQ ctrl LDQ ctrl LDQ ctrl LDQ
Memory Interface / Cache Control
Memory Unit
Cache Bank (8KB) Cache Bank (8KB) Cache Bank (8KB) Cache Bank (8KB)
Control Processor
Cache Tags
Mult Div Lane Cluster
allowed for some benchmarks, these are often unpublished
continue to improve
average 7 loop iterations execute in parallel, 8x speedup
10 20 30 40 50 60 70
1 2 3 4 5 6 7 8 9
1 Lane 2 Lanes 4 Lanes 8 Lanes
adpcm.dec sha
prototype
rgbcmy rgbyiq hpg
2 4 6 8 10
dither lookup qsort pntrch
1 Lane 2 Lanes 4 Lanes 8 Lanes