Simty:
Generalized SIMT execution on RISC-V
CARRV 2017 Sylvain Collange
INRIA Rennes / IRISA
sylvain.collange@inria.fr
From CPU-GPU to heterogeneous multi-core Yesterday (2000-2010) - - PowerPoint PPT Presentation
Simty : Generalized SIMT execution on RISC-V CARRV 2017 Sylvain Collange INRIA Rennes / IRISA sylvain.collange@inria.fr From CPU-GPU to heterogeneous multi-core Yesterday (2000-2010) Homogeneous multi-core Discrete components Graphics
sylvain.collange@inria.fr
Central Processing Unit (CPU) Graphics Processing Unit (GPU) Latency-
cores Throughput-
cores
Heterogeneous multi-core chip Hardware accelerators
Central Processing Unit (CPU) Graphics Processing Unit (GPU) Latency-
cores Throughput-
cores
Heterogeneous multi-core chip Hardware accelerators
Threads Vector instructions
SPMD Program
add r1, r3 mul r2, r1 add mul add mul add mul add mul add mul
jal jalr bXX ecall ebreak Xret
RISC-V
jmpi if iff else endif do while break cont halt msave mrest push pop
Intel GMA Gen4 (2006)
jmpi if else endif case while break cont halt call return fork
Intel GMA SB (2011)
push push_else pop push_wqm pop_wqm else_wqm jump_any reactivate reactivate_wqm loop_start loop_start_no_al loop_start_dx10 loop_end loop_continue loop_break jump else call call_fs return return_fs alu alu_push_before alu_pop_after alu_pop2_after alu_continue alu_break alu_else_after
AMD Cayman (2011)
push push_else pop loop_start loop_start_no_al loop_start_dx10 loop_end loop_continue loop_break jump else call call_fs return return_fs alu alu_push_before alu_pop_after alu_pop2_after alu_continue alu_break alu_else_after
AMD R600 (2007)
jump loop endloop rep endrep breakloop breakrep continue
AMD R500 (2005)
bar bra brk brkpt cal cont kil pbk pret ret ssy trap .s
NVIDIA Tesla (2007)
bar bpt bra brk brx cal cont exit jcal jmx kil pbk pret ret ssy .s
NVIDIA Fermi (2010)
Master PC
tid= 1 2 3 1 PC0 PC1 PC2 PC3 Match → active No match → inactive
Vote Instruction Fetch Insn, MPC Broadcast MPC=PC0? Exec Update PC Insn PC0 Insn, MPC MPC=PC1? Exec Update PC Insn PC1 Insn, MPC MPC=PCn? Exec Update PC Insn PCn Insn, MPC No match: discard instruction
MPC MPC Match: execute instruction, update PC
Instruction Fetch Insn, MPC Broadcast MPC=PC0? Exec Insn Insn, MPC MPC=PC1? Exec Insn Insn, MPC MPC=PCn? Exec Insn PCn++ Insn, MPC No match: discard instruction, do not change PC
MPC MPC++ PC0++ Match: execute instruction, update PC
PC0 12 17 3 17 17 3 3 17 PC1 Per-thread PCs PC7 PC2 PC3PC4 PC5PC6 min min min min min min min 3 Master PC
17 PC0 12 17 3 17 17 3 3 17 0 1 0 1 1 0 0 1 3 0 0 1 0 0 1 1 0 12 1 0 0 0 0 0 0 0 PC1 CPC1 CPC2 CPC3 Per-thread PCs Sorted context table PC7 PC2 PC3PC4 PC5PC6 T0 T1 T7
Threads Vector instruction Warp
Memory Registers Memory Registers T1 Tn T2 T1 Tn T2 Memory Registers T1 Tn T2 Memory Registers T1 Tn T2
Easy Hardest Easy Hard
Memory Registers Memory Registers T1 Tn T2 T1 Tn T2 Memory Registers T1 Tn T2 Memory Registers T1 Tn T2
Easy Hardest Easy Hard
Vote Instruction Fetch Insn, MPC Broadcast MPC=PC0? Mem Update PC Insn PC0 Insn, MPC MPC=PC1? Mem Update PC Insn PC1 Insn, MPC MPC=PCn? Mem Update PC Insn PCn Insn, MPC No match or bank conflict: MPC MPC discard instruction, do not update PC
Logic area (LEs) Memory area (M9Ks) Frequency (MHz) Latency hiding multithreading depth Throughput SIMD width
sylvain.collange@inria.fr