HetCore: TFET-CMOS Hetero-Device Architecture for CPUs and GPUs
Bhargava Gopireddy, Dimitrios Skarlatos, Wenjuan Zhu, Josep Torrellas
University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu ISCA 2018 Wednesday, 11:20am Session 9B: GPUs
HetCore: TFET-CMOS Hetero-Device Architecture for CPUs and GPUs - - PowerPoint PPT Presentation
HetCore: TFET-CMOS Hetero-Device Architecture for CPUs and GPUs Bhargava Gopireddy, Dimitrios Skarlatos, Wenjuan Zhu, Josep Torrellas University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu ISCA 2018 Wednesday, 11:20am Session 9B:
University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu ISCA 2018 Wednesday, 11:20am Session 9B: GPUs
HetCore: TFET-CMOS Hetero-Device Architecture for CPUs and GPUs
Voltage Current VG Id (log) Vdd Ideal Switch
2
HetCore: TFET-CMOS Hetero-Device Architecture for CPUs and GPUs
Voltage Current VG Id (log) Vdd-CMOS Vdd Ideal Switch MOSFET
3
HetCore: TFET-CMOS Hetero-Device Architecture for CPUs and GPUs
Voltage Current VG Id (log) Vdd-CMOS Vdd-TFET Ideal Switch MOSFET TFET
4
Lower Vdd
HetCore: TFET-CMOS Hetero-Device Architecture for CPUs and GPUs
CMOS TFET
200 400 600 800 1000 1200 1400 1600 1800 2000 50 100 150 200 Delay per Operation (ps) Dynamic Energy per Operation (fJ)
4x Lower Energy 2x Slower 8x Lower Dynamic Power 125x Lower Leakage Power Vdd at 15nm:
TFET: 0.4V CMOS: 0.73V
5
TFET and CMOS manufacturing processes are compatible → Share same chip
HetCore: TFET-CMOS Hetero-Device Architecture for CPUs and GPUs
▪ As energy efficient as TFET ▪ As fast as CMOS
▪ Power consuming ▪ Amenable to pipelining or not very latency sensitive
6
HetCore: TFET-CMOS Hetero-Device Architecture for CPUs and GPUs
▪ Customizes known microarchitecture optimizations
7
HetCore: TFET-CMOS Hetero-Device Architecture for CPUs and GPUs
CMOS Stage 1 CMOS Stage 2 CMOS Stage 3 CMOS Stage 1 TFET Stage 2a TFET Stage 2b CMOS Stage 3
8
VTFET VCMOS
HetCore: TFET-CMOS Hetero-Device Architecture for CPUs and GPUs
L2
DL1 IL1
L2
DL1 IL1
L2
DL1 IL1
L2
DL1 IL1
Last Level Cache CPU
Core 1 Core 0 Core 2 Core 3
9 TFET CMOS
HetCore: TFET-CMOS Hetero-Device Architecture for CPUs and GPUs
L2 and LLC primarily consume leakage power → TFETs can reduce leakage power substantially
L2
DL1 IL1
L2
DL1 IL1
L2
DL1 IL1
L2
DL1 IL1
Last Level Cache CPU
Core 1 Core 0 Core 2 Core 3
10 TFET CMOS
HetCore: TFET-CMOS Hetero-Device Architecture for CPUs and GPUs
DL1 IL1
Core 0
11 TFET CMOS
HetCore: TFET-CMOS Hetero-Device Architecture for CPUs and GPUs
Core 0
DL1 IL1 12 TFET CMOS
HetCore: TFET-CMOS Hetero-Device Architecture for CPUs and GPUs
DL1 IL1
Core 0
FPU ALU 13 TFET CMOS
HetCore: TFET-CMOS Hetero-Device Architecture for CPUs and GPUs
DL1 IL1
Core 0
FPU ALU 14 TFET CMOS
HetCore: TFET-CMOS Hetero-Device Architecture for CPUs and GPUs
SIMD FPU SIMD FPU SIMD FPU SIMD FPU RF RF RF RF
GPU
15
SIMD FPU can be pipelined
TFET CMOS
HetCore: TFET-CMOS Hetero-Device Architecture for CPUs and GPUs
SIMD FPU SIMD FPU SIMD FPU SIMD FPU RF RF RF RF
GPU
16 TFET CMOS
SIMD FPU can be pipelined RF consumes high energy
HetCore: TFET-CMOS Hetero-Device Architecture for CPUs and GPUs
SIMD FPU SIMD FPU SIMD FPU SIMD FPU RF RF RF RF
GPU
17 TFET CMOS
SIMD FPU can be pipelined RF consumes high energy
HetCore: TFET-CMOS Hetero-Device Architecture for CPUs and GPUs
SIMD FPU RF
L2
FPU DL1 IL1
L2
FPU DL1 IL1
L2
FPU DL1 IL1
L2
FPU DL1 IL1
Last Level Cache
SIMD FPU RF SIMD FPU RF SIMD FPU RF
CPU GPU
TFET CMOS
ALU ALU ALU ALU
Core 1 Core 0 Core 2 Core 3
18
HetCore: TFET-CMOS Hetero-Device Architecture for CPUs and GPUs
– Asymmetric DL1 cache – Dual cluster ALU
– Register file cache
19
HetCore: TFET-CMOS Hetero-Device Architecture for CPUs and GPUs
TFET Data Way1 TFET Data Way 6 TFET Data Way 7 TFET Data Way 0
TFET Tag 0 TFET Tag 1 TFET Tag 6 TFET Tag 7 CAM Match Tag Address Index Address Miss to L2 Data to core
Hit
20 TFET
HetCore: TFET-CMOS Hetero-Device Architecture for CPUs and GPUs
Check CMOS way before accessing TFET ways CMOS way holds MRU cacheline and can respond in 1 cycle
TFET Data Way1 TFET Data Way 6 TFET Data Way 7 CMOS Data Way 0
CMOS Tag 0 TFET Tag 1 TFET Tag 6 TFET Tag 7 Comparator CAM Match Tag Address Index Address Hit Hit
Data Select
Miss Miss to L2 Data to core Data to core
VTFET VCMOS
21 TFET CMOS
HetCore: TFET-CMOS Hetero-Device Architecture for CPUs and GPUs
TFET ALU doubles the latency of most common operations Prevents back-to-back issue of dependent instructions Increases misprediction penalty
TFET ALU 0 TFET ALU 1 TFET ALU 2 TFET ALU 3
22 TFET
HetCore: TFET-CMOS Hetero-Device Architecture for CPUs and GPUs
In dispatch stage, identify the producer- consumer pairs in small window, and steer the producer to CMOS ALU. Steering algorithm: minimize bubbles, maximize power saving and balance
Mis-steering a producer is okay; as the penalty is only one cycle for consumer
TFET ALU 1 TFET ALU 2 TFET ALU 3 CMOS ALU 0
23 TFET CMOS
HetCore: TFET-CMOS Hetero-Device Architecture for CPUs and GPUs
▪ Proposed earlier to reduce energy consumption [Gebhart et al.] ▪ We use it to reduce the access latency by having the register file
24
HetCore: TFET-CMOS Hetero-Device Architecture for CPUs and GPUs
25
HetCore: TFET-CMOS Hetero-Device Architecture for CPUs and GPUs
0.2 0.4 0.6 0.8 1 1.2 1.4
Avg Execution Time Avg Energy Avg ED² Normalized to BaseCMOS
BaseCMOS BaseTFET BaseHetCore AdvHetCore AdvHetCore-2X
26
HetCore: TFET-CMOS Hetero-Device Architecture for CPUs and GPUs
0.2 0.4 0.6 0.8 1 1.2 1.4
Avg Execution Time Avg Energy Avg ED² Normalized to BaseCMOS
BaseCMOS BaseTFET BaseHetCore AdvHetCore AdvHetCore-2X
1.95
27
HetCore: TFET-CMOS Hetero-Device Architecture for CPUs and GPUs
0.2 0.4 0.6 0.8 1 1.2 1.4
Avg Execution Time Avg Energy Avg ED² Normalized to BaseCMOS
BaseCMOS BaseTFET BaseHetCore AdvHetCore AdvHetCore-2X
1.95
39% 36% 28%
28
HetCore: TFET-CMOS Hetero-Device Architecture for CPUs and GPUs
0.2 0.4 0.6 0.8 1 1.2 1.4
Avg Execution Time Avg Energy Avg ED² Normalized to BaseCMOS
BaseCMOS BaseTFET BaseHetCore AdvHetCore AdvHetCore-2X 26%
1.95
39% 10%
29
HetCore: TFET-CMOS Hetero-Device Architecture for CPUs and GPUs
0.2 0.4 0.6 0.8 1 1.2 1.4
Avg Execution Time Avg Energy Avg ED² Normalized to BaseCMOS
BaseCMOS BaseTFET BaseHetCore AdvHetCore AdvHetCore-2X 34% 68% 32%
30
HetCore: TFET-CMOS Hetero-Device Architecture for CPUs and GPUs
31
HetCore: TFET-CMOS Hetero-Device Architecture for CPUs and GPUs
32
University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu ISCA 2018 Wednesday, 11:20am Session 9B: GPUs