HetCore: TFET-CMOS Hetero-Device Architecture for CPUs and GPUs - - PowerPoint PPT Presentation

hetcore tfet cmos hetero device
SMART_READER_LITE
LIVE PREVIEW

HetCore: TFET-CMOS Hetero-Device Architecture for CPUs and GPUs - - PowerPoint PPT Presentation

HetCore: TFET-CMOS Hetero-Device Architecture for CPUs and GPUs Bhargava Gopireddy, Dimitrios Skarlatos, Wenjuan Zhu, Josep Torrellas University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu ISCA 2018 Wednesday, 11:20am Session 9B:


slide-1
SLIDE 1

HetCore: TFET-CMOS Hetero-Device Architecture for CPUs and GPUs

Bhargava Gopireddy, Dimitrios Skarlatos, Wenjuan Zhu, Josep Torrellas

University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu ISCA 2018 Wednesday, 11:20am Session 9B: GPUs

slide-2
SLIDE 2

HetCore: TFET-CMOS Hetero-Device Architecture for CPUs and GPUs

Ideal Switch

Voltage Current VG Id (log) Vdd Ideal Switch

2

slide-3
SLIDE 3

HetCore: TFET-CMOS Hetero-Device Architecture for CPUs and GPUs

Ideal Switch vs Si-MOSFET

Voltage Current VG Id (log) Vdd-CMOS Vdd Ideal Switch MOSFET

3

slide-4
SLIDE 4

HetCore: TFET-CMOS Hetero-Device Architecture for CPUs and GPUs

TFET vs MOSFET

Voltage Current VG Id (log) Vdd-CMOS Vdd-TFET Ideal Switch MOSFET TFET

4

Lower Vdd

slide-5
SLIDE 5

HetCore: TFET-CMOS Hetero-Device Architecture for CPUs and GPUs

TFET vs CMOS: Energy and Delay

CMOS TFET

200 400 600 800 1000 1200 1400 1600 1800 2000 50 100 150 200 Delay per Operation (ps) Dynamic Energy per Operation (fJ)

4x Lower Energy 2x Slower 8x Lower Dynamic Power 125x Lower Leakage Power Vdd at 15nm:

TFET: 0.4V CMOS: 0.73V

5

TFET and CMOS manufacturing processes are compatible → Share same chip

slide-6
SLIDE 6

HetCore: TFET-CMOS Hetero-Device Architecture for CPUs and GPUs

Goal: Energy Efficient Core Design with TFETs

  • Design a core that is

▪ As energy efficient as TFET ▪ As fast as CMOS

  • Approach: Use both CMOS and TFET devices within the core
  • How: Selectively replace CMOS units by TFET ones; that are

▪ Power consuming ▪ Amenable to pipelining or not very latency sensitive

6

slide-7
SLIDE 7

HetCore: TFET-CMOS Hetero-Device Architecture for CPUs and GPUs

Contributions

  • Propose the concept of a hetero-device TFET-CMOS core

architecture, called HetCore

  • Design of an “Advanced HetCore” for CPUs and GPUs

▪ Customizes known microarchitecture optimizations

  • At iso-power, an 8-core HetCore CPU has a 68% lower ED2 and is

32% faster than a 4-core CMOS CPU

  • Similar results are obtained for GPUs

7

slide-8
SLIDE 8

HetCore: TFET-CMOS Hetero-Device Architecture for CPUs and GPUs

Replacing CMOS Units with TFET in Pipeline

  • Pipeline twice as deep while maintaining the same frequency

CMOS Stage 1 CMOS Stage 2 CMOS Stage 3 CMOS Stage 1 TFET Stage 2a TFET Stage 2b CMOS Stage 3

Selected units must be: Amenable to pipelining and/or not very latency sensitive

8

VTFET VCMOS

slide-9
SLIDE 9

HetCore: TFET-CMOS Hetero-Device Architecture for CPUs and GPUs

Baseline HetCore Design

L2

DL1 IL1

L2

DL1 IL1

L2

DL1 IL1

L2

DL1 IL1

Last Level Cache CPU

Core 1 Core 0 Core 2 Core 3

9 TFET CMOS

slide-10
SLIDE 10

HetCore: TFET-CMOS Hetero-Device Architecture for CPUs and GPUs

Baseline HetCore Design

L2 and LLC primarily consume leakage power → TFETs can reduce leakage power substantially

L2

DL1 IL1

L2

DL1 IL1

L2

DL1 IL1

L2

DL1 IL1

Last Level Cache CPU

Core 1 Core 0 Core 2 Core 3

10 TFET CMOS

slide-11
SLIDE 11

HetCore: TFET-CMOS Hetero-Device Architecture for CPUs and GPUs

Baseline HetCore Design

DL1 IL1

Core 0

DL1 and IL1 consume high dynamic as well as leakage power

11 TFET CMOS

slide-12
SLIDE 12

HetCore: TFET-CMOS Hetero-Device Architecture for CPUs and GPUs

Baseline HetCore Design

Core 0

DL1 and IL1 consume high dynamic as well as leakage power DL1 latency can be partially hidden in an Out-of-Order machine

DL1 IL1 12 TFET CMOS

slide-13
SLIDE 13

HetCore: TFET-CMOS Hetero-Device Architecture for CPUs and GPUs

Baseline HetCore Design

Both FPU and ALU consume significant power and can be pipelined FPU: Pipeline deeper and exploit ILP ALU: Impact on performance, but energy savings justify its placement in TFET

DL1 IL1

Core 0

FPU ALU 13 TFET CMOS

slide-14
SLIDE 14

HetCore: TFET-CMOS Hetero-Device Architecture for CPUs and GPUs

Baseline HetCore Design

Both FPU and ALU consume significant power and can be pipelined FPU: Pipeline deeper and exploit ILP ALU: Impact on performance, but energy savings justify its placement in TFET

DL1 IL1

Core 0

FPU ALU 14 TFET CMOS

slide-15
SLIDE 15

HetCore: TFET-CMOS Hetero-Device Architecture for CPUs and GPUs

Baseline HetCore GPU Design

SIMD FPU SIMD FPU SIMD FPU SIMD FPU RF RF RF RF

GPU

15

SIMD FPU can be pipelined

TFET CMOS

slide-16
SLIDE 16

HetCore: TFET-CMOS Hetero-Device Architecture for CPUs and GPUs

Baseline HetCore GPU Design

SIMD FPU SIMD FPU SIMD FPU SIMD FPU RF RF RF RF

GPU

16 TFET CMOS

SIMD FPU can be pipelined RF consumes high energy

slide-17
SLIDE 17

HetCore: TFET-CMOS Hetero-Device Architecture for CPUs and GPUs

Baseline HetCore GPU Design

SIMD FPU SIMD FPU SIMD FPU SIMD FPU RF RF RF RF

GPU

17 TFET CMOS

SIMD FPU can be pipelined RF consumes high energy

slide-18
SLIDE 18

HetCore: TFET-CMOS Hetero-Device Architecture for CPUs and GPUs

Baseline HetCore with CPU and GPU

SIMD FPU RF

L2

FPU DL1 IL1

L2

FPU DL1 IL1

L2

FPU DL1 IL1

L2

FPU DL1 IL1

Last Level Cache

SIMD FPU RF SIMD FPU RF SIMD FPU RF

CPU GPU

TFET CMOS

ALU ALU ALU ALU

Core 1 Core 0 Core 2 Core 3

18

Base HetCore saves energy compared to CMOS but it degrades performance

slide-19
SLIDE 19

HetCore: TFET-CMOS Hetero-Device Architecture for CPUs and GPUs

Advanced HetCore Design

  • New opportunities for micro-architectural optimization

– Base HetCore is an unbalanced design – A small power penalty maybe a good tradeoff for large gains in performance

  • For CPU:

– Asymmetric DL1 cache – Dual cluster ALU

  • For GPU:

– Register file cache

19

slide-20
SLIDE 20

HetCore: TFET-CMOS Hetero-Device Architecture for CPUs and GPUs

DL1 Cache in TFET

TFET Data Way1 TFET Data Way 6 TFET Data Way 7 TFET Data Way 0

TFET Tag 0 TFET Tag 1 TFET Tag 6 TFET Tag 7 CAM Match Tag Address Index Address Miss to L2 Data to core

… …

Hit

20 TFET

slide-21
SLIDE 21

HetCore: TFET-CMOS Hetero-Device Architecture for CPUs and GPUs

Asymmetric DL1 Cache

Check CMOS way before accessing TFET ways CMOS way holds MRU cacheline and can respond in 1 cycle

TFET Data Way1 TFET Data Way 6 TFET Data Way 7 CMOS Data Way 0

CMOS Tag 0 TFET Tag 1 TFET Tag 6 TFET Tag 7 Comparator CAM Match Tag Address Index Address Hit Hit

Data Select

Miss Miss to L2 Data to core Data to core

… …

VTFET VCMOS

21 TFET CMOS

slide-22
SLIDE 22

HetCore: TFET-CMOS Hetero-Device Architecture for CPUs and GPUs

Performance Impact of TFET ALU

TFET ALU doubles the latency of most common operations Prevents back-to-back issue of dependent instructions Increases misprediction penalty

TFET ALU 0 TFET ALU 1 TFET ALU 2 TFET ALU 3

22 TFET

slide-23
SLIDE 23

HetCore: TFET-CMOS Hetero-Device Architecture for CPUs and GPUs

Dual Speed ALU Cluster

In dispatch stage, identify the producer- consumer pairs in small window, and steer the producer to CMOS ALU. Steering algorithm: minimize bubbles, maximize power saving and balance

  • verall utilization [Baniasadi et al]

Mis-steering a producer is okay; as the penalty is only one cycle for consumer

TFET ALU 1 TFET ALU 2 TFET ALU 3 CMOS ALU 0

23 TFET CMOS

slide-24
SLIDE 24

HetCore: TFET-CMOS Hetero-Device Architecture for CPUs and GPUs

Register File Cache in GPU

  • TFET register file introduces additional cycles in critical path
  • Use: Register file cache, similar to an asymmetric cache, to

hold a few registers closer to the FPU

▪ Proposed earlier to reduce energy consumption [Gebhart et al.] ▪ We use it to reduce the access latency by having the register file

cache in CMOS

24

slide-25
SLIDE 25

HetCore: TFET-CMOS Hetero-Device Architecture for CPUs and GPUs

Evaluation Methodology

4 out-of-order cores in CPU, 8 Compute Units in GPU (AMD Southern Islands) Multi2sim Simulator

  • CPU: SPLASH2 and Parsec
  • GPU: AMD-SDK-APP benchmark suite

Configurations:

  • BaseCMOS, BaseTFET
  • Base HetCore
  • Adv HetCore → Base HetCore with previous mitigations
  • Adv HetCore-2X → Twice as many cores within the same power

budget as BaseCMOS

25

slide-26
SLIDE 26

HetCore: TFET-CMOS Hetero-Device Architecture for CPUs and GPUs

HetCore – CPU Results

0.2 0.4 0.6 0.8 1 1.2 1.4

Avg Execution Time Avg Energy Avg ED² Normalized to BaseCMOS

BaseCMOS BaseTFET BaseHetCore AdvHetCore AdvHetCore-2X

26

slide-27
SLIDE 27

HetCore: TFET-CMOS Hetero-Device Architecture for CPUs and GPUs

HetCore – CPU Results

0.2 0.4 0.6 0.8 1 1.2 1.4

Avg Execution Time Avg Energy Avg ED² Normalized to BaseCMOS

BaseCMOS BaseTFET BaseHetCore AdvHetCore AdvHetCore-2X

1.95

27

Very slow !!

slide-28
SLIDE 28

HetCore: TFET-CMOS Hetero-Device Architecture for CPUs and GPUs

Base HetCore – CPU Results

0.2 0.4 0.6 0.8 1 1.2 1.4

Avg Execution Time Avg Energy Avg ED² Normalized to BaseCMOS

BaseCMOS BaseTFET BaseHetCore AdvHetCore AdvHetCore-2X

1.95

39% 36% 28%

28

Still too slow

slide-29
SLIDE 29

HetCore: TFET-CMOS Hetero-Device Architecture for CPUs and GPUs

Adv HetCore – CPU Results

0.2 0.4 0.6 0.8 1 1.2 1.4

Avg Execution Time Avg Energy Avg ED² Normalized to BaseCMOS

BaseCMOS BaseTFET BaseHetCore AdvHetCore AdvHetCore-2X 26%

1.95

39% 10%

29

High energy efficiency w/ mild slowdown

slide-30
SLIDE 30

HetCore: TFET-CMOS Hetero-Device Architecture for CPUs and GPUs

Adv HetCore-2X at Iso-power to BaseCMOS

0.2 0.4 0.6 0.8 1 1.2 1.4

Avg Execution Time Avg Energy Avg ED² Normalized to BaseCMOS

BaseCMOS BaseTFET BaseHetCore AdvHetCore AdvHetCore-2X 34% 68% 32%

30

Adv HetCore enables 2X cores in the same power budget !

slide-31
SLIDE 31

HetCore: TFET-CMOS Hetero-Device Architecture for CPUs and GPUs

Adv HetCore GPU

  • Adv HetCore-GPU

– 40% lower Energy – 20% slowdown

  • Adv HetCore-GPU with 2X EUs at iso-power

– 60% lower ED2 – 30% faster

31

slide-32
SLIDE 32

HetCore: TFET-CMOS Hetero-Device Architecture for CPUs and GPUs

Conclusion

  • Proposed the concept of a hetero-device TFET-CMOS core

architecture for high performance and energy efficiency

  • Designed an Advanced HetCore for CPUs and GPUs

– Customizes known microarchitecture optimizations

  • At iso-power, an 8-core HetCore CPU has a 68% lower ED2 and

is 32% faster than a 4-core CMOS CPU

  • Similar results are obtained for GPUs

32

slide-33
SLIDE 33

HetCore: TFET-CMOS Hetero-Device Architecture for CPUs and GPUs

Bhargava Gopireddy, Dimitrios Skarlatos, Wenjuan Zhu, Josep Torrellas

University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu ISCA 2018 Wednesday, 11:20am Session 9B: GPUs