[PPT] - Processing-in-memory (PIM) is regaining attention for energy PowerPoint Presentation

SLIDE 1

Lifeng Nai*† Ramyad Hadidi† He Xiao† Hyojong Kim† Jaewoong Sim‡ Hyesoon Kim†

†Georgia Institute of Technology

* Google, ‡Intel Labs IPDPS-32 | May 2018 Disclaimer: This work does not relate to Google/Intel Labs

SLIDE 2

2/24

Processing-in-memory (PIM) is regaining attention for energy efficient computing

Graph Workloads: Data-Intensive, Little Data Reuse

Basic Concept: Offload compute to memory

Reduce costly energy consumption of data movement
Enable using large internal memory bandwidth

DATA: 0xf0f0 DATA: 0xf0f4 INST: ADD

Conventional Data Processing Processing-in-Memory

SLIDE 3

3/24

PIM could increase memory temperature beyond normal operating temperature (85°C)

High BW (hundreds of GBs ~ TBs) from 3D-stacked memory
Less effective heat transfer compared to DIMMs
PIM would make these thermal problems worse!

Too Hot Memory Stack?

Slower processing for memory requests
Decreasing overall system performance

DATA: 0xf0f0 DATA: 0xf0f4 INST: ADD

Rarely exceeds 85°C

CoolPIM keeps the memory “Cool ” to achieve better PIM performance

SLIDE 4

4/24

Introduction Hybrid Memory Cube

Background
Thermal Measurements & Thermal Modeling of Future HMC

CoolPIM

Software-Based Throttling
Hardware-Based Throttling

Evaluation Conclusion

SLIDE 5

5/24

A Hybrid Memory Cube (HMC) from Micron

Multiple 3D-stacked DRAM layers + one logic layer with TSVs
Vaults: equivalent to memory channels
Full-duplex serial links between the host and HMC

No PIM functionality for existing HMC products yet

External Serial Links

TSVs Logic Layer DRAM Layers Vault

Packets

SLIDE 6

6/24

Instruction-level PIM supported in future HMC (HMC 2.0)

Perform Read-Modify-Write (RMW) operations atomically
Similar to READ/WRITE packets; just different CMD in the Header
No HMC 2.0 product yet!

PIM-ADD (addr, imm) Header (PIM-ADD) addr, imm Tail

Type HMC 2.0 PIM Instruction Arithmetic Signed Add Bitwise Swap, bit write Boolean AND/NAND/OR/NOR/XOR Comparison CAS-equal/greater

Q: Can we offload all the PIM operations to HMC? What is the thermal impact of PIM in future HMC?

Logic Layer DRAM Layers ACK

SLIDE 7

7/24

Experiment Platform (Pico SC-6 Mini System)

Intel Core i7 + FPGA Compute Modules (AC-510)

} AC-510: 4GB HMC 1.1, Kintex Ultrascale

Measure the temperature on the heat sink

Controlling memory BW via FPGA
Applying three different cooling methods

} High-End Active Heat Sink } Low-End Active Heat Sink } Passive Heat Sink

HMC 1.1 has no PIM functionality!

HMC

FPGA

BW Control RTL

HMC

SLIDE 8

8/24

Passive Low-end Active High-end Active Idle Busy

HMC

FPGA Shutdown

SLIDE 9

9/24

Thermal modeling for HMC 2.0 with commodity-server active cooling

HMC 2.0 (w/o PIM) would reach 81°C at a full external BW (320GB/s)

} We validated our thermal model against the measurements on HMC 1.1

40 80 120 40 80 120 160 200 240 280 320 Peak DRAM Temp. (°C) Data Bandwidth (GB/s) Passiv e Low

e

nd Commodity High-end

HMC

perating

temperature: 0°C-105°C High bandwidth Better cooling

We need at least commodity-server cooling to benefit from PIM!

SLIDE 10

10/24

PIM increases memory temperature due to power consumption of logic and DRAM layers.

In our modeling, the maximum PIM offloading rate is 6.5 PIM ops/ns
A high offloading rate could reduce memory performance for cool down

70 80 90 100 110 1 2 3 4 5 6 7 Peak DRAM Temp. (°C) PIM Rate (op/ns) 95°C-105°C 85°C-95°C 0°C-85°C Too Hot Desirable Temp. Range

Reduced memory performance

SLIDE 11

11/24

Higher BW benefits èBetter performance Higher DRAM temperature èLow memory performance PIM Offloading Rate Performance

PIM intensity needs to be controlled!!

SLIDE 12

12/24

CoolPIM

Controls PIM Intensity with Thermal Consideration

SLIDE 13

13/24

We propose two methods for GPU/HMC

1) A SW mechanism with no hardware changes 2) A HW mechanism with changes in GPU architectures

Dynamic source throttling based on thermal warning messages from HMC

Thermal warning -> lowers PIM intensity -> reduces internal temperature of HMC

GPU HW Architecture GPU Runtime HMC PIM Offloading Hardware-based Source Throttling Software-based Source Throttling Thermal Warning Updated Offloading Intensity Updated # of PIM-Enabled CUDA Blocks Updated # of PIM- Enabled Warps SW Method HW Method

SLIDE 14

14/24

GPU runtime implements some components to control PIM intensity

PIM Token Pool (PTP)

} # of maximum thread blocks that are allowed to use PIM functionality

Thread Block Manager

} Check PTP and launch PIM code if tokens are available

Initialization

} Estimate the initial PTP size based on static analysis at compile time Thermal Warning Switch Vault ... Vault Vault SM SM SM SM Mem Request PIM Offloading Thermal Interrupt PIM Token Pool Interrupt Handler Forward Launch Blk Th-Blk Manager HMC GPU GPU Runtime HMC Links PCI-E Initialization

SLIDE 15

15/24

The GPU compiler generates PIM-enabled and non-PIM kernels at compile time

Source-to-source translation
IR-to-IR translation

Void cuda_kernel(arg_list) { for (int i=0; i<end; i++) { uint addr = addrArray[i]; PIM_Add(addr, 1); } } void cuda_kernel_np(arg_list) { for (int i=0; i<end; i++) { uint addr = addrArray[i]; cuda atomicAdd(addr, 1); } } Original PIM Code Shadow Non-PIM Code

SLIDE 16

16/24

PIM Control Unit

Controls # of PIM-enabled warps
Performs dynamic binary translation
See the paper for detail!

Thermal Warning Switch Vault ... Vault Vault SM SM SM SM Mem Request PIM Offloading PIM Control Unit HMC GPU HMC Links Control PIM-enabled Warp #

Type PIM Instruction Non-PIM Arithmetic Signed Add atomicAdd Bitwise Swap, bit write atomicExch Boolean AND, OR atomicAND/OR Comparison CAS-equal/greater atomicCAS/Max

SLIDE 17

17/24

Evaluation

SLIDE 18

18/24

Thermal Evaluation

Temp Measurement: Real HMC 1.1 Platform
Thermal Modeling: HMC 2.0 using 3D-ICE
Power & Area: Synopsys (28nm/50nm CMOS)

Performance Evaluation

MacSim w/ VaultSim

Benchmark

GraphBIG benchmark with LDBC dataset

} BFS, SSSP, PageRank, etc…

Thermal Modeling GPU/HMC Timing Simulation Validation Synopsys Power

and

Area Verilog HMC Spec

FPGA HMC

BW Control RTL Thermal Camera Benchmarks

SLIDE 19

19/24

Speedup over baseline (Non-Offloading)

Naïve/SW/HW: using a commodity-server active heat sync
Ideal Thermal: with unlimited cooling

On average, CoolPIM (SW/HW) improves performance by 1.21x/1.25x!

0.0 0.5 1.0 1.5 2.0 b f s

d

t c b f s

d

w c b f s

t

a b f s

t

t c b f s

t

w c d c k c

r

e s s s p

d

t c s s s p

d

w c s s s p

t

t c s s s p

t

w c p a g e r a n k G M e a n

Speedup over Non-Offloading

No n-O ff l

ading

Naïve-Offloading CoolPIM (SW) CoolPIM (HW) Ideal Ther mal

SLIDE 20

20/24

PIM Offloading Rate

Naïve: 3~4 op/ns à Temperature goes beyond the normal operating region.
CoolPIM: 1.3 op/ns à No memory performance slowdown

CoolPIM maintains peak DRAM temperature within normal operating temp!

7 5 8 0 8 5 9 0 9 5 1 0 bfs-d tc bfs-d w c bfs-ta bfs-ttc bfs-twc dc kc o re sss p

dtc sss p
dwc sss p
ttc

sss p

twc pag e

ra n k Peak DRAM Temp. (°C) Naïv e Offl

adin

g Co

lP

IM (S W ) Co

lP

IM (HW)

SLIDE 21

21/24

Conclusion

SLIDE 22

22/24

Observation: PIM integration requires careful thermal consideration

Naive PIM offloading may cause a thermal issue and degrades overall system performance

CoolPIM: Source throttling techniques to control PIM intensity

Keeps HMC ”Cool” to avoid thermal-triggered memory performance degradation

Results: CoolPIM improves performance by 1.37x over naïve offloading

1.2x over non-offloading on average

SLIDE 23

23/24

Thank You

SLIDE 24

24/24

Backup

SLIDE 25

25/24

Type Thermal Resistance Cooling Power* Passive heat sink 4.0 °C/W Low-end active heat sink 2.0 °C/W 1x Commodity-server active heat sink 0.5 °C/W 104x High-end heat sink 0.2 °C/W 380x * We assume the same plate-fin heat sink model for all configurations.

SLIDE 26

26/24

| Validate our thermal evaluation environment

Model HMC 1.1 temperature and compare with measurements

20 40 60 80 Low -en d High -en d Temperature (°C) Su face (m easu red ) Die (est i m at ed) Die (m odelin g)

SLIDE 27

27/24

Component Configuration Host GPU, 16 PTX SMs, 32 threads/warp, 1.4GHz 16KB private L1D and 1MB 16-way L2 cache HMC 8 GB cube, 1 logic die, 8 DRAM dies 32 vaults, 512 DRAM banks tCL=tRCD=tRP=13.75ns,tRAS=27.5ns 4 links per package, 120 GB/s per link 80 GB/s data bandwidth per link DRAM Temp. phase: 0-85 °C, 85-95 °C, 95-105 °C 20% DRAM freq reduction (high temp. phases)

SLIDE 28

28/24

| Bandwidth consumption normalized to baseline (non-offloading)

0.2 0.4 0.6 0.8 1 b f s

d

t c b f s

d

w c b f s

t

a b f s

t

t c b f s

t

w c d c k c

r

e s s s p

d

t c s s s p

d

w c s s s p

t

t c s s s p

t

w c p a g e r a n k G M e a n Normalized Bandwidth Non-Offloading Naï ve-Offloading CoolPIM (SW ) CoolPIM (H W)

SLIDE 29

29/24

Interrupt Handler PIM Token Pool PIM Code Non-PIM Code CUDA Blk. Manager GPU SMs HMC Reduce size Issue token Select Blk. Launch CUDA Blk. Offloading Thermal warning

SLIDE 30

30/24

Type Software-Based HW-Based Control Granularity Thread Blocks Warps Control Delay Long Delay Short Delay Design Complexity Low High

SLIDE 31

31/24

0.5 1 1.5 2 2.5 2 4 6 8 10 12 PIM Rate (op/ns) Time (ms) Naï v e O ffl

ad

ing Coo lPIM (SW) Coo lPIM (HW)