Lifeng Nai*† Ramyad Hadidi† He Xiao† Hyojong Kim† Jaewoong Sim‡ Hyesoon Kim†
†Georgia Institute of Technology
* Google, ‡Intel Labs IPDPS-32 | May 2018 Disclaimer: This work does not relate to Google/Intel Labs
Processing-in-memory (PIM) is regaining attention for energy - - PowerPoint PPT Presentation
Lifeng Nai * Ramyad Hadidi He Xiao Hyojong Kim Jaewoong Sim Hyesoon Kim Georgia Institute of Technology * Google, Intel Labs IPDPS-32 | May 2018 Disclaimer: This work does not relate to Google/Intel Labs 2/24
†Georgia Institute of Technology
* Google, ‡Intel Labs IPDPS-32 | May 2018 Disclaimer: This work does not relate to Google/Intel Labs
2/24
DATA: 0xf0f0 DATA: 0xf0f4 INST: ADD
Conventional Data Processing Processing-in-Memory
3/24
DATA: 0xf0f0 DATA: 0xf0f4 INST: ADD
Rarely exceeds 85°C
4/24
5/24
External Serial Links
TSVs Logic Layer DRAM Layers Vault
Packets
6/24
PIM-ADD (addr, imm) Header (PIM-ADD) addr, imm Tail
Type HMC 2.0 PIM Instruction Arithmetic Signed Add Bitwise Swap, bit write Boolean AND/NAND/OR/NOR/XOR Comparison CAS-equal/greater
Logic Layer DRAM Layers ACK
7/24
} AC-510: 4GB HMC 1.1, Kintex Ultrascale
} High-End Active Heat Sink } Low-End Active Heat Sink } Passive Heat Sink
HMC
FPGA
BW Control RTL
8/24
Passive Low-end Active High-end Active Idle Busy
HMC
FPGA Shutdown
9/24
} We validated our thermal model against the measurements on HMC 1.1
40 80 120 40 80 120 160 200 240 280 320 Peak DRAM Temp. (°C) Data Bandwidth (GB/s) Passiv e Low
nd Commodity High-end
HMC
temperature: 0°C-105°C High bandwidth Better cooling
We need at least commodity-server cooling to benefit from PIM!
10/24
70 80 90 100 110 1 2 3 4 5 6 7 Peak DRAM Temp. (°C) PIM Rate (op/ns) 95°C-105°C 85°C-95°C 0°C-85°C Too Hot Desirable Temp. Range
Reduced memory performance
11/24
Higher BW benefits èBetter performance Higher DRAM temperature èLow memory performance PIM Offloading Rate Performance
12/24
13/24
1) A SW mechanism with no hardware changes 2) A HW mechanism with changes in GPU architectures
GPU HW Architecture GPU Runtime HMC PIM Offloading Hardware-based Source Throttling Software-based Source Throttling Thermal Warning Updated Offloading Intensity Updated # of PIM-Enabled CUDA Blocks Updated # of PIM- Enabled Warps SW Method HW Method
14/24
} # of maximum thread blocks that are allowed to use PIM functionality
} Check PTP and launch PIM code if tokens are available
} Estimate the initial PTP size based on static analysis at compile time Thermal Warning Switch Vault ... Vault Vault SM SM SM SM Mem Request PIM Offloading Thermal Interrupt PIM Token Pool Interrupt Handler Forward Launch Blk Th-Blk Manager HMC GPU GPU Runtime HMC Links PCI-E Initialization
15/24
Void cuda_kernel(arg_list) { for (int i=0; i<end; i++) { uint addr = addrArray[i]; PIM_Add(addr, 1); } } void cuda_kernel_np(arg_list) { for (int i=0; i<end; i++) { uint addr = addrArray[i]; cuda atomicAdd(addr, 1); } } Original PIM Code Shadow Non-PIM Code
16/24
Thermal Warning Switch Vault ... Vault Vault SM SM SM SM Mem Request PIM Offloading PIM Control Unit HMC GPU HMC Links Control PIM-enabled Warp #
Type PIM Instruction Non-PIM Arithmetic Signed Add atomicAdd Bitwise Swap, bit write atomicExch Boolean AND, OR atomicAND/OR Comparison CAS-equal/greater atomicCAS/Max
17/24
18/24
} BFS, SSSP, PageRank, etc…
Thermal Modeling GPU/HMC Timing Simulation Validation Synopsys Power
and
Area Verilog HMC Spec
FPGA HMC
BW Control RTL Thermal Camera Benchmarks
19/24
0.0 0.5 1.0 1.5 2.0 b f s
t c b f s
w c b f s
a b f s
t c b f s
w c d c k c
e s s s p
t c s s s p
w c s s s p
t c s s s p
w c p a g e r a n k G M e a n
Speedup over Non-Offloading
No n-O ff l
Naïve-Offloading CoolPIM (SW) CoolPIM (HW) Ideal Ther mal
20/24
CoolPIM maintains peak DRAM temperature within normal operating temp!
7 5 8 0 8 5 9 0 9 5 1 0 bfs-d tc bfs-d w c bfs-ta bfs-ttc bfs-twc dc kc o re sss p
sss p
ra n k Peak DRAM Temp. (°C) Naïv e Offl
g Co
IM (S W ) Co
IM (HW)
21/24
22/24
23/24
24/24
25/24
Type Thermal Resistance Cooling Power* Passive heat sink 4.0 °C/W Low-end active heat sink 2.0 °C/W 1x Commodity-server active heat sink 0.5 °C/W 104x High-end heat sink 0.2 °C/W 380x * We assume the same plate-fin heat sink model for all configurations.
26/24
20 40 60 80 Low -en d High -en d Temperature (°C) Su face (m easu red ) Die (est i m at ed) Die (m odelin g)
27/24
Component Configuration Host GPU, 16 PTX SMs, 32 threads/warp, 1.4GHz 16KB private L1D and 1MB 16-way L2 cache HMC 8 GB cube, 1 logic die, 8 DRAM dies 32 vaults, 512 DRAM banks tCL=tRCD=tRP=13.75ns,tRAS=27.5ns 4 links per package, 120 GB/s per link 80 GB/s data bandwidth per link DRAM Temp. phase: 0-85 °C, 85-95 °C, 95-105 °C 20% DRAM freq reduction (high temp. phases)
28/24
0.2 0.4 0.6 0.8 1 b f s
t c b f s
w c b f s
a b f s
t c b f s
w c d c k c
e s s s p
t c s s s p
w c s s s p
t c s s s p
w c p a g e r a n k G M e a n Normalized Bandwidth Non-Offloading Naï ve-Offloading CoolPIM (SW ) CoolPIM (H W)
29/24
Interrupt Handler PIM Token Pool PIM Code Non-PIM Code CUDA Blk. Manager GPU SMs HMC Reduce size Issue token Select Blk. Launch CUDA Blk. Offloading Thermal warning
30/24
31/24
0.5 1 1.5 2 2.5 2 4 6 8 10 12 PIM Rate (op/ns) Time (ms) Naï v e O ffl
ing Coo lPIM (SW) Coo lPIM (HW)