Processing-in-memory (PIM) is regaining attention for energy - - PowerPoint PPT Presentation

processing in memory pim is regaining attention for
SMART_READER_LITE
LIVE PREVIEW

Processing-in-memory (PIM) is regaining attention for energy - - PowerPoint PPT Presentation

Lifeng Nai * Ramyad Hadidi He Xiao Hyojong Kim Jaewoong Sim Hyesoon Kim Georgia Institute of Technology * Google, Intel Labs IPDPS-32 | May 2018 Disclaimer: This work does not relate to Google/Intel Labs 2/24


slide-1
SLIDE 1

Lifeng Nai*† Ramyad Hadidi† He Xiao† Hyojong Kim† Jaewoong Sim‡ Hyesoon Kim†

†Georgia Institute of Technology

* Google, ‡Intel Labs IPDPS-32 | May 2018 Disclaimer: This work does not relate to Google/Intel Labs

slide-2
SLIDE 2

2/24

Processing-in-memory (PIM) is regaining attention for energy efficient computing

  • Graph Workloads: Data-Intensive, Little Data Reuse

Basic Concept: Offload compute to memory

  • Reduce costly energy consumption of data movement
  • Enable using large internal memory bandwidth

DATA: 0xf0f0 DATA: 0xf0f4 INST: ADD

Conventional Data Processing Processing-in-Memory

slide-3
SLIDE 3

3/24

PIM could increase memory temperature beyond normal operating temperature (85°C)

  • High BW (hundreds of GBs ~ TBs) from 3D-stacked memory
  • Less effective heat transfer compared to DIMMs
  • PIM would make these thermal problems worse!

Too Hot Memory Stack?

  • Slower processing for memory requests
  • Decreasing overall system performance

DATA: 0xf0f0 DATA: 0xf0f4 INST: ADD

Rarely exceeds 85°C

CoolPIM keeps the memory “Cool ” to achieve better PIM performance

slide-4
SLIDE 4

4/24

Introduction Hybrid Memory Cube

  • Background
  • Thermal Measurements & Thermal Modeling of Future HMC

CoolPIM

  • Software-Based Throttling
  • Hardware-Based Throttling

Evaluation Conclusion

slide-5
SLIDE 5

5/24

A Hybrid Memory Cube (HMC) from Micron

  • Multiple 3D-stacked DRAM layers + one logic layer with TSVs
  • Vaults: equivalent to memory channels
  • Full-duplex serial links between the host and HMC

No PIM functionality for existing HMC products yet

External Serial Links

TSVs Logic Layer DRAM Layers Vault

Packets

slide-6
SLIDE 6

6/24

Instruction-level PIM supported in future HMC (HMC 2.0)

  • Perform Read-Modify-Write (RMW) operations atomically
  • Similar to READ/WRITE packets; just different CMD in the Header
  • No HMC 2.0 product yet!

PIM-ADD (addr, imm) Header (PIM-ADD) addr, imm Tail

Type HMC 2.0 PIM Instruction Arithmetic Signed Add Bitwise Swap, bit write Boolean AND/NAND/OR/NOR/XOR Comparison CAS-equal/greater

Q: Can we offload all the PIM operations to HMC? What is the thermal impact of PIM in future HMC?

Logic Layer DRAM Layers ACK

slide-7
SLIDE 7

7/24

Experiment Platform (Pico SC-6 Mini System)

  • Intel Core i7 + FPGA Compute Modules (AC-510)

} AC-510: 4GB HMC 1.1, Kintex Ultrascale

Measure the temperature on the heat sink

  • Controlling memory BW via FPGA
  • Applying three different cooling methods

} High-End Active Heat Sink } Low-End Active Heat Sink } Passive Heat Sink

HMC 1.1 has no PIM functionality!

HMC

FPGA

FPGA

BW Control RTL

HMC

slide-8
SLIDE 8

8/24

Passive Low-end Active High-end Active Idle Busy

HMC

FPGA Shutdown

slide-9
SLIDE 9

9/24

Thermal modeling for HMC 2.0 with commodity-server active cooling

  • HMC 2.0 (w/o PIM) would reach 81°C at a full external BW (320GB/s)

} We validated our thermal model against the measurements on HMC 1.1

40 80 120 40 80 120 160 200 240 280 320 Peak DRAM Temp. (°C) Data Bandwidth (GB/s) Passiv e Low

  • e

nd Commodity High-end

HMC

  • perating

temperature: 0°C-105°C High bandwidth Better cooling

We need at least commodity-server cooling to benefit from PIM!

slide-10
SLIDE 10

10/24

PIM increases memory temperature due to power consumption of logic and DRAM layers.

  • In our modeling, the maximum PIM offloading rate is 6.5 PIM ops/ns
  • A high offloading rate could reduce memory performance for cool down

70 80 90 100 110 1 2 3 4 5 6 7 Peak DRAM Temp. (°C) PIM Rate (op/ns) 95°C-105°C 85°C-95°C 0°C-85°C Too Hot Desirable Temp. Range

Reduced memory performance

slide-11
SLIDE 11

11/24

Higher BW benefits èBetter performance Higher DRAM temperature èLow memory performance PIM Offloading Rate Performance

PIM intensity needs to be controlled!!

slide-12
SLIDE 12

12/24

CoolPIM

Controls PIM Intensity with Thermal Consideration

slide-13
SLIDE 13

13/24

We propose two methods for GPU/HMC

1) A SW mechanism with no hardware changes 2) A HW mechanism with changes in GPU architectures

Dynamic source throttling based on thermal warning messages from HMC

  • Thermal warning -> lowers PIM intensity -> reduces internal temperature of HMC

GPU HW Architecture GPU Runtime HMC PIM Offloading Hardware-based Source Throttling Software-based Source Throttling Thermal Warning Updated Offloading Intensity Updated # of PIM-Enabled CUDA Blocks Updated # of PIM- Enabled Warps SW Method HW Method

slide-14
SLIDE 14

14/24

GPU runtime implements some components to control PIM intensity

  • PIM Token Pool (PTP)

} # of maximum thread blocks that are allowed to use PIM functionality

  • Thread Block Manager

} Check PTP and launch PIM code if tokens are available

  • Initialization

} Estimate the initial PTP size based on static analysis at compile time Thermal Warning Switch Vault ... Vault Vault SM SM SM SM Mem Request PIM Offloading Thermal Interrupt PIM Token Pool Interrupt Handler Forward Launch Blk Th-Blk Manager HMC GPU GPU Runtime HMC Links PCI-E Initialization

slide-15
SLIDE 15

15/24

The GPU compiler generates PIM-enabled and non-PIM kernels at compile time

  • Source-to-source translation
  • IR-to-IR translation

Void cuda_kernel(arg_list) { for (int i=0; i<end; i++) { uint addr = addrArray[i]; PIM_Add(addr, 1); } } void cuda_kernel_np(arg_list) { for (int i=0; i<end; i++) { uint addr = addrArray[i]; cuda atomicAdd(addr, 1); } } Original PIM Code Shadow Non-PIM Code

slide-16
SLIDE 16

16/24

PIM Control Unit

  • Controls # of PIM-enabled warps
  • Performs dynamic binary translation
  • See the paper for detail!

Thermal Warning Switch Vault ... Vault Vault SM SM SM SM Mem Request PIM Offloading PIM Control Unit HMC GPU HMC Links Control PIM-enabled Warp #

Type PIM Instruction Non-PIM Arithmetic Signed Add atomicAdd Bitwise Swap, bit write atomicExch Boolean AND, OR atomicAND/OR Comparison CAS-equal/greater atomicCAS/Max

slide-17
SLIDE 17

17/24

Evaluation

slide-18
SLIDE 18

18/24

Thermal Evaluation

  • Temp Measurement: Real HMC 1.1 Platform
  • Thermal Modeling: HMC 2.0 using 3D-ICE
  • Power & Area: Synopsys (28nm/50nm CMOS)

Performance Evaluation

  • MacSim w/ VaultSim

Benchmark

  • GraphBIG benchmark with LDBC dataset

} BFS, SSSP, PageRank, etc…

Thermal Modeling GPU/HMC Timing Simulation Validation Synopsys Power

and

Area Verilog HMC Spec

FPGA HMC

BW Control RTL Thermal Camera Benchmarks

slide-19
SLIDE 19

19/24

Speedup over baseline (Non-Offloading)

  • Naïve/SW/HW: using a commodity-server active heat sync
  • Ideal Thermal: with unlimited cooling

On average, CoolPIM (SW/HW) improves performance by 1.21x/1.25x!

0.0 0.5 1.0 1.5 2.0 b f s

  • d

t c b f s

  • d

w c b f s

  • t

a b f s

  • t

t c b f s

  • t

w c d c k c

  • r

e s s s p

  • d

t c s s s p

  • d

w c s s s p

  • t

t c s s s p

  • t

w c p a g e r a n k G M e a n

Speedup over Non-Offloading

No n-O ff l

  • ading

Naïve-Offloading CoolPIM (SW) CoolPIM (HW) Ideal Ther mal

slide-20
SLIDE 20

20/24

PIM Offloading Rate

  • Naïve: 3~4 op/ns à Temperature goes beyond the normal operating region.
  • CoolPIM: 1.3 op/ns à No memory performance slowdown

CoolPIM maintains peak DRAM temperature within normal operating temp!

7 5 8 0 8 5 9 0 9 5 1 0 bfs-d tc bfs-d w c bfs-ta bfs-ttc bfs-twc dc kc o re sss p

  • dtc sss p
  • dwc sss p
  • ttc

sss p

  • twc pag e

ra n k Peak DRAM Temp. (°C) Naïv e Offl

  • adin

g Co

  • lP

IM (S W ) Co

  • lP

IM (HW)

slide-21
SLIDE 21

21/24

Conclusion

slide-22
SLIDE 22

22/24

Observation: PIM integration requires careful thermal consideration

  • Naive PIM offloading may cause a thermal issue and degrades overall system performance

CoolPIM: Source throttling techniques to control PIM intensity

  • Keeps HMC ”Cool” to avoid thermal-triggered memory performance degradation

Results: CoolPIM improves performance by 1.37x over naïve offloading

  • 1.2x over non-offloading on average
slide-23
SLIDE 23

23/24

Thank You

slide-24
SLIDE 24

24/24

Backup

slide-25
SLIDE 25

25/24

Type Thermal Resistance Cooling Power* Passive heat sink 4.0 °C/W Low-end active heat sink 2.0 °C/W 1x Commodity-server active heat sink 0.5 °C/W 104x High-end heat sink 0.2 °C/W 380x * We assume the same plate-fin heat sink model for all configurations.

slide-26
SLIDE 26

26/24

| Validate our thermal evaluation environment

  • Model HMC 1.1 temperature and compare with measurements

20 40 60 80 Low -en d High -en d Temperature (°C) Su face (m easu red ) Die (est i m at ed) Die (m odelin g)

slide-27
SLIDE 27

27/24

Component Configuration Host GPU, 16 PTX SMs, 32 threads/warp, 1.4GHz 16KB private L1D and 1MB 16-way L2 cache HMC 8 GB cube, 1 logic die, 8 DRAM dies 32 vaults, 512 DRAM banks tCL=tRCD=tRP=13.75ns,tRAS=27.5ns 4 links per package, 120 GB/s per link 80 GB/s data bandwidth per link DRAM Temp. phase: 0-85 °C, 85-95 °C, 95-105 °C 20% DRAM freq reduction (high temp. phases)

slide-28
SLIDE 28

28/24

| Bandwidth consumption normalized to baseline (non-offloading)

0.2 0.4 0.6 0.8 1 b f s

  • d

t c b f s

  • d

w c b f s

  • t

a b f s

  • t

t c b f s

  • t

w c d c k c

  • r

e s s s p

  • d

t c s s s p

  • d

w c s s s p

  • t

t c s s s p

  • t

w c p a g e r a n k G M e a n Normalized Bandwidth Non-Offloading Naï ve-Offloading CoolPIM (SW ) CoolPIM (H W)

slide-29
SLIDE 29

29/24

Interrupt Handler PIM Token Pool PIM Code Non-PIM Code CUDA Blk. Manager GPU SMs HMC Reduce size Issue token Select Blk. Launch CUDA Blk. Offloading Thermal warning

slide-30
SLIDE 30

30/24

Type Software-Based HW-Based Control Granularity Thread Blocks Warps Control Delay Long Delay Short Delay Design Complexity Low High

slide-31
SLIDE 31

31/24

0.5 1 1.5 2 2.5 2 4 6 8 10 12 PIM Rate (op/ns) Time (ms) Naï v e O ffl

  • ad

ing Coo lPIM (SW) Coo lPIM (HW)