Compressing DMA Engine: Leveraging Activation Sparsity For Training - - PowerPoint PPT Presentation

compressing dma engine leveraging activation sparsity for
SMART_READER_LITE
LIVE PREVIEW

Compressing DMA Engine: Leveraging Activation Sparsity For Training - - PowerPoint PPT Presentation

(C) Minsoo Rhu 1 Compressing DMA Engine: Leveraging Activation Sparsity For Training Deep Neural Networks Minsoo Rhu , Mike OConnor * , Niladrish Chatterjee * , Jeff Pool * , Youngeun Kwon , and Stephen W. Keckler * POSTECH and


slide-1
SLIDE 1

1

(C) Minsoo Rhu

Minsoo Rhu✝, Mike O’Connor*, Niladrish Chatterjee*, Jeff Pool*, Youngeun Kwon✝, and Stephen W. Keckler*

Compressing DMA Engine: Leveraging Activation Sparsity For Training Deep Neural Networks

POSTECH✝ and NVIDIA*

slide-2
SLIDE 2

2

(C) Minsoo Rhu

Motivation

slide-3
SLIDE 3

3

(C) Minsoo Rhu

ML trends: deeper & larger DNN models

From AlexNet to ResNet

[AlexNet*]

* Krizhevsky et al., “ImageNet Classification with Deep Convolutional Neural Networks”, NIPS-2012

7 convolutional layers (2012)

slide-4
SLIDE 4

4

(C) Minsoo Rhu

ML trends: deeper & larger DNN models

From AlexNet to ResNet

[ResNet*]

* He et al., “Deep Residual Learning for Image Recognition”, CVPR-2016

153 convolutional layers (2016)

slide-5
SLIDE 5

5

(C) Minsoo Rhu

Memory “capacity” limits in DNN training

Training large & deep DNNs incurs large memory allocations

— The Next Platform, “Baidu eyes deep learning strategy in wake of new GPU options”, April 26th 2016

slide-6
SLIDE 6

6

(C) Minsoo Rhu

Prior solution: virtualized DNN (vDNN)

Expose both CPU and GPU memory for allocating DNN training data

CPU memory GPU memory

CPU

PCIe

GPU

* Rhu et al.,“vDNN: Virtualized Deep Neural Networks for Scalable, Memory-Efficient Neural Network Design”, MICRO-2016

slide-7
SLIDE 7

7

(C) Minsoo Rhu

Prior solution: virtualized DNN (vDNN)

Expose both CPU and GPU memory for allocating DNN training data

CPU memory GPU memory

CPU

PCIe

GPU

Spill to CPU memory

* Rhu et al.,“vDNN: Virtualized Deep Neural Networks for Scalable, Memory-Efficient Neural Network Design”, MICRO-2016

slide-8
SLIDE 8

8

(C) Minsoo Rhu

Prior solution: virtualized DNN (vDNN)

Expose both CPU and GPU memory for allocating DNN training data

CPU memory GPU memory

CPU

PCIe

GPU

Migrate back to GPU memory

* Rhu et al.,“vDNN: Virtualized Deep Neural Networks for Scalable, Memory-Efficient Neural Network Design”, MICRO-2016

slide-9
SLIDE 9

9

(C) Minsoo Rhu

Large Model Support (LMS) with PowerAI

Expose both CPU and GPU memory for allocating DNN training data

* https://developer.ibm.com/linuxonpower/2017/09/22/realizing-value-large-model-support-lms-powerai-ibm-caffe/

slide-10
SLIDE 10

10

(C) Minsoo Rhu

HPC system node for deep learning

Multiple GPUs (4 to 8) connected under a PCIe root complex

QuickPath Interconnect (QPI)

High capacity, low bandwidth memory (DDR4) Low capacity, high bandwidth stacked memory (HBM)

Big Data Deeper & wider neural networks

slide-11
SLIDE 11

11

(C) Minsoo Rhu

HPC system node for deep learning

Multiple GPUs (4 to 8) connected under a PCIe root complex

QuickPath Interconnect (QPI)

High capacity, low bandwidth memory (DDR4) Low capacity, high bandwidth stacked memory (HBM)

Big Data Deeper & wider neural networks

GPU-CPU migration traffic

slide-12
SLIDE 12

12

(C) Minsoo Rhu

HPC system node for deep learning

Multiple GPUs (4 to 8) connected under a PCIe root complex

QuickPath Interconnect (QPI)

High capacity, low bandwidth memory (DDR4) Low capacity, high bandwidth stacked memory (HBM)

Big Data Deeper & wider neural networks

Challenges: PCIe channel bandwidth becomes a performance bottleneck!

GPU-CPU migration traffic GPU-CPU migration traffic

slide-13
SLIDE 13

13

(C) Minsoo Rhu

Opportunity: “sparse” data structures

Amplify effective PCIe bandwidth via compressing CPU-migrated data

CPU memory GPU memory

CPU

PCIe

GPU

Spill to CPU memory

slide-14
SLIDE 14

14

(C) Minsoo Rhu

Opportunity: “sparse” data structures

Amplify effective PCIe bandwidth via compressing CPU-migrated data

c d a b

slide-15
SLIDE 15

15

(C) Minsoo Rhu

Key contributions of this work

Application characterization study on sparsity when training convolutional neural networks Architectural support for leveraging activation sparsity in virtualized DNNs

slide-16
SLIDE 16

16

(C) Minsoo Rhu

  • Q. How much sparsity do DNNs exhibit

during training?

slide-17
SLIDE 17

17

(C) Minsoo Rhu

Case study) AlexNet

Characterizing the changes in layer density during training

[AlexNet*]

* Krizhevsky et al., “ImageNet Classification with Deep Convolutional Neural Networks”, NIPS-2012

slide-18
SLIDE 18

18

(C) Minsoo Rhu

Case study) AlexNet

Characterizing the changes in layer density during training

* Krizhevsky et al., “ImageNet Classification with Deep Convolutional Neural Networks”, NIPS-2012

Test image

[AlexNet*]

slide-19
SLIDE 19

19

(C) Minsoo Rhu

Case study) AlexNet

Characterizing the changes in layer density during training

conv0 (96, 55, 55) Trained (0%) Trained (20%) Trained (40%) Trained (60%) Trained (80%) Trained (100%) 55 55 96

Feature maps

Test image

slide-20
SLIDE 20

20

(C) Minsoo Rhu

Case study) AlexNet

Characterizing the changes in layer density during training

conv0 (96, 55, 55)

(55x55) 2D image

Trained (0%) Trained (20%) Trained (40%) Trained (60%) Trained (80%) Trained (100%) 55 55 96

Feature maps

Test image

slide-21
SLIDE 21

21

(C) Minsoo Rhu

Case study) AlexNet

Characterizing the changes in layer density during training

conv0 (96, 55, 55)

(55x55) 2D image

Trained (0%) Trained (20%) Trained (40%) Trained (60%) Trained (80%) Trained (100%)

96 channels

55 55 96

Feature maps

Test image

slide-22
SLIDE 22

22

(C) Minsoo Rhu

Case study) AlexNet

Characterizing the changes in layer density during training

conv0 (96, 55, 55) Trained (0%) Trained (20%) Trained (40%) Trained (60%) Trained (80%) Trained (100%)

Average layer density: 49% (51% of activations are 0-valued)

Test image

slide-23
SLIDE 23

23

(C) Minsoo Rhu

Case study) AlexNet

Characterizing the changes in layer density during training

conv0 (96, 55, 55) Trained (0%) Trained (20%) Trained (40%) Trained (60%) Trained (80%) Trained (100%)

Test image

slide-24
SLIDE 24

24

(C) Minsoo Rhu

Case study) AlexNet

Characterizing the changes in layer density during training

conv0 (96, 55, 55) Trained (0%) Trained (20%) Trained (40%) Trained (60%) Trained (80%) Trained (100%)

Test image

Average layer density: 49% (51% of activations are 0-valued)

slide-25
SLIDE 25

25

(C) Minsoo Rhu

Case study) AlexNet

Characterizing the changes in layer density during training

Trained (0%) Trained (20%) Trained (40%) Trained (60%) Trained (80%) Trained (100%) conv1 (256, 27, 27)

Average layer density: 36% (64% of activations are 0-valued)

slide-26
SLIDE 26

26

(C) Minsoo Rhu

Case study) AlexNet

Characterizing the changes in layer density during training

Trained (0%) Trained (20%) Trained (40%) Trained (60%) Trained (80%) Trained (100%) conv4 (256, 13, 13)

Average layer density: 22% (78% of activations are 0-valued)

slide-27
SLIDE 27

27

(C) Minsoo Rhu

Case study) AlexNet

Putting everything together

Time (0% to 100%)

slide-28
SLIDE 28

28

(C) Minsoo Rhu

Case study) AlexNet

Putting everything together

Time (0% to 100%)

slide-29
SLIDE 29

29

(C) Minsoo Rhu

Case study) AlexNet

Putting everything together

Observation #1: First CONV layer consistently exhibits around 50% layer density across the entire training process.

slide-30
SLIDE 30

30

(C) Minsoo Rhu

Case study) AlexNet

Putting everything together

Observation #2: Pooling layers always increase overall activation density.

slide-31
SLIDE 31

31

(C) Minsoo Rhu

Case study) AlexNet

Putting everything together

Observation #3: Within each layer, activation density rapidly decreases during the initial training periods; once training period reaches the fine-tuning stage, density gradually crawls back up again.

slide-32
SLIDE 32

32

(C) Minsoo Rhu

Case study) AlexNet

Putting everything together

Observation #4: Later layers are generally more sparser than earlier layers

slide-33
SLIDE 33

33

(C) Minsoo Rhu

Case study) VGG-16

Putting everything together

Deeper Sparser

slide-34
SLIDE 34

34

(C) Minsoo Rhu

What causes such behavior in DNNs?

Discussed much more in our paper J

slide-35
SLIDE 35

35

(C) Minsoo Rhu

What causes such behavior in DNNs?

Observation#4: Sparsity increases as you go deep inside the network

Deeper Sparser

slide-36
SLIDE 36

36

(C) Minsoo Rhu

What causes such behavior in DNNs?

Observation#4: Sparsity increases as you go deep inside the network

* Zeiler et al., “Visualizing and Understanding Convolutional Networks”, arXiv.org, 2013

Activations Input images

slide-37
SLIDE 37

37

(C) Minsoo Rhu

What causes such behavior in DNNs?

Observation#4: Sparsity increases as you go deep inside the network

* Zeiler et al., “Visualizing and Understanding Convolutional Networks”, arXiv.org, 2013

Activations Input images

First few layers: filters are trained to respond to “class-invariant” features

  • Corners
  • Edges
  • Colors
slide-38
SLIDE 38

38

(C) Minsoo Rhu

What causes such behavior in DNNs?

Observation#4: Sparsity increases as you go deep inside the network

* Zeiler et al., “Visualizing and Understanding Convolutional Networks”, arXiv.org, 2013

Input images Activations

slide-39
SLIDE 39

39

(C) Minsoo Rhu

What causes such behavior in DNNs?

Observation#4: Sparsity increases as you go deep inside the network

* Zeiler et al., “Visualizing and Understanding Convolutional Networks”, arXiv.org, 2013

Input images Activations

Deeper layers: more “class-specific” features (e.g., Textures …)

slide-40
SLIDE 40

40

(C) Minsoo Rhu

What causes such behavior in DNNs?

Observation#4: Sparsity increases as you go deep inside the network

* Zeiler et al., “Visualizing and Understanding Convolutional Networks”, arXiv.org, 2013

Input images Activations

slide-41
SLIDE 41

41

(C) Minsoo Rhu

What causes such behavior in DNNs?

Observation#4: Sparsity increases as you go deep inside the network

* Zeiler et al., “Visualizing and Understanding Convolutional Networks”, arXiv.org, 2013

Input images Activations

For “deep” neural networks, there exists significant sparsity in activations (40% ~ 90% layer-wise sparsity)

slide-42
SLIDE 42

42

(C) Minsoo Rhu

Compressing DMA Engine (cDMA)

slide-43
SLIDE 43

43

(C) Minsoo Rhu

Baseline CPU-GPU system interconnect

  • Max. 16 GB/sec communication channel between CPU-GPU

16 GB/s

PCIe

CPU MC

CPU DRAM GPU DRAM

DMA Engine

GPU

Crossbar

SM SM SM SM SM SM

CPU

336 GB/s

MC L2 MC L2 MC L2 MC L2 MC L2 MC L2

slide-44
SLIDE 44

44

(C) Minsoo Rhu

Compressing DMA architecture

Goals: Saturate PCIe channel with compressed activation maps

16 GB/s

PCIe

CPU MC

CPU DRAM GPU DRAM

DMA Engine

GPU

Crossbar

SM SM SM SM SM SM

CPU

336 GB/s

MC L2 MC L2 MC L2 MC L2 MC L2 MC L2

Compressed data

  • Q. How should the memory subsystem interact with the DMA engine?
slide-45
SLIDE 45

45

(C) Minsoo Rhu

Compressing DMA architecture

DRAM read-BW should be high enough to generate compressed data

16 GB/s

PCIe

CPU MC

CPU DRAM GPU DRAM

DMA Engine

GPU

Crossbar

SM SM SM SM SM SM

CPU

336 GB/s

MC L2 MC L2 MC L2 MC L2 MC L2 MC L2

: DRAM read throughput >= (compression rate x PCIe bandwidth)

Compressed data

slide-46
SLIDE 46

46

(C) Minsoo Rhu

Compressing DMA architecture

Challenges: GPU crossbar bandwidth should be amplified proportionally

16 GB/s

PCIe

CPU MC

CPU DRAM GPU DRAM

DMA Engine

GPU

Crossbar

SM SM SM SM SM SM

CPU

336 GB/s

MC L2 MC L2 MC L2 MC L2 MC L2 MC L2

: DRAM read throughput >= (compression rate x PCIe bandwidth)

Compressed data

slide-47
SLIDE 47

47

(C) Minsoo Rhu

Compressing DMA architecture

Solution: Compress data “before” routing it through the crossbar

16 GB/s

PCIe

CPU MC

CPU DRAM GPU DRAM

cDMA Engine

GPU

Crossbar

SM SM SM SM SM SM

CPU

336 GB/s

C MC L2 C MC L2 C MC L2 C MC L2 C MC L2 C MC L2 B

C

: Compression unit

B

: Buffer to aggregate compressed data from all MCs

slide-48
SLIDE 48

48

(C) Minsoo Rhu

Compressing DMA architecture

Solution: Compress data “before” routing it through the crossbar

16 GB/s

PCIe

CPU MC

CPU DRAM GPU DRAM

cDMA Engine

GPU

Crossbar

SM SM SM SM SM SM

CPU

336 GB/s

C MC L2 C MC L2 C MC L2 C MC L2 C MC L2 C MC L2 B

C

: Compression unit

B

: Buffer to aggregate compressed data from all MCs

slide-49
SLIDE 49

49

(C) Minsoo Rhu

Compression algorithms

Compression algorithms

  • 1. Run-length encoding

+ Simple to implement, well-suited for high-throughput compression

  • - Compression rate is good only when zero-values are clustered
  • 2. Zlib compression

+ Exhibits good compression rate for a variety of data patterns

  • - Designing high-throughput compression hardware is challenging

e.g., Dedicated ASIC/FPGA solutions provide roughly 2.5 GB/sec data

slide-50
SLIDE 50

50

(C) Minsoo Rhu

Proposed compression algorithm

Frequent-value compression (encoding sparseness)

has zeros

a b c d e f g h

Data

i j k l m n

  • p

Metadata Data has zeros

a b c d e f g h i j k l m n

  • p

< Uncompressed > < Compressed >

slide-51
SLIDE 51

51

(C) Minsoo Rhu

Proposed compression algorithm

Frequent-value compression (encoding sparseness)

a b

Data

c d e f 1 1

Metadata (bitmask)

1 1 1 1 a b c d e f

Data N elements

< Uncompressed >

1

has zeros

< Compressed >

N bits

slide-52
SLIDE 52

52

(C) Minsoo Rhu

Compression microarchitecture

Frequent-value compression (encoding sparseness)

≠ 0? ≠ 0? ≠ 0? ≠ 0? ≠ 0? ≠ 0? ≠ 0? ≠ 0? Prefix Sum Bubble-collapsing Shifter

10011010 10011010 4

Shift-And-Append

+

10011010

Shift-And-Append

Buffer Length Reg. Mask Segment Mask

Compressed 128B Buffer Input Data (32B x 4)

<Area overhead>

  • FreePDK + CACTI
  • 1.5 mm2 in 28 nm process
  • (Note) GV100 size: 800 mm2
slide-53
SLIDE 53

53

(C) Minsoo Rhu

Results

slide-54
SLIDE 54

54

(C) Minsoo Rhu

Evaluation

Methodology

Application characterization & datasets Model: trained from scratch using Caffe Activations: collected at training time, which are fed into the compression module

slide-55
SLIDE 55

55

(C) Minsoo Rhu

Evaluation

Methodology

Application characterization & datasets Model: trained from scratch using Caffe Activations: collected at training time, which are fed into the compression module Performance evaluation (hybrid approach) Real GPU: measured using vDNN* with CPU-migrated data properly compressed Analytical model: penalize performance when cDMA’s DRAM bandwidth pressure is high

slide-56
SLIDE 56

56

(C) Minsoo Rhu

Evaluation

Application characterization & datasets Model: trained from scratch using Caffe Activations: collected at training time, which are fed into the compression module Performance evaluation (hybrid approach) Real GPU: measured using vDNN* with CPU-migrated data properly compressed Analytical model: penalize performance when cDMA’s DRAM bandwidth pressure is high

Methodology

* Rhu et al.,“vDNN: Virtualized Deep Neural Networks for Scalable, Memory-Efficient Neural Network Design”, MICRO-2016

slide-57
SLIDE 57

57

(C) Minsoo Rhu

Avg/Max compression rate

Higher is better

1 2 4 8 16 RL ZV ZL RL ZV ZL RL ZV ZL RL ZV ZL RL ZV ZL RL ZV ZL AlexNet OverFeat NiN VGG SqueezeNet GoogleNet Compression ratio avg (network) max (layer)

slide-58
SLIDE 58

58

(C) Minsoo Rhu

Avg/Max compression rate

Higher is better

1 2 4 8 16 RL ZV ZL RL ZV ZL RL ZV ZL RL ZV ZL RL ZV ZL RL ZV ZL AlexNet OverFeat NiN VGG SqueezeNet GoogleNet Compression ratio avg (network) max (layer)

: different compression algorithm è RL (run-length encoding), ZV (zero-value compression), and ZL (Zlib compression)

slide-59
SLIDE 59

59

(C) Minsoo Rhu

CPU-GPU data traffic size

Lower is better

0.2 0.4 0.6 0.8 1 RL ZV ZL RL ZV ZL RL ZV ZL RL ZV ZL RL ZV ZL RL ZV ZL vDNN cDMA

  • vDNN

cDMA

  • vDNN

cDMA

  • vDNN

cDMA

  • vDNN

cDMA

  • vDNN

cDMA AlexNet OverFeat NiN VGG SqueezeNet GoogLeNet Offload size (normalized)

: different compression algorithm è RL (run-length encoding), ZV (zero-value compression), and ZL (Zlib compression)

slide-60
SLIDE 60

60

(C) Minsoo Rhu

Performance

Higher is better

0.2 0.4 0.6 0.8 1 RL ZV ZL RL ZV ZL RL ZV ZL RL ZV ZL RL ZV ZL RL ZV ZL vDNN cDMA

  • rac
  • vDNN

cDMA

  • rac
  • vDNN

cDMA

  • rac
  • vDNN

cDMA

  • rac
  • vDNN

cDMA

  • rac
  • vDNN

cDMA

  • rac

AlexNet OverFeat NiN VGG SqueezeNet GoogLeNet Performance (normalized)

: different compression algorithm è RL (run-length encoding), ZV (zero-value compression), and ZL (Zlib compression)

slide-61
SLIDE 61

61

(C) Minsoo Rhu

Conclusions

Compressing DMA engine: Architectural support for sparse CNN training Avg 2.6x (max 13.8x) compression rate Avg 53% (max 79%) speedup on Pascal Titan Xp

slide-62
SLIDE 62

62

(C) Minsoo Rhu

Backup

slide-63
SLIDE 63

63

(C) Minsoo Rhu

Training vs. inference

Deep learning for image classification

slide-64
SLIDE 64

64

(C) Minsoo Rhu

Training vs. inference

Deep learning for image classification

: DNN model is fixed (so, activations stay constant for the same input sets)

slide-65
SLIDE 65

65

(C) Minsoo Rhu

Training vs. inference

Deep learning for image classification

: DNN model gets constantly updated during the course of training (so, activation map values also changes accordingly …) : DNN model is fixed (so, activations stay constant for the same input sets)

slide-66
SLIDE 66

66

(C) Minsoo Rhu

Case study) AlexNet

Characterizing the changes in layer density during training

Trained (0%) Trained (20%) Trained (40%) Trained (60%) Trained (80%) Trained (100%) fc1 (4096, 1, 1)

Average layer density: 31% (69% of activations are 0-valued)