1
(C) Minsoo Rhu
Minsoo Rhu✝, Mike O’Connor*, Niladrish Chatterjee*, Jeff Pool*, Youngeun Kwon✝, and Stephen W. Keckler*
Compressing DMA Engine: Leveraging Activation Sparsity For Training Deep Neural Networks
POSTECH✝ and NVIDIA*
Compressing DMA Engine: Leveraging Activation Sparsity For Training - - PowerPoint PPT Presentation
(C) Minsoo Rhu 1 Compressing DMA Engine: Leveraging Activation Sparsity For Training Deep Neural Networks Minsoo Rhu , Mike OConnor * , Niladrish Chatterjee * , Jeff Pool * , Youngeun Kwon , and Stephen W. Keckler * POSTECH and
1
(C) Minsoo Rhu
Minsoo Rhu✝, Mike O’Connor*, Niladrish Chatterjee*, Jeff Pool*, Youngeun Kwon✝, and Stephen W. Keckler*
POSTECH✝ and NVIDIA*
2
(C) Minsoo Rhu
3
(C) Minsoo Rhu
[AlexNet*]
* Krizhevsky et al., “ImageNet Classification with Deep Convolutional Neural Networks”, NIPS-2012
4
(C) Minsoo Rhu
[ResNet*]
* He et al., “Deep Residual Learning for Image Recognition”, CVPR-2016
5
(C) Minsoo Rhu
— The Next Platform, “Baidu eyes deep learning strategy in wake of new GPU options”, April 26th 2016
6
(C) Minsoo Rhu
CPU memory GPU memory
* Rhu et al.,“vDNN: Virtualized Deep Neural Networks for Scalable, Memory-Efficient Neural Network Design”, MICRO-2016
7
(C) Minsoo Rhu
CPU memory GPU memory
Spill to CPU memory
* Rhu et al.,“vDNN: Virtualized Deep Neural Networks for Scalable, Memory-Efficient Neural Network Design”, MICRO-2016
8
(C) Minsoo Rhu
CPU memory GPU memory
Migrate back to GPU memory
* Rhu et al.,“vDNN: Virtualized Deep Neural Networks for Scalable, Memory-Efficient Neural Network Design”, MICRO-2016
9
(C) Minsoo Rhu
* https://developer.ibm.com/linuxonpower/2017/09/22/realizing-value-large-model-support-lms-powerai-ibm-caffe/
10
(C) Minsoo Rhu
QuickPath Interconnect (QPI)
High capacity, low bandwidth memory (DDR4) Low capacity, high bandwidth stacked memory (HBM)
Big Data Deeper & wider neural networks
11
(C) Minsoo Rhu
QuickPath Interconnect (QPI)
High capacity, low bandwidth memory (DDR4) Low capacity, high bandwidth stacked memory (HBM)
Big Data Deeper & wider neural networks
GPU-CPU migration traffic
12
(C) Minsoo Rhu
QuickPath Interconnect (QPI)
High capacity, low bandwidth memory (DDR4) Low capacity, high bandwidth stacked memory (HBM)
Big Data Deeper & wider neural networks
GPU-CPU migration traffic GPU-CPU migration traffic
13
(C) Minsoo Rhu
CPU memory GPU memory
Spill to CPU memory
14
(C) Minsoo Rhu
c d a b
15
(C) Minsoo Rhu
16
(C) Minsoo Rhu
17
(C) Minsoo Rhu
[AlexNet*]
* Krizhevsky et al., “ImageNet Classification with Deep Convolutional Neural Networks”, NIPS-2012
18
(C) Minsoo Rhu
* Krizhevsky et al., “ImageNet Classification with Deep Convolutional Neural Networks”, NIPS-2012
Test image
[AlexNet*]
19
(C) Minsoo Rhu
conv0 (96, 55, 55) Trained (0%) Trained (20%) Trained (40%) Trained (60%) Trained (80%) Trained (100%) 55 55 96
Feature maps
Test image
20
(C) Minsoo Rhu
conv0 (96, 55, 55)
(55x55) 2D image
Trained (0%) Trained (20%) Trained (40%) Trained (60%) Trained (80%) Trained (100%) 55 55 96
Feature maps
Test image
21
(C) Minsoo Rhu
conv0 (96, 55, 55)
(55x55) 2D image
Trained (0%) Trained (20%) Trained (40%) Trained (60%) Trained (80%) Trained (100%)
96 channels
55 55 96
Feature maps
Test image
22
(C) Minsoo Rhu
conv0 (96, 55, 55) Trained (0%) Trained (20%) Trained (40%) Trained (60%) Trained (80%) Trained (100%)
Test image
23
(C) Minsoo Rhu
conv0 (96, 55, 55) Trained (0%) Trained (20%) Trained (40%) Trained (60%) Trained (80%) Trained (100%)
Test image
24
(C) Minsoo Rhu
conv0 (96, 55, 55) Trained (0%) Trained (20%) Trained (40%) Trained (60%) Trained (80%) Trained (100%)
Test image
25
(C) Minsoo Rhu
Trained (0%) Trained (20%) Trained (40%) Trained (60%) Trained (80%) Trained (100%) conv1 (256, 27, 27)
26
(C) Minsoo Rhu
Trained (0%) Trained (20%) Trained (40%) Trained (60%) Trained (80%) Trained (100%) conv4 (256, 13, 13)
27
(C) Minsoo Rhu
Time (0% to 100%)
28
(C) Minsoo Rhu
Time (0% to 100%)
29
(C) Minsoo Rhu
Observation #1: First CONV layer consistently exhibits around 50% layer density across the entire training process.
30
(C) Minsoo Rhu
Observation #2: Pooling layers always increase overall activation density.
31
(C) Minsoo Rhu
Observation #3: Within each layer, activation density rapidly decreases during the initial training periods; once training period reaches the fine-tuning stage, density gradually crawls back up again.
32
(C) Minsoo Rhu
Observation #4: Later layers are generally more sparser than earlier layers
33
(C) Minsoo Rhu
Deeper Sparser
34
(C) Minsoo Rhu
35
(C) Minsoo Rhu
Deeper Sparser
36
(C) Minsoo Rhu
* Zeiler et al., “Visualizing and Understanding Convolutional Networks”, arXiv.org, 2013
Activations Input images
37
(C) Minsoo Rhu
* Zeiler et al., “Visualizing and Understanding Convolutional Networks”, arXiv.org, 2013
Activations Input images
38
(C) Minsoo Rhu
* Zeiler et al., “Visualizing and Understanding Convolutional Networks”, arXiv.org, 2013
Input images Activations
39
(C) Minsoo Rhu
* Zeiler et al., “Visualizing and Understanding Convolutional Networks”, arXiv.org, 2013
Input images Activations
40
(C) Minsoo Rhu
* Zeiler et al., “Visualizing and Understanding Convolutional Networks”, arXiv.org, 2013
Input images Activations
41
(C) Minsoo Rhu
* Zeiler et al., “Visualizing and Understanding Convolutional Networks”, arXiv.org, 2013
Input images Activations
42
(C) Minsoo Rhu
43
(C) Minsoo Rhu
16 GB/s
PCIe
CPU MC
CPU DRAM GPU DRAM
DMA Engine
Crossbar
SM SM SM SM SM SM
336 GB/s
MC L2 MC L2 MC L2 MC L2 MC L2 MC L2
44
(C) Minsoo Rhu
16 GB/s
PCIe
CPU MC
CPU DRAM GPU DRAM
DMA Engine
Crossbar
SM SM SM SM SM SM
336 GB/s
MC L2 MC L2 MC L2 MC L2 MC L2 MC L2
Compressed data
45
(C) Minsoo Rhu
16 GB/s
PCIe
CPU MC
CPU DRAM GPU DRAM
DMA Engine
Crossbar
SM SM SM SM SM SM
336 GB/s
MC L2 MC L2 MC L2 MC L2 MC L2 MC L2
: DRAM read throughput >= (compression rate x PCIe bandwidth)
Compressed data
46
(C) Minsoo Rhu
16 GB/s
PCIe
CPU MC
CPU DRAM GPU DRAM
DMA Engine
Crossbar
SM SM SM SM SM SM
336 GB/s
MC L2 MC L2 MC L2 MC L2 MC L2 MC L2
: DRAM read throughput >= (compression rate x PCIe bandwidth)
Compressed data
47
(C) Minsoo Rhu
16 GB/s
PCIe
CPU MC
CPU DRAM GPU DRAM
cDMA Engine
Crossbar
SM SM SM SM SM SM
336 GB/s
C MC L2 C MC L2 C MC L2 C MC L2 C MC L2 C MC L2 B
C
: Compression unit
B
: Buffer to aggregate compressed data from all MCs
48
(C) Minsoo Rhu
16 GB/s
PCIe
CPU MC
CPU DRAM GPU DRAM
cDMA Engine
Crossbar
SM SM SM SM SM SM
336 GB/s
C MC L2 C MC L2 C MC L2 C MC L2 C MC L2 C MC L2 B
C
: Compression unit
B
: Buffer to aggregate compressed data from all MCs
49
(C) Minsoo Rhu
+ Simple to implement, well-suited for high-throughput compression
+ Exhibits good compression rate for a variety of data patterns
e.g., Dedicated ASIC/FPGA solutions provide roughly 2.5 GB/sec data
50
(C) Minsoo Rhu
has zeros
a b c d e f g h
Data
i j k l m n
Metadata Data has zeros
a b c d e f g h i j k l m n
< Uncompressed > < Compressed >
51
(C) Minsoo Rhu
a b
Data
c d e f 1 1
Metadata (bitmask)
1 1 1 1 a b c d e f
Data N elements
< Uncompressed >
1
has zeros
< Compressed >
N bits
52
(C) Minsoo Rhu
≠ 0? ≠ 0? ≠ 0? ≠ 0? ≠ 0? ≠ 0? ≠ 0? ≠ 0? Prefix Sum Bubble-collapsing Shifter
10011010 10011010 4
Shift-And-Append
+
10011010
Shift-And-Append
Buffer Length Reg. Mask Segment Mask
Compressed 128B Buffer Input Data (32B x 4)
<Area overhead>
53
(C) Minsoo Rhu
54
(C) Minsoo Rhu
Application characterization & datasets Model: trained from scratch using Caffe Activations: collected at training time, which are fed into the compression module
55
(C) Minsoo Rhu
Application characterization & datasets Model: trained from scratch using Caffe Activations: collected at training time, which are fed into the compression module Performance evaluation (hybrid approach) Real GPU: measured using vDNN* with CPU-migrated data properly compressed Analytical model: penalize performance when cDMA’s DRAM bandwidth pressure is high
56
(C) Minsoo Rhu
Application characterization & datasets Model: trained from scratch using Caffe Activations: collected at training time, which are fed into the compression module Performance evaluation (hybrid approach) Real GPU: measured using vDNN* with CPU-migrated data properly compressed Analytical model: penalize performance when cDMA’s DRAM bandwidth pressure is high
* Rhu et al.,“vDNN: Virtualized Deep Neural Networks for Scalable, Memory-Efficient Neural Network Design”, MICRO-2016
57
(C) Minsoo Rhu
1 2 4 8 16 RL ZV ZL RL ZV ZL RL ZV ZL RL ZV ZL RL ZV ZL RL ZV ZL AlexNet OverFeat NiN VGG SqueezeNet GoogleNet Compression ratio avg (network) max (layer)
58
(C) Minsoo Rhu
1 2 4 8 16 RL ZV ZL RL ZV ZL RL ZV ZL RL ZV ZL RL ZV ZL RL ZV ZL AlexNet OverFeat NiN VGG SqueezeNet GoogleNet Compression ratio avg (network) max (layer)
: different compression algorithm è RL (run-length encoding), ZV (zero-value compression), and ZL (Zlib compression)
59
(C) Minsoo Rhu
0.2 0.4 0.6 0.8 1 RL ZV ZL RL ZV ZL RL ZV ZL RL ZV ZL RL ZV ZL RL ZV ZL vDNN cDMA
cDMA
cDMA
cDMA
cDMA
cDMA AlexNet OverFeat NiN VGG SqueezeNet GoogLeNet Offload size (normalized)
: different compression algorithm è RL (run-length encoding), ZV (zero-value compression), and ZL (Zlib compression)
60
(C) Minsoo Rhu
0.2 0.4 0.6 0.8 1 RL ZV ZL RL ZV ZL RL ZV ZL RL ZV ZL RL ZV ZL RL ZV ZL vDNN cDMA
cDMA
cDMA
cDMA
cDMA
cDMA
AlexNet OverFeat NiN VGG SqueezeNet GoogLeNet Performance (normalized)
: different compression algorithm è RL (run-length encoding), ZV (zero-value compression), and ZL (Zlib compression)
61
(C) Minsoo Rhu
62
(C) Minsoo Rhu
63
(C) Minsoo Rhu
64
(C) Minsoo Rhu
: DNN model is fixed (so, activations stay constant for the same input sets)
65
(C) Minsoo Rhu
: DNN model gets constantly updated during the course of training (so, activation map values also changes accordingly …) : DNN model is fixed (so, activations stay constant for the same input sets)
66
(C) Minsoo Rhu
Trained (0%) Trained (20%) Trained (40%) Trained (60%) Trained (80%) Trained (100%) fc1 (4096, 1, 1)