Compressing DMA Engine: Leveraging Activation Sparsity For Training - PowerPoint PPT Presentation

(C) Minsoo Rhu 1 Compressing DMA Engine: Leveraging Activation Sparsity For Training Deep Neural Networks Minsoo Rhu ✝ , Mike O’Connor * , Niladrish Chatterjee * , Jeff Pool * , Youngeun Kwon ✝ , and Stephen W. Keckler * POSTECH ✝ and NVIDIA *

(C) Minsoo Rhu 2 Motivation

(C) Minsoo Rhu 3 ML trends: deeper & larger DNN models From AlexNet to ResNet [AlexNet*] 7 convolutional layers (2012) * Krizhevsky et al., “ImageNet Classification with Deep Convolutional Neural Networks”, NIPS-2012

(C) Minsoo Rhu 4 ML trends: deeper & larger DNN models From AlexNet to ResNet [ResNet*] 153 convolutional layers (2016) * He et al., “Deep Residual Learning for Image Recognition”, CVPR-2016

(C) Minsoo Rhu 5 Memory “capacity” limits in DNN training Training large & deep DNNs incurs large memory allocations — The Next Platform, “Baidu eyes deep learning strategy in wake of new GPU options”, April 26 th 2016

(C) Minsoo Rhu 6 Prior solution: virtualized DNN (vDNN) Expose both CPU and GPU memory for allocating DNN training data PCIe CPU GPU CPU memory GPU memory * Rhu et al.,“vDNN: Virtualized Deep Neural Networks for Scalable, Memory-Efficient Neural Network Design”, MICRO-2016

(C) Minsoo Rhu 7 Prior solution: virtualized DNN (vDNN) Expose both CPU and GPU memory for allocating DNN training data PCIe CPU GPU Spill to CPU memory CPU memory GPU memory * Rhu et al.,“vDNN: Virtualized Deep Neural Networks for Scalable, Memory-Efficient Neural Network Design”, MICRO-2016

(C) Minsoo Rhu 8 Prior solution: virtualized DNN (vDNN) Expose both CPU and GPU memory for allocating DNN training data PCIe CPU GPU Migrate back to GPU memory CPU memory GPU memory * Rhu et al.,“vDNN: Virtualized Deep Neural Networks for Scalable, Memory-Efficient Neural Network Design”, MICRO-2016

(C) Minsoo Rhu 9 Large Model Support (LMS) with PowerAI Expose both CPU and GPU memory for allocating DNN training data * https://developer.ibm.com/linuxonpower/2017/09/22/realizing-value-large-model-support-lms-powerai-ibm-caffe/

(C) Minsoo Rhu 10 HPC system node for deep learning Multiple GPUs (4 to 8) connected under a PCIe root complex High capacity, low bandwidth memory (DDR4) QuickPath Interconnect (QPI) Big Data Deeper & wider neural networks Low capacity, high bandwidth stacked memory (HBM)

(C) Minsoo Rhu 11 HPC system node for deep learning Multiple GPUs (4 to 8) connected under a PCIe root complex High capacity, low bandwidth memory (DDR4) GPU-CPU QuickPath Interconnect (QPI) migration traffic Big Data Deeper & wider neural networks Low capacity, high bandwidth stacked memory (HBM)

(C) Minsoo Rhu 12 HPC system node for deep learning Multiple GPUs (4 to 8) connected under a PCIe root complex High capacity, low bandwidth memory (DDR4) GPU-CPU GPU-CPU QuickPath Interconnect (QPI) migration traffic migration traffic Big Data Deeper & wider neural networks Low capacity, high bandwidth stacked memory (HBM) Challenges: PCIe channel bandwidth becomes a performance bottleneck!

(C) Minsoo Rhu 13 Opportunity: “sparse” data structures Amplify effective PCIe bandwidth via compressing CPU-migrated data PCIe CPU GPU Spill to CPU memory CPU memory GPU memory

(C) Minsoo Rhu 14 Opportunity: “sparse” data structures Amplify effective PCIe bandwidth via compressing CPU-migrated data 0 0 0 0 c a 0 0 0 d 0 0 b 0 0 0 0 0 0 0

(C) Minsoo Rhu 15 Key contributions of this work Application characterization study on sparsity when training convolutional neural networks Architectural support for leveraging activation sparsity in virtualized DNNs

(C) Minsoo Rhu 16 Q. How much sparsity do DNNs exhibit during training?

(C) Minsoo Rhu 17 Case study) AlexNet Characterizing the changes in layer density during training [AlexNet*] * Krizhevsky et al., “ImageNet Classification with Deep Convolutional Neural Networks”, NIPS-2012

(C) Minsoo Rhu 18 Case study) AlexNet Characterizing the changes in layer density during training Test image [AlexNet*] * Krizhevsky et al., “ImageNet Classification with Deep Convolutional Neural Networks”, NIPS-2012

(C) Minsoo Rhu 19 Case study) AlexNet Characterizing the changes in layer density during training conv0 (96, 55, 55) 55 Trained Trained Trained Trained Trained Trained 55 (0%) (20%) (40%) (60%) (80%) (100%) 96 Feature maps Test image

(C) Minsoo Rhu 20 Case study) AlexNet Characterizing the changes in layer density during training (55x55) 2D image conv0 (96, 55, 55) 55 Trained Trained Trained Trained Trained Trained 55 (0%) (20%) (40%) (60%) (80%) (100%) 96 Feature maps Test image

(C) Minsoo Rhu 21 Case study) AlexNet Characterizing the changes in layer density during training (55x55) 2D image conv0 96 channels (96, 55, 55) 55 Trained Trained Trained Trained Trained Trained 55 (0%) (20%) (40%) (60%) (80%) (100%) 96 Feature maps Test image

(C) Minsoo Rhu 22 Case study) AlexNet Characterizing the changes in layer density during training conv0 (96, 55, 55) Trained Trained Trained Trained Trained Trained (0%) (20%) (40%) (60%) (80%) (100%) Average layer density: 49% (51% of activations are 0-valued) Test image

(C) Minsoo Rhu 23 Case study) AlexNet Characterizing the changes in layer density during training conv0 (96, 55, 55) Trained Trained Trained Trained Trained Trained (0%) (20%) (40%) (60%) (80%) (100%) Test image

(C) Minsoo Rhu 24 Case study) AlexNet Characterizing the changes in layer density during training conv0 (96, 55, 55) Trained Trained Trained Trained Trained Trained (0%) (20%) (40%) (60%) (80%) (100%) Average layer density: 49% (51% of activations are 0-valued) Test image

(C) Minsoo Rhu 25 Case study) AlexNet Characterizing the changes in layer density during training conv1 (256, 27, 27) Trained Trained Trained Trained Trained Trained (0%) (20%) (40%) (60%) (80%) (100%) Average layer density: 36% (64% of activations are 0-valued)

(C) Minsoo Rhu 26 Case study) AlexNet Characterizing the changes in layer density during training conv4 (256, 13, 13) Trained Trained Trained Trained Trained Trained (0%) (20%) (40%) (60%) (80%) (100%) Average layer density: 22% (78% of activations are 0-valued)

(C) Minsoo Rhu 27 Case study) AlexNet Putting everything together Time (0% to 100%)

(C) Minsoo Rhu 28 Case study) AlexNet Putting everything together Time (0% to 100%)

(C) Minsoo Rhu 29 Case study) AlexNet Putting everything together Observation #1: First CONV layer consistently exhibits around 50% layer density across the entire training process.

(C) Minsoo Rhu 30 Case study) AlexNet Putting everything together Observation #2: Pooling layers always increase overall activation density.

(C) Minsoo Rhu 31 Case study) AlexNet Putting everything together Observation #3: Within each layer, activation density rapidly decreases during the initial training periods; once training period reaches the fine-tuning stage, density gradually crawls back up again.

(C) Minsoo Rhu 32 Case study) AlexNet Putting everything together Observation #4: Later layers are generally more sparser than earlier layers

(C) Minsoo Rhu 33 Case study) VGG-16 Putting everything together Sparser Deeper

(C) Minsoo Rhu 34 What causes such behavior in DNNs? Discussed much more in our paper J

(C) Minsoo Rhu 35 What causes such behavior in DNNs? Observation#4: Sparsity increases as you go deep inside the network Sparser Deeper

(C) Minsoo Rhu 36 What causes such behavior in DNNs? Observation#4: Sparsity increases as you go deep inside the network Input images Activations * Zeiler et al., “Visualizing and Understanding Convolutional Networks”, arXiv.org, 2013

(C) Minsoo Rhu 37 What causes such behavior in DNNs? Observation#4: Sparsity increases as you go deep inside the network First few layers: filters are trained to respond to “ class-invariant” features - Corners - Edges - Colors Input images Activations * Zeiler et al., “Visualizing and Understanding Convolutional Networks”, arXiv.org, 2013

(C) Minsoo Rhu 39 What causes such behavior in DNNs? Observation#4: Sparsity increases as you go deep inside the network Deeper layers: more “ class-specific” features Input images Activations (e.g., Textures …) * Zeiler et al., “Visualizing and Understanding Convolutional Networks”, arXiv.org, 2013

Compressing DMA Engine: Leveraging Activation Sparsity For Training - PowerPoint PPT Presentation

(C) Minsoo Rhu 1 Compressing DMA Engine: Leveraging Activation Sparsity For Training Deep Neural Networks Minsoo Rhu , Mike OConnor * , Niladrish Chatterjee * , Jeff Pool * , Youngeun Kwon , and Stephen W. Keckler * POSTECH and

Mastering the DMA and IOMMU APIs Embedded Linux Conference Europe 2014 Dsseldorf Laurent

SYSC3601 Microprocessor Systems Unit 8: Direct Memory Access (DMA) Topics/Reading 1. DMA 2.

Compilers Activation Records Alex Aiken Activation Records The information needed to manage

Presentation Topics M I C R O P R O C E S S O R 2 N D S E M Interrupt Operations DMA

Lempel- -Ziv Ziv- -Welch (LZW) Welch (LZW) Lempel Data Compressing Model Data Compressing

Sparsity, Randomness and Compressed Sensing Petros Boufounos Mitsubishi Electric Research Labs

Introduction to Sparsity in Modeling and Learning Introduction to Sparsity in Modeling and

Sparsity and image processing Aurlie Boisbunon INRIA-SAM, AYIN March 26, 2014 Why sparsity?

Compressing Strings of the Kernel Wolfram Sang Consultant 21.8.2014, LinuxCon14 Wolfram Sang

Search Engine Optimization What is Search Engine Optimization Search Engine Optimization is the

Basophil activation test Edward Knol Dept. Immunology & Dermatology/Allergology Basophil

Internalization, Dimerization, and Activation of CD38 during mNOX Activation: - and Ca 2+

COMPONENT ACTIVATION OF A HIGH CURRENT COMPONENT ACTIVATION OF A HIGH CURRENT RADIOISOTOPE

ACTIVATION MARCH 26, 2020 ACTIVATION OF POWER Against COVID-19 ZOOM Room opens at 6:30 pm

Shared Activation of Reserve (SAR) and Regional Reserve Sharing (RRS) Technical Panel June 12,

Meng. Thesis Summary Design of Optoelectronic Activation Functions for COIN Co-processor Wegene

Preview question Officially the name of the Tor network is not an acronym, but the or part

Accelerating the Tucker Decomposition with Compressed Sparse Tensors Shaden Smith and George

Panoramic video content distribution in the xTV project Peter Quax, Panagiotis Issaris, Wouter

On The Complexity of Compressing Obfuscation Gilad Asharov, Naomi Ephraim, Ilan Komargodski, and

4200:225 Equilibrium Thermodynamics Unit I. Earth, Air, Fire, and Water Chapter 4.

[HDFS] Why data writes matter A write is performed once, But read happens many times (over)

Texture Compression in Real-Time Using the GPU Jason Tranchida Senior Programmer THQ | Volition

Why Actors Rock: Designing a Distributed Database with libcppa Matthias Vallentin

Compressing DMA Engine: Leveraging Activation Sparsity For Training - PowerPoint PPT Presentation

(C) Minsoo Rhu 1 Compressing DMA Engine: Leveraging Activation Sparsity For Training Deep Neural Networks Minsoo Rhu , Mike OConnor * , Niladrish Chatterjee * , Jeff Pool * , Youngeun Kwon , and Stephen W. Keckler * POSTECH and

Mastering the DMA and IOMMU APIs Embedded Linux Conference Europe 2014 Dsseldorf Laurent

SYSC3601 Microprocessor Systems Unit 8: Direct Memory Access (DMA) Topics/Reading 1. DMA 2.

Compilers Activation Records Alex Aiken Activation Records The information needed to manage

Presentation Topics M I C R O P R O C E S S O R 2 N D S E M Interrupt Operations DMA

Lempel- -Ziv Ziv- -Welch (LZW) Welch (LZW) Lempel Data Compressing Model Data Compressing

Sparsity, Randomness and Compressed Sensing Petros Boufounos Mitsubishi Electric Research Labs

Introduction to Sparsity in Modeling and Learning Introduction to Sparsity in Modeling and

Sparsity and image processing Aurlie Boisbunon INRIA-SAM, AYIN March 26, 2014 Why sparsity?

Compressing Strings of the Kernel Wolfram Sang Consultant 21.8.2014, LinuxCon14 Wolfram Sang

Search Engine Optimization What is Search Engine Optimization Search Engine Optimization is the

Basophil activation test Edward Knol Dept. Immunology &amp; Dermatology/Allergology Basophil

Internalization, Dimerization, and Activation of CD38 during mNOX Activation: - and Ca 2+

COMPONENT ACTIVATION OF A HIGH CURRENT COMPONENT ACTIVATION OF A HIGH CURRENT RADIOISOTOPE

ACTIVATION MARCH 26, 2020 ACTIVATION OF POWER Against COVID-19 ZOOM Room opens at 6:30 pm

Shared Activation of Reserve (SAR) and Regional Reserve Sharing (RRS) Technical Panel June 12,

Meng. Thesis Summary Design of Optoelectronic Activation Functions for COIN Co-processor Wegene

Preview question Officially the name of the Tor network is not an acronym, but the or part

Accelerating the Tucker Decomposition with Compressed Sparse Tensors Shaden Smith and George

Panoramic video content distribution in the xTV project Peter Quax, Panagiotis Issaris, Wouter

On The Complexity of Compressing Obfuscation Gilad Asharov, Naomi Ephraim, Ilan Komargodski, and

4200:225 Equilibrium Thermodynamics Unit I. Earth, Air, Fire, and Water Chapter 4.

[HDFS] Why data writes matter A write is performed once, But read happens many times (over)

Texture Compression in Real-Time Using the GPU Jason Tranchida Senior Programmer THQ | Volition

Why Actors Rock: Designing a Distributed Database with libcppa Matthias Vallentin

Basophil activation test Edward Knol Dept. Immunology & Dermatology/Allergology Basophil