Model Compression Seminar: Advanced Machine Learning, SS 2016 - PowerPoint PPT Presentation

Model Compression Seminar: Advanced Machine Learning, SS 2016 Markus Beuckelmann markus.beuckelmann@stud.uni-heidelberg.de July 19, 2016 Markus Beuckelmann Model Compression July 19, 2016 1 / 33

Introduction Outline Outline 1 Overview & Motivation ◇ Why do we need model compression? ◇ Embedded & Mobile devices ◇ DRAM vs. SRAM 2 Recap: Neural Networks for Prediction 3 Neural Network Compression & Model Compression ◇ Neural Network Pruning: OBD and OBS ◇ Knowledge Distillation ◇ Deep Compression 4 Summary Markus Beuckelmann Model Compression July 19, 2016 2 / 33

Introduction Overview & Motivation 1 Overview & Motivation Markus Beuckelmann Model Compression July 19, 2016 3 / 33

Introduction Overview & Motivation Success of Neural Networks • Image recognition • Image classification • Speech recognition • Natural Language Processing (Han et al., 2015) (Tensorflow) Markus Beuckelmann Model Compression July 19, 2016 4 / 33

Introduction Overview & Motivation Problem: Predictive Performance is Not Enough • There a different metrics when it comes to evaluating a model • Usually there is some kind of trade–off, so the choice is governed by deployment requirements How good is your model in terms of...? • Predictive performance • Speed (time complexity) in training/testing • Memory complexity in training/testing • Energy consumption in training/testing Markus Beuckelmann Model Compression July 19, 2016 5 / 33

Introduction Overview & Motivation AlexNet: Millions of Parameters AlexNet (Krizhevsky et al., 2012) • Trained on ImageNet (15 · 10 6 training images, 22 · 10 3 categories) • Number of neurons: 650 · 10 3 • Number of free parameters: 61 · 10 6 • ≈ 233 MiB (32-bit float) (Krizhevsky et al., 2012) • Having this many parameters is expensive in memory, time and energy. Markus Beuckelmann Model Compression July 19, 2016 6 / 33

Introduction Overview & Motivation Mobile & Embedded Devices • Smartphones • Hearing implants • Credit cards, etc. ... Smartphone Hardware (2016) • CPU: 2 × 1 . 7 GHz • DRAM: 2 GiB • SRAM: MiB • Battery: 2000 mAh (Micri µ m, Embedded Software) • Limitations: storage, battery, computational power, network bandwidth Model Compression : Find a minimum topology of the model. Markus Beuckelmann Model Compression July 19, 2016 7 / 33

Introduction Overview & Motivation Minimizing Energy Consumption: SRAM & DRAM • DRAM: Slower, higher energy consumption, cheaper • SRAM: Faster, less energy consumption, more expensive, usually used as cache memory (Han et al., 2015) Markus Beuckelmann Model Compression July 19, 2016 8 / 33

Introduction Overview & Motivation Minimizing Energy Consumption: SRAM & DRAM • DRAM: Slower, higher energy consumption, cheaper • SRAM: Faster, less energy consumption, more expensive, usually used as cache memory (Han et al., 2015) • If we can fit the whole model into SRAM , we will consume drastically less energy and gain significant speedups! Markus Beuckelmann Model Compression July 19, 2016 8 / 33

Neural Networks 2 Neural Networks Markus Beuckelmann Model Compression July 19, 2016 9 / 33

Neural Networks Neural Networks Neural Networks: Basics Feed–Forward Networks • a ( i +1) = ( W ( i +1) ) ⊤ z ( i ) , z (0) := x • z ( i +1) = 𝑔 ( i +1) ( a ( i +1) ) • 𝜚 ( x ) = 𝑔 ( N ) ⎞ ⎡ )︄ [︄ 𝑔 1 ( W (1) x ) · · · · · · )︄ [︄ • ˆ 𝑧 = arg max 𝜚 ( x ) • Training: GD, Backpropagation (Rajesh Rai, AI lecture) • Powerful, (non–linear) classification/regression • Keep in mind: there are more complex architectures! ( http: // deepdish. io ) Markus Beuckelmann Model Compression July 19, 2016 10 / 33

Neural Networks Neural Networks Neural Networks: Prediction (Zeiler, 2013) Loss functions N 1 √︂ 𝑧 i ) 2 • Regression: L ( 𝜄 | X , y ) = ( 𝑧 i − ˆ 2 𝑂 i =1 N K ⎞ ⎡ √︂ √︂ • Multiclass classification: L ( 𝜄 | X , y ) = − 𝑧 ik · log ( P (ˆ 𝑧 ik )) i =1 k =1 ⨂︂ K • Last layer is usually a softmax layer: p = z ( l ) = exp( a ( l ) ) √︂ exp( 𝑏 ( l ) k ) k =1 • In the end, we will get posterior probability distribution over the classes Markus Beuckelmann Model Compression July 19, 2016 11 / 33

Model Compression Methods 3 Neural Network Compression & Model Compression Markus Beuckelmann Model Compression July 19, 2016 12 / 33

Model Compression Methods Pruning Pruning: Overview • Selectively removing weights / neurons • Compression: 2 × to 4 × • Usually combined with retraining (Ben Lorica, O’Reilly Media) Markus Beuckelmann Model Compression July 19, 2016 13 / 33

Model Compression Methods Pruning Pruning: Overview • Selectively removing weights / neurons • Compression: 2 × to 4 × • Usually combined with retraining (Ben Lorica, O’Reilly Media) Important Questions • Which weights should we remove first? • How many weights can we remove? • What about the order of removal? Markus Beuckelmann Model Compression July 19, 2016 13 / 33

Model Compression Methods Pruning Motivation: Synaptic Pruning • In Humans we have synaptic pruning • This removes redundant connections in the brain (Seeman et al., 1987) Markus Beuckelmann Model Compression July 19, 2016 14 / 33

Model Compression Methods Pruning Pruning: How do we find the least important weight(s)? • Brute–force Pruning ◇ 𝒫 ( 𝑁𝑋 2 ) with 𝑋 weights and 𝑁 training samples ◇ Not feasible for large neural networks Markus Beuckelmann Model Compression July 19, 2016 15 / 33

Model Compression Methods Pruning Pruning: How do we find the least important weight(s)? • Brute–force Pruning ◇ 𝒫 ( 𝑁𝑋 2 ) with 𝑋 weights and 𝑁 training samples ◇ Not feasible for large neural networks • Simple Heuristics ◇ Magnitude–Based Damage: Look at ♣♣ w ♣♣ p ◇ Variance–Based Damage Markus Beuckelmann Model Compression July 19, 2016 15 / 33

Model Compression Methods Pruning Pruning: How do we find the least important weight(s)? • Brute–force Pruning ◇ 𝒫 ( 𝑁𝑋 2 ) with 𝑋 weights and 𝑁 training samples ◇ Not feasible for large neural networks • Simple Heuristics ◇ Magnitude–Based Damage: Look at ♣♣ w ♣♣ p ◇ Variance–Based Damage • More Rigorous Approaches ◇ Optimal Brain Damage (OBD) (LeCun et al., 1990) ◇ Optimal Brain Surgeon (OBS) (Hassibi et al., 1993) Markus Beuckelmann Model Compression July 19, 2016 15 / 33

Model Compression Methods Pruning Optimal Brain Damage (OBD) • Small perturbation: δ w ⇒ δ L = L ( w + δ w ) − L ( w ) • Taylor expansion: ⎞ 𝜖 L ⎡ ⊤ · δ w + 1 2 δ w ⊤ · H · δ w + O ( || δ w || 3 ) δ L ≈ 𝜖 w ⎞ 𝜖 L ⎡ δ𝑥 i + 1 ∑︂ ∑︂ δ𝑥 i ( H ) ij δ𝑥 j + O ( || δ w || 3 ) ⇒ δ L ≈ 𝜖𝑥 i 2 i ( i,j ) ∂ 2 ℒ • With Hessian: ( H ) ij = ∂w i ∂w j Markus Beuckelmann Model Compression July 19, 2016 16 / 33

Model Compression Methods Pruning Optimal Brain Damage (OBD) • We need to deal with: ⎞ 𝜖 L ⎡ 1 ∑︂ ∑︂ O ( || δ w || 3 ) ⇒ δ L ≈ δ𝑥 i + δ𝑥 i ( H ) ij δ𝑥 j + 𝜖𝑥 i 2 i ( i,j ) ⏟ ⏞ √︂ √︂ 1 ( H ) ii δw 2 i + 1 δw i ( H ) ij δw j 2 2 i i ̸ = j Approximations • Extremal assumption: local optimum (training has converged) • Diagonal assumption: H is diagonal • Quadratic approximation: L is approximately quadratic Markus Beuckelmann Model Compression July 19, 2016 17 / 33

Model Compression Methods Pruning Optimal Brain Damage (OBD) • We need to deal with: ⎞ 𝜖 L ⎡ 1 ∑︂ ∑︂ O ( || δ w || 3 ) ⇒ δ L ≈ δ𝑥 i + δ𝑥 i ( H ) ij δ𝑥 j + 𝜖𝑥 i 2 i ( i,j ) ⏟ ⏞ √︂ √︂ 1 ( H ) ii δw 2 i + 1 δw i ( H ) ij δw j 2 2 i i ̸ = j Approximations • Extremal assumption: local optimum (training has converged) • Diagonal assumption: H is diagonal • Quadratic approximation: L is approximately quadratic • Now we are left with: δ L ≈ 1 i → 𝑇 k = 1 ∑︂ ( H ) ii δ𝑥 2 2( H ) kk 𝑥 2 k 2 i Markus Beuckelmann Model Compression July 19, 2016 17 / 33

Model Compression Methods Pruning OBD: The Algorithm 1 Choose a reasonable network architecture 2 Train the network until a reasonable local minimum is obtained 3 Compute the diagonal of the Hessian, i.e. ( H ) kk 4 Compute the saliencies given by 𝑇 k = 1 2( H ) kk 𝑥 2 k for each parameter 5 Sort the parameters by 𝑇 k 6 Delete parameters with low–saliency 7 (Optional: Iterate to step 2) Markus Beuckelmann Model Compression July 19, 2016 18 / 33

Model Compression Methods Pruning OBD: Experimental Results • Data: MNIST (handwritten digits recognition) • Left panel (a): Comparison to magnitude–based pruning • Right panel(b): Comparison to saliencies (Le Cun et al., 1990) Markus Beuckelmann Model Compression July 19, 2016 19 / 33

Model Compression Methods Pruning OBD: Experimental Results – With Retraining • This is what it looks with retraining. • Left panel (a): Retraining (training data) • Right panel (a): Retraining (test data) (Le Cun et al., 1990) Markus Beuckelmann Model Compression July 19, 2016 20 / 33

Model Compression Seminar: Advanced Machine Learning, SS 2016 - PowerPoint PPT Presentation

Model Compression Seminar: Advanced Machine Learning, SS 2016 Markus Beuckelmann markus.beuckelmann@stud.uni-heidelberg.de July 19, 2016 Markus Beuckelmann Model Compression July 19, 2016 1 / 33 Introduction Outline Outline 1 Overview

Lossless compression in lossy compression systems Almost every lossy compression system

14.9.2 JPEG2000 compression DCT compression basis for JPEG wavelet compression

JPEG Compression Ian Snyder December 11, 2009 Ian Snyder JPEG Compression Outline

Lecture 9: Compression 1 / 52 Compression Recap Bu ff er Management Recap 2 / 52 Compression

A Model to Address Salary Compression for Faculty (an anti-compression model) Presented to

Digital Image Compression Digital Image Compression Digital Image Compression and JPEG Standards

Digital Video Compression Digital Video Compression Digital Video Compression and H.261

From Sorting to Heaps to Compression Data Compression video on demand/set top box jpeg

Tradeoffs in XML Database Compression James Cheney University of Edinburgh Data Compression

Compression Overview Multimedia Encoding and Compression Huffman codes Lossless

Compression Programs File Compression: Gzip, Bzip Archivers :Arc, Pkzip, Winrar,

Scientific Data Compression: From Stone-Age to Renaissance Factor 10,100 compression

Information Retrieval Tutorial 3: Index Compression Professor: Michel Schellekens TA: Ang Gao

Basic Techniques II: Iterative Compression Marek Cygan Institute of Informatics University of

Compression Strategies & Alternate Summarization Systems and Applications Ling 573 May 23,

Video Compression Lecture # 5 6 Shahab Baqai LUMS Outline Image compression

More on games (Ch. 5.4-5.6) Announcements Midterm next Tuesday: covers weeks 1-4 (Chapters 1-4)

A stable skeletonization for tabletop gesture recognition Andoni Beristain, Manuel Graa

XDN: Towards Efficient Inference of Residual Neural Networks on Cambricon Chips 2019

Near-Optimal Offline Cleaning for Flash-Based SSDs MANSOUR SHAFAEI & PETER DESNOYERS

- Pruning the SASLOG Digging into the Roots of NOTEs, WARNINGs, and ERRORs A

Outline Games Perfect play: principles of adversarial search minimax decisions

Geosphere: C Consistently T y Turning M MIMO C O Capaci city y into T Throughput

Model Compression Presented by : Ashutosh Adhikari Neural Networks Can be Too Huge !! - NNs

Model Compression Seminar: Advanced Machine Learning, SS 2016 - PowerPoint PPT Presentation

Model Compression Seminar: Advanced Machine Learning, SS 2016 Markus Beuckelmann markus.beuckelmann@stud.uni-heidelberg.de July 19, 2016 Markus Beuckelmann Model Compression July 19, 2016 1 / 33 Introduction Outline Outline 1 Overview

Lossless compression in lossy compression systems Almost every lossy compression system

14.9.2 JPEG2000 compression DCT compression basis for JPEG wavelet compression

JPEG Compression Ian Snyder December 11, 2009 Ian Snyder JPEG Compression Outline

Lecture 9: Compression 1 / 52 Compression Recap Bu ff er Management Recap 2 / 52 Compression

A Model to Address Salary Compression for Faculty (an anti-compression model) Presented to

Digital Image Compression Digital Image Compression Digital Image Compression and JPEG Standards

Digital Video Compression Digital Video Compression Digital Video Compression and H.261

From Sorting to Heaps to Compression Data Compression video on demand/set top box jpeg

Tradeoffs in XML Database Compression James Cheney University of Edinburgh Data Compression

Compression Overview Multimedia Encoding and Compression Huffman codes Lossless

Compression Programs File Compression: Gzip, Bzip Archivers :Arc, Pkzip, Winrar,

Scientific Data Compression: From Stone-Age to Renaissance Factor 10,100 compression

Information Retrieval Tutorial 3: Index Compression Professor: Michel Schellekens TA: Ang Gao

Basic Techniques II: Iterative Compression Marek Cygan Institute of Informatics University of

Compression Strategies &amp; Alternate Summarization Systems and Applications Ling 573 May 23,

Video Compression Lecture # 5 6 Shahab Baqai LUMS Outline Image compression

More on games (Ch. 5.4-5.6) Announcements Midterm next Tuesday: covers weeks 1-4 (Chapters 1-4)

A stable skeletonization for tabletop gesture recognition Andoni Beristain, Manuel Graa

XDN: Towards Efficient Inference of Residual Neural Networks on Cambricon Chips 2019

Near-Optimal Offline Cleaning for Flash-Based SSDs MANSOUR SHAFAEI &amp; PETER DESNOYERS

- Pruning the SASLOG Digging into the Roots of NOTEs, WARNINGs, and ERRORs A

Outline Games Perfect play: principles of adversarial search minimax decisions

Geosphere: C Consistently T y Turning M MIMO C O Capaci city y into T Throughput

Model Compression Presented by : Ashutosh Adhikari Neural Networks Can be Too Huge !! - NNs

Compression Strategies & Alternate Summarization Systems and Applications Ling 573 May 23,

Near-Optimal Offline Cleaning for Flash-Based SSDs MANSOUR SHAFAEI & PETER DESNOYERS