model compression
play

Model Compression Seminar: Advanced Machine Learning, SS 2016 - PowerPoint PPT Presentation

Model Compression Seminar: Advanced Machine Learning, SS 2016 Markus Beuckelmann markus.beuckelmann@stud.uni-heidelberg.de July 19, 2016 Markus Beuckelmann Model Compression July 19, 2016 1 / 33 Introduction Outline Outline 1 Overview


  1. Model Compression Seminar: Advanced Machine Learning, SS 2016 Markus Beuckelmann markus.beuckelmann@stud.uni-heidelberg.de July 19, 2016 Markus Beuckelmann Model Compression July 19, 2016 1 / 33

  2. Introduction Outline Outline 1 Overview & Motivation ◇ Why do we need model compression? ◇ Embedded & Mobile devices ◇ DRAM vs. SRAM 2 Recap: Neural Networks for Prediction 3 Neural Network Compression & Model Compression ◇ Neural Network Pruning: OBD and OBS ◇ Knowledge Distillation ◇ Deep Compression 4 Summary Markus Beuckelmann Model Compression July 19, 2016 2 / 33

  3. Introduction Overview & Motivation 1 Overview & Motivation Markus Beuckelmann Model Compression July 19, 2016 3 / 33

  4. Introduction Overview & Motivation Success of Neural Networks • Image recognition • Image classification • Speech recognition • Natural Language Processing (Han et al., 2015) (Tensorflow) Markus Beuckelmann Model Compression July 19, 2016 4 / 33

  5. Introduction Overview & Motivation Problem: Predictive Performance is Not Enough • There a different metrics when it comes to evaluating a model • Usually there is some kind of trade–off, so the choice is governed by deployment requirements How good is your model in terms of...? • Predictive performance • Speed (time complexity) in training/testing • Memory complexity in training/testing • Energy consumption in training/testing Markus Beuckelmann Model Compression July 19, 2016 5 / 33

  6. Introduction Overview & Motivation AlexNet: Millions of Parameters AlexNet (Krizhevsky et al., 2012) • Trained on ImageNet (15 · 10 6 training images, 22 · 10 3 categories) • Number of neurons: 650 · 10 3 • Number of free parameters: 61 · 10 6 • ≈ 233 MiB (32-bit float) (Krizhevsky et al., 2012) • Having this many parameters is expensive in memory, time and energy. Markus Beuckelmann Model Compression July 19, 2016 6 / 33

  7. Introduction Overview & Motivation Mobile & Embedded Devices • Smartphones • Hearing implants • Credit cards, etc. ... Smartphone Hardware (2016) • CPU: 2 × 1 . 7 GHz • DRAM: 2 GiB • SRAM: MiB • Battery: 2000 mAh (Micri µ m, Embedded Software) • Limitations: storage, battery, computational power, network bandwidth Model Compression : Find a minimum topology of the model. Markus Beuckelmann Model Compression July 19, 2016 7 / 33

  8. Introduction Overview & Motivation Minimizing Energy Consumption: SRAM & DRAM • DRAM: Slower, higher energy consumption, cheaper • SRAM: Faster, less energy consumption, more expensive, usually used as cache memory (Han et al., 2015) Markus Beuckelmann Model Compression July 19, 2016 8 / 33

  9. Introduction Overview & Motivation Minimizing Energy Consumption: SRAM & DRAM • DRAM: Slower, higher energy consumption, cheaper • SRAM: Faster, less energy consumption, more expensive, usually used as cache memory (Han et al., 2015) • If we can fit the whole model into SRAM , we will consume drastically less energy and gain significant speedups! Markus Beuckelmann Model Compression July 19, 2016 8 / 33

  10. Neural Networks 2 Neural Networks Markus Beuckelmann Model Compression July 19, 2016 9 / 33

  11. Neural Networks Neural Networks Neural Networks: Basics Feed–Forward Networks • a ( i +1) = ( W ( i +1) ) ⊤ z ( i ) , z (0) := x • z ( i +1) = 𝑔 ( i +1) ( a ( i +1) ) • 𝜚 ( x ) = 𝑔 ( N ) ⎞ ⎡ )︄ [︄ 𝑔 1 ( W (1) x ) · · · · · · )︄ [︄ • ˆ 𝑧 = arg max 𝜚 ( x ) • Training: GD, Backpropagation (Rajesh Rai, AI lecture) • Powerful, (non–linear) classification/regression • Keep in mind: there are more complex architectures! ( http: // deepdish. io ) Markus Beuckelmann Model Compression July 19, 2016 10 / 33

  12. Neural Networks Neural Networks Neural Networks: Prediction (Zeiler, 2013) Loss functions N 1 √︂ 𝑧 i ) 2 • Regression: L ( 𝜄 | X , y ) = ( 𝑧 i − ˆ 2 𝑂 i =1 N K ⎞ ⎡ √︂ √︂ • Multiclass classification: L ( 𝜄 | X , y ) = − 𝑧 ik · log ( P (ˆ 𝑧 ik )) i =1 k =1 ⨂︂ K • Last layer is usually a softmax layer: p = z ( l ) = exp( a ( l ) ) √︂ exp( 𝑏 ( l ) k ) k =1 • In the end, we will get posterior probability distribution over the classes Markus Beuckelmann Model Compression July 19, 2016 11 / 33

  13. Model Compression Methods 3 Neural Network Compression & Model Compression Markus Beuckelmann Model Compression July 19, 2016 12 / 33

  14. Model Compression Methods Pruning Pruning: Overview • Selectively removing weights / neurons • Compression: 2 × to 4 × • Usually combined with retraining (Ben Lorica, O’Reilly Media) Markus Beuckelmann Model Compression July 19, 2016 13 / 33

  15. Model Compression Methods Pruning Pruning: Overview • Selectively removing weights / neurons • Compression: 2 × to 4 × • Usually combined with retraining (Ben Lorica, O’Reilly Media) Important Questions • Which weights should we remove first? • How many weights can we remove? • What about the order of removal? Markus Beuckelmann Model Compression July 19, 2016 13 / 33

  16. Model Compression Methods Pruning Motivation: Synaptic Pruning • In Humans we have synaptic pruning • This removes redundant connections in the brain (Seeman et al., 1987) Markus Beuckelmann Model Compression July 19, 2016 14 / 33

  17. Model Compression Methods Pruning Pruning: How do we find the least important weight(s)? • Brute–force Pruning ◇ 𝒫 ( 𝑁𝑋 2 ) with 𝑋 weights and 𝑁 training samples ◇ Not feasible for large neural networks Markus Beuckelmann Model Compression July 19, 2016 15 / 33

  18. Model Compression Methods Pruning Pruning: How do we find the least important weight(s)? • Brute–force Pruning ◇ 𝒫 ( 𝑁𝑋 2 ) with 𝑋 weights and 𝑁 training samples ◇ Not feasible for large neural networks • Simple Heuristics ◇ Magnitude–Based Damage: Look at ♣♣ w ♣♣ p ◇ Variance–Based Damage Markus Beuckelmann Model Compression July 19, 2016 15 / 33

  19. Model Compression Methods Pruning Pruning: How do we find the least important weight(s)? • Brute–force Pruning ◇ 𝒫 ( 𝑁𝑋 2 ) with 𝑋 weights and 𝑁 training samples ◇ Not feasible for large neural networks • Simple Heuristics ◇ Magnitude–Based Damage: Look at ♣♣ w ♣♣ p ◇ Variance–Based Damage • More Rigorous Approaches ◇ Optimal Brain Damage (OBD) (LeCun et al., 1990) ◇ Optimal Brain Surgeon (OBS) (Hassibi et al., 1993) Markus Beuckelmann Model Compression July 19, 2016 15 / 33

  20. Model Compression Methods Pruning Optimal Brain Damage (OBD) • Small perturbation: δ w ⇒ δ L = L ( w + δ w ) − L ( w ) • Taylor expansion: ⎞ 𝜖 L ⎡ ⊤ · δ w + 1 2 δ w ⊤ · H · δ w + O ( || δ w || 3 ) δ L ≈ 𝜖 w ⎞ 𝜖 L ⎡ δ𝑥 i + 1 ∑︂ ∑︂ δ𝑥 i ( H ) ij δ𝑥 j + O ( || δ w || 3 ) ⇒ δ L ≈ 𝜖𝑥 i 2 i ( i,j ) ∂ 2 ℒ • With Hessian: ( H ) ij = ∂w i ∂w j Markus Beuckelmann Model Compression July 19, 2016 16 / 33

  21. Model Compression Methods Pruning Optimal Brain Damage (OBD) • We need to deal with: ⎞ 𝜖 L ⎡ 1 ∑︂ ∑︂ O ( || δ w || 3 ) ⇒ δ L ≈ δ𝑥 i + δ𝑥 i ( H ) ij δ𝑥 j + 𝜖𝑥 i 2 i ( i,j ) ⏟ ⏞ √︂ √︂ 1 ( H ) ii δw 2 i + 1 δw i ( H ) ij δw j 2 2 i i ̸ = j Approximations • Extremal assumption: local optimum (training has converged) • Diagonal assumption: H is diagonal • Quadratic approximation: L is approximately quadratic Markus Beuckelmann Model Compression July 19, 2016 17 / 33

  22. Model Compression Methods Pruning Optimal Brain Damage (OBD) • We need to deal with: ⎞ 𝜖 L ⎡ 1 ∑︂ ∑︂ O ( || δ w || 3 ) ⇒ δ L ≈ δ𝑥 i + δ𝑥 i ( H ) ij δ𝑥 j + 𝜖𝑥 i 2 i ( i,j ) ⏟ ⏞ √︂ √︂ 1 ( H ) ii δw 2 i + 1 δw i ( H ) ij δw j 2 2 i i ̸ = j Approximations • Extremal assumption: local optimum (training has converged) • Diagonal assumption: H is diagonal • Quadratic approximation: L is approximately quadratic • Now we are left with: δ L ≈ 1 i → 𝑇 k = 1 ∑︂ ( H ) ii δ𝑥 2 2( H ) kk 𝑥 2 k 2 i Markus Beuckelmann Model Compression July 19, 2016 17 / 33

  23. Model Compression Methods Pruning OBD: The Algorithm 1 Choose a reasonable network architecture 2 Train the network until a reasonable local minimum is obtained 3 Compute the diagonal of the Hessian, i.e. ( H ) kk 4 Compute the saliencies given by 𝑇 k = 1 2( H ) kk 𝑥 2 k for each parameter 5 Sort the parameters by 𝑇 k 6 Delete parameters with low–saliency 7 (Optional: Iterate to step 2) Markus Beuckelmann Model Compression July 19, 2016 18 / 33

  24. Model Compression Methods Pruning OBD: Experimental Results • Data: MNIST (handwritten digits recognition) • Left panel (a): Comparison to magnitude–based pruning • Right panel(b): Comparison to saliencies (Le Cun et al., 1990) Markus Beuckelmann Model Compression July 19, 2016 19 / 33

  25. Model Compression Methods Pruning OBD: Experimental Results – With Retraining • This is what it looks with retraining. • Left panel (a): Retraining (training data) • Right panel (a): Retraining (test data) (Le Cun et al., 1990) Markus Beuckelmann Model Compression July 19, 2016 20 / 33

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend