Model Compression Seminar: Advanced Machine Learning, SS 2016 - - PowerPoint PPT Presentation

model compression
SMART_READER_LITE
LIVE PREVIEW

Model Compression Seminar: Advanced Machine Learning, SS 2016 - - PowerPoint PPT Presentation

Model Compression Seminar: Advanced Machine Learning, SS 2016 Markus Beuckelmann markus.beuckelmann@stud.uni-heidelberg.de July 19, 2016 Markus Beuckelmann Model Compression July 19, 2016 1 / 33 Introduction Outline Outline 1 Overview


slide-1
SLIDE 1

Model Compression

Seminar: Advanced Machine Learning, SS 2016 Markus Beuckelmann markus.beuckelmann@stud.uni-heidelberg.de

July 19, 2016

Markus Beuckelmann Model Compression July 19, 2016 1 / 33

slide-2
SLIDE 2

Introduction Outline

Outline

1 Overview & Motivation

◇ Why do we need model compression? ◇ Embedded & Mobile devices ◇ DRAM vs. SRAM

2 Recap: Neural Networks for Prediction 3 Neural Network Compression & Model Compression

◇ Neural Network Pruning: OBD and OBS ◇ Knowledge Distillation ◇ Deep Compression

4 Summary

Markus Beuckelmann Model Compression July 19, 2016 2 / 33

slide-3
SLIDE 3

Introduction Overview & Motivation

1 Overview & Motivation

Markus Beuckelmann Model Compression July 19, 2016 3 / 33

slide-4
SLIDE 4

Introduction Overview & Motivation

Success of Neural Networks

  • Image recognition
  • Image classification
  • Speech recognition
  • Natural Language Processing

(Han et al., 2015) (Tensorflow) Markus Beuckelmann Model Compression July 19, 2016 4 / 33

slide-5
SLIDE 5

Introduction Overview & Motivation

Problem: Predictive Performance is Not Enough

  • There a different metrics when it comes to evaluating a model
  • Usually there is some kind of trade–off, so the choice is governed by

deployment requirements

How good is your model in terms of...?

  • Predictive performance
  • Speed (time complexity) in training/testing
  • Memory complexity in training/testing
  • Energy consumption in training/testing

Markus Beuckelmann Model Compression July 19, 2016 5 / 33

slide-6
SLIDE 6

Introduction Overview & Motivation

AlexNet: Millions of Parameters

AlexNet (Krizhevsky et al., 2012)

  • Trained on ImageNet (15 · 106 training images, 22 · 103 categories)
  • Number of neurons: 650 · 103
  • Number of free parameters: 61 · 106
  • ≈ 233 MiB (32-bit float)

(Krizhevsky et al., 2012)

  • Having this many parameters is expensive in memory, time and energy.

Markus Beuckelmann Model Compression July 19, 2016 6 / 33

slide-7
SLIDE 7

Introduction Overview & Motivation

Mobile & Embedded Devices

  • Smartphones
  • Hearing implants
  • Credit cards, etc. ...

Smartphone Hardware (2016)

  • CPU: 2× 1.7 GHz
  • DRAM: 2 GiB
  • SRAM: MiB
  • Battery: 2000 mAh

(Micriµm, Embedded Software)

  • Limitations: storage, battery, computational power, network bandwidth

Model Compression: Find a minimum topology of the model.

Markus Beuckelmann Model Compression July 19, 2016 7 / 33

slide-8
SLIDE 8

Introduction Overview & Motivation

Minimizing Energy Consumption: SRAM & DRAM

  • DRAM: Slower, higher energy consumption, cheaper
  • SRAM: Faster, less energy consumption, more expensive, usually used as

cache memory

(Han et al., 2015) Markus Beuckelmann Model Compression July 19, 2016 8 / 33

slide-9
SLIDE 9

Introduction Overview & Motivation

Minimizing Energy Consumption: SRAM & DRAM

  • DRAM: Slower, higher energy consumption, cheaper
  • SRAM: Faster, less energy consumption, more expensive, usually used as

cache memory

(Han et al., 2015)

  • If we can fit the whole model into SRAM, we will consume drastically less

energy and gain significant speedups!

Markus Beuckelmann Model Compression July 19, 2016 8 / 33

slide-10
SLIDE 10

Neural Networks

2 Neural Networks

Markus Beuckelmann Model Compression July 19, 2016 9 / 33

slide-11
SLIDE 11

Neural Networks Neural Networks

Neural Networks: Basics

Feed–Forward Networks

  • a(i+1) = (W(i+1))

⊤z(i), z(0) := x

  • z(i+1) = 𝑔 (i+1)(a(i+1))
  • 𝜚(x) = 𝑔 (N)⎞

· · · )︄ 𝑔 1(W(1)x) [︄ · · · ⎡

  • ˆ

𝑧 = arg max )︄ 𝜚(x) [︄

  • Training: GD, Backpropagation
  • Powerful, (non–linear)

classification/regression

  • Keep in mind: there are more

complex architectures!

(Rajesh Rai, AI lecture) (http: // deepdish. io ) Markus Beuckelmann Model Compression July 19, 2016 10 / 33

slide-12
SLIDE 12

Neural Networks Neural Networks

Neural Networks: Prediction

(Zeiler, 2013)

Loss functions

  • Regression: L(𝜄 | X, y) =

1 2𝑂

N

√︂

i=1

(𝑧i − ˆ 𝑧i)2

  • Multiclass classification: L(𝜄 | X, y) = −

N

√︂

i=1 K

√︂

k=1

⎞ 𝑧ik · log (P(ˆ 𝑧ik)) ⎡

  • Last layer is usually a softmax layer: p = z(l) = exp(a(l))

⨂︂ K √︂

k=1

exp(𝑏(l)

k )

  • In the end, we will get posterior probability distribution over the classes

Markus Beuckelmann Model Compression July 19, 2016 11 / 33

slide-13
SLIDE 13

Model Compression Methods

3 Neural Network Compression & Model Compression

Markus Beuckelmann Model Compression July 19, 2016 12 / 33

slide-14
SLIDE 14

Model Compression Methods Pruning

Pruning: Overview

  • Selectively removing

weights / neurons

  • Compression: 2× to 4×
  • Usually combined with

retraining

(Ben Lorica, O’Reilly Media) Markus Beuckelmann Model Compression July 19, 2016 13 / 33

slide-15
SLIDE 15

Model Compression Methods Pruning

Pruning: Overview

  • Selectively removing

weights / neurons

  • Compression: 2× to 4×
  • Usually combined with

retraining

(Ben Lorica, O’Reilly Media)

Important Questions

  • Which weights should we remove first?
  • How many weights can we remove?
  • What about the order of removal?

Markus Beuckelmann Model Compression July 19, 2016 13 / 33

slide-16
SLIDE 16

Model Compression Methods Pruning

Motivation: Synaptic Pruning

  • In Humans we have synaptic pruning
  • This removes redundant connections in the brain

(Seeman et al., 1987) Markus Beuckelmann Model Compression July 19, 2016 14 / 33

slide-17
SLIDE 17

Model Compression Methods Pruning

Pruning: How do we find the least important weight(s)?

  • Brute–force Pruning

◇ 𝒫(𝑁𝑋 2) with 𝑋 weights and 𝑁 training samples ◇ Not feasible for large neural networks

Markus Beuckelmann Model Compression July 19, 2016 15 / 33

slide-18
SLIDE 18

Model Compression Methods Pruning

Pruning: How do we find the least important weight(s)?

  • Brute–force Pruning

◇ 𝒫(𝑁𝑋 2) with 𝑋 weights and 𝑁 training samples ◇ Not feasible for large neural networks

  • Simple Heuristics

◇ Magnitude–Based Damage: Look at ♣♣w♣♣p ◇ Variance–Based Damage

Markus Beuckelmann Model Compression July 19, 2016 15 / 33

slide-19
SLIDE 19

Model Compression Methods Pruning

Pruning: How do we find the least important weight(s)?

  • Brute–force Pruning

◇ 𝒫(𝑁𝑋 2) with 𝑋 weights and 𝑁 training samples ◇ Not feasible for large neural networks

  • Simple Heuristics

◇ Magnitude–Based Damage: Look at ♣♣w♣♣p ◇ Variance–Based Damage

  • More Rigorous Approaches

◇ Optimal Brain Damage (OBD) (LeCun et al., 1990) ◇ Optimal Brain Surgeon (OBS) (Hassibi et al., 1993)

Markus Beuckelmann Model Compression July 19, 2016 15 / 33

slide-20
SLIDE 20

Model Compression Methods Pruning

Optimal Brain Damage (OBD)

  • Small perturbation: δw ⇒ δL = L(w + δw) − L(w)
  • Taylor expansion:

δL ≈ ⎞ 𝜖L 𝜖w ⎡⊤ · δw + 1 2δw⊤ · H · δw + O(||δw||3) ⇒ δL ≈ ∑︂

i

⎞ 𝜖L 𝜖𝑥i ⎡ δ𝑥i + 1 2 ∑︂

(i,j)

δ𝑥i(H)ijδ𝑥j + O(||δw||3)

  • With Hessian: (H)ij =

∂2ℒ ∂wi∂wj

Markus Beuckelmann Model Compression July 19, 2016 16 / 33

slide-21
SLIDE 21

Model Compression Methods Pruning

Optimal Brain Damage (OBD)

  • We need to deal with:

⇒ δL ≈ ∑︂

i

⎞ 𝜖L 𝜖𝑥i ⎡ δ𝑥i + 1 2 ∑︂

(i,j)

δ𝑥i(H)ijδ𝑥j ⏟ ⏞

1 2

√︂

i

(H)iiδw2

i + 1 2

√︂

i̸=j

δwi(H)ijδwj

+ O(||δw||3)

Approximations

  • Extremal assumption: local optimum (training has converged)
  • Diagonal assumption: H is diagonal
  • Quadratic approximation: L is approximately quadratic

Markus Beuckelmann Model Compression July 19, 2016 17 / 33

slide-22
SLIDE 22

Model Compression Methods Pruning

Optimal Brain Damage (OBD)

  • We need to deal with:

⇒ δL ≈ ∑︂

i

⎞ 𝜖L 𝜖𝑥i ⎡ δ𝑥i + 1 2 ∑︂

(i,j)

δ𝑥i(H)ijδ𝑥j ⏟ ⏞

1 2

√︂

i

(H)iiδw2

i + 1 2

√︂

i̸=j

δwi(H)ijδwj

+ O(||δw||3)

Approximations

  • Extremal assumption: local optimum (training has converged)
  • Diagonal assumption: H is diagonal
  • Quadratic approximation: L is approximately quadratic
  • Now we are left with:

δL ≈ 1 2 ∑︂

i

(H)iiδ𝑥2

i → 𝑇k = 1

2(H)kk𝑥2

k

Markus Beuckelmann Model Compression July 19, 2016 17 / 33

slide-23
SLIDE 23

Model Compression Methods Pruning

OBD: The Algorithm

1 Choose a reasonable network architecture 2 Train the network until a reasonable local minimum is obtained 3 Compute the diagonal of the Hessian, i.e. (H)kk 4 Compute the saliencies given by 𝑇k = 1

2(H)kk𝑥2

k for each parameter 5 Sort the parameters by 𝑇k 6 Delete parameters with low–saliency 7 (Optional: Iterate to step 2)

Markus Beuckelmann Model Compression July 19, 2016 18 / 33

slide-24
SLIDE 24

Model Compression Methods Pruning

OBD: Experimental Results

  • Data: MNIST (handwritten digits recognition)
  • Left panel (a): Comparison to magnitude–based pruning
  • Right panel(b): Comparison to saliencies

(Le Cun et al., 1990) Markus Beuckelmann Model Compression July 19, 2016 19 / 33

slide-25
SLIDE 25

Model Compression Methods Pruning

OBD: Experimental Results – With Retraining

  • This is what it looks with retraining.
  • Left panel (a): Retraining (training data)
  • Right panel (a): Retraining (test data)

(Le Cun et al., 1990) Markus Beuckelmann Model Compression July 19, 2016 20 / 33

slide-26
SLIDE 26

Model Compression Methods Pruning

Optimal Brain Surgeon (OBS)

  • Now: Use the full Hessian H
  • We want to set one of the weights to zero: (δwk)⊤ · ˆ

ek ⏟ ⏞

δwk

+𝑥k = 0

  • Solve the optimization problem

min

k

∮︂ min

δW

⎞1 2(δw)⊤ · H · δw ⎡ | (δwk)⊤ · ˆ ek + 𝑥k = 0 ⨀︁ .

  • Lagrangian:

Λ = 1 2δw⊤ + λ ⎞ (δwk)⊤ · ˆ ek + 𝑥k ⎡ .

Markus Beuckelmann Model Compression July 19, 2016 21 / 33

slide-27
SLIDE 27

Model Compression Methods Pruning

OBS: Solving the optimization problem

  • Generalized saliency:

𝑇k = 1 2 𝑥2

k

(H⊗1)kk

  • Note: If H⊗1 is diagonal: (H⊗1)kk =

)︄ (H)kk [︄⊗1 ⇒ 𝑇k = 1 2(H)kk𝑥2

k

  • Optimal weight change:

δw = − 𝑥k (H⊗1)kk · H⊗1 · ˆ ek

The obvious drawback...

  • However: We need H⊗1
  • Fortunately: It is possible to recursively calculate H⊗1

(see Hassibi et al., 1993)

Markus Beuckelmann Model Compression July 19, 2016 22 / 33

slide-28
SLIDE 28

Model Compression Methods Knowledge Distillation

Knowledge Distillation: General Idea

  • Try to approximate bigger/more complex neural nets (models) with

smaller neural nets (models) of similar generalization performance.

  • General Idea of Distillation: Have a student model model the teacher’s

function directly.

Motivation: An Analogy to Insects

  • Larval form: Optimized for extracting energy and nutrients
  • Adult form: Optimized for traveling and reproduction

Similarly: There are different requirements for training and testing

  • Training: extract knowledge from training data, non time–critical
  • Testing: fast (real–time) prediction, energy efficiency

Markus Beuckelmann Model Compression July 19, 2016 23 / 33

slide-29
SLIDE 29

Model Compression Methods Knowledge Distillation

Caruana et al. (2006) — Model Compression

  • Transfer learning: Try to match the logits

produced by the teacher net

  • Logit: The input to the softmax layer
  • This is just a regression problem (regressing

logit with ℓ2-loss)!

(Yangyang, 2014)

The Algorithm

1 Feed teacher with data 2 Obtain logits from teacher (transfer training set) 3 Train student on these logits (ℓ2-regression)

  • Note that we can use unlabeled data now to train the student!

Markus Beuckelmann Model Compression July 19, 2016 24 / 33

slide-30
SLIDE 30

Model Compression Methods Knowledge Distillation

Hinton et al. (2015) — Dark Knowledge

  • Transfer learning: Try to match soft targets

produced by the teacher net

  • Softmax layer:

p = z(l) = exp( a(l)

T )

⨂︂ K √︂

k=1

exp(

a(l)

k

T )

  • This will soften the posterior distribution
  • For 𝑈 → ∞, this is equivalent to the Caruana

approach (assuming logits are zero–meaned)

(Yangyang, 2014)

The Algorithm

1 Feed teacher with data (𝑈1) 2 Obtain soft targets from teacher (𝑈1) (transfer training set) 3 Train student on soft targets (𝑈1) with cross–entropy loss 4 Use student with 𝑈 < 𝑈1

Markus Beuckelmann Model Compression July 19, 2016 25 / 33

slide-31
SLIDE 31

Model Compression Methods Knowledge Distillation

Knowledge Distillation: Dark Knowledge

Why does this work?

  • Dark Knowledge (Geoffrey Hinton)

◇ Knowledge Distillation works, because most of the knowledge in the learned model is in the relative probabilities of extremely improbable wrong answers.

Markus Beuckelmann Model Compression July 19, 2016 26 / 33

slide-32
SLIDE 32

Model Compression Methods Knowledge Distillation

Knowledge Distillation: Dark Knowledge

Why does this work?

  • Dark Knowledge (Geoffrey Hinton)

◇ Knowledge Distillation works, because most of the knowledge in the learned model is in the relative probabilities of extremely improbable wrong answers.

  • Lets look at an example:

◇ Truth: y =

)︄

1 0[︄⊤

⏟ ⏞ )︄

Pcow Pdog Pcat Pboat

[︄⊤

◇ Teacher output: ˆ y = )︄ 10⊗6 0.9 0.1 10⊗9[︄⊤ ◇ Softened output: ˜ y = )︄ 0.05 0.4 0.3 10⊗3[︄⊤

Markus Beuckelmann Model Compression July 19, 2016 26 / 33

slide-33
SLIDE 33

Model Compression Methods Deep Compression

Deep Compression: Overview

  • Overall compression: up to 49×
  • More focused on reducing memory and battery footprint
  • EIE: Efficient Inference Engine on Compressed Deep Neural Network

(Han et al., 2015) Markus Beuckelmann Model Compression July 19, 2016 27 / 33

slide-34
SLIDE 34

Model Compression Methods Deep Compression

Deep Compression: Weight Quantization / Sharing

(Han et al., 2015) (Han et al., 2015) (Han et al., 2015) Markus Beuckelmann Model Compression July 19, 2016 28 / 33

slide-35
SLIDE 35

Model Compression Methods Deep Compression

Deep Compression: Huffman Coding

  • Lossless compression:

Optimal prefix code

  • General idea: Represent more common

symbols with fewer bits than less common symbols.

(Han et al., 2015) (Han et al., 2015) Markus Beuckelmann Model Compression July 19, 2016 29 / 33

slide-36
SLIDE 36

Model Compression Methods Deep Compression

Deep Compression: Experimental Results

(Han et al., 2015)

  • Deep Compression on AlexNet:

(Han et al., 2015) Markus Beuckelmann Model Compression July 19, 2016 30 / 33

slide-37
SLIDE 37

4 Summary

Markus Beuckelmann Model Compression July 19, 2016 31 / 33

slide-38
SLIDE 38

Summary

Summary

Model Compression: Why?

  • We can improve speed in training/testing
  • We can reduce the memory footprint
  • We can reduce energy consumption
  • We can (sometimes) improve predictive performance

Model Compression: Different Approaches

1 Pruning: Selectively removing weights (by saliency for OBD & OBS) 2 Knowledge Distillation: Try to distill the model’s function 𝜚(x) directly 3 Deep Compression: Pruning – Quantization – Encoding

Markus Beuckelmann Model Compression July 19, 2016 32 / 33

slide-39
SLIDE 39

Summary Resources

Reading / Resources

  • Knowledge Distillation

◇ [1] Distilling the Knowledge in a Neural Network, Hinton et al. (2015) ◇ [2] Model Compression, Caruana et al. (2006) ◇ [3] Do Deep Nets Really Need to be Deep?, Caruana et al. (2014)

  • Pruning

◇ Overview: [4] Pruning algorithms-a survey, Reed (1993) ◇ [5] Optimal Brain Damage, Le Cun et al. (1990) ◇ [6] Optimal Brain Surgeon, Hassibi et al. (1993) ◇ [7] Learning both Weights and Connections for Efficient Neural Networks, Han et al. (2015)

  • Deep Compression

◇ [8] Deep Compression, Han et al. (2016) ◇ [9] Efficient Inference Engine, Han et al. (2016)

Markus Beuckelmann Model Compression July 19, 2016 33 / 33

slide-40
SLIDE 40

Summary Resources

Reading / Resources

  • Knowledge Distillation

◇ [1] Distilling the Knowledge in a Neural Network, Hinton et al. (2015) ◇ [2] Model Compression, Caruana et al. (2006) ◇ [3] Do Deep Nets Really Need to be Deep?, Caruana et al. (2014)

  • Pruning

◇ Overview: [4] Pruning algorithms-a survey, Reed (1993) ◇ [5] Optimal Brain Damage, Le Cun et al. (1990) ◇ [6] Optimal Brain Surgeon, Hassibi et al. (1993) ◇ [7] Learning both Weights and Connections for Efficient Neural Networks, Han et al. (2015)

  • Deep Compression

◇ [8] Deep Compression, Han et al. (2016) ◇ [9] Efficient Inference Engine, Han et al. (2016)

Thank you!

Markus Beuckelmann Model Compression July 19, 2016 33 / 33

slide-41
SLIDE 41

5 Extra slides

Markus Beuckelmann Model Compression July 19, 2016 1 / 3

slide-42
SLIDE 42

Caruana vs. Hinton

  • What’s the connection between matching logits and minimizing

cross–entropy with soft targets?

  • Note: eε ≈ 1 + 𝜁 for small 𝜁
  • 𝑈 𝜖L

𝜖𝑨i = 𝑞i − 𝑢i = e

zi T

√︂

j

e

zj T

− e

vi T

√︂

j

e

vj T

  • 𝑈 𝜖L

𝜖𝑨i ≈ 1 + zi

T

𝐿 + √︂

j zj T

− 1 + vi

T

𝐿 + √︂

j vj T

= 1 𝐿𝑈 (𝑨i − 𝑤i)

  • ...assuming that both sets of logits are zero–meaned. Here 𝐿 is the

number of classes.

  • ⇒ Matching the logits of the cumbersome model is a special case of

distillation.

Markus Beuckelmann Model Compression July 19, 2016 2 / 3

slide-43
SLIDE 43

OBD: How to compute second derivative?

  • This can be done similar to backpropagation
  • In general: 𝑦i = 𝑔(𝑏i), 𝑏i = √︂

j

𝑥ij𝑦j

  • Diagonal of Hessian:

ℎkk = (H)kk = ∑︂

(i,j)

𝜖2L 𝜖𝑥2

ij

= ∑︂

(i,j)

𝜖2L 𝜖𝑏2

i

𝑦2

j

  • Then:

𝜖2L 𝜖𝑏2

i

= 𝑔 ′(𝑏i)2 ∑︂

l

𝑥2

li

𝜖2L 𝜖𝑏2

l

+ 𝑔 ′′(𝑏i) 𝜖L 𝜖𝑦i

Markus Beuckelmann Model Compression July 19, 2016 3 / 3