Rate Distortion for Model Compression: From Theory To Practice - - PowerPoint PPT Presentation

rate distortion for model compression from theory to
SMART_READER_LITE
LIVE PREVIEW

Rate Distortion for Model Compression: From Theory To Practice - - PowerPoint PPT Presentation

Rate Distortion for Model Compression: From Theory To Practice Weihao Gao , Yu-Han Liu , Chong Wang and Sewoong Oh UIUC, Google, Bytedance, Univ of Washington June 10, 2019 Weihao Gao (UIUC) Model Compression June


slide-1
SLIDE 1

Rate Distortion for Model Compression: From Theory To Practice

Weihao Gao∗, Yu-Han Liu†, Chong Wang‡ and Sewoong Oh§

∗UIUC, †Google, ‡Bytedance, §Univ of Washington

June 10, 2019

Weihao Gao (UIUC) Model Compression June 10, 2019 1 / 13

slide-2
SLIDE 2

Motivation

Nowadays, neural networks become more and more powerful

Weihao Gao (UIUC) Model Compression June 10, 2019 2 / 13

slide-3
SLIDE 3

Motivation

Nowadays, neural networks become more and more powerful Also, neural networks become larger and larger

LeNet 40K, AlexNet 62M, BERT 110M(base)/340M(large)

Weihao Gao (UIUC) Model Compression June 10, 2019 2 / 13

slide-4
SLIDE 4

Motivation

Nowadays, neural networks become more and more powerful Also, neural networks become larger and larger

LeNet 40K, AlexNet 62M, BERT 110M(base)/340M(large)

Compression of models are necessary for saving

training and inference time storing space, e.g., for mobile Apps

Weihao Gao (UIUC) Model Compression June 10, 2019 2 / 13

slide-5
SLIDE 5

Motivation

Nowadays, neural networks become more and more powerful Also, neural networks become larger and larger

LeNet 40K, AlexNet 62M, BERT 110M(base)/340M(large)

Compression of models are necessary for saving

training and inference time storing space, e.g., for mobile Apps

Two fundamental questions about model compression

Weihao Gao (UIUC) Model Compression June 10, 2019 2 / 13

slide-6
SLIDE 6

Motivation

Nowadays, neural networks become more and more powerful Also, neural networks become larger and larger

LeNet 40K, AlexNet 62M, BERT 110M(base)/340M(large)

Compression of models are necessary for saving

training and inference time storing space, e.g., for mobile Apps

Two fundamental questions about model compression

1

Is there any theoretical understanding of the fundamental limit of model compression algorithms?

Weihao Gao (UIUC) Model Compression June 10, 2019 2 / 13

slide-7
SLIDE 7

Motivation

Nowadays, neural networks become more and more powerful Also, neural networks become larger and larger

LeNet 40K, AlexNet 62M, BERT 110M(base)/340M(large)

Compression of models are necessary for saving

training and inference time storing space, e.g., for mobile Apps

Two fundamental questions about model compression

1

Is there any theoretical understanding of the fundamental limit of model compression algorithms?

2

How can theoretical understanding help us to improve practical compression algorithms?

Weihao Gao (UIUC) Model Compression June 10, 2019 2 / 13

slide-8
SLIDE 8

Fundamental limit for model compression

Trade-off between compression ratio and quality of compressed model

Weihao Gao (UIUC) Model Compression June 10, 2019 3 / 13

slide-9
SLIDE 9

Fundamental limit for model compression

Trade-off between compression ratio and quality of compressed model

0.0 0.5 1.0 1.5 2.0 2.5 0% 5% 10% 15% 20% 25%

uncompressed baseline proposed

Compression Ratio Cross Entropy

Figure 1: Trade-off between compression ratio and cross entropy loss

Weihao Gao (UIUC) Model Compression June 10, 2019 3 / 13

slide-10
SLIDE 10

Fundamental limit for model compression

Trade-off between compression ratio and quality of compressed model

0.0 0.5 1.0 1.5 2.0 2.5 0% 5% 10% 15% 20% 25%

uncompressed baseline proposed

Compression Ratio Cross Entropy

Figure 1: Trade-off between compression ratio and cross entropy loss

Fundamental question: Given a pretrained model fw(x), how well can we compress the model, given certain ratio?

Weihao Gao (UIUC) Model Compression June 10, 2019 3 / 13

slide-11
SLIDE 11

Rate distortion for model compression

We bring the tool of rate distortion theory from information theory

Weihao Gao (UIUC) Model Compression June 10, 2019 4 / 13

slide-12
SLIDE 12

Rate distortion for model compression

We bring the tool of rate distortion theory from information theory Rate: average number of bits to represent parameters

Weihao Gao (UIUC) Model Compression June 10, 2019 4 / 13

slide-13
SLIDE 13

Rate distortion for model compression

We bring the tool of rate distortion theory from information theory Rate: average number of bits to represent parameters Distortion: difference between compressed model and original model

Weihao Gao (UIUC) Model Compression June 10, 2019 4 / 13

slide-14
SLIDE 14

Rate distortion for model compression

We bring the tool of rate distortion theory from information theory Rate: average number of bits to represent parameters Distortion: difference between compressed model and original model

For regression d(w, ˆ w) = EX[fw(X) − f ˆ

w(X)2]

For classification d(w, ˆ w) = EX[DKL(f ˆ

w(X)||fw(X))]

Weihao Gao (UIUC) Model Compression June 10, 2019 4 / 13

slide-15
SLIDE 15

Rate distortion for model compression

We bring the tool of rate distortion theory from information theory Rate: average number of bits to represent parameters Distortion: difference between compressed model and original model

For regression d(w, ˆ w) = EX[fw(X) − f ˆ

w(X)2]

For classification d(w, ˆ w) = EX[DKL(f ˆ

w(X)||fw(X))]

Rate-distortion theorem for model compression R(D) = min

P ˆ

W |W :E[d(W , ˆ

W )]≤D

I(W ; ˆ W )

Weihao Gao (UIUC) Model Compression June 10, 2019 4 / 13

slide-16
SLIDE 16

Our contributions

Generally, it is intractable to evaluate R(D) due to

High dimensionality of parameters Complicated non-linearity

Weihao Gao (UIUC) Model Compression June 10, 2019 5 / 13

slide-17
SLIDE 17

Our contributions

Generally, it is intractable to evaluate R(D) due to

High dimensionality of parameters Complicated non-linearity

In this talk, our contributions are

Weihao Gao (UIUC) Model Compression June 10, 2019 5 / 13

slide-18
SLIDE 18

Our contributions

Generally, it is intractable to evaluate R(D) due to

High dimensionality of parameters Complicated non-linearity

In this talk, our contributions are

For linear regression model, we give a lower bound of R(D) and give an algorithm achieving the lower bound

Weihao Gao (UIUC) Model Compression June 10, 2019 5 / 13

slide-19
SLIDE 19

Our contributions

Generally, it is intractable to evaluate R(D) due to

High dimensionality of parameters Complicated non-linearity

In this talk, our contributions are

For linear regression model, we give a lower bound of R(D) and give an algorithm achieving the lower bound Inspired by the optimal algorithm, we propose two “golden rules” for model compression

Weihao Gao (UIUC) Model Compression June 10, 2019 5 / 13

slide-20
SLIDE 20

Our contributions

Generally, it is intractable to evaluate R(D) due to

High dimensionality of parameters Complicated non-linearity

In this talk, our contributions are

For linear regression model, we give a lower bound of R(D) and give an algorithm achieving the lower bound Inspired by the optimal algorithm, we propose two “golden rules” for model compression We prove the optimality of proposed “golden rules” for one layer ReLU network

Weihao Gao (UIUC) Model Compression June 10, 2019 5 / 13

slide-21
SLIDE 21

Our contributions

Generally, it is intractable to evaluate R(D) due to

High dimensionality of parameters Complicated non-linearity

In this talk, our contributions are

For linear regression model, we give a lower bound of R(D) and give an algorithm achieving the lower bound Inspired by the optimal algorithm, we propose two “golden rules” for model compression We prove the optimality of proposed “golden rules” for one layer ReLU network We show that the algorithm following “golden rules” performs better in real models

Weihao Gao (UIUC) Model Compression June 10, 2019 5 / 13

slide-22
SLIDE 22

Linear regression

Consider linear regression model fw(x) = wTx

Weihao Gao (UIUC) Model Compression June 10, 2019 6 / 13

slide-23
SLIDE 23

Linear regression

Consider linear regression model fw(x) = wTx and the following assumptions

Weights W are drawn from N(0, ΣW ) Data X has zero mean and E[X 2

i ] = λx,i, E[XiXj] = 0.

Weihao Gao (UIUC) Model Compression June 10, 2019 6 / 13

slide-24
SLIDE 24

Linear regression

Consider linear regression model fw(x) = wTx and the following assumptions

Weights W are drawn from N(0, ΣW ) Data X has zero mean and E[X 2

i ] = λx,i, E[XiXj] = 0.

Theorem: the rate distortion function is lower bounded by: R(D) ≥ R(D) = 1 2 log det(ΣW ) −

m

  • i=1

1 2 log(Di), where Di =

  • µ/λx,i

ifµ < λx,iEW [W 2

i ] ,

EW [W 2

i ]

ifµ ≥ λx,iEW [W 2

i ] ,

where µ is chosen that m

i=1 λx,iDi = D.

The lower bound is tight for linear regression.

Weihao Gao (UIUC) Model Compression June 10, 2019 6 / 13

slide-25
SLIDE 25

From theory to practice

Two “golden rules” of the optimal compressor

1

Orthogonality: EW , ˆ

W [ ˆ

W TΣX(W − ˆ W )] = 0

2

Minimization: EW , ˆ

W [(W − ˆ

W )TΣX(W − ˆ W )] should be minimized, given certain rate.

Weihao Gao (UIUC) Model Compression June 10, 2019 7 / 13

slide-26
SLIDE 26

From theory to practice

Two “golden rules” of the optimal compressor

1

Orthogonality: EW , ˆ

W [ ˆ

W TΣX(W − ˆ W )] = 0

2

Minimization: EW , ˆ

W [(W − ˆ

W )TΣX(W − ˆ W )] should be minimized, given certain rate.

Modified “golden rules” for practice

1

Orthogonality: ˆ w TIw(w − ˆ w) = 0,

2

Minimization: (w − ˆ w)TIw(w − ˆ w) is minimized given certain constraints.

Weihao Gao (UIUC) Model Compression June 10, 2019 7 / 13

slide-27
SLIDE 27

From theory to practice

Two “golden rules” of the optimal compressor

1

Orthogonality: EW , ˆ

W [ ˆ

W TΣX(W − ˆ W )] = 0

2

Minimization: EW , ˆ

W [(W − ˆ

W )TΣX(W − ˆ W )] should be minimized, given certain rate.

Modified “golden rules” for practice

1

Orthogonality: ˆ w TIw(w − ˆ w) = 0,

2

Minimization: (w − ˆ w)TIw(w − ˆ w) is minimized given certain constraints.

here Iw is the weight importance matrix

For regression, Iw = EX

  • ∇wfw(X)(∇wfw(X))T

For classification, Iw = EX

  • (∇wfw(X))diag[f −1

w (X)](∇wfw(X))T

Weihao Gao (UIUC) Model Compression June 10, 2019 7 / 13

slide-28
SLIDE 28

Optimality of “golden rules”

One-layer ReLU model fw(x) = ReLU(wTx).

Data X has zero mean and E[X 2

i ] = λx,i, E[XiXj] = 0

Weihao Gao (UIUC) Model Compression June 10, 2019 8 / 13

slide-29
SLIDE 29

Optimality of “golden rules”

One-layer ReLU model fw(x) = ReLU(wTx).

Data X has zero mean and E[X 2

i ] = λx,i, E[XiXj] = 0

For pruning and quantization algorithm, if a compressor minimizes (w − ˆ w)TIw(w − ˆ w), it automatically satisfies orthogonality: ˆ wTIw( ˆ w − w) = 0.

Weihao Gao (UIUC) Model Compression June 10, 2019 8 / 13

slide-30
SLIDE 30

Optimality of “golden rules”

One-layer ReLU model fw(x) = ReLU(wTx).

Data X has zero mean and E[X 2

i ] = λx,i, E[XiXj] = 0

For pruning and quantization algorithm, if a compressor minimizes (w − ˆ w)TIw(w − ˆ w), it automatically satisfies orthogonality: ˆ wTIw( ˆ w − w) = 0. Hence, for pruning and quantization, minimizing the objective (w − ˆ w)TIw(w − ˆ w) is equivalent to minimizing MSE loss.

Weihao Gao (UIUC) Model Compression June 10, 2019 8 / 13

slide-31
SLIDE 31

Optimality of “golden rules”

One-layer ReLU model fw(x) = ReLU(wTx).

Data X has zero mean and E[X 2

i ] = λx,i, E[XiXj] = 0

For pruning and quantization algorithm, if a compressor minimizes (w − ˆ w)TIw(w − ˆ w), it automatically satisfies orthogonality: ˆ wTIw( ˆ w − w) = 0. Hence, for pruning and quantization, minimizing the objective (w − ˆ w)TIw(w − ˆ w) is equivalent to minimizing MSE loss. For practical models, we test the objective on real data.

Weihao Gao (UIUC) Model Compression June 10, 2019 8 / 13

slide-32
SLIDE 32

Real data experiment

CIFAR10 with 5 conv layers + 3 fc layers (More experiments in full paper)

Weihao Gao (UIUC) Model Compression June 10, 2019 9 / 13

slide-33
SLIDE 33

Real data experiment

CIFAR10 with 5 conv layers + 3 fc layers (More experiments in full paper) Algorithms

Pruning: same prune ratio for all conv and fc layers Quantization: same number of clusters for all conv and fc layers.

Weihao Gao (UIUC) Model Compression June 10, 2019 9 / 13

slide-34
SLIDE 34

Real data experiment

CIFAR10 with 5 conv layers + 3 fc layers (More experiments in full paper) Algorithms

Pruning: same prune ratio for all conv and fc layers Quantization: same number of clusters for all conv and fc layers.

Recall that for classification problem, Iw = EX

  • (∇wfw(X))diag[f −1

w (X)](∇wfw(X))T

Weihao Gao (UIUC) Model Compression June 10, 2019 9 / 13

slide-35
SLIDE 35

Real data experiment

CIFAR10 with 5 conv layers + 3 fc layers (More experiments in full paper) Algorithms

Pruning: same prune ratio for all conv and fc layers Quantization: same number of clusters for all conv and fc layers.

Recall that for classification problem, Iw = EX

  • (∇wfw(X))diag[f −1

w (X)](∇wfw(X))T

We drop the off-diagonal terms of Iw

Weihao Gao (UIUC) Model Compression June 10, 2019 9 / 13

slide-36
SLIDE 36

Real data experiment

CIFAR10 with 5 conv layers + 3 fc layers (More experiments in full paper) Algorithms

Pruning: same prune ratio for all conv and fc layers Quantization: same number of clusters for all conv and fc layers.

Recall that for classification problem, Iw = EX

  • (∇wfw(X))diag[f −1

w (X)](∇wfw(X))T

We drop the off-diagonal terms of Iw Compare with baseline: Iw = identity. Name Minimizing objective Baseline m

i=1(wi − ˆ

wi)2 Proposed m

i=1 EX[ (∇wi fw(X))2 fw(X)

](wi − ˆ wi)2

Table 1: Comparison of unsupervised compression objectives.

Weihao Gao (UIUC) Model Compression June 10, 2019 9 / 13

slide-37
SLIDE 37

Real data experiment

0.2 0.4 0.6 0.8 1.0 0% 20% 40% 60% 80% 100%

uncompressed baseline proposed

Accuracy

0.2 0.4 0.6 0.8 1.0 4% 6% 8% 10% 12%

uncompressed baseline proposed

0.0 2.5 5.0 7.5 10.0 12.5 0% 20% 40% 60% 80% 100%

uncompressed baseline proposed

Cross Entropy

0.0 1.0 2.0 3.0 4.0 5.0 6.0 4% 6% 8% 10% 12%

uncompressed baseline proposed

Compression Ratio Compression Ratio

Figure 2: Result for unsupervised experiment. Left: pruning. Right: quantization.

Weihao Gao (UIUC) Model Compression June 10, 2019 10 / 13

slide-38
SLIDE 38

Real data experiment

In the previous experiments, we didn’t use the training labels

Weihao Gao (UIUC) Model Compression June 10, 2019 11 / 13

slide-39
SLIDE 39

Real data experiment

In the previous experiments, we didn’t use the training labels To use training label, treat the loss function Lw(x, y) = L(fw(x), y) as a function to be compressed and define Iw = E

  • ∇wLw(X, Y )(∇wLw(X, Y ))T

Weihao Gao (UIUC) Model Compression June 10, 2019 11 / 13

slide-40
SLIDE 40

Real data experiment

In the previous experiments, we didn’t use the training labels To use training label, treat the loss function Lw(x, y) = L(fw(x), y) as a function to be compressed and define Iw = E

  • ∇wLw(X, Y )(∇wLw(X, Y ))T

By first and second order approximation of L, we propose Name Minimizing objective Baseline m

i=1(wi − ˆ

wi)2 Gradient (1st approx. of L) m

i=1 E[(∇wiLw(X, Y ))2](wi − ˆ

wi)2 Hessian ([LeCun 90’]) m

i=1 E[∇2 wiLw(X, Y )](wi − ˆ

wi)2 Gradient+Hessian m

i=1 E[(∇wiLw(X, Y ))2](wi − ˆ

wi)2 (2nd approx. of L) + 1

4

m

i=1 E[(∇2 wiLw(X, Y ))2](wi − ˆ

wi)4

Table 2: Comparison of supervised compression objectives.

Weihao Gao (UIUC) Model Compression June 10, 2019 11 / 13

slide-41
SLIDE 41

Real data experiment

0.2 0.4 0.6 0.8 1.0 0% 20% 40% 60% 80% 100%

uncompressed baseline gradient hessian gradient+hessian

Accuracy

0.2 0.4 0.6 0.8 1.0 4% 6% 8% 10% 12%

uncompressed baseline gradient hessian gradient+hessian

0.0 2.5 5.0 7.5 10.0 12.5 0% 20% 40% 60% 80% 100%

uncompressed baseline gradient hessian gradient+hessian

Cross Entropy

0.0 1.0 2.0 3.0 4.0 5.0 6.0 4% 6% 8% 10% 12%

uncompressed baseline gradient hessian gradient+hessian

Compression Ratio Compression Ratio

Figure 3: Result for supervised pruning experiment. Left: pruning. Right: quantization.

Weihao Gao (UIUC) Model Compression June 10, 2019 12 / 13

slide-42
SLIDE 42

Thanks

Thank you for your attention! Our poster #169 tonight.

Weihao Gao (UIUC) Model Compression June 10, 2019 13 / 13