EigenDamage: Structured Pruning in the Kronecker-Factored Eigenbasis - - PowerPoint PPT Presentation

eigendamage structured pruning in the kronecker factored
SMART_READER_LITE
LIVE PREVIEW

EigenDamage: Structured Pruning in the Kronecker-Factored Eigenbasis - - PowerPoint PPT Presentation

Structured Pruning Background: Hessian-Based Pruning Methods Approximating Hessian with K-FAC Fisher EigenDamage: Structured Pruning in the KFE EigenDamage: Structured Pruning in the Kronecker-Factored Eigenbasis Chaoqi Wang , Roger Grosse,


slide-1
SLIDE 1

Structured Pruning Background: Hessian-Based Pruning Methods Approximating Hessian with K-FAC Fisher EigenDamage: Structured Pruning in the KFE

EigenDamage: Structured Pruning in the Kronecker-Factored Eigenbasis

Chaoqi Wang, Roger Grosse, Sanja Fidler and Guodong Zhang

University of Toronto, Vector Institute

Jun 12, 2019

Chaoqi Wang EigenDamage: Structured Pruning in the KFE

slide-2
SLIDE 2

Structured Pruning Background: Hessian-Based Pruning Methods Approximating Hessian with K-FAC Fisher EigenDamage: Structured Pruning in the KFE

Structured Pruning

Structured Pruning: Prunes filters or neurons. GPU-friendly.

Figure 1: An illustration of structured pruning.

Chaoqi Wang EigenDamage: Structured Pruning in the KFE

slide-3
SLIDE 3

Structured Pruning Background: Hessian-Based Pruning Methods Approximating Hessian with K-FAC Fisher EigenDamage: Structured Pruning in the KFE Background: Hessian-Based Pruning Methods

Background: Hessian-Based Pruning Methods

Hessian-based methods:

1 The pruning criteria is calibrated across layers, 2 Automatically determines the network structure, 3 Fewer hyper-parameters required. (Only the pruning ratio)

It relies on the Taylor expansion around the minimum θ∗, and directly approximates the effect on the loss when removing a given weight ∆θ, ∆L =

  • ∂L

∂θ

∆θ

  • ≈0

+ 1 2∆θ⊤H∆θ +✘✘✘✘✘

O(∆θ3) (1)

Chaoqi Wang EigenDamage: Structured Pruning in the KFE

slide-4
SLIDE 4

Structured Pruning Background: Hessian-Based Pruning Methods Approximating Hessian with K-FAC Fisher EigenDamage: Structured Pruning in the KFE Background: Hessian-Based Pruning Methods

Background: Hessian-Based Pruning Methods

Two representative methods Optimal Brain Damage (OBD) and Optimal Brain Surgeon (OBS),: OBD uses a diagonal Hessian for fast computation, and computes the importance of each weight θ∗

q by:

∆LOBD = 1 2(θ∗

q)2Hqq

(2) OBS uses the full Hessian for accounting the correlations, and computes the importance of each weight θ∗

q by:

∆LOBS = 1 2 (θ∗

q)

[H−1]qq (3)

Chaoqi Wang EigenDamage: Structured Pruning in the KFE

slide-5
SLIDE 5

Structured Pruning Background: Hessian-Based Pruning Methods Approximating Hessian with K-FAC Fisher EigenDamage: Structured Pruning in the KFE Background: Hessian-Based Pruning Methods

Is OBS always better than OBD?

In the original paper, OBS is guaranteed to be better than OBD when pruning weights one by one (i.e. recompute the Hessian for each iteration). But in practice, we will prune multiple weights at a time.

Chaoqi Wang EigenDamage: Structured Pruning in the KFE

slide-6
SLIDE 6

Structured Pruning Background: Hessian-Based Pruning Methods Approximating Hessian with K-FAC Fisher EigenDamage: Structured Pruning in the KFE Background: Hessian-Based Pruning Methods

Is OBS always better than OBD?

We would like to ask: Is OBS always better than OBD for pruning multiple weights at a time? At the first glance.... Yes? Because OBS uses full Hessian, while OBD only uses diagonal Hessian.

Chaoqi Wang EigenDamage: Structured Pruning in the KFE

slide-7
SLIDE 7

Structured Pruning Background: Hessian-Based Pruning Methods Approximating Hessian with K-FAC Fisher EigenDamage: Structured Pruning in the KFE Background: Hessian-Based Pruning Methods

Bayesian Interpretations

Surprisingly, No! Even if we can compute the exact Hessian.

Bayesian Interpretations of OBD and OBS:

(b) (a) (c)

Both OBS and OBD are using a factorial Gaussian to approximate the highly coupled weight posterior (c) under different objectives:

and neither of them will necessarily be better than the other. More details in the Paper and Poster#22!

Chaoqi Wang EigenDamage: Structured Pruning in the KFE

slide-8
SLIDE 8

Structured Pruning Background: Hessian-Based Pruning Methods Approximating Hessian with K-FAC Fisher EigenDamage: Structured Pruning in the KFE Background: Hessian-Based Pruning Methods

Bayesian Interpretations

Surprisingly, No! Even if we can compute the exact Hessian.

Bayesian Interpretations of OBD and OBS:

(b) (a) (c)

Both OBS and OBD are using a factorial Gaussian to approximate the highly coupled weight posterior (c) under different objectives:

OBD: Reverse KL divergence (b). (Too pessimistic) OBS: Forward KL divergence (a). (Too optimistic) and neither of them will necessarily be better than the other. More details in the Paper and Poster#22!

Chaoqi Wang EigenDamage: Structured Pruning in the KFE

slide-9
SLIDE 9

Structured Pruning Background: Hessian-Based Pruning Methods Approximating Hessian with K-FAC Fisher EigenDamage: Structured Pruning in the KFE Background: Hessian-Based Pruning Methods

Method

OBD and OBS use diagonal Hessian and diagonal Hessian inverse for pruning. Both of them fail to capture the correlations when pruning multiple weights at a time. Solution: Pruning in a new coordinate system (i.e. a new basis), in which the posterior is closer to factorial! The new basis ideally would be the eigenbasis of the Hessian! But Issues:

1 Exact Hessian is intractable for large neural networks. 2 The new basis will introduce extra parameters. Chaoqi Wang EigenDamage: Structured Pruning in the KFE

slide-10
SLIDE 10

Structured Pruning Background: Hessian-Based Pruning Methods Approximating Hessian with K-FAC Fisher EigenDamage: Structured Pruning in the KFE Background: Hessian-Based Pruning Methods

Method

OBD and OBS use diagonal Hessian and diagonal Hessian inverse for pruning. Both of them fail to capture the correlations when pruning multiple weights at a time. Solution: Pruning in a new coordinate system (i.e. a new basis), in which the posterior is closer to factorial! The new basis ideally would be the eigenbasis of the Hessian! But Issues:

1 Exact Hessian is intractable for large neural networks. 2 The new basis will introduce extra parameters. Chaoqi Wang EigenDamage: Structured Pruning in the KFE

slide-11
SLIDE 11

Structured Pruning Background: Hessian-Based Pruning Methods Approximating Hessian with K-FAC Fisher EigenDamage: Structured Pruning in the KFE Background: Hessian-Based Pruning Methods

Method

OBD and OBS use diagonal Hessian and diagonal Hessian inverse for pruning. Both of them fail to capture the correlations when pruning multiple weights at a time. Solution: Pruning in a new coordinate system (i.e. a new basis), in which the posterior is closer to factorial! The new basis ideally would be the eigenbasis of the Hessian! But.. Issues:

1 Exact Hessian is intractable for large neural networks. 2 The new basis will introduce extra parameters. Chaoqi Wang EigenDamage: Structured Pruning in the KFE

slide-12
SLIDE 12

Structured Pruning Background: Hessian-Based Pruning Methods Approximating Hessian with K-FAC Fisher EigenDamage: Structured Pruning in the KFE

Approximating Hessian with K-FAC Fisher

1 Fisher Information Matrix (FIM) is commonly

adopted for approximating the Hessian: F = E[∇θ log p(y|x; θ)∇θ log p(y|x; θ)⊤] (4)

2 K-FAC decomposes the FIM of a neural network into the

Kronecker product of two small matrices under the independence assumption: F = E[DsDs⊤ ⊗ aa⊤] ≈ E[DsDs⊤] ⊗ E[aa⊤] = S ⊗ A (5)

3 Kronecker-factored Eigenbasis (KFE):

F ≈ (QSΛSQ⊤

S ) ⊗ (QAΛAQ⊤ A)

= (QS ⊗ QA)

  • KFE

(ΛS ⊗ ΛA)(QS ⊗ QA)⊤ (6)

Chaoqi Wang EigenDamage: Structured Pruning in the KFE

slide-13
SLIDE 13

Structured Pruning Background: Hessian-Based Pruning Methods Approximating Hessian with K-FAC Fisher EigenDamage: Structured Pruning in the KFE

Approximating Hessian with K-FAC Fisher

1 Fisher Information Matrix (FIM) is commonly

adopted for approximating the Hessian: F = E[∇θ log p(y|x; θ)∇θ log p(y|x; θ)⊤] (7)

2 K-FAC decomposes the FIM of a neural network into the

Kronecker product of two small matrices under the independence assumption: F = E[DsDs⊤ ⊗ aa⊤] ≈ E[DsDs⊤] ⊗ E[aa⊤] = S ⊗ A (8)

3 Kronecker-factored Eigenbasis (KFE):

F ≈ (QSΛSQ⊤

S ) ⊗ (QAΛAQ⊤ A)

= (QS ⊗ QA)

  • KFE

(ΛS ⊗ ΛA)(QS ⊗ QA)⊤ (9)

Chaoqi Wang EigenDamage: Structured Pruning in the KFE

slide-14
SLIDE 14

Structured Pruning Background: Hessian-Based Pruning Methods Approximating Hessian with K-FAC Fisher EigenDamage: Structured Pruning in the KFE

Approximating Hessian with K-FAC Fisher

1 Fisher Information Matrix (FIM) is commonly

adopted for approximating the Hessian: F = E[∇θ log p(y|x; θ)∇θ log p(y|x; θ)⊤] (10)

2 K-FAC decomposes the FIM of a neural network into the

Kronecker product of two small matrices under the independence assumption: F = E[DsDs⊤ ⊗ aa⊤] ≈ E[DsDs⊤] ⊗ E[aa⊤] = S ⊗ A (11)

3 Kronecker-Factored Eigenbasis (KFE):

F ≈ (QSΛSQ⊤

S ) ⊗ (QAΛAQ⊤ A)

= (QS ⊗ QA)

  • KFE

(ΛS ⊗ ΛA)(QS ⊗ QA)⊤ (12)

Chaoqi Wang EigenDamage: Structured Pruning in the KFE

slide-15
SLIDE 15

Structured Pruning Background: Hessian-Based Pruning Methods Approximating Hessian with K-FAC Fisher EigenDamage: Structured Pruning in the KFE

EigenDamage: Structured Pruning in the KFE

(b) prune

Bottleneck (after pruning)

input(32x32x512)

Original

input(32x32x512)

Bottleneck (before pruning)

input(32x32x512)

1x1, conv, 512, 512

Params: 11%, FLOPs: 11%

3x3, conv, 512, 512

Params: 100%, FLOPs: 100%

3x3, conv, 512, 512

Params: 100%, FLOPs: 100%

1x1, conv, 512, 512

Params: 11%, FLOPs: 11%

1x1, conv, 512, 32

Params: 0.7%, FLOPs: 0.7%

3x3, conv, 32, 32

Params: 0.4%, FLOPs: 0.3%

1x1, conv, 32, 512

Params: 0.7%, FLOPs: 0.7%

𝐗 𝐗# 𝐗𝐪

#

𝐗𝐪

(a) prune

Weight space

𝐑𝐓 𝐗# 𝐑𝑩

𝐔

Kronecker-factored eigenspace

𝐗

Rotate the weights to the KFE by: vec{W} = (QS ⊗ QA)⊤vec(W′) = vec{Q⊤

AW′QS}

(13)

Chaoqi Wang EigenDamage: Structured Pruning in the KFE

slide-16
SLIDE 16

Structured Pruning Background: Hessian-Based Pruning Methods Approximating Hessian with K-FAC Fisher EigenDamage: Structured Pruning in the KFE

EigenDamage: Structured Pruning in the KFE

(b) prune

Bottleneck (after pruning)

input(32x32x512)

Original

input(32x32x512)

Bottleneck (before pruning)

input(32x32x512)

1x1, conv, 512, 512

Params: 11%, FLOPs: 11%

3x3, conv, 512, 512

Params: 100%, FLOPs: 100%

3x3, conv, 512, 512

Params: 100%, FLOPs: 100%

1x1, conv, 512, 512

Params: 11%, FLOPs: 11%

1x1, conv, 512, 32

Params: 0.7%, FLOPs: 0.7%

3x3, conv, 32, 32

Params: 0.4%, FLOPs: 0.3%

1x1, conv, 32, 512

Params: 0.7%, FLOPs: 0.7%

𝐗 𝐗# 𝐗𝐪

#

𝐗𝐪

(a) prune

Weight space

𝐑𝐓 𝐗# 𝐑𝑩

𝐔

𝐑𝐓

𝐔⨂𝐑𝐁 𝐔

Kronecker-factored eigenspace

𝐗

Rotate the weights to the KFE by: vec{W} = (QS ⊗ QA)⊤vec(W′) = vec{Q⊤

AW′QS}

(14)

Chaoqi Wang EigenDamage: Structured Pruning in the KFE

slide-17
SLIDE 17

Structured Pruning Background: Hessian-Based Pruning Methods Approximating Hessian with K-FAC Fisher EigenDamage: Structured Pruning in the KFE

EigenDamage: Structured Pruning in the KFE

(b) prune

Bottleneck (after pruning)

input(32x32x512)

Original

input(32x32x512)

Bottleneck (before pruning)

input(32x32x512)

1x1, conv, 512, 512

Params: 11%, FLOPs: 11%

3x3, conv, 512, 512

Params: 100%, FLOPs: 100%

3x3, conv, 512, 512

Params: 100%, FLOPs: 100%

1x1, conv, 512, 512

Params: 11%, FLOPs: 11%

1x1, conv, 512, 32

Params: 0.7%, FLOPs: 0.7%

3x3, conv, 32, 32

Params: 0.4%, FLOPs: 0.3%

1x1, conv, 32, 512

Params: 0.7%, FLOPs: 0.7%

𝐗 𝐗# 𝐗𝐪

#

𝐗𝐪

(a) prune

Weight space

𝐑𝐓 𝐗# 𝐑𝑩

𝐔

𝐑𝐓⨂𝐑𝐁 𝐑𝐓

𝐔⨂𝐑𝐁 𝐔

Kronecker-factored eigenspace

𝐗

Rotate the weights to the KFE by: vec{W} = (QS ⊗ QA)⊤vec(W′) = vec{Q⊤

AW′QS}

(15)

Chaoqi Wang EigenDamage: Structured Pruning in the KFE

slide-18
SLIDE 18

Structured Pruning Background: Hessian-Based Pruning Methods Approximating Hessian with K-FAC Fisher EigenDamage: Structured Pruning in the KFE

EigenDamage: Structured Pruning in the KFE

Figure 2: The Fisher matrices in the original weight coordinate and in the KFE.

Compared to the Fisher in the original coordinate, the Fisher in KFE is approximately diagonal, and thus the weights are closer to be independent to each other.

Chaoqi Wang EigenDamage: Structured Pruning in the KFE

slide-19
SLIDE 19

Structured Pruning Background: Hessian-Based Pruning Methods Approximating Hessian with K-FAC Fisher EigenDamage: Structured Pruning in the KFE

Pruning in the KFE is More Accurate

20 40 60 80 100 Reduction in weights (%) 0.0 1.0 2.0 Train loss VGG19, CIFAR10(One-pass) 20 40 60 80 100 Reduction in weights (%) 0.0 2.0 4.0 6.0 Train loss VGG19, CIFAR100(One-pass)

C-OBD C-OBS Kron-OBD Kron-OBS EigenDamage

The network pruned by EigenDamage achieves significantly lower training error than others. (without fine-tuning!)

Chaoqi Wang EigenDamage: Structured Pruning in the KFE

slide-20
SLIDE 20

Structured Pruning Background: Hessian-Based Pruning Methods Approximating Hessian with K-FAC Fisher EigenDamage: Structured Pruning in the KFE

One-pass Pruning Results

50 60 70 80 90 100

Reduction in weights (%)

20 40 60

Test accuracy (%) VGG19, Tiny-ImageNet(One-pass)

NN Slimming C-OBD C-OBS Kron-OBD Kron-OBS EigenDamage baseline

60 70 80 90 100

Reduction in FLOPs (%)

20 40 60

Test accuracy (%) VGG19, Tiny-ImageNet(One-pass)

EigenDamage outperforms other methods by a significant margin, especially for high pruning ratios, e.g. ≥ 90%.

Chaoqi Wang EigenDamage: Structured Pruning in the KFE

slide-21
SLIDE 21

Structured Pruning Background: Hessian-Based Pruning Methods Approximating Hessian with K-FAC Fisher EigenDamage: Structured Pruning in the KFE

Iterative Pruning Results

50 60 70 80 90 100

Reduction in weights (%)

50 60 70 80

Test accuracy (%) ResNet32, CIFAR100(Iterative)

C-OBD C-OBS Kron-OBD Kron-OBS EigenDamage baseline

50 60 70 80 90 100

Reduction in FLOPs (%)

50 60 70 80

Test accuracy (%) ResNet32, CIFAR100(Iterative)

Iterative pruning can yield more accurate pruning results. EigenDamage again outperforms other methods by an even more significant margin.

Chaoqi Wang EigenDamage: Structured Pruning in the KFE

slide-22
SLIDE 22

Structured Pruning Background: Hessian-Based Pruning Methods Approximating Hessian with K-FAC Fisher EigenDamage: Structured Pruning in the KFE

Poster Session

Poster Session: Today 06:30 – 09:00 PM Pacific Ballroom #22

Code available at: https://github.com/alecwangcq/EigenDamage-Pytorch

Chaoqi Wang EigenDamage: Structured Pruning in the KFE