LEARNING SPARSE NEURAL NETWORKS THROUGH L0 REGULARIZATION Christos - - PowerPoint PPT Presentation

learning sparse neural networks through l0 regularization
SMART_READER_LITE
LIVE PREVIEW

LEARNING SPARSE NEURAL NETWORKS THROUGH L0 REGULARIZATION Christos - - PowerPoint PPT Presentation

LEARNING SPARSE NEURAL NETWORKS THROUGH L0 REGULARIZATION Christos Louizos, Max Welling, Diederik P. Kingma STA 4273 Paper Presentation Daniel Flam-Shepherd, Armaan Farhadi & Zhaoyu Guo March 2nd, 2018 1 / 14 Neural Networks: the good and


slide-1
SLIDE 1

LEARNING SPARSE NEURAL NETWORKS THROUGH L0 REGULARIZATION

Christos Louizos, Max Welling, Diederik P. Kingma

STA 4273 Paper Presentation Daniel Flam-Shepherd, Armaan Farhadi & Zhaoyu Guo

March 2nd, 2018

1 / 14

slide-2
SLIDE 2

Neural Networks: the good and the bad

Neural Networks : ...

1 are flexible function approximators that scale really well 2 are overparameterized and prone to overfitting and memorization

So what can we do about this?

Model compression and sparsification!

Consider the Empirical Risk minimization problem min

θ

R(θ) = 1 N

N

  • i=1

L(f(xi; θ), yi) + λ||θ||p where

1 {(xi, yi)}N i=1 is the iid dataset of input-output pairs 2 f(x; θ) is the NN using parameters θ 3 ||θ||p is the Lp norm 4 L(·) is the loss function

2 / 14

slide-3
SLIDE 3

Lp Norms

Figure: Lp norm penalties for parameter θ from lousizos et al

The L0 ”norm” is just the number of nonzero parameters. ||θ||0 =

|θ|

  • j=1

I[θj = 0] This does not impose shrinkage on large θj rather it directly penalizes |θ|.

3 / 14

slide-4
SLIDE 4

Reparameterizing

If we use the Lp norm R(θ) is non-differentiable at 0. How can we relax this optimization and ensure 0 ∈ θ? First, Reparameterize by putting binary gates zj on each θj. θj = ˜ θjzj, zj ∈ {0, 1}, ˜ θj = 0, & ||θ||0 =

|θ|

  • j=1

zj let zj ∼ Ber(πj) with pmf q(zj|πj) and we can formulate the problem as: min

˜ θ,π R(˜

θ, π) = Eq(z|π)

  • 1

N

N

  • i=1

L(f(xi; ˜ θ ⊙ z), yi)

  • + λ

|θ|

  • j=1

πj we cannot optimize the first term.

4 / 14

slide-5
SLIDE 5

Smooth the objective so we can optimize it!

Let gates z be given by a hard-sigmoid rectification of s, as follows z = g(s) = min(1, max(0, s)), s ∼ qφ(s) The probability of a gate being active is qφ(z = 0) = 1 − Qφ(s ≤ 0) Then using the reparametrization trick on s = f(φ, ǫ) so z = g(f(φ, ǫ)) min

˜ θ,φ Ep(ǫ)

  • 1

N

N

  • i=1

L(f(xi; ˜ θ ⊙ g(f(φ, ǫ))), yi)

  • + λ

|θ|

  • j=1

1 − Qφ(sj ≤ 0) Okay but which distribution qφ(s) should we use?

5 / 14

slide-6
SLIDE 6

Hard Concrete Distribution

An appropriate smoothing distribution q(s) is the binary concrete rv s : u ∼ U(0, 1), s = Sigmoid ((log u − log(1 − u) + log α)/β)) s = s(ζ − γ) + γ and z = min(1, max(0, s))

1 s is a concrete binary distributed 2 α is the location parameter, and 3 β is the temperature parameter 4 z is the hard concrete distribution. 5 we stretch s → ¯

s into the range (γ, ζ) where ζ < 0 and γ > 1.

6 / 14

slide-7
SLIDE 7

Figure: Figure 2 from lousizos et al

7 / 14

slide-8
SLIDE 8

Hard Concrete Distribution

From earlier, we had 1 − Qφ(s ≤ 0) in L0 complexity loss of the objective

  • function. Now, if the random variable is hard concrete, then we can say:

1 − Qφ(s ≤ 0) = Sigmoid(log α − β log −γ ζ ) During test time, the authors use the following for the gate: ˆ z = min(1, max(0, Sigmoid(log α)(ζ − γ) + γ)) and θ∗ = ˜ θ∗ ⊙ ˆ z

8 / 14

slide-9
SLIDE 9

Experiments - MNIST Classification and Sparsification

9 / 14

slide-10
SLIDE 10

Experiments - MNIST Classification

Figure: Expected FLOPs. Left is the MLP. Right is the LeNet-5

10 / 14

slide-11
SLIDE 11

Experiments - CIFAR Classification

11 / 14

slide-12
SLIDE 12

Experiments - CIFAR Classification

Figure: Expected FLOPs of WRN at CIFAR 10 (left) & 100 (right)

12 / 14

slide-13
SLIDE 13

Discussion & Future Work

Discussion

1 L0 penalty can save memory and computation 2 L0 regularization lead to competitive predictive accuracy and stability

Future Work

1 Adopt a full Bayesian treatment over the parameter θ

13 / 14

slide-14
SLIDE 14

Thank You . . .

14 / 14