Trimming the 1 Regularizer: Statistical Analysis, Optimization, and - - PowerPoint PPT Presentation

trimming the 1 regularizer
SMART_READER_LITE
LIVE PREVIEW

Trimming the 1 Regularizer: Statistical Analysis, Optimization, and - - PowerPoint PPT Presentation

Trimming the 1 Regularizer: Statistical Analysis, Optimization, and Applications to Deep Learning Jihun Yun 1 , Peng Zheng 2 , Eunho Yang 1,3 , Aur elie C. Lozano 4 , Aleksandr Aravkin 2 1 KAIST 2 University of Washington 3 AITRICS 4 IBM T.J.


slide-1
SLIDE 1

Trimming the ℓ1 Regularizer:

Statistical Analysis, Optimization, and Applications to Deep Learning

Jihun Yun1, Peng Zheng2, Eunho Yang1,3, Aur´ elie C. Lozano4, Aleksandr Aravkin2

1KAIST 2University of Washington 3AITRICS 4IBM T.J. Watson Research Center

arcprime@kaist.ac.kr International Conference on Machine Learning June 12, 2019

Jihun Yun (KAIST) Trimmed ℓ1 Penalty June 12, 2019 1 / 40

slide-2
SLIDE 2

Table of Contents

1

Introduction and Setup

2

Statistical Analysis

3

Optimization

4

Experiments & Applications to Deep Learning

Jihun Yun (KAIST) Trimmed ℓ1 Penalty June 12, 2019 2 / 40

slide-3
SLIDE 3

Table of Contents

1

Introduction and Setup

2

Statistical Analysis

3

Optimization

4

Experiments & Applications to Deep Learning

Jihun Yun (KAIST) Trimmed ℓ1 Penalty June 12, 2019 3 / 40

slide-4
SLIDE 4

ℓ1 Regularization is Popular

High-dimensional data with ℓ1 regularization (n < < p)

Genomic Data, Matrix Completion, Deep Learning, etc.

(a) Sparse linear models (b) Sparse graphical models (c) Matrix Completion (d) Sparse neural networks

Jihun Yun (KAIST) Trimmed ℓ1 Penalty June 12, 2019 4 / 40

slide-5
SLIDE 5

Concrete Example 1

Lasso

Example 1: Lasso∗ (Sparse Linear Regression)

  • θ ∈ argmin

θ∈Ω

1 2ny − Xθ2

2 + λnθ1

∗R. Tibshirani. Regression shrinkage and selection via the lasso. JRSS, Series B,1996. Jihun Yun (KAIST) Trimmed ℓ1 Penalty June 12, 2019 5 / 40

slide-6
SLIDE 6

Concrete Example 2

Graphical Lasso

Example 2: Graphical Lasso∗ (Sparse Concentration Matrix)

  • Θ ∈ argmin

Θ∈Sp

++

trace( ΣΘ) − log det(Θ) + λnΘ1,off where Σ is a sample covariance matrix, Sp

++ the symmetric and strictly positive

definite matrices, and Θ1,off the ℓ1-norm on the off-diagonal elements of Θ.

∗P. Ravikumar, M. J. Wainwright, G. Raskutti, and B. Yu. High-dimensional covariance estimation by minimizing l1-penalized log-determinant

  • divergence. EJS, 2011

Jihun Yun (KAIST) Trimmed ℓ1 Penalty June 12, 2019 6 / 40

slide-7
SLIDE 7

Concrete Example 3

Group ℓ1 on Network Pruning Task

Example 3: Group ℓ1∗ (Structured Sparsity of Weight Parameters)

  • θ ∈ argmin

θ∈Ω

L(θ; D) + λnθG where θ is a collection of weight parameters of neural networks, L the neural network loss (ex. softmax), and θG the group sparsity regularizer.

Pruning Synapses Pruning Neurons Before Pruning After Pruning

Figure: Encouraging group sparsity. For example, θG=

g∈Gθg2 with each

group g.

∗W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li. Learning Structured Sparsity in Deep Neural Networks. NIPS, 2016 Jihun Yun (KAIST) Trimmed ℓ1 Penalty June 12, 2019 7 / 40

slide-8
SLIDE 8

Shrinkage Bias of Standard ℓ1 Penalty

As parameter size gets larger, the shrinkage bias effect also tends to be larger.

The ℓ1 penalty is proportional to the size of parameters.

Despite the popularity of ℓ1 penalty (and also strong statistical guarantees), Is it really good enough?

Jihun Yun (KAIST) Trimmed ℓ1 Penalty June 12, 2019 8 / 40

slide-9
SLIDE 9

Non-convex Regularizers

Previous Work

For amenable non-convex regularizers (such as SCAD∗ and MCP∗∗),

⊲ Amenable regularizer: Resembles ℓ1 at the

  • rigin and has vanishing derivatives at the tail.

→ coordinate-wise decomposable. ⊲ (Loh & Wainwright)∗∗∗ provide the statistical analysis on amenable regularizers.

Jihun Yun (KAIST) Trimmed ℓ1 Penalty June 12, 2019 9 / 40

slide-10
SLIDE 10

Non-convex Regularizers

Previous Work

For amenable non-convex regularizers (such as SCAD∗ and MCP∗∗),

⊲ Amenable regularizer: Resembles ℓ1 at the

  • rigin and has vanishing derivatives at the tail.

→ coordinate-wise decomposable. ⊲ (Loh & Wainwright)∗∗∗ provide the statistical analysis on amenable regularizers.

What about more structurally complex regularizer?

∗J. Fan and R. Li. Variable selection via non-concave penalized likelihood and its oracle properties. Jour. Amer. Stat. Ass., 96(456):1348-1360, December 2001. ∗∗Cun-Hui Zhang et al. Nearly unbiased variable selection under minimax concave penalty. The Annals of statistics, 38(2):894-942, 2010. ∗∗∗P. Loh and M. J. Wainwright. Regularized M-estimators with non-convexity: statistical and algorithmic theory for local optima and algorithmic. JMLR, 2015. ∗∗∗P. Loh and M. J. Wainwright. Support recovery without incoherence: A case for nonconvex regularization. The Annals of Statistics, 2017. Jihun Yun (KAIST) Trimmed ℓ1 Penalty June 12, 2019 9 / 40

slide-11
SLIDE 11

Trimmed ℓ1 Penalty

Definition

In this paper, we study the Trimmed ℓ1 penalty.

New class of regularizers.

𝑞 − ℎ

Jihun Yun (KAIST) Trimmed ℓ1 Penalty June 12, 2019 10 / 40

slide-12
SLIDE 12

Trimmed ℓ1 Penalty

Definition

In this paper, we study the Trimmed ℓ1 penalty.

New class of regularizers.

Definition: For a parameter vector θ ∈ Rp, we only ℓ1-penalize each entry except largest h entries (We call h the trimming parameter).

𝑞 − ℎ

Jihun Yun (KAIST) Trimmed ℓ1 Penalty June 12, 2019 10 / 40

slide-13
SLIDE 13

Trimmed ℓ1 Penalty

Definition

In this paper, we study the Trimmed ℓ1 penalty.

New class of regularizers.

Definition: For a parameter vector θ ∈ Rp, we only ℓ1-penalize each entry except largest h entries (We call h the trimming parameter).

Parameter (The darker color, the larger value)

Penalty-free We only penalize the smallest 𝑞 − ℎ entries.

Jihun Yun (KAIST) Trimmed ℓ1 Penalty June 12, 2019 10 / 40

slide-14
SLIDE 14

Trimmed ℓ1 Penalty

First Formulation

Parameter (The darker color, the larger value)

Penalty-free We only penalize the smallest 𝑞 − ℎ entries.

We can formalize by defining the order statistics of the parameter vector |θ(1)|> |θ(2)|> · · · > |θ(p)|, the M-estimation with the Trimmed ℓ1 penalty is minimize

θ∈Ω

L(θ; D) + λnR(θ; h) where the regularizer R(θ; h) = p

j=h+1|θ(j)| (sum of smallest p − h entries

in absolute values).

Importantly, the Trimmed ℓ1 is not amenable nor coordinate-wise separable.

Jihun Yun (KAIST) Trimmed ℓ1 Penalty June 12, 2019 11 / 40

slide-15
SLIDE 15

M-estimation with the Trimmed ℓ1 penalty

Second Formulation

We can rewrite the M-estimation with the Trimmed ℓ1 penalty by introducing additional variable w: minimize

θ∈Ω,w∈[0,1]p F(θ, w) := L(θ; D) + λn p

  • j=1

wj|θj| such that 1T w ≥ p − h

Jihun Yun (KAIST) Trimmed ℓ1 Penalty June 12, 2019 12 / 40

slide-16
SLIDE 16

M-estimation with the Trimmed ℓ1 penalty

Second Formulation

We can rewrite the M-estimation with the Trimmed ℓ1 penalty by introducing additional variable w: minimize

θ∈Ω,w∈[0,1]p F(θ, w) := L(θ; D) + λn p

  • j=1

wj|θj| such that 1T w ≥ p − h The variable w encodes the sparsity pattern and order information of θ. As an ideal case,

wj = 0 for largest h entries wj = 1 for smallest p − h entries

Jihun Yun (KAIST) Trimmed ℓ1 Penalty June 12, 2019 12 / 40

slide-17
SLIDE 17

M-estimation with the Trimmed ℓ1 penalty

Second Formulation

We can rewrite the M-estimation with the Trimmed ℓ1 penalty by introducing additional variable w: minimize

θ∈Ω,w∈[0,1]p F(θ, w) := L(θ; D) + λn p

  • j=1

wj|θj| such that 1T w ≥ p − h The variable w encodes the sparsity pattern and order information of θ. As an ideal case,

wj = 0 for largest h entries wj = 1 for smallest p − h entries

If we set the trimming parameter h = 0, it is just a standard ℓ1.

Jihun Yun (KAIST) Trimmed ℓ1 Penalty June 12, 2019 12 / 40

slide-18
SLIDE 18

M-estimation with the Trimmed ℓ1 penalty

Second Formulation: Important Properties

minimize

θ∈Ω,w∈[0,1]p F(θ, w) := L(θ; D) + λn p

  • j=1

wj|θj| such that 1T w ≥ p − h The objective function F is

Weighted ℓ1-regularized if we fix w. Linear in w with fixing θ. However, F is non-convex in jointly (θ, w) because of coupling of θ and w.

We use this second formulation for an optimization.

Since we don’t need to sort the parameter.

Jihun Yun (KAIST) Trimmed ℓ1 Penalty June 12, 2019 13 / 40

slide-19
SLIDE 19

Trimmed ℓ1 Penalty

Unit Balls Visualization

Trimmed ℓ1 Unit balls of θ = (θ1, θ2, θ3) in the 3-dimensional space.

𝜄1 𝜄2 𝜄3 ℎ = 0 𝜄1 𝜄2 𝜄3 ℎ = 1 𝜄1 𝜄2 𝜄3 ℎ = 2

Jihun Yun (KAIST) Trimmed ℓ1 Penalty June 12, 2019 14 / 40

slide-20
SLIDE 20

Trimmed ℓ1 Penalty

Unit Balls Visualization

Trimmed ℓ1 Unit balls of θ = (θ1, θ2, θ3) in the 3-dimensional space.

𝜄1 𝜄2 𝜄3 ℎ = 0 𝜄1 𝜄2 𝜄3 ℎ = 1 𝜄1 𝜄2 𝜄3 ℎ = 2

Jihun Yun (KAIST) Trimmed ℓ1 Penalty June 12, 2019 14 / 40

slide-21
SLIDE 21

Trimmed ℓ1 Penalty

Unit Balls Visualization

Trimmed ℓ1 Unit balls of θ = (θ1, θ2, θ3) in the 3-dimensional space.

𝜄1 𝜄2 𝜄3 ℎ = 0 𝜄1 𝜄2 𝜄3 ℎ = 1 𝜄1 𝜄2 𝜄3 ℎ = 2

Jihun Yun (KAIST) Trimmed ℓ1 Penalty June 12, 2019 14 / 40

slide-22
SLIDE 22

Trimmed ℓ1 Penalty

Unit Balls Visualization

Trimmed ℓ1 Unit balls of θ = (θ1, θ2, θ3) in the 3-dimensional space.

𝜄1 𝜄2 𝜄3 ℎ = 0 𝜄1 𝜄2 𝜄3 ℎ = 1 𝜄1 𝜄2 𝜄3 ℎ = 2

Jihun Yun (KAIST) Trimmed ℓ1 Penalty June 12, 2019 14 / 40

slide-23
SLIDE 23

Trimmed ℓ1 Penalty

Unit Balls Visualization

Trimmed ℓ1 Unit balls of θ = (θ1, θ2, θ3) in the 3-dimensional space.

𝜄1 𝜄2 𝜄3 ℎ = 0 𝜄1 𝜄2 𝜄3 ℎ = 1 𝜄1 𝜄2 𝜄3 ℎ = 2

For h = 0, the shape is the same as standard ℓ1 unit ball. For h > 0, the penalty could be unbounded.

Since the largest h entries are not penalized, the unit ball could extend to infinity in these directions. As h increases, the penalty would be more complicated.

Jihun Yun (KAIST) Trimmed ℓ1 Penalty June 12, 2019 14 / 40

slide-24
SLIDE 24

Table of Contents

1

Introduction and Setup

2

Statistical Analysis

3

Optimization

4

Experiments & Applications to Deep Learning

Jihun Yun (KAIST) Trimmed ℓ1 Penalty June 12, 2019 15 / 40

slide-25
SLIDE 25

Statistical Analysis: Key Assumptions and Quantity

Assumptions: (C1) The loss L is differentiable and convex. (C2) Restricted Strong Convexity on θ: Let D be the set of all possible error vectors for θ. Then, for all θ − θ∗ ∈ D, ∇L(θ∗, ∆) − ∇L(θ∗), ∆ ≥ κl∆2

2−τ1

log p n ∆2

1,

where κl is a “curvature” parameter, and τ1 a “tolerance”.

Allowing a small loss difference to be translated to a small error θ − θ∗. RSC condition is a standard one in this line of work.

Quantity: Let Q = 1

0 ∇2L(θ∗ + t(

θ − θ∗))dt.

Jihun Yun (KAIST) Trimmed ℓ1 Penalty June 12, 2019 16 / 40

slide-26
SLIDE 26

Statistical Analysis

Theorem 1: General ℓ∞-error Bound and Variable Selection

Consider an M-estimation problem with the Trimmed ℓ1 penalty. Under (C1)&(C2) and standard conditions, for any local minimum θ, we have

1

For every pair j1 ∈ S, j2 ∈ Sc, we have | θj1| > | θj2|

Jihun Yun (KAIST) Trimmed ℓ1 Penalty June 12, 2019 17 / 40

slide-27
SLIDE 27

Statistical Analysis

Theorem 1: General ℓ∞-error Bound and Variable Selection

Consider an M-estimation problem with the Trimmed ℓ1 penalty. Under (C1)&(C2) and standard conditions, for any local minimum θ, we have

1

For every pair j1 ∈ S, j2 ∈ Sc, we have | θj1| > | θj2| True Parameter Estimated Parameter

Jihun Yun (KAIST) Trimmed ℓ1 Penalty June 12, 2019 17 / 40

slide-28
SLIDE 28

Statistical Analysis

Theorem 1: General ℓ∞-error Bound and Variable Selection

1

For every pair j1 ∈ S, j2 ∈ Sc, we have | θj1|> | θj2|

2

If h < k, all j ∈ Sc are successfully estimated as zero and

  • θ − θ∗∞≤
  • (

QSS)−1∇L(θ∗)S

  • ∞ + λn
  • (

QSS)−1

Jihun Yun (KAIST) Trimmed ℓ1 Penalty June 12, 2019 18 / 40

slide-29
SLIDE 29

Statistical Analysis

Theorem 1: General ℓ∞-error Bound and Variable Selection

1

For every pair j1 ∈ S, j2 ∈ Sc, we have | θj1|> | θj2|

2

If h < k, all j ∈ Sc are successfully estimated as zero and

  • θ − θ∗∞≤
  • (

QSS)−1∇L(θ∗)S

  • ∞ + λn
  • (

QSS)−1

True Parameter

Zero!!

Estimated Parameter

Jihun Yun (KAIST) Trimmed ℓ1 Penalty June 12, 2019 18 / 40

slide-30
SLIDE 30

Statistical Analysis

Theorem 1: General ℓ∞-error Bound and Variable Selection

1

For every pair j1 ∈ S, j2 ∈ Sc, we have | θj1|> | θj2|

2

If h < k, all j ∈ Sc are successfully estimated as zero and

  • θ − θ∗∞≤
  • (

QSS)−1∇L(θ∗)S

  • ∞ + λn
  • (

QSS)−1

3

If h ≥ k, at least the smallest (in absolute) p − h entries in Sc are exactly zero and θ − θ∞≤ ( Q

U U)−1∇L(θ∗) U∞ where

U is defined as the h largest absolute entries of θ including S.

Jihun Yun (KAIST) Trimmed ℓ1 Penalty June 12, 2019 19 / 40

slide-31
SLIDE 31

Statistical Analysis

Theorem 1: General ℓ∞-error Bound and Variable Selection

1

For every pair j1 ∈ S, j2 ∈ Sc, we have | θj1|> | θj2|

2

If h < k, all j ∈ Sc are successfully estimated as zero and

  • θ − θ∗∞≤
  • (

QSS)−1∇L(θ∗)S

  • ∞ + λn
  • (

QSS)−1

3

If h ≥ k, at least the smallest (in absolute) p − h entries in Sc are exactly zero and θ − θ∞≤ ( Q

U U)−1∇L(θ∗) U∞ where

U is defined as the h largest absolute entries of θ including S. True Parameter

Zero!!

Estimated Parameter

Jihun Yun (KAIST) Trimmed ℓ1 Penalty June 12, 2019 19 / 40

slide-32
SLIDE 32

Statistical Analysis

Theorem 2: General ℓ2-error Bound

Theorem 2

Consider an M-estimation problem with Trimmed ℓ1 regularization where all conditions in Theorem 1 hold. For any local minimum θ, the parameter estimation error in terms of ℓ2-norm is upper bounded as:

  • θ − θ∗2=
  • Cλn

√ k/2 + √ k − h

  • if h < k

Cλn √ h/2

  • therwise

Jihun Yun (KAIST) Trimmed ℓ1 Penalty June 12, 2019 20 / 40

slide-33
SLIDE 33

Statistical Analysis

Theorem 2: General ℓ2-error Bound

Theorem 2

Consider an M-estimation problem with Trimmed ℓ1 regularization where all conditions in Theorem 1 hold. For any local minimum θ, the parameter estimation error in terms of ℓ2-norm is upper bounded as:

  • θ − θ∗2=
  • Cλn

√ k/2 + √ k − h

  • if h < k

Cλn √ h/2

  • therwise

From our bound, h = k is the best case!

We can choose h ≍ k via cross-validation. Table: ℓ2-error bound for different h values.

h < k h = k h > k

  • θ − θ∗2

Cλn(

√ k 2 +

√ k − h) Cλn

√ k 2

Cλn

√ h 2

Jihun Yun (KAIST) Trimmed ℓ1 Penalty June 12, 2019 20 / 40

slide-34
SLIDE 34

Statistical Analysis

Remarks: Other alternative penalties vs. Trimmed ℓ1

  • θ − θ∗2=
  • Cλn

√ k/2 + √ k − h

  • if h < k

Cλn √ h/2

  • therwise

ρλ(t): (µ, γ)-amenable ρλ(t) + 1

2 µt2 is convex.

ρ′

λ(t) = 0 for |t|> γ. Jihun Yun (KAIST) Trimmed ℓ1 Penalty June 12, 2019 21 / 40

slide-35
SLIDE 35

Statistical Analysis

Remarks: Other alternative penalties vs. Trimmed ℓ1

  • θ − θ∗2=
  • Cλn

√ k/2 + √ k − h

  • if h < k

Cλn √ h/2

  • therwise

ρλ(t): (µ, γ)-amenable ρλ(t) + 1

2 µt2 is convex.

ρ′

λ(t) = 0 for |t|> γ.

Table: ℓ2-error bound comparison with universal constant c0 in sub-Gaussian tail bounds.

Standard ℓ1 (h = 0) (µ, γ)-amenable Trimmed ℓ1 (h = k)

  • θ − θ∗2

3c0 κl λn √ k 2 c0 κl− 3

2 µ

λn √ k 2 c0 κl λn √ k 2

Jihun Yun (KAIST) Trimmed ℓ1 Penalty June 12, 2019 21 / 40

slide-36
SLIDE 36

Statistical Analysis

Remarks: Other alternative penalties vs. Trimmed ℓ1

  • θ − θ∗2=
  • Cλn

√ k/2 + √ k − h

  • if h < k

Cλn √ h/2

  • therwise

ρλ(t): (µ, γ)-amenable ρλ(t) + 1

2 µt2 is convex.

ρ′

λ(t) = 0 for |t|> γ.

Table: ℓ2-error bound comparison with universal constant c0 in sub-Gaussian tail bounds.

Standard ℓ1 (h = 0) (µ, γ)-amenable Trimmed ℓ1 (h = k)

  • θ − θ∗2

3c0 κl λn √ k 2 c0 κl− 3

2 µ

λn √ k 2 c0 κl λn √ k 2

Trimmed ℓ1 can achieve three times smaller bound than standard one.

Jihun Yun (KAIST) Trimmed ℓ1 Penalty June 12, 2019 21 / 40

slide-37
SLIDE 37

Statistical Analysis

Remarks: Other alternative penalties vs. Trimmed ℓ1

  • θ − θ∗2=
  • Cλn

√ k/2 + √ k − h

  • if h < k

Cλn √ h/2

  • therwise

ρλ(t): (µ, γ)-amenable ρλ(t) + 1

2 µt2 is convex.

ρ′

λ(t) = 0 for |t|> γ.

Table: ℓ2-error bound comparison with universal constant c0 in sub-Gaussian tail bounds.

Standard ℓ1 (h = 0) (µ, γ)-amenable Trimmed ℓ1 (h = k)

  • θ − θ∗2

3c0 κl λn √ k 2 c0 κl− 3

2 µ

λn √ k 2 c0 κl λn √ k 2

Also, we have a smaller bound than non-convex regularizers since (µ, γ)-amenable regularizers have (possibly large) µ in the denominator.

Jihun Yun (KAIST) Trimmed ℓ1 Penalty June 12, 2019 21 / 40

slide-38
SLIDE 38

Statistical Analysis

Corollary 1: General ℓ∞-error Bound for Linear Regression

Consider a linear regression problem with sub-Gaussian error ǫ. Under standard conditions as in Theorem 1 and incoherence condition on sample covariance, with high probability, any local minimum θ satisfies

① The absolute value of relevant features is always larger than non-relevant features. ② If we set the trimming parameter ℎ smaller than true sparsity level 𝑙, all non-relevant parameters are estimated as zero. ③ If we set ℎ larger than true sparsity level 𝑙, at least the smallest 𝑞 − ℎ entries are estimated as zero. Jihun Yun (KAIST) Trimmed ℓ1 Penalty June 12, 2019 22 / 40

slide-39
SLIDE 39

Table of Contents

1

Introduction and Setup

2

Statistical Analysis

3

Optimization

4

Experiments & Applications to Deep Learning

Jihun Yun (KAIST) Trimmed ℓ1 Penalty June 12, 2019 23 / 40

slide-40
SLIDE 40

Optimization for Trimmed ℓ1 Regularized Program

For an optimization, we use our second formulation of trimmed regularization problem minimize

θ∈Ω,w∈[0,1]p F(θ, w) := L(θ; D) + λ p

  • j=1

wj|θj| s.t. 1T w ≥ p − h

Jihun Yun (KAIST) Trimmed ℓ1 Penalty June 12, 2019 24 / 40

slide-41
SLIDE 41

Optimization for Trimmed ℓ1 Regularized Program

For an optimization, we use our second formulation of trimmed regularization problem minimize

θ∈Ω,w∈[0,1]p F(θ, w) := L(θ; D) + λ p

  • j=1

wj|θj| s.t. 1T w ≥ p − h We update (θ, w) in an alternating manner. θk+1 ← proxηλR(·,wk)[θk − η∇θL(θk)] wk+1 ← projS[wk − τr(θk+1)]

Fixing w, prox operator is weighted ℓ1 norm. By fixing θ, the objective function F is linear in w. projS is a projection onto the constraint set S = {w ∈ [0, 1]p | 1T w = p − h}.

Jihun Yun (KAIST) Trimmed ℓ1 Penalty June 12, 2019 24 / 40

slide-42
SLIDE 42

Optimization: Comparison with DC-based Approach

Convergence history our algorithm vs. Algorithm 2 of (Khamaru & Wainwright, 2018)∗.

Algorithm 2 of (Khamaru & Wainwright, 2018) is an optimization method for (non-convex and non-smooth) objective functions of the form difference of convex functions (f := g + φ − h). Trimmed regularized problem can be formulated as a DC.

2000 4000 Iteration 10−10 10−3 F − F ∗ 1000 Iteration 10−7 10−1 F − F ∗ 500 1000 Iteration 10−12 10−7 10−2 103 F − F ∗ 500 1000 Iteration 10−7 10−1 F − F ∗

Figure: Algorithm comparison with λ ∈ {0.5, 5, 10, 20}.

∗K. Khamaru and M. J. Wainwright. Convergence guarantees for a class of non-convex and non-smooth optimization problems. ICML, 2018 Jihun Yun (KAIST) Trimmed ℓ1 Penalty June 12, 2019 25 / 40

slide-43
SLIDE 43

Table of Contents

1

Introduction and Setup

2

Statistical Analysis

3

Optimization

4

Experiments & Applications to Deep Learning

Jihun Yun (KAIST) Trimmed ℓ1 Penalty June 12, 2019 26 / 40

slide-44
SLIDE 44

Simulation Experiments

Incoherent Case: Support Recovery

Scenario 1: Incoherence condition is satisfied

100 250 500 1000 n 0.00 0.25 0.50 0.75 Support Recovery Prob. Trimmed ℓ1 SCAD MCP Standard ℓ1 100 250 500 1000 n 0.0 0.2 0.4 0.6 Support Recovery Prob. 100 250 500 1000 n 0.0 0.1 0.2 Support Recovery Prob.

Probability of successful support recovery for Trimmed Lasso, SCAD, MCP, and standard Lasso with (p, k) = (128, 8), (256, 16), (512, 32).

Jihun Yun (KAIST) Trimmed ℓ1 Penalty June 12, 2019 27 / 40

slide-45
SLIDE 45

Simulation Experiments

Incoherent Case: Stationary & log ℓ2-error Comparison

Scenario 1: Incoherence condition is satisfied

15000 30000 Iteration Count, t 2 4 log(βt − β∗2) log ℓ2-error 20000 40000 60000 Iteration Count, t 1 2 3 log(βt − β∗2) Trimmed ℓ1 SCAD MCP Standard ℓ1

(Left) 50 random initializations for a setting with (n, p, k) = (160, 256, 16). (Right) log ℓ2-error comparison.

Jihun Yun (KAIST) Trimmed ℓ1 Penalty June 12, 2019 28 / 40

slide-46
SLIDE 46

Simulation Experiments

Nonincoherent Case: Support Recovery

Scenario 2: Incoherence condition violated

Note that we need an incoherence condition in our Corollary 1. Interestingly, the Trimmed Lasso outperforms all the other comparison regularizers even in this regime.

100 250 500 1000 n 0.00 0.25 0.50 0.75 Support Recovery Prob. Trimmed ℓ1 SCAD MCP Standard ℓ1 100 250 500 1000 n 0.00 0.25 0.50 0.75 Support Recovery Prob. 100 250 500 1000 n 0.0 0.2 0.4 Support Recovery Prob.

Probability of successful support recovery for Trimmed Lasso, SCAD, MCP, and standard Lasso with (p, k) = (128, 8), (256, 16), (512, 32).

Jihun Yun (KAIST) Trimmed ℓ1 Penalty June 12, 2019 29 / 40

slide-47
SLIDE 47

Simulation Experiments

Nonincoherent Case: Stationary & log ℓ2-error Comparison

Scenario 2: Incoherence condition violated

500 1000 Iteration Count, t 2 4 log(βt − β∗2) log ℓ2-error 20000 40000 Iteration Count, t 1 2 3 log(βt − β∗2) Trimmed ℓ1 SCAD MCP

(Left) 50 random initializations for a setting with (n, p, k) = (160, 256, 16). (Right) log ℓ2-error comparison.

Jihun Yun (KAIST) Trimmed ℓ1 Penalty June 12, 2019 30 / 40

slide-48
SLIDE 48

Simulation Experiments

Nonincoherent Case: Stationary

Scenario 3

(Left) True signals and regularization parameter λ are both small (Small regime) Investigating the choice of the trimming parameter h (Middle: Incoherent case, Right: Non-incoherent case).

−3.0 −2.6 −2.2 −1.8 −1.4 −1.0 log λ 0.0 0.1 0.2 Support Recovery Prob.

(n, p, k) = (1000, 128, 8)

Trimmed ℓ1 Standard ℓ1 0.90 0.92 0.94 0.96 0.98

p−h p

0.00 0.05 0.10 0.15 Support Recovery Prob.

(n, p, k) = (160, 256, 16)

Trimmed ℓ1 SCAD MCP Standard ℓ1 0.90 0.92 0.94 0.96 0.98

p−h p

0.0 0.2 0.4 Support Recovery Prob.

(n, p, k) = (160, 256, 16)

Jihun Yun (KAIST) Trimmed ℓ1 Penalty June 12, 2019 31 / 40

slide-49
SLIDE 49

Applications to Deep Learning 1

Input Structure Recovery of Compact Neural Networks

We apply trimmed regularization to recover the weight structure of neural networks as parameter support recovery. Motivated by the recent work of Oymak (2018)∗, we consider

𝑝 𝑋∗

input dimension 𝑞 = 80 hidden dimension ℎ = 20 𝑧

The regression model, yi = oT ReLU(W ∗xi) with

  • = 1.

Each hidden node is connected to only 4 input features.

∗ Samet Oymak. Learning Compact Neural Networks with Regularization. ICML, 2018. Jihun Yun (KAIST) Trimmed ℓ1 Penalty June 12, 2019 32 / 40

slide-50
SLIDE 50

Applications to Deep Learning 1

Input Structure Recovery of Compact Neural Networks: Results

With good initialization (small perturbation from true weight) With random initialization

Jihun Yun (KAIST) Trimmed ℓ1 Penalty June 12, 2019 33 / 40

slide-51
SLIDE 51

Applications to Deep Learning 2: Pruning Deep Networks

Pruning Synapses Pruning Neurons Before Pruning After Pruning

Pruning neurons is more computationally efficient than edge-wise pruning.

Jihun Yun (KAIST) Trimmed ℓ1 Penalty June 12, 2019 34 / 40

slide-52
SLIDE 52

Applications to Deep Learning: Pruning Deep Networks

Trimmed Group ℓ1 Regularization on Deep Networks

To encourage group sparsity on neural networks, we consider two cases: Neuron sparsity (for fully-connected layers)

Let θl ∈ Rnin×nout be a weight parameter, then we can enforce group-wise sparsity via Trimmed group ℓ1 penalty as Rl(θl, w) = λl

nin

  • j=1

wj

  • θ2

j,1 + θ2 j,2 + · · · + θj,nout

Activation map sparsity (for convolutional layers)

Similarly, let θl ∈ RCout×Cin×H×W be a weight parameter, then Rl(θ, w) = λl

Cout

  • j=1

wj

m,n,k

θ2

j,m,n,k

for all possible indices (m, n, k).

with the constraint 1T w = nin − hl or Cout − hl respectively.

Jihun Yun (KAIST) Trimmed ℓ1 Penalty June 12, 2019 35 / 40

slide-53
SLIDE 53

Applications to Deep Learning: Pruning Deep Networks

Results on MNIST dataset

Comparison with vanilla group ℓ1 penalty vs. Trimmed group ℓ1 penalty on LeNet-300-100 structure

Jihun Yun (KAIST) Trimmed ℓ1 Penalty June 12, 2019 36 / 40

slide-54
SLIDE 54

Applications to Deep Learning: Pruning Deep Networks

Bayesian Neural Networks with Trimmed ℓ1 Regularization

Most modern algorithms for network pruning are based on Bayesian variational framework. We propose a Bayesian neural network with Trimmed ℓ1 regularization regarding only θ as Bayesian.

Jihun Yun (KAIST) Trimmed ℓ1 Penalty June 12, 2019 37 / 40

slide-55
SLIDE 55

Applications to Deep Learning: Pruning Deep Networks

Bayesian Neural Networks with Trimmed ℓ1 Regularization

Most modern algorithms for network pruning are based on Bayesian variational framework. We propose a Bayesian neural network with Trimmed ℓ1 regularization regarding only θ as Bayesian. By relationship between Bayesian neural networks and variational dropout, we choose qθ,α(θi,j) = N(φi,j, αi,jφ2

i,j) as a variational

distribution.

Jihun Yun (KAIST) Trimmed ℓ1 Penalty June 12, 2019 37 / 40

slide-56
SLIDE 56

Applications to Deep Learning: Pruning Deep Networks

Bayesian Neural Networks with Trimmed ℓ1 Regularization

Most modern algorithms for network pruning are based on Bayesian variational framework. We propose a Bayesian neural network with Trimmed ℓ1 regularization regarding only θ as Bayesian. By relationship between Bayesian neural networks and variational dropout, we choose qθ,α(θi,j) = N(φi,j, αi,jφ2

i,j) as a variational

distribution. Combined with Trimmed ℓ1 regularization, the objective is Eqφ,α(θ)

  • − L(W; D)
  • + KL(qφ,α(W)p(W))
  • ELBO

+ Eqφ,α(θ) L+1

  • l=1

λlRl(θl, wl)

  • Expected Trimmed group ℓ1 penalty

Jihun Yun (KAIST) Trimmed ℓ1 Penalty June 12, 2019 37 / 40

slide-57
SLIDE 57

Applications to Deep Learning: Pruning Deep Networks

Results on MNIST dataset (Cont’ d)

With Bayesian extensions on LeNet-300-100

We compare with a smoothed ℓ0-norm under Bayesian variational framework proposed by Louizos et al. (2018)∗

With Bayesian extensions on LeNet-5-Caffe

∗Louizos et al. Learning Sparse Neural Networks through ℓ0 Regularization. ICLR, 2018 Jihun Yun (KAIST) Trimmed ℓ1 Penalty June 12, 2019 38 / 40

slide-58
SLIDE 58

Concluding Remarks

High-dimensional M-estimators with Trimmed ℓ1 penalty: Alleviate the bias incurred by the vanilla ℓ1 penalty by leaving the h largest parameter entries penalty-free. Theoretical Results on support recovery and ℓ2-error hold for any local

  • ptima and are competitive with other non-convex regularizers.

Simulation experiments demonstrated the value of approach compared to Lasso and non-convex penalties. Future work:

Trimming for other standard regularizers beyond sparsity Bypassing incoherence condition in corollaries More experiments and theories when RSC does not hold Investigating the use of trimmed regularization in deep models.

Jihun Yun (KAIST) Trimmed ℓ1 Penalty June 12, 2019 39 / 40

slide-59
SLIDE 59

THANK YOU! Any Questions? Poster Session at Pacific Ballroom #186 6:30pm – 9:00pm

Jihun Yun (KAIST) Trimmed ℓ1 Penalty June 12, 2019 40 / 40