On Dropout and Nuclear Norm Regularization Poorya Mianjy and Raman - - PowerPoint PPT Presentation

on dropout and nuclear norm regularization
SMART_READER_LITE
LIVE PREVIEW

On Dropout and Nuclear Norm Regularization Poorya Mianjy and Raman - - PowerPoint PPT Presentation

On Dropout and Nuclear Norm Regularization Poorya Mianjy and Raman Arora Johns Hopkins University June 10, 2019 Motivation Algorithmic approaches endow deep learning systems with certain inductive biases that help generalization. In


slide-1
SLIDE 1

On Dropout and Nuclear Norm Regularization

Poorya Mianjy and Raman Arora

Johns Hopkins University

June 10, 2019

slide-2
SLIDE 2

Motivation

◮ Algorithmic approaches endow deep learning systems with

certain inductive biases that help generalization.

◮ In this paper we study dropout, one of the most popular

algorithmic heuristics for training deep neural nets.

slide-3
SLIDE 3

Problem Setup

◮ Deep linear networks with k hidden layers

fw : x → Wk+1 · · · W1x, Wi ∈ Rdi×di−1 where w = {Wi}k+1

i=1 is the set of weight matrices.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

x[1] x[2] x[d0] y[1] y[2] y[dk+1]

Input layer Output layer

slide-4
SLIDE 4

Problem Setup

◮ Deep linear networks with k hidden layers

fw : x → Wk+1 · · · W1x, Wi ∈ Rdi×di−1 where w = {Wi}k+1

i=1 is the set of weight matrices.

◮ x ∈ Rd0, y ∈ Rdk+1, (x, y) ∼ D. Assume E[xx⊤] = I.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

x[1] x[2] x[d0] y[1] y[2] y[dk+1]

Input layer Output layer

slide-5
SLIDE 5

Problem Setup

◮ Deep linear networks with k hidden layers

fw : x → Wk+1 · · · W1x, Wi ∈ Rdi×di−1 where w = {Wi}k+1

i=1 is the set of weight matrices.

◮ x ∈ Rd0, y ∈ Rdk+1, (x, y) ∼ D. Assume E[xx⊤] = I. ◮ Learning problem: minimize the population risk

L(w) := E(x,y)∼D[y − fw(x)2] based on iid samples from the distribution.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

x[1] x[2] x[d0] y[1] y[2] y[dk+1]

Input layer Output layer

slide-6
SLIDE 6

Problem Setup

◮ Network perturbed by dropping hidden nodes at random, computing

¯ fw(x) = Wk+1BkWk · · · B1W1x, where Bi(j, j)=0 with probability 1 − θ, and 1

θ with probability θ.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

x[1] x[2] x[d0] y[1] y[2] y[dk+1]

Input layer

dropout dropout dropout dropout dropout dropout dropout dropout dropout

Output layer

slide-7
SLIDE 7

Problem Setup

◮ Network perturbed by dropping hidden nodes at random, computing

¯ fw(x) = Wk+1BkWk · · · B1W1x, where Bi(j, j)=0 with probability 1 − θ, and 1

θ with probability θ.

◮ dropout boils down to SGD on the dropout objective

Lθ(w) := E{Bi},(x,y)y − ¯ fw(x)2

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

x[1] x[2] x[d0] y[1] y[2] y[dk+1]

Input layer

dropout dropout dropout dropout dropout dropout dropout dropout dropout

Output layer

slide-8
SLIDE 8

Empirical Observation

◮ 3-layer network with width/input/output dimensionality = 20.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Index of the singular values

5 10 15 20 25 30 35

Singular values True Model

slide-9
SLIDE 9

Empirical Observation

◮ 3-layer network with width/input/output dimensionality = 20.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Index of the singular values

5 10 15 20 25 30 35

Singular values True Model SGD

slide-10
SLIDE 10

Empirical Observation

◮ 3-layer network with width/input/output dimensionality = 20.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Index of the singular values

5 10 15 20 25 30 35

Singular values True Model SGD 1 − θ = 0.05

slide-11
SLIDE 11

Empirical Observation

◮ 3-layer network with width/input/output dimensionality = 20.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Index of the singular values

5 10 15 20 25 30 35

Singular values True Model SGD 1 − θ = 0.05 1 − θ = 0.10

slide-12
SLIDE 12

Empirical Observation

◮ 3-layer network with width/input/output dimensionality = 20.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Index of the singular values

5 10 15 20 25 30 35

Singular values True Model SGD 1 − θ = 0.05 1 − θ = 0.10 1 − θ = 0.15

slide-13
SLIDE 13

Empirical Observation

◮ 3-layer network with width/input/output dimensionality = 20.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Index of the singular values

5 10 15 20 25 30 35

Singular values True Model SGD 1 − θ = 0.05 1 − θ = 0.10 1 − θ = 0.15 1 − θ = 0.25

slide-14
SLIDE 14

Empirical Observation

◮ 3-layer network with width/input/output dimensionality = 20.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Index of the singular values

5 10 15 20 25 30 35

Singular values True Model SGD 1 − θ = 0.05 1 − θ = 0.10 1 − θ = 0.15 1 − θ = 0.25 1 − θ = 0.35

slide-15
SLIDE 15

Empirical Observation

◮ 3-layer network with width/input/output dimensionality = 20.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Index of the singular values

5 10 15 20 25 30 35

Singular values True Model SGD 1 − θ = 0.05 1 − θ = 0.10 1 − θ = 0.15 1 − θ = 0.25 1 − θ = 0.35 1 − θ = 0.45

slide-16
SLIDE 16

Empirical Observation

◮ 3-layer network with width/input/output dimensionality = 20.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Index of the singular values

5 10 15 20 25 30 35

Singular values True Model SGD 1 − θ = 0.05 1 − θ = 0.10 1 − θ = 0.15 1 − θ = 0.25 1 − θ = 0.35 1 − θ = 0.45 1 − θ = 0.55

slide-17
SLIDE 17

Empirical Observation

◮ 3-layer network with width/input/output dimensionality = 20.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Index of the singular values

5 10 15 20 25 30 35

Singular values True Model SGD 1 − θ = 0.05 1 − θ = 0.10 1 − θ = 0.15 1 − θ = 0.25 1 − θ = 0.35 1 − θ = 0.45 1 − θ = 0.55 1 − θ = 0.65

slide-18
SLIDE 18

Empirical Observation

◮ 3-layer network with width/input/output dimensionality = 20.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Index of the singular values

5 10 15 20 25 30 35

Singular values True Model SGD 1 − θ = 0.05 1 − θ = 0.10 1 − θ = 0.15 1 − θ = 0.25 1 − θ = 0.35 1 − θ = 0.45 1 − θ = 0.55 1 − θ = 0.65 1 − θ = 0.75

slide-19
SLIDE 19

Empirical Observation

◮ 3-layer network with width/input/output dimensionality = 20.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Index of the singular values

5 10 15 20 25 30 35

Singular values True Model SGD 1 − θ = 0.05 1 − θ = 0.10 1 − θ = 0.15 1 − θ = 0.25 1 − θ = 0.35 1 − θ = 0.45 1 − θ = 0.55 1 − θ = 0.65 1 − θ = 0.75 1 − θ = 0.85

slide-20
SLIDE 20

Main Results

Explicit Regularizer

Give full characterization of R(w) := Lθ(w) − L(w)

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

x[1] x[2] x[d0] i1 i2 i3 y[1] y[2] y[dk+1] Input layer pivot = j1 pivot = j2 pivot = j3 Output layer

αj1,i1 := Wj1→1(i1, :) β1 := Wj2→j1+1(i2, i1) β2 := Wj3→j2+1(i3, i2) γj3,i3 := Wk+1→j3+1(:, i3)

slide-21
SLIDE 21

Main Results

Explicit Regularizer

Give full characterization of R(w) := Lθ(w) − L(w)

Induced Regularizer

Θ(M) := min

fw=M R(w)

slide-22
SLIDE 22

Main Results

Explicit Regularizer

Give full characterization of R(w) := Lθ(w) − L(w)

Induced Regularizer

Θ(M) := min

fw=M R(w)

◮ Multi-dimensional output

Θ∗∗(fw) = ν{di}fw2

slide-23
SLIDE 23

Main Results

Explicit Regularizer

Give full characterization of R(w) := Lθ(w) − L(w)

Induced Regularizer

Θ(M) := min

fw=M R(w)

◮ Multi-dimensional output

Θ∗∗(fw) = ν{di}fw2

◮ One-dimensional output

Θ(fw) = Θ∗∗(fw) = ν{di}fw2

slide-24
SLIDE 24

Main Results

Explicit Regularizer

Give full characterization of R(w) := Lθ(w) − L(w)

Induced Regularizer

Θ(M) := min

fw=M R(w)

◮ Multi-dimensional output

Θ∗∗(fw) = ν{di}fw2

◮ One-dimensional output

Θ(fw) = Θ∗∗(fw) = ν{di}fw2

Effective Regularization Parameter

ν{di} increases with depth and decreases with width deeper and narrower networks are more biased towards low-rank solutions

slide-25
SLIDE 25

Thanks for your attention!

Stop by Poster 79 for more information.