On Dropout and Nuclear Norm Regularization
Poorya Mianjy and Raman Arora
Johns Hopkins University
On Dropout and Nuclear Norm Regularization Poorya Mianjy and Raman - - PowerPoint PPT Presentation
On Dropout and Nuclear Norm Regularization Poorya Mianjy and Raman Arora Johns Hopkins University June 10, 2019 Motivation Algorithmic approaches endow deep learning systems with certain inductive biases that help generalization. In
Johns Hopkins University
◮ Algorithmic approaches endow deep learning systems with
◮ In this paper we study dropout, one of the most popular
◮ Deep linear networks with k hidden layers
fw : x → Wk+1 · · · W1x, Wi ∈ Rdi×di−1 where w = {Wi}k+1
i=1 is the set of weight matrices.
x[1] x[2] x[d0] y[1] y[2] y[dk+1]
Input layer Output layer
◮ Deep linear networks with k hidden layers
fw : x → Wk+1 · · · W1x, Wi ∈ Rdi×di−1 where w = {Wi}k+1
i=1 is the set of weight matrices.
◮ x ∈ Rd0, y ∈ Rdk+1, (x, y) ∼ D. Assume E[xx⊤] = I.
x[1] x[2] x[d0] y[1] y[2] y[dk+1]
Input layer Output layer
◮ Deep linear networks with k hidden layers
fw : x → Wk+1 · · · W1x, Wi ∈ Rdi×di−1 where w = {Wi}k+1
i=1 is the set of weight matrices.
◮ x ∈ Rd0, y ∈ Rdk+1, (x, y) ∼ D. Assume E[xx⊤] = I. ◮ Learning problem: minimize the population risk
L(w) := E(x,y)∼D[y − fw(x)2] based on iid samples from the distribution.
x[1] x[2] x[d0] y[1] y[2] y[dk+1]
Input layer Output layer
◮ Network perturbed by dropping hidden nodes at random, computing
¯ fw(x) = Wk+1BkWk · · · B1W1x, where Bi(j, j)=0 with probability 1 − θ, and 1
θ with probability θ.
x[1] x[2] x[d0] y[1] y[2] y[dk+1]
Input layer
dropout dropout dropout dropout dropout dropout dropout dropout dropout
Output layer
◮ Network perturbed by dropping hidden nodes at random, computing
¯ fw(x) = Wk+1BkWk · · · B1W1x, where Bi(j, j)=0 with probability 1 − θ, and 1
θ with probability θ.
◮ dropout boils down to SGD on the dropout objective
Lθ(w) := E{Bi},(x,y)y − ¯ fw(x)2
x[1] x[2] x[d0] y[1] y[2] y[dk+1]
Input layer
dropout dropout dropout dropout dropout dropout dropout dropout dropout
Output layer
◮ 3-layer network with width/input/output dimensionality = 20.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Index of the singular values
5 10 15 20 25 30 35
Singular values True Model
◮ 3-layer network with width/input/output dimensionality = 20.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Index of the singular values
5 10 15 20 25 30 35
Singular values True Model SGD
◮ 3-layer network with width/input/output dimensionality = 20.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Index of the singular values
5 10 15 20 25 30 35
Singular values True Model SGD 1 − θ = 0.05
◮ 3-layer network with width/input/output dimensionality = 20.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Index of the singular values
5 10 15 20 25 30 35
Singular values True Model SGD 1 − θ = 0.05 1 − θ = 0.10
◮ 3-layer network with width/input/output dimensionality = 20.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Index of the singular values
5 10 15 20 25 30 35
Singular values True Model SGD 1 − θ = 0.05 1 − θ = 0.10 1 − θ = 0.15
◮ 3-layer network with width/input/output dimensionality = 20.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Index of the singular values
5 10 15 20 25 30 35
Singular values True Model SGD 1 − θ = 0.05 1 − θ = 0.10 1 − θ = 0.15 1 − θ = 0.25
◮ 3-layer network with width/input/output dimensionality = 20.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Index of the singular values
5 10 15 20 25 30 35
Singular values True Model SGD 1 − θ = 0.05 1 − θ = 0.10 1 − θ = 0.15 1 − θ = 0.25 1 − θ = 0.35
◮ 3-layer network with width/input/output dimensionality = 20.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Index of the singular values
5 10 15 20 25 30 35
Singular values True Model SGD 1 − θ = 0.05 1 − θ = 0.10 1 − θ = 0.15 1 − θ = 0.25 1 − θ = 0.35 1 − θ = 0.45
◮ 3-layer network with width/input/output dimensionality = 20.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Index of the singular values
5 10 15 20 25 30 35
Singular values True Model SGD 1 − θ = 0.05 1 − θ = 0.10 1 − θ = 0.15 1 − θ = 0.25 1 − θ = 0.35 1 − θ = 0.45 1 − θ = 0.55
◮ 3-layer network with width/input/output dimensionality = 20.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Index of the singular values
5 10 15 20 25 30 35
Singular values True Model SGD 1 − θ = 0.05 1 − θ = 0.10 1 − θ = 0.15 1 − θ = 0.25 1 − θ = 0.35 1 − θ = 0.45 1 − θ = 0.55 1 − θ = 0.65
◮ 3-layer network with width/input/output dimensionality = 20.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Index of the singular values
5 10 15 20 25 30 35
Singular values True Model SGD 1 − θ = 0.05 1 − θ = 0.10 1 − θ = 0.15 1 − θ = 0.25 1 − θ = 0.35 1 − θ = 0.45 1 − θ = 0.55 1 − θ = 0.65 1 − θ = 0.75
◮ 3-layer network with width/input/output dimensionality = 20.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Index of the singular values
5 10 15 20 25 30 35
Singular values True Model SGD 1 − θ = 0.05 1 − θ = 0.10 1 − θ = 0.15 1 − θ = 0.25 1 − θ = 0.35 1 − θ = 0.45 1 − θ = 0.55 1 − θ = 0.65 1 − θ = 0.75 1 − θ = 0.85
Give full characterization of R(w) := Lθ(w) − L(w)
x[1] x[2] x[d0] i1 i2 i3 y[1] y[2] y[dk+1] Input layer pivot = j1 pivot = j2 pivot = j3 Output layer
αj1,i1 := Wj1→1(i1, :) β1 := Wj2→j1+1(i2, i1) β2 := Wj3→j2+1(i3, i2) γj3,i3 := Wk+1→j3+1(:, i3)
Give full characterization of R(w) := Lθ(w) − L(w)
Θ(M) := min
fw=M R(w)
Give full characterization of R(w) := Lθ(w) − L(w)
Θ(M) := min
fw=M R(w)
◮ Multi-dimensional output
Θ∗∗(fw) = ν{di}fw2
∗
Give full characterization of R(w) := Lθ(w) − L(w)
Θ(M) := min
fw=M R(w)
◮ Multi-dimensional output
Θ∗∗(fw) = ν{di}fw2
∗
◮ One-dimensional output
Θ(fw) = Θ∗∗(fw) = ν{di}fw2
Give full characterization of R(w) := Lθ(w) − L(w)
Θ(M) := min
fw=M R(w)
◮ Multi-dimensional output
Θ∗∗(fw) = ν{di}fw2
∗
◮ One-dimensional output
Θ(fw) = Θ∗∗(fw) = ν{di}fw2
ν{di} increases with depth and decreases with width deeper and narrower networks are more biased towards low-rank solutions