SLIDE 1 Towards Demystifying Overparameterization in Deep Learning
Mahdi Soltanolkotabi Department of Electrical and Computer Engineering April 4, 2019 Mathematics of Imaging Workshop # 3 Henri Poincare Institute
April 4, 2019 Mathematics of Imaging Workshop # 3
SLIDE 2
Collaborators: Samet Oymak and Mingchen Li
SLIDE 3
Motivation (Theory)
SLIDE 4
Many success stories
Neural networks very effective at learning from data
SLIDE 5
Lots of hype
SLIDE 6
Some failures
SLIDE 7
Need more principled understanding
Deep learning-based AI increasingly used in human facing services Challenges: Optimization: Why can they fit? Generalization: Why can they predict? Achitecture: Which neural nets?
SLIDE 8
This talk: Overparameterization without overfitting
Mystery # of parameters >> # training data
SLIDE 9
Surprising experiment I (stolen from B. Recht)
p parameters, n = 50, 000 training samples, d = 3072 feature size, and 10 classes
SLIDE 10
Surprising experiment II-Overfitting to corruption
Add corruption Corrupt a fraction of training labels by replacing with another random label No corruption on test labels
SLIDE 11
Surprising experiment III-Robustness
Repeat the same experiment but stop early
SLIDE 12
Surprising experiment III-Robustness
Repeat the same experiment but stop early
SLIDE 13
Benefits of overparameterization for neural networks
Benefit I: Tractable nonconvex optimization Benefit II: Robustness to corruption with early stopping
SLIDE 14
Benefit I: Tractable nonconvex optimization
SLIDE 15
One-hidden layer
yi = vT φ(W xi)
SLIDE 16 Theory for smooth activations
Data set {(xi, yi)}n
i=1 with xiℓ2 = 1
min
W
L(W ) :=
n
2
−6 −4 −2 2 4 6 2 4
SLIDE 17 Theory for smooth activations
Data set {(xi, yi)}n
i=1 with xiℓ2 = 1
min
W
L(W ) :=
n
2
−6 −4 −2 2 4 6 2 4
Set v at random or balanced (half +, half −) Run gradient descent Wτ+1 = Wτ − µτ∇L(Wτ) with random initialization Theorem (Oymak and Soltanolkotabi 2019) Assume Smooth activations |φ′(z)| ≤ B and |φ′′(z)| ≤ B Overparameterization √ kd κ(X)n Initialization with i.i.d. N(0, 1) entries Then, with high probability Zero training error: L(Wτ) ≤
n
2τ L(W0) Iterates remain close to initialization:
W −W0F W0F
kd √n
SLIDE 18 Dependence on data?
Diversity of input data is important... X = xT
1
xT
2
. . . xT
n
κ(X) :=
n X
λ(X) Definition (Neural network covariance matrix and eigenvalue) Neural net covariance matrix Σ(X) :=1 k EW0
φ′(Xw)φ′(Xw)T ⊙
. Eigenvalue λ(X) := λmin (Σ(X))
SLIDE 19 Hermite expansion
Lemma Let {µr(φ′)}∞
r=0 be the Hermite coefficients of φ′. Then,
Σ (X) =
+∞
µ2
r(φ′)
⊙ . . . ⊙
2(φ′) (E[φ′′(g)])2
⊙
arbitrary activation ⇔ quadratic activation Conclusion For generic data e.g. xi i.i.d. uniform on the unit sphere κ(X) scales like a constant
SLIDE 20 Theory for ReLU activations
Data set {(xi, yi)}n
i=1 with xiℓ2 = 1
min
W
L(W ) :=
n
2
−6 −4 −2 2 4 6 2 4
Set v at random or balanced (half +, half −) Run gradient descent Wτ+1 = Wτ − µτ∇L(Wτ) with random initialization Theorem (Oymak and Soltanolkotabi 2019) Assume ReLU activation φ(z) = ReLU(z) = max(0, z) Overparameterization √ kd κ3(X) n
d × n
Initialization with i.i.d. N(0, 1) entries Then, with high probability Zero training error: L(Wτ) ≤
n
2τ L(W0) Iterates remain close to initialization:
W −W0F W0F
kd √n
SLIDE 21 Theory for SGD
Data set {(xi, yi)}n
i=1 with xiℓ2 = 1
min
W
L(W ) :=
n
2 Set v at random or balanced (half +, half −) Run gradient descent Wτ+1 = Wτ − µτ∇L(Wτ) with random initialization Theorem (Oymak and Soltanolkotabi 2019) Assume Smooth activations |φ′(z)| ≤ B and |φ′′(z)| ≤ B Overparameterization √ kd κ(X)n Initialization with i.i.d. N(0, 1) entries Then, with high probability Zero training error: E[L(Wτ)] ≤
n2
2τ L(W0) Iterates remain close to initialization:
W −W0F W0F
kd √n
SLIDE 22
Proof Sketch
SLIDE 23 Prelude: over-parametrized linear least-squares
min
θ∈Rp L(θ) := 1
2 Xθ − y2
ℓ2
with X ∈ Rn×p and n ≤ p. Gradient descent starting from θ0 has three properties: Global convergence Converges to a global
closest to θ0 Total gradient path length is relatively short
SLIDE 24 Over-parametrized nonlinear least-squares
min
θ∈Rp L(θ) := 1
2 f(θ) − y2
ℓ2 ,
where y := y1 y2 . . . yn ∈ Rn, f(θ) := f(x1; θ) f(x2; θ) . . . f(xn; θ) ∈ Rn, and n ≤ p. Gradient descent: start from some initial parameter θ0 and run θτ+1 = θτ − ητ∇L(θτ), ∇L(θ) = J (θ)T (f(θ) − y). Here, J (θ) ∈ Rn×p is the Jacobian matrix with entries Jij = ∂f(xi,θ)
∂θj
.
SLIDE 25 Key lemma
Lemma Following assumptions on B (θ0, R) with R :=
4f(θ0)−yℓ2 α
Jacobian at initialization: σmin (J (θ0)) ≥ 2α Bounded Jacobian spectrum: J (θ) ≤ β Lipschitz Jacobian:
θ) − J (θ)
Small initial residual: f(θ0) − yℓ2 ≤ α2
4L
Then using step size η ≤ 2
β
Global geometric convergence: f(θτ) − y2
ℓ2 ≤
2
τ f(θ0) − y2
ℓ2
iterates stay close to init.: θτ − θ0ℓ2 ≤ 4
α f(θ0) − yℓ2 ≤ 4 β α θ∗ − θ0ℓ2
Total gradient path bounded: ∞
τ=0 θτ+1 − θτℓ2 ≤ 4 α f(θ0) − yℓ2
Key Ideal Track dynamics of Vτ := rτℓ2 + 1 2
- 1 − ηβ2 τ−1
- θt+1 − θtℓ2 .
SLIDE 26 Proof sketch (SGD)
Challenge: show that SGD remains in the local neighborhood Attempt I: Show θτ − θ0ℓ2 is a super martingale (see also [Tan and Vershynin 2017]) Attempt II: Show that f(θτ) − yℓ2 + λ θτ − θ0ℓ2 is a super martingale Final attempt: Show that 1 K
K
θτ − viℓ2 + 3η n
is a super-martingale. Here, vi is a very fine cover of B(θ0, R)
SLIDE 27 Over-parametrized nonlinear least-squares for neural nets
min
W ∈Rk×d L(W ) := 1
2 f(W ) − y2
ℓ2 ,
where y := y1 y2 . . . yn ∈ Rn, f(W ) := f(W , x1) f(W , x2) . . . f(W , xn) ∈ Rn, and n ≤ kd. Linearization via Jacobian J (W ) = X ∗
XW T diag(v)
SLIDE 28 Key Techniques
Hadamard product J (W )J T (W ) =
Theorem (Schur 1913) For two PSD A, B ∈ Rn×n λmin (A ⊙ B) ≥
i
Bii
λmax (A ⊙ B) ≤
i
Bii
Random matrix theory J (W )J T (W ) =
k
⊙
SLIDE 29 Side corollary: Nonconvex matrix recovery
Features: A1, A2, . . . , An ∈ Rd×d. Labels: y1, y2, . . . , yn Solve Nonconvex matrix factorization min
U∈Rd×r
1 2
n
2 Theorem (Oymak and Soltanolkotabi 2018) Assume i.i.d. Gaussian Ai any label yi Initialization at well conditioned matrix U0 Then, gradient descent iterations Uτ converge with a geometric rate to a close global optima as soon as n ≤ dr. Burer-Monteiro and many others r ≥ √n For Gaussian Ai we allow r ≥ n
d
when n ≈ dr0 Burer-Monteiro: r √dr0 Ours: r r0
SLIDE 30
Previous work
Unrealistic quadratic: [Soltanolkotabi, Javanmard, Lee 2018] and [Venturi, Bandeira, Bruna,...] Smooth activations: [Du, Lee, Li, Wang, Zhai 2018] kd n2 versus k n4. ReLU activation: [Du et. al. 2018] k n4 d3 versus k n6. Separation: [Li and Liang 2018] [ Allen-Zhu, Li, Song 2018] k n12 δ4 versus k n25????. Begin to move beyond “lazy training” [Chizat & Bach, 2018]; Faster convergence rate Deep: [Du, Lee, Li, Wang, Zhai 2018] and [ Allen-Zhu, Li, Song 2018] Mean field analysis for infinitely wide: [Mei et al., 2018]; [Chizat & Bach, 2018]; [Sirignano & Spiliopoulos, 2018]; [Rotskoff & Vanden-Eijnden, 2018]; [Wei et al., 2018].
SLIDE 31
Related recent literature
Approximation capability [Barron 1994], [Telgarsky 2016],[Bolcskei, Grohs, Kutyniok, and Petersen 2017] More over-parameterization (n ≤ ck) [Poston, Lee, Choie, and Kwon 1991], [ Haeffele and Vidal 2015], [Nguyen and Hein 2017] Under-parameterized with resampling [Oymak 2018], [Ge, Ma, Lee 2017], [Zhong, Song, Jain, Bartlett, and Dhillon 2017] [Brutzkus and Globerson 2017] and [Li and Yuan 2017] Other learning methods (Tensors, kernels, etc) [Janzamin, Sedghi, and Anandkumar 2015], [Goel and Klivans 2017] Generalization [Hardt, Benjamin Recht, Yoram Singer 2016], [Brutzkus, Globerson, Malach, and Shalev-Shwartz 2017], [Golowich, Rakhlin, Shamir 2017], [Dziugaite and Roy 2017], [Bartlett, Foster, Telgarsky 2017], [Neysahbur, Bhojanapalli, McAllester, Srebro 2017] [Arora, Ge, Neyshabur, and Zhang 2018] [Arora, Cohen, Hazan 2018], [Azzian, Hassibi 2018] Interface with statistical physics [Choromanskaya, Henaff, Mathieu, Arous, LeCun 2015], [Lee, Bahri, Novak, Schoenholz, Pennington, Sohl-dickstein 2018], [Novak, Bahri, Abolafia, Pennington, Sohl-Dickstein 2018], Many others...
SLIDE 32 The need for overparameterization beyond width
Simple exercise: initialize W at random and just fit output layer weights L(v) := 1 2
n
2 = 1 2
v − y
ℓ2 ,
Simple least-squares problem ˆ v := ΦT ΦΦT −1 y where Φ := φ
. Theorem (Oymak and Soltanolkotabi 2019) Fitting the output layer perfectly interpolates the data w.h.p. as soon as k n
SLIDE 33
There is still a huge gap!
SLIDE 34
Benefit II: Robustness to corruption
SLIDE 35
Surprising experiment III-Robustness
Repeat the same experiment but stop early
SLIDE 36 Model (without corruption)
clean data: (ǫ0, δ)-clusterable data input/label pairs {(xi, yi)}n
i=1 ∈ Rd × [−1, 1]
L clusters and K classes ǫ0 c1 Class 1 ǫ0 c2 ǫ0 c3 ǫ0 c4 Class 2 ǫ0 c5 ǫ0 c6 Class 3 α1 α2 α3 −1 0 0.1 1
δ = 0.9
1
SLIDE 37 Robustness to corruption
Clean data points {(xi, ¯ yi)}n
i=1, corrupt s := ρn to get corrupted data
{(xi, yi)}n
i=1.
Fit L(W ) := 1 2
n
(f(W , xi) − yi)2 via gradient descent Theorem (Oymak and Soltanolkotabi 2019) Assume Corruption level ρ <
1 16
Cluster radius ǫ 1/L2 # Overparameterization k × d κ2(C)L4 Starting from random initialization, after τ ∼ L log(1/ρ) iterations, gradient descent finds a model with perfect accuracy, i.e. closest label to f(Wτ, xi) = true label ¯ yi
SLIDE 38
Learning versus overfitting
Key insight: distance from initialization Theorem (Oymak and Soltanolkotabi 2019) With early stopping (τ ∼ L log(ρ)) distance is bounded W − W0F √ L To overfit to the corruption you have to travel far W − W0F ∝ √s
SLIDE 39
Proof Sketch
SLIDE 40
High-level intuition
Intuition I: Network should learn when there is no corruption Intuition II: Network should not fit to the corruption
SLIDE 41 Key Idea I
Reminder Gradient ∇L(θ) = J (θ) (f(θ, X) − y) Jacobian J (θ) =
∂θ ∂f(θ,x2) ∂θ
. . .
∂f(θ,xn) ∂θ
Key Ideal I If ǫ = 0, there are only L distinct inputs. J has exactly rank L.
SLIDE 42 Key Idea I
Reminder Gradient ∇L(θ) = J (θ) (f(θ, X) − y) Jacobian J (θ) =
∂θ ∂f(θ,x2) ∂θ
. . .
∂f(θ,xn) ∂θ
Key Ideal I If ǫ is small, there are only L distinct inputs. J has approximately rank L.
SLIDE 43 Key Idea II
Key Ideal II Two complementary subspaces Fast (data) subspace F: Subspace associated with top L right singular vectors of J slow (noise) subspace S: Complement of F Interaction of Jacobian and residual in the gradient ∇L(θ) = J (θ) (f(θ, X) − y) Residual can be decomposed into two terms r(θ) := f(θ, X) − y = f(θ, X) − ¯ y
- Residual w.r.t. true labels
+ ¯ y − y
corruption
Residual w.r.t. true labels falls mostly onto F and quickly goes to zero Corruption y − ¯ y falls mostly onto S and goes very slowly to zero
SLIDE 44
What about real data?
SLIDE 45
Dataset: CIFAR10 Model: ResNET20 Task: Binary classification (airplane vs truck) n = 10, 000 and p = 270, 000
SLIDE 46
Conclusion
Provable benefits of overparameterization More tractable optimization Robustness to corruption
SLIDE 47
Mandatory Postdoc Announcement
SLIDE 48
References
Theoretical insights into the Optimization Landscape of Over-parameterized Shallow Neural Nets M. Soltanolkotabi, A. Javanmard, and J. D. Lee 2017. Over-parametrized nonlinear learning: Gradient descent follows the shortest path? S. Oymak and M. Soltanolkotabi Gradient Descent is Provably Robust to Label Noise for Overparameterized Neural Networks. S. Oymak and M. Soltanolkotabi Towards moderate overparameterization: global convergence guarantees for training shallow neural networks. S. Oymak and M. Soltanolkotabi Gradient Descent with Early Stopping is Provably Robust to Label Noise for Overparameterized Neural Networks. S. Oymak and M. Soltanolkotabi
SLIDE 49
Thanks!
Funding acknowledgment