Towards Demystifying Overparameterization in Deep Learning Mahdi - - PowerPoint PPT Presentation

towards demystifying overparameterization in deep learning
SMART_READER_LITE
LIVE PREVIEW

Towards Demystifying Overparameterization in Deep Learning Mahdi - - PowerPoint PPT Presentation

Towards Demystifying Overparameterization in Deep Learning Mahdi Soltanolkotabi Department of Electrical and Computer Engineering April 4, 2019 Mathematics of Imaging Workshop # 3 Henri Poincare Institute April 4, 2019 Mathematics of Imaging


slide-1
SLIDE 1

Towards Demystifying Overparameterization in Deep Learning

Mahdi Soltanolkotabi Department of Electrical and Computer Engineering April 4, 2019 Mathematics of Imaging Workshop # 3 Henri Poincare Institute

April 4, 2019 Mathematics of Imaging Workshop # 3

slide-2
SLIDE 2

Collaborators: Samet Oymak and Mingchen Li

slide-3
SLIDE 3

Motivation (Theory)

slide-4
SLIDE 4

Many success stories

Neural networks very effective at learning from data

slide-5
SLIDE 5

Lots of hype

slide-6
SLIDE 6

Some failures

slide-7
SLIDE 7

Need more principled understanding

Deep learning-based AI increasingly used in human facing services Challenges: Optimization: Why can they fit? Generalization: Why can they predict? Achitecture: Which neural nets?

slide-8
SLIDE 8

This talk: Overparameterization without overfitting

Mystery # of parameters >> # training data

slide-9
SLIDE 9

Surprising experiment I (stolen from B. Recht)

p parameters, n = 50, 000 training samples, d = 3072 feature size, and 10 classes

slide-10
SLIDE 10

Surprising experiment II-Overfitting to corruption

Add corruption Corrupt a fraction of training labels by replacing with another random label No corruption on test labels

slide-11
SLIDE 11

Surprising experiment III-Robustness

Repeat the same experiment but stop early

slide-12
SLIDE 12

Surprising experiment III-Robustness

Repeat the same experiment but stop early

slide-13
SLIDE 13

Benefits of overparameterization for neural networks

Benefit I: Tractable nonconvex optimization Benefit II: Robustness to corruption with early stopping

slide-14
SLIDE 14

Benefit I: Tractable nonconvex optimization

slide-15
SLIDE 15

One-hidden layer

yi = vT φ(W xi)

slide-16
SLIDE 16

Theory for smooth activations

Data set {(xi, yi)}n

i=1 with xiℓ2 = 1

min

W

L(W ) :=

n

  • i=1
  • vT φ(W xi) − yi

2

−6 −4 −2 2 4 6 2 4

slide-17
SLIDE 17

Theory for smooth activations

Data set {(xi, yi)}n

i=1 with xiℓ2 = 1

min

W

L(W ) :=

n

  • i=1
  • vT φ(W xi) − yi

2

−6 −4 −2 2 4 6 2 4

Set v at random or balanced (half +, half −) Run gradient descent Wτ+1 = Wτ − µτ∇L(Wτ) with random initialization Theorem (Oymak and Soltanolkotabi 2019) Assume Smooth activations |φ′(z)| ≤ B and |φ′′(z)| ≤ B Overparameterization √ kd κ(X)n Initialization with i.i.d. N(0, 1) entries Then, with high probability Zero training error: L(Wτ) ≤

  • 1 − c d

n

2τ L(W0) Iterates remain close to initialization:

W −W0F W0F

kd √n

slide-18
SLIDE 18

Dependence on data?

Diversity of input data is important... X =      xT

1

xT

2

. . . xT

n

     κ(X) :=

  • d

n X

λ(X) Definition (Neural network covariance matrix and eigenvalue) Neural net covariance matrix Σ(X) :=1 k EW0

  • J (W0)J T (W0)
  • = Ew

φ′(Xw)φ′(Xw)T ⊙

  • XXT

. Eigenvalue λ(X) := λmin (Σ(X))

slide-19
SLIDE 19

Hermite expansion

Lemma Let {µr(φ′)}∞

r=0 be the Hermite coefficients of φ′. Then,

Σ (X) =

+∞

  • r=0

µ2

r(φ′)

  • XXT

⊙ . . . ⊙

  • XXT
  • r+1
  • µ2

2(φ′) (E[φ′′(g)])2

  • XXT

  • XXT

arbitrary activation ⇔ quadratic activation Conclusion For generic data e.g. xi i.i.d. uniform on the unit sphere κ(X) scales like a constant

slide-20
SLIDE 20

Theory for ReLU activations

Data set {(xi, yi)}n

i=1 with xiℓ2 = 1

min

W

L(W ) :=

n

  • i=1
  • vT φ(W xi) − yi

2

−6 −4 −2 2 4 6 2 4

Set v at random or balanced (half +, half −) Run gradient descent Wτ+1 = Wτ − µτ∇L(Wτ) with random initialization Theorem (Oymak and Soltanolkotabi 2019) Assume ReLU activation φ(z) = ReLU(z) = max(0, z) Overparameterization √ kd κ3(X) n

d × n

Initialization with i.i.d. N(0, 1) entries Then, with high probability Zero training error: L(Wτ) ≤

  • 1 − c d

n

2τ L(W0) Iterates remain close to initialization:

W −W0F W0F

kd √n

slide-21
SLIDE 21

Theory for SGD

Data set {(xi, yi)}n

i=1 with xiℓ2 = 1

min

W

L(W ) :=

n

  • i=1
  • vT φ(W xi) − yi

2 Set v at random or balanced (half +, half −) Run gradient descent Wτ+1 = Wτ − µτ∇L(Wτ) with random initialization Theorem (Oymak and Soltanolkotabi 2019) Assume Smooth activations |φ′(z)| ≤ B and |φ′′(z)| ≤ B Overparameterization √ kd κ(X)n Initialization with i.i.d. N(0, 1) entries Then, with high probability Zero training error: E[L(Wτ)] ≤

  • 1 − c d

n2

2τ L(W0) Iterates remain close to initialization:

W −W0F W0F

kd √n

slide-22
SLIDE 22

Proof Sketch

slide-23
SLIDE 23

Prelude: over-parametrized linear least-squares

min

θ∈Rp L(θ) := 1

2 Xθ − y2

ℓ2

with X ∈ Rn×p and n ≤ p. Gradient descent starting from θ0 has three properties: Global convergence Converges to a global

  • ptimum which is

closest to θ0 Total gradient path length is relatively short

slide-24
SLIDE 24

Over-parametrized nonlinear least-squares

min

θ∈Rp L(θ) := 1

2 f(θ) − y2

ℓ2 ,

where y :=      y1 y2 . . . yn      ∈ Rn, f(θ) :=      f(x1; θ) f(x2; θ) . . . f(xn; θ)      ∈ Rn, and n ≤ p. Gradient descent: start from some initial parameter θ0 and run θτ+1 = θτ − ητ∇L(θτ), ∇L(θ) = J (θ)T (f(θ) − y). Here, J (θ) ∈ Rn×p is the Jacobian matrix with entries Jij = ∂f(xi,θ)

∂θj

.

slide-25
SLIDE 25

Key lemma

Lemma Following assumptions on B (θ0, R) with R :=

4f(θ0)−yℓ2 α

Jacobian at initialization: σmin (J (θ0)) ≥ 2α Bounded Jacobian spectrum: J (θ) ≤ β Lipschitz Jacobian:

  • J (

θ) − J (θ)

  • ≤ L
  • θ − θ
  • F

Small initial residual: f(θ0) − yℓ2 ≤ α2

4L

Then using step size η ≤ 2

β

Global geometric convergence: f(θτ) − y2

ℓ2 ≤

  • 1 − ηα2

2

τ f(θ0) − y2

ℓ2

iterates stay close to init.: θτ − θ0ℓ2 ≤ 4

α f(θ0) − yℓ2 ≤ 4 β α θ∗ − θ0ℓ2

Total gradient path bounded: ∞

τ=0 θτ+1 − θτℓ2 ≤ 4 α f(θ0) − yℓ2

Key Ideal Track dynamics of Vτ := rτℓ2 + 1 2

  • 1 − ηβ2 τ−1
  • θt+1 − θtℓ2 .
slide-26
SLIDE 26

Proof sketch (SGD)

Challenge: show that SGD remains in the local neighborhood Attempt I: Show θτ − θ0ℓ2 is a super martingale (see also [Tan and Vershynin 2017]) Attempt II: Show that f(θτ) − yℓ2 + λ θτ − θ0ℓ2 is a super martingale Final attempt: Show that 1 K

K

  • j=1

θτ − viℓ2 + 3η n

  • J T (θτ) (f(θτ) − y)
  • ℓ2

is a super-martingale. Here, vi is a very fine cover of B(θ0, R)

slide-27
SLIDE 27

Over-parametrized nonlinear least-squares for neural nets

min

W ∈Rk×d L(W ) := 1

2 f(W ) − y2

ℓ2 ,

where y :=      y1 y2 . . . yn      ∈ Rn, f(W ) :=      f(W , x1) f(W , x2) . . . f(W , xn)      ∈ Rn, and n ≤ kd. Linearization via Jacobian J (W ) = X ∗

  • φ′

XW T diag(v)

slide-28
SLIDE 28

Key Techniques

Hadamard product J (W )J T (W ) =

  • φ′(XW T )φ′(W XT )
  • XXT

Theorem (Schur 1913) For two PSD A, B ∈ Rn×n λmin (A ⊙ B) ≥

  • min

i

Bii

  • λmin(A)

λmax (A ⊙ B) ≤

  • max

i

Bii

  • λmax(A)

Random matrix theory J (W )J T (W ) =

k

  • ℓ=1
  • φ′(Xwℓ)φ′(Xwℓ)T

  • XXT
slide-29
SLIDE 29

Side corollary: Nonconvex matrix recovery

Features: A1, A2, . . . , An ∈ Rd×d. Labels: y1, y2, . . . , yn Solve Nonconvex matrix factorization min

U∈Rd×r

1 2

n

  • i=1
  • yi − Ai, UU T

2 Theorem (Oymak and Soltanolkotabi 2018) Assume i.i.d. Gaussian Ai any label yi Initialization at well conditioned matrix U0 Then, gradient descent iterations Uτ converge with a geometric rate to a close global optima as soon as n ≤ dr. Burer-Monteiro and many others r ≥ √n For Gaussian Ai we allow r ≥ n

d

when n ≈ dr0 Burer-Monteiro: r √dr0 Ours: r r0

slide-30
SLIDE 30

Previous work

Unrealistic quadratic: [Soltanolkotabi, Javanmard, Lee 2018] and [Venturi, Bandeira, Bruna,...] Smooth activations: [Du, Lee, Li, Wang, Zhai 2018] kd n2 versus k n4. ReLU activation: [Du et. al. 2018] k n4 d3 versus k n6. Separation: [Li and Liang 2018] [ Allen-Zhu, Li, Song 2018] k n12 δ4 versus k n25????. Begin to move beyond “lazy training” [Chizat & Bach, 2018]; Faster convergence rate Deep: [Du, Lee, Li, Wang, Zhai 2018] and [ Allen-Zhu, Li, Song 2018] Mean field analysis for infinitely wide: [Mei et al., 2018]; [Chizat & Bach, 2018]; [Sirignano & Spiliopoulos, 2018]; [Rotskoff & Vanden-Eijnden, 2018]; [Wei et al., 2018].

slide-31
SLIDE 31

Related recent literature

Approximation capability [Barron 1994], [Telgarsky 2016],[Bolcskei, Grohs, Kutyniok, and Petersen 2017] More over-parameterization (n ≤ ck) [Poston, Lee, Choie, and Kwon 1991], [ Haeffele and Vidal 2015], [Nguyen and Hein 2017] Under-parameterized with resampling [Oymak 2018], [Ge, Ma, Lee 2017], [Zhong, Song, Jain, Bartlett, and Dhillon 2017] [Brutzkus and Globerson 2017] and [Li and Yuan 2017] Other learning methods (Tensors, kernels, etc) [Janzamin, Sedghi, and Anandkumar 2015], [Goel and Klivans 2017] Generalization [Hardt, Benjamin Recht, Yoram Singer 2016], [Brutzkus, Globerson, Malach, and Shalev-Shwartz 2017], [Golowich, Rakhlin, Shamir 2017], [Dziugaite and Roy 2017], [Bartlett, Foster, Telgarsky 2017], [Neysahbur, Bhojanapalli, McAllester, Srebro 2017] [Arora, Ge, Neyshabur, and Zhang 2018] [Arora, Cohen, Hazan 2018], [Azzian, Hassibi 2018] Interface with statistical physics [Choromanskaya, Henaff, Mathieu, Arous, LeCun 2015], [Lee, Bahri, Novak, Schoenholz, Pennington, Sohl-dickstein 2018], [Novak, Bahri, Abolafia, Pennington, Sohl-Dickstein 2018], Many others...

slide-32
SLIDE 32

The need for overparameterization beyond width

Simple exercise: initialize W at random and just fit output layer weights L(v) := 1 2

n

  • i=1
  • vT φ (W xi) − yi

2 = 1 2

  • φ
  • XW T

v − y

  • 2

ℓ2 ,

Simple least-squares problem ˆ v := ΦT ΦΦT −1 y where Φ := φ

  • XW T

. Theorem (Oymak and Soltanolkotabi 2019) Fitting the output layer perfectly interpolates the data w.h.p. as soon as k n

slide-33
SLIDE 33

There is still a huge gap!

slide-34
SLIDE 34

Benefit II: Robustness to corruption

slide-35
SLIDE 35

Surprising experiment III-Robustness

Repeat the same experiment but stop early

slide-36
SLIDE 36

Model (without corruption)

clean data: (ǫ0, δ)-clusterable data input/label pairs {(xi, yi)}n

i=1 ∈ Rd × [−1, 1]

L clusters and K classes ǫ0 c1 Class 1 ǫ0 c2 ǫ0 c3 ǫ0 c4 Class 2 ǫ0 c5 ǫ0 c6 Class 3 α1 α2 α3 −1 0 0.1 1

δ = 0.9

1

slide-37
SLIDE 37

Robustness to corruption

Clean data points {(xi, ¯ yi)}n

i=1, corrupt s := ρn to get corrupted data

{(xi, yi)}n

i=1.

Fit L(W ) := 1 2

n

  • i=1

(f(W , xi) − yi)2 via gradient descent Theorem (Oymak and Soltanolkotabi 2019) Assume Corruption level ρ <

1 16

Cluster radius ǫ 1/L2 # Overparameterization k × d κ2(C)L4 Starting from random initialization, after τ ∼ L log(1/ρ) iterations, gradient descent finds a model with perfect accuracy, i.e. closest label to f(Wτ, xi) = true label ¯ yi

slide-38
SLIDE 38

Learning versus overfitting

Key insight: distance from initialization Theorem (Oymak and Soltanolkotabi 2019) With early stopping (τ ∼ L log(ρ)) distance is bounded W − W0F √ L To overfit to the corruption you have to travel far W − W0F ∝ √s

slide-39
SLIDE 39

Proof Sketch

slide-40
SLIDE 40

High-level intuition

Intuition I: Network should learn when there is no corruption Intuition II: Network should not fit to the corruption

slide-41
SLIDE 41

Key Idea I

Reminder Gradient ∇L(θ) = J (θ) (f(θ, X) − y) Jacobian J (θ) =

  • ∂f(θ,x1)

∂θ ∂f(θ,x2) ∂θ

. . .

∂f(θ,xn) ∂θ

  • ∈ Rp×n

Key Ideal I If ǫ = 0, there are only L distinct inputs. J has exactly rank L.

slide-42
SLIDE 42

Key Idea I

Reminder Gradient ∇L(θ) = J (θ) (f(θ, X) − y) Jacobian J (θ) =

  • ∂f(θ,x1)

∂θ ∂f(θ,x2) ∂θ

. . .

∂f(θ,xn) ∂θ

  • ∈ Rp×n

Key Ideal I If ǫ is small, there are only L distinct inputs. J has approximately rank L.

slide-43
SLIDE 43

Key Idea II

Key Ideal II Two complementary subspaces Fast (data) subspace F: Subspace associated with top L right singular vectors of J slow (noise) subspace S: Complement of F Interaction of Jacobian and residual in the gradient ∇L(θ) = J (θ) (f(θ, X) − y) Residual can be decomposed into two terms r(θ) := f(θ, X) − y = f(θ, X) − ¯ y

  • Residual w.r.t. true labels

+ ¯ y − y

corruption

Residual w.r.t. true labels falls mostly onto F and quickly goes to zero Corruption y − ¯ y falls mostly onto S and goes very slowly to zero

slide-44
SLIDE 44

What about real data?

slide-45
SLIDE 45

Dataset: CIFAR10 Model: ResNET20 Task: Binary classification (airplane vs truck) n = 10, 000 and p = 270, 000

slide-46
SLIDE 46

Conclusion

Provable benefits of overparameterization More tractable optimization Robustness to corruption

slide-47
SLIDE 47

Mandatory Postdoc Announcement

slide-48
SLIDE 48

References

Theoretical insights into the Optimization Landscape of Over-parameterized Shallow Neural Nets M. Soltanolkotabi, A. Javanmard, and J. D. Lee 2017. Over-parametrized nonlinear learning: Gradient descent follows the shortest path? S. Oymak and M. Soltanolkotabi Gradient Descent is Provably Robust to Label Noise for Overparameterized Neural Networks. S. Oymak and M. Soltanolkotabi Towards moderate overparameterization: global convergence guarantees for training shallow neural networks. S. Oymak and M. Soltanolkotabi Gradient Descent with Early Stopping is Provably Robust to Label Noise for Overparameterized Neural Networks. S. Oymak and M. Soltanolkotabi

slide-49
SLIDE 49

Thanks!

Funding acknowledgment