Over-parameterized nonlinear learning: Gradient descent follows the - - PowerPoint PPT Presentation

over parameterized nonlinear learning gradient descent
SMART_READER_LITE
LIVE PREVIEW

Over-parameterized nonlinear learning: Gradient descent follows the - - PowerPoint PPT Presentation

Over-parameterized nonlinear learning: Gradient descent follows the shortest path? Samet Oymak and Mahdi Soltanolkotabi Department of Electrical and Computer Engineering June 2019 June 2019 Motivation Modern learning (e.g. deep learning)


slide-1
SLIDE 1

Over-parameterized nonlinear learning: Gradient descent follows the shortest path?

Samet Oymak and Mahdi Soltanolkotabi Department of Electrical and Computer Engineering June 2019

June 2019

slide-2
SLIDE 2

Motivation

Modern learning (e.g. deep learning) involves fitting nonlinear models

slide-3
SLIDE 3

Motivation

Modern learning (e.g. deep learning) involves fitting nonlinear models Mystery # of parameters >> # of training data

slide-4
SLIDE 4

Motivation

Modern learning (e.g. deep learning) involves fitting nonlinear models Mystery # of parameters >> # of training data

slide-5
SLIDE 5

Motivation

Modern learning (e.g. deep learning) involves fitting nonlinear models Mystery # of parameters >> # of training data

slide-6
SLIDE 6

Motivation

Modern learning (e.g. deep learning) involves fitting nonlinear models Mystery # of parameters >> # of training data Challenges Optimization: Why can you find a global optima despite nonconvexity? Generalization: Why is the global optima any good for prediction?

slide-7
SLIDE 7

Prelude: over-parametrized linear least-squares

min

θ∈Rp L(θ) := 1

2 Xθ − y2

ℓ2

with X ∈ Rn×p and n ≤ p.

slide-8
SLIDE 8

Prelude: over-parametrized linear least-squares

min

θ∈Rp L(θ) := 1

2 Xθ − y2

ℓ2

with X ∈ Rn×p and n ≤ p. Gradient descent starting from θ0 has three properties:

slide-9
SLIDE 9

Prelude: over-parametrized linear least-squares

min

θ∈Rp L(θ) := 1

2 Xθ − y2

ℓ2

with X ∈ Rn×p and n ≤ p. Gradient descent starting from θ0 has three properties: Global convergence

slide-10
SLIDE 10

Prelude: over-parametrized linear least-squares

min

θ∈Rp L(θ) := 1

2 Xθ − y2

ℓ2

with X ∈ Rn×p and n ≤ p. Gradient descent starting from θ0 has three properties: Global convergence Converges to closest global optima to θ0

slide-11
SLIDE 11

Prelude: over-parametrized linear least-squares

min

θ∈Rp L(θ) := 1

2 Xθ − y2

ℓ2

with X ∈ Rn×p and n ≤ p. Gradient descent starting from θ0 has three properties: Global convergence Converges to closest global optima to θ0 Follows a direct trajectory

slide-12
SLIDE 12

Over-parametrized nonlinear least-squares

min

θ∈Rp L(θ) := 1

2 f(θ) − y2

ℓ2 ,

where y :=      y1 y2 . . . yn      ∈ Rn, f(θ) :=      f(x1; θ) f(x2; θ) . . . f(xn; θ)      ∈ Rn, and n ≤ p.

slide-13
SLIDE 13

Over-parametrized nonlinear least-squares

min

θ∈Rp L(θ) := 1

2 f(θ) − y2

ℓ2 ,

where y :=      y1 y2 . . . yn      ∈ Rn, f(θ) :=      f(x1; θ) f(x2; θ) . . . f(xn; θ)      ∈ Rn, and n ≤ p. Run gradient descent: θτ+1 = θτ − ητ∇L(θτ)

slide-14
SLIDE 14

Over-parametrized nonlinear least-squares

min

θ∈Rp L(θ) := 1

2 f(θ) − y2

ℓ2 ,

where y :=      y1 y2 . . . yn      ∈ Rn, f(θ) :=      f(x1; θ) f(x2; θ) . . . f(xn; θ)      ∈ Rn, and n ≤ p. Run gradient descent: θτ+1 = θτ − ητ∇L(θτ) Gradient and Jacobian ∇L(θ) = J (θ)T (f(θ) − y). J (θ) = ∂f(θ)

∂θ

∈ Rn×p is the Jacobian matrix Intuition: Jacobian replaces the feature matrix X

slide-15
SLIDE 15

Gradient descent trajectory

Assumptions minimum singular value at initialization: σmin (J (θ0)) ≥ 2α maximum singular value: J (θ) ≤ β Jacobian smoothness: J (θ2) − J (θ1) ≤ L θ2 − θ1ℓ2 Initial error: f(θ0) − yℓ2 ≤ α2

4L

slide-16
SLIDE 16

Gradient descent trajectory

Assumptions minimum singular value at initialization: σmin (J (θ0)) ≥ 2α maximum singular value: J (θ) ≤ β Jacobian smoothness: J (θ2) − J (θ1) ≤ L θ2 − θ1ℓ2 Initial error: f(θ0) − yℓ2 ≤ α2

4L

Theorem (Oymak and Soltanolkotabi 2018) Assume above over a ball of radius R =

f(θ0)−yℓ2 α

around θ0 and Set η =

1 β2 .

Global convergence: f(θτ) − y2

ℓ2 ≤

  • 1 − 1

2 α2 β2 τ f(θ0) − y2

ℓ2

Converges to near closest global minima to initialization: θτ − θ0ℓ2 ≤ β α θ∗ − θ0ℓ2 Takes an approximately direct route

slide-17
SLIDE 17

Concrete example: One-hidden layer neural net

Training data: (x1, y1), (x2, y2), . . . , (xn, yn) Loss: L(v, W ) := n

i=1

  • vT φ(W xi) − yi

2 Algorithm: gradient descent with random Gaussian initialization Theorem (Oymak and Soltanolkotabi 2019) As long as #parameters (#of training data)2 Then, with high probability Zero training error: L(vτ, Wτ) ≤ (1 − ρ)τ L(v0, W0) Iterates remain close to initialization

slide-18
SLIDE 18

Further results and applications

Extensions to SGD and other loss functions Theoretical justification for Early stopping Robustness to label noise Generalization Other applications Fitting generalized linear models Low-rank matrix recovery

slide-19
SLIDE 19

Conclusion

(Stochastic) gradient descent has three intriguing properties Global convergence Converges to near closest global optima to init. Follows a direct trajectory

slide-20
SLIDE 20

Thanks!

Poster Thursday, 6:30 PM, # 95 References Over-parametrized nonlinear learning: Gradient descent follows the shortest path? S. Oymak and M. Soltanolkotabi Towards moderate overparameterization: global convergence guarantees for training shallow neural networks. S. Oymak and M. Soltanolkotabi Gradient Descent with Early Stopping is Provably Robust to Label Noise for Overparameterized Neural Networks. M. Li, M. Soltanolkotabi, and S. Oymak Generalization Guarantees for Neural Networks via Harnessing the Low-rank Structure of the Jacobian. S. Oymak, Z. Fabian, M. Li, and

  • M. Soltanolkotabi