SLIDE 1 Over-parameterized nonlinear learning: Gradient descent follows the shortest path?
Samet Oymak and Mahdi Soltanolkotabi Department of Electrical and Computer Engineering June 2019
June 2019
SLIDE 2
Motivation
Modern learning (e.g. deep learning) involves fitting nonlinear models
SLIDE 3
Motivation
Modern learning (e.g. deep learning) involves fitting nonlinear models Mystery # of parameters >> # of training data
SLIDE 4
Motivation
Modern learning (e.g. deep learning) involves fitting nonlinear models Mystery # of parameters >> # of training data
SLIDE 5
Motivation
Modern learning (e.g. deep learning) involves fitting nonlinear models Mystery # of parameters >> # of training data
SLIDE 6
Motivation
Modern learning (e.g. deep learning) involves fitting nonlinear models Mystery # of parameters >> # of training data Challenges Optimization: Why can you find a global optima despite nonconvexity? Generalization: Why is the global optima any good for prediction?
SLIDE 7 Prelude: over-parametrized linear least-squares
min
θ∈Rp L(θ) := 1
2 Xθ − y2
ℓ2
with X ∈ Rn×p and n ≤ p.
SLIDE 8 Prelude: over-parametrized linear least-squares
min
θ∈Rp L(θ) := 1
2 Xθ − y2
ℓ2
with X ∈ Rn×p and n ≤ p. Gradient descent starting from θ0 has three properties:
SLIDE 9 Prelude: over-parametrized linear least-squares
min
θ∈Rp L(θ) := 1
2 Xθ − y2
ℓ2
with X ∈ Rn×p and n ≤ p. Gradient descent starting from θ0 has three properties: Global convergence
SLIDE 10 Prelude: over-parametrized linear least-squares
min
θ∈Rp L(θ) := 1
2 Xθ − y2
ℓ2
with X ∈ Rn×p and n ≤ p. Gradient descent starting from θ0 has three properties: Global convergence Converges to closest global optima to θ0
SLIDE 11 Prelude: over-parametrized linear least-squares
min
θ∈Rp L(θ) := 1
2 Xθ − y2
ℓ2
with X ∈ Rn×p and n ≤ p. Gradient descent starting from θ0 has three properties: Global convergence Converges to closest global optima to θ0 Follows a direct trajectory
SLIDE 12 Over-parametrized nonlinear least-squares
min
θ∈Rp L(θ) := 1
2 f(θ) − y2
ℓ2 ,
where y := y1 y2 . . . yn ∈ Rn, f(θ) := f(x1; θ) f(x2; θ) . . . f(xn; θ) ∈ Rn, and n ≤ p.
SLIDE 13 Over-parametrized nonlinear least-squares
min
θ∈Rp L(θ) := 1
2 f(θ) − y2
ℓ2 ,
where y := y1 y2 . . . yn ∈ Rn, f(θ) := f(x1; θ) f(x2; θ) . . . f(xn; θ) ∈ Rn, and n ≤ p. Run gradient descent: θτ+1 = θτ − ητ∇L(θτ)
SLIDE 14 Over-parametrized nonlinear least-squares
min
θ∈Rp L(θ) := 1
2 f(θ) − y2
ℓ2 ,
where y := y1 y2 . . . yn ∈ Rn, f(θ) := f(x1; θ) f(x2; θ) . . . f(xn; θ) ∈ Rn, and n ≤ p. Run gradient descent: θτ+1 = θτ − ητ∇L(θτ) Gradient and Jacobian ∇L(θ) = J (θ)T (f(θ) − y). J (θ) = ∂f(θ)
∂θ
∈ Rn×p is the Jacobian matrix Intuition: Jacobian replaces the feature matrix X
SLIDE 15 Gradient descent trajectory
Assumptions minimum singular value at initialization: σmin (J (θ0)) ≥ 2α maximum singular value: J (θ) ≤ β Jacobian smoothness: J (θ2) − J (θ1) ≤ L θ2 − θ1ℓ2 Initial error: f(θ0) − yℓ2 ≤ α2
4L
SLIDE 16 Gradient descent trajectory
Assumptions minimum singular value at initialization: σmin (J (θ0)) ≥ 2α maximum singular value: J (θ) ≤ β Jacobian smoothness: J (θ2) − J (θ1) ≤ L θ2 − θ1ℓ2 Initial error: f(θ0) − yℓ2 ≤ α2
4L
Theorem (Oymak and Soltanolkotabi 2018) Assume above over a ball of radius R =
f(θ0)−yℓ2 α
around θ0 and Set η =
1 β2 .
Global convergence: f(θτ) − y2
ℓ2 ≤
2 α2 β2 τ f(θ0) − y2
ℓ2
Converges to near closest global minima to initialization: θτ − θ0ℓ2 ≤ β α θ∗ − θ0ℓ2 Takes an approximately direct route
SLIDE 17 Concrete example: One-hidden layer neural net
Training data: (x1, y1), (x2, y2), . . . , (xn, yn) Loss: L(v, W ) := n
i=1
2 Algorithm: gradient descent with random Gaussian initialization Theorem (Oymak and Soltanolkotabi 2019) As long as #parameters (#of training data)2 Then, with high probability Zero training error: L(vτ, Wτ) ≤ (1 − ρ)τ L(v0, W0) Iterates remain close to initialization
SLIDE 18
Further results and applications
Extensions to SGD and other loss functions Theoretical justification for Early stopping Robustness to label noise Generalization Other applications Fitting generalized linear models Low-rank matrix recovery
SLIDE 19
Conclusion
(Stochastic) gradient descent has three intriguing properties Global convergence Converges to near closest global optima to init. Follows a direct trajectory
SLIDE 20 Thanks!
Poster Thursday, 6:30 PM, # 95 References Over-parametrized nonlinear learning: Gradient descent follows the shortest path? S. Oymak and M. Soltanolkotabi Towards moderate overparameterization: global convergence guarantees for training shallow neural networks. S. Oymak and M. Soltanolkotabi Gradient Descent with Early Stopping is Provably Robust to Label Noise for Overparameterized Neural Networks. M. Li, M. Soltanolkotabi, and S. Oymak Generalization Guarantees for Neural Networks via Harnessing the Low-rank Structure of the Jacobian. S. Oymak, Z. Fabian, M. Li, and