Analyzing Optimization in Deep Learning via Trajectories Nadav Cohen - - PowerPoint PPT Presentation

analyzing optimization in deep learning via trajectories
SMART_READER_LITE
LIVE PREVIEW

Analyzing Optimization in Deep Learning via Trajectories Nadav Cohen - - PowerPoint PPT Presentation

Analyzing Optimization in Deep Learning via Trajectories Nadav Cohen Institute for Advanced Study Institute for Computational and Experimental Research in Mathematics (ICERM) Workshop on Theory and Practice in Machine Learning and Computer


slide-1
SLIDE 1

Analyzing Optimization in Deep Learning via Trajectories

Nadav Cohen

Institute for Advanced Study Institute for Computational and Experimental Research in Mathematics (ICERM) Workshop on Theory and Practice in Machine Learning and Computer Vision 19 February 2019

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 1 / 35

slide-2
SLIDE 2

Deep Learning

Source NVIDIA (www.slideshare.net/openomics/the-revolution-of-deep-learning)

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 2 / 35

slide-3
SLIDE 3

Limited Formal Understanding

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 3 / 35

slide-4
SLIDE 4

DL Theory: Expressiveness, Optimization & Generalization

Outline

1

Deep Learning Theory: Expressiveness, Optimization and Generalization

2

Analyzing Optimization via Trajectories

3

Trajectories of Gradient Descent for Deep Linear Neural Networks Convergence to Global Optimum Acceleration by Depth

4

Conclusion

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 4 / 35

slide-5
SLIDE 5

DL Theory: Expressiveness, Optimization & Generalization

Statistical Learning Setup

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 5 / 35

slide-6
SLIDE 6

DL Theory: Expressiveness, Optimization & Generalization

Statistical Learning Setup

X — instance space (e.g. R100×100 for 100-by-100 grayscale images)

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 5 / 35

slide-7
SLIDE 7

DL Theory: Expressiveness, Optimization & Generalization

Statistical Learning Setup

X — instance space (e.g. R100×100 for 100-by-100 grayscale images) Y — label space (e.g. R for regression or {1, . . . , k} for classification)

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 5 / 35

slide-8
SLIDE 8

DL Theory: Expressiveness, Optimization & Generalization

Statistical Learning Setup

X — instance space (e.g. R100×100 for 100-by-100 grayscale images) Y — label space (e.g. R for regression or {1, . . . , k} for classification) D — distribution over X × Y (unknown)

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 5 / 35

slide-9
SLIDE 9

DL Theory: Expressiveness, Optimization & Generalization

Statistical Learning Setup

X — instance space (e.g. R100×100 for 100-by-100 grayscale images) Y — label space (e.g. R for regression or {1, . . . , k} for classification) D — distribution over X × Y (unknown) ℓ : Y×Y → R≥0 — loss func (e.g. ℓ(y, ˆ y) = (y − ˆ y)2 for Y = R)

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 5 / 35

slide-10
SLIDE 10

DL Theory: Expressiveness, Optimization & Generalization

Statistical Learning Setup

X — instance space (e.g. R100×100 for 100-by-100 grayscale images) Y — label space (e.g. R for regression or {1, . . . , k} for classification) D — distribution over X × Y (unknown) ℓ : Y×Y → R≥0 — loss func (e.g. ℓ(y, ˆ y) = (y − ˆ y)2 for Y = R) Task Given training set S = {(Xi, yi)}m

i=1 drawn i.i.d. from D, return hypothesis

(predictor) h : X → Y that minimizes population loss: LD(h) := E(X,y)∼D[ℓ(y, h(X))]

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 5 / 35

slide-11
SLIDE 11

DL Theory: Expressiveness, Optimization & Generalization

Statistical Learning Setup

X — instance space (e.g. R100×100 for 100-by-100 grayscale images) Y — label space (e.g. R for regression or {1, . . . , k} for classification) D — distribution over X × Y (unknown) ℓ : Y×Y → R≥0 — loss func (e.g. ℓ(y, ˆ y) = (y − ˆ y)2 for Y = R) Task Given training set S = {(Xi, yi)}m

i=1 drawn i.i.d. from D, return hypothesis

(predictor) h : X → Y that minimizes population loss: LD(h) := E(X,y)∼D[ℓ(y, h(X))] Approach Predetermine hypotheses space H ⊂ YX , and return hypothesis h ∈ H that minimizes empirical loss: LS(h) := 1 m

m

i=1 ℓ(yi, h(Xi))

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 5 / 35

slide-12
SLIDE 12

DL Theory: Expressiveness, Optimization & Generalization

Three Pillars of Statistical Learning Theory: Expressiveness, Generalization and Optimization

h

* S

h

*

h

*

f

(all functions) (hypotheses space)

f ∗

D — ground truth (minimizer of population loss over YX )

h∗

D — optimal hypothesis (minimizer of population loss over H)

h∗

S — empirically optimal hypothesis (minimizer of empirical loss over H)

¯ h — returned hypothesis

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 6 / 35

slide-13
SLIDE 13

DL Theory: Expressiveness, Optimization & Generalization

Three Pillars of Statistical Learning Theory: Expressiveness, Generalization and Optimization

h

* S

h

*

h

*

f

Approximation Error (Expressiveness)

(all functions) (hypotheses space)

f ∗

D — ground truth (minimizer of population loss over YX )

h∗

D — optimal hypothesis (minimizer of population loss over H)

h∗

S — empirically optimal hypothesis (minimizer of empirical loss over H)

¯ h — returned hypothesis

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 6 / 35

slide-14
SLIDE 14

DL Theory: Expressiveness, Optimization & Generalization

Three Pillars of Statistical Learning Theory: Expressiveness, Generalization and Optimization

h

* S

h

*

h

*

f

Approximation Error (Expressiveness) Estimation Error (Generalization)

(all functions) (hypotheses space)

f ∗

D — ground truth (minimizer of population loss over YX )

h∗

D — optimal hypothesis (minimizer of population loss over H)

h∗

S — empirically optimal hypothesis (minimizer of empirical loss over H)

¯ h — returned hypothesis

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 6 / 35

slide-15
SLIDE 15

DL Theory: Expressiveness, Optimization & Generalization

Three Pillars of Statistical Learning Theory: Expressiveness, Generalization and Optimization

h

* S

h

*

h

*

f

Approximation Error (Expressiveness) Estimation Error (Generalization) Training Error (Optimization)

(all functions) (hypotheses space)

f ∗

D — ground truth (minimizer of population loss over YX )

h∗

D — optimal hypothesis (minimizer of population loss over H)

h∗

S — empirically optimal hypothesis (minimizer of empirical loss over H)

¯ h — returned hypothesis

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 6 / 35

slide-16
SLIDE 16

DL Theory: Expressiveness, Optimization & Generalization

Classical Machine Learning

h

* S

h

*

h

*

f

Approximation Error (Expressiveness) Estimation Error (Generalization) Training Error (Optimization)

(all functions) (hypotheses space)

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 7 / 35

slide-17
SLIDE 17

DL Theory: Expressiveness, Optimization & Generalization

Classical Machine Learning

h

* S

h

*

h

*

f

Approximation Error (Expressiveness) Estimation Error (Generalization) Training Error (Optimization)

(all functions) (hypotheses space)

Optimization Empirical loss minimization is a convex program: ¯ h ≈ h∗

S ( training err ≈ 0 )

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 7 / 35

slide-18
SLIDE 18

DL Theory: Expressiveness, Optimization & Generalization

Classical Machine Learning

h

* S

h

*

h

*

f

Approximation Error (Expressiveness) Estimation Error (Generalization) Training Error (Optimization)

(all functions) (hypotheses space)

Optimization Empirical loss minimization is a convex program: ¯ h ≈ h∗

S ( training err ≈ 0 )

Expressiveness & Generalization Bias-variance trade-off: H approximation err estimation err expands ց ր shrinks ր ց

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 7 / 35

slide-19
SLIDE 19

DL Theory: Expressiveness, Optimization & Generalization

Classical Machine Learning

h

* S

h

*

h

*

f

Approximation Error (Expressiveness) Estimation Error (Generalization) Training Error (Optimization)

(all functions) (hypotheses space)

Optimization Empirical loss minimization is a convex program: ¯ h ≈ h∗

S ( training err ≈ 0 )

Expressiveness & Generalization Bias-variance trade-off: H approximation err estimation err expands ց ր shrinks ր ց

Well developed theory

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 7 / 35

slide-20
SLIDE 20

DL Theory: Expressiveness, Optimization & Generalization

Deep Learning

h

* S

h

*

h

*

f

Approximation Error (Expressiveness) Estimation Error (Generalization) Training Error (Optimization)

(all functions) (hypotheses space)

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 8 / 35

slide-21
SLIDE 21

DL Theory: Expressiveness, Optimization & Generalization

Deep Learning

h

* S

h

*

h

*

f

Approximation Error (Expressiveness) Estimation Error (Generalization) Training Error (Optimization)

(all functions) (hypotheses space)

Optimization Empirical loss minimization is a non-convex program:

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 8 / 35

slide-22
SLIDE 22

DL Theory: Expressiveness, Optimization & Generalization

Deep Learning

h

* S

h

*

h

*

f

Approximation Error (Expressiveness) Estimation Error (Generalization) Training Error (Optimization)

(all functions) (hypotheses space)

Optimization Empirical loss minimization is a non-convex program: h∗

S is not unique — many hypotheses have low training err

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 8 / 35

slide-23
SLIDE 23

DL Theory: Expressiveness, Optimization & Generalization

Deep Learning

h

* S

h

*

h

*

f

Approximation Error (Expressiveness) Estimation Error (Generalization) Training Error (Optimization)

(all functions) (hypotheses space)

Optimization Empirical loss minimization is a non-convex program: h∗

S is not unique — many hypotheses have low training err

Gradient descent (GD) somehow reaches one of these

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 8 / 35

slide-24
SLIDE 24

DL Theory: Expressiveness, Optimization & Generalization

Deep Learning

h

* S

h

*

h

*

f

Approximation Error (Expressiveness) Estimation Error (Generalization) Training Error (Optimization)

(all functions) (hypotheses space)

Optimization Empirical loss minimization is a non-convex program: h∗

S is not unique — many hypotheses have low training err

Gradient descent (GD) somehow reaches one of these Expressiveness & Generalization Vast difference from classical ML:

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 8 / 35

slide-25
SLIDE 25

DL Theory: Expressiveness, Optimization & Generalization

Deep Learning

h

* S

h

*

h

*

f

Approximation Error (Expressiveness) Estimation Error (Generalization) Training Error (Optimization)

(all functions) (hypotheses space)

Optimization Empirical loss minimization is a non-convex program: h∗

S is not unique — many hypotheses have low training err

Gradient descent (GD) somehow reaches one of these Expressiveness & Generalization Vast difference from classical ML: Some low training err hypotheses generalize well, others don’t

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 8 / 35

slide-26
SLIDE 26

DL Theory: Expressiveness, Optimization & Generalization

Deep Learning

h

* S

h

*

h

*

f

Approximation Error (Expressiveness) Estimation Error (Generalization) Training Error (Optimization)

(all functions) (hypotheses space)

Optimization Empirical loss minimization is a non-convex program: h∗

S is not unique — many hypotheses have low training err

Gradient descent (GD) somehow reaches one of these Expressiveness & Generalization Vast difference from classical ML: Some low training err hypotheses generalize well, others don’t W/typical data, solution returned by GD often generalizes well

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 8 / 35

slide-27
SLIDE 27

DL Theory: Expressiveness, Optimization & Generalization

Deep Learning

h

* S

h

*

h

*

f

Approximation Error (Expressiveness) Estimation Error (Generalization) Training Error (Optimization)

(all functions) (hypotheses space)

Optimization Empirical loss minimization is a non-convex program: h∗

S is not unique — many hypotheses have low training err

Gradient descent (GD) somehow reaches one of these Expressiveness & Generalization Vast difference from classical ML: Some low training err hypotheses generalize well, others don’t W/typical data, solution returned by GD often generalizes well Expanding H reduces approximation err, but also estimation err!

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 8 / 35

slide-28
SLIDE 28

DL Theory: Expressiveness, Optimization & Generalization

Deep Learning

h

* S

h

*

h

*

f

Approximation Error (Expressiveness) Estimation Error (Generalization) Training Error (Optimization)

(all functions) (hypotheses space)

Optimization Empirical loss minimization is a non-convex program: h∗

S is not unique — many hypotheses have low training err

Gradient descent (GD) somehow reaches one of these Expressiveness & Generalization Vast difference from classical ML: Some low training err hypotheses generalize well, others don’t W/typical data, solution returned by GD often generalizes well Expanding H reduces approximation err, but also estimation err!

Not well understood

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 8 / 35

slide-29
SLIDE 29

Analyzing Optimization via Trajectories

Outline

1

Deep Learning Theory: Expressiveness, Optimization and Generalization

2

Analyzing Optimization via Trajectories

3

Trajectories of Gradient Descent for Deep Linear Neural Networks Convergence to Global Optimum Acceleration by Depth

4

Conclusion

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 9 / 35

slide-30
SLIDE 30

Analyzing Optimization via Trajectories

Optimization

h

* S

h

*

h

*

f

Training Error (Optimization)

(all functions) (hypotheses space)

f ∗

D — ground truth

h∗

D — optimal hypothesis

h∗

S — empirically optimal hypothesis

¯ h — returned hypothesis

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 10 / 35

slide-31
SLIDE 31

Analyzing Optimization via Trajectories

Approach: Convergence via Critical Points

Prominent approach for analyzing optimization in DL is via critical points (∇ = 0) in loss landscape

Non-strict saddle Good local minimum ( ≈ global minimum) Poor local minimum Strict saddle

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 11 / 35

slide-32
SLIDE 32

Analyzing Optimization via Trajectories

Approach: Convergence via Critical Points

Prominent approach for analyzing optimization in DL is via critical points (∇ = 0) in loss landscape

Non-strict saddle Good local minimum ( ≈ global minimum) Poor local minimum Strict saddle (1) (2)

Result (cf. Ge et al. 2015; Lee et al. 2016) If: (1) there are no poor local minima; and (2) all saddle points are strict, then gradient descent (GD) converges to global minimum

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 11 / 35

slide-33
SLIDE 33

Analyzing Optimization via Trajectories

Approach: Convergence via Critical Points

Prominent approach for analyzing optimization in DL is via critical points (∇ = 0) in loss landscape

Non-strict saddle Good local minimum ( ≈ global minimum) Poor local minimum Strict saddle (1) (2)

Result (cf. Ge et al. 2015; Lee et al. 2016) If: (1) there are no poor local minima; and (2) all saddle points are strict, then gradient descent (GD) converges to global minimum Motivated by this, many 1 studied the validity of (1) and/or (2)

1 e.g. Haeffele & Vidal 2015; Kawaguchi 2016; Soudry & Carmon 2016; Safran & Shamir 2018 Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 11 / 35

slide-34
SLIDE 34

Analyzing Optimization via Trajectories

Limitations

Convergence of GD to global min was proven via critical points only for problems involving shallow (2 layer) models

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 12 / 35

slide-35
SLIDE 35

Analyzing Optimization via Trajectories

Limitations

Convergence of GD to global min was proven via critical points only for problems involving shallow (2 layer) models Approach is insufficient when treating deep (≥ 3 layer) models:

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 12 / 35

slide-36
SLIDE 36

Analyzing Optimization via Trajectories

Limitations

Convergence of GD to global min was proven via critical points only for problems involving shallow (2 layer) models Approach is insufficient when treating deep (≥ 3 layer) models: (2) is violated — ∃ non-strict saddles, e.g. when all weights = 0

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 12 / 35

slide-37
SLIDE 37

Analyzing Optimization via Trajectories

Limitations

Convergence of GD to global min was proven via critical points only for problems involving shallow (2 layer) models Approach is insufficient when treating deep (≥ 3 layer) models: (2) is violated — ∃ non-strict saddles, e.g. when all weights = 0 Algorithmic aspects essential for convergence w/deep models, e.g. proper initialization, are ignored

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 12 / 35

slide-38
SLIDE 38

Analyzing Optimization via Trajectories

Optimizer Trajectories Matter

Different optimization trajectories may lead to qualitatively different results

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 13 / 35

slide-39
SLIDE 39

Analyzing Optimization via Trajectories

Optimizer Trajectories Matter

Different optimization trajectories may lead to qualitatively different results = ⇒ details of algorithm and init should be taken into account!

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 13 / 35

slide-40
SLIDE 40

Analyzing Optimization via Trajectories

Existing Trajectory Analyses

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 14 / 35

slide-41
SLIDE 41

Analyzing Optimization via Trajectories

Existing Trajectory Analyses

Trajectory approach led to successful analyses of shallow models:

Brutzkus & Globerson 2017 Li & Yuan 2017 Zhong et al. 2017 Tian 2017 Brutzkus et al. 2018 Li et al. 2018 Du et al. 2018 Oymak & Soltanolkotabi 2018

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 14 / 35

slide-42
SLIDE 42

Analyzing Optimization via Trajectories

Existing Trajectory Analyses

Trajectory approach led to successful analyses of shallow models:

Brutzkus & Globerson 2017 Li & Yuan 2017 Zhong et al. 2017 Tian 2017 Brutzkus et al. 2018 Li et al. 2018 Du et al. 2018 Oymak & Soltanolkotabi 2018

It also allowed treating prohibitively large deep models:

Du et al. 2018 Allen-Zhu et al. 2018 Zou et al. 2018

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 14 / 35

slide-43
SLIDE 43

Analyzing Optimization via Trajectories

Existing Trajectory Analyses

Trajectory approach led to successful analyses of shallow models:

Brutzkus & Globerson 2017 Li & Yuan 2017 Zhong et al. 2017 Tian 2017 Brutzkus et al. 2018 Li et al. 2018 Du et al. 2018 Oymak & Soltanolkotabi 2018

It also allowed treating prohibitively large deep models:

Du et al. 2018 Allen-Zhu et al. 2018 Zou et al. 2018

For deep linear residual networks, trajectories were used to show efficient convergence of GD to global min (Bartlett et al. 2018)

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 14 / 35

slide-44
SLIDE 44

Trajectories of GD for Deep LNNs

Outline

1

Deep Learning Theory: Expressiveness, Optimization and Generalization

2

Analyzing Optimization via Trajectories

3

Trajectories of Gradient Descent for Deep Linear Neural Networks Convergence to Global Optimum Acceleration by Depth

4

Conclusion

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 15 / 35

slide-45
SLIDE 45

Trajectories of GD for Deep LNNs

Sources

On the Optimization of Deep Networks: Implicit Acceleration by Overparameterization Arora + C + Hazan (alphabetical order) International Conference on Machine Learning (ICML) 2018 A Convergence Analysis of Gradient Descent for Deep Linear Neural Networks Arora + C + Golowich + Hu (alphabetical order) To appear: International Conference on Learning Representations (ICLR) 2019

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 16 / 35

slide-46
SLIDE 46

Trajectories of GD for Deep LNNs

Collaborators

Sanjeev Arora Elad Hazan Wei Hu Noah Golowich

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 17 / 35

slide-47
SLIDE 47

Trajectories of GD for Deep LNNs Convergence to Global Optimum

Outline

1

Deep Learning Theory: Expressiveness, Optimization and Generalization

2

Analyzing Optimization via Trajectories

3

Trajectories of Gradient Descent for Deep Linear Neural Networks Convergence to Global Optimum Acceleration by Depth

4

Conclusion

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 18 / 35

slide-48
SLIDE 48

Trajectories of GD for Deep LNNs Convergence to Global Optimum

Linear Neural Networks

Linear neural networks (LNN) are fully-connected neural networks w/linear (no) activation W1 W2 WN x y = WN • • • W2W1 x

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 19 / 35

slide-49
SLIDE 49

Trajectories of GD for Deep LNNs Convergence to Global Optimum

Linear Neural Networks

Linear neural networks (LNN) are fully-connected neural networks w/linear (no) activation W1 W2 WN x y = WN • • • W2W1 x As surrogate for optimization in DL, GD over LNN (highly non-convex problem) is studied extensively 1

1 e.g. Saxe et al. 2014; Kawaguchi 2016; Hardt & Ma 2017; Laurent & Brecht 2018 Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 19 / 35

slide-50
SLIDE 50

Trajectories of GD for Deep LNNs Convergence to Global Optimum

Linear Neural Networks

Linear neural networks (LNN) are fully-connected neural networks w/linear (no) activation W1 W2 WN x y = WN • • • W2W1 x As surrogate for optimization in DL, GD over LNN (highly non-convex problem) is studied extensively 1 Existing Result (Bartlett et al. 2018) W/linear residual networks (a special case: Wj are square and init to Id), for ℓ2 loss on certain data, GD efficiently converges to global min

1 e.g. Saxe et al. 2014; Kawaguchi 2016; Hardt & Ma 2017; Laurent & Brecht 2018 Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 19 / 35

slide-51
SLIDE 51

Trajectories of GD for Deep LNNs Convergence to Global Optimum

Linear Neural Networks

Linear neural networks (LNN) are fully-connected neural networks w/linear (no) activation W1 W2 WN x y = WN • • • W2W1 x As surrogate for optimization in DL, GD over LNN (highly non-convex problem) is studied extensively 1 Existing Result (Bartlett et al. 2018) W/linear residual networks (a special case: Wj are square and init to Id), for ℓ2 loss on certain data, GD efficiently converges to global min ↑ Only existing proof of efficient convergence to global min for GD training deep model

1 e.g. Saxe et al. 2014; Kawaguchi 2016; Hardt & Ma 2017; Laurent & Brecht 2018 Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 19 / 35

slide-52
SLIDE 52

Trajectories of GD for Deep LNNs Convergence to Global Optimum

Gradient Flow

Gradient flow (GF) is a continuous version of GD (learning rate → 0):

d dt α(t) = −∇f (α(t)) , t ∈ R>0

Gradient descent Gradient flow

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 20 / 35

slide-53
SLIDE 53

Trajectories of GD for Deep LNNs Convergence to Global Optimum

Gradient Flow

Gradient flow (GF) is a continuous version of GD (learning rate → 0):

d dt α(t) = −∇f (α(t)) , t ∈ R>0

Gradient descent Gradient flow Admits use of theoretical tools from differential geometry/equations

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 20 / 35

slide-54
SLIDE 54

Trajectories of GD for Deep LNNs Convergence to Global Optimum

Trajectories of Gradient Flow

W1 W2 WN x y = WN • • • W2W1 x

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 21 / 35

slide-55
SLIDE 55

Trajectories of GD for Deep LNNs Convergence to Global Optimum

Trajectories of Gradient Flow

W1 W2 WN x y = WN • • • W2W1 x Loss ℓ(·) for linear model induces overparameterized objective for LNN: φ(W1, . . . , WN) := ℓ(WN · · · W2W1)

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 21 / 35

slide-56
SLIDE 56

Trajectories of GD for Deep LNNs Convergence to Global Optimum

Trajectories of Gradient Flow

W1 W2 WN x y = WN • • • W2W1 x Loss ℓ(·) for linear model induces overparameterized objective for LNN: φ(W1, . . . , WN) := ℓ(WN · · · W2W1) Definition Weights W1 . . . WN are balanced if W ⊤

j+1Wj+1 = WjW ⊤ j

, ∀j.

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 21 / 35

slide-57
SLIDE 57

Trajectories of GD for Deep LNNs Convergence to Global Optimum

Trajectories of Gradient Flow

W1 W2 WN x y = WN • • • W2W1 x Loss ℓ(·) for linear model induces overparameterized objective for LNN: φ(W1, . . . , WN) := ℓ(WN · · · W2W1) Definition Weights W1 . . . WN are balanced if W ⊤

j+1Wj+1 = WjW ⊤ j

, ∀j. ↑ Holds approximately under ≈ 0 init, exactly under residual (Id) init

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 21 / 35

slide-58
SLIDE 58

Trajectories of GD for Deep LNNs Convergence to Global Optimum

Trajectories of Gradient Flow

W1 W2 WN x y = WN • • • W2W1 x Loss ℓ(·) for linear model induces overparameterized objective for LNN: φ(W1, . . . , WN) := ℓ(WN · · · W2W1) Definition Weights W1 . . . WN are balanced if W ⊤

j+1Wj+1 = WjW ⊤ j

, ∀j. ↑ Holds approximately under ≈ 0 init, exactly under residual (Id) init Claim Trajectories of GF over LNN preserve balancedness: if W1 . . . WN are balanced at init, they remain that way throughout GF optimization

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 21 / 35

slide-59
SLIDE 59

Trajectories of GD for Deep LNNs Convergence to Global Optimum

Implicit Preconditioning

Question How does end-to-end matrix W1:N:=WN· · ·W1 move on GF trajectories?

W1 W2 WN

W1:N

Linear Neural Network Equivalent Linear Model

)

N

W , … ,

1

W ( Gradient flow over ?

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 22 / 35

slide-60
SLIDE 60

Trajectories of GD for Deep LNNs Convergence to Global Optimum

Implicit Preconditioning

Question How does end-to-end matrix W1:N:=WN· · ·W1 move on GF trajectories?

W1 W2 WN

W1:N

Linear Neural Network Equivalent Linear Model

Gradient flow over (W1 ,…,WN) Preconditioned gradient flow over (W1:N)

Theorem If W1 . . . WN are balanced at init, W1:N follows end-to-end dynamics:

d dt vec [W1:N(t)] = −PW1:N(t) · vec

∇ℓ W1:N(t)

  • where PW1:N(t) is a preconditioner (PSD matrix) that “reinforces” W1:N(t)

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 22 / 35

slide-61
SLIDE 61

Trajectories of GD for Deep LNNs Convergence to Global Optimum

Implicit Preconditioning

Question How does end-to-end matrix W1:N:=WN· · ·W1 move on GF trajectories?

W1 W2 WN

W1:N

Linear Neural Network Equivalent Linear Model

Gradient flow over (W1 ,…,WN) Preconditioned gradient flow over (W1:N)

Theorem If W1 . . . WN are balanced at init, W1:N follows end-to-end dynamics:

d dt vec [W1:N(t)] = −PW1:N(t) · vec

∇ℓ W1:N(t)

  • where PW1:N(t) is a preconditioner (PSD matrix) that “reinforces” W1:N(t)

Adding (redundant) linear layers to classic linear model induces preconditioner promoting movement in directions already taken!

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 22 / 35

slide-62
SLIDE 62

Trajectories of GD for Deep LNNs Convergence to Global Optimum

Convergence to Global Optimum

d dt vec [W1:N(t)] = −PW1:N(t) · vec

∇ℓ W1:N(t)

  • Nadav Cohen (IAS)

Optimization in DL via Trajectories ICERM Workshop, Feb’19 23 / 35

slide-63
SLIDE 63

Trajectories of GD for Deep LNNs Convergence to Global Optimum

Convergence to Global Optimum

d dt vec [W1:N(t)] = −PW1:N(t) · vec

∇ℓ W1:N(t)

  • PW1:N(t)≻0 when W1:N(t) has full rank

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 23 / 35

slide-64
SLIDE 64

Trajectories of GD for Deep LNNs Convergence to Global Optimum

Convergence to Global Optimum

d dt vec [W1:N(t)] = −PW1:N(t) · vec

∇ℓ W1:N(t)

  • PW1:N(t)≻0 when W1:N(t) has full rank =

⇒ loss decreases until: (1) ∇ℓ

W1:N(t) = 0

  • r

(2) W1:N(t) is singular

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 23 / 35

slide-65
SLIDE 65

Trajectories of GD for Deep LNNs Convergence to Global Optimum

Convergence to Global Optimum

d dt vec [W1:N(t)] = −PW1:N(t) · vec

∇ℓ W1:N(t)

  • PW1:N(t)≻0 when W1:N(t) has full rank =

⇒ loss decreases until: (1) ∇ℓ

W1:N(t) = 0

  • r

(2) W1:N(t) is singular ℓ(·) is typically convex = ⇒ (1) means global min was reached

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 23 / 35

slide-66
SLIDE 66

Trajectories of GD for Deep LNNs Convergence to Global Optimum

Convergence to Global Optimum

d dt vec [W1:N(t)] = −PW1:N(t) · vec

∇ℓ W1:N(t)

  • PW1:N(t)≻0 when W1:N(t) has full rank =

⇒ loss decreases until: (1) ∇ℓ

W1:N(t) = 0

  • r

(2) W1:N(t) is singular ℓ(·) is typically convex = ⇒ (1) means global min was reached Corollary Assume ℓ(·) is convex and LNN is init such that: W1 . . . WN are balanced ℓ(W1:N) < ℓ(W ) for any singular W Then, GF converges to global min

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 23 / 35

slide-67
SLIDE 67

Trajectories of GD for Deep LNNs Convergence to Global Optimum

From Gradient Flow to Gradient Descent

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 24 / 35

slide-68
SLIDE 68

Trajectories of GD for Deep LNNs Convergence to Global Optimum

From Gradient Flow to Gradient Descent

Our convergence result for GF made two assumptions on init:

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 24 / 35

slide-69
SLIDE 69

Trajectories of GD for Deep LNNs Convergence to Global Optimum

From Gradient Flow to Gradient Descent

Our convergence result for GF made two assumptions on init:

1 Weights are balanced:

W ⊤

j+1Wj+1 = WjW ⊤ j

, ∀j

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 24 / 35

slide-70
SLIDE 70

Trajectories of GD for Deep LNNs Convergence to Global Optimum

From Gradient Flow to Gradient Descent

Our convergence result for GF made two assumptions on init:

1 Weights are balanced:

W ⊤

j+1Wj+1 − WjW ⊤ j F = 0 , ∀j

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 24 / 35

slide-71
SLIDE 71

Trajectories of GD for Deep LNNs Convergence to Global Optimum

From Gradient Flow to Gradient Descent

Our convergence result for GF made two assumptions on init:

1 Weights are balanced:

W ⊤

j+1Wj+1 − WjW ⊤ j F = 0 , ∀j

2 Loss is smaller than that of any singular solution:

ℓ(W1:N) < ℓ(W ) , ∀W s.t. σmin(W ) = 0

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 24 / 35

slide-72
SLIDE 72

Trajectories of GD for Deep LNNs Convergence to Global Optimum

From Gradient Flow to Gradient Descent

Our convergence result for GF made two assumptions on init:

1 Weights are balanced:

W ⊤

j+1Wj+1 − WjW ⊤ j F = 0 , ∀j

2 Loss is smaller than that of any singular solution:

ℓ(W1:N) < ℓ(W ) , ∀W s.t. σmin(W ) = 0 For translating to GD, we define discrete forms of these conditions:

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 24 / 35

slide-73
SLIDE 73

Trajectories of GD for Deep LNNs Convergence to Global Optimum

From Gradient Flow to Gradient Descent

Our convergence result for GF made two assumptions on init:

1 Weights are balanced:

W ⊤

j+1Wj+1 − WjW ⊤ j F = 0 , ∀j

2 Loss is smaller than that of any singular solution:

ℓ(W1:N) < ℓ(W ) , ∀W s.t. σmin(W ) = 0 For translating to GD, we define discrete forms of these conditions: Definition For δ ≥ 0, weights W1 . . . WN are δ-balanced if: W ⊤

j+1Wj+1 − WjW ⊤ j F ≤ δ , ∀j

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 24 / 35

slide-74
SLIDE 74

Trajectories of GD for Deep LNNs Convergence to Global Optimum

From Gradient Flow to Gradient Descent

Our convergence result for GF made two assumptions on init:

1 Weights are balanced:

W ⊤

j+1Wj+1 − WjW ⊤ j F = 0 , ∀j

2 Loss is smaller than that of any singular solution:

ℓ(W1:N) < ℓ(W ) , ∀W s.t. σmin(W ) = 0 For translating to GD, we define discrete forms of these conditions: Definition For δ ≥ 0, weights W1 . . . WN are δ-balanced if: W ⊤

j+1Wj+1 − WjW ⊤ j F ≤ δ , ∀j

Definition For c > 0, weights W1 . . . WN have deficiency margin c if: ℓ(W1:N) ≤ ℓ(W ) , ∀W s.t. σmin(W ) ≤ c

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 24 / 35

slide-75
SLIDE 75

Trajectories of GD for Deep LNNs Convergence to Global Optimum

Convergence to Global Optimum for Gradient Descent

Suppose ℓ(·) = ℓ2 loss

  • i.e. ℓ(W ) = 1

m

m

i=1 W xi − yi2 2

  • Nadav Cohen (IAS)

Optimization in DL via Trajectories ICERM Workshop, Feb’19 25 / 35

slide-76
SLIDE 76

Trajectories of GD for Deep LNNs Convergence to Global Optimum

Convergence to Global Optimum for Gradient Descent

Suppose ℓ(·) = ℓ2 loss

  • i.e. ℓ(W ) = 1

m

m

i=1 W xi − yi2 2

  • Theorem

Assume GD over LNN is init s.t. W1 . . . WN have deficiency margin c > 0 and are δ-balanced w/δ ≤ O(c2). Then, for any learning rate η ≤ O(c4): loss(iteration t) ≤ e−Ω(c2ηt)

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 25 / 35

slide-77
SLIDE 77

Trajectories of GD for Deep LNNs Convergence to Global Optimum

Convergence to Global Optimum for Gradient Descent

Suppose ℓ(·) = ℓ2 loss

  • i.e. ℓ(W ) = 1

m

m

i=1 W xi − yi2 2

  • Theorem

Assume GD over LNN is init s.t. W1 . . . WN have deficiency margin c > 0 and are δ-balanced w/δ ≤ O(c2). Then, for any learning rate η ≤ O(c4): loss(iteration t) ≤ e−Ω(c2ηt) Claim Assumptions on init — deficiency margin and δ-balancedness:

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 25 / 35

slide-78
SLIDE 78

Trajectories of GD for Deep LNNs Convergence to Global Optimum

Convergence to Global Optimum for Gradient Descent

Suppose ℓ(·) = ℓ2 loss

  • i.e. ℓ(W ) = 1

m

m

i=1 W xi − yi2 2

  • Theorem

Assume GD over LNN is init s.t. W1 . . . WN have deficiency margin c > 0 and are δ-balanced w/δ ≤ O(c2). Then, for any learning rate η ≤ O(c4): loss(iteration t) ≤ e−Ω(c2ηt) Claim Assumptions on init — deficiency margin and δ-balancedness: Are necessary (violating any of them can lead to divergence)

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 25 / 35

slide-79
SLIDE 79

Trajectories of GD for Deep LNNs Convergence to Global Optimum

Convergence to Global Optimum for Gradient Descent

Suppose ℓ(·) = ℓ2 loss

  • i.e. ℓ(W ) = 1

m

m

i=1 W xi − yi2 2

  • Theorem

Assume GD over LNN is init s.t. W1 . . . WN have deficiency margin c > 0 and are δ-balanced w/δ ≤ O(c2). Then, for any learning rate η ≤ O(c4): loss(iteration t) ≤ e−Ω(c2ηt) Claim Assumptions on init — deficiency margin and δ-balancedness: Are necessary (violating any of them can lead to divergence) For output dim 1, hold w/const prob under random “balanced” init

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 25 / 35

slide-80
SLIDE 80

Trajectories of GD for Deep LNNs Convergence to Global Optimum

Convergence to Global Optimum for Gradient Descent

Suppose ℓ(·) = ℓ2 loss

  • i.e. ℓ(W ) = 1

m

m

i=1 W xi − yi2 2

  • Theorem

Assume GD over LNN is init s.t. W1 . . . WN have deficiency margin c > 0 and are δ-balanced w/δ ≤ O(c2). Then, for any learning rate η ≤ O(c4): loss(iteration t) ≤ e−Ω(c2ηt) Claim Assumptions on init — deficiency margin and δ-balancedness: Are necessary (violating any of them can lead to divergence) For output dim 1, hold w/const prob under random “balanced” init Guarantee of efficient (linear rate) convergence to global min! Most general guarantee to date for GD efficiently training deep net.

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 25 / 35

slide-81
SLIDE 81

Trajectories of GD for Deep LNNs Acceleration by Depth

Outline

1

Deep Learning Theory: Expressiveness, Optimization and Generalization

2

Analyzing Optimization via Trajectories

3

Trajectories of Gradient Descent for Deep Linear Neural Networks Convergence to Global Optimum Acceleration by Depth

4

Conclusion

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 26 / 35

slide-82
SLIDE 82

Trajectories of GD for Deep LNNs Acceleration by Depth

The Effect of Depth

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 27 / 35

slide-83
SLIDE 83

Trajectories of GD for Deep LNNs Acceleration by Depth

The Effect of Depth

Conventional wisdom: Depth boosts expressiveness

input early layers intermediate layers deep layers

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 27 / 35

slide-84
SLIDE 84

Trajectories of GD for Deep LNNs Acceleration by Depth

The Effect of Depth

Conventional wisdom: Depth boosts expressiveness

input early layers intermediate layers deep layers

But complicates optimization

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 27 / 35

slide-85
SLIDE 85

Trajectories of GD for Deep LNNs Acceleration by Depth

The Effect of Depth

Conventional wisdom: Depth boosts expressiveness

input early layers intermediate layers deep layers

But complicates optimization We will see: not always true...

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 27 / 35

slide-86
SLIDE 86

Trajectories of GD for Deep LNNs Acceleration by Depth

Effect of Depth for Linear Neural Networks

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 28 / 35

slide-87
SLIDE 87

Trajectories of GD for Deep LNNs Acceleration by Depth

Effect of Depth for Linear Neural Networks

For LNN, we derived end-to-end dynamics:

d dt vec [W1:N(t)] = −PW1:N(t) · vec

∇ℓ W1:N(t)

  • Nadav Cohen (IAS)

Optimization in DL via Trajectories ICERM Workshop, Feb’19 28 / 35

slide-88
SLIDE 88

Trajectories of GD for Deep LNNs Acceleration by Depth

Effect of Depth for Linear Neural Networks

For LNN, we derived end-to-end dynamics:

d dt vec [W1:N(t)] = −PW1:N(t) · vec

∇ℓ W1:N(t)

  • Consider a discrete version:

vec

  • W1:N(t + 1)
  • ← vec
  • W1:N(t)
  • − η · PW1:N(t) · vec
  • ∇ℓ(W1:N(t))
  • Nadav Cohen (IAS)

Optimization in DL via Trajectories ICERM Workshop, Feb’19 28 / 35

slide-89
SLIDE 89

Trajectories of GD for Deep LNNs Acceleration by Depth

Effect of Depth for Linear Neural Networks

For LNN, we derived end-to-end dynamics:

d dt vec [W1:N(t)] = −PW1:N(t) · vec

∇ℓ W1:N(t)

  • Consider a discrete version:

vec

  • W1:N(t + 1)
  • ← vec
  • W1:N(t)
  • − η · PW1:N(t) · vec
  • ∇ℓ(W1:N(t))
  • Claim

For any p > 2, there exist settings where ℓ(·) = ℓp loss: ℓ(W ) = 1

m

m

i=1 W xi−yip p

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 28 / 35

slide-90
SLIDE 90

Trajectories of GD for Deep LNNs Acceleration by Depth

Effect of Depth for Linear Neural Networks

For LNN, we derived end-to-end dynamics:

d dt vec [W1:N(t)] = −PW1:N(t) · vec

∇ℓ W1:N(t)

  • Consider a discrete version:

vec

  • W1:N(t + 1)
  • ← vec
  • W1:N(t)
  • − η · PW1:N(t) · vec
  • ∇ℓ(W1:N(t))
  • Claim

For any p > 2, there exist settings where ℓ(·) = ℓp loss: ℓ(W ) = 1

m

m

i=1 W xi−yip p

← convex

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 28 / 35

slide-91
SLIDE 91

Trajectories of GD for Deep LNNs Acceleration by Depth

Effect of Depth for Linear Neural Networks

For LNN, we derived end-to-end dynamics:

d dt vec [W1:N(t)] = −PW1:N(t) · vec

∇ℓ W1:N(t)

  • Consider a discrete version:

vec

  • W1:N(t + 1)
  • ← vec
  • W1:N(t)
  • − η · PW1:N(t) · vec
  • ∇ℓ(W1:N(t))
  • Claim

For any p > 2, there exist settings where ℓ(·) = ℓp loss: ℓ(W ) = 1

m

m

i=1 W xi−yip p

← convex and disc end-to-end dynamics reach global min arbitrarily faster than GD

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 28 / 35

slide-92
SLIDE 92

Trajectories of GD for Deep LNNs Acceleration by Depth

Effect of Depth for Linear Neural Networks

For LNN, we derived end-to-end dynamics:

d dt vec [W1:N(t)] = −PW1:N(t) · vec

∇ℓ W1:N(t)

  • Consider a discrete version:

vec

  • W1:N(t + 1)
  • ← vec
  • W1:N(t)
  • − η · PW1:N(t) · vec
  • ∇ℓ(W1:N(t))
  • Claim

For any p > 2, there exist settings where ℓ(·) = ℓp loss: ℓ(W ) = 1

m

m

i=1 W xi−yip p

← convex and disc end-to-end dynamics reach global min arbitrarily faster than GD

w1 w2 w1 w2 Gradient descent End-to-end dymaics

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 28 / 35

slide-93
SLIDE 93

Trajectories of GD for Deep LNNs Acceleration by Depth

Experiments

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 29 / 35

slide-94
SLIDE 94

Trajectories of GD for Deep LNNs Acceleration by Depth

Experiments

Linear neural networks:

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 29 / 35

slide-95
SLIDE 95

Trajectories of GD for Deep LNNs Acceleration by Depth

Experiments

Linear neural networks: Regression problem from UCI ML Repository ; ℓ4 loss

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 29 / 35

slide-96
SLIDE 96

Trajectories of GD for Deep LNNs Acceleration by Depth

Experiments

Linear neural networks: Regression problem from UCI ML Repository ; ℓ4 loss

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 29 / 35

slide-97
SLIDE 97

Trajectories of GD for Deep LNNs Acceleration by Depth

Experiments

Linear neural networks: Regression problem from UCI ML Repository ; ℓ4 loss

Depth can speed-up GD, even w/o any gain in expressiveness, and despite introducing non-convexity!

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 29 / 35

slide-98
SLIDE 98

Trajectories of GD for Deep LNNs Acceleration by Depth

Experiments

Linear neural networks: Regression problem from UCI ML Repository ; ℓ4 loss

Depth can speed-up GD, even w/o any gain in expressiveness, and despite introducing non-convexity!

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 29 / 35

slide-99
SLIDE 99

Trajectories of GD for Deep LNNs Acceleration by Depth

Experiments

Linear neural networks: Regression problem from UCI ML Repository ; ℓ4 loss

Depth can speed-up GD, even w/o any gain in expressiveness, and despite introducing non-convexity! This speed-up can outperform popular acceleration methods designed for convex problems!

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 29 / 35

slide-100
SLIDE 100

Trajectories of GD for Deep LNNs Acceleration by Depth

Experiments (cont’)

Non-linear neural networks:

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 30 / 35

slide-101
SLIDE 101

Trajectories of GD for Deep LNNs Acceleration by Depth

Experiments (cont’)

Non-linear neural networks: TensorFlow convolutional network tutorial for MNIST: 1 Arch: (conv → ReLU → max pool) x 2 → dense → ReLU → dense Training: stochastic GD w/momentum, dropout

1 https://github.com/tensorflow/models/tree/master/tutorials/image/mnist Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 30 / 35

slide-102
SLIDE 102

Trajectories of GD for Deep LNNs Acceleration by Depth

Experiments (cont’)

Non-linear neural networks: TensorFlow convolutional network tutorial for MNIST: 1 Arch: (conv → ReLU → max pool) x 2 → dense → ReLU → dense Training: stochastic GD w/momentum, dropout We overparameterized by adding linear layer after each dense layer

1 https://github.com/tensorflow/models/tree/master/tutorials/image/mnist Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 30 / 35

slide-103
SLIDE 103

Trajectories of GD for Deep LNNs Acceleration by Depth

Experiments (cont’)

Non-linear neural networks: TensorFlow convolutional network tutorial for MNIST: 1 Arch: (conv → ReLU → max pool) x 2 → dense → ReLU → dense Training: stochastic GD w/momentum, dropout We overparameterized by adding linear layer after each dense layer

1000 2000 3000 4000 5000 6000 7000 8000

iteration

10−1 100 101

batch loss

  • riginal
  • verparameterized

1 https://github.com/tensorflow/models/tree/master/tutorials/image/mnist Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 30 / 35

slide-104
SLIDE 104

Trajectories of GD for Deep LNNs Acceleration by Depth

Experiments (cont’)

Non-linear neural networks: TensorFlow convolutional network tutorial for MNIST: 1 Arch: (conv → ReLU → max pool) x 2 → dense → ReLU → dense Training: stochastic GD w/momentum, dropout We overparameterized by adding linear layer after each dense layer

1000 2000 3000 4000 5000 6000 7000 8000

iteration

10−1 100 101

batch loss

  • riginal
  • verparameterized

Adding depth, w/o any gain in expressiveness, and only +15% in params, accelerated non-linear net by orders-of-magnitude!

1 https://github.com/tensorflow/models/tree/master/tutorials/image/mnist Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 30 / 35

slide-105
SLIDE 105

Conclusion

Outline

1

Deep Learning Theory: Expressiveness, Optimization and Generalization

2

Analyzing Optimization via Trajectories

3

Trajectories of Gradient Descent for Deep Linear Neural Networks Convergence to Global Optimum Acceleration by Depth

4

Conclusion

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 31 / 35

slide-106
SLIDE 106

Conclusion

Recap

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 32 / 35

slide-107
SLIDE 107

Conclusion

Recap

Understanding DL calls for addressing three fundamental Qs: Expressiveness Optimization Generalization

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 32 / 35

slide-108
SLIDE 108

Conclusion

Recap

Understanding DL calls for addressing three fundamental Qs: Expressiveness Optimization Generalization Optimization

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 32 / 35

slide-109
SLIDE 109

Conclusion

Recap

Understanding DL calls for addressing three fundamental Qs: Expressiveness Optimization Generalization Optimization Deep (≥ 3 layer) models can’t be treated via geometry alone = ⇒ specific optimizer trajectories should be taken into account

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 32 / 35

slide-110
SLIDE 110

Conclusion

Recap

Understanding DL calls for addressing three fundamental Qs: Expressiveness Optimization Generalization Optimization Deep (≥ 3 layer) models can’t be treated via geometry alone = ⇒ specific optimizer trajectories should be taken into account We analyzed trajectories of GD over deep linear neural nets:

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 32 / 35

slide-111
SLIDE 111

Conclusion

Recap

Understanding DL calls for addressing three fundamental Qs: Expressiveness Optimization Generalization Optimization Deep (≥ 3 layer) models can’t be treated via geometry alone = ⇒ specific optimizer trajectories should be taken into account We analyzed trajectories of GD over deep linear neural nets: Derived guarantee for convergence to global min at linear rate (most general guarantee to date for GD efficiently training deep model)

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 32 / 35

slide-112
SLIDE 112

Conclusion

Recap

Understanding DL calls for addressing three fundamental Qs: Expressiveness Optimization Generalization Optimization Deep (≥ 3 layer) models can’t be treated via geometry alone = ⇒ specific optimizer trajectories should be taken into account We analyzed trajectories of GD over deep linear neural nets: Derived guarantee for convergence to global min at linear rate (most general guarantee to date for GD efficiently training deep model) Depth induces preconditioner that can accelerate convergence, w/o any gain in expressiveness, and despite introducing non-convexity

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 32 / 35

slide-113
SLIDE 113

Conclusion

Next Step: Analyzing Generalization via Trajectories

h

* S

h

*

h

*

f

Estimation Error (Generalization)

(all functions) (hypotheses space)

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 33 / 35

slide-114
SLIDE 114

Outline

1

Deep Learning Theory: Expressiveness, Optimization and Generalization

2

Analyzing Optimization via Trajectories

3

Trajectories of Gradient Descent for Deep Linear Neural Networks Convergence to Global Optimum Acceleration by Depth

4

Conclusion

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 34 / 35

slide-115
SLIDE 115

Thank You

Nadav Cohen (IAS) Optimization in DL via Trajectories ICERM Workshop, Feb’19 35 / 35