ON THE OPTIMIZATION LANDSCAPE OF NEURAL NETWORKS JOAN BRUNA , CIMS - - PowerPoint PPT Presentation

on the optimization landscape of neural networks
SMART_READER_LITE
LIVE PREVIEW

ON THE OPTIMIZATION LANDSCAPE OF NEURAL NETWORKS JOAN BRUNA , CIMS - - PowerPoint PPT Presentation

ON THE OPTIMIZATION LANDSCAPE OF NEURAL NETWORKS JOAN BRUNA , CIMS + CDS, NYU in collaboration with D.Freeman (UC Berkeley), Luca Venturi & Afonso Bandeira (NYU) MOTIVATION We consider the standard Empirical Risk Minimization setup: `


slide-1
SLIDE 1

ON THE OPTIMIZATION LANDSCAPE OF NEURAL NETWORKS

JOAN BRUNA , CIMS + CDS, NYU

in collaboration with D.Freeman (UC Berkeley), Luca Venturi & Afonso Bandeira (NYU)

slide-2
SLIDE 2

MOTIVATION

➤ We consider the standard Empirical Risk Minimization setup:

E(Θ) = E(X,Y )∼P `(Φ(X; Θ), Y ) .

ˆ P = 1 n X

i≤

δ(xi,yi) .

`(z) convex

R(Θ): regularization

ˆ E(Θ) = E(X,Y )∼ ˆ

P `(Φ(X; Θ), Y ) + R(Θ)

slide-3
SLIDE 3

MOTIVATION

➤ We consider the standard Empirical Risk Minimization setup: ➤ Population loss decomposition (aka “fundamental theorem of ML

”):

➤ Long history of techniques to provably control generalization error

via appropriate regularization.

➤ Generalization error and optimization are entangled [Bottou &

Bousquet]

E(Θ) = E(X,Y )∼P `(Φ(X; Θ), Y ) .

`(z) convex

E(Θ∗) = ˆ E(Θ∗) | {z }

training error

+ E(Θ∗) − ˆ E(Θ∗) | {z }

generalization gap

.

R(Θ): regularization

ˆ E(Θ) = E(X,Y )∼ ˆ

P `(Φ(X; Θ), Y ) + R(Θ)

ˆ P = 1 L X

l≤L

δ(xl,yl)

slide-4
SLIDE 4

MOTIVATION

➤ However, when is a large, deep network, current best

mechanism to control generalization gap has two key ingredients:

➤ Stochastic Optimization

➤ “During training, it adds the sampling noise that corresponds to empirical-

population mismatch” [Léon Bottou].

➤ Make the model convolutional and very large.

➤ see e.g. “Understanding Deep Learning Requires Rethinking

Generalization”, [Ch. Zhang et al, ICLR’17].

Φ(X; Θ)

slide-5
SLIDE 5

MOTIVATION

➤ However, when is a large, deep network, current best

mechanism to control generalization gap has two key ingredients:

➤ Stochastic Optimization ➤ Make the model convolutional and as large as possible. ➤ We first address how overparametrization affects the energy

landscapes.

➤ Goal 1: Study simple topological properties of these landscapes

. for half-rectified neural networks.

➤ Goal 2: Estimate simple geometric properties with efficient,

scalable algorithms. Diagnostic tool.

Φ(X; Θ)

E(Θ), ˆ E(Θ)

slide-6
SLIDE 6

OUTLINE

➤ Topology of Neural Network Energy Landscapes ➤ Geometry of Neural Network Energy Landscapes

(a) without skip connections (b) with skip connections

[Li et al.’17]

slide-7
SLIDE 7

PRIOR RELATED WORK

➤ Models from Statistical physics have been considered as

possible approximations [Dauphin et al.’14, Choromanska et al.’15, Segun et al.’15]

➤ Tensor factorization models capture some of the non

convexity essence [Anandukar et al’15, Cohen et al. ’15, Haeffele et al.’15]

slide-8
SLIDE 8

PRIOR RELATED WORK

➤ Models from Statistical physics have been considered as

possible approximations [Dauphin et al.’14, Choromanska et al.’15, Segun et al.’15]

➤ Tensor factorization models capture some of the non

convexity essence [Anandukar et al’15, Cohen et al. ’15, Haeffele et al.’15]

➤ [Shafran and Shamir,’15] studies bassins of attraction in

neural networks in the overparametrized regime.

➤ [Soudry’16, Song et al’16] study Empirical Risk Minimization

in two-layer ReLU networks, also in the over-parametrized regime.

slide-9
SLIDE 9

PRIOR RELATED WORK

➤ Models from Statistical physics have been considered as possible

approximations [Dauphin et al.’14, Choromanska et al.’15, Segun et al.’15]

➤ Tensor factorization models capture some of the non convexity essence

[Anandukar et al’15, Cohen et al. ’15, Haeffele et al.’15]

➤ [Shafran and Shamir,’15] studies bassins of attraction in neural networks in

the overparametrized regime.

➤ [Soudry’16, Song et al’16] study Empirical Risk Minimization in two-layer

ReLU networks, also in the over-parametrized regime.

➤ [Tian’17] studies learning dynamics in a gaussian generative setting. ➤ [Chaudhari et al’17]: Studies local smoothing of energy landscape using the

local entropy method from statistical physics.

➤ [Pennington & Bahri’17]: Hessian Analysis using Random Matrix Th. ➤ [Soltanolkotabi, Javanmard & Lee’17]: layer-wise quadratic NNs.

slide-10
SLIDE 10

NON-CONVEXITY ≠ NOT OPTIMIZABLE

➤ We can perturb any convex function in such a way it is no

longer convex, but such that gradient descent still converges.

➤ E.g. quasi-convex functions.

slide-11
SLIDE 11

NON-CONVEXITY ≠ NOT OPTIMIZABLE

➤ We can perturb any convex function in such a way it is no

longer convex, but such that gradient descent still converges.

➤ E.g. quasi-convex functions. ➤ In particular, deep models have internal symmetries.

F(θ) = F(g.θ) , g ∈ G compact.

slide-12
SLIDE 12

ANALYSIS OF NON-CONVEX LOSS SURFACES

➤ Given loss we consider its representation in

terms of level sets:

E(θ) , θ ∈ Rd ,

E(θ) = Z ∞ 1(θ ∈ Ωu)du , Ωu = {y ∈ Rd ; E(y) ≤ u} .

slide-13
SLIDE 13

ANALYSIS OF NON-CONVEX LOSS SURFACES

➤ Given loss we consider its representation in

terms of level sets:

➤ A first notion we address is about the topology of the level

sets .

➤ In particular, we ask how connected they are, i.e. how many

connected components at each energy level ?

Nu Ωu u

E(θ) = Z ∞ 1(θ ∈ Ωu)du , Ωu = {y ∈ Rd ; E(y) ≤ u} .

E(θ) , θ ∈ Rd ,

Ωu

slide-14
SLIDE 14

ANALYSIS OF NON-CONVEX LOSS SURFACES

➤ Given loss we consider its representation in

terms of level sets:

➤ A first notion we address is about the topology of the level

sets .

➤ In particular, we ask how connected they are, i.e. how many

connected components at each energy level ?

➤ Related to presence of poor local minima:

Nu Ωu u

E(θ) = Z ∞ 1(θ ∈ Ωu)du , Ωu = {y ∈ Rd ; E(y) ≤ u} .

E(θ) , θ ∈ Rd ,

Ωu

Proposition: If Nu = 1 for all u then E has no poor local minima.

(i.e. no local minima y∗ s.t. E(y∗) > miny E(y))

slide-15
SLIDE 15

LINEAR VS NON-LINEAR DEEP MODELS

➤ Some authors have considered linear “deep” models as a first

step towards understanding nonlinear deep models:

E(W1, . . . , WK) = E(X,Y )∼P kWK . . . W1X Y k2 . X ∈ Rn , Y ∈ Rm , Wk ∈ Rnk×nk−1 .

slide-16
SLIDE 16

LINEAR VS NON-LINEAR DEEP MODELS

➤ Some authors have considered linear “deep” models as a first

step towards understanding nonlinear deep models:

  • studying critical points.
  • later generalized in [Hardt & Ma’16, Lu & Kawaguchi’17]

E(W1, . . . , WK) = E(X,Y )∼P kWK . . . W1X Y k2 . X ∈ Rn , Y ∈ Rm , Wk ∈ Rnk×nk−1 . Theorem: [Kawaguchi’16] If Σ = E(XXT ) and E(XY T ) are full-rank and Σ has distinct eigenvalues, then E(Θ) has no poor local minima.

slide-17
SLIDE 17

LINEAR VS NON-LINEAR DEEP MODELS

E(W1, . . . , WK) = E(X,Y )∼P kWK . . . W1X Y k2 .

Proposition: [BF’16]

  • 1. If nk > min(n, m) , 0 < k < K, then Nu = 1 for all u.
  • 2. (2-layer case, ridge regression)

E(W1, W2) = E(X,Y )∼P kW2W1X Y k2 + λ(kW1k2 + kW2k2) satisfies Nu = 1 8 u if n1 > min(n, m).

➤ We pay extra redundancy price to get simple topology.

slide-18
SLIDE 18

LINEAR VS NON-LINEAR DEEP MODELS

E(W1, . . . , WK) = E(X,Y )∼P kWK . . . W1X Y k2 .

Proposition: [BF’16]

  • 1. If nk > min(n, m) , 0 < k < K, then Nu = 1 for all u.
  • 2. (2-layer case, ridge regression)

E(W1, W2) = E(X,Y )∼P kW2W1X Y k2 + λ(kW1k2 + kW2k2) satisfies Nu = 1 8 u if n1 > min(n, m).

Proposition: [BF’16] For any architecture (choice of internal dimensions), there exists a distribution P(X,Y ) such that Nu > 1 in the ReLU ρ(z) = max(0, z) case.

➤ We pay extra redundancy price to get simple topology. ➤ This simple topology is an “artifact” of the linearity of the

network:

slide-19
SLIDE 19

PROOF SKETCH

Given ΘA = (W A

1 , . . . , W A K) and ΘB = (W B 1 , . . . , W B K ),

we construct a path γ(t) that connects ΘA with ΘB st E(γ(t)) ≤ max(E(ΘA), E(ΘB)).

➤ Goal:

slide-20
SLIDE 20

PROOF SKETCH

Given ΘA = (W A

1 , . . . , W A K) and ΘB = (W B 1 , . . . , W B K ),

we construct a path γ(t) that connects ΘA with ΘB st E(γ(t)) ≤ max(E(ΘA), E(ΘB)).

➤ Goal: ➤ Main idea: ➤ Simple fact:

  • 1. Induction on K.
  • 2. Lift the parameter space to f

W = W1W2: the problem is convex ⇒ there exists a (linear) path e γ(t) that connects ΘA and ΘB.

  • 3. Write the path in terms of original coordinates by factorizing e

γ(t).

If M0, M1 ∈ Rn⇥n0 with n0 > n, then there exists a path t : [0, 1] → γ(t) with γ(0) = M0, γ(1) = M1 and M0, M1 ∈ span(γ(t)) for all t ∈ (0, 1).

slide-21
SLIDE 21

MODEL SYMMETRIES

➤ How much extra redundancy are we paying to achieve

instead of simply no poor-local minima?

Nu = 1 [with L. Venturi, A. Bandeira, ’17]

slide-22
SLIDE 22

MODEL SYMMETRIES

➤ How much extra redundancy are we paying to achieve

instead of simply no poor-local minima?

➤ In the multilinear case, we don’t need .

Nu = 1

nk > min(n, m)

[with L. Venturi, A. Bandeira, ’17]

(W1, W2, . . . WK) ∼ (f W1, . . . , f WK) ⇔ f Wk = UkWkU −1

k−1 , Uk ∈ GL(Rnk×nk) .

slide-23
SLIDE 23

MODEL SYMMETRIES

➤ How much extra redundancy are we paying to achieve

instead of simply no poor-local minima?

➤ In the multilinear case, we don’t need ➤ We do the same analysis in the quotient space defined by the

equivalence relationship .

➤ Construct paths on the Grassmanian manifold of linear subspaces ➤ Generalizes best known results for multilinear case (no assumptions

  • n covariance).

Nu = 1

nk > min(n, m)

[with L. Venturi, A. Bandeira, ’17]

(W1, W2, . . . WK) ∼ (f W1, . . . , f WK) ⇔ f Wk = UkWkU −1

k−1 , Uk ∈ GL(Rnk×nk) .

Theorem [LBB’17]: The Multilinear regression E(X,Y )∼P kWK . . . W1X Y k2 has no poor local minima.

slide-24
SLIDE 24

BETWEEN LINEAR AND RELU: POLYNOMIAL NETS

➤ Quadratic nonlinearities are a simple extension of

the linear case, by lifting or “kernelizing”:

ρ(z) = z2 ρ(Wx) = AW X , X = xxT , AW = (WkW T

k )k≤M .

slide-25
SLIDE 25

BETWEEN LINEAR AND RELU: POLYNOMIAL NETS

➤ Quadratic nonlinearities are a simple extension of

the linear case, by lifting or “kernelizing”:

➤ Level sets are connected with sufficient overparametrisation:

ρ(z) = z2 ρ(Wx) = AW X , X = xxT , AW = (WkW T

k )k≤M .

Proposition: If Mk ≥ 3N 2k ∀ k ≤ K , then the landscape of K-layer quadratic network is simple: Nu = 1 ∀ u.

slide-26
SLIDE 26

BETWEEN LINEAR AND RELU: POLYNOMIAL NETS

➤ Quadratic nonlinearities are a simple extension of the

linear case, by lifting or “kernelizing”:

➤ Level sets are connected with sufficient overparametrisation: ➤ No poor local minima with much better bounds in the scalar

  • utput two-layer case:

ρ(z) = z2 ρ(Wx) = AW X , X = xxT , AW = (WkW T

k )k≤M .

Proposition: If Mk ≥ 3N 2k ∀ k ≤ K , then the landscape of K-layer quadratic network is simple: Nu = 1 ∀ u.

Theorem [LBB’17]: The two-layer quadratic network optimization L(U, W) = E(X,Y )∼P kU(WX)2 Y k2 has no poor local minima if M 2N.

slide-27
SLIDE 27

ASYMPTOTIC CONNECTEDNESS OF RELU

➤ Good behavior is recovered with nonlinear ReLU networks,

provided they are sufficiently overparametrized:

➤ Setup: two-layer ReLU network:

Φ(X; Θ) = W2ρ(W1X) , ρ(z) = max(0, z).W1 ∈ Rm×n, W2 ∈ Rm

slide-28
SLIDE 28

ASYMPTOTIC CONNECTEDNESS OF RELU

➤ Good behavior is recovered with nonlinear ReLU networks,

provided they are sufficiently overparametrized:

➤ Setup: two-layer ReLU network: ➤ Overparametrisation “wipes-out” local minima (and group

symmetries).

➤ The bound is cursed by dimensionality, ie exponential in . ➤ Result is based on local linearization of the ReLU kernel (hence

exponential price). Theorem [BF’16]: For any ΘA, ΘB ∈ Rm×n, Rm, with E(Θ{A,B}) ≤ , there exists path (t) from ΘA and ΘB such that ∀ t , E((t)) ≤ max(, ✏) and ✏ ∼ m− 1

n .

Φ(X; Θ) = W2ρ(W1X) , ρ(z) = max(0, z).W1 ∈ Rm×n, W2 ∈ Rm

n

slide-29
SLIDE 29

KERNELS ARE BACK?

➤ The underlying technique we described consists in

“convexifying” the problem, by mapping neural parameters to canonical parameters :

Φ(x; Θ) = Wkρ(Wk−1 . . . ρ(W1X))) , Θ = (W1, . . . Wk) , β = A(Θ) Φ(X; Θ) = hΨ(X), A(Θ)i . Θ ΘA ΘB A(ΘA) A(ΘB)

slide-30
SLIDE 30

KERNELS ARE BACK?

➤ The underlying technique we described consists in

“convexifying” the problem, by mapping neural parameters to canonical parameters :

➤ This includes Empirical Risk Minimization (since RKHS is only

queried on finite # of datapoints).

➤ See [Bietti&Mairal’17,Zhang et al’17, Bach’17] for related work.

Φ(x; Θ) = Wkρ(Wk−1 . . . ρ(W1X))) , Θ = (W1, . . . Wk) , β = A(Θ) Φ(X; Θ) = hΨ(X), A(Θ)i . Θ Corollary: [BBV’17] If dim{A(w), w ∈ Rn} = q < ∞ and M ≥ 2q, then E(W, U) = E|Uρ(WX) − Y |2, W ∈ RM×N has no poor local minima if M ≥ 2q.

slide-31
SLIDE 31

PARAMETRIC VS MANIFOLD OPTIMIZATION

➤ This suggests thinking about the problem in the functional

space generated by the model:

FΦ = {ϕ : Rn → Rm ; ϕ(x) = Φ(x; Θ) for some Θ} . FΦ g∗ : x 7! E(Y |x) g∗ min

ϕ∈FΦ kϕ g∗kp

<latexit sha1_base64="Yyi7AmoUmMvMC5btwgUeAX5xD1I=">ACIXicbVDNSgMxGMzWv1r/Vj16CRZBMuCNpbURCPFawtNGXJpmk3NJtdkmyhrPsXnwVLx5UehNfxrRdQVsHAsPMfOT7xo85U9pxPq3C0vLK6lpxvbSxubW9Y+/uPagokYQ2SMQj2fKxopwJ2tBMc9qKJcWhz2nTH1xP/OaQSsUica9HMe2EuC9YjxGsjeTZVRQy4aVoiGUcMIiYgCjEOiCYpzeZh+oByB6hD+BU9j3TozgxZ5dirOFHCRuDkpgx1zx6jbkSkApNOFaq7Tqx7qRYakY4zUoUTGZID7tG2owCFVnXR6YgaPjNKFvUiaJzScqr8nUhwqNQp9k5ysr+a9ifif10507KTMhEnmgoy+6iXcKgjOkLdpmkRPORIZhIZnaFJMASE21aLZkS3PmTF0njrFKtOHfn5dpV3kYRHIBDcAxcAFq4BbUQMQ8ARewBt4t56tV+vDGs+iBSuf2Qd/YH19A/rUo3o=</latexit><latexit sha1_base64="Yyi7AmoUmMvMC5btwgUeAX5xD1I=">ACIXicbVDNSgMxGMzWv1r/Vj16CRZBMuCNpbURCPFawtNGXJpmk3NJtdkmyhrPsXnwVLx5UehNfxrRdQVsHAsPMfOT7xo85U9pxPq3C0vLK6lpxvbSxubW9Y+/uPagokYQ2SMQj2fKxopwJ2tBMc9qKJcWhz2nTH1xP/OaQSsUica9HMe2EuC9YjxGsjeTZVRQy4aVoiGUcMIiYgCjEOiCYpzeZh+oByB6hD+BU9j3TozgxZ5dirOFHCRuDkpgx1zx6jbkSkApNOFaq7Tqx7qRYakY4zUoUTGZID7tG2owCFVnXR6YgaPjNKFvUiaJzScqr8nUhwqNQp9k5ysr+a9ifif10507KTMhEnmgoy+6iXcKgjOkLdpmkRPORIZhIZnaFJMASE21aLZkS3PmTF0njrFKtOHfn5dpV3kYRHIBDcAxcAFq4BbUQMQ8ARewBt4t56tV+vDGs+iBSuf2Qd/YH19A/rUo3o=</latexit><latexit sha1_base64="Yyi7AmoUmMvMC5btwgUeAX5xD1I=">ACIXicbVDNSgMxGMzWv1r/Vj16CRZBMuCNpbURCPFawtNGXJpmk3NJtdkmyhrPsXnwVLx5UehNfxrRdQVsHAsPMfOT7xo85U9pxPq3C0vLK6lpxvbSxubW9Y+/uPagokYQ2SMQj2fKxopwJ2tBMc9qKJcWhz2nTH1xP/OaQSsUica9HMe2EuC9YjxGsjeTZVRQy4aVoiGUcMIiYgCjEOiCYpzeZh+oByB6hD+BU9j3TozgxZ5dirOFHCRuDkpgx1zx6jbkSkApNOFaq7Tqx7qRYakY4zUoUTGZID7tG2owCFVnXR6YgaPjNKFvUiaJzScqr8nUhwqNQp9k5ysr+a9ifif10507KTMhEnmgoy+6iXcKgjOkLdpmkRPORIZhIZnaFJMASE21aLZkS3PmTF0njrFKtOHfn5dpV3kYRHIBDcAxcAFq4BbUQMQ8ARewBt4t56tV+vDGs+iBSuf2Qd/YH19A/rUo3o=</latexit>

hf, gip := E{f(X)g(X)} .

<latexit sha1_base64="f/5yZKiEaS0X7R8LxHBn9FrGo8=">ACH3icbVBdS8MwFE39nPOr6qMvwSFMGKUVYSoIQxF8nGDdYB0jzdIuLE1Lkgqj7Kf4l/xQcV8W3/xrTbg24eSDg515y7/ETRqWy7YmxtLyurZe2ihvbm3v7Jp7+48yTgUmLo5ZLNo+koRTlxFSPtRBAU+Yy0/OFN7reiJA05g9qlJBuhEJOA4qR0lLPrHsM8ZARGNRgCD1RPHoJvLyCXoTUwPez2zH0MhU2ycwzC9v7NWsnlmxLbsAXCTOjFTADM2e+e31Y5xGhCvMkJQdx05UN0NCUczIuOylkiQID1FIOpyFBHZzYoFx/BYK30YxEIfrmCh/u7IUCTlKPJ1ZT60nPdy8T+vk6rgvJtRnqSKcDz9KEgZVDHM04J9KghWbKQJwoLqWSEeIGw0pmWdQjO/MqLxD21Liz7/qzSuJ6lUQKH4AhUgQPqoAHuQBO4AINn8ArewYfxYrwZn8bXtHTJmPUcgD8wJj9u9KAz</latexit><latexit sha1_base64="f/5yZKiEaS0X7R8LxHBn9FrGo8=">ACH3icbVBdS8MwFE39nPOr6qMvwSFMGKUVYSoIQxF8nGDdYB0jzdIuLE1Lkgqj7Kf4l/xQcV8W3/xrTbg24eSDg515y7/ETRqWy7YmxtLyurZe2ihvbm3v7Jp7+48yTgUmLo5ZLNo+koRTlxFSPtRBAU+Yy0/OFN7reiJA05g9qlJBuhEJOA4qR0lLPrHsM8ZARGNRgCD1RPHoJvLyCXoTUwPez2zH0MhU2ycwzC9v7NWsnlmxLbsAXCTOjFTADM2e+e31Y5xGhCvMkJQdx05UN0NCUczIuOylkiQID1FIOpyFBHZzYoFx/BYK30YxEIfrmCh/u7IUCTlKPJ1ZT60nPdy8T+vk6rgvJtRnqSKcDz9KEgZVDHM04J9KghWbKQJwoLqWSEeIGw0pmWdQjO/MqLxD21Liz7/qzSuJ6lUQKH4AhUgQPqoAHuQBO4AINn8ArewYfxYrwZn8bXtHTJmPUcgD8wJj9u9KAz</latexit><latexit sha1_base64="f/5yZKiEaS0X7R8LxHBn9FrGo8=">ACH3icbVBdS8MwFE39nPOr6qMvwSFMGKUVYSoIQxF8nGDdYB0jzdIuLE1Lkgqj7Kf4l/xQcV8W3/xrTbg24eSDg515y7/ETRqWy7YmxtLyurZe2ihvbm3v7Jp7+48yTgUmLo5ZLNo+koRTlxFSPtRBAU+Yy0/OFN7reiJA05g9qlJBuhEJOA4qR0lLPrHsM8ZARGNRgCD1RPHoJvLyCXoTUwPez2zH0MhU2ycwzC9v7NWsnlmxLbsAXCTOjFTADM2e+e31Y5xGhCvMkJQdx05UN0NCUczIuOylkiQID1FIOpyFBHZzYoFx/BYK30YxEIfrmCh/u7IUCTlKPJ1ZT60nPdy8T+vk6rgvJtRnqSKcDz9KEgZVDHM04J9KghWbKQJwoLqWSEeIGw0pmWdQjO/MqLxD21Liz7/qzSuJ6lUQKH4AhUgQPqoAHuQBO4AINn8ArewYfxYrwZn8bXtHTJmPUcgD8wJj9u9KAz</latexit>

ϕ

<latexit sha1_base64="brezGfS9xgUQ3GjHx0Z0LO17Wg=">AB7XicbVBNSwMxEJ3Ur1q/qh69BIvgqeyKoN6KXjxWcG2hXUo2zbah2WxIsoWy9Ed48aDi1f/jzX9j2u5BWx8MPN6bYWZepAQ31vO+UWltfWNzq7xd2dnd2z+oHh49mTlAU0FaluR8QwSULeCtZVmJIkEa0Wju5nfGjNteCof7USxMCEDyWNOiXVSqzsmWg15r1rz6t4ceJX4BalBgWav+tXtpzRLmLRUEGM6vqdsmBNtORVsWulmhilCR2TAOo5KkjAT5vNzp/jMKX0cp9qVtHiu/p7ISWLMJIlcZ0Ls0Cx7M/E/r5PZ+DrMuVSZIuFsWZwDbFs9xn2tGrZg4Qqjm7lZMh0QTal1CFReCv/zyKgku6jd17+Gy1rgt0ijDCZzCOfhwBQ24hyYEQGEz/AKb0ihF/SOPhatJVTMHMfoM8f5tuPeQ=</latexit><latexit sha1_base64="brezGfS9xgUQ3GjHx0Z0LO17Wg=">AB7XicbVBNSwMxEJ3Ur1q/qh69BIvgqeyKoN6KXjxWcG2hXUo2zbah2WxIsoWy9Ed48aDi1f/jzX9j2u5BWx8MPN6bYWZepAQ31vO+UWltfWNzq7xd2dnd2z+oHh49mTlAU0FaluR8QwSULeCtZVmJIkEa0Wju5nfGjNteCof7USxMCEDyWNOiXVSqzsmWg15r1rz6t4ceJX4BalBgWav+tXtpzRLmLRUEGM6vqdsmBNtORVsWulmhilCR2TAOo5KkjAT5vNzp/jMKX0cp9qVtHiu/p7ISWLMJIlcZ0Ls0Cx7M/E/r5PZ+DrMuVSZIuFsWZwDbFs9xn2tGrZg4Qqjm7lZMh0QTal1CFReCv/zyKgku6jd17+Gy1rgt0ijDCZzCOfhwBQ24hyYEQGEz/AKb0ihF/SOPhatJVTMHMfoM8f5tuPeQ=</latexit><latexit sha1_base64="brezGfS9xgUQ3GjHx0Z0LO17Wg=">AB7XicbVBNSwMxEJ3Ur1q/qh69BIvgqeyKoN6KXjxWcG2hXUo2zbah2WxIsoWy9Ed48aDi1f/jzX9j2u5BWx8MPN6bYWZepAQ31vO+UWltfWNzq7xd2dnd2z+oHh49mTlAU0FaluR8QwSULeCtZVmJIkEa0Wju5nfGjNteCof7USxMCEDyWNOiXVSqzsmWg15r1rz6t4ceJX4BalBgWav+tXtpzRLmLRUEGM6vqdsmBNtORVsWulmhilCR2TAOo5KkjAT5vNzp/jMKX0cp9qVtHiu/p7ISWLMJIlcZ0Ls0Cx7M/E/r5PZ+DrMuVSZIuFsWZwDbFs9xn2tGrZg4Qqjm7lZMh0QTal1CFReCv/zyKgku6jd17+Gy1rgt0ijDCZzCOfhwBQ24hyYEQGEz/AKb0ihF/SOPhatJVTMHMfoM8f5tuPeQ=</latexit>
slide-32
SLIDE 32

PARAMETRIC VS MANIFOLD OPTIMIZATION

➤ This suggests thinking about the problem in the functional

space generated by the model:

➤ Sufficient conditions for success so far:

➤ convex and sufficiently large so that we can move freely within.

➤ What happens when the model is not overparametrised?

FΦ = {ϕ : Rn → Rm ; ϕ(x) = Φ(x; Θ) for some Θ} . FΦ g∗ : x 7! E(Y |x) g∗ FΦ Θ min

ϕ∈FΦ kϕ g∗kp

<latexit sha1_base64="Yyi7AmoUmMvMC5btwgUeAX5xD1I=">ACIXicbVDNSgMxGMzWv1r/Vj16CRZBMuCNpbURCPFawtNGXJpmk3NJtdkmyhrPsXnwVLx5UehNfxrRdQVsHAsPMfOT7xo85U9pxPq3C0vLK6lpxvbSxubW9Y+/uPagokYQ2SMQj2fKxopwJ2tBMc9qKJcWhz2nTH1xP/OaQSsUica9HMe2EuC9YjxGsjeTZVRQy4aVoiGUcMIiYgCjEOiCYpzeZh+oByB6hD+BU9j3TozgxZ5dirOFHCRuDkpgx1zx6jbkSkApNOFaq7Tqx7qRYakY4zUoUTGZID7tG2owCFVnXR6YgaPjNKFvUiaJzScqr8nUhwqNQp9k5ysr+a9ifif10507KTMhEnmgoy+6iXcKgjOkLdpmkRPORIZhIZnaFJMASE21aLZkS3PmTF0njrFKtOHfn5dpV3kYRHIBDcAxcAFq4BbUQMQ8ARewBt4t56tV+vDGs+iBSuf2Qd/YH19A/rUo3o=</latexit><latexit sha1_base64="Yyi7AmoUmMvMC5btwgUeAX5xD1I=">ACIXicbVDNSgMxGMzWv1r/Vj16CRZBMuCNpbURCPFawtNGXJpmk3NJtdkmyhrPsXnwVLx5UehNfxrRdQVsHAsPMfOT7xo85U9pxPq3C0vLK6lpxvbSxubW9Y+/uPagokYQ2SMQj2fKxopwJ2tBMc9qKJcWhz2nTH1xP/OaQSsUica9HMe2EuC9YjxGsjeTZVRQy4aVoiGUcMIiYgCjEOiCYpzeZh+oByB6hD+BU9j3TozgxZ5dirOFHCRuDkpgx1zx6jbkSkApNOFaq7Tqx7qRYakY4zUoUTGZID7tG2owCFVnXR6YgaPjNKFvUiaJzScqr8nUhwqNQp9k5ysr+a9ifif10507KTMhEnmgoy+6iXcKgjOkLdpmkRPORIZhIZnaFJMASE21aLZkS3PmTF0njrFKtOHfn5dpV3kYRHIBDcAxcAFq4BbUQMQ8ARewBt4t56tV+vDGs+iBSuf2Qd/YH19A/rUo3o=</latexit><latexit sha1_base64="Yyi7AmoUmMvMC5btwgUeAX5xD1I=">ACIXicbVDNSgMxGMzWv1r/Vj16CRZBMuCNpbURCPFawtNGXJpmk3NJtdkmyhrPsXnwVLx5UehNfxrRdQVsHAsPMfOT7xo85U9pxPq3C0vLK6lpxvbSxubW9Y+/uPagokYQ2SMQj2fKxopwJ2tBMc9qKJcWhz2nTH1xP/OaQSsUica9HMe2EuC9YjxGsjeTZVRQy4aVoiGUcMIiYgCjEOiCYpzeZh+oByB6hD+BU9j3TozgxZ5dirOFHCRuDkpgx1zx6jbkSkApNOFaq7Tqx7qRYakY4zUoUTGZID7tG2owCFVnXR6YgaPjNKFvUiaJzScqr8nUhwqNQp9k5ysr+a9ifif10507KTMhEnmgoy+6iXcKgjOkLdpmkRPORIZhIZnaFJMASE21aLZkS3PmTF0njrFKtOHfn5dpV3kYRHIBDcAxcAFq4BbUQMQ8ARewBt4t56tV+vDGs+iBSuf2Qd/YH19A/rUo3o=</latexit>

hf, gip := E{f(X)g(X)} .

<latexit sha1_base64="f/5yZKiEaS0X7R8LxHBn9FrGo8=">ACH3icbVBdS8MwFE39nPOr6qMvwSFMGKUVYSoIQxF8nGDdYB0jzdIuLE1Lkgqj7Kf4l/xQcV8W3/xrTbg24eSDg515y7/ETRqWy7YmxtLyurZe2ihvbm3v7Jp7+48yTgUmLo5ZLNo+koRTlxFSPtRBAU+Yy0/OFN7reiJA05g9qlJBuhEJOA4qR0lLPrHsM8ZARGNRgCD1RPHoJvLyCXoTUwPez2zH0MhU2ycwzC9v7NWsnlmxLbsAXCTOjFTADM2e+e31Y5xGhCvMkJQdx05UN0NCUczIuOylkiQID1FIOpyFBHZzYoFx/BYK30YxEIfrmCh/u7IUCTlKPJ1ZT60nPdy8T+vk6rgvJtRnqSKcDz9KEgZVDHM04J9KghWbKQJwoLqWSEeIGw0pmWdQjO/MqLxD21Liz7/qzSuJ6lUQKH4AhUgQPqoAHuQBO4AINn8ArewYfxYrwZn8bXtHTJmPUcgD8wJj9u9KAz</latexit><latexit sha1_base64="f/5yZKiEaS0X7R8LxHBn9FrGo8=">ACH3icbVBdS8MwFE39nPOr6qMvwSFMGKUVYSoIQxF8nGDdYB0jzdIuLE1Lkgqj7Kf4l/xQcV8W3/xrTbg24eSDg515y7/ETRqWy7YmxtLyurZe2ihvbm3v7Jp7+48yTgUmLo5ZLNo+koRTlxFSPtRBAU+Yy0/OFN7reiJA05g9qlJBuhEJOA4qR0lLPrHsM8ZARGNRgCD1RPHoJvLyCXoTUwPez2zH0MhU2ycwzC9v7NWsnlmxLbsAXCTOjFTADM2e+e31Y5xGhCvMkJQdx05UN0NCUczIuOylkiQID1FIOpyFBHZzYoFx/BYK30YxEIfrmCh/u7IUCTlKPJ1ZT60nPdy8T+vk6rgvJtRnqSKcDz9KEgZVDHM04J9KghWbKQJwoLqWSEeIGw0pmWdQjO/MqLxD21Liz7/qzSuJ6lUQKH4AhUgQPqoAHuQBO4AINn8ArewYfxYrwZn8bXtHTJmPUcgD8wJj9u9KAz</latexit><latexit sha1_base64="f/5yZKiEaS0X7R8LxHBn9FrGo8=">ACH3icbVBdS8MwFE39nPOr6qMvwSFMGKUVYSoIQxF8nGDdYB0jzdIuLE1Lkgqj7Kf4l/xQcV8W3/xrTbg24eSDg515y7/ETRqWy7YmxtLyurZe2ihvbm3v7Jp7+48yTgUmLo5ZLNo+koRTlxFSPtRBAU+Yy0/OFN7reiJA05g9qlJBuhEJOA4qR0lLPrHsM8ZARGNRgCD1RPHoJvLyCXoTUwPez2zH0MhU2ycwzC9v7NWsnlmxLbsAXCTOjFTADM2e+e31Y5xGhCvMkJQdx05UN0NCUczIuOylkiQID1FIOpyFBHZzYoFx/BYK30YxEIfrmCh/u7IUCTlKPJ1ZT60nPdy8T+vk6rgvJtRnqSKcDz9KEgZVDHM04J9KghWbKQJwoLqWSEeIGw0pmWdQjO/MqLxD21Liz7/qzSuJ6lUQKH4AhUgQPqoAHuQBO4AINn8ArewYfxYrwZn8bXtHTJmPUcgD8wJj9u9KAz</latexit>

ϕ

<latexit sha1_base64="brezGfS9xgUQ3GjHx0Z0LO17Wg=">AB7XicbVBNSwMxEJ3Ur1q/qh69BIvgqeyKoN6KXjxWcG2hXUo2zbah2WxIsoWy9Ed48aDi1f/jzX9j2u5BWx8MPN6bYWZepAQ31vO+UWltfWNzq7xd2dnd2z+oHh49mTlAU0FaluR8QwSULeCtZVmJIkEa0Wju5nfGjNteCof7USxMCEDyWNOiXVSqzsmWg15r1rz6t4ceJX4BalBgWav+tXtpzRLmLRUEGM6vqdsmBNtORVsWulmhilCR2TAOo5KkjAT5vNzp/jMKX0cp9qVtHiu/p7ISWLMJIlcZ0Ls0Cx7M/E/r5PZ+DrMuVSZIuFsWZwDbFs9xn2tGrZg4Qqjm7lZMh0QTal1CFReCv/zyKgku6jd17+Gy1rgt0ijDCZzCOfhwBQ24hyYEQGEz/AKb0ihF/SOPhatJVTMHMfoM8f5tuPeQ=</latexit><latexit sha1_base64="brezGfS9xgUQ3GjHx0Z0LO17Wg=">AB7XicbVBNSwMxEJ3Ur1q/qh69BIvgqeyKoN6KXjxWcG2hXUo2zbah2WxIsoWy9Ed48aDi1f/jzX9j2u5BWx8MPN6bYWZepAQ31vO+UWltfWNzq7xd2dnd2z+oHh49mTlAU0FaluR8QwSULeCtZVmJIkEa0Wju5nfGjNteCof7USxMCEDyWNOiXVSqzsmWg15r1rz6t4ceJX4BalBgWav+tXtpzRLmLRUEGM6vqdsmBNtORVsWulmhilCR2TAOo5KkjAT5vNzp/jMKX0cp9qVtHiu/p7ISWLMJIlcZ0Ls0Cx7M/E/r5PZ+DrMuVSZIuFsWZwDbFs9xn2tGrZg4Qqjm7lZMh0QTal1CFReCv/zyKgku6jd17+Gy1rgt0ijDCZzCOfhwBQ24hyYEQGEz/AKb0ihF/SOPhatJVTMHMfoM8f5tuPeQ=</latexit><latexit sha1_base64="brezGfS9xgUQ3GjHx0Z0LO17Wg=">AB7XicbVBNSwMxEJ3Ur1q/qh69BIvgqeyKoN6KXjxWcG2hXUo2zbah2WxIsoWy9Ed48aDi1f/jzX9j2u5BWx8MPN6bYWZepAQ31vO+UWltfWNzq7xd2dnd2z+oHh49mTlAU0FaluR8QwSULeCtZVmJIkEa0Wju5nfGjNteCof7USxMCEDyWNOiXVSqzsmWg15r1rz6t4ceJX4BalBgWav+tXtpzRLmLRUEGM6vqdsmBNtORVsWulmhilCR2TAOo5KkjAT5vNzp/jMKX0cp9qVtHiu/p7ISWLMJIlcZ0Ls0Cx7M/E/r5PZ+DrMuVSZIuFsWZwDbFs9xn2tGrZg4Qqjm7lZMh0QTal1CFReCv/zyKgku6jd17+Gy1rgt0ijDCZzCOfhwBQ24hyYEQGEz/AKb0ihF/SOPhatJVTMHMfoM8f5tuPeQ=</latexit>
slide-33
SLIDE 33

FROM SIMPLE LANDSCAPES TO ENERGY BARRIER

➤ The energy landscape of several prototypical models in

statistical physics exhibit a so-called energy barrier, e.g. spherical spin glasses:

HN,p(σ) = N −(p−1)/2

N

X

i1,...ip=1

Ji1,...,ipσi1 · · · σip , σ ∈ SN−1( √ N) , Ji ∼ N(0, 1).

≥ ≥ lim

N→∞

1 N log E CrtN,k(u) = Θk,p(u).

1.66 1.65 1.64 1.63 0.02 0.01 0.01 0.02

u −E0 −E1 −E10 −Ec

Figure 1. The functions Θk,p for p = 3 and k = 0 (solid), k = 1 (dashed), k = 2 (dash-dotted), k = 10, k = 100 (both dotted). All these functions agree for u ≥ −E∞.

[Auffinger, Ben Arous, Cerny,’11]

slide-34
SLIDE 34

FROM SIMPLE LANDSCAPES TO ENERGY BARRIER?

➤ Does a similar macroscopic picture arise in our setting? ➤ Given homogeneous, assume

➤ Define

➤ Best loss obtained by first projecting the data onto the best possible

subspace of dimension and adding bounded noise in the complement.

➤ decreases with and

ρ(z)

˜ ρ(hw, Xi) = hAw, ψ(X)i , with dim(ψ(X)) = f(N) .

f −1(M)

β(M, N) = inf

S;dim(S)=f −1(M)

inf

U 2 Rm×M W 2 RM×f−1(M)

sup

EkZk  N f −1(M), PSZ = 0

EkUρ(WPSX + Z) Y k2

β(M, N)

β(f(N), N) = min

U,W E(U, W) .

M

slide-35
SLIDE 35

FROM SIMPLE LANDSCAPES TO ENERGY BARRIER

➤ Does a similar macroscopic picture arise in our setting? ➤ Given homogeneous, assume

➤ Define

➤ Best loss obtained by first projecting the data onto the best possible

subspace of dimension and adding bounded noise in the complement.

➤ decreases with and

ρ(z)

˜ ρ(hw, Xi) = hAw, ψ(X)i , with dim(ψ(X)) = f(N) .

f −1(M)

β(M, N) = inf

S;dim(S)=f −1(M)

inf

U 2 Rm×M W 2 RM×f−1(M)

sup

EkZk  N f −1(M), PSZ = 0

EkUρ(WPSX + Z) Y k2

β(M, N)

β(f(N), N) = min

U,W E(U, W) .

M

Conjecture [LBB’18]: The loss L(U, W) = EkUρ(WX) Y k2 has no poor local minima above the energy barrier β(M, N).

<latexit sha1_base64="+mVPjJvZLvbrtX4oIn9416unzBw=">ACiHicbVFdb9MwFHUCjNHxUeCRlytaRCuVKqkE25CQpk5IPAw0pIUW1aWy3ZvGLEj25mosv4XfhNv/Bvcrg+wcZ+Ozrmf5/Iyl9ZF0e8gvHX7zs7d3XuNvfsPHj5qPn7yxerKCEyEzrUZc2YxlwoTJ12O49IgK3iOI35+vNZHF2is1OrMLUucFmyhZCoFc56aNX9SpaWao3JAHf5wPK2PtfqOwlUGYXIyHL6MD6ZvV3CWIeTaWmifdJIejLrwDmjBXMZ5/X5FLyEBajLdGY278Aq+Ar38NmgDpZAxC0pDqbXxDQTLoZBKFgwY1xcIzvdFhWaxBM6MkWigTk61vnY+9Rt92fNVtSPNgE3QbwFLbKN01nzF51rURX+JEzaydxVLpzYyTIsdVg1YWSybO2QInHipWoJ3WGydX8MIzc0j9qn2lmzYvytqVli7LjPXN9ur2tr8n/apHLpwbSWqwcKnE1K1ycBrWb4G5N7zfOkBE0b6XUFkzDh/PMa3oT4+sk3QTLoH/ajz4PW0XDrxi5Rp6TDonJPjkiH8gpSYgIdoJe8Dp4E+6FcbgfHl6lhsG25in5J8LhHxK3vsQ=</latexit><latexit sha1_base64="+mVPjJvZLvbrtX4oIn9416unzBw=">ACiHicbVFdb9MwFHUCjNHxUeCRlytaRCuVKqkE25CQpk5IPAw0pIUW1aWy3ZvGLEj25mosv4XfhNv/Bvcrg+wcZ+Ozrmf5/Iyl9ZF0e8gvHX7zs7d3XuNvfsPHj5qPn7yxerKCEyEzrUZc2YxlwoTJ12O49IgK3iOI35+vNZHF2is1OrMLUucFmyhZCoFc56aNX9SpaWao3JAHf5wPK2PtfqOwlUGYXIyHL6MD6ZvV3CWIeTaWmifdJIejLrwDmjBXMZ5/X5FLyEBajLdGY278Aq+Ar38NmgDpZAxC0pDqbXxDQTLoZBKFgwY1xcIzvdFhWaxBM6MkWigTk61vnY+9Rt92fNVtSPNgE3QbwFLbKN01nzF51rURX+JEzaydxVLpzYyTIsdVg1YWSybO2QInHipWoJ3WGydX8MIzc0j9qn2lmzYvytqVli7LjPXN9ur2tr8n/apHLpwbSWqwcKnE1K1ycBrWb4G5N7zfOkBE0b6XUFkzDh/PMa3oT4+sk3QTLoH/ajz4PW0XDrxi5Rp6TDonJPjkiH8gpSYgIdoJe8Dp4E+6FcbgfHl6lhsG25in5J8LhHxK3vsQ=</latexit><latexit sha1_base64="+mVPjJvZLvbrtX4oIn9416unzBw=">ACiHicbVFdb9MwFHUCjNHxUeCRlytaRCuVKqkE25CQpk5IPAw0pIUW1aWy3ZvGLEj25mosv4XfhNv/Bvcrg+wcZ+Ozrmf5/Iyl9ZF0e8gvHX7zs7d3XuNvfsPHj5qPn7yxerKCEyEzrUZc2YxlwoTJ12O49IgK3iOI35+vNZHF2is1OrMLUucFmyhZCoFc56aNX9SpaWao3JAHf5wPK2PtfqOwlUGYXIyHL6MD6ZvV3CWIeTaWmifdJIejLrwDmjBXMZ5/X5FLyEBajLdGY278Aq+Ar38NmgDpZAxC0pDqbXxDQTLoZBKFgwY1xcIzvdFhWaxBM6MkWigTk61vnY+9Rt92fNVtSPNgE3QbwFLbKN01nzF51rURX+JEzaydxVLpzYyTIsdVg1YWSybO2QInHipWoJ3WGydX8MIzc0j9qn2lmzYvytqVli7LjPXN9ur2tr8n/apHLpwbSWqwcKnE1K1ycBrWb4G5N7zfOkBE0b6XUFkzDh/PMa3oT4+sk3QTLoH/ajz4PW0XDrxi5Rp6TDonJPjkiH8gpSYgIdoJe8Dp4E+6FcbgfHl6lhsG25in5J8LhHxK3vsQ=</latexit>
slide-36
SLIDE 36

FROM TOPOLOGY TO GEOMETRY

➤ The next question we are interested in is conditioning for

descent.

➤ Even if level sets are connected, how easy it is to navigate

through them?

➤ How “large” and regular are they?

easy to move from one energy level to lower one hard to move from one energy level to lower one

slide-37
SLIDE 37

FROM TOPOLOGY TO GEOMETRY

➤ The next question we are interested in is conditioning for

descent.

➤ Even if level sets are connected, how easy it is to navigate

through them?

➤ We estimate level set geodesics and measure their length.

easy to move from one energy level to lower one hard to move from one energy level to lower one

θA θB

θA

θB

slide-38
SLIDE 38

FINDING CONNECTED COMPONENTS

➤ Suppose are such that ➤ They are in the same connected component of iff

there is a path such that

➤ Moreover, we penalize the length of the path:

θ1, θ2 γ(t), γ(0) = θ1, γ(1) = θ2 ∀ t ∈ (0, 1) , E(γ(t)) ≤ u0 . E(θ1) = E(θ2) = u0 Ωu0

8 t 2 (0, 1) , E(γ(t))  u0 and Z k˙ γ(t)kdt  M .

slide-39
SLIDE 39

FINDING CONNECTED COMPONENTS

➤ Suppose are such that ➤ They are in the same connected component of iff

there is a path such that

➤ Moreover, we penalize the length of the path: ➤Dynamic programming approach:

θ1, θ2 γ(t), γ(0) = θ1, γ(1) = θ2 ∀ t ∈ (0, 1) , E(γ(t)) ≤ u0 . E(θ1) = E(θ2) = u0 Ωu0

8 t 2 (0, 1) , E(γ(t))  u0 and Z k˙ γ(t)kdt  M .

θ1 θ2

slide-40
SLIDE 40

FINDING CONNECTED COMPONENTS

➤ Suppose are such that ➤ They are in the same connected component of iff

there is a path such that

➤ Moreover, we penalize the length of the path: ➤Dynamic programming approach:

θ1, θ2 γ(t), γ(0) = θ1, γ(1) = θ2 ∀ t ∈ (0, 1) , E(γ(t)) ≤ u0 . E(θ1) = E(θ2) = u0 Ωu0

8 t 2 (0, 1) , E(γ(t))  u0 and Z k˙ γ(t)kdt  M .

θ1 θ2

θm = θ1 + θ2 2

H θ3 = arg min

θ∈H; E(θ)≤u0 kθ θmk .

θ3

slide-41
SLIDE 41

FINDING CONNECTED COMPONENTS

➤ Suppose are such that ➤ They are in the same connected component of iff

there is a path such that

➤ Moreover, we penalize the length of the path: ➤Dynamic programming approach:

θ1, θ2 γ(t), γ(0) = θ1, γ(1) = θ2 ∀ t ∈ (0, 1) , E(γ(t)) ≤ u0 . E(θ1) = E(θ2) = u0 Ωu0

8 t 2 (0, 1) , E(γ(t))  u0 and Z k˙ γ(t)kdt  M .

θ1 θ2

θm = θ1 + θ2 2

θ3 = arg min

θ∈H; E(θ)≤u0 kθ θmk .

θ3

slide-42
SLIDE 42

FINDING CONNECTED COMPONENTS

➤ Suppose are such that ➤ They are in the same connected component of iff

there is a path such that

➤ Moreover, we penalize the length of the path: ➤Dynamic programming approach:

θ1, θ2 γ(t), γ(0) = θ1, γ(1) = θ2 ∀ t ∈ (0, 1) , E(γ(t)) ≤ u0 . E(θ1) = E(θ2) = u0 Ωu0

8 t 2 (0, 1) , E(γ(t))  u0 and Z k˙ γ(t)kdt  M .

θ1 θ2

θm = θ1 + θ2 2

H θ3 = arg min

θ∈H; E(θ)≤u0 kθ θmk .

θ3

slide-43
SLIDE 43

NUMERICAL EXPERIMENTS

➤ Compute length of geodesic in obtained by the algorithm

and normalize it by the Euclidean distance. Measure of curviness of level sets.

Ωu

cubic polynomial CNN/MNIST

slide-44
SLIDE 44

NUMERICAL EXPERIMENTS

➤ Compute length of geodesic in obtained by the algorithm

and normalize it by the Euclidean distance. Measure of curviness of level sets.

Ωu

CNN/CIFAR-10 LSTM/Penn

slide-45
SLIDE 45

ANALYSIS AND PERSPECTIVES

➤ #of components does not increase: no detected poor local minima

so far when using typical datasets and typical architectures (at energy levels explored by SGD).

➤ Level sets become more irregular as energy decreases. ➤ Presence of “energy barrier”? extend to truncated Taylor? ➤ Kernels are back? CNN RKHS ➤ Open: “sweet spot” between overparametrisation and overfitting? ➤ Open: Role of Stochastic Optimization in this story?

slide-46
SLIDE 46

THANKS!