ON THE OPTIMIZATION LANDSCAPE OF NEURAL NETWORKS
JOAN BRUNA , CIMS + CDS, NYU
in collaboration with D.Freeman (UC Berkeley), Luca Venturi & Afonso Bandeira (NYU)
ON THE OPTIMIZATION LANDSCAPE OF NEURAL NETWORKS JOAN BRUNA , CIMS - - PowerPoint PPT Presentation
ON THE OPTIMIZATION LANDSCAPE OF NEURAL NETWORKS JOAN BRUNA , CIMS + CDS, NYU in collaboration with D.Freeman (UC Berkeley), Luca Venturi & Afonso Bandeira (NYU) MOTIVATION We consider the standard Empirical Risk Minimization setup: `
JOAN BRUNA , CIMS + CDS, NYU
in collaboration with D.Freeman (UC Berkeley), Luca Venturi & Afonso Bandeira (NYU)
MOTIVATION
➤ We consider the standard Empirical Risk Minimization setup:
E(Θ) = E(X,Y )∼P `(Φ(X; Θ), Y ) .
ˆ P = 1 n X
i≤
δ(xi,yi) .
`(z) convex
R(Θ): regularization
ˆ E(Θ) = E(X,Y )∼ ˆ
P `(Φ(X; Θ), Y ) + R(Θ)
MOTIVATION
➤ We consider the standard Empirical Risk Minimization setup: ➤ Population loss decomposition (aka “fundamental theorem of ML
”):
➤ Long history of techniques to provably control generalization error
via appropriate regularization.
➤ Generalization error and optimization are entangled [Bottou &
Bousquet]
E(Θ) = E(X,Y )∼P `(Φ(X; Θ), Y ) .
`(z) convex
E(Θ∗) = ˆ E(Θ∗) | {z }
training error
+ E(Θ∗) − ˆ E(Θ∗) | {z }
generalization gap
.
R(Θ): regularization
ˆ E(Θ) = E(X,Y )∼ ˆ
P `(Φ(X; Θ), Y ) + R(Θ)
ˆ P = 1 L X
l≤L
δ(xl,yl)
MOTIVATION
➤ However, when is a large, deep network, current best
mechanism to control generalization gap has two key ingredients:
➤ Stochastic Optimization
➤ “During training, it adds the sampling noise that corresponds to empirical-
population mismatch” [Léon Bottou].
➤ Make the model convolutional and very large.
➤ see e.g. “Understanding Deep Learning Requires Rethinking
Generalization”, [Ch. Zhang et al, ICLR’17].
Φ(X; Θ)
MOTIVATION
➤ However, when is a large, deep network, current best
mechanism to control generalization gap has two key ingredients:
➤ Stochastic Optimization ➤ Make the model convolutional and as large as possible. ➤ We first address how overparametrization affects the energy
landscapes.
➤ Goal 1: Study simple topological properties of these landscapes
. for half-rectified neural networks.
➤ Goal 2: Estimate simple geometric properties with efficient,
scalable algorithms. Diagnostic tool.
Φ(X; Θ)
E(Θ), ˆ E(Θ)
OUTLINE
➤ Topology of Neural Network Energy Landscapes ➤ Geometry of Neural Network Energy Landscapes
(a) without skip connections (b) with skip connections
[Li et al.’17]
PRIOR RELATED WORK
➤ Models from Statistical physics have been considered as
possible approximations [Dauphin et al.’14, Choromanska et al.’15, Segun et al.’15]
➤ Tensor factorization models capture some of the non
convexity essence [Anandukar et al’15, Cohen et al. ’15, Haeffele et al.’15]
PRIOR RELATED WORK
➤ Models from Statistical physics have been considered as
possible approximations [Dauphin et al.’14, Choromanska et al.’15, Segun et al.’15]
➤ Tensor factorization models capture some of the non
convexity essence [Anandukar et al’15, Cohen et al. ’15, Haeffele et al.’15]
➤ [Shafran and Shamir,’15] studies bassins of attraction in
neural networks in the overparametrized regime.
➤ [Soudry’16, Song et al’16] study Empirical Risk Minimization
in two-layer ReLU networks, also in the over-parametrized regime.
PRIOR RELATED WORK
➤ Models from Statistical physics have been considered as possible
approximations [Dauphin et al.’14, Choromanska et al.’15, Segun et al.’15]
➤ Tensor factorization models capture some of the non convexity essence
[Anandukar et al’15, Cohen et al. ’15, Haeffele et al.’15]
➤ [Shafran and Shamir,’15] studies bassins of attraction in neural networks in
the overparametrized regime.
➤ [Soudry’16, Song et al’16] study Empirical Risk Minimization in two-layer
ReLU networks, also in the over-parametrized regime.
➤ [Tian’17] studies learning dynamics in a gaussian generative setting. ➤ [Chaudhari et al’17]: Studies local smoothing of energy landscape using the
local entropy method from statistical physics.
➤ [Pennington & Bahri’17]: Hessian Analysis using Random Matrix Th. ➤ [Soltanolkotabi, Javanmard & Lee’17]: layer-wise quadratic NNs.
NON-CONVEXITY ≠ NOT OPTIMIZABLE
➤ We can perturb any convex function in such a way it is no
longer convex, but such that gradient descent still converges.
➤ E.g. quasi-convex functions.
NON-CONVEXITY ≠ NOT OPTIMIZABLE
➤ We can perturb any convex function in such a way it is no
longer convex, but such that gradient descent still converges.
➤ E.g. quasi-convex functions. ➤ In particular, deep models have internal symmetries.
F(θ) = F(g.θ) , g ∈ G compact.
ANALYSIS OF NON-CONVEX LOSS SURFACES
➤ Given loss we consider its representation in
terms of level sets:
E(θ) , θ ∈ Rd ,
E(θ) = Z ∞ 1(θ ∈ Ωu)du , Ωu = {y ∈ Rd ; E(y) ≤ u} .
ANALYSIS OF NON-CONVEX LOSS SURFACES
➤ Given loss we consider its representation in
terms of level sets:
➤ A first notion we address is about the topology of the level
sets .
➤ In particular, we ask how connected they are, i.e. how many
connected components at each energy level ?
Nu Ωu u
E(θ) = Z ∞ 1(θ ∈ Ωu)du , Ωu = {y ∈ Rd ; E(y) ≤ u} .
E(θ) , θ ∈ Rd ,
Ωu
ANALYSIS OF NON-CONVEX LOSS SURFACES
➤ Given loss we consider its representation in
terms of level sets:
➤ A first notion we address is about the topology of the level
sets .
➤ In particular, we ask how connected they are, i.e. how many
connected components at each energy level ?
➤ Related to presence of poor local minima:
Nu Ωu u
E(θ) = Z ∞ 1(θ ∈ Ωu)du , Ωu = {y ∈ Rd ; E(y) ≤ u} .
E(θ) , θ ∈ Rd ,
Ωu
Proposition: If Nu = 1 for all u then E has no poor local minima.
(i.e. no local minima y∗ s.t. E(y∗) > miny E(y))
LINEAR VS NON-LINEAR DEEP MODELS
➤ Some authors have considered linear “deep” models as a first
step towards understanding nonlinear deep models:
E(W1, . . . , WK) = E(X,Y )∼P kWK . . . W1X Y k2 . X ∈ Rn , Y ∈ Rm , Wk ∈ Rnk×nk−1 .
LINEAR VS NON-LINEAR DEEP MODELS
➤ Some authors have considered linear “deep” models as a first
step towards understanding nonlinear deep models:
E(W1, . . . , WK) = E(X,Y )∼P kWK . . . W1X Y k2 . X ∈ Rn , Y ∈ Rm , Wk ∈ Rnk×nk−1 . Theorem: [Kawaguchi’16] If Σ = E(XXT ) and E(XY T ) are full-rank and Σ has distinct eigenvalues, then E(Θ) has no poor local minima.
LINEAR VS NON-LINEAR DEEP MODELS
E(W1, . . . , WK) = E(X,Y )∼P kWK . . . W1X Y k2 .
Proposition: [BF’16]
E(W1, W2) = E(X,Y )∼P kW2W1X Y k2 + λ(kW1k2 + kW2k2) satisfies Nu = 1 8 u if n1 > min(n, m).
➤ We pay extra redundancy price to get simple topology.
LINEAR VS NON-LINEAR DEEP MODELS
E(W1, . . . , WK) = E(X,Y )∼P kWK . . . W1X Y k2 .
Proposition: [BF’16]
E(W1, W2) = E(X,Y )∼P kW2W1X Y k2 + λ(kW1k2 + kW2k2) satisfies Nu = 1 8 u if n1 > min(n, m).
Proposition: [BF’16] For any architecture (choice of internal dimensions), there exists a distribution P(X,Y ) such that Nu > 1 in the ReLU ρ(z) = max(0, z) case.
➤ We pay extra redundancy price to get simple topology. ➤ This simple topology is an “artifact” of the linearity of the
network:
PROOF SKETCH
Given ΘA = (W A
1 , . . . , W A K) and ΘB = (W B 1 , . . . , W B K ),
we construct a path γ(t) that connects ΘA with ΘB st E(γ(t)) ≤ max(E(ΘA), E(ΘB)).
➤ Goal:
PROOF SKETCH
Given ΘA = (W A
1 , . . . , W A K) and ΘB = (W B 1 , . . . , W B K ),
we construct a path γ(t) that connects ΘA with ΘB st E(γ(t)) ≤ max(E(ΘA), E(ΘB)).
➤ Goal: ➤ Main idea: ➤ Simple fact:
W = W1W2: the problem is convex ⇒ there exists a (linear) path e γ(t) that connects ΘA and ΘB.
γ(t).
If M0, M1 ∈ Rn⇥n0 with n0 > n, then there exists a path t : [0, 1] → γ(t) with γ(0) = M0, γ(1) = M1 and M0, M1 ∈ span(γ(t)) for all t ∈ (0, 1).
MODEL SYMMETRIES
➤ How much extra redundancy are we paying to achieve
instead of simply no poor-local minima?
Nu = 1 [with L. Venturi, A. Bandeira, ’17]
MODEL SYMMETRIES
➤ How much extra redundancy are we paying to achieve
instead of simply no poor-local minima?
➤ In the multilinear case, we don’t need .
Nu = 1
nk > min(n, m)
[with L. Venturi, A. Bandeira, ’17]
(W1, W2, . . . WK) ∼ (f W1, . . . , f WK) ⇔ f Wk = UkWkU −1
k−1 , Uk ∈ GL(Rnk×nk) .
MODEL SYMMETRIES
➤ How much extra redundancy are we paying to achieve
instead of simply no poor-local minima?
➤ In the multilinear case, we don’t need ➤ We do the same analysis in the quotient space defined by the
equivalence relationship .
➤ Construct paths on the Grassmanian manifold of linear subspaces ➤ Generalizes best known results for multilinear case (no assumptions
Nu = 1
nk > min(n, m)
[with L. Venturi, A. Bandeira, ’17]
(W1, W2, . . . WK) ∼ (f W1, . . . , f WK) ⇔ f Wk = UkWkU −1
k−1 , Uk ∈ GL(Rnk×nk) .
Theorem [LBB’17]: The Multilinear regression E(X,Y )∼P kWK . . . W1X Y k2 has no poor local minima.
BETWEEN LINEAR AND RELU: POLYNOMIAL NETS
➤ Quadratic nonlinearities are a simple extension of
the linear case, by lifting or “kernelizing”:
ρ(z) = z2 ρ(Wx) = AW X , X = xxT , AW = (WkW T
k )k≤M .
BETWEEN LINEAR AND RELU: POLYNOMIAL NETS
➤ Quadratic nonlinearities are a simple extension of
the linear case, by lifting or “kernelizing”:
➤ Level sets are connected with sufficient overparametrisation:
ρ(z) = z2 ρ(Wx) = AW X , X = xxT , AW = (WkW T
k )k≤M .
Proposition: If Mk ≥ 3N 2k ∀ k ≤ K , then the landscape of K-layer quadratic network is simple: Nu = 1 ∀ u.
BETWEEN LINEAR AND RELU: POLYNOMIAL NETS
➤ Quadratic nonlinearities are a simple extension of the
linear case, by lifting or “kernelizing”:
➤ Level sets are connected with sufficient overparametrisation: ➤ No poor local minima with much better bounds in the scalar
ρ(z) = z2 ρ(Wx) = AW X , X = xxT , AW = (WkW T
k )k≤M .
Proposition: If Mk ≥ 3N 2k ∀ k ≤ K , then the landscape of K-layer quadratic network is simple: Nu = 1 ∀ u.
Theorem [LBB’17]: The two-layer quadratic network optimization L(U, W) = E(X,Y )∼P kU(WX)2 Y k2 has no poor local minima if M 2N.
ASYMPTOTIC CONNECTEDNESS OF RELU
➤ Good behavior is recovered with nonlinear ReLU networks,
provided they are sufficiently overparametrized:
➤ Setup: two-layer ReLU network:
Φ(X; Θ) = W2ρ(W1X) , ρ(z) = max(0, z).W1 ∈ Rm×n, W2 ∈ Rm
ASYMPTOTIC CONNECTEDNESS OF RELU
➤ Good behavior is recovered with nonlinear ReLU networks,
provided they are sufficiently overparametrized:
➤ Setup: two-layer ReLU network: ➤ Overparametrisation “wipes-out” local minima (and group
symmetries).
➤ The bound is cursed by dimensionality, ie exponential in . ➤ Result is based on local linearization of the ReLU kernel (hence
exponential price). Theorem [BF’16]: For any ΘA, ΘB ∈ Rm×n, Rm, with E(Θ{A,B}) ≤ , there exists path (t) from ΘA and ΘB such that ∀ t , E((t)) ≤ max(, ✏) and ✏ ∼ m− 1
n .
Φ(X; Θ) = W2ρ(W1X) , ρ(z) = max(0, z).W1 ∈ Rm×n, W2 ∈ Rm
n
KERNELS ARE BACK?
➤ The underlying technique we described consists in
“convexifying” the problem, by mapping neural parameters to canonical parameters :
Φ(x; Θ) = Wkρ(Wk−1 . . . ρ(W1X))) , Θ = (W1, . . . Wk) , β = A(Θ) Φ(X; Θ) = hΨ(X), A(Θ)i . Θ ΘA ΘB A(ΘA) A(ΘB)
KERNELS ARE BACK?
➤ The underlying technique we described consists in
“convexifying” the problem, by mapping neural parameters to canonical parameters :
➤ This includes Empirical Risk Minimization (since RKHS is only
queried on finite # of datapoints).
➤ See [Bietti&Mairal’17,Zhang et al’17, Bach’17] for related work.
Φ(x; Θ) = Wkρ(Wk−1 . . . ρ(W1X))) , Θ = (W1, . . . Wk) , β = A(Θ) Φ(X; Θ) = hΨ(X), A(Θ)i . Θ Corollary: [BBV’17] If dim{A(w), w ∈ Rn} = q < ∞ and M ≥ 2q, then E(W, U) = E|Uρ(WX) − Y |2, W ∈ RM×N has no poor local minima if M ≥ 2q.
PARAMETRIC VS MANIFOLD OPTIMIZATION
➤ This suggests thinking about the problem in the functional
space generated by the model:
FΦ = {ϕ : Rn → Rm ; ϕ(x) = Φ(x; Θ) for some Θ} . FΦ g∗ : x 7! E(Y |x) g∗ min
ϕ∈FΦ kϕ g∗kp
<latexit sha1_base64="Yyi7AmoUmMvMC5btwgUeAX5xD1I=">ACIXicbVDNSgMxGMzWv1r/Vj16CRZBMuCNpbURCPFawtNGXJpmk3NJtdkmyhrPsXnwVLx5UehNfxrRdQVsHAsPMfOT7xo85U9pxPq3C0vLK6lpxvbSxubW9Y+/uPagokYQ2SMQj2fKxopwJ2tBMc9qKJcWhz2nTH1xP/OaQSsUica9HMe2EuC9YjxGsjeTZVRQy4aVoiGUcMIiYgCjEOiCYpzeZh+oByB6hD+BU9j3TozgxZ5dirOFHCRuDkpgx1zx6jbkSkApNOFaq7Tqx7qRYakY4zUoUTGZID7tG2owCFVnXR6YgaPjNKFvUiaJzScqr8nUhwqNQp9k5ysr+a9ifif10507KTMhEnmgoy+6iXcKgjOkLdpmkRPORIZhIZnaFJMASE21aLZkS3PmTF0njrFKtOHfn5dpV3kYRHIBDcAxcAFq4BbUQMQ8ARewBt4t56tV+vDGs+iBSuf2Qd/YH19A/rUo3o=</latexit><latexit sha1_base64="Yyi7AmoUmMvMC5btwgUeAX5xD1I=">ACIXicbVDNSgMxGMzWv1r/Vj16CRZBMuCNpbURCPFawtNGXJpmk3NJtdkmyhrPsXnwVLx5UehNfxrRdQVsHAsPMfOT7xo85U9pxPq3C0vLK6lpxvbSxubW9Y+/uPagokYQ2SMQj2fKxopwJ2tBMc9qKJcWhz2nTH1xP/OaQSsUica9HMe2EuC9YjxGsjeTZVRQy4aVoiGUcMIiYgCjEOiCYpzeZh+oByB6hD+BU9j3TozgxZ5dirOFHCRuDkpgx1zx6jbkSkApNOFaq7Tqx7qRYakY4zUoUTGZID7tG2owCFVnXR6YgaPjNKFvUiaJzScqr8nUhwqNQp9k5ysr+a9ifif10507KTMhEnmgoy+6iXcKgjOkLdpmkRPORIZhIZnaFJMASE21aLZkS3PmTF0njrFKtOHfn5dpV3kYRHIBDcAxcAFq4BbUQMQ8ARewBt4t56tV+vDGs+iBSuf2Qd/YH19A/rUo3o=</latexit><latexit sha1_base64="Yyi7AmoUmMvMC5btwgUeAX5xD1I=">ACIXicbVDNSgMxGMzWv1r/Vj16CRZBMuCNpbURCPFawtNGXJpmk3NJtdkmyhrPsXnwVLx5UehNfxrRdQVsHAsPMfOT7xo85U9pxPq3C0vLK6lpxvbSxubW9Y+/uPagokYQ2SMQj2fKxopwJ2tBMc9qKJcWhz2nTH1xP/OaQSsUica9HMe2EuC9YjxGsjeTZVRQy4aVoiGUcMIiYgCjEOiCYpzeZh+oByB6hD+BU9j3TozgxZ5dirOFHCRuDkpgx1zx6jbkSkApNOFaq7Tqx7qRYakY4zUoUTGZID7tG2owCFVnXR6YgaPjNKFvUiaJzScqr8nUhwqNQp9k5ysr+a9ifif10507KTMhEnmgoy+6iXcKgjOkLdpmkRPORIZhIZnaFJMASE21aLZkS3PmTF0njrFKtOHfn5dpV3kYRHIBDcAxcAFq4BbUQMQ8ARewBt4t56tV+vDGs+iBSuf2Qd/YH19A/rUo3o=</latexit>hf, gip := E{f(X)g(X)} .
<latexit sha1_base64="f/5yZKiEaS0X7R8LxHBn9FrGo8=">ACH3icbVBdS8MwFE39nPOr6qMvwSFMGKUVYSoIQxF8nGDdYB0jzdIuLE1Lkgqj7Kf4l/xQcV8W3/xrTbg24eSDg515y7/ETRqWy7YmxtLyurZe2ihvbm3v7Jp7+48yTgUmLo5ZLNo+koRTlxFSPtRBAU+Yy0/OFN7reiJA05g9qlJBuhEJOA4qR0lLPrHsM8ZARGNRgCD1RPHoJvLyCXoTUwPez2zH0MhU2ycwzC9v7NWsnlmxLbsAXCTOjFTADM2e+e31Y5xGhCvMkJQdx05UN0NCUczIuOylkiQID1FIOpyFBHZzYoFx/BYK30YxEIfrmCh/u7IUCTlKPJ1ZT60nPdy8T+vk6rgvJtRnqSKcDz9KEgZVDHM04J9KghWbKQJwoLqWSEeIGw0pmWdQjO/MqLxD21Liz7/qzSuJ6lUQKH4AhUgQPqoAHuQBO4AINn8ArewYfxYrwZn8bXtHTJmPUcgD8wJj9u9KAz</latexit><latexit sha1_base64="f/5yZKiEaS0X7R8LxHBn9FrGo8=">ACH3icbVBdS8MwFE39nPOr6qMvwSFMGKUVYSoIQxF8nGDdYB0jzdIuLE1Lkgqj7Kf4l/xQcV8W3/xrTbg24eSDg515y7/ETRqWy7YmxtLyurZe2ihvbm3v7Jp7+48yTgUmLo5ZLNo+koRTlxFSPtRBAU+Yy0/OFN7reiJA05g9qlJBuhEJOA4qR0lLPrHsM8ZARGNRgCD1RPHoJvLyCXoTUwPez2zH0MhU2ycwzC9v7NWsnlmxLbsAXCTOjFTADM2e+e31Y5xGhCvMkJQdx05UN0NCUczIuOylkiQID1FIOpyFBHZzYoFx/BYK30YxEIfrmCh/u7IUCTlKPJ1ZT60nPdy8T+vk6rgvJtRnqSKcDz9KEgZVDHM04J9KghWbKQJwoLqWSEeIGw0pmWdQjO/MqLxD21Liz7/qzSuJ6lUQKH4AhUgQPqoAHuQBO4AINn8ArewYfxYrwZn8bXtHTJmPUcgD8wJj9u9KAz</latexit><latexit sha1_base64="f/5yZKiEaS0X7R8LxHBn9FrGo8=">ACH3icbVBdS8MwFE39nPOr6qMvwSFMGKUVYSoIQxF8nGDdYB0jzdIuLE1Lkgqj7Kf4l/xQcV8W3/xrTbg24eSDg515y7/ETRqWy7YmxtLyurZe2ihvbm3v7Jp7+48yTgUmLo5ZLNo+koRTlxFSPtRBAU+Yy0/OFN7reiJA05g9qlJBuhEJOA4qR0lLPrHsM8ZARGNRgCD1RPHoJvLyCXoTUwPez2zH0MhU2ycwzC9v7NWsnlmxLbsAXCTOjFTADM2e+e31Y5xGhCvMkJQdx05UN0NCUczIuOylkiQID1FIOpyFBHZzYoFx/BYK30YxEIfrmCh/u7IUCTlKPJ1ZT60nPdy8T+vk6rgvJtRnqSKcDz9KEgZVDHM04J9KghWbKQJwoLqWSEeIGw0pmWdQjO/MqLxD21Liz7/qzSuJ6lUQKH4AhUgQPqoAHuQBO4AINn8ArewYfxYrwZn8bXtHTJmPUcgD8wJj9u9KAz</latexit>ϕ
<latexit sha1_base64="brezGfS9xgUQ3GjHx0Z0LO17Wg=">AB7XicbVBNSwMxEJ3Ur1q/qh69BIvgqeyKoN6KXjxWcG2hXUo2zbah2WxIsoWy9Ed48aDi1f/jzX9j2u5BWx8MPN6bYWZepAQ31vO+UWltfWNzq7xd2dnd2z+oHh49mTlAU0FaluR8QwSULeCtZVmJIkEa0Wju5nfGjNteCof7USxMCEDyWNOiXVSqzsmWg15r1rz6t4ceJX4BalBgWav+tXtpzRLmLRUEGM6vqdsmBNtORVsWulmhilCR2TAOo5KkjAT5vNzp/jMKX0cp9qVtHiu/p7ISWLMJIlcZ0Ls0Cx7M/E/r5PZ+DrMuVSZIuFsWZwDbFs9xn2tGrZg4Qqjm7lZMh0QTal1CFReCv/zyKgku6jd17+Gy1rgt0ijDCZzCOfhwBQ24hyYEQGEz/AKb0ihF/SOPhatJVTMHMfoM8f5tuPeQ=</latexit><latexit sha1_base64="brezGfS9xgUQ3GjHx0Z0LO17Wg=">AB7XicbVBNSwMxEJ3Ur1q/qh69BIvgqeyKoN6KXjxWcG2hXUo2zbah2WxIsoWy9Ed48aDi1f/jzX9j2u5BWx8MPN6bYWZepAQ31vO+UWltfWNzq7xd2dnd2z+oHh49mTlAU0FaluR8QwSULeCtZVmJIkEa0Wju5nfGjNteCof7USxMCEDyWNOiXVSqzsmWg15r1rz6t4ceJX4BalBgWav+tXtpzRLmLRUEGM6vqdsmBNtORVsWulmhilCR2TAOo5KkjAT5vNzp/jMKX0cp9qVtHiu/p7ISWLMJIlcZ0Ls0Cx7M/E/r5PZ+DrMuVSZIuFsWZwDbFs9xn2tGrZg4Qqjm7lZMh0QTal1CFReCv/zyKgku6jd17+Gy1rgt0ijDCZzCOfhwBQ24hyYEQGEz/AKb0ihF/SOPhatJVTMHMfoM8f5tuPeQ=</latexit><latexit sha1_base64="brezGfS9xgUQ3GjHx0Z0LO17Wg=">AB7XicbVBNSwMxEJ3Ur1q/qh69BIvgqeyKoN6KXjxWcG2hXUo2zbah2WxIsoWy9Ed48aDi1f/jzX9j2u5BWx8MPN6bYWZepAQ31vO+UWltfWNzq7xd2dnd2z+oHh49mTlAU0FaluR8QwSULeCtZVmJIkEa0Wju5nfGjNteCof7USxMCEDyWNOiXVSqzsmWg15r1rz6t4ceJX4BalBgWav+tXtpzRLmLRUEGM6vqdsmBNtORVsWulmhilCR2TAOo5KkjAT5vNzp/jMKX0cp9qVtHiu/p7ISWLMJIlcZ0Ls0Cx7M/E/r5PZ+DrMuVSZIuFsWZwDbFs9xn2tGrZg4Qqjm7lZMh0QTal1CFReCv/zyKgku6jd17+Gy1rgt0ijDCZzCOfhwBQ24hyYEQGEz/AKb0ihF/SOPhatJVTMHMfoM8f5tuPeQ=</latexit>PARAMETRIC VS MANIFOLD OPTIMIZATION
➤ This suggests thinking about the problem in the functional
space generated by the model:
➤ Sufficient conditions for success so far:
➤ convex and sufficiently large so that we can move freely within.
➤ What happens when the model is not overparametrised?
FΦ = {ϕ : Rn → Rm ; ϕ(x) = Φ(x; Θ) for some Θ} . FΦ g∗ : x 7! E(Y |x) g∗ FΦ Θ min
ϕ∈FΦ kϕ g∗kp
<latexit sha1_base64="Yyi7AmoUmMvMC5btwgUeAX5xD1I=">ACIXicbVDNSgMxGMzWv1r/Vj16CRZBMuCNpbURCPFawtNGXJpmk3NJtdkmyhrPsXnwVLx5UehNfxrRdQVsHAsPMfOT7xo85U9pxPq3C0vLK6lpxvbSxubW9Y+/uPagokYQ2SMQj2fKxopwJ2tBMc9qKJcWhz2nTH1xP/OaQSsUica9HMe2EuC9YjxGsjeTZVRQy4aVoiGUcMIiYgCjEOiCYpzeZh+oByB6hD+BU9j3TozgxZ5dirOFHCRuDkpgx1zx6jbkSkApNOFaq7Tqx7qRYakY4zUoUTGZID7tG2owCFVnXR6YgaPjNKFvUiaJzScqr8nUhwqNQp9k5ysr+a9ifif10507KTMhEnmgoy+6iXcKgjOkLdpmkRPORIZhIZnaFJMASE21aLZkS3PmTF0njrFKtOHfn5dpV3kYRHIBDcAxcAFq4BbUQMQ8ARewBt4t56tV+vDGs+iBSuf2Qd/YH19A/rUo3o=</latexit><latexit sha1_base64="Yyi7AmoUmMvMC5btwgUeAX5xD1I=">ACIXicbVDNSgMxGMzWv1r/Vj16CRZBMuCNpbURCPFawtNGXJpmk3NJtdkmyhrPsXnwVLx5UehNfxrRdQVsHAsPMfOT7xo85U9pxPq3C0vLK6lpxvbSxubW9Y+/uPagokYQ2SMQj2fKxopwJ2tBMc9qKJcWhz2nTH1xP/OaQSsUica9HMe2EuC9YjxGsjeTZVRQy4aVoiGUcMIiYgCjEOiCYpzeZh+oByB6hD+BU9j3TozgxZ5dirOFHCRuDkpgx1zx6jbkSkApNOFaq7Tqx7qRYakY4zUoUTGZID7tG2owCFVnXR6YgaPjNKFvUiaJzScqr8nUhwqNQp9k5ysr+a9ifif10507KTMhEnmgoy+6iXcKgjOkLdpmkRPORIZhIZnaFJMASE21aLZkS3PmTF0njrFKtOHfn5dpV3kYRHIBDcAxcAFq4BbUQMQ8ARewBt4t56tV+vDGs+iBSuf2Qd/YH19A/rUo3o=</latexit><latexit sha1_base64="Yyi7AmoUmMvMC5btwgUeAX5xD1I=">ACIXicbVDNSgMxGMzWv1r/Vj16CRZBMuCNpbURCPFawtNGXJpmk3NJtdkmyhrPsXnwVLx5UehNfxrRdQVsHAsPMfOT7xo85U9pxPq3C0vLK6lpxvbSxubW9Y+/uPagokYQ2SMQj2fKxopwJ2tBMc9qKJcWhz2nTH1xP/OaQSsUica9HMe2EuC9YjxGsjeTZVRQy4aVoiGUcMIiYgCjEOiCYpzeZh+oByB6hD+BU9j3TozgxZ5dirOFHCRuDkpgx1zx6jbkSkApNOFaq7Tqx7qRYakY4zUoUTGZID7tG2owCFVnXR6YgaPjNKFvUiaJzScqr8nUhwqNQp9k5ysr+a9ifif10507KTMhEnmgoy+6iXcKgjOkLdpmkRPORIZhIZnaFJMASE21aLZkS3PmTF0njrFKtOHfn5dpV3kYRHIBDcAxcAFq4BbUQMQ8ARewBt4t56tV+vDGs+iBSuf2Qd/YH19A/rUo3o=</latexit>hf, gip := E{f(X)g(X)} .
<latexit sha1_base64="f/5yZKiEaS0X7R8LxHBn9FrGo8=">ACH3icbVBdS8MwFE39nPOr6qMvwSFMGKUVYSoIQxF8nGDdYB0jzdIuLE1Lkgqj7Kf4l/xQcV8W3/xrTbg24eSDg515y7/ETRqWy7YmxtLyurZe2ihvbm3v7Jp7+48yTgUmLo5ZLNo+koRTlxFSPtRBAU+Yy0/OFN7reiJA05g9qlJBuhEJOA4qR0lLPrHsM8ZARGNRgCD1RPHoJvLyCXoTUwPez2zH0MhU2ycwzC9v7NWsnlmxLbsAXCTOjFTADM2e+e31Y5xGhCvMkJQdx05UN0NCUczIuOylkiQID1FIOpyFBHZzYoFx/BYK30YxEIfrmCh/u7IUCTlKPJ1ZT60nPdy8T+vk6rgvJtRnqSKcDz9KEgZVDHM04J9KghWbKQJwoLqWSEeIGw0pmWdQjO/MqLxD21Liz7/qzSuJ6lUQKH4AhUgQPqoAHuQBO4AINn8ArewYfxYrwZn8bXtHTJmPUcgD8wJj9u9KAz</latexit><latexit sha1_base64="f/5yZKiEaS0X7R8LxHBn9FrGo8=">ACH3icbVBdS8MwFE39nPOr6qMvwSFMGKUVYSoIQxF8nGDdYB0jzdIuLE1Lkgqj7Kf4l/xQcV8W3/xrTbg24eSDg515y7/ETRqWy7YmxtLyurZe2ihvbm3v7Jp7+48yTgUmLo5ZLNo+koRTlxFSPtRBAU+Yy0/OFN7reiJA05g9qlJBuhEJOA4qR0lLPrHsM8ZARGNRgCD1RPHoJvLyCXoTUwPez2zH0MhU2ycwzC9v7NWsnlmxLbsAXCTOjFTADM2e+e31Y5xGhCvMkJQdx05UN0NCUczIuOylkiQID1FIOpyFBHZzYoFx/BYK30YxEIfrmCh/u7IUCTlKPJ1ZT60nPdy8T+vk6rgvJtRnqSKcDz9KEgZVDHM04J9KghWbKQJwoLqWSEeIGw0pmWdQjO/MqLxD21Liz7/qzSuJ6lUQKH4AhUgQPqoAHuQBO4AINn8ArewYfxYrwZn8bXtHTJmPUcgD8wJj9u9KAz</latexit><latexit sha1_base64="f/5yZKiEaS0X7R8LxHBn9FrGo8=">ACH3icbVBdS8MwFE39nPOr6qMvwSFMGKUVYSoIQxF8nGDdYB0jzdIuLE1Lkgqj7Kf4l/xQcV8W3/xrTbg24eSDg515y7/ETRqWy7YmxtLyurZe2ihvbm3v7Jp7+48yTgUmLo5ZLNo+koRTlxFSPtRBAU+Yy0/OFN7reiJA05g9qlJBuhEJOA4qR0lLPrHsM8ZARGNRgCD1RPHoJvLyCXoTUwPez2zH0MhU2ycwzC9v7NWsnlmxLbsAXCTOjFTADM2e+e31Y5xGhCvMkJQdx05UN0NCUczIuOylkiQID1FIOpyFBHZzYoFx/BYK30YxEIfrmCh/u7IUCTlKPJ1ZT60nPdy8T+vk6rgvJtRnqSKcDz9KEgZVDHM04J9KghWbKQJwoLqWSEeIGw0pmWdQjO/MqLxD21Liz7/qzSuJ6lUQKH4AhUgQPqoAHuQBO4AINn8ArewYfxYrwZn8bXtHTJmPUcgD8wJj9u9KAz</latexit>ϕ
<latexit sha1_base64="brezGfS9xgUQ3GjHx0Z0LO17Wg=">AB7XicbVBNSwMxEJ3Ur1q/qh69BIvgqeyKoN6KXjxWcG2hXUo2zbah2WxIsoWy9Ed48aDi1f/jzX9j2u5BWx8MPN6bYWZepAQ31vO+UWltfWNzq7xd2dnd2z+oHh49mTlAU0FaluR8QwSULeCtZVmJIkEa0Wju5nfGjNteCof7USxMCEDyWNOiXVSqzsmWg15r1rz6t4ceJX4BalBgWav+tXtpzRLmLRUEGM6vqdsmBNtORVsWulmhilCR2TAOo5KkjAT5vNzp/jMKX0cp9qVtHiu/p7ISWLMJIlcZ0Ls0Cx7M/E/r5PZ+DrMuVSZIuFsWZwDbFs9xn2tGrZg4Qqjm7lZMh0QTal1CFReCv/zyKgku6jd17+Gy1rgt0ijDCZzCOfhwBQ24hyYEQGEz/AKb0ihF/SOPhatJVTMHMfoM8f5tuPeQ=</latexit><latexit sha1_base64="brezGfS9xgUQ3GjHx0Z0LO17Wg=">AB7XicbVBNSwMxEJ3Ur1q/qh69BIvgqeyKoN6KXjxWcG2hXUo2zbah2WxIsoWy9Ed48aDi1f/jzX9j2u5BWx8MPN6bYWZepAQ31vO+UWltfWNzq7xd2dnd2z+oHh49mTlAU0FaluR8QwSULeCtZVmJIkEa0Wju5nfGjNteCof7USxMCEDyWNOiXVSqzsmWg15r1rz6t4ceJX4BalBgWav+tXtpzRLmLRUEGM6vqdsmBNtORVsWulmhilCR2TAOo5KkjAT5vNzp/jMKX0cp9qVtHiu/p7ISWLMJIlcZ0Ls0Cx7M/E/r5PZ+DrMuVSZIuFsWZwDbFs9xn2tGrZg4Qqjm7lZMh0QTal1CFReCv/zyKgku6jd17+Gy1rgt0ijDCZzCOfhwBQ24hyYEQGEz/AKb0ihF/SOPhatJVTMHMfoM8f5tuPeQ=</latexit><latexit sha1_base64="brezGfS9xgUQ3GjHx0Z0LO17Wg=">AB7XicbVBNSwMxEJ3Ur1q/qh69BIvgqeyKoN6KXjxWcG2hXUo2zbah2WxIsoWy9Ed48aDi1f/jzX9j2u5BWx8MPN6bYWZepAQ31vO+UWltfWNzq7xd2dnd2z+oHh49mTlAU0FaluR8QwSULeCtZVmJIkEa0Wju5nfGjNteCof7USxMCEDyWNOiXVSqzsmWg15r1rz6t4ceJX4BalBgWav+tXtpzRLmLRUEGM6vqdsmBNtORVsWulmhilCR2TAOo5KkjAT5vNzp/jMKX0cp9qVtHiu/p7ISWLMJIlcZ0Ls0Cx7M/E/r5PZ+DrMuVSZIuFsWZwDbFs9xn2tGrZg4Qqjm7lZMh0QTal1CFReCv/zyKgku6jd17+Gy1rgt0ijDCZzCOfhwBQ24hyYEQGEz/AKb0ihF/SOPhatJVTMHMfoM8f5tuPeQ=</latexit>FROM SIMPLE LANDSCAPES TO ENERGY BARRIER
➤ The energy landscape of several prototypical models in
statistical physics exhibit a so-called energy barrier, e.g. spherical spin glasses:
HN,p(σ) = N −(p−1)/2
N
X
i1,...ip=1
Ji1,...,ipσi1 · · · σip , σ ∈ SN−1( √ N) , Ji ∼ N(0, 1).
≥ ≥ lim
N→∞1 N log E CrtN,k(u) = Θk,p(u).
1.66 1.65 1.64 1.63 0.02 0.01 0.01 0.02u −E0 −E1 −E10 −Ec
Figure 1. The functions Θk,p for p = 3 and k = 0 (solid), k = 1 (dashed), k = 2 (dash-dotted), k = 10, k = 100 (both dotted). All these functions agree for u ≥ −E∞.
[Auffinger, Ben Arous, Cerny,’11]
FROM SIMPLE LANDSCAPES TO ENERGY BARRIER?
➤ Does a similar macroscopic picture arise in our setting? ➤ Given homogeneous, assume
➤
➤ Define
➤ Best loss obtained by first projecting the data onto the best possible
subspace of dimension and adding bounded noise in the complement.
➤ decreases with and
ρ(z)
˜ ρ(hw, Xi) = hAw, ψ(X)i , with dim(ψ(X)) = f(N) .
f −1(M)
β(M, N) = inf
S;dim(S)=f −1(M)
inf
U 2 Rm×M W 2 RM×f−1(M)
sup
EkZk N f −1(M), PSZ = 0
EkUρ(WPSX + Z) Y k2
β(M, N)
β(f(N), N) = min
U,W E(U, W) .
M
FROM SIMPLE LANDSCAPES TO ENERGY BARRIER
➤ Does a similar macroscopic picture arise in our setting? ➤ Given homogeneous, assume
➤
➤ Define
➤ Best loss obtained by first projecting the data onto the best possible
subspace of dimension and adding bounded noise in the complement.
➤ decreases with and
ρ(z)
˜ ρ(hw, Xi) = hAw, ψ(X)i , with dim(ψ(X)) = f(N) .
f −1(M)
β(M, N) = inf
S;dim(S)=f −1(M)
inf
U 2 Rm×M W 2 RM×f−1(M)
sup
EkZk N f −1(M), PSZ = 0
EkUρ(WPSX + Z) Y k2
β(M, N)
β(f(N), N) = min
U,W E(U, W) .
M
Conjecture [LBB’18]: The loss L(U, W) = EkUρ(WX) Y k2 has no poor local minima above the energy barrier β(M, N).
<latexit sha1_base64="+mVPjJvZLvbrtX4oIn9416unzBw=">ACiHicbVFdb9MwFHUCjNHxUeCRlytaRCuVKqkE25CQpk5IPAw0pIUW1aWy3ZvGLEj25mosv4XfhNv/Bvcrg+wcZ+Ozrmf5/Iyl9ZF0e8gvHX7zs7d3XuNvfsPHj5qPn7yxerKCEyEzrUZc2YxlwoTJ12O49IgK3iOI35+vNZHF2is1OrMLUucFmyhZCoFc56aNX9SpaWao3JAHf5wPK2PtfqOwlUGYXIyHL6MD6ZvV3CWIeTaWmifdJIejLrwDmjBXMZ5/X5FLyEBajLdGY278Aq+Ar38NmgDpZAxC0pDqbXxDQTLoZBKFgwY1xcIzvdFhWaxBM6MkWigTk61vnY+9Rt92fNVtSPNgE3QbwFLbKN01nzF51rURX+JEzaydxVLpzYyTIsdVg1YWSybO2QInHipWoJ3WGydX8MIzc0j9qn2lmzYvytqVli7LjPXN9ur2tr8n/apHLpwbSWqwcKnE1K1ycBrWb4G5N7zfOkBE0b6XUFkzDh/PMa3oT4+sk3QTLoH/ajz4PW0XDrxi5Rp6TDonJPjkiH8gpSYgIdoJe8Dp4E+6FcbgfHl6lhsG25in5J8LhHxK3vsQ=</latexit><latexit sha1_base64="+mVPjJvZLvbrtX4oIn9416unzBw=">ACiHicbVFdb9MwFHUCjNHxUeCRlytaRCuVKqkE25CQpk5IPAw0pIUW1aWy3ZvGLEj25mosv4XfhNv/Bvcrg+wcZ+Ozrmf5/Iyl9ZF0e8gvHX7zs7d3XuNvfsPHj5qPn7yxerKCEyEzrUZc2YxlwoTJ12O49IgK3iOI35+vNZHF2is1OrMLUucFmyhZCoFc56aNX9SpaWao3JAHf5wPK2PtfqOwlUGYXIyHL6MD6ZvV3CWIeTaWmifdJIejLrwDmjBXMZ5/X5FLyEBajLdGY278Aq+Ar38NmgDpZAxC0pDqbXxDQTLoZBKFgwY1xcIzvdFhWaxBM6MkWigTk61vnY+9Rt92fNVtSPNgE3QbwFLbKN01nzF51rURX+JEzaydxVLpzYyTIsdVg1YWSybO2QInHipWoJ3WGydX8MIzc0j9qn2lmzYvytqVli7LjPXN9ur2tr8n/apHLpwbSWqwcKnE1K1ycBrWb4G5N7zfOkBE0b6XUFkzDh/PMa3oT4+sk3QTLoH/ajz4PW0XDrxi5Rp6TDonJPjkiH8gpSYgIdoJe8Dp4E+6FcbgfHl6lhsG25in5J8LhHxK3vsQ=</latexit><latexit sha1_base64="+mVPjJvZLvbrtX4oIn9416unzBw=">ACiHicbVFdb9MwFHUCjNHxUeCRlytaRCuVKqkE25CQpk5IPAw0pIUW1aWy3ZvGLEj25mosv4XfhNv/Bvcrg+wcZ+Ozrmf5/Iyl9ZF0e8gvHX7zs7d3XuNvfsPHj5qPn7yxerKCEyEzrUZc2YxlwoTJ12O49IgK3iOI35+vNZHF2is1OrMLUucFmyhZCoFc56aNX9SpaWao3JAHf5wPK2PtfqOwlUGYXIyHL6MD6ZvV3CWIeTaWmifdJIejLrwDmjBXMZ5/X5FLyEBajLdGY278Aq+Ar38NmgDpZAxC0pDqbXxDQTLoZBKFgwY1xcIzvdFhWaxBM6MkWigTk61vnY+9Rt92fNVtSPNgE3QbwFLbKN01nzF51rURX+JEzaydxVLpzYyTIsdVg1YWSybO2QInHipWoJ3WGydX8MIzc0j9qn2lmzYvytqVli7LjPXN9ur2tr8n/apHLpwbSWqwcKnE1K1ycBrWb4G5N7zfOkBE0b6XUFkzDh/PMa3oT4+sk3QTLoH/ajz4PW0XDrxi5Rp6TDonJPjkiH8gpSYgIdoJe8Dp4E+6FcbgfHl6lhsG25in5J8LhHxK3vsQ=</latexit>FROM TOPOLOGY TO GEOMETRY
➤ The next question we are interested in is conditioning for
descent.
➤ Even if level sets are connected, how easy it is to navigate
through them?
➤ How “large” and regular are they?
easy to move from one energy level to lower one hard to move from one energy level to lower one
FROM TOPOLOGY TO GEOMETRY
➤ The next question we are interested in is conditioning for
descent.
➤ Even if level sets are connected, how easy it is to navigate
through them?
➤ We estimate level set geodesics and measure their length.
easy to move from one energy level to lower one hard to move from one energy level to lower one
θA θB
θA
θB
FINDING CONNECTED COMPONENTS
➤ Suppose are such that ➤ They are in the same connected component of iff
there is a path such that
➤ Moreover, we penalize the length of the path:
θ1, θ2 γ(t), γ(0) = θ1, γ(1) = θ2 ∀ t ∈ (0, 1) , E(γ(t)) ≤ u0 . E(θ1) = E(θ2) = u0 Ωu0
8 t 2 (0, 1) , E(γ(t)) u0 and Z k˙ γ(t)kdt M .
FINDING CONNECTED COMPONENTS
➤ Suppose are such that ➤ They are in the same connected component of iff
there is a path such that
➤ Moreover, we penalize the length of the path: ➤Dynamic programming approach:
θ1, θ2 γ(t), γ(0) = θ1, γ(1) = θ2 ∀ t ∈ (0, 1) , E(γ(t)) ≤ u0 . E(θ1) = E(θ2) = u0 Ωu0
8 t 2 (0, 1) , E(γ(t)) u0 and Z k˙ γ(t)kdt M .
θ1 θ2
FINDING CONNECTED COMPONENTS
➤ Suppose are such that ➤ They are in the same connected component of iff
there is a path such that
➤ Moreover, we penalize the length of the path: ➤Dynamic programming approach:
θ1, θ2 γ(t), γ(0) = θ1, γ(1) = θ2 ∀ t ∈ (0, 1) , E(γ(t)) ≤ u0 . E(θ1) = E(θ2) = u0 Ωu0
8 t 2 (0, 1) , E(γ(t)) u0 and Z k˙ γ(t)kdt M .
θ1 θ2
θm = θ1 + θ2 2
H θ3 = arg min
θ∈H; E(θ)≤u0 kθ θmk .
θ3
FINDING CONNECTED COMPONENTS
➤ Suppose are such that ➤ They are in the same connected component of iff
there is a path such that
➤ Moreover, we penalize the length of the path: ➤Dynamic programming approach:
θ1, θ2 γ(t), γ(0) = θ1, γ(1) = θ2 ∀ t ∈ (0, 1) , E(γ(t)) ≤ u0 . E(θ1) = E(θ2) = u0 Ωu0
8 t 2 (0, 1) , E(γ(t)) u0 and Z k˙ γ(t)kdt M .
θ1 θ2
θm = θ1 + θ2 2
θ3 = arg min
θ∈H; E(θ)≤u0 kθ θmk .
θ3
FINDING CONNECTED COMPONENTS
➤ Suppose are such that ➤ They are in the same connected component of iff
there is a path such that
➤ Moreover, we penalize the length of the path: ➤Dynamic programming approach:
θ1, θ2 γ(t), γ(0) = θ1, γ(1) = θ2 ∀ t ∈ (0, 1) , E(γ(t)) ≤ u0 . E(θ1) = E(θ2) = u0 Ωu0
8 t 2 (0, 1) , E(γ(t)) u0 and Z k˙ γ(t)kdt M .
θ1 θ2
θm = θ1 + θ2 2
H θ3 = arg min
θ∈H; E(θ)≤u0 kθ θmk .
θ3
NUMERICAL EXPERIMENTS
➤ Compute length of geodesic in obtained by the algorithm
and normalize it by the Euclidean distance. Measure of curviness of level sets.
Ωu
cubic polynomial CNN/MNIST
NUMERICAL EXPERIMENTS
➤ Compute length of geodesic in obtained by the algorithm
and normalize it by the Euclidean distance. Measure of curviness of level sets.
Ωu
CNN/CIFAR-10 LSTM/Penn
ANALYSIS AND PERSPECTIVES
➤ #of components does not increase: no detected poor local minima
so far when using typical datasets and typical architectures (at energy levels explored by SGD).
➤ Level sets become more irregular as energy decreases. ➤ Presence of “energy barrier”? extend to truncated Taylor? ➤ Kernels are back? CNN RKHS ➤ Open: “sweet spot” between overparametrisation and overfitting? ➤ Open: Role of Stochastic Optimization in this story?