Survey of Overparametrization and Optimization
Jason D. Lee University of Southern California September 25, 2019
Jason Lee
Survey of Overparametrization and Optimization Jason D. Lee - - PowerPoint PPT Presentation
Survey of Overparametrization and Optimization Jason D. Lee University of Southern California September 25, 2019 Jason Lee Overparametrization and Architecture Design 1 Geometric Results on Overparametrization 2 Review Non-convex
Jason Lee
Jason Lee
Jason Lee
Jason Lee
1 Optimization
2 Statistical
Jason Lee
1 Optimization
2 Statistical
Jason Lee
Jason Lee
Jason Lee
Jason Lee
1 fθ(x) is the prediction function (neural network) 2 ℓ(ˆ
Jason Lee
Jason Lee
Jason Lee
aLivni et al.
Jason Lee
Jason Lee
Iterations ×104 1 2 3 4 5 Objective Value 0.1 0.2 0.3 0.4 0.5 0.5 1 1.5 2 2.5 3 Iterations 104 0.1 0.2 0.3 0.4 0.5 Objective Value
Jason Lee
Jason Lee
Jason Lee
Jason Lee
Jason Lee
Jason Lee
Jason Lee
Jason Lee
Jason Lee
Jason Lee
Jason Lee
Jason Lee
k ∇L(θk) ≤ 0.
Jason Lee
1 ∇L(θ) = 0 2 ∇2L(θ) 0.
Jason Lee
1 ∇L(θ∗) = 0, 2 ∇2L(θ∗) 0.
Jason Lee
Jason Lee
Jason Lee
Jason Lee
Jason Lee
Jason Lee
Jason Lee
Jason Lee
Jason Lee
1 Matrix Completion (GLM16, GJZ17,. . . ) 2 Rank k Approximation (classical) 3 Matrix Sensing (BNS16) 4 Phase Retrieval (SQW16) 5 Orthogonal Tensor Decomposition (AGHKT12,GHJY15) 6 Dictionary Learning (SQW15) 7 Max-cut via Burer Monteiro (BBV16, Montanari 16) 8 Overparametrized Networks with Quadratic Activation (DL18) 9 ReLU network with two neurons (LWL17) 10 ReLU networks via landscape design (GLM18) Jason Lee
Jason Lee
1 Gaussian x 2 no negative output weights 3 k ≤ d
Jason Lee
Jason Lee
Jason Lee
Jason Lee
Jason Lee
Jason Lee
Jason Lee
Jason Lee
Jason Lee
Jason Lee
Jason Lee
Jason Lee
Jason Lee
Jason Lee
Jason Lee
Jason Lee
Jason Lee
Jason Lee
Jason Lee
Jason Lee
Jason Lee
Jason Lee
Jason Lee
Jason Lee
Jason Lee
1 Assume the target f⋆ ∈ H(Kφ) or approximable by H(Kφ)
Jason Lee
1 Assume the target f⋆ ∈ H(Kφ) or approximable by H(Kφ)
2 Show that SGD learns something as competitive as the best in
Jason Lee
1 Write K(x, y) = g(ρ) = ciρi. Jason Lee
1 Write K(x, y) = g(ρ) = ciρi. 2 Thus φ(x)i = √cixi is a feature map. Jason Lee
1 Write K(x, y) = g(ρ) = ciρi. 2 Thus φ(x)i = √cixi is a feature map. 3 Using this, we can write p(x) = pjxj = w, φ(x) for
Jason Lee
1 Write K(x, y) = g(ρ) = ciρi. 2 Thus φ(x)i = √cixi is a feature map. 3 Using this, we can write p(x) = pjxj = w, φ(x) for
4 Thus if cj decay quickly, then w2 won’t be too huge. RKHS
Jason Lee
1 Write K(x, y) = g(ρ) = ciρi. 2 Thus φ(x)i = √cixi is a feature map. 3 Using this, we can write p(x) = pjxj = w, φ(x) for
4 Thus if cj decay quickly, then w2 won’t be too huge. RKHS
Jason Lee
Jason Lee
Jason Lee
Jason Lee
Jason Lee
Jason Lee
Jason Lee
Jason Lee
Jason Lee
Jason Lee
Jason Lee
Jason Lee
Jason Lee
Jason Lee
Jason Lee
Jason Lee
Jason Lee
Jason Lee
Jason Lee
Jason Lee
Jason Lee
Jason Lee
1Probably need f0 = o(√m), and is the only place neural net structure is
Jason Lee
Jason Lee
Jason Lee
Jason Lee
Jason Lee
Jason Lee
Jason Lee
Jason Lee
Jason Lee
Jason Lee
2It can be significantly improved with data assumptions to m f2 K. Jason Lee
Jason Lee
Jason Lee
3If Kernel has nullspace, then should be f2 K ≤ j |aj|2/cj. Jason Lee
Jason Lee
Jason Lee
Jason Lee
Jason Lee
1 If f⋆(x) = σ(w⊤x) (single ReLU), then need exponential in d
2 With m = dk can only learn as well as fitting a degree k
3 For a simple distribution realizable by four ReLU with n d2
Jason Lee
Jason Lee
Jason Lee
Jason Lee
Jason Lee
4NTK is not stationary point of this. Jason Lee
Jason Lee