SLIDE 1 Recent Advances and Challenges in Non-Convex Optimization
Anima Anandkumar
..
U.C. Irvine
..
NIPS workshop on non-convex optimization 2015
SLIDE 2 Optimization for Learning
Most learning problems can be cast as optimization.
Unsupervised Learning
Clustering k-means, hierarchical . . . Maximum Likelihood Estimator Probabilistic latent variable models
Supervised Learning
Optimizing a neural network with respect to a loss function
Input Neuron Output
SLIDE 3
Convex vs. Non-convex Optimization
Mostly convex analysis.. But non-convex is trending!
Images taken from https://www.facebook.com/nonconvex
SLIDE 4
Convex vs. Nonconvex Optimization
Unique optimum: global/local. Multiple local optima Guaranteed approaches for non-convex problems?
SLIDE 5
Non-convex Optimization in High Dimensions
Critical/statitionary points: x : ∇xf(x) = 0. local maxima local minima Saddle points Curse of dimensionality: exponential number of critical points.
SLIDE 6
Outline
1
Introduction
2
Spectral Optimization
3
Other Approaches
4
Conclusion
SLIDE 7 Matrix Eigen-analysis
Top eigenvector: max
v v, Mv s.t. v = 1, v ∈ Rd.
- No. of isolated critical points ≤ d.
- No. of local optima: 1
min max
Local optimum ≡ Global optimum!
Algorithmic implication
Gradient descent (power method) converges to global optimum! Saddle points avoided by random initialization!
SLIDE 8
From Matrices to Tensors
Matrix: Pairwise Moments
E[x ⊗ x] ∈ Rd×d is a second order tensor. E[x ⊗ x]i1,i2 = E[xi1xi2]. For matrices: E[x ⊗ x] = E[xx⊤].
Tensor: Higher order Moments
E[x ⊗ x ⊗ x] ∈ Rd×d×d is a third order tensor. E[x ⊗ x ⊗ x]i1,i2,i3 = E[xi1xi2xi3].
SLIDE 9
Tensor Norm Maximization Problem
Computationally hard for general tensors. Orthogonal tensors T =
i∈[k] ui ⊗ ui ⊗ ui : ui ⊥ uj for i = j.
Top eigenvector: max
v
T(v, v, v), v = 1, v ∈ Rd.
SLIDE 10 Tensor Norm Maximization Problem
Computationally hard for general tensors. Orthogonal tensors T =
i∈[k] ui ⊗ ui ⊗ ui : ui ⊥ uj for i = j.
Top eigenvector: max
v
T(v, v, v), v = 1, v ∈ Rd.
- No. of critical points exp(d)!
- No. of local optima: k.
Local optima: {ui}
min max
Multiple local optima, but they correspond to components!
SLIDE 11 Implication: Guaranteed Tensor Decomposition
Orthogonal Tensor Decomposition T =
i∈[k] ui ⊗ ui ⊗ ui
Gradient descent (power method) recovers a local optimum {ui}. Find all components {ui} by deflation!
Non-orthogonal: T =
i λiai ⊗ ai ⊗ ai
u1 u u3 W a1 a2 a3 Orthogonalization via multilinear transform. W computed using SVD on tensor slices. Requires linear independence of ai’s
+ +
+ +
Recovery of Tensor Factorization under Mild Conditions!
SLIDE 12
Implementations of Spectral Methods
Spark Implementation
https://github.com/FurongHuang/SpectralLDA-TensorSpark
Single machine topic model implementation
https://bitbucket.org/megaDataLab/tensormethodsforml/overview
Single machine community detection implementation
https://github.com/FurongHuang/Fast-Detection-of-Overlapping-
Randomized Sketching for Tensors (Thursday Spotlight)
http://yining-wang.com/fftlda-code.zip
Extended BLAS kernels
Exploit BLAS extensions on CPU/GPU: under progress.
SLIDE 13 Implications for Learning: Unsupervised Setting
GMM HMM
h1 h2 h3 x1 x2 x3
ICA
h1 h2 hk x1 x2 xd
Topic Models Spectral vs. Variational
100 101 102 103 104 105 Variational Tensor
Running time Facebook Yelp DBLP-sub DBLP
500-fold speedup compared to variational inference. More details at Kevin Chen’s talk at 17:00
SLIDE 14 Reinforcement Learning of POMDPs
Partially observable Markov decision processes. Memoryless policies with oracle access to planning Episodic learning with Spectral Methods First RL method for POMDPs with regret bounds!
xi xi+1 xi+2 yi yi+1 ri ri+1 ai ai+1
1000 2000 3000 4000 5000 6000 7000 2 2.1 2.2 2.3 2.4 2.5 2.6 2.7 Number of Trials Average Reward SM−UCRL−POMDP UCRL−MDP Q−learning Random Policy
Reinforcement Learning of POMDPs using Spectral Methods by K. Azzizade, A. Lazaric, A. .
SLIDE 15 Training Neural Networks via Tensor Methods
Unsupervised learning of tensor features S(x) Train neural networks by tensor decomposition of E[y ⊗ S(x)] First guaranteed results for training neural networks! Exploits probabilistic models of the input. x S(x) y
Input Score function
1 n
n
yi⊗
Estimate using labeled data
Score function
Cross- moment
Rank-1 components are first layer weights
+
+ · · ·
CP tensor decomposition
Beating the Perils of Non-Convexity: Guaranteed Training of Neural Networks using Tensor Methods by M. Janzamin, H. Sedghi, A. Neural networks will also be discussed in Andrew Barron’s talk at 15:30
SLIDE 16 Local Optima in Deep Learning?
y=1 y=−1 σ(·) σ(·) y x1 x2 x w1 w2 Backprop (quadratic) loss surface
−4 −3 −2 −1 1 2 3 4 −4 −3 −2 −1 1 2 3 4 200 250 300 350 400 450 500 550 600 650
w1(1) w1(2) Loss surface for our tensor method
−4 −3 −2 −1 1 2 3 4 −4 −3 −2 −1 1 2 3 4 20 40 60 80 100 120 140 160 180 200
w1(1) w1(2)
SLIDE 17 Local Optima in Deep Learning?
y=1 y=−1 σ(·) σ(·) y x1 x2 x w1 w2 Backprop (quadratic) loss surface
−4 −3 −2 −1 1 2 3 4 −4 −3 −2 −1 1 2 3 4 200 250 300 350 400 450 500 550 600 650
w1(1) w1(2) Loss surface for our tensor method
−4 −3 −2 −1 1 2 3 4 −4 −3 −2 −1 1 2 3 4 20 40 60 80 100 120 140 160 180 200
w1(1) w1(2)
SLIDE 18 Analysis in High Dimensions
Loss function in deep neural networks ≈ Random Gaussian Polynomials (under strong assumptions) f(x) =
ci xi, x ∈ Rd, ci ∼ N(0, 1). Main Result: All local minima have similar values.
Auffinger, A., Arous, G.B.: Complexity of random smooth functions of many variables.
More details at Yann Lecun’s talk at 9:10AM. Caution: Algorithmically still hard to optimize! Basin of attraction for local optima unknown
◮ Exponential initializations needed for success?
Degenerate saddle points are present
◮ NP-hard to escape them!
SLIDE 19
Outline
1
Introduction
2
Spectral Optimization
3
Other Approaches
4
Conclusion
SLIDE 20
Problem Dependent Initialization
Compute approximate solution via some polynomial time method. Initialize for gradient descent
Notable Success Stories
Dictionary Learning/Sparse Coding: Initialize with clustering-based solution. Robust PCA: matrix and tensor settings: Initialize with PCA.
=
More details at Sanjeev Arora’s talk at 14:30
SLIDE 21 Avoiding Saddle Points without Restarts
Non-degenerate Saddle points: Hessian has both positive and negative eigenvectors. Negative eigenvector: direction of escape Stuck Escape Second order method: use Hessian information to escape.
◮ Cubic regularization of Newton method, Nestorov & Polyak
First order method: noisy stochastic gradient descent works!
◮ Escaping From Saddle Points — Online Stochastic Gradient for Tensor
Decomposition, R. Ge, F. Huang, C. Jin, Y. Yuan
SLIDE 22 Convex Envelopes, Smoothing, Annealing...
Convex envelope: achieves global
- ptimum, but hard to compute. PDEs.
Smoothing: may not achieve global
See Hossein Mobahi’s talk at 15:00
Annealing Methods
Form of stochastic search: sampling based methods. Challenge: mixing time can be exponential. See Andrew Barron’s talk at 15:30
Sum of squares
Higher order semi-definite programs (SDP). Nice theoretical tool, but computationally intensive.
SLIDE 23
Outline
1
Introduction
2
Spectral Optimization
3
Other Approaches
4
Conclusion
SLIDE 24 Conclusion
Many approaches to analyze and deal with non-convexity
◮ Local search methods: gradient descent, trust region... ◮ Problem dependent initialization. ◮ Annealing and smoothing methods. ◮ Sum of squares (higher order SDPs)
NP-hardness should not deter us from building new theory for non-convex optimization.
Open problems: Numerous!
Providing explicit characterization of tractable problems. Lack a hierarchy of non-convex problems. Looking forward to a great workshop!