recent advances and challenges in non convex optimization
play

Recent Advances and Challenges in Non-Convex Optimization Anima - PowerPoint PPT Presentation

Recent Advances and Challenges in Non-Convex Optimization Anima Anandkumar .. U.C. Irvine .. NIPS workshop on non-convex optimization 2015 Optimization for Learning Most learning problems can be cast as optimization. Unsupervised Learning


  1. Recent Advances and Challenges in Non-Convex Optimization Anima Anandkumar .. U.C. Irvine .. NIPS workshop on non-convex optimization 2015

  2. Optimization for Learning Most learning problems can be cast as optimization. Unsupervised Learning Clustering k -means, hierarchical . . . Maximum Likelihood Estimator Probabilistic latent variable models Supervised Learning Output Optimizing a neural network with Neuron respect to a loss function Input

  3. Convex vs. Non-convex Optimization Mostly convex analysis.. But non-convex is trending! Images taken from https://www.facebook.com/nonconvex

  4. Convex vs. Nonconvex Optimization Unique optimum: global/local. Multiple local optima Guaranteed approaches for non-convex problems?

  5. Non-convex Optimization in High Dimensions Critical/statitionary points: x : ∇ x f ( x ) = 0 . local maxima Saddle points local minima Curse of dimensionality: exponential number of critical points.

  6. Outline Introduction 1 Spectral Optimization 2 Other Approaches 3 Conclusion 4

  7. Matrix Eigen-analysis max Top eigenvector: v � v, Mv � s.t. � v � = 1 , v ∈ R d . max No. of isolated critical points ≤ d . No. of local optima: 1 min Local optimum ≡ Global optimum! Algorithmic implication Gradient descent (power method) converges to global optimum! Saddle points avoided by random initialization!

  8. From Matrices to Tensors Matrix: Pairwise Moments E [ x ⊗ x ] ∈ R d × d is a second order tensor. E [ x ⊗ x ] i 1 ,i 2 = E [ x i 1 x i 2 ] . For matrices: E [ x ⊗ x ] = E [ xx ⊤ ] . Tensor: Higher order Moments E [ x ⊗ x ⊗ x ] ∈ R d × d × d is a third order tensor. E [ x ⊗ x ⊗ x ] i 1 ,i 2 ,i 3 = E [ x i 1 x i 2 x i 3 ] .

  9. Tensor Norm Maximization Problem Computationally hard for general tensors. i ∈ [ k ] u i ⊗ u i ⊗ u i : u i ⊥ u j for i � = j . Orthogonal tensors T = � T ( v, v, v ) , � v � = 1 , v ∈ R d . Top eigenvector: max v

  10. Tensor Norm Maximization Problem Computationally hard for general tensors. i ∈ [ k ] u i ⊗ u i ⊗ u i : u i ⊥ u j for i � = j . Orthogonal tensors T = � T ( v, v, v ) , � v � = 1 , v ∈ R d . Top eigenvector: max v No. of critical points exp( d ) ! max No. of local optima: k . Local optima: { u i } min Multiple local optima, but they correspond to components!

  11. Implication: Guaranteed Tensor Decomposition Orthogonal Tensor Decomposition T = � i ∈ [ k ] u i ⊗ u i ⊗ u i Gradient descent (power method) recovers a local optimum { u i } . Find all components { u i } by deflation! u 1 a 1 W Non-orthogonal: T = � i λ i a i ⊗ a i ⊗ a i a 2 u a 3 u 3 Orthogonalization via multilinear transform. W computed using SVD on tensor slices. Requires linear independence of a i ’s + + + + Recovery of Tensor Factorization under Mild Conditions!

  12. Implementations of Spectral Methods Spark Implementation https://github.com/FurongHuang/SpectralLDA-TensorSpark Single machine topic model implementation https://bitbucket.org/megaDataLab/tensormethodsforml/overview Single machine community detection implementation https://github.com/FurongHuang/Fast-Detection-of-Overlapping- Randomized Sketching for Tensors (Thursday Spotlight) http://yining-wang.com/fftlda-code.zip Extended BLAS kernels Exploit BLAS extensions on CPU/GPU: under progress.

  13. Implications for Learning: Unsupervised Setting GMM HMM ICA h 1 h 2 h k h 1 h 2 h 3 x 1 x 2 x d x 1 x 2 x 3 Spectral vs. Variational Running time Topic Models 10 5 Variational Tensor 10 4 10 3 10 2 10 1 10 0 Facebook Yelp DBLP-sub DBLP 500-fold speedup compared to variational inference. More details at Kevin Chen’s talk at 17:00

  14. Reinforcement Learning of POMDPs Partially observable Markov decision processes. Memoryless policies with oracle access to planning Episodic learning with Spectral Methods First RL method for POMDPs with regret bounds! 2.7 2.6 x i x i +1 x i +2 2.5 Average Reward 2.4 r i +1 r i 2.3 y i y i +1 2.2 SM−UCRL−POMDP UCRL−MDP a i a i +1 2.1 Q−learning Random Policy 2 0 1000 2000 3000 4000 5000 6000 7000 Number of Trials Reinforcement Learning of POMDPs using Spectral Methods by K. Azzizade, A. Lazaric, A. .

  15. Training Neural Networks via Tensor Methods y Unsupervised learning of tensor features S ( x ) Train neural networks by tensor decomposition of S ( x ) E [ y ⊗ S ( x )] First guaranteed results for training neural networks! x Exploits probabilistic models of the input. Estimate using labeled data Cross- n 1 moment CP tensor � y i ⊗ + · · · + n decomposition i =1 Score function Rank-1 components are first layer weights Input Score function Beating the Perils of Non-Convexity: Guaranteed Training of Neural Networks using Tensor Methods by M. Janzamin, H. Sedghi, A. Neural networks will also be discussed in Andrew Barron’s talk at 15:30

  16. Local Optima in Deep Learning? y y =1 y = − 1 σ ( · ) σ ( · ) w 1 w 2 x 1 x 2 x Backprop (quadratic) loss surface Loss surface for our tensor method 200 650 180 600 160 550 140 120 500 100 450 80 400 60 350 40 300 20 250 0 −4 200 −3 −4 −3 −2 −1 −2 4 0 −1 3 4 2 1 0 3 1 2 2 0 1 −1 1 3 0 −2 2 w 1 (1) −3 w 1 (2) −1 4 3 −2 −4 w 1 (1) −3 w 1 (2) 4 −4

  17. Local Optima in Deep Learning? y y =1 y = − 1 σ ( · ) σ ( · ) w 1 w 2 x 1 x 2 x Backprop (quadratic) loss surface Loss surface for our tensor method 200 650 180 600 160 550 140 120 500 100 450 80 400 60 350 40 300 20 250 0 −4 200 −3 −4 −3 −2 −1 −2 4 0 −1 3 4 2 1 0 3 1 2 2 0 1 −1 1 3 0 −2 2 w 1 (1) −3 w 1 (2) −1 4 3 −2 −4 w 1 (1) −3 w 1 (2) 4 −4

  18. Analysis in High Dimensions Loss function in deep neural networks ≈ Random Gaussian Polynomials (under strong assumptions) � c i x i , x ∈ R d , c i ∼ N (0 , 1) . f ( x ) = i Main Result: All local minima have similar values. Auffinger, A., Arous, G.B.: Complexity of random smooth functions of many variables. More details at Yann Lecun’s talk at 9:10AM. Caution: Algorithmically still hard to optimize! Basin of attraction for local optima unknown ◮ Exponential initializations needed for success? Degenerate saddle points are present ◮ NP-hard to escape them!

  19. Outline Introduction 1 Spectral Optimization 2 Other Approaches 3 Conclusion 4

  20. Problem Dependent Initialization Compute approximate solution via some polynomial time method. Initialize for gradient descent = Notable Success Stories Dictionary Learning/Sparse Coding: Initialize with clustering-based solution. Robust PCA: matrix and tensor settings: Initialize with PCA. More details at Sanjeev Arora’s talk at 14:30

  21. Avoiding Saddle Points without Restarts Non-degenerate Saddle points: Hessian has both positive and Stuck negative eigenvectors. Negative eigenvector: direction of escape Escape Second order method: use Hessian information to escape. ◮ Cubic regularization of Newton method, Nestorov & Polyak First order method: noisy stochastic gradient descent works! ◮ Escaping From Saddle Points — Online Stochastic Gradient for Tensor Decomposition, R. Ge, F. Huang, C. Jin, Y. Yuan

  22. Convex Envelopes, Smoothing, Annealing... Convex envelope: achieves global optimum, but hard to compute. PDEs. Smoothing: may not achieve global optimum, but tractable. See Hossein Mobahi’s talk at 15:00 Annealing Methods Form of stochastic search: sampling based methods. Challenge: mixing time can be exponential. See Andrew Barron’s talk at 15:30 Sum of squares Higher order semi-definite programs (SDP). Nice theoretical tool, but computationally intensive.

  23. Outline Introduction 1 Spectral Optimization 2 Other Approaches 3 Conclusion 4

  24. Conclusion Many approaches to analyze and deal with non-convexity ◮ Local search methods: gradient descent, trust region... ◮ Problem dependent initialization. ◮ Annealing and smoothing methods. ◮ Sum of squares (higher order SDPs) NP-hardness should not deter us from building new theory for non-convex optimization. Open problems: Numerous! Providing explicit characterization of tractable problems. Lack a hierarchy of non-convex problems. Looking forward to a great workshop!

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend