Recent Advances and Challenges in Non-Convex Optimization Anima - PowerPoint PPT Presentation

Recent Advances and Challenges in Non-Convex Optimization Anima Anandkumar .. U.C. Irvine .. NIPS workshop on non-convex optimization 2015

Optimization for Learning Most learning problems can be cast as optimization. Unsupervised Learning Clustering k -means, hierarchical . . . Maximum Likelihood Estimator Probabilistic latent variable models Supervised Learning Output Optimizing a neural network with Neuron respect to a loss function Input

Convex vs. Non-convex Optimization Mostly convex analysis.. But non-convex is trending! Images taken from https://www.facebook.com/nonconvex

Convex vs. Nonconvex Optimization Unique optimum: global/local. Multiple local optima Guaranteed approaches for non-convex problems?

Non-convex Optimization in High Dimensions Critical/statitionary points: x : ∇ x f ( x ) = 0 . local maxima Saddle points local minima Curse of dimensionality: exponential number of critical points.

Outline Introduction 1 Spectral Optimization 2 Other Approaches 3 Conclusion 4

Matrix Eigen-analysis max Top eigenvector: v � v, Mv � s.t. � v � = 1 , v ∈ R d . max No. of isolated critical points ≤ d . No. of local optima: 1 min Local optimum ≡ Global optimum! Algorithmic implication Gradient descent (power method) converges to global optimum! Saddle points avoided by random initialization!

From Matrices to Tensors Matrix: Pairwise Moments E [ x ⊗ x ] ∈ R d × d is a second order tensor. E [ x ⊗ x ] i 1 ,i 2 = E [ x i 1 x i 2 ] . For matrices: E [ x ⊗ x ] = E [ xx ⊤ ] . Tensor: Higher order Moments E [ x ⊗ x ⊗ x ] ∈ R d × d × d is a third order tensor. E [ x ⊗ x ⊗ x ] i 1 ,i 2 ,i 3 = E [ x i 1 x i 2 x i 3 ] .

Tensor Norm Maximization Problem Computationally hard for general tensors. i ∈ [ k ] u i ⊗ u i ⊗ u i : u i ⊥ u j for i � = j . Orthogonal tensors T = � T ( v, v, v ) , � v � = 1 , v ∈ R d . Top eigenvector: max v

Tensor Norm Maximization Problem Computationally hard for general tensors. i ∈ [ k ] u i ⊗ u i ⊗ u i : u i ⊥ u j for i � = j . Orthogonal tensors T = � T ( v, v, v ) , � v � = 1 , v ∈ R d . Top eigenvector: max v No. of critical points exp( d ) ! max No. of local optima: k . Local optima: { u i } min Multiple local optima, but they correspond to components!

Implication: Guaranteed Tensor Decomposition Orthogonal Tensor Decomposition T = � i ∈ [ k ] u i ⊗ u i ⊗ u i Gradient descent (power method) recovers a local optimum { u i } . Find all components { u i } by deflation! u 1 a 1 W Non-orthogonal: T = � i λ i a i ⊗ a i ⊗ a i a 2 u a 3 u 3 Orthogonalization via multilinear transform. W computed using SVD on tensor slices. Requires linear independence of a i ’s + + + + Recovery of Tensor Factorization under Mild Conditions!

Implementations of Spectral Methods Spark Implementation https://github.com/FurongHuang/SpectralLDA-TensorSpark Single machine topic model implementation https://bitbucket.org/megaDataLab/tensormethodsforml/overview Single machine community detection implementation https://github.com/FurongHuang/Fast-Detection-of-Overlapping- Randomized Sketching for Tensors (Thursday Spotlight) http://yining-wang.com/fftlda-code.zip Extended BLAS kernels Exploit BLAS extensions on CPU/GPU: under progress.

Implications for Learning: Unsupervised Setting GMM HMM ICA h 1 h 2 h k h 1 h 2 h 3 x 1 x 2 x d x 1 x 2 x 3 Spectral vs. Variational Running time Topic Models 10 5 Variational Tensor 10 4 10 3 10 2 10 1 10 0 Facebook Yelp DBLP-sub DBLP 500-fold speedup compared to variational inference. More details at Kevin Chen’s talk at 17:00

Reinforcement Learning of POMDPs Partially observable Markov decision processes. Memoryless policies with oracle access to planning Episodic learning with Spectral Methods First RL method for POMDPs with regret bounds! 2.7 2.6 x i x i +1 x i +2 2.5 Average Reward 2.4 r i +1 r i 2.3 y i y i +1 2.2 SM−UCRL−POMDP UCRL−MDP a i a i +1 2.1 Q−learning Random Policy 2 0 1000 2000 3000 4000 5000 6000 7000 Number of Trials Reinforcement Learning of POMDPs using Spectral Methods by K. Azzizade, A. Lazaric, A. .

Training Neural Networks via Tensor Methods y Unsupervised learning of tensor features S ( x ) Train neural networks by tensor decomposition of S ( x ) E [ y ⊗ S ( x )] First guaranteed results for training neural networks! x Exploits probabilistic models of the input. Estimate using labeled data Cross- n 1 moment CP tensor � y i ⊗ + · · · + n decomposition i =1 Score function Rank-1 components are first layer weights Input Score function Beating the Perils of Non-Convexity: Guaranteed Training of Neural Networks using Tensor Methods by M. Janzamin, H. Sedghi, A. Neural networks will also be discussed in Andrew Barron’s talk at 15:30

Local Optima in Deep Learning? y y =1 y = − 1 σ ( · ) σ ( · ) w 1 w 2 x 1 x 2 x Backprop (quadratic) loss surface Loss surface for our tensor method 200 650 180 600 160 550 140 120 500 100 450 80 400 60 350 40 300 20 250 0 −4 200 −3 −4 −3 −2 −1 −2 4 0 −1 3 4 2 1 0 3 1 2 2 0 1 −1 1 3 0 −2 2 w 1 (1) −3 w 1 (2) −1 4 3 −2 −4 w 1 (1) −3 w 1 (2) 4 −4

Analysis in High Dimensions Loss function in deep neural networks ≈ Random Gaussian Polynomials (under strong assumptions) � c i x i , x ∈ R d , c i ∼ N (0 , 1) . f ( x ) = i Main Result: All local minima have similar values. Auffinger, A., Arous, G.B.: Complexity of random smooth functions of many variables. More details at Yann Lecun’s talk at 9:10AM. Caution: Algorithmically still hard to optimize! Basin of attraction for local optima unknown ◮ Exponential initializations needed for success? Degenerate saddle points are present ◮ NP-hard to escape them!

Problem Dependent Initialization Compute approximate solution via some polynomial time method. Initialize for gradient descent = Notable Success Stories Dictionary Learning/Sparse Coding: Initialize with clustering-based solution. Robust PCA: matrix and tensor settings: Initialize with PCA. More details at Sanjeev Arora’s talk at 14:30

Avoiding Saddle Points without Restarts Non-degenerate Saddle points: Hessian has both positive and Stuck negative eigenvectors. Negative eigenvector: direction of escape Escape Second order method: use Hessian information to escape. ◮ Cubic regularization of Newton method, Nestorov & Polyak First order method: noisy stochastic gradient descent works! ◮ Escaping From Saddle Points — Online Stochastic Gradient for Tensor Decomposition, R. Ge, F. Huang, C. Jin, Y. Yuan

Convex Envelopes, Smoothing, Annealing... Convex envelope: achieves global optimum, but hard to compute. PDEs. Smoothing: may not achieve global optimum, but tractable. See Hossein Mobahi’s talk at 15:00 Annealing Methods Form of stochastic search: sampling based methods. Challenge: mixing time can be exponential. See Andrew Barron’s talk at 15:30 Sum of squares Higher order semi-definite programs (SDP). Nice theoretical tool, but computationally intensive.

Conclusion Many approaches to analyze and deal with non-convexity ◮ Local search methods: gradient descent, trust region... ◮ Problem dependent initialization. ◮ Annealing and smoothing methods. ◮ Sum of squares (higher order SDPs) NP-hardness should not deter us from building new theory for non-convex optimization. Open problems: Numerous! Providing explicit characterization of tractable problems. Lack a hierarchy of non-convex problems. Looking forward to a great workshop!

Recent Advances and Challenges in Non-Convex Optimization Anima - PowerPoint PPT Presentation

Recent Advances and Challenges in Non-Convex Optimization Anima Anandkumar .. U.C. Irvine .. NIPS workshop on non-convex optimization 2015 Optimization for Learning Most learning problems can be cast as optimization. Unsupervised Learning

Some Recent Advances in Non-convex Optimization Purushottam Kar IIT KANPUR Outline of the Talk

Convex Hell 362 dnc CS 16: Convex Hull Whoops, I mean... Convex Hull Whats a Convex Hull?

CS675: Convex and Combinatorial Optimization Fall 2019 Convex Optimization Problems Instructor:

CS675: Convex and Combinatorial Optimization Spring 2018 Convex Optimization Problems

constrained convex optimization virgil pavlu 1 convex set a set X in a vector space is convex if

Recent advances on the acceleration of first-order methods in convex optimization . Juan

Convex Optimization 4. Convex Optimization Problems Prof. Ying Cui Department of Electrical

Dimension Free Optimization and Non-Convex Optimization Instructor: Sham Kakade 1 Non-convex

CS675: Convex and Combinatorial Optimization Spring 2018 Convex Sets Instructor: Shaddin Dughmi

CS675: Convex and Combinatorial Optimization Fall 2019 Convex Functions Instructor: Shaddin

CS675: Convex and Combinatorial Optimization Fall 2019 Convex Sets Instructor: Shaddin Dughmi

CS675: Convex and Combinatorial Optimization Fall 2014 Convex Functions Instructor: Shaddin

Convex hull 1 - 1 Convex hull 1 - 2 Convex hull 1 - 3 Convex hull Definition, extremal

CS133 Computational Geometry Convex Hull 1 Convex Hull Given a set of n points, find the

16. Review of convex optimization Convex sets and functions Convex programming models

A Primer in Convex Optimization Moritz Diehl partly based on material by Colin Jones, Stephen

Unconstrained Optimization Optimization problem Given f : R n R find x R n ,

CSC321 Lecture 8: Optimization Roger Grosse Roger Grosse CSC321 Lecture 8: Optimization 1 / 26

Linear Bandits: From Theory to Applications Claire Vernade DeepMind Foundations Team credits

Why Unexpectedly Positive A New Reformulation . . . Experiences Make Decision Resulting

Combinatorial Optimization Problems 4. Solution Methods 5. Construction Heuristics for the

SELECTING JOURNAL & CONFERENCE OUTLETS FOR OPTIMUM VISIBILITY BY DR. DAVID O. OMOLE

GLOBAL INDUSTRIAL CONFERENCE November 2018 SAFE HARBOR STATEMENTS This presentation contains

Energy Local Crickhowell First meeting 25 September 2018 Outline of meeting Introductions 1.

Recent Advances and Challenges in Non-Convex Optimization Anima - PowerPoint PPT Presentation

Recent Advances and Challenges in Non-Convex Optimization Anima Anandkumar .. U.C. Irvine .. NIPS workshop on non-convex optimization 2015 Optimization for Learning Most learning problems can be cast as optimization. Unsupervised Learning

Some Recent Advances in Non-convex Optimization Purushottam Kar IIT KANPUR Outline of the Talk

Convex Hell 362 dnc CS 16: Convex Hull Whoops, I mean... Convex Hull Whats a Convex Hull?

CS675: Convex and Combinatorial Optimization Fall 2019 Convex Optimization Problems Instructor:

CS675: Convex and Combinatorial Optimization Spring 2018 Convex Optimization Problems

constrained convex optimization virgil pavlu 1 convex set a set X in a vector space is convex if

Recent advances on the acceleration of first-order methods in convex optimization . Juan

Convex Optimization 4. Convex Optimization Problems Prof. Ying Cui Department of Electrical

Dimension Free Optimization and Non-Convex Optimization Instructor: Sham Kakade 1 Non-convex

CS675: Convex and Combinatorial Optimization Spring 2018 Convex Sets Instructor: Shaddin Dughmi

CS675: Convex and Combinatorial Optimization Fall 2019 Convex Functions Instructor: Shaddin

CS675: Convex and Combinatorial Optimization Fall 2019 Convex Sets Instructor: Shaddin Dughmi

CS675: Convex and Combinatorial Optimization Fall 2014 Convex Functions Instructor: Shaddin

Convex hull 1 - 1 Convex hull 1 - 2 Convex hull 1 - 3 Convex hull Definition, extremal

CS133 Computational Geometry Convex Hull 1 Convex Hull Given a set of n points, find the

16. Review of convex optimization Convex sets and functions Convex programming models

A Primer in Convex Optimization Moritz Diehl partly based on material by Colin Jones, Stephen

Unconstrained Optimization Optimization problem Given f : R n R find x R n ,

CSC321 Lecture 8: Optimization Roger Grosse Roger Grosse CSC321 Lecture 8: Optimization 1 / 26

Linear Bandits: From Theory to Applications Claire Vernade DeepMind Foundations Team credits

Why Unexpectedly Positive A New Reformulation . . . Experiences Make Decision Resulting

Combinatorial Optimization Problems 4. Solution Methods 5. Construction Heuristics for the

SELECTING JOURNAL &amp; CONFERENCE OUTLETS FOR OPTIMUM VISIBILITY BY DR. DAVID O. OMOLE

GLOBAL INDUSTRIAL CONFERENCE November 2018 SAFE HARBOR STATEMENTS This presentation contains

Energy Local Crickhowell First meeting 25 September 2018 Outline of meeting Introductions 1.

SELECTING JOURNAL & CONFERENCE OUTLETS FOR OPTIMUM VISIBILITY BY DR. DAVID O. OMOLE