Recent Advances and Challenges in Non-Convex Optimization Anima - - PowerPoint PPT Presentation

recent advances and challenges in non convex optimization
SMART_READER_LITE
LIVE PREVIEW

Recent Advances and Challenges in Non-Convex Optimization Anima - - PowerPoint PPT Presentation

Recent Advances and Challenges in Non-Convex Optimization Anima Anandkumar .. U.C. Irvine .. NIPS workshop on non-convex optimization 2015 Optimization for Learning Most learning problems can be cast as optimization. Unsupervised Learning


slide-1
SLIDE 1

Recent Advances and Challenges in Non-Convex Optimization

Anima Anandkumar

..

U.C. Irvine

..

NIPS workshop on non-convex optimization 2015

slide-2
SLIDE 2

Optimization for Learning

Most learning problems can be cast as optimization.

Unsupervised Learning

Clustering k-means, hierarchical . . . Maximum Likelihood Estimator Probabilistic latent variable models

Supervised Learning

Optimizing a neural network with respect to a loss function

Input Neuron Output

slide-3
SLIDE 3

Convex vs. Non-convex Optimization

Mostly convex analysis.. But non-convex is trending!

Images taken from https://www.facebook.com/nonconvex

slide-4
SLIDE 4

Convex vs. Nonconvex Optimization

Unique optimum: global/local. Multiple local optima Guaranteed approaches for non-convex problems?

slide-5
SLIDE 5

Non-convex Optimization in High Dimensions

Critical/statitionary points: x : ∇xf(x) = 0. local maxima local minima Saddle points Curse of dimensionality: exponential number of critical points.

slide-6
SLIDE 6

Outline

1

Introduction

2

Spectral Optimization

3

Other Approaches

4

Conclusion

slide-7
SLIDE 7

Matrix Eigen-analysis

Top eigenvector: max

v v, Mv s.t. v = 1, v ∈ Rd.

  • No. of isolated critical points ≤ d.
  • No. of local optima: 1

min max

Local optimum ≡ Global optimum!

Algorithmic implication

Gradient descent (power method) converges to global optimum! Saddle points avoided by random initialization!

slide-8
SLIDE 8

From Matrices to Tensors

Matrix: Pairwise Moments

E[x ⊗ x] ∈ Rd×d is a second order tensor. E[x ⊗ x]i1,i2 = E[xi1xi2]. For matrices: E[x ⊗ x] = E[xx⊤].

Tensor: Higher order Moments

E[x ⊗ x ⊗ x] ∈ Rd×d×d is a third order tensor. E[x ⊗ x ⊗ x]i1,i2,i3 = E[xi1xi2xi3].

slide-9
SLIDE 9

Tensor Norm Maximization Problem

Computationally hard for general tensors. Orthogonal tensors T =

i∈[k] ui ⊗ ui ⊗ ui : ui ⊥ uj for i = j.

Top eigenvector: max

v

T(v, v, v), v = 1, v ∈ Rd.

slide-10
SLIDE 10

Tensor Norm Maximization Problem

Computationally hard for general tensors. Orthogonal tensors T =

i∈[k] ui ⊗ ui ⊗ ui : ui ⊥ uj for i = j.

Top eigenvector: max

v

T(v, v, v), v = 1, v ∈ Rd.

  • No. of critical points exp(d)!
  • No. of local optima: k.

Local optima: {ui}

min max

Multiple local optima, but they correspond to components!

slide-11
SLIDE 11

Implication: Guaranteed Tensor Decomposition

Orthogonal Tensor Decomposition T =

i∈[k] ui ⊗ ui ⊗ ui

Gradient descent (power method) recovers a local optimum {ui}. Find all components {ui} by deflation!

Non-orthogonal: T =

i λiai ⊗ ai ⊗ ai

u1 u u3 W a1 a2 a3 Orthogonalization via multilinear transform. W computed using SVD on tensor slices. Requires linear independence of ai’s

+ +

+ +

Recovery of Tensor Factorization under Mild Conditions!

slide-12
SLIDE 12

Implementations of Spectral Methods

Spark Implementation

https://github.com/FurongHuang/SpectralLDA-TensorSpark

Single machine topic model implementation

https://bitbucket.org/megaDataLab/tensormethodsforml/overview

Single machine community detection implementation

https://github.com/FurongHuang/Fast-Detection-of-Overlapping-

Randomized Sketching for Tensors (Thursday Spotlight)

http://yining-wang.com/fftlda-code.zip

Extended BLAS kernels

Exploit BLAS extensions on CPU/GPU: under progress.

slide-13
SLIDE 13

Implications for Learning: Unsupervised Setting

GMM HMM

h1 h2 h3 x1 x2 x3

ICA

h1 h2 hk x1 x2 xd

Topic Models Spectral vs. Variational

100 101 102 103 104 105 Variational Tensor

Running time Facebook Yelp DBLP-sub DBLP

500-fold speedup compared to variational inference. More details at Kevin Chen’s talk at 17:00

slide-14
SLIDE 14

Reinforcement Learning of POMDPs

Partially observable Markov decision processes. Memoryless policies with oracle access to planning Episodic learning with Spectral Methods First RL method for POMDPs with regret bounds!

xi xi+1 xi+2 yi yi+1 ri ri+1 ai ai+1

1000 2000 3000 4000 5000 6000 7000 2 2.1 2.2 2.3 2.4 2.5 2.6 2.7 Number of Trials Average Reward SM−UCRL−POMDP UCRL−MDP Q−learning Random Policy

Reinforcement Learning of POMDPs using Spectral Methods by K. Azzizade, A. Lazaric, A. .

slide-15
SLIDE 15

Training Neural Networks via Tensor Methods

Unsupervised learning of tensor features S(x) Train neural networks by tensor decomposition of E[y ⊗ S(x)] First guaranteed results for training neural networks! Exploits probabilistic models of the input. x S(x) y

Input Score function

1 n

n

  • i=1

yi⊗

Estimate using labeled data

Score function

Cross- moment

Rank-1 components are first layer weights

+

+ · · ·

CP tensor decomposition

Beating the Perils of Non-Convexity: Guaranteed Training of Neural Networks using Tensor Methods by M. Janzamin, H. Sedghi, A. Neural networks will also be discussed in Andrew Barron’s talk at 15:30

slide-16
SLIDE 16

Local Optima in Deep Learning?

y=1 y=−1 σ(·) σ(·) y x1 x2 x w1 w2 Backprop (quadratic) loss surface

−4 −3 −2 −1 1 2 3 4 −4 −3 −2 −1 1 2 3 4 200 250 300 350 400 450 500 550 600 650

w1(1) w1(2) Loss surface for our tensor method

−4 −3 −2 −1 1 2 3 4 −4 −3 −2 −1 1 2 3 4 20 40 60 80 100 120 140 160 180 200

w1(1) w1(2)

slide-17
SLIDE 17

Local Optima in Deep Learning?

y=1 y=−1 σ(·) σ(·) y x1 x2 x w1 w2 Backprop (quadratic) loss surface

−4 −3 −2 −1 1 2 3 4 −4 −3 −2 −1 1 2 3 4 200 250 300 350 400 450 500 550 600 650

w1(1) w1(2) Loss surface for our tensor method

−4 −3 −2 −1 1 2 3 4 −4 −3 −2 −1 1 2 3 4 20 40 60 80 100 120 140 160 180 200

w1(1) w1(2)

slide-18
SLIDE 18

Analysis in High Dimensions

Loss function in deep neural networks ≈ Random Gaussian Polynomials (under strong assumptions) f(x) =

  • i

ci xi, x ∈ Rd, ci ∼ N(0, 1). Main Result: All local minima have similar values.

Auffinger, A., Arous, G.B.: Complexity of random smooth functions of many variables.

More details at Yann Lecun’s talk at 9:10AM. Caution: Algorithmically still hard to optimize! Basin of attraction for local optima unknown

◮ Exponential initializations needed for success?

Degenerate saddle points are present

◮ NP-hard to escape them!

slide-19
SLIDE 19

Outline

1

Introduction

2

Spectral Optimization

3

Other Approaches

4

Conclusion

slide-20
SLIDE 20

Problem Dependent Initialization

Compute approximate solution via some polynomial time method. Initialize for gradient descent

Notable Success Stories

Dictionary Learning/Sparse Coding: Initialize with clustering-based solution. Robust PCA: matrix and tensor settings: Initialize with PCA.

=

More details at Sanjeev Arora’s talk at 14:30

slide-21
SLIDE 21

Avoiding Saddle Points without Restarts

Non-degenerate Saddle points: Hessian has both positive and negative eigenvectors. Negative eigenvector: direction of escape Stuck Escape Second order method: use Hessian information to escape.

◮ Cubic regularization of Newton method, Nestorov & Polyak

First order method: noisy stochastic gradient descent works!

◮ Escaping From Saddle Points — Online Stochastic Gradient for Tensor

Decomposition, R. Ge, F. Huang, C. Jin, Y. Yuan

slide-22
SLIDE 22

Convex Envelopes, Smoothing, Annealing...

Convex envelope: achieves global

  • ptimum, but hard to compute. PDEs.

Smoothing: may not achieve global

  • ptimum, but tractable.

See Hossein Mobahi’s talk at 15:00

Annealing Methods

Form of stochastic search: sampling based methods. Challenge: mixing time can be exponential. See Andrew Barron’s talk at 15:30

Sum of squares

Higher order semi-definite programs (SDP). Nice theoretical tool, but computationally intensive.

slide-23
SLIDE 23

Outline

1

Introduction

2

Spectral Optimization

3

Other Approaches

4

Conclusion

slide-24
SLIDE 24

Conclusion

Many approaches to analyze and deal with non-convexity

◮ Local search methods: gradient descent, trust region... ◮ Problem dependent initialization. ◮ Annealing and smoothing methods. ◮ Sum of squares (higher order SDPs)

NP-hardness should not deter us from building new theory for non-convex optimization.

Open problems: Numerous!

Providing explicit characterization of tractable problems. Lack a hierarchy of non-convex problems. Looking forward to a great workshop!