Beating the Perils of Non-Convexity: Machine Learning using Tensor - - PowerPoint PPT Presentation

beating the perils of non convexity machine learning
SMART_READER_LITE
LIVE PREVIEW

Beating the Perils of Non-Convexity: Machine Learning using Tensor - - PowerPoint PPT Presentation

Beating the Perils of Non-Convexity: Machine Learning using Tensor Methods Anima Anandkumar .. Joint work with Majid Janzamin and Hanie Sedghi. U.C. Irvine Learning with Big Data Learning is finding needle in a haystack Learning with Big


slide-1
SLIDE 1

Beating the Perils of Non-Convexity: Machine Learning using Tensor Methods

Anima Anandkumar

..

Joint work with Majid Janzamin and Hanie Sedghi.

U.C. Irvine

slide-2
SLIDE 2

Learning with Big Data

Learning is finding needle in a haystack

slide-3
SLIDE 3

Learning with Big Data

Learning is finding needle in a haystack High dimensional regime: as data grows, more variables! Useful information: low-dimensional structures. Learning with big data: ill-posed problem.

slide-4
SLIDE 4

Learning with Big Data

Learning is finding needle in a haystack High dimensional regime: as data grows, more variables! Useful information: low-dimensional structures. Learning with big data: ill-posed problem. Learning with big data: statistically and computationally challenging!

slide-5
SLIDE 5

Optimization for Learning

Most learning problems can be cast as optimization.

Unsupervised Learning

Clustering k-means, hierarchical . . . Maximum Likelihood Estimator Probabilistic latent variable models

Supervised Learning

Optimizing a neural network with respect to a loss function

Input Neuron Output

slide-6
SLIDE 6

Convex vs. Non-convex Optimization

Progress is only tip of the iceberg..

Images taken from https://www.facebook.com/nonconvex

slide-7
SLIDE 7

Convex vs. Non-convex Optimization

Progress is only tip of the iceberg.. Real world is mostly non-convex!

Images taken from https://www.facebook.com/nonconvex

slide-8
SLIDE 8

Convex vs. Nonconvex Optimization

Unique optimum: global/local. Multiple local optima

slide-9
SLIDE 9

Convex vs. Nonconvex Optimization

Unique optimum: global/local. Multiple local optima In high dimensions possibly exponential local optima

slide-10
SLIDE 10

Convex vs. Nonconvex Optimization

Unique optimum: global/local. Multiple local optima In high dimensions possibly exponential local optima How to deal with non-convexity?

slide-11
SLIDE 11

Outline

1

Introduction

2

Guaranteed Training of Neural Networks

3

Overview of Other Results on Tensors

4

Conclusion

slide-12
SLIDE 12

Training Neural Networks

Tremendous practical impact with deep learning. Algorithm: backpropagation. Highly non-convex optimization

slide-13
SLIDE 13

Toy Example: Failure of Backpropagation

x1 x2

y=1 y=−1 Labeled input samples Goal: binary classification σ(·) σ(·) y x1 x2 x

w1 w2

Our method: guaranteed risk bounds for training neural networks

slide-14
SLIDE 14

Toy Example: Failure of Backpropagation

x1 x2

y=1 y=−1 Labeled input samples Goal: binary classification σ(·) σ(·) y x1 x2 x

w1 w2

Our method: guaranteed risk bounds for training neural networks

slide-15
SLIDE 15

Toy Example: Failure of Backpropagation

x1 x2

y=1 y=−1 Labeled input samples Goal: binary classification σ(·) σ(·) y x1 x2 x

w1 w2

Our method: guaranteed risk bounds for training neural networks

slide-16
SLIDE 16

Backpropagation vs. Our Method

Weights w2 randomly drawn and fixed Backprop (quadratic) loss surface

−4 −3 −2 −1 1 2 3 4 −4 −3 −2 −1 1 2 3 4 200 250 300 350 400 450 500 550 600 650

w1(1) w1(2)

slide-17
SLIDE 17

Backpropagation vs. Our Method

Weights w2 randomly drawn and fixed Backprop (quadratic) loss surface

−4 −3 −2 −1 1 2 3 4 −4 −3 −2 −1 1 2 3 4 200 250 300 350 400 450 500 550 600 650

w1(1) w1(2)

Loss surface for our method

−4 −3 −2 −1 1 2 3 4 −4 −3 −2 −1 1 2 3 4 20 40 60 80 100 120 140 160 180 200

w1(1) w1(2)

slide-18
SLIDE 18

Overcoming Hardness of Training

In general, training a neural network is NP hard. How does knowledge of input distribution help?

slide-19
SLIDE 19

Overcoming Hardness of Training

In general, training a neural network is NP hard. How does knowledge of input distribution help?

slide-20
SLIDE 20

Generative vs. Discriminative Models

p(x, y)

10 20 30 40 50 60 70 80 90 100 0.02 0.04 0.06 0.08 0.1 0.12

Input data x Class y = 1 Class y = 0

p(y|x)

10 20 30 40 50 60 70 80 90 100 0.2 0.4 0.6 0.8 1 1.2

Input data x Class y = 1 Class y = 0

Generative models: Encode domain knowledge. Discriminative: good classification performance. Neural Network is a discriminative model. Do generative models help in discriminative tasks?

slide-21
SLIDE 21

Feature Transformation for Training Neural Networks

Feature learning: Learn φ(·) from input data. How to use φ(·) to train neural networks? x φ(x) y

slide-22
SLIDE 22

Feature Transformation for Training Neural Networks

Feature learning: Learn φ(·) from input data. How to use φ(·) to train neural networks? x φ(x) y

Multivariate Moments: Many possibilities, . . .

E[x ⊗ y], E[x ⊗ x ⊗ y], E[φ(x) ⊗ y], . . .

slide-23
SLIDE 23

Tensor Notation for Higher Order Moments

Multi-variate higher order moments form tensors. Are there spectral operations on tensors akin to PCA on matrices?

Matrix

E[x ⊗ y] ∈ Rd×d is a second order tensor. E[x ⊗ y]i1,i2 = E[xi1yi2]. For matrices: E[x ⊗ y] = E[xy⊤].

Tensor

E[x ⊗ x ⊗ y] ∈ Rd×d×d is a third order tensor. E[x ⊗ x ⊗ y]i1,i2,i3 = E[xi1xi2yi3]. In general, E[φ(x) ⊗ y] is a tensor. What class of φ(·) useful for training neural networks?

slide-24
SLIDE 24

Score Function Transformations

Score function for x ∈ Rd with pdf p(·): S1(x) := −∇x log p(x) Input: x ∈ Rd S1(x) ∈ Rd

slide-25
SLIDE 25

Score Function Transformations

Score function for x ∈ Rd with pdf p(·): S1(x) := −∇x log p(x) Input: x ∈ Rd S1(x) ∈ Rd

slide-26
SLIDE 26

Score Function Transformations

Score function for x ∈ Rd with pdf p(·): S1(x) := −∇x log p(x) Input: x ∈ Rd S1(x) ∈ Rd

slide-27
SLIDE 27

Score Function Transformations

Score function for x ∈ Rd with pdf p(·): S1(x) := −∇x log p(x) mth-order score function: Input: x ∈ Rd S1(x) ∈ Rd

slide-28
SLIDE 28

Score Function Transformations

Score function for x ∈ Rd with pdf p(·): S1(x) := −∇x log p(x) mth-order score function: Sm(x) := (−1)m ∇(m)p(x) p(x) Input: x ∈ Rd S1(x) ∈ Rd

slide-29
SLIDE 29

Score Function Transformations

Score function for x ∈ Rd with pdf p(·): S1(x) := −∇x log p(x) mth-order score function: Sm(x) := (−1)m ∇(m)p(x) p(x) Input: x ∈ Rd S2(x) ∈ Rd×d

slide-30
SLIDE 30

Score Function Transformations

Score function for x ∈ Rd with pdf p(·): S1(x) := −∇x log p(x) mth-order score function: Sm(x) := (−1)m ∇(m)p(x) p(x) Input: x ∈ Rd S3(x) ∈ Rd×d×d

slide-31
SLIDE 31

Moments of a Neural Network

E[y|x] = f(x) = a⊤

2 σ(A⊤ 1 x + b1) + b2 σ(·) σ(·) σ(·) σ(·) 1 k y x1 x2 x3 xdx x a2 A1 · · · · · · · · ·

slide-32
SLIDE 32

Moments of a Neural Network

E[y|x] = f(x) = a⊤

2 σ(A⊤ 1 x + b1) + b2

Given labeled examples {(xi, yi)} E [y · Sm(x)] = E

  • ∇(m)f(x)

M1 = E[y · S1(x)] =

  • j∈[k]

λ1,j · uj⊗uj ⊗ uj

σ(·) σ(·) σ(·) σ(·) 1 k y x1 x2 x3 xdx x a2 A1 · · · · · · · · ·

slide-33
SLIDE 33

Moments of a Neural Network

E[y|x] = f(x) = a⊤

2 σ(A⊤ 1 x + b1) + b2

Given labeled examples {(xi, yi)} E [y · Sm(x)] = E

  • ∇(m)f(x)

M1 = E[y · S1(x)] =

  • j∈[k]

λ1,j · (A1)j⊗(A1)j ⊗ (A1)j

σ(·) σ(·) σ(·) σ(·) 1 k y x1 x2 x3 xdx x a2 A1 · · · · · · · · ·

slide-34
SLIDE 34

Moments of a Neural Network

E[y|x] = f(x) = a⊤

2 σ(A⊤ 1 x + b1) + b2

Given labeled examples {(xi, yi)} E [y · Sm(x)] = E

  • ∇(m)f(x)

M1 = E[y · S1(x)] =

  • j∈[k]

λ1,j · (A1)j⊗(A1)j ⊗ (A1)j

= + ....

λ11(A1)1 λ12(A1)2

σ(·) σ(·) σ(·) σ(·) 1 k y x1 x2 x3 xdx x a2 A1 · · · · · · · · ·

slide-35
SLIDE 35

Moments of a Neural Network

E[y|x] = f(x) = a⊤

2 σ(A⊤ 1 x + b1) + b2

Given labeled examples {(xi, yi)} E [y · Sm(x)] = E

  • ∇(m)f(x)

M2 = E[y · S2(x)] =

  • j∈[k]

λ2,j · (A1)j ⊗ (A1)j⊗(A1)j

σ(·) σ(·) σ(·) σ(·) 1 k y x1 x2 x3 xdx x a2 A1 · · · · · · · · ·

slide-36
SLIDE 36

Moments of a Neural Network

E[y|x] = f(x) = a⊤

2 σ(A⊤ 1 x + b1) + b2

Given labeled examples {(xi, yi)} E [y · Sm(x)] = E

  • ∇(m)f(x)

M2 = E[y · S2(x)] =

  • j∈[k]

λ2,j · (A1)j ⊗ (A1)j⊗(A1)j

= + ....

λ11(A1)1 ⊗ (A1)1 λ12(A1)2 ⊗ (A1)2

σ(·) σ(·) σ(·) σ(·) 1 k y x1 x2 x3 xdx x a2 A1 · · · · · · · · ·

slide-37
SLIDE 37

Moments of a Neural Network

E[y|x] = f(x) = a⊤

2 σ(A⊤ 1 x + b1) + b2

Given labeled examples {(xi, yi)} E [y · Sm(x)] = E

  • ∇(m)f(x)

M3 = E[y · S3(x)] =

  • j∈[k]

λ3,j · (A1)j ⊗ (A1)j ⊗ (A1)j

σ(·) σ(·) σ(·) σ(·) 1 k y x1 x2 x3 xdx x a2 A1 · · · · · · · · ·

slide-38
SLIDE 38

Moments of a Neural Network

E[y|x] = f(x) = a⊤

2 σ(A⊤ 1 x + b1) + b2

Given labeled examples {(xi, yi)} E [y · Sm(x)] = E

  • ∇(m)f(x)

M3 = E[y · S3(x)] =

  • j∈[k]

λ3,j · (A1)j ⊗ (A1)j ⊗ (A1)j = + ....

σ(·) σ(·) σ(·) σ(·) 1 k y x1 x2 x3 xdx x a2 A1 · · · · · · · · ·

slide-39
SLIDE 39

Moments of a Neural Network

E[y|x] = f(x) = a⊤

2 σ(A⊤ 1 x + b1) + b2

Given labeled examples {(xi, yi)} E [y · Sm(x)] = E

  • ∇(m)f(x)

M3 = E[y · S3(x)] =

  • j∈[k]

λ3,j · (A1)j ⊗ (A1)j ⊗ (A1)j

σ(·) σ(·) σ(·) σ(·) 1 k y x1 x2 x3 xdx x a2 A1 · · · · · · · · ·

Why tensors are required?

Matrix decomposition recovers subspace, not actual weights. Tensor decomposition uniquely recovers under non-degeneracy.

slide-40
SLIDE 40

Moments of a Neural Network

E[y|x] = f(x) = a⊤

2 σ(A⊤ 1 x + b1) + b2

Given labeled examples {(xi, yi)} E [y · Sm(x)] = E

  • ∇(m)f(x)

M3 = E[y · S3(x)] =

  • j∈[k]

λ3,j · (A1)j ⊗ (A1)j ⊗ (A1)j

σ(·) σ(·) σ(·) σ(·) 1 k y x1 x2 x3 xdx x a2 A1 · · · · · · · · ·

Guaranteed learning of weights of first layer via tensor decomposition. Learning the other parameters via a Fourier technique.

slide-41
SLIDE 41

NN-LiFT: Neural Network LearnIng using Feature Tensors

Input: x ∈ Rd S3(x) ∈ Rd×d×d

slide-42
SLIDE 42

NN-LiFT: Neural Network LearnIng using Feature Tensors

Input: x ∈ Rd S3(x) ∈ Rd×d×d

1 n

n

  • i=1

yi ⊗ S3(xi) = 1 n

n

  • i=1

yi⊗ Estimating M3 using labeled data {(xi, yi)} S3(xi)

Cross- moment

slide-43
SLIDE 43

NN-LiFT: Neural Network LearnIng using Feature Tensors

Input: x ∈ Rd S3(x) ∈ Rd×d×d

1 n

n

  • i=1

yi ⊗ S3(xi) = 1 n

n

  • i=1

yi⊗ Estimating M3 using labeled data {(xi, yi)} S3(xi)

Cross- moment

Rank-1 components are the estimates of columns of A1 +

+ · · ·

CP tensor decomposition

slide-44
SLIDE 44

NN-LiFT: Neural Network LearnIng using Feature Tensors

Input: x ∈ Rd S3(x) ∈ Rd×d×d

1 n

n

  • i=1

yi ⊗ S3(xi) = 1 n

n

  • i=1

yi⊗ Estimating M3 using labeled data {(xi, yi)} S3(xi)

Cross- moment

Rank-1 components are the estimates of columns of A1 +

+ · · ·

CP tensor decomposition

Fourier technique ⇒ a2, b1, b2

slide-45
SLIDE 45

Estimation error bound

Guaranteed learning of weights of first layer via tensor decomposition. M3 = E[y ⊗ S3(x)] =

  • j∈[k]

λ3,j · (A1)j ⊗ (A1)j ⊗ (A1)j Full column rank assumption on weight matrix A1 Guaranteed tensor decomposition (AGHKT’14, AGJ’14)

slide-46
SLIDE 46

Estimation error bound

Guaranteed learning of weights of first layer via tensor decomposition. M3 = E[y ⊗ S3(x)] =

  • j∈[k]

λ3,j · (A1)j ⊗ (A1)j ⊗ (A1)j Full column rank assumption on weight matrix A1 Guaranteed tensor decomposition (AGHKT’14, AGJ’14) Learning the other parameters via a Fourier technique.

slide-47
SLIDE 47

Estimation error bound

Guaranteed learning of weights of first layer via tensor decomposition. M3 = E[y ⊗ S3(x)] =

  • j∈[k]

λ3,j · (A1)j ⊗ (A1)j ⊗ (A1)j Full column rank assumption on weight matrix A1 Guaranteed tensor decomposition (AGHKT’14, AGJ’14) Learning the other parameters via a Fourier technique.

Theorem (JSA’14)

number of samples n = poly(d, k), we have w.h.p. |f(x) − ˆ f(x)|2 ≤ ˜ O(1/n).

“Beating the Perils of Non-Convexity: Guaranteed Training of Neural Networks using Tensor Methods” by M. Janzamin, H. Sedghi and A., June. 2015.

slide-48
SLIDE 48

Our Main Result: Risk Bounds

Approximating arbitrary function f(x) with bounded Cf :=

  • Rd ω2 · |F(ω)|dω

n samples, d input dimension, k number of neurons.

slide-49
SLIDE 49

Our Main Result: Risk Bounds

Approximating arbitrary function f(x) with bounded Cf :=

  • Rd ω2 · |F(ω)|dω

n samples, d input dimension, k number of neurons.

Theorem(JSA’14)

Assume Cf is small. E[|f(x) − ˆ f(x)|2] ≤ O(C2

f/k) + O(1/n).

Polynomial sample complexity n in terms of dimensions d, k. Computational complexity same as SGD with enough parallel processors.

“Beating the Perils of Non-Convexity: Guaranteed Training of Neural Networks using Tensor Methods” by M. Janzamin, H. Sedghi and A. , June. 2015.

slide-50
SLIDE 50

Outline

1

Introduction

2

Guaranteed Training of Neural Networks

3

Overview of Other Results on Tensors

4

Conclusion

slide-51
SLIDE 51

Tractable Learning for LVMs

GMM HMM

h1 h2 h3 x1 x2 x3

ICA

h1 h2 hk x1 x2 xd

Multiview and Topic Models

slide-52
SLIDE 52

At Scale Tensor Computations

Randomized Tensor Sketches

Naive computation scales exponentially in order of the tensor. Propose randomized FFT sketches. Computational complexity independent of tensor order. Linear scaling in input dimension and number of samples.

(1) Fast and Guaranteed Tensor Decomposition via Sketching by Yining Wang, Hsiao-Yu Tung, Alex Smola, A. , NIPS 2015. (2) Tensor Contractions with Extended BLAS Kernels on CPU and GPU by Y. Shi, UN Niranjan, C. Cecka, A. Mowli, A.

slide-53
SLIDE 53

At Scale Tensor Computations

Randomized Tensor Sketches

Naive computation scales exponentially in order of the tensor. Propose randomized FFT sketches. Computational complexity independent of tensor order. Linear scaling in input dimension and number of samples.

Tensor Contractions with Extended BLAS Kernels on CPU and GPU

BLAS: Basic Linear Algebraic Subprograms, highly optimized libraries. Use extended BLAS to minimize data permutation, I/O calls.

(1) Fast and Guaranteed Tensor Decomposition via Sketching by Yining Wang, Hsiao-Yu Tung, Alex Smola, A. , NIPS 2015. (2) Tensor Contractions with Extended BLAS Kernels on CPU and GPU by Y. Shi, UN Niranjan, C. Cecka, A. Mowli, A.

slide-54
SLIDE 54

Preliminary Results on Spark

In-memory processing of Spark: ideal for iterative tensor methods. Alternating Least Squares for Tensor Decomposition. min

w,A,B,C

  • T −

k

  • i=1

λiA(:, i) ⊗ B(:, i) ⊗ C(:, i)

  • 2

F

Update Rows Independently

Tensor slices B C worker 1 worker 2 worker i worker k

Results on NYtimes corpus

3 ∗ 105 documents, 108 words Spark Map-Reduce 26mins 4 hrs

(2) Topic Modeling at Lightning Speeds via Tensor Factorization on Spark by F. Huang, A. , under preparation.

slide-55
SLIDE 55

Convolutional Tensor Decomposition

= = ∗ x x

  • f ∗

i

w∗

i

F∗ w∗

(a)Convolutional dictionary model (b)Reformulated model

.... ....

= + +

Cumulant λ1(F∗

1 )⊗3

+λ2(F∗

2 )⊗3

. . .

Efficient methods for tensor decomposition with circulant constraints.

Convolutional Dictionary Learning through Tensor Factorization by F. Huang, A. , June 2015.

slide-56
SLIDE 56

Reinforcement Learning (RL) of POMDPs

Partially observable Markov decision processes.

Proposed Method

Consider memoryless policies. Episodic learning: indirect exploration. Tensor methods: careful conditioning required for learning. First RL method for POMDPs with logarithmic regret bounds.

xi xi+1 xi+2 yi yi+1 ri ri+1 ai ai+1

1000 2000 3000 4000 5000 6000 7000 2 2.1 2.2 2.3 2.4 2.5 2.6 2.7 Number of Trials Average Reward SM−UCRL−POMDP UCRL−MDP Q−learning Random Policy

Logarithmic Regret Bounds for POMDPs using Spectral Methods by K. Azzizade, A. Lazaric, A. , under preparation.

slide-57
SLIDE 57

Outline

1

Introduction

2

Guaranteed Training of Neural Networks

3

Overview of Other Results on Tensors

4

Conclusion

slide-58
SLIDE 58

Summary and Outlook

Summary

Tensor methods: a powerful paradigm for guaranteed large-scale machine learning. First methods to provide provable bounds for training neural networks, many latent variable models (e.g HMM, LDA), POMDPs!

slide-59
SLIDE 59

Summary and Outlook

Summary

Tensor methods: a powerful paradigm for guaranteed large-scale machine learning. First methods to provide provable bounds for training neural networks, many latent variable models (e.g HMM, LDA), POMDPs!

Outlook

Training multi-layer neural networks, models with invariances, reinforcement learning using neural networks . . . Unified framework for tractable non-convex methods with guaranteed convergence to global optima?

slide-60
SLIDE 60

My Research Group and Resources

Podcast/lectures/papers/software available at http://newport.eecs.uci.edu/anandkumar/