SLIDE 1 Beating the Perils of Non-Convexity: Machine Learning using Tensor Methods
Anima Anandkumar
..
Joint work with Majid Janzamin and Hanie Sedghi.
U.C. Irvine
SLIDE 2
Learning with Big Data
Learning is finding needle in a haystack
SLIDE 3
Learning with Big Data
Learning is finding needle in a haystack High dimensional regime: as data grows, more variables! Useful information: low-dimensional structures. Learning with big data: ill-posed problem.
SLIDE 4
Learning with Big Data
Learning is finding needle in a haystack High dimensional regime: as data grows, more variables! Useful information: low-dimensional structures. Learning with big data: ill-posed problem. Learning with big data: statistically and computationally challenging!
SLIDE 5 Optimization for Learning
Most learning problems can be cast as optimization.
Unsupervised Learning
Clustering k-means, hierarchical . . . Maximum Likelihood Estimator Probabilistic latent variable models
Supervised Learning
Optimizing a neural network with respect to a loss function
Input Neuron Output
SLIDE 6 Convex vs. Non-convex Optimization
Progress is only tip of the iceberg..
Images taken from https://www.facebook.com/nonconvex
SLIDE 7 Convex vs. Non-convex Optimization
Progress is only tip of the iceberg.. Real world is mostly non-convex!
Images taken from https://www.facebook.com/nonconvex
SLIDE 8
Convex vs. Nonconvex Optimization
Unique optimum: global/local. Multiple local optima
SLIDE 9
Convex vs. Nonconvex Optimization
Unique optimum: global/local. Multiple local optima In high dimensions possibly exponential local optima
SLIDE 10
Convex vs. Nonconvex Optimization
Unique optimum: global/local. Multiple local optima In high dimensions possibly exponential local optima How to deal with non-convexity?
SLIDE 11 Outline
1
Introduction
2
Guaranteed Training of Neural Networks
3
Overview of Other Results on Tensors
4
Conclusion
SLIDE 12
Training Neural Networks
Tremendous practical impact with deep learning. Algorithm: backpropagation. Highly non-convex optimization
SLIDE 13
Toy Example: Failure of Backpropagation
x1 x2
y=1 y=−1 Labeled input samples Goal: binary classification σ(·) σ(·) y x1 x2 x
w1 w2
Our method: guaranteed risk bounds for training neural networks
SLIDE 14
Toy Example: Failure of Backpropagation
x1 x2
y=1 y=−1 Labeled input samples Goal: binary classification σ(·) σ(·) y x1 x2 x
w1 w2
Our method: guaranteed risk bounds for training neural networks
SLIDE 15
Toy Example: Failure of Backpropagation
x1 x2
y=1 y=−1 Labeled input samples Goal: binary classification σ(·) σ(·) y x1 x2 x
w1 w2
Our method: guaranteed risk bounds for training neural networks
SLIDE 16 Backpropagation vs. Our Method
Weights w2 randomly drawn and fixed Backprop (quadratic) loss surface
−4 −3 −2 −1 1 2 3 4 −4 −3 −2 −1 1 2 3 4 200 250 300 350 400 450 500 550 600 650
w1(1) w1(2)
SLIDE 17 Backpropagation vs. Our Method
Weights w2 randomly drawn and fixed Backprop (quadratic) loss surface
−4 −3 −2 −1 1 2 3 4 −4 −3 −2 −1 1 2 3 4 200 250 300 350 400 450 500 550 600 650
w1(1) w1(2)
Loss surface for our method
−4 −3 −2 −1 1 2 3 4 −4 −3 −2 −1 1 2 3 4 20 40 60 80 100 120 140 160 180 200
w1(1) w1(2)
SLIDE 18
Overcoming Hardness of Training
In general, training a neural network is NP hard. How does knowledge of input distribution help?
SLIDE 19
Overcoming Hardness of Training
In general, training a neural network is NP hard. How does knowledge of input distribution help?
SLIDE 20 Generative vs. Discriminative Models
p(x, y)
10 20 30 40 50 60 70 80 90 100 0.02 0.04 0.06 0.08 0.1 0.12
Input data x Class y = 1 Class y = 0
p(y|x)
10 20 30 40 50 60 70 80 90 100 0.2 0.4 0.6 0.8 1 1.2
Input data x Class y = 1 Class y = 0
Generative models: Encode domain knowledge. Discriminative: good classification performance. Neural Network is a discriminative model. Do generative models help in discriminative tasks?
SLIDE 21
Feature Transformation for Training Neural Networks
Feature learning: Learn φ(·) from input data. How to use φ(·) to train neural networks? x φ(x) y
SLIDE 22
Feature Transformation for Training Neural Networks
Feature learning: Learn φ(·) from input data. How to use φ(·) to train neural networks? x φ(x) y
Multivariate Moments: Many possibilities, . . .
E[x ⊗ y], E[x ⊗ x ⊗ y], E[φ(x) ⊗ y], . . .
SLIDE 23
Tensor Notation for Higher Order Moments
Multi-variate higher order moments form tensors. Are there spectral operations on tensors akin to PCA on matrices?
Matrix
E[x ⊗ y] ∈ Rd×d is a second order tensor. E[x ⊗ y]i1,i2 = E[xi1yi2]. For matrices: E[x ⊗ y] = E[xy⊤].
Tensor
E[x ⊗ x ⊗ y] ∈ Rd×d×d is a third order tensor. E[x ⊗ x ⊗ y]i1,i2,i3 = E[xi1xi2yi3]. In general, E[φ(x) ⊗ y] is a tensor. What class of φ(·) useful for training neural networks?
SLIDE 24
Score Function Transformations
Score function for x ∈ Rd with pdf p(·): S1(x) := −∇x log p(x) Input: x ∈ Rd S1(x) ∈ Rd
SLIDE 25
Score Function Transformations
Score function for x ∈ Rd with pdf p(·): S1(x) := −∇x log p(x) Input: x ∈ Rd S1(x) ∈ Rd
SLIDE 26
Score Function Transformations
Score function for x ∈ Rd with pdf p(·): S1(x) := −∇x log p(x) Input: x ∈ Rd S1(x) ∈ Rd
SLIDE 27
Score Function Transformations
Score function for x ∈ Rd with pdf p(·): S1(x) := −∇x log p(x) mth-order score function: Input: x ∈ Rd S1(x) ∈ Rd
SLIDE 28
Score Function Transformations
Score function for x ∈ Rd with pdf p(·): S1(x) := −∇x log p(x) mth-order score function: Sm(x) := (−1)m ∇(m)p(x) p(x) Input: x ∈ Rd S1(x) ∈ Rd
SLIDE 29
Score Function Transformations
Score function for x ∈ Rd with pdf p(·): S1(x) := −∇x log p(x) mth-order score function: Sm(x) := (−1)m ∇(m)p(x) p(x) Input: x ∈ Rd S2(x) ∈ Rd×d
SLIDE 30
Score Function Transformations
Score function for x ∈ Rd with pdf p(·): S1(x) := −∇x log p(x) mth-order score function: Sm(x) := (−1)m ∇(m)p(x) p(x) Input: x ∈ Rd S3(x) ∈ Rd×d×d
SLIDE 31 Moments of a Neural Network
E[y|x] = f(x) = a⊤
2 σ(A⊤ 1 x + b1) + b2 σ(·) σ(·) σ(·) σ(·) 1 k y x1 x2 x3 xdx x a2 A1 · · · · · · · · ·
SLIDE 32 Moments of a Neural Network
E[y|x] = f(x) = a⊤
2 σ(A⊤ 1 x + b1) + b2
Given labeled examples {(xi, yi)} E [y · Sm(x)] = E
M1 = E[y · S1(x)] =
λ1,j · uj⊗uj ⊗ uj
σ(·) σ(·) σ(·) σ(·) 1 k y x1 x2 x3 xdx x a2 A1 · · · · · · · · ·
SLIDE 33 Moments of a Neural Network
E[y|x] = f(x) = a⊤
2 σ(A⊤ 1 x + b1) + b2
Given labeled examples {(xi, yi)} E [y · Sm(x)] = E
M1 = E[y · S1(x)] =
λ1,j · (A1)j⊗(A1)j ⊗ (A1)j
σ(·) σ(·) σ(·) σ(·) 1 k y x1 x2 x3 xdx x a2 A1 · · · · · · · · ·
SLIDE 34 Moments of a Neural Network
E[y|x] = f(x) = a⊤
2 σ(A⊤ 1 x + b1) + b2
Given labeled examples {(xi, yi)} E [y · Sm(x)] = E
M1 = E[y · S1(x)] =
λ1,j · (A1)j⊗(A1)j ⊗ (A1)j
= + ....
λ11(A1)1 λ12(A1)2
σ(·) σ(·) σ(·) σ(·) 1 k y x1 x2 x3 xdx x a2 A1 · · · · · · · · ·
SLIDE 35 Moments of a Neural Network
E[y|x] = f(x) = a⊤
2 σ(A⊤ 1 x + b1) + b2
Given labeled examples {(xi, yi)} E [y · Sm(x)] = E
M2 = E[y · S2(x)] =
λ2,j · (A1)j ⊗ (A1)j⊗(A1)j
σ(·) σ(·) σ(·) σ(·) 1 k y x1 x2 x3 xdx x a2 A1 · · · · · · · · ·
SLIDE 36 Moments of a Neural Network
E[y|x] = f(x) = a⊤
2 σ(A⊤ 1 x + b1) + b2
Given labeled examples {(xi, yi)} E [y · Sm(x)] = E
M2 = E[y · S2(x)] =
λ2,j · (A1)j ⊗ (A1)j⊗(A1)j
= + ....
λ11(A1)1 ⊗ (A1)1 λ12(A1)2 ⊗ (A1)2
σ(·) σ(·) σ(·) σ(·) 1 k y x1 x2 x3 xdx x a2 A1 · · · · · · · · ·
SLIDE 37 Moments of a Neural Network
E[y|x] = f(x) = a⊤
2 σ(A⊤ 1 x + b1) + b2
Given labeled examples {(xi, yi)} E [y · Sm(x)] = E
M3 = E[y · S3(x)] =
λ3,j · (A1)j ⊗ (A1)j ⊗ (A1)j
σ(·) σ(·) σ(·) σ(·) 1 k y x1 x2 x3 xdx x a2 A1 · · · · · · · · ·
SLIDE 38 Moments of a Neural Network
E[y|x] = f(x) = a⊤
2 σ(A⊤ 1 x + b1) + b2
Given labeled examples {(xi, yi)} E [y · Sm(x)] = E
M3 = E[y · S3(x)] =
λ3,j · (A1)j ⊗ (A1)j ⊗ (A1)j = + ....
σ(·) σ(·) σ(·) σ(·) 1 k y x1 x2 x3 xdx x a2 A1 · · · · · · · · ·
SLIDE 39 Moments of a Neural Network
E[y|x] = f(x) = a⊤
2 σ(A⊤ 1 x + b1) + b2
Given labeled examples {(xi, yi)} E [y · Sm(x)] = E
M3 = E[y · S3(x)] =
λ3,j · (A1)j ⊗ (A1)j ⊗ (A1)j
σ(·) σ(·) σ(·) σ(·) 1 k y x1 x2 x3 xdx x a2 A1 · · · · · · · · ·
Why tensors are required?
Matrix decomposition recovers subspace, not actual weights. Tensor decomposition uniquely recovers under non-degeneracy.
SLIDE 40 Moments of a Neural Network
E[y|x] = f(x) = a⊤
2 σ(A⊤ 1 x + b1) + b2
Given labeled examples {(xi, yi)} E [y · Sm(x)] = E
M3 = E[y · S3(x)] =
λ3,j · (A1)j ⊗ (A1)j ⊗ (A1)j
σ(·) σ(·) σ(·) σ(·) 1 k y x1 x2 x3 xdx x a2 A1 · · · · · · · · ·
Guaranteed learning of weights of first layer via tensor decomposition. Learning the other parameters via a Fourier technique.
SLIDE 41
NN-LiFT: Neural Network LearnIng using Feature Tensors
Input: x ∈ Rd S3(x) ∈ Rd×d×d
SLIDE 42 NN-LiFT: Neural Network LearnIng using Feature Tensors
Input: x ∈ Rd S3(x) ∈ Rd×d×d
1 n
n
yi ⊗ S3(xi) = 1 n
n
yi⊗ Estimating M3 using labeled data {(xi, yi)} S3(xi)
Cross- moment
SLIDE 43 NN-LiFT: Neural Network LearnIng using Feature Tensors
Input: x ∈ Rd S3(x) ∈ Rd×d×d
1 n
n
yi ⊗ S3(xi) = 1 n
n
yi⊗ Estimating M3 using labeled data {(xi, yi)} S3(xi)
Cross- moment
Rank-1 components are the estimates of columns of A1 +
+ · · ·
CP tensor decomposition
SLIDE 44 NN-LiFT: Neural Network LearnIng using Feature Tensors
Input: x ∈ Rd S3(x) ∈ Rd×d×d
1 n
n
yi ⊗ S3(xi) = 1 n
n
yi⊗ Estimating M3 using labeled data {(xi, yi)} S3(xi)
Cross- moment
Rank-1 components are the estimates of columns of A1 +
+ · · ·
CP tensor decomposition
Fourier technique ⇒ a2, b1, b2
SLIDE 45 Estimation error bound
Guaranteed learning of weights of first layer via tensor decomposition. M3 = E[y ⊗ S3(x)] =
λ3,j · (A1)j ⊗ (A1)j ⊗ (A1)j Full column rank assumption on weight matrix A1 Guaranteed tensor decomposition (AGHKT’14, AGJ’14)
SLIDE 46 Estimation error bound
Guaranteed learning of weights of first layer via tensor decomposition. M3 = E[y ⊗ S3(x)] =
λ3,j · (A1)j ⊗ (A1)j ⊗ (A1)j Full column rank assumption on weight matrix A1 Guaranteed tensor decomposition (AGHKT’14, AGJ’14) Learning the other parameters via a Fourier technique.
SLIDE 47 Estimation error bound
Guaranteed learning of weights of first layer via tensor decomposition. M3 = E[y ⊗ S3(x)] =
λ3,j · (A1)j ⊗ (A1)j ⊗ (A1)j Full column rank assumption on weight matrix A1 Guaranteed tensor decomposition (AGHKT’14, AGJ’14) Learning the other parameters via a Fourier technique.
Theorem (JSA’14)
number of samples n = poly(d, k), we have w.h.p. |f(x) − ˆ f(x)|2 ≤ ˜ O(1/n).
“Beating the Perils of Non-Convexity: Guaranteed Training of Neural Networks using Tensor Methods” by M. Janzamin, H. Sedghi and A., June. 2015.
SLIDE 48 Our Main Result: Risk Bounds
Approximating arbitrary function f(x) with bounded Cf :=
n samples, d input dimension, k number of neurons.
SLIDE 49 Our Main Result: Risk Bounds
Approximating arbitrary function f(x) with bounded Cf :=
n samples, d input dimension, k number of neurons.
Theorem(JSA’14)
Assume Cf is small. E[|f(x) − ˆ f(x)|2] ≤ O(C2
f/k) + O(1/n).
Polynomial sample complexity n in terms of dimensions d, k. Computational complexity same as SGD with enough parallel processors.
“Beating the Perils of Non-Convexity: Guaranteed Training of Neural Networks using Tensor Methods” by M. Janzamin, H. Sedghi and A. , June. 2015.
SLIDE 50 Outline
1
Introduction
2
Guaranteed Training of Neural Networks
3
Overview of Other Results on Tensors
4
Conclusion
SLIDE 51
Tractable Learning for LVMs
GMM HMM
h1 h2 h3 x1 x2 x3
ICA
h1 h2 hk x1 x2 xd
Multiview and Topic Models
SLIDE 52 At Scale Tensor Computations
Randomized Tensor Sketches
Naive computation scales exponentially in order of the tensor. Propose randomized FFT sketches. Computational complexity independent of tensor order. Linear scaling in input dimension and number of samples.
(1) Fast and Guaranteed Tensor Decomposition via Sketching by Yining Wang, Hsiao-Yu Tung, Alex Smola, A. , NIPS 2015. (2) Tensor Contractions with Extended BLAS Kernels on CPU and GPU by Y. Shi, UN Niranjan, C. Cecka, A. Mowli, A.
SLIDE 53 At Scale Tensor Computations
Randomized Tensor Sketches
Naive computation scales exponentially in order of the tensor. Propose randomized FFT sketches. Computational complexity independent of tensor order. Linear scaling in input dimension and number of samples.
Tensor Contractions with Extended BLAS Kernels on CPU and GPU
BLAS: Basic Linear Algebraic Subprograms, highly optimized libraries. Use extended BLAS to minimize data permutation, I/O calls.
(1) Fast and Guaranteed Tensor Decomposition via Sketching by Yining Wang, Hsiao-Yu Tung, Alex Smola, A. , NIPS 2015. (2) Tensor Contractions with Extended BLAS Kernels on CPU and GPU by Y. Shi, UN Niranjan, C. Cecka, A. Mowli, A.
SLIDE 54 Preliminary Results on Spark
In-memory processing of Spark: ideal for iterative tensor methods. Alternating Least Squares for Tensor Decomposition. min
w,A,B,C
k
λiA(:, i) ⊗ B(:, i) ⊗ C(:, i)
F
Update Rows Independently
Tensor slices B C worker 1 worker 2 worker i worker k
Results on NYtimes corpus
3 ∗ 105 documents, 108 words Spark Map-Reduce 26mins 4 hrs
(2) Topic Modeling at Lightning Speeds via Tensor Factorization on Spark by F. Huang, A. , under preparation.
SLIDE 55 Convolutional Tensor Decomposition
= = ∗ x x
i
w∗
i
F∗ w∗
(a)Convolutional dictionary model (b)Reformulated model
.... ....
= + +
Cumulant λ1(F∗
1 )⊗3
+λ2(F∗
2 )⊗3
. . .
Efficient methods for tensor decomposition with circulant constraints.
Convolutional Dictionary Learning through Tensor Factorization by F. Huang, A. , June 2015.
SLIDE 56 Reinforcement Learning (RL) of POMDPs
Partially observable Markov decision processes.
Proposed Method
Consider memoryless policies. Episodic learning: indirect exploration. Tensor methods: careful conditioning required for learning. First RL method for POMDPs with logarithmic regret bounds.
xi xi+1 xi+2 yi yi+1 ri ri+1 ai ai+1
1000 2000 3000 4000 5000 6000 7000 2 2.1 2.2 2.3 2.4 2.5 2.6 2.7 Number of Trials Average Reward SM−UCRL−POMDP UCRL−MDP Q−learning Random Policy
Logarithmic Regret Bounds for POMDPs using Spectral Methods by K. Azzizade, A. Lazaric, A. , under preparation.
SLIDE 57 Outline
1
Introduction
2
Guaranteed Training of Neural Networks
3
Overview of Other Results on Tensors
4
Conclusion
SLIDE 58
Summary and Outlook
Summary
Tensor methods: a powerful paradigm for guaranteed large-scale machine learning. First methods to provide provable bounds for training neural networks, many latent variable models (e.g HMM, LDA), POMDPs!
SLIDE 59
Summary and Outlook
Summary
Tensor methods: a powerful paradigm for guaranteed large-scale machine learning. First methods to provide provable bounds for training neural networks, many latent variable models (e.g HMM, LDA), POMDPs!
Outlook
Training multi-layer neural networks, models with invariances, reinforcement learning using neural networks . . . Unified framework for tractable non-convex methods with guaranteed convergence to global optima?
SLIDE 60
My Research Group and Resources
Podcast/lectures/papers/software available at http://newport.eecs.uci.edu/anandkumar/