Tensor Methods for Feature Learning Anima Anandkumar U.C. Irvine - - PowerPoint PPT Presentation

tensor methods for feature learning
SMART_READER_LITE
LIVE PREVIEW

Tensor Methods for Feature Learning Anima Anandkumar U.C. Irvine - - PowerPoint PPT Presentation

Tensor Methods for Feature Learning Anima Anandkumar U.C. Irvine Feature Learning For Efficient Classification Find good transformations of input for improved classification Figures used attributed to Fei-Fei Li, Rob Fergus, Antonio Torralba,


slide-1
SLIDE 1

Tensor Methods for Feature Learning

Anima Anandkumar

U.C. Irvine

slide-2
SLIDE 2

Feature Learning For Efficient Classification

Find good transformations of input for improved classification

Figures used attributed to Fei-Fei Li, Rob Fergus, Antonio Torralba, et al.

slide-3
SLIDE 3

Principles Behind Feature Learning

Classification/regression tasks: Predict y given x. Find feature transform φ(x) to better predict y. x y Feature learning: Learn φ(·) from data.

slide-4
SLIDE 4

Principles Behind Feature Learning

Classification/regression tasks: Predict y given x. Find feature transform φ(x) to better predict y. x φ(x) y Feature learning: Learn φ(·) from data.

slide-5
SLIDE 5

Principles Behind Feature Learning

Classification/regression tasks: Predict y given x. Find feature transform φ(x) to better predict y. x φ(x) y Feature learning: Learn φ(·) from data.

Learning φ(x) from Labeled vs. Unlabeled Samples

Labeled samples {xi, yi} and unlabeled samples {xi}. Labeled samples should lead to better feature learning φ(·) but are harder to obtain.

slide-6
SLIDE 6

Principles Behind Feature Learning

Classification/regression tasks: Predict y given x. Find feature transform φ(x) to better predict y. x φ(x) y Feature learning: Learn φ(·) from data.

Learning φ(x) from Labeled vs. Unlabeled Samples

Labeled samples {xi, yi} and unlabeled samples {xi}. Labeled samples should lead to better feature learning φ(·) but are harder to obtain. Learn features φ(x) through latent variables related to x, y.

slide-7
SLIDE 7

Conditional Latent Variable Models: Two Cases

x y x y

slide-8
SLIDE 8

Conditional Latent Variable Models: Two Cases

x φ(x) φ(φ(x)) y x y

slide-9
SLIDE 9

Conditional Latent Variable Models: Two Cases

Multi-layer Neural Networks

x φ(x) φ(φ(x)) y x y

slide-10
SLIDE 10

Conditional Latent Variable Models: Two Cases

Multi-layer Neural Networks

E[y|x] = σ(Ad σ(Ad−1 σ(· · · A2 σ(A1x)))) x φ(x) φ(φ(x)) y x y

slide-11
SLIDE 11

Conditional Latent Variable Models: Two Cases

Multi-layer Neural Networks

E[y|x] = σ(Ad σ(Ad−1 σ(· · · A2 σ(A1x)))) x φ(x) φ(φ(x)) y

Mixture of Classifiers or GLMs

G(x) := E[y|x, h] = σ(Uh, x + b, h) x y

slide-12
SLIDE 12

Conditional Latent Variable Models: Two Cases

Multi-layer Neural Networks

E[y|x] = σ(Ad σ(Ad−1 σ(· · · A2 σ(A1x)))) x φ(x) φ(φ(x)) y

Mixture of Classifiers or GLMs

G(x) := E[y|x, h] = σ(Uh, x + b, h) x h y

slide-13
SLIDE 13

Challenges in Learning LVMs

Challenge: Identifiability Conditions

When can model be identified (given infinite computation and data)? Does identifiability also lead to tractable algorithms?

Computational Challenges

Maximum likelihood is NP-hard in most scenarios. Practice: Local search approaches such as Back-propagation, EM, Variational Bayes have no consistency guarantees.

Sample Complexity

Sample complexity needs to be low for high-dimensional regime.

Guaranteed and efficient learning through tensor methods

slide-14
SLIDE 14

Outline

1

Introduction

2

Spectral and Tensor Methods

3

Generative Models for Feature Learning

4

Proposed Framework

5

Conclusion

slide-15
SLIDE 15

Classical Spectral Methods: Matrix PCA and CCA

Unsupervised Setting: PCA

For centered samples {xi}, find projection P with Rank(P) = k s.t. min

P

1 n

  • i∈[n]

xi − Pxi2. Result: Eigen-decomposition of S = Cov(X).

Supervised Setting: CCA

For centered samples {xi, yi}, find max

a,b

a⊤ˆ E[xy⊤]b

  • a⊤ˆ

E[xx⊤]a b⊤ˆ E[yy⊤]b . Result: Generalized eigen decomposition. x y a, x b, y

slide-16
SLIDE 16

Beyond SVD: Spectral Methods on Tensors

How to learn the mixture models without separation constraints?

◮ PCA uses covariance matrix of data. Are higher order moments helpful?

Unified framework?

◮ Moment-based estimation of probabilistic latent variable models?

SVD gives spectral decomposition of matrices.

◮ What are the analogues for tensors?

slide-17
SLIDE 17

Moment Matrices and Tensors

Multivariate Moments

M1 := E[x], M2 := E[x ⊗ x], M3 := E[x ⊗ x ⊗ x].

Matrix

E[x ⊗ x] ∈ Rd×d is a second order tensor. E[x ⊗ x]i1,i2 = E[xi1xi2]. For matrices: E[x ⊗ x] = E[xx⊤].

Tensor

E[x ⊗ x ⊗ x] ∈ Rd×d×d is a third order tensor. E[x ⊗ x ⊗ x]i1,i2,i3 = E[xi1xi2xi3].

slide-18
SLIDE 18

Spectral Decomposition of Tensors

M2 =

i

λiui ⊗ vi

= + ....

Matrix M2 λ1u1 ⊗ v1 λ2u2 ⊗ v2

M3 =

i

λiui ⊗ vi ⊗ wi

= + ....

Tensor M3 λ1u1 ⊗ v1 ⊗ w1 λ2u2 ⊗ v2 ⊗ w2

u ⊗ v ⊗ w is a rank-1 tensor since its (i1, i2, i3)th entry is ui1vi2wi3. Guaranteed recovery. (Anandkumar et al 2012, Zhang & Golub 2001).

slide-19
SLIDE 19

Moment Tensors for Conditional Models

Multivariate Moments: Many possibilities...

E[x ⊗ y], E[x ⊗ x ⊗ y], E[φ(x) ⊗ y] . . . .

Feature Transformations of the Input: x → φ(x)

How to exploit them? Are moments E[φ(x) ⊗ y] useful? If φ(x) is a matrix/tensor, we have matrix/tensor moments. Can carry out spectral decomposition of the moments.

slide-20
SLIDE 20

Moment Tensors for Conditional Models

Multivariate Moments: Many possibilities...

E[x ⊗ y], E[x ⊗ x ⊗ y], E[φ(x) ⊗ y] . . . .

Feature Transformations of the Input: x → φ(x)

How to exploit them? Are moments E[φ(x) ⊗ y] useful? If φ(x) is a matrix/tensor, we have matrix/tensor moments. Can carry out spectral decomposition of the moments. Construct φ(x) based on input distribution?

slide-21
SLIDE 21

Outline

1

Introduction

2

Spectral and Tensor Methods

3

Generative Models for Feature Learning

4

Proposed Framework

5

Conclusion

slide-22
SLIDE 22

Score Function of Input Distribution

Score function S(x) := −∇ log p(x)

1-d PDF

(a) p(x) = 1

Z exp(−E(x))

1-d Score

(b)

∂ ∂x log p(x) = − ∂ ∂xE(x)

Figures from Alain and Bengio 2014.

slide-23
SLIDE 23

Score Function of Input Distribution

Score function S(x) := −∇ log p(x)

1-d PDF

(a) p(x) = 1

Z exp(−E(x))

1-d Score

(b)

∂ ∂x log p(x) = − ∂ ∂xE(x)

2-d Score

Figures from Alain and Bengio 2014.

slide-24
SLIDE 24

Why Score Function Features?

S(x) := −∇ log p(x) Utilizes generative models for input. Can be learnt from unlabeled data. Score matching methods work for non-normalized models.

slide-25
SLIDE 25

Why Score Function Features?

S(x) := −∇ log p(x) Utilizes generative models for input. Can be learnt from unlabeled data. Score matching methods work for non-normalized models.

Approximation of score function using denoising auto-encoders

∇ log p(x) ≈ r∗(x + n) − x σ2

slide-26
SLIDE 26

Why Score Function Features?

S(x) := −∇ log p(x) Utilizes generative models for input. Can be learnt from unlabeled data. Score matching methods work for non-normalized models.

Approximation of score function using denoising auto-encoders

∇ log p(x) ≈ r∗(x + n) − x σ2

Recall our goal: construct moments E[y ⊗ φ(x)]

slide-27
SLIDE 27

Why Score Function Features?

S(x) := −∇ log p(x) Utilizes generative models for input. Can be learnt from unlabeled data. Score matching methods work for non-normalized models.

Approximation of score function using denoising auto-encoders

∇ log p(x) ≈ r∗(x + n) − x σ2

Recall our goal: construct moments E[y ⊗ φ(x)]

Beyond vector features?

slide-28
SLIDE 28

Matrix and Tensor-valued Features

Higher order score functions

Sm(x) := (−1)m ∇(m)p(x) p(x) Can be a matrix or a tensor instead of a vector. Can be used to construct matrix and tensor moments E[y ⊗ φ(x)].

slide-29
SLIDE 29

Outline

1

Introduction

2

Spectral and Tensor Methods

3

Generative Models for Feature Learning

4

Proposed Framework

5

Conclusion

slide-30
SLIDE 30

Operations on Score Function Features

Form the cross-moments: E [y ⊗ Sm(x)].

slide-31
SLIDE 31

Operations on Score Function Features

Form the cross-moments: E [y ⊗ Sm(x)].

Our result

E [y ⊗ Sm(x)] = E

  • ∇(m)G(x)
  • ,

G(x) := E[y|x].

slide-32
SLIDE 32

Operations on Score Function Features

Form the cross-moments: E [y ⊗ Sm(x)].

Our result

E [y ⊗ Sm(x)] = E

  • ∇(m)G(x)
  • ,

G(x) := E[y|x]. Extension of Stein’s lemma.

slide-33
SLIDE 33

Operations on Score Function Features

Form the cross-moments: E [y ⊗ Sm(x)].

Our result

E [y ⊗ Sm(x)] = E

  • ∇(m)G(x)
  • ,

G(x) := E[y|x]. Extension of Stein’s lemma.

Extract discriminative directions through spectral decomposition

E [y ⊗ Sm(x)] = E

  • ∇(m)G(x)
  • =
  • j∈[k]

λj · uj ⊗ uj . . . ⊗ uj

  • m times

.

slide-34
SLIDE 34

Operations on Score Function Features

Form the cross-moments: E [y ⊗ Sm(x)].

Our result

E [y ⊗ Sm(x)] = E

  • ∇(m)G(x)
  • ,

G(x) := E[y|x]. Extension of Stein’s lemma.

Extract discriminative directions through spectral decomposition

E [y ⊗ Sm(x)] = E

  • ∇(m)G(x)
  • =
  • j∈[k]

λj · uj ⊗ uj . . . ⊗ uj

  • m times

. Construct σ(u⊤

j x) for some nonlinearity σ.

slide-35
SLIDE 35

Automated Extraction of Discriminative Features

slide-36
SLIDE 36

Learning Mixtures of Classifiers/GLMs

A mixture of r classifiers, hidden choice variable h ∈ {e1, . . . , er}. E[y|x, h] = g(Uh, x + b, h)

∗ U = [u1|u2 . . . |ur] are the weight vectors of GLMs. ∗ b is the vector of biases.

M3 = E[y · S3(x)] =

  • i∈[r]

λi · ui ⊗ ui ⊗ ui. First results for learning non-linear mixtures using spectral methods

slide-37
SLIDE 37

Learning Multi-layer Neural Networks

E[y|x] = a2, σ(A⊤

1 x) .

A1 a2 x h y

Our result

M3 = E[y · S3(x)] =

  • i∈[r]

λi · A1,i ⊗ A1,i ⊗ A1,i.

slide-38
SLIDE 38

Framework Applied to MNIST

Unlabeled data: {xi} Auto-Encoder Train SVM with σ(xi, uj) r(x) Labeled data: {(xi, yi)} uj’s Compute score function Sm(x) using r(x) Form cross-moment: E [y · Sm(x)] Spectral/tensor method:

= + ....

Tensor T = u⊗3

1

+ u⊗3

2

slide-39
SLIDE 39

Outline

1

Introduction

2

Spectral and Tensor Methods

3

Generative Models for Feature Learning

4

Proposed Framework

5

Conclusion

slide-40
SLIDE 40

Conclusion: Learning Conditional Models using Tensor Methods

Tensor Decomposition

Efficient sample and computational complexities Better performance compared to EM, Variational Bayes etc. Scalable and embarrassingly parallel: handle large datasets.

Score function features

Score function features crucial for learning conditional models.

Related: Guaranteed Non-convex Methods

Overcomplete Dictionary Learning/Sparse Coding: Decompose data into a sparse combination of unknown dictionary elements. Non-convex robust PCA: Same guarantees as convex relaxation methods, lower computational complexity. Extensions to tensor setting.

slide-41
SLIDE 41

Co-authors and Resources

Majid Janzamin Hanie Sedghi Niranjan UN

Papers available at http://newport.eecs.uci.edu/anandkumar/