Learning Overcomplete Latent Variable Models through Tensor Methods - - PowerPoint PPT Presentation

learning overcomplete latent variable models through
SMART_READER_LITE
LIVE PREVIEW

Learning Overcomplete Latent Variable Models through Tensor Methods - - PowerPoint PPT Presentation

Learning Overcomplete Latent Variable Models through Tensor Methods Anima Anandkumar UC Irvine Joint work with Majid Janzamin Rong Ge UC Irvine Microsoft Research Latent Variable Probabilistic Models Latent (hidden) variable h R k ,


slide-1
SLIDE 1

Learning Overcomplete Latent Variable Models through Tensor Methods Anima Anandkumar

UC Irvine Joint work with

Majid Janzamin Rong Ge

UC Irvine Microsoft Research

slide-2
SLIDE 2

Latent Variable Probabilistic Models

Latent (hidden) variable h ∈ Rk, observed variable x ∈ Rd.

slide-3
SLIDE 3

Latent Variable Probabilistic Models

Latent (hidden) variable h ∈ Rk, observed variable x ∈ Rd. Multiview linear mixture models Categorical hidden variable h. Views: conditionally indep. given h. Linear model: E[x1|h] = ah, E[x2|h] = bh, E[x3|h] = ch.

h x1 x2 x3

· · ·

slide-4
SLIDE 4

Latent Variable Probabilistic Models

Latent (hidden) variable h ∈ Rk, observed variable x ∈ Rd. Multiview linear mixture models Categorical hidden variable h. Views: conditionally indep. given h. Linear model: E[x1|h] = ah, E[x2|h] = bh, E[x3|h] = ch.

h x1 x2 x3

· · · Gaussian Mixture Categorical hidden variable h. x|h ∼ N(µh, Σh).

slide-5
SLIDE 5

Latent Variable Probabilistic Models

Latent (hidden) variable h ∈ Rk, observed variable x ∈ Rd. Multiview linear mixture models Categorical hidden variable h. Views: conditionally indep. given h. Linear model: E[x1|h] = ah, E[x2|h] = bh, E[x3|h] = ch.

h x1 x2 x3

· · · Gaussian Mixture Categorical hidden variable h. x|h ∼ N(µh, Σh). ICA, Sparse Coding, HMM, Topic modeling, . . .

slide-6
SLIDE 6

Latent Variable Probabilistic Models

Latent (hidden) variable h ∈ Rk, observed variable x ∈ Rd. Multiview linear mixture models Categorical hidden variable h. Views: conditionally indep. given h. Linear model: E[x1|h] = ah, E[x2|h] = bh, E[x3|h] = ch.

h x1 x2 x3

· · · Gaussian Mixture Categorical hidden variable h. x|h ∼ N(µh, Σh). ICA, Sparse Coding, HMM, Topic modeling, . . . Efficient Learning of the parameters ah, µh, . . . ?

slide-7
SLIDE 7

Method-of-Moments (Spectral methods)

Multi-variate observed moments

M1 := E[x], M2 := E[x ⊗ x], M3 := E[x ⊗ x ⊗ x].

slide-8
SLIDE 8

Method-of-Moments (Spectral methods)

Multi-variate observed moments

M1 := E[x], M2 := E[x ⊗ x], M3 := E[x ⊗ x ⊗ x].

Matrix

E[x ⊗ x] ∈ Rd×d is a second order tensor. E[x ⊗ x]i1,i2 = E[xi1xi2]. For matrices: E[x ⊗ x] = E[xx⊤].

slide-9
SLIDE 9

Method-of-Moments (Spectral methods)

Multi-variate observed moments

M1 := E[x], M2 := E[x ⊗ x], M3 := E[x ⊗ x ⊗ x].

Matrix

E[x ⊗ x] ∈ Rd×d is a second order tensor. E[x ⊗ x]i1,i2 = E[xi1xi2]. For matrices: E[x ⊗ x] = E[xx⊤].

Tensor

E[x ⊗ x ⊗ x] ∈ Rd×d×d is a third order tensor. E[x ⊗ x ⊗ x]i1,i2,i3 = E[xi1xi2xi3].

slide-10
SLIDE 10

Method-of-Moments (Spectral methods)

Multi-variate observed moments

M1 := E[x], M2 := E[x ⊗ x], M3 := E[x ⊗ x ⊗ x].

Matrix

E[x ⊗ x] ∈ Rd×d is a second order tensor. E[x ⊗ x]i1,i2 = E[xi1xi2]. For matrices: E[x ⊗ x] = E[xx⊤].

Tensor

E[x ⊗ x ⊗ x] ∈ Rd×d×d is a third order tensor. E[x ⊗ x ⊗ x]i1,i2,i3 = E[xi1xi2xi3]. Information in moments for learning LVMs?

slide-11
SLIDE 11

Multiview Mixture Model

[k] := {1, . . . , k}. Multiview linear mixture models Categorical hidden variable h ∈ [k]. wj := Pr[h = j] Views: conditionally indep. given h. Linear model: E[x1|h] = ah, E[x2|h] = bh, E[x3|h] = ch.

h x1 x2 x3

· · ·

slide-12
SLIDE 12

Multiview Mixture Model

[k] := {1, . . . , k}. Multiview linear mixture models Categorical hidden variable h ∈ [k]. wj := Pr[h = j] Views: conditionally indep. given h. Linear model: E[x1|h] = ah, E[x2|h] = bh, E[x3|h] = ch.

h x1 x2 x3

· · · Ex[

x1x⊤

2

x1 ⊗ x2] = Eh[Ex[x1 ⊗ x2|h]] = Eh[ah ⊗ bh] =

  • j∈[k]

wjaj ⊗ bj.

slide-13
SLIDE 13

Multiview Mixture Model

[k] := {1, . . . , k}. Multiview linear mixture models Categorical hidden variable h ∈ [k]. wj := Pr[h = j] Views: conditionally indep. given h. Linear model: E[x1|h] = ah, E[x2|h] = bh, E[x3|h] = ch.

h x1 x2 x3

· · · E[x1 ⊗ x2] =

  • j∈[k]

wjaj ⊗ bj,

slide-14
SLIDE 14

Multiview Mixture Model

[k] := {1, . . . , k}. Multiview linear mixture models Categorical hidden variable h ∈ [k]. wj := Pr[h = j] Views: conditionally indep. given h. Linear model: E[x1|h] = ah, E[x2|h] = bh, E[x3|h] = ch.

h x1 x2 x3

· · · E[x1 ⊗ x2] =

  • j∈[k]

wjaj ⊗ bj, E[x1 ⊗ x2 ⊗ x3] =

  • j∈[k]

wjaj ⊗ bj ⊗ cj.

slide-15
SLIDE 15

Multiview Mixture Model

[k] := {1, . . . , k}. Multiview linear mixture models Categorical hidden variable h ∈ [k]. wj := Pr[h = j] Views: conditionally indep. given h. Linear model: E[x1|h] = ah, E[x2|h] = bh, E[x3|h] = ch.

h x1 x2 x3

· · · E[x1 ⊗ x2] =

  • j∈[k]

wjaj ⊗ bj, E[x1 ⊗ x2 ⊗ x3] =

  • j∈[k]

wjaj ⊗ bj ⊗ cj. Tensor (matrix) factorization for learning LVMs.

slide-16
SLIDE 16

Tensor Rank and Tensor Decomposition

Rank-1 tensor: T = w · a ⊗ b ⊗ c ⇔ T(i, j, l) = w · a(i) · b(j) · c(l).

slide-17
SLIDE 17

Tensor Rank and Tensor Decomposition

Rank-1 tensor: T = w · a ⊗ b ⊗ c ⇔ T(i, j, l) = w · a(i) · b(j) · c(l).

CANDECOMP/PARAFAC (CP) Decomposition

T =

  • j∈[k]

wjaj ⊗ bj ⊗ cj ∈ Rd×d×d, aj, bj, cj ∈ Sd−1.

= + ....

Tensor T w1 · a1 ⊗ b1 ⊗ c1 w2 · a2 ⊗ b2 ⊗ c2

slide-18
SLIDE 18

Tensor Rank and Tensor Decomposition

Rank-1 tensor: T = w · a ⊗ b ⊗ c ⇔ T(i, j, l) = w · a(i) · b(j) · c(l).

CANDECOMP/PARAFAC (CP) Decomposition

T =

  • j∈[k]

wjaj ⊗ bj ⊗ cj ∈ Rd×d×d, aj, bj, cj ∈ Sd−1.

= + ....

Tensor T w1 · a1 ⊗ b1 ⊗ c1 w2 · a2 ⊗ b2 ⊗ c2

k: tensor rank, d: ambient dimension. k ≤ d: undercomplete and k > d: overcomplete.

slide-19
SLIDE 19

Tensor Rank and Tensor Decomposition

Rank-1 tensor: T = w · a ⊗ b ⊗ c ⇔ T(i, j, l) = w · a(i) · b(j) · c(l).

CANDECOMP/PARAFAC (CP) Decomposition

T =

  • j∈[k]

wjaj ⊗ bj ⊗ cj ∈ Rd×d×d, aj, bj, cj ∈ Sd−1.

= + ....

Tensor T w1 · a1 ⊗ b1 ⊗ c1 w2 · a2 ⊗ b2 ⊗ c2

k: tensor rank, d: ambient dimension. k ≤ d: undercomplete and k > d: overcomplete. This talk: guarantees for overcomplete tensor decomposition

slide-20
SLIDE 20

Challenges in Tensor Decomposition

Symmetric tensor T ∈ Rd×d×d: T =

  • i∈[k]

λivi ⊗ vi ⊗ vi.

Challenges in tensors

Decomposition may not always exist for general tensors. Finding the decomposition is NP-hard in general.

slide-21
SLIDE 21

Challenges in Tensor Decomposition

Symmetric tensor T ∈ Rd×d×d: T =

  • i∈[k]

λivi ⊗ vi ⊗ vi.

Challenges in tensors

Decomposition may not always exist for general tensors. Finding the decomposition is NP-hard in general.

Tractable case: orthogonal tensor decomposition (vi, vj = 0, i = j)

Algorithm: tensor power method: v → T(I, v, v) T(I, v, v).

  • {vi}’s are the only robust fixed points.
slide-22
SLIDE 22

Challenges in Tensor Decomposition

Symmetric tensor T ∈ Rd×d×d: T =

  • i∈[k]

λivi ⊗ vi ⊗ vi.

Challenges in tensors

Decomposition may not always exist for general tensors. Finding the decomposition is NP-hard in general.

Tractable case: orthogonal tensor decomposition (vi, vj = 0, i = j)

Algorithm: tensor power method: v → T(I, v, v) T(I, v, v).

  • {vi}’s are the only robust fixed points.
  • All other eigenvectors are saddle points.
slide-23
SLIDE 23

Challenges in Tensor Decomposition

Symmetric tensor T ∈ Rd×d×d: T =

  • i∈[k]

λivi ⊗ vi ⊗ vi.

Challenges in tensors

Decomposition may not always exist for general tensors. Finding the decomposition is NP-hard in general.

Tractable case: orthogonal tensor decomposition (vi, vj = 0, i = j)

Algorithm: tensor power method: v → T(I, v, v) T(I, v, v).

  • {vi}’s are the only robust fixed points.
  • All other eigenvectors are saddle points.

For an orthogonal tensor, no spurious local optima!

slide-24
SLIDE 24

Beyond Orthogonal Tensor Decomposition

Limitations

Not ALL tensors have orthogonal decomposition (unlike matrices).

slide-25
SLIDE 25

Beyond Orthogonal Tensor Decomposition

Limitations

Not ALL tensors have orthogonal decomposition (unlike matrices).

Undercomplete tensors (k ≤ d) with full rank components

Non-orthogonal decomposition T1 =

i wiai ⊗ ai ⊗ ai.

Whitening matrix W: Multilinear transform: T2 = T1(W, W, W) v1 v2 v3 W a1 a2 a3

Tensor T1 Tensor T2

slide-26
SLIDE 26

Beyond Orthogonal Tensor Decomposition

Limitations

Not ALL tensors have orthogonal decomposition (unlike matrices).

Undercomplete tensors (k ≤ d) with full rank components

Non-orthogonal decomposition T1 =

i wiai ⊗ ai ⊗ ai.

Whitening matrix W: Multilinear transform: T2 = T1(W, W, W) v1 v2 v3 W a1 a2 a3

Tensor T1 Tensor T2

This talk: guarantees for overcomplete tensor decomposition

slide-27
SLIDE 27

Outline

1

Introduction

2

Overcomplete tensor decomposition

3

Sample Complexity Analysis

4

Conclusion

slide-28
SLIDE 28

Our Setup

So far

General tensor decomposition: NP-hard. Orthogonal tensors: too limiting. Tractable cases? Covers overcomplete tensors?

slide-29
SLIDE 29

Our Setup

So far

General tensor decomposition: NP-hard. Orthogonal tensors: too limiting. Tractable cases? Covers overcomplete tensors?

Our framework: Incoherent Components

|ai, aj| = O

  • 1/

√ d

  • for i = j. Similarly for b, c.

Can handle overcomplete tensors. Satisfied by random vectors. Guaranteed recovery for alternating minimization?

slide-30
SLIDE 30

Alternating minimization

min

a,b,c∈Sd−1,w∈R T − w · a ⊗ b ⊗ cF .

Rank-1 ALS iteration (power iteration)

Initialization: a(0), b(0), c(0). Update in tth step: fix a(t), b(t) and c(t+1) ∝ T(a(t), b(t), I). After (approx.) convergence, restart.

slide-31
SLIDE 31

Alternating minimization

min

a,b,c∈Sd−1,w∈R T − w · a ⊗ b ⊗ cF .

Rank-1 ALS iteration (power iteration)

Initialization: a(0), b(0), c(0). Update in tth step: fix a(t), b(t) and c(t+1) ∝ T(a(t), b(t), I). After (approx.) convergence, restart. Simple update: trivially parallelizable and hence scalable. Linear computation in dimension, rank, number of different runs.

slide-32
SLIDE 32

Alternating minimization

min

a,b,c∈Sd−1,w∈R T − w · a ⊗ b ⊗ cF .

Rank-1 ALS iteration (power iteration)

Initialization: a(0), b(0), c(0). Update in tth step: fix a(t), b(t) and c(t+1) ∝ T(a(t), b(t), I). After (approx.) convergence, restart. Simple update: trivially parallelizable and hence scalable. Linear computation in dimension, rank, number of different runs. Rank-1 ALS iteration ≡ asymmetric power iteration

slide-33
SLIDE 33

Main Result: Local Convergence

Initialization: max{a1 − ˆ a(0), b1 −ˆ b(0)} ≤ ǫ0, and ǫ0 < constant. Noise: ˆ T := T + E, and E ≤ 1/ polylog(d). Rank: k = o(d1.5).

slide-34
SLIDE 34

Main Result: Local Convergence

Initialization: max{a1 − ˆ a(0), b1 −ˆ b(0)} ≤ ǫ0, and ǫ0 < constant. Noise: ˆ T := T + E, and E ≤ 1/ polylog(d). Rank: k = o(d1.5).

Theorem (Local Convergence)[AGJ2014]

After N = O(log(1/E)) steps of alternating rank-1 updates, a1 − ˆ a(N) = O (E) .

slide-35
SLIDE 35

Main Result: Local Convergence

Initialization: max{a1 − ˆ a(0), b1 −ˆ b(0)} ≤ ǫ0, and ǫ0 < constant. Noise: ˆ T := T + E, and E ≤ 1/ polylog(d). Rank: k = o(d1.5).

Theorem (Local Convergence)[AGJ2014]

After N = O(log(1/E)) steps of alternating rank-1 updates, a1 − ˆ a(N) = O (E) . Linear convergence: up to approximation error. Guarantees for overcomplete tensors: k = o(d1.5) and for pth-order tensors k = o(dp/2). Requires good initialization. What about global convergence?

slide-36
SLIDE 36

Global Convergence k = O(d)

SVD Initialization

Find the top singular vectors of T(I, I, θ) for θ ∼ N(0, I). Use them for initialization. L trials.

slide-37
SLIDE 37

Global Convergence k = O(d)

SVD Initialization

Find the top singular vectors of T(I, I, θ) for θ ∼ N(0, I). Use them for initialization. L trials.

Assumptions

Number of initializations: L ≥ kΩ(k/d)2, Tensor Rank: k = O(d)

  • No. of Iterations: N = Θ (log(1/E)). Recall E: recovery error.
slide-38
SLIDE 38

Global Convergence k = O(d)

SVD Initialization

Find the top singular vectors of T(I, I, θ) for θ ∼ N(0, I). Use them for initialization. L trials.

Assumptions

Number of initializations: L ≥ kΩ(k/d)2, Tensor Rank: k = O(d)

  • No. of Iterations: N = Θ (log(1/E)). Recall E: recovery error.

Theorem (Global Convergence)[AGJ2014]: a1 − ˆ a(N) ≤ O(ǫR).

slide-39
SLIDE 39

Outline

1

Introduction

2

Overcomplete tensor decomposition

3

Sample Complexity Analysis

4

Conclusion

slide-40
SLIDE 40

High-level Intuition for Sample Bounds

Multi-view Model: x1 = Ah + z1, where z1 is noise. Exact moment T =

i wiai ⊗ bi ⊗ ci.

Sample moment: ˆ T = 1

n

  • i xi

1 ⊗ xi 2 ⊗ xi 3.

Na¨ ıve Idea: ˆ T − T ≤ mat( ˆ T) − mat(T), apply matrix Bernstein’s.

slide-41
SLIDE 41

High-level Intuition for Sample Bounds

Multi-view Model: x1 = Ah + z1, where z1 is noise. Exact moment T =

i wiai ⊗ bi ⊗ ci.

Sample moment: ˆ T = 1

n

  • i xi

1 ⊗ xi 2 ⊗ xi 3.

Na¨ ıve Idea: ˆ T − T ≤ mat( ˆ T) − mat(T), apply matrix Bernstein’s. Our idea: Careful ǫ-net covering for ˆ T − T. ˆ T − T has many terms, e.g., all-noise term:

1 n

  • i zi

1 ⊗ zi 2 ⊗ zi 3 and

signal-noise terms. Need to bound 1 n

  • i

zi

1, uzi 2, vzi 3, w, for all u, v, w ∈ Sd−1.

Classify inner products into buckets and bound them separately. Tight sample bounds for a range of latent variable models

slide-42
SLIDE 42

Unsupervised Learning of Gaussian Mixtures

  • No. of mixture components: k = C · d
  • No. of unlabeled samples: n = ˜

Ω(k · d). Computational complexity: ˜ O

  • kC2

Our result: achieved error with n unlabeled samples

max

j

  • aj − aj = ˜

O

  • k

n

  • Linear convergence.

Error: same as before, for semi-supervised setting. Computational complexity: polynomial when k = Θ(d).

slide-43
SLIDE 43

Semi-supervised Learning of Gaussian Mixtures

n unlabeled samples, mj: samples for component j.

  • No. of mixture components: k = o(d1.5)
  • No. of labeled samples: mj = ˜

Ω(1).

  • No. of unlabeled samples: n = ˜

Ω(k).

Our result: achieved error with n unlabeled samples

max

j

  • aj − aj = ˜

O

  • k

n

  • Linear convergence.

Can handle (polynomially) overcomplete mixtures. Extremely small number of labeled samples: polylog(d). Sample complexity is tight: need ˜ Ω(k) samples!

slide-44
SLIDE 44

Outline

1

Introduction

2

Overcomplete tensor decomposition

3

Sample Complexity Analysis

4

Conclusion

slide-45
SLIDE 45

Conclusion

Learning overcomplete Latent variable models.

⋆ Method-of-moments. ⋆ Tensor power iteration.

Robustness to noise. Sample complexity bounds for a range of LVMs.

⋆ Unsupervised setting. ⋆ Semi-supervised setting.

slide-46
SLIDE 46

Conclusion

Learning overcomplete Latent variable models.

⋆ Method-of-moments. ⋆ Tensor power iteration.

Robustness to noise. Sample complexity bounds for a range of LVMs.

⋆ Unsupervised setting. ⋆ Semi-supervised setting.

Latest result: improved initialization for tensor with Gaussian components.

slide-47
SLIDE 47

Conclusion

Learning overcomplete Latent variable models.

⋆ Method-of-moments. ⋆ Tensor power iteration.

Robustness to noise. Sample complexity bounds for a range of LVMs.

⋆ Unsupervised setting. ⋆ Semi-supervised setting.

Latest result: improved initialization for tensor with Gaussian components.

Thank you!