Learning Overcomplete Latent Variable Models through Tensor Methods - - PowerPoint PPT Presentation

learning overcomplete latent variable models through
SMART_READER_LITE
LIVE PREVIEW

Learning Overcomplete Latent Variable Models through Tensor Methods - - PowerPoint PPT Presentation

Learning Overcomplete Latent Variable Models through Tensor Methods Majid Janzamin UC Irvine Joint work with Anima Anandkumar Rong Ge UC Irvine Microsoft Research Latent Variable Modeling Goal: Discover hidden effects from observed


slide-1
SLIDE 1

Learning Overcomplete Latent Variable Models through Tensor Methods Majid Janzamin

UC Irvine Joint work with

Anima Anandkumar Rong Ge

UC Irvine Microsoft Research

slide-2
SLIDE 2

Latent Variable Modeling

Goal: Discover hidden effects from observed measurements

Document modeling

Observed: words. Hidden: topics.

Nursing Home Is Faulted Over Care After Storm By MICHAEL POWELL and SHERI FINK Amid the worst hurricane to hit New York City in nearly 80 years, officials have claimed that the Promenade Rehabilitation and Health Care Center failed to provide the most basic care to its patients. In One Day, 11,000 Flee Syria as War and Hardship Worsen By RICK GLADSTONE and NEIL MacFARQUHAR The United Nations reported that 11,000 Syrians fled on Friday, the vast majority of them clambering for safety over the Turkish border. Obama to Insist on Tax Increase for the Wealthy By HELENE COOPER and JONATHAN WEISMAN Amid talk of compromise, President Obama and Speaker John A. Boehner both indicated unchanged stances on this issue, long a point
  • f contention.
Hurricane Exposed Flaws in Protection of Tunnels By ELISABETH ROSENTHAL Nearly two weeks after Hurricane Sandy struck, the vital arteries that bring cars, trucks and subways into New York City’s transportation network have recovered, with
  • ne major exception: the Brooklyn-Battery
Tunnel remains closed. Behind New York Gas Lines, Warnings and Crossed Fingers By DAVID W. CHEN, WINNIE HU and CLIFFORD KRAUSS The return of 1970s-era gas lines to the five boroughs of New York City was not the result
  • f a single miscalculation, but a combination
  • f ignored warnings and indecisiveness.

Social Network Modeling

Observed: social interactions. Hidden: communities, relationships.

Recommendation Systems

Observed: recommendations (e.g., reviews). Hidden: User and business attributes Applications in Speech, Vision, . . .

slide-3
SLIDE 3

Latent Variable Modeling

Feature Learning

Learn good features/representations for classification tasks, e.g., image and speech recognition.

slide-4
SLIDE 4

Latent Variable Modeling

Feature Learning

Learn good features/representations for classification tasks, e.g., image and speech recognition.

Sparse Coding, Dictionary Learning

Sparse representations, low dimensional hidden structures. A few dictionary elements make complicated shapes.

(Image from Sanjeev Arora’s slides.)

slide-5
SLIDE 5

Learning Latent Variable Models

Goal: Discover hidden effects from observed measurements. Unsupervised learning: no labeled samples.

slide-6
SLIDE 6

Learning Latent Variable Models

Goal: Discover hidden effects from observed measurements. Unsupervised learning: no labeled samples. Semi-supervised learning: few labeled samples.

slide-7
SLIDE 7

Learning Latent Variable Models

Goal: Discover hidden effects from observed measurements. Unsupervised learning: no labeled samples. Semi-supervised learning: few labeled samples.

Challenge: Conditions for Identifiability

When can model be identified (given infinite computation and data)? Does identifiability also lead to tractable algorithms?

slide-8
SLIDE 8

Learning Latent Variable Models

Goal: Discover hidden effects from observed measurements. Unsupervised learning: no labeled samples. Semi-supervised learning: few labeled samples.

Challenge: Conditions for Identifiability

When can model be identified (given infinite computation and data)? Does identifiability also lead to tractable algorithms?

Challenge: Efficient Learning of Latent Variable Models

Maximum likelihood is NP-hard in most cases. Practice: EM, Variational Bayes, but have no consistency guarantees. Scalable guaranteed learning algorithms?

⋆ Low computational and statistical complexity

slide-9
SLIDE 9

Learning Latent Variable Models

Goal: Discover hidden effects from observed measurements. Unsupervised learning: no labeled samples. Semi-supervised learning: few labeled samples.

Challenge: Conditions for Identifiability

When can model be identified (given infinite computation and data)? Does identifiability also lead to tractable algorithms?

Challenge: Efficient Learning of Latent Variable Models

Maximum likelihood is NP-hard in most cases. Practice: EM, Variational Bayes, but have no consistency guarantees. Scalable guaranteed learning algorithms?

⋆ Low computational and statistical complexity

This talk: guaranteed and efficient learning through spectral methods.

slide-10
SLIDE 10

LVMs as Probabilistic Models

Latent (hidden) variable h ∈ Rk, observed variable x ∈ Rd.

slide-11
SLIDE 11

LVMs as Probabilistic Models

Latent (hidden) variable h ∈ Rk, observed variable x ∈ Rd. Multiview linear mixture models Categorical hidden variable h. Views: conditionally indep. given h. Linear model: E[x1|h] = ah, E[x2|h] = bh, E[x3|h] = ch.

h x1 x2 x3

· · ·

slide-12
SLIDE 12

LVMs as Probabilistic Models

Latent (hidden) variable h ∈ Rk, observed variable x ∈ Rd. Multiview linear mixture models Categorical hidden variable h. Views: conditionally indep. given h. Linear model: E[x1|h] = ah, E[x2|h] = bh, E[x3|h] = ch.

h x1 x2 x3

· · · Gaussian Mixture Categorical hidden variable h. x|h ∼ N(µh, Σh).

slide-13
SLIDE 13

LVMs as Probabilistic Models

Latent (hidden) variable h ∈ Rk, observed variable x ∈ Rd. Multiview linear mixture models Categorical hidden variable h. Views: conditionally indep. given h. Linear model: E[x1|h] = ah, E[x2|h] = bh, E[x3|h] = ch.

h x1 x2 x3

· · · Gaussian Mixture Categorical hidden variable h. x|h ∼ N(µh, Σh). ICA, Sparse Coding, HMM, Topic modeling, . . .

slide-14
SLIDE 14

LVMs as Probabilistic Models

Latent (hidden) variable h ∈ Rk, observed variable x ∈ Rd. Multiview linear mixture models Categorical hidden variable h. Views: conditionally indep. given h. Linear model: E[x1|h] = ah, E[x2|h] = bh, E[x3|h] = ch.

h x1 x2 x3

· · · Gaussian Mixture Categorical hidden variable h. x|h ∼ N(µh, Σh). ICA, Sparse Coding, HMM, Topic modeling, . . . Efficient Learning of the parameters ah, µh, . . . ?

slide-15
SLIDE 15

Method-of-Moments (Spectral methods)

Multi-variate observed moments

M1 := E[x], M2 := E[x ⊗ x], M3 := E[x ⊗ x ⊗ x].

slide-16
SLIDE 16

Method-of-Moments (Spectral methods)

Multi-variate observed moments

M1 := E[x], M2 := E[x ⊗ x], M3 := E[x ⊗ x ⊗ x].

Matrix

E[x ⊗ x] ∈ Rd×d is a second order tensor. E[x ⊗ x]i1,i2 = E[xi1xi2]. For matrices: E[x ⊗ x] = E[xx⊤].

slide-17
SLIDE 17

Method-of-Moments (Spectral methods)

Multi-variate observed moments

M1 := E[x], M2 := E[x ⊗ x], M3 := E[x ⊗ x ⊗ x].

Matrix

E[x ⊗ x] ∈ Rd×d is a second order tensor. E[x ⊗ x]i1,i2 = E[xi1xi2]. For matrices: E[x ⊗ x] = E[xx⊤].

Tensor

E[x ⊗ x ⊗ x] ∈ Rd×d×d is a third order tensor. E[x ⊗ x ⊗ x]i1,i2,i3 = E[xi1xi2xi3].

slide-18
SLIDE 18

Method-of-Moments (Spectral methods)

Multi-variate observed moments

M1 := E[x], M2 := E[x ⊗ x], M3 := E[x ⊗ x ⊗ x].

Matrix

E[x ⊗ x] ∈ Rd×d is a second order tensor. E[x ⊗ x]i1,i2 = E[xi1xi2]. For matrices: E[x ⊗ x] = E[xx⊤].

Tensor

E[x ⊗ x ⊗ x] ∈ Rd×d×d is a third order tensor. E[x ⊗ x ⊗ x]i1,i2,i3 = E[xi1xi2xi3]. Information in moments for learning LVMs?

slide-19
SLIDE 19

Multiview Mixture Model

[k] := {1, . . . , k}. Multiview linear mixture models Categorical hidden variable h ∈ [k]. wj := Pr[h = j] Views: conditionally indep. given h. Linear model: E[x1|h] = ah, E[x2|h] = bh, E[x3|h] = ch.

h x1 x2 x3

· · ·

slide-20
SLIDE 20

Multiview Mixture Model

[k] := {1, . . . , k}. Multiview linear mixture models Categorical hidden variable h ∈ [k]. wj := Pr[h = j] Views: conditionally indep. given h. Linear model: E[x1|h] = ah, E[x2|h] = bh, E[x3|h] = ch.

h x1 x2 x3

· · · Ex[

x1x⊤

2

x1 ⊗ x2] = Eh[Ex[x1 ⊗ x2|h]] = Eh[ah ⊗ bh] =

  • j∈[k]

wjaj ⊗ bj.

slide-21
SLIDE 21

Multiview Mixture Model

[k] := {1, . . . , k}. Multiview linear mixture models Categorical hidden variable h ∈ [k]. wj := Pr[h = j] Views: conditionally indep. given h. Linear model: E[x1|h] = ah, E[x2|h] = bh, E[x3|h] = ch.

h x1 x2 x3

· · · E[x1 ⊗ x2] =

  • j∈[k]

wjaj ⊗ bj,

slide-22
SLIDE 22

Multiview Mixture Model

[k] := {1, . . . , k}. Multiview linear mixture models Categorical hidden variable h ∈ [k]. wj := Pr[h = j] Views: conditionally indep. given h. Linear model: E[x1|h] = ah, E[x2|h] = bh, E[x3|h] = ch.

h x1 x2 x3

· · · E[x1 ⊗ x2] =

  • j∈[k]

wjaj ⊗ bj, E[x1 ⊗ x2 ⊗ x3] =

  • j∈[k]

wjaj ⊗ bj ⊗ cj.

slide-23
SLIDE 23

Multiview Mixture Model

[k] := {1, . . . , k}. Multiview linear mixture models Categorical hidden variable h ∈ [k]. wj := Pr[h = j] Views: conditionally indep. given h. Linear model: E[x1|h] = ah, E[x2|h] = bh, E[x3|h] = ch.

h x1 x2 x3

· · · E[x1 ⊗ x2] =

  • j∈[k]

wjaj ⊗ bj, E[x1 ⊗ x2 ⊗ x3] =

  • j∈[k]

wjaj ⊗ bj ⊗ cj. Tensor (matrix) factorization for learning LVMs.

slide-24
SLIDE 24

Matrix vs. Tensor Decomposition

Uniqueness of decomposition.

Matrix Decomposition

Distinct weights. Orthogonal components, i.e., ai, aj = 0, i = j. Too limiting. Otherwise, only learning up to subspace is possible.

slide-25
SLIDE 25

Matrix vs. Tensor Decomposition

Uniqueness of decomposition.

Matrix Decomposition

Distinct weights. Orthogonal components, i.e., ai, aj = 0, i = j. Too limiting. Otherwise, only learning up to subspace is possible.

Tensor Decomposition

Same weights. Non-Orthogonal components ⇒ Overcomplete models. More general models.

slide-26
SLIDE 26

Matrix vs. Tensor Decomposition

Uniqueness of decomposition.

Matrix Decomposition

Distinct weights. Orthogonal components, i.e., ai, aj = 0, i = j. Too limiting. Otherwise, only learning up to subspace is possible.

Tensor Decomposition

Same weights. Non-Orthogonal components ⇒ Overcomplete models. More general models. Focus on tensor decomposition for learning LVMs.

slide-27
SLIDE 27

Overcomplete Latent Variable Models

Overcomplete Latent Representations

Latent dimensionality > observed dimensionality, i.e., k > d. Flexible modeling, robust to noise. Applicable in speech and image modeling. Large amount of unlabeled samples.

slide-28
SLIDE 28

Overcomplete Latent Variable Models

Overcomplete Latent Representations

Latent dimensionality > observed dimensionality, i.e., k > d. Flexible modeling, robust to noise. Applicable in speech and image modeling. Large amount of unlabeled samples. Possible to learn when using higher (e.g., 3rd) order tensor moment.

slide-29
SLIDE 29

Overcomplete Latent Variable Models

Overcomplete Latent Representations

Latent dimensionality > observed dimensionality, i.e., k > d. Flexible modeling, robust to noise. Applicable in speech and image modeling. Large amount of unlabeled samples. Possible to learn when using higher (e.g., 3rd) order tensor moment.

Example: T ∈ R2×2×2 with rank 3 (d = 2, k = 3) T (:, :, 1) = 1 1

  • ,

T (:, :, 2) =

  • 1

−1

  • .

T = 1 1

1

1

  • +
  • 1

−1

1

1

  • +

1

1 1

  • 1

−1

  • .
slide-30
SLIDE 30

Overcomplete Latent Variable Models

Overcomplete Latent Representations

Latent dimensionality > observed dimensionality, i.e., k > d. Flexible modeling, robust to noise. Applicable in speech and image modeling. Large amount of unlabeled samples. Possible to learn when using higher (e.g., 3rd) order tensor moment.

So far

Learning LVMs. Spectral methods (method-of-moments). Overcomplete LVMs. This work: theoretical guarantees for above.

slide-31
SLIDE 31

Outline

1

Introduction

2

Summary of Results

3

Recap of Orthogonal Matrix and Tensor Decomposition

4

Overcomplete (Non-Orthogonal) Tensor Decomposition

5

Sample Complexity Analysis

6

Numerical Results

7

Conclusion

slide-32
SLIDE 32

Spherical Gaussian Mixtures

Assumptions

k components, d: observed dimension. Component means ai incoherent: randomly drawn from the sphere. Spherical variance σ2

d I (assume known).

slide-33
SLIDE 33

Spherical Gaussian Mixtures

Assumptions

k components, d: observed dimension. Component means ai incoherent: randomly drawn from the sphere. Spherical variance σ2

d I (assume known).

In this talk: special case

Noise norm σ2 = 1: same as signal. Uniform probability of components.

slide-34
SLIDE 34

Spherical Gaussian Mixtures

Assumptions

k components, d: observed dimension. Component means ai incoherent: randomly drawn from the sphere. Spherical variance σ2

d I (assume known).

In this talk: special case

Noise norm σ2 = 1: same as signal. Uniform probability of components.

Tensor For Learning (Hsu, Kakade 2012)

M3 := E[x⊗3] − σ2

i∈[d]

(E[x] ⊗ ei ⊗ ei + · · · ) ⇒ M3 =

  • j∈[k]

wjaj ⊗ aj ⊗ aj.

slide-35
SLIDE 35

Semi-supervised Learning of Gaussian Mixtures

n unlabeled samples, mj: samples for component j.

  • No. of mixture components: k = o(d1.5)
  • No. of labeled samples: mj = ˜

Ω(1).

  • No. of unlabeled samples: n = ˜

Ω(k).

Our result: achieved error with n unlabeled samples

max

j

  • aj − aj = ˜

O

  • k

n

  • + ˜

O √ k d

  • Linear convergence.

Can handle (polynomially) overcomplete mixtures. Extremely small number of labeled samples: polylog(d). Sample complexity is tight: need ˜ Ω(k) samples! Approximation error: decaying in high dimensions.

slide-36
SLIDE 36

Unsupervised Learning of Gaussian Mixtures

  • No. of mixture components: k = C · d
  • No. of unlabeled samples: n = ˜

Ω(k · d). Computational complexity: ˜ O

  • kC2

Our result: achieved error with n unlabeled samples

max

j

  • aj − aj = ˜

O

  • k

n

  • + ˜

O √ k d

  • Linear convergence.

Error: same as before, for semi-supervised setting. Sample complexity: worse than semi-supervised, but better than previous works (no dependence on condition number of A). Computational complexity: polynomial when k = Θ(d).

slide-37
SLIDE 37

Multi-view Mixture Models

h x1 x2 x3

· · · A = [a1 a2 · · · ak] ∈ Rd×k, similarly B and C. Linear model: x1 = Ah + z1, x2 = Bh + z2, x3 = Ch + z3.

slide-38
SLIDE 38

Multi-view Mixture Models

h x1 x2 x3

· · · A = [a1 a2 · · · ak] ∈ Rd×k, similarly B and C. Linear model: x1 = Ah + z1, x2 = Bh + z2, x3 = Ch + z3. Incoherence: Component means ai’s are incoherent (randomly drawn from unit sphere). Similarly bi’s and ci’s. The zero-mean noise zl’s satisfy RIP, e.g., Gaussian, Bernoulli. Same results as Gaussian mixtures.

slide-39
SLIDE 39

Independent Component Analysis

x = Ah, independent sources, unknown mixing. Blind source separation of speech, image, video. h1 h2 hk x1 x2 xd A

slide-40
SLIDE 40

Independent Component Analysis

x = Ah, independent sources, unknown mixing. Blind source separation of speech, image, video. Sources h are sub-Gaussian (but not Gaussian). Columns of A are incoherent. Form cumulant tensor M4 := E[x⊗4] − · · · n samples. k sources. d dimensions. h1 h2 hk x1 x2 xd A

slide-41
SLIDE 41

Independent Component Analysis

x = Ah, independent sources, unknown mixing. Blind source separation of speech, image, video. Sources h are sub-Gaussian (but not Gaussian). Columns of A are incoherent. Form cumulant tensor M4 := E[x⊗4] − · · · n samples. k sources. d dimensions. h1 h2 hk x1 x2 xd A

Learning Result

Semi-supervised: k = o(d2), n ≥ ˜ Ω(max(k2, k4/d3)). Unsupervised: k = O(d), n ≥ ˜ Ω(k3). max

j

min

f∈{−1,1} f

aj − aj = ˜ O   k2 min

  • n,

√ d3n

 + ˜ O √ k d1.5

slide-42
SLIDE 42

Sparse Coding

Sparse representations, low dimensional hidden structures. A few dictionary elements make complicated shapes.

slide-43
SLIDE 43

Sparse Coding

x = Ah, sparse coefficients, unknown dictionary. Image compression, feature learning, ...

slide-44
SLIDE 44

Sparse Coding

x = Ah, sparse coefficients, unknown dictionary. Image compression, feature learning, ... Coefficients h are independent Bernoulli Gaussian: Sparse ICA. Columns of A are incoherent. Form cumulant tensor M4 := E[x⊗4] − · · · n samples. k dictionary elements. d dimensions. s avg. sparsity.

slide-45
SLIDE 45

Sparse Coding

x = Ah, sparse coefficients, unknown dictionary. Image compression, feature learning, ... Coefficients h are independent Bernoulli Gaussian: Sparse ICA. Columns of A are incoherent. Form cumulant tensor M4 := E[x⊗4] − · · · n samples. k dictionary elements. d dimensions. s avg. sparsity.

Learning Result

Semi-supervised: k = o(d2), n ≥ ˜ Ω(max(sk, s2k2/d3)). Unsupervised: k = O(d), n ≥ ˜ Ω(sk2). max

j

min

f∈{−1,1} f

aj − aj = ˜ O   sk min

  • n,

√ d3n

 + ˜ O √ k d1.5

slide-46
SLIDE 46

Outline

1

Introduction

2

Summary of Results

3

Recap of Orthogonal Matrix and Tensor Decomposition

4

Overcomplete (Non-Orthogonal) Tensor Decomposition

5

Sample Complexity Analysis

6

Numerical Results

7

Conclusion

slide-47
SLIDE 47

Recap of Orthogonal Matrix Eigen Analysis

Symmetric M ∈ Rd×d Eigen-vectors are fixed points: Mv = λv. Eigen decomposition: M =

i λiviv⊤ i . Orthogonal: vi, vj = 0, i = j.

slide-48
SLIDE 48

Recap of Orthogonal Matrix Eigen Analysis

Symmetric M ∈ Rd×d Eigen-vectors are fixed points: Mv = λv. Eigen decomposition: M =

i λiviv⊤ i . Orthogonal: vi, vj = 0, i = j.

Uniqueness (Identifiability):

  • Iff. λi’s are distinct.
slide-49
SLIDE 49

Recap of Orthogonal Matrix Eigen Analysis

Symmetric M ∈ Rd×d Eigen-vectors are fixed points: Mv = λv. Eigen decomposition: M =

i λiviv⊤ i . Orthogonal: vi, vj = 0, i = j.

Uniqueness (Identifiability):

  • Iff. λi’s are distinct.

Algorithm: Power method: v → Mv Mv.

slide-50
SLIDE 50

Recap of Orthogonal Matrix Eigen Analysis

Symmetric M ∈ Rd×d Eigen-vectors are fixed points: Mv = λv. Eigen decomposition: M =

i λiviv⊤ i . Orthogonal: vi, vj = 0, i = j.

Uniqueness (Identifiability):

  • Iff. λi’s are distinct.

Algorithm: Power method: v → Mv Mv.

Convergence properties

Let λ1 > λ2 > · · · > λd. Only vi’s are fixed points of power iteration. Mvi = λivi.

slide-51
SLIDE 51

Recap of Orthogonal Matrix Eigen Analysis

Symmetric M ∈ Rd×d Eigen-vectors are fixed points: Mv = λv. Eigen decomposition: M =

i λiviv⊤ i . Orthogonal: vi, vj = 0, i = j.

Uniqueness (Identifiability):

  • Iff. λi’s are distinct.

Algorithm: Power method: v → Mv Mv.

Convergence properties

Let λ1 > λ2 > · · · > λd. Only vi’s are fixed points of power iteration. Mvi = λivi.

  • v1 is the only robust fixed point.
slide-52
SLIDE 52

Recap of Orthogonal Matrix Eigen Analysis

Symmetric M ∈ Rd×d Eigen-vectors are fixed points: Mv = λv. Eigen decomposition: M =

i λiviv⊤ i . Orthogonal: vi, vj = 0, i = j.

Uniqueness (Identifiability):

  • Iff. λi’s are distinct.

Algorithm: Power method: v → Mv Mv.

Convergence properties

Let λ1 > λ2 > · · · > λd. Only vi’s are fixed points of power iteration. Mvi = λivi.

  • v1 is the only robust fixed point.
  • All other vi’s are saddle points.
slide-53
SLIDE 53

Recap of Orthogonal Matrix Eigen Analysis

Symmetric M ∈ Rd×d Eigen-vectors are fixed points: Mv = λv. Eigen decomposition: M =

i λiviv⊤ i . Orthogonal: vi, vj = 0, i = j.

Uniqueness (Identifiability):

  • Iff. λi’s are distinct.

Algorithm: Power method: v → Mv Mv.

Convergence properties

Let λ1 > λ2 > · · · > λd. Only vi’s are fixed points of power iteration. Mvi = λivi.

  • v1 is the only robust fixed point.
  • All other vi’s are saddle points.

Power method recovers v1 when initialization v satisfies v, v1 = 0.

slide-54
SLIDE 54

Tensor Rank and Tensor Decomposition

Rank-1 tensor: T = w · a ⊗ b ⊗ c ⇔ T(i, j, l) = w · a(i) · b(j) · c(l).

slide-55
SLIDE 55

Tensor Rank and Tensor Decomposition

Rank-1 tensor: T = w · a ⊗ b ⊗ c ⇔ T(i, j, l) = w · a(i) · b(j) · c(l).

CANDECOMP/PARAFAC (CP) Decomposition

T =

  • j∈[k]

wjaj ⊗ bj ⊗ cj ∈ Rd×d×d, aj, bj, cj ∈ Sd−1.

= + ....

Tensor T w1 · a1 ⊗ b1 ⊗ c1 w2 · a2 ⊗ b2 ⊗ c2

slide-56
SLIDE 56

Tensor Rank and Tensor Decomposition

Rank-1 tensor: T = w · a ⊗ b ⊗ c ⇔ T(i, j, l) = w · a(i) · b(j) · c(l).

CANDECOMP/PARAFAC (CP) Decomposition

T =

  • j∈[k]

wjaj ⊗ bj ⊗ cj ∈ Rd×d×d, aj, bj, cj ∈ Sd−1.

= + ....

Tensor T w1 · a1 ⊗ b1 ⊗ c1 w2 · a2 ⊗ b2 ⊗ c2

k: tensor rank, d: ambient dimension. k ≤ d: undercomplete and k > d: overcomplete.

slide-57
SLIDE 57

Tensor Rank and Tensor Decomposition

Rank-1 tensor: T = w · a ⊗ b ⊗ c ⇔ T(i, j, l) = w · a(i) · b(j) · c(l).

CANDECOMP/PARAFAC (CP) Decomposition

T =

  • j∈[k]

wjaj ⊗ bj ⊗ cj ∈ Rd×d×d, aj, bj, cj ∈ Sd−1.

= + ....

Tensor T w1 · a1 ⊗ b1 ⊗ c1 w2 · a2 ⊗ b2 ⊗ c2

k: tensor rank, d: ambient dimension. k ≤ d: undercomplete and k > d: overcomplete. This talk: guarantees for overcomplete tensor decomposition

slide-58
SLIDE 58

Background on Tensor Decomposition

T =

  • i∈[k]

wiai ⊗ bi ⊗ ci, ai, bi, ci ∈ Sd−1.

Theoretical Guarantees

Tensor decompositions in psychometrics (Cattell ‘44). CP tensor decomposition (Harshman ‘70, Carol & Chang ‘70). Identifiability of CP tensor decomposition (Kruskal ‘76). Orthogonal decomposition: (Zhang & Golub ‘01, Kolda ‘01, Anandkumar etal ‘12). Tensor decomposition through (lifted) linear equations (Lawthauwer ‘07): works for overcomplete tensors. Tensor decomposition through simultaneous diagonalization: perturbation analysis (Goyal et. al ‘13, Bhaskara ‘13)

slide-59
SLIDE 59

Background on Tensor Decompositions (contd.)

T =

  • i∈[k]

wiai ⊗ bi ⊗ ci, ai, bi, ci ∈ Sd−1.

Practice: Alternating least squares (ALS)

Let A = [a1|a2 · · · ak] and similarly B, C. Fix estimates of two of the modes (say for A and B) and re-estimate the third. Iterative updates, low computational complexity. No theoretical guarantees. In this talk: analysis of alternating minimization

slide-60
SLIDE 60

Tensors as Multilinear Transformations

Tensor T ∈ Rd×d×d. Vectors v, w ∈ Rd.

slide-61
SLIDE 61

Tensors as Multilinear Transformations

Tensor T ∈ Rd×d×d. Vectors v, w ∈ Rd. T(I, v, w) :=

  • j,l∈[d]

vjwlT(:, j, l) ∈ Rd.

slide-62
SLIDE 62

Tensors as Multilinear Transformations

Tensor T ∈ Rd×d×d. Vectors v, w ∈ Rd. T(I, v, w) :=

  • j,l∈[d]

vjwlT(:, j, l) ∈ Rd. For matrix M ∈ Rd×d: M(I, w) = Mw =

  • l∈[d]

wlM(:, l) ∈ Rd.

slide-63
SLIDE 63

Challenges in Tensor Decomposition

Symmetric tensor T ∈ Rd×d×d: T =

  • i∈[k]

λivi ⊗ vi ⊗ vi.

Challenges in tensors

Decomposition may not always exist for general tensors. Finding the decomposition is NP-hard in general.

slide-64
SLIDE 64

Challenges in Tensor Decomposition

Symmetric tensor T ∈ Rd×d×d: T =

  • i∈[k]

λivi ⊗ vi ⊗ vi.

Challenges in tensors

Decomposition may not always exist for general tensors. Finding the decomposition is NP-hard in general.

Tractable case: orthogonal tensor decomposition (vi, vj = 0, i = j)

{vi} are eigenvectors: T(I, vi, vi) = λivi.

slide-65
SLIDE 65

Challenges in Tensor Decomposition

Symmetric tensor T ∈ Rd×d×d: T =

  • i∈[k]

λivi ⊗ vi ⊗ vi.

Challenges in tensors

Decomposition may not always exist for general tensors. Finding the decomposition is NP-hard in general.

Tractable case: orthogonal tensor decomposition (vi, vj = 0, i = j)

{vi} are eigenvectors: T(I, vi, vi) = λivi. Bad news: There can be other eigenvectors (unlike matrix case). v = v1 + v2 √ 2 satisfies T(I, v, v) = 1 √ 2v. λi ≡ 1.

slide-66
SLIDE 66

Challenges in Tensor Decomposition

Symmetric tensor T ∈ Rd×d×d: T =

  • i∈[k]

λivi ⊗ vi ⊗ vi.

Challenges in tensors

Decomposition may not always exist for general tensors. Finding the decomposition is NP-hard in general.

Tractable case: orthogonal tensor decomposition (vi, vj = 0, i = j)

{vi} are eigenvectors: T(I, vi, vi) = λivi. Bad news: There can be other eigenvectors (unlike matrix case). v = v1 + v2 √ 2 satisfies T(I, v, v) = 1 √ 2v. λi ≡ 1. How do we avoid spurious solutions (not part of decomposition)?

slide-67
SLIDE 67

Orthogonal Tensor Power Method

Symmetric orthogonal tensor T ∈ Rd×d×d: T =

  • i∈[k]

λivi ⊗ vi ⊗ vi.

slide-68
SLIDE 68

Orthogonal Tensor Power Method

Symmetric orthogonal tensor T ∈ Rd×d×d: T =

  • i∈[k]

λivi ⊗ vi ⊗ vi. Recall matrix power method: v → M(I, v) M(I, v).

slide-69
SLIDE 69

Orthogonal Tensor Power Method

Symmetric orthogonal tensor T ∈ Rd×d×d: T =

  • i∈[k]

λivi ⊗ vi ⊗ vi. Recall matrix power method: v → M(I, v) M(I, v). Algorithm: tensor power method: v → T(I, v, v) T(I, v, v).

slide-70
SLIDE 70

Orthogonal Tensor Power Method

Symmetric orthogonal tensor T ∈ Rd×d×d: T =

  • i∈[k]

λivi ⊗ vi ⊗ vi. Recall matrix power method: v → M(I, v) M(I, v). Algorithm: tensor power method: v → T(I, v, v) T(I, v, v).

  • {vi}’s are the only robust fixed points.
slide-71
SLIDE 71

Orthogonal Tensor Power Method

Symmetric orthogonal tensor T ∈ Rd×d×d: T =

  • i∈[k]

λivi ⊗ vi ⊗ vi. Recall matrix power method: v → M(I, v) M(I, v). Algorithm: tensor power method: v → T(I, v, v) T(I, v, v).

  • {vi}’s are the only robust fixed points.
  • All other eigenvectors are saddle points.
slide-72
SLIDE 72

Orthogonal Tensor Power Method

Symmetric orthogonal tensor T ∈ Rd×d×d: T =

  • i∈[k]

λivi ⊗ vi ⊗ vi. Recall matrix power method: v → M(I, v) M(I, v). Algorithm: tensor power method: v → T(I, v, v) T(I, v, v).

  • {vi}’s are the only robust fixed points.
  • All other eigenvectors are saddle points.

For an orthogonal tensor, no spurious local optima!

slide-73
SLIDE 73

Matrix vs. tensor power iteration

Matrix power iteration: Tensor power iteration:

slide-74
SLIDE 74

Matrix vs. tensor power iteration

Matrix power iteration:

1

Requires gap between largest and second-largest eigenvalue. Property of the matrix only. Tensor power iteration:

1

Requires gap between largest and second-largest λi|ci| where initialization vector v =

i civi.

Property of the tensor and initialization v.

slide-75
SLIDE 75

Matrix vs. tensor power iteration

Matrix power iteration:

1

Requires gap between largest and second-largest eigenvalue. Property of the matrix only.

2

Converges to top eigenvector. Tensor power iteration:

1

Requires gap between largest and second-largest λi|ci| where initialization vector v =

i civi.

Property of the tensor and initialization v.

2

Converges to vi for which vi|ci| = max! could be any of them.

slide-76
SLIDE 76

Matrix vs. tensor power iteration

Matrix power iteration:

1

Requires gap between largest and second-largest eigenvalue. Property of the matrix only.

2

Converges to top eigenvector.

3

Linear convergence. Need O(log(1/ǫ)) iterations. Tensor power iteration:

1

Requires gap between largest and second-largest λi|ci| where initialization vector v =

i civi.

Property of the tensor and initialization v.

2

Converges to vi for which vi|ci| = max! could be any of them.

3

Quadratic convergence. Need O(log log(1/ǫ)) iterations.

slide-77
SLIDE 77

Beyond Orthogonal Tensor Decomposition

Limitations

Not ALL tensors have orthogonal decomposition (unlike matrices). Orthogonal forms: cannot handle overcomplete tensors (k > d). Overcomplete representations: redundancy leads to flexible modeling, noise resistant, no domain knowledge.

slide-78
SLIDE 78

Beyond Orthogonal Tensor Decomposition

Limitations

Not ALL tensors have orthogonal decomposition (unlike matrices). Orthogonal forms: cannot handle overcomplete tensors (k > d). Overcomplete representations: redundancy leads to flexible modeling, noise resistant, no domain knowledge.

Undercomplete tensors (k ≤ d) with full rank components

Non-orthogonal decomposition T1 =

i wiai ⊗ ai ⊗ ai.

Whitening matrix W: Multilinear transform: T2 = T1(W, W, W) Limitations: depends on condition number, sensitive to noise. v1 v2 v3 W a1 a2 a3

Tensor T1 Tensor T2

slide-79
SLIDE 79

Beyond Orthogonal Tensor Decomposition

Limitations

Not ALL tensors have orthogonal decomposition (unlike matrices). Orthogonal forms: cannot handle overcomplete tensors (k > d). Overcomplete representations: redundancy leads to flexible modeling, noise resistant, no domain knowledge.

Undercomplete tensors (k ≤ d) with full rank components

Non-orthogonal decomposition T1 =

i wiai ⊗ ai ⊗ ai.

Whitening matrix W: Multilinear transform: T2 = T1(W, W, W) Limitations: depends on condition number, sensitive to noise. v1 v2 v3 W a1 a2 a3

Tensor T1 Tensor T2

This talk: guarantees for overcomplete tensor decomposition

slide-80
SLIDE 80

Outline

1

Introduction

2

Summary of Results

3

Recap of Orthogonal Matrix and Tensor Decomposition

4

Overcomplete (Non-Orthogonal) Tensor Decomposition

5

Sample Complexity Analysis

6

Numerical Results

7

Conclusion

slide-81
SLIDE 81

Non-orthogonal Tensor Decomposition

Multiview linear mixture model Linear model: E[x1|h] = ah, E[x2|h] = bh, E[x3|h] = ch. E[x1 ⊗ x2 ⊗ x3] =

i∈[k] wiai ⊗ bi ⊗ ci.

h x1 x2 x3

· · ·

slide-82
SLIDE 82

Non-orthogonal Tensor Decomposition

T =

  • i∈[k]

wiai ⊗ bi ⊗ ci, ai, bi, ci ∈ Sd−1.

Practice: Alternating least squares (ALS)

Many spurious local optima. No theoretical guarantee.

slide-83
SLIDE 83

Non-orthogonal Tensor Decomposition

T =

  • i∈[k]

wiai ⊗ bi ⊗ ci, ai, bi, ci ∈ Sd−1.

Practice: Alternating least squares (ALS)

Many spurious local optima. No theoretical guarantee.

Rank-1 ALS (Best Rank-1 Approximation)

min

a,b,c∈Sd−1,w∈R T − w · a ⊗ b ⊗ cF .

slide-84
SLIDE 84

Non-orthogonal Tensor Decomposition

T =

  • i∈[k]

wiai ⊗ bi ⊗ ci, ai, bi, ci ∈ Sd−1.

Practice: Alternating least squares (ALS)

Many spurious local optima. No theoretical guarantee.

Rank-1 ALS (Best Rank-1 Approximation)

min

a,b,c∈Sd−1,w∈R T − w · a ⊗ b ⊗ cF .

Fix a(t), b(t) and update c(t+1) = ⇒ c(t+1) ∝ T(a(t), b(t), I). Rank-1 ALS iteration ≡ asymmetric power iteration

slide-85
SLIDE 85

Alternating minimization

Rank-1 ALS iteration (power iteration)

Initialization: a(0), b(0), c(0). Update in tth step: fix a(t), b(t) and c(t+1) ∝ T(a(t), b(t), I). After (approx.) convergence, restart.

slide-86
SLIDE 86

Alternating minimization

Rank-1 ALS iteration (power iteration)

Initialization: a(0), b(0), c(0). Update in tth step: fix a(t), b(t) and c(t+1) ∝ T(a(t), b(t), I). After (approx.) convergence, restart. Simple update: trivially parallelizable and hence scalable. Linear computation in dimension, rank, number of different runs.

slide-87
SLIDE 87

Alternating minimization

Rank-1 ALS iteration (power iteration)

Initialization: a(0), b(0), c(0). Update in tth step: fix a(t), b(t) and c(t+1) ∝ T(a(t), b(t), I). After (approx.) convergence, restart. Simple update: trivially parallelizable and hence scalable. Linear computation in dimension, rank, number of different runs.

Challenges

Optimization problem: non-convex, multiple local optima. Alternating minimization: improves the objective in each step? Recovery of ai, bi, ci’s? Not true in general. Noisy tensor decomposition.

slide-88
SLIDE 88

Alternating minimization

Rank-1 ALS iteration (power iteration)

Initialization: a(0), b(0), c(0). Update in tth step: fix a(t), b(t) and c(t+1) ∝ T(a(t), b(t), I). After (approx.) convergence, restart. Simple update: trivially parallelizable and hence scalable. Linear computation in dimension, rank, number of different runs.

Challenges

Optimization problem: non-convex, multiple local optima. Alternating minimization: improves the objective in each step? Recovery of ai, bi, ci’s? Not true in general. Noisy tensor decomposition. Natural conditions under which Alt-Min has guarantees?

slide-89
SLIDE 89

Special case: Orthogonal Setting

T =

  • i∈[k]

wiai ⊗ bi ⊗ ci, ai, bi, ci ∈ Sd−1. ai, aj = 0, for i = j. Similarly for b, c. Alternating updates: c(t+1) ∝ T(a(t), b(t), I) =

  • i∈[k]

wiai, a(t)bi, b(t)ci. ai, bi, ci are stationary points.

slide-90
SLIDE 90

Special case: Orthogonal Setting

T =

  • i∈[k]

wiai ⊗ bi ⊗ ci, ai, bi, ci ∈ Sd−1. ai, aj = 0, for i = j. Similarly for b, c. Alternating updates: c(t+1) ∝ T(a(t), b(t), I) =

  • i∈[k]

wiai, a(t)bi, b(t)ci. ai, bi, ci are stationary points. ONLY local optima for best rank-1 approximation problem. Guaranteed recovery through alternating minimization.

slide-91
SLIDE 91

Special case: Orthogonal Setting

T =

  • i∈[k]

wiai ⊗ bi ⊗ ci, ai, bi, ci ∈ Sd−1. ai, aj = 0, for i = j. Similarly for b, c. Alternating updates: c(t+1) ∝ T(a(t), b(t), I) =

  • i∈[k]

wiai, a(t)bi, b(t)ci. ai, bi, ci are stationary points. ONLY local optima for best rank-1 approximation problem. Guaranteed recovery through alternating minimization. Perturbation Analysis [AGH+2012]: Under poly(d) number of random initializations and bounded noise conditions.

slide-92
SLIDE 92

Our Setup

So far

General tensor decomposition: NP-hard. Orthogonal tensors: too limiting. Tractable cases? Covers overcomplete tensors?

“Guaranteed Non-Orthogonal Tensor Decomposition via Alternating Rank-1 Updates” by A. Anandkumar, R. Ge. and M. Janzamin, Feb. 2014.

slide-93
SLIDE 93

Our Setup

So far

General tensor decomposition: NP-hard. Orthogonal tensors: too limiting. Tractable cases? Covers overcomplete tensors?

Our framework: Incoherent Components

|ai, aj| = O

  • 1/

√ d

  • for i = j. Similarly for b, c.

Can handle overcomplete tensors. Satisfied by random (generic) vectors. Guaranteed recovery for alternating minimization?

“Guaranteed Non-Orthogonal Tensor Decomposition via Alternating Rank-1 Updates” by A. Anandkumar, R. Ge. and M. Janzamin, Feb. 2014.

slide-94
SLIDE 94

Analysis of One Step Update

T =

  • i∈[k]

wiai ⊗ bi ⊗ ci, ai, bi, ci ∈ Sd−1.

Basic Intuition

Let ˆ a,ˆ b be “close to” a1, b1. Alternating update: ˆ c ∝ T(ˆ a,ˆ b, I) =

  • i∈[k]

wiai, ˆ abi,ˆ bci, = w1a1, ˆ ab1,ˆ bc1 + T−1(ˆ a,ˆ b, I).

slide-95
SLIDE 95

Analysis of One Step Update

T =

  • i∈[k]

wiai ⊗ bi ⊗ ci, ai, bi, ci ∈ Sd−1.

Basic Intuition

Let ˆ a,ˆ b be “close to” a1, b1. Alternating update: ˆ c ∝ T(ˆ a,ˆ b, I) =

  • i∈[k]

wiai, ˆ abi,ˆ bci, = w1a1, ˆ ab1,ˆ bc1 + T−1(ˆ a,ˆ b, I). T−1(ˆ a,ˆ b, I) = 0 in orthogonal case, when ˆ a = a1,ˆ b = b1.

slide-96
SLIDE 96

Analysis of One Step Update

T =

  • i∈[k]

wiai ⊗ bi ⊗ ci, ai, bi, ci ∈ Sd−1.

Basic Intuition

Let ˆ a,ˆ b be “close to” a1, b1. Alternating update: ˆ c ∝ T(ˆ a,ˆ b, I) =

  • i∈[k]

wiai, ˆ abi,ˆ bci, = w1a1, ˆ ab1,ˆ bc1 + T−1(ˆ a,ˆ b, I). T−1(ˆ a,ˆ b, I) = 0 in orthogonal case, when ˆ a = a1,ˆ b = b1. Can it be controlled for incoherent (random) vectors?

slide-97
SLIDE 97

Results for one step update

Incoherence: |ai, aj| = O

  • 1/

√ d

  • for i = j. Similarly for b, c.

Spectral norm: A, B, C ≤ 1 + O

  • k

d

  • . T ≤ (1 + o(1)).

Tensor rank: k = o(d1.5). Weights: For simplicity, wi ≡ 1.

slide-98
SLIDE 98

Results for one step update

Incoherence: |ai, aj| = O

  • 1/

√ d

  • for i = j. Similarly for b, c.

Spectral norm: A, B, C ≤ 1 + O

  • k

d

  • . T ≤ (1 + o(1)).

Tensor rank: k = o(d1.5). Weights: For simplicity, wi ≡ 1.

Lemma [AGJ2014]

For small enough ǫ such that max{a1 − ˆ a, b1 − ˆ b} ≤ ǫ, after one step c1 − ˆ c ≤ O √ k d + max 1 √ d , k d1.5

  • ǫ + ǫ2
  • .

√ k d : approximation error. rest: error contraction.

slide-99
SLIDE 99

Main Result: Local Convergence

Initialization: max{a1 − ˆ a(0), b1 −ˆ b(0)} ≤ ǫ0, and ǫ0 < constant. Noise: ˆ T := T + E, and E ≤ 1/ polylog(d). Rank: k = o(d1.5). Recovery error: ǫR := E + ˜ O √

k d

slide-100
SLIDE 100

Main Result: Local Convergence

Initialization: max{a1 − ˆ a(0), b1 −ˆ b(0)} ≤ ǫ0, and ǫ0 < constant. Noise: ˆ T := T + E, and E ≤ 1/ polylog(d). Rank: k = o(d1.5). Recovery error: ǫR := E + ˜ O √

k d

  • Theorem (Local Convergence)[AGJ2014]

After N = O(log(1/ǫR)) steps of alternating rank-1 updates, a1 − ˆ a(N) = O(ǫR).

slide-101
SLIDE 101

Main Result: Local Convergence

Initialization: max{a1 − ˆ a(0), b1 −ˆ b(0)} ≤ ǫ0, and ǫ0 < constant. Noise: ˆ T := T + E, and E ≤ 1/ polylog(d). Rank: k = o(d1.5). Recovery error: ǫR := E + ˜ O √

k d

  • Theorem (Local Convergence)[AGJ2014]

After N = O(log(1/ǫR)) steps of alternating rank-1 updates, a1 − ˆ a(N) = O(ǫR). Linear convergence: up to approximation error. Guarantees for overcomplete tensors: k = o(d1.5) and for pth-order tensors k = o(dp/2). Requires good initialization. What about global convergence?

slide-102
SLIDE 102

Global Convergence k = O(d)

SVD Initialization

Find the top singular vectors of T(I, I, θ) for θ ∼ N(0, I). Use them for initialization. L trials.

slide-103
SLIDE 103

Global Convergence k = O(d)

SVD Initialization

Find the top singular vectors of T(I, I, θ) for θ ∼ N(0, I). Use them for initialization. L trials.

Assumptions

Number of initializations: L ≥ kΩ(k/d)2, Tensor Rank: k = O(d)

  • No. of Iterations: N = Θ (log(1/ǫR)). Recall ǫR: recovery error.
slide-104
SLIDE 104

Global Convergence k = O(d)

SVD Initialization

Find the top singular vectors of T(I, I, θ) for θ ∼ N(0, I). Use them for initialization. L trials.

Assumptions

Number of initializations: L ≥ kΩ(k/d)2, Tensor Rank: k = O(d)

  • No. of Iterations: N = Θ (log(1/ǫR)). Recall ǫR: recovery error.

Theorem (Global Convergence)[AGJ2014]: a1 − ˆ a(N) ≤ O(ǫR).

slide-105
SLIDE 105

Global Convergence k = O(d)

SVD Initialization

Find the top singular vectors of T(I, I, θ) for θ ∼ N(0, I). Use them for initialization. L trials.

Assumptions

Number of initializations: L ≥ kΩ(k/d)2, Tensor Rank: k = O(d)

  • No. of Iterations: N = Θ (log(1/ǫR)). Recall ǫR: recovery error.

Theorem (Global Convergence)[AGJ2014]: a1 − ˆ a(N) ≤ O(ǫR). Corollary: Differing Dimensions

If ai, bi ∈ Rdu and ci ∈ Rdo, and du ≥ k ≥ do. k = O(√dudo) for incoherent vectors. k = O(du) if A, B orthogonal. Same guarantees. Can handle one overcomplete mode.

slide-106
SLIDE 106

Latest Result: Global Convergence

Assume Gaussian means ai’s. Improved initialization requirement for convergence of third order tensor power iteration |a1, ˆ a(0)| ≥ dβ √ k d , β > (log d)−c.

Spherical Gaussian Mixture or Multiview Mixture Model

Initialize with samples with norm of noise bounded by √ dσ such that σ = o

  • d

k

  • .

“Analyzing Tensor Power Method Dynamics: Applications to Learning Overcomplete Latent Variable Models” by A. Anandkumar, R. Ge. and M. Janzamin, Nov. 2014.

slide-107
SLIDE 107

Outline

1

Introduction

2

Summary of Results

3

Recap of Orthogonal Matrix and Tensor Decomposition

4

Overcomplete (Non-Orthogonal) Tensor Decomposition

5

Sample Complexity Analysis

6

Numerical Results

7

Conclusion

slide-108
SLIDE 108

High-level Intuition for Sample Bounds

Multi-view Model: x1 = Ah + z1, where z1 is noise. Exact moment T =

i wiai ⊗ bi ⊗ ci.

Sample moment: ˆ T = 1

n

  • i xi

1 ⊗ xi 2 ⊗ xi 3.

Na¨ ıve Idea: ˆ T − T ≤ mat( ˆ T) − mat(T), apply matrix Bernstein’s.

slide-109
SLIDE 109

High-level Intuition for Sample Bounds

Multi-view Model: x1 = Ah + z1, where z1 is noise. Exact moment T =

i wiai ⊗ bi ⊗ ci.

Sample moment: ˆ T = 1

n

  • i xi

1 ⊗ xi 2 ⊗ xi 3.

Na¨ ıve Idea: ˆ T − T ≤ mat( ˆ T) − mat(T), apply matrix Bernstein’s. Our idea: Careful ǫ-net covering for ˆ T − T. ˆ T − T has many terms, e.g., all-noise term:

1 n

  • i zi

1 ⊗ zi 2 ⊗ zi 3 and

signal-noise terms. Need to bound 1 n

  • i

zi

1, uzi 2, vzi 3, w, for all u, v, w ∈ Sd−1.

Classify inner products into buckets and bound them separately. Tight sample bounds for a range of latent variable models

“Provable Learning of Overcomplete Latent Variable Models: Semi-supervised and Unsupervised Settings” by A. Anandkumar, R. Ge. and M. Janzamin, Aug. 2014.

slide-110
SLIDE 110

Outline

1

Introduction

2

Summary of Results

3

Recap of Orthogonal Matrix and Tensor Decomposition

4

Overcomplete (Non-Orthogonal) Tensor Decomposition

5

Sample Complexity Analysis

6

Numerical Results

7

Conclusion

slide-111
SLIDE 111

Synthetic experiments

Learning multiview Gaussian mixture. Random mixture components. d = 100, k = {10, 20, 50, 100, 200, 500}. n = 1000. Random initialization.

10 10

1

10

2

10

3

10

−3

10

−2

10

−1

10 d=100, k =10 d=100, k =20 d=100, k =50 d=100, k =100 d=100, k =200 d=100, k =500

recovery rate of algorithm number of initializations ratio of recovered components

slide-112
SLIDE 112

Outline

1

Introduction

2

Summary of Results

3

Recap of Orthogonal Matrix and Tensor Decomposition

4

Overcomplete (Non-Orthogonal) Tensor Decomposition

5

Sample Complexity Analysis

6

Numerical Results

7

Conclusion

slide-113
SLIDE 113

Conclusion

Learning overcomplete Latent variable models.

⋆ Method-of-moments. ⋆ Tensor power iteration.

Robustness to noise. Sample complexity bounds for a range of LVMs.

⋆ Unsupervised setting. ⋆ Semi-supervised setting.

slide-114
SLIDE 114

Conclusion

Learning overcomplete Latent variable models.

⋆ Method-of-moments. ⋆ Tensor power iteration.

Robustness to noise. Sample complexity bounds for a range of LVMs.

⋆ Unsupervised setting. ⋆ Semi-supervised setting.

Coming: removing approximation error

√ k d .

slide-115
SLIDE 115

Conclusion

Learning overcomplete Latent variable models.

⋆ Method-of-moments. ⋆ Tensor power iteration.

Robustness to noise. Sample complexity bounds for a range of LVMs.

⋆ Unsupervised setting. ⋆ Semi-supervised setting.

Coming: removing approximation error

√ k d .

Thank you!