Learning Mixtures of Spherical Gaussians: Moment Methods and - - PowerPoint PPT Presentation

learning mixtures of spherical gaussians
SMART_READER_LITE
LIVE PREVIEW

Learning Mixtures of Spherical Gaussians: Moment Methods and - - PowerPoint PPT Presentation

Learning Mixtures of Spherical Gaussians: Moment Methods and Spectral Decompositions Daniel Hsu and Sham M. Kakade Microsoft Research, New England Also based on work with Anima Anandkumar (UCI) , Rong Ge (Princeton) , Matus Telgarsky (UCSD) . 1


slide-1
SLIDE 1

Learning Mixtures of Spherical Gaussians:

Moment Methods and Spectral Decompositions Daniel Hsu and Sham M. Kakade

Microsoft Research, New England

Also based on work with Anima Anandkumar (UCI), Rong Ge (Princeton), Matus Telgarsky (UCSD).

1

slide-2
SLIDE 2

Unsupervised machine learning

◮ Many applications in machine learning and statistics:

◮ Lots of high-dimensional data, but mostly unlabeled. 2

slide-3
SLIDE 3

Unsupervised machine learning

◮ Many applications in machine learning and statistics:

◮ Lots of high-dimensional data, but mostly unlabeled.

◮ Unsupervised learning: discover interesting structure of

population from unlabeled data.

◮ This talk: learn about sub-populations in data source. 2

slide-4
SLIDE 4

Learning mixtures of Gaussians

Mixture of Gaussians: k

i=1 wi N(

µi, Σi)

k sub-populations; each modeled as multivariate Gaussian N( µi, Σi) together with mixing weight wi.

3

slide-5
SLIDE 5

Learning mixtures of Gaussians

Mixture of Gaussians: k

i=1 wi N(

µi, Σi)

k sub-populations; each modeled as multivariate Gaussian N( µi, Σi) together with mixing weight wi.

Goal: efficient algorithm that approximately recovers parameters from samples.

3

slide-6
SLIDE 6

Learning mixtures of Gaussians

Mixture of Gaussians: k

i=1 wi N(

µi, Σi)

k sub-populations; each modeled as multivariate Gaussian N( µi, Σi) together with mixing weight wi.

Goal: efficient algorithm that approximately recovers parameters from samples.

(Alternative goal: density estimation. Not in this talk.)

3

slide-7
SLIDE 7

Learning setup

◮ Input: i.i.d. sample S ⊂ Rd from unknown mixtures of

Gaussians with parameters θ⋆ := {( µi⋆, Σ⋆

i , wi⋆) : i ∈ [k]}.

4

slide-8
SLIDE 8

Learning setup

◮ Input: i.i.d. sample S ⊂ Rd from unknown mixtures of

Gaussians with parameters θ⋆ := {( µi⋆, Σ⋆

i , wi⋆) : i ∈ [k]}. ◮ Each data point drawn from one of k Gaussians N(

µi⋆, Σ⋆

i )

(choose N( µi ⋆, Σ⋆

i ) with probability wi ⋆.)

4

slide-9
SLIDE 9

Learning setup

◮ Input: i.i.d. sample S ⊂ Rd from unknown mixtures of

Gaussians with parameters θ⋆ := {( µi⋆, Σ⋆

i , wi⋆) : i ∈ [k]}. ◮ Each data point drawn from one of k Gaussians N(

µi⋆, Σ⋆

i )

(choose N( µi ⋆, Σ⋆

i ) with probability wi ⋆.)

◮ But “labels” are not observed.

4

slide-10
SLIDE 10

Learning setup

◮ Input: i.i.d. sample S ⊂ Rd from unknown mixtures of

Gaussians with parameters θ⋆ := {( µi⋆, Σ⋆

i , wi⋆) : i ∈ [k]}. ◮ Each data point drawn from one of k Gaussians N(

µi⋆, Σ⋆

i )

(choose N( µi ⋆, Σ⋆

i ) with probability wi ⋆.)

◮ But “labels” are not observed. ◮ Goal: estimate parameters θ = {(

µi, Σi, wi) : i ∈ [k]} such that θ ≈ θ⋆.

4

slide-11
SLIDE 11

Learning setup

◮ Input: i.i.d. sample S ⊂ Rd from unknown mixtures of

Gaussians with parameters θ⋆ := {( µi⋆, Σ⋆

i , wi⋆) : i ∈ [k]}. ◮ Each data point drawn from one of k Gaussians N(

µi⋆, Σ⋆

i )

(choose N( µi ⋆, Σ⋆

i ) with probability wi ⋆.)

◮ But “labels” are not observed. ◮ Goal: estimate parameters θ = {(

µi, Σi, wi) : i ∈ [k]} such that θ ≈ θ⋆.

◮ In practice: local search for maximum-likelihood

parameters (E-M algorithm).

4

slide-12
SLIDE 12

When are there efficient algorithms?

Well-separated mixtures: estimation is easier if there is large minimum separation between component means (Dasgupta, ’99): sep

sep := min

i=j

  • µi −

µj max{σi, σj}.

◮ sep = Ω(dc) or sep = Ω(kc): simple clustering methods,

perhaps after dimension reduction

(Dasgupta, ’99; Vempala-Wang, ’02; and many more.)

5

slide-13
SLIDE 13

When are there efficient algorithms?

Well-separated mixtures: estimation is easier if there is large minimum separation between component means (Dasgupta, ’99): sep

sep := min

i=j

  • µi −

µj max{σi, σj}.

◮ sep = Ω(dc) or sep = Ω(kc): simple clustering methods,

perhaps after dimension reduction

(Dasgupta, ’99; Vempala-Wang, ’02; and many more.)

Recent developments:

◮ No minimum separation requirement, but current methods

require exp(Ω(k)) running time / sample size

(Kalai-Moitra-Valiant, ’10; Belkin-Sinha, ’10; Moitra-Valiant, ’10)

5

slide-14
SLIDE 14

Overcoming barriers to efficient estimation

Information-theoretic barrier: Gaussian mixtures in R1 can require exp(Ω(k)) samples to estimate parameters, even when components are well-separated

(Moitra-Valiant, ’10).

6

slide-15
SLIDE 15

Overcoming barriers to efficient estimation

Information-theoretic barrier: Gaussian mixtures in R1 can require exp(Ω(k)) samples to estimate parameters, even when components are well-separated

(Moitra-Valiant, ’10).

These hard instances are degenerate in high-dimensions!

6

slide-16
SLIDE 16

Overcoming barriers to efficient estimation

Information-theoretic barrier: Gaussian mixtures in R1 can require exp(Ω(k)) samples to estimate parameters, even when components are well-separated

(Moitra-Valiant, ’10).

These hard instances are degenerate in high-dimensions! Our result: efficient algorithms for non-degenerate models in high-dimensions (d ≥ k) with spherical covariances.

6

slide-17
SLIDE 17

Main result

Theorem (H-Kakade, ’13)

Assume { µ1⋆, µ2⋆, . . . , µk ⋆} linearly independent, wi⋆ > 0 for all i ∈ [k], and Σ⋆

i = σ2 i ⋆I for all i ∈ [k].

There is an algorithm that, given independent draws from a mixture of k spherical Gaussians, returns ε-accurate parameters (up to permutation, under ℓ2 metric) w.h.p. The running time and sample complexity are poly(d, k, 1/ε, 1/wmin, 1/λmin) where λmin = kth-largest singular value of [ µ1⋆| µ2⋆| · · · | µk ⋆].

(Also using new techniques from Anandkumar-Ge-H-Kakade-Telgarsky, ’12.)

7

slide-18
SLIDE 18
  • 2. Learning algorithm

Introduction Learning algorithm Method-of-moments Choice of moments Solving the moment equations Concluding remarks

8

slide-19
SLIDE 19

Method-of-moments

Let S ⊂ Rd be an i.i.d. sample from an unknown mixture of spherical Gaussians:

  • x ∼

k

  • i=1

wi

⋆N(

µi

⋆, σ2 i ⋆I).

9

slide-20
SLIDE 20

Method-of-moments

Let S ⊂ Rd be an i.i.d. sample from an unknown mixture of spherical Gaussians:

  • x ∼

k

  • i=1

wi

⋆N(

µi

⋆, σ2 i ⋆I).

Estimation via method-of-moments (Pearson, 1894) Find parameters θ such that Eθ[ p( x) ] ≈ ˆ E

x∈S[ p(

x) ] for some functions p : Rd → R (typically multivar. polynomials).

9

slide-21
SLIDE 21

Method-of-moments

Let S ⊂ Rd be an i.i.d. sample from an unknown mixture of spherical Gaussians:

  • x ∼

k

  • i=1

wi

⋆N(

µi

⋆, σ2 i ⋆I).

Estimation via method-of-moments (Pearson, 1894) Find parameters θ such that Eθ[ p( x) ] ≈ ˆ E

x∈S[ p(

x) ] for some functions p : Rd → R (typically multivar. polynomials). Q1 Which moments to use? Q2 How to (approx.) solve moment equations?

9

slide-22
SLIDE 22

Which moments to use?

10

slide-23
SLIDE 23

Which moments to use?

moment order reliable estimates? unique solution? 1st, 2nd 1st- and 2nd-order moments (e.g., mean, covariance)

[Achlioptas-McSherry, ’05]

1st 2nd Ω(k)th

  • rder of moments

[Vempala-Wang, ’02] [Chaudhuri-Rao, ’08]

10

slide-24
SLIDE 24

Which moments to use?

moment order reliable estimates? unique solution? 1st, 2nd ✓ 1st- and 2nd-order moments (e.g., mean, covariance)

◮ Fairly easy to get reliable estimates.

E

x∈S[

x ⊗ x] ≈ Eθ⋆[ x ⊗ x]

[Achlioptas-McSherry, ’05]

1st 2nd Ω(k)th

  • rder of moments

[Vempala-Wang, ’02] [Chaudhuri-Rao, ’08]

10

slide-25
SLIDE 25

Which moments to use?

moment order reliable estimates? unique solution? 1st, 2nd ✓ ✗ 1st- and 2nd-order moments (e.g., mean, covariance)

◮ Fairly easy to get reliable estimates.

E

x∈S[

x ⊗ x] ≈ Eθ⋆[ x ⊗ x]

◮ Can have multiple solutions to moment equations.

Eθ1[ x ⊗ x] ≈ E

x∈S[

x ⊗ x] ≈ Eθ2[ x ⊗ x], θ1 = θ2

[Achlioptas-McSherry, ’05]

1st 2nd Ω(k)th

  • rder of moments

[Vempala-Wang, ’02] [Chaudhuri-Rao, ’08]

10

slide-26
SLIDE 26

Which moments to use?

moment order reliable estimates? unique solution? 1st, 2nd ✓ ✗ Ω(k)th Ω(k)th-order moments (e.g., Eθ[degree-k-poly(

x)])

[Achlioptas-McSherry, ’05]

1st 2nd Ω(k)th

[Prony, 1795] [Lindsay, ’89]

  • rder of moments

[Belkin-Sinha, ’10] [Moitra-Valiant, ’10] [Vempala-Wang, ’02] [Chaudhuri-Rao, ’08]

10

slide-27
SLIDE 27

Which moments to use?

moment order reliable estimates? unique solution? 1st, 2nd ✓ ✗ Ω(k)th ✓ Ω(k)th-order moments (e.g., Eθ[degree-k-poly(

x)])

◮ Uniquely pins down the solution. [Achlioptas-McSherry, ’05]

1st 2nd Ω(k)th

[Prony, 1795] [Lindsay, ’89]

  • rder of moments

[Belkin-Sinha, ’10] [Moitra-Valiant, ’10] [Vempala-Wang, ’02] [Chaudhuri-Rao, ’08]

10

slide-28
SLIDE 28

Which moments to use?

moment order reliable estimates? unique solution? 1st, 2nd ✓ ✗ Ω(k)th ✗ ✓ Ω(k)th-order moments (e.g., Eθ[degree-k-poly(

x)])

◮ Uniquely pins down the solution. ◮ Empirical estimates very unreliable. [Achlioptas-McSherry, ’05]

1st 2nd Ω(k)th

[Prony, 1795] [Lindsay, ’89]

  • rder of moments

[Belkin-Sinha, ’10] [Moitra-Valiant, ’10] [Vempala-Wang, ’02] [Chaudhuri-Rao, ’08]

10

slide-29
SLIDE 29

Which moments to use?

moment order reliable estimates? unique solution? 1st, 2nd ✓ ✗ Ω(k)th ✗ ✓ Can we get best-of-both-worlds?

[Achlioptas-McSherry, ’05]

1st 2nd Ω(k)th

[Prony, 1795] [Lindsay, ’89]

  • rder of moments

[Belkin-Sinha, ’10] [Moitra-Valiant, ’10] [Vempala-Wang, ’02] [Chaudhuri-Rao, ’08]

10

slide-30
SLIDE 30

Which moments to use?

moment order reliable estimates? unique solution? 1st, 2nd ✓ ✗ Ω(k)th ✗ ✓ Can we get best-of-both-worlds? Yes! In high-dimensions (d ≥ k), low-order multivariate moments suffice.

(1st-, 2nd-, and 3rd-order moments)

[Achlioptas-McSherry, ’05]

1st 2nd Ω(k)th

[Prony, 1795] [Lindsay, ’89]

  • rder of moments

[Belkin-Sinha, ’10] [Moitra-Valiant, ’10] [Vempala-Wang, ’02] [Chaudhuri-Rao, ’08]

this work

10

slide-31
SLIDE 31

Structure of low-order multivariate moments

Second- and third-order multivariate moments: Eθ[ x ⊗ x] =

k

  • i=1

wi µi ⊗ µi + some sparse matrix; Eθ[ x ⊗ x ⊗ x] =

k

  • i=1

wi µi ⊗ µi ⊗ µi + some sparse tensor.

11

slide-32
SLIDE 32

Structure of low-order multivariate moments

Second- and third-order multivariate moments: Eθ[ x ⊗ x] =

k

  • i=1

wi µi ⊗ µi + some sparse matrix; Eθ[ x ⊗ x ⊗ x] =

k

  • i=1

wi µi ⊗ µi ⊗ µi + some sparse tensor. Trick: “sparse stuff” can be estimated and thus removed.

11

slide-33
SLIDE 33

Structure of low-order multivariate moments

Second- and third-order multivariate moments: Eθ[ x ⊗ x] =

k

  • i=1

wi µi ⊗ µi + some sparse matrix; Eθ[ x ⊗ x ⊗ x] =

k

  • i=1

wi µi ⊗ µi ⊗ µi + some sparse tensor. Trick: “sparse stuff” can be estimated and thus removed. Upshot: the following can be readily estimated (with

M, T).

Mθ⋆ :=

k

  • i=1

wi

µi

⋆ ⊗

µi

and Tθ⋆ :=

k

  • i=1

wi

µi

⋆ ⊗

µi

⋆ ⊗

µi

⋆.

11

slide-34
SLIDE 34

Structure of low-order multivariate moments

Second- and third-order multivariate moments: Eθ[ x ⊗ x] =

k

  • i=1

wi µi ⊗ µi + some sparse matrix; Eθ[ x ⊗ x ⊗ x] =

k

  • i=1

wi µi ⊗ µi ⊗ µi + some sparse tensor. Trick: “sparse stuff” can be estimated and thus removed. Upshot: the following can be readily estimated (with

M, T).

Mθ⋆ :=

k

  • i=1

wi

µi

⋆ ⊗

µi

and Tθ⋆ :=

k

  • i=1

wi

µi

⋆ ⊗

µi

⋆ ⊗

µi

⋆.

Claim: {( µi, wi)} uniquely determined by Mθ and Tθ.

11

slide-35
SLIDE 35

Variational argument for parameter uniquness

View Mθ : Rd × Rd → R and Tθ : Rd × Rd × Rd → R as bi-linear and tri-linear functions.

12

slide-36
SLIDE 36

Variational argument for parameter uniquness

View Mθ : Rd × Rd → R and Tθ : Rd × Rd × Rd → R as bi-linear and tri-linear functions.

Lemma

If { µi} are linearly independent and all wi > 0, then each of the k distinct, isolated local maximizers u∗ of max

  • u∈Rd Tθ(

u, u, u) s.t. Mθ( u, u) ≤ 1 satisfies, for some i ∈ [k], Mθ(·, u∗) = √wi µi, Tθ( u∗, u∗, u∗) = 1 √wi .

12

slide-37
SLIDE 37

Variational argument for parameter uniquness

View Mθ : Rd × Rd → R and Tθ : Rd × Rd × Rd → R as bi-linear and tri-linear functions.

Lemma

If { µi} are linearly independent and all wi > 0, then each of the k distinct, isolated local maximizers u∗ of max

  • u∈Rd Tθ(

u, u, u) s.t. Mθ( u, u) ≤ 1 satisfies, for some i ∈ [k], Mθ(·, u∗) = √wi µi, Tθ( u∗, u∗, u∗) = 1 √wi . ∴ {( µi, wi) : i ∈ [k]} uniquely determined by Mθ, Tθ.

12

slide-38
SLIDE 38

Main idea for variational lemma

max

  • u∈Rd Tθ(

u, u, u) s.t. Mθ( u, u) ≤ 1

13

slide-39
SLIDE 39

Main idea for variational lemma

max

  • u∈Rd

k

  • i=1

wi µi, u3 s.t.

k

  • i=1

wi µi, u2 ≤ 1

13

slide-40
SLIDE 40

Main idea for variational lemma

max

  • u∈Rd

k

  • i=1

wi µi, u3 s.t.

k

  • i=1

wi µi, u2 ≤ 1 Maximizers are directions u∗ orthogonal to all but one µj.

  • µ2
  • µ3
  • u∗
  • µ1

13

slide-41
SLIDE 41

Main idea for variational lemma

max

  • u∈Rd

k

  • i=1

wi µi, u3 s.t.

k

  • i=1

wi µi, u2 ≤ 1 Maximizers are directions u∗ orthogonal to all but one µj.

  • µ2
  • µ3
  • u∗
  • µ1

Combine with constraints wj µj, u∗2 ≤ 1 to get M u∗ = k

  • i=1

wi µi ⊗ µi

  • u∗ =

k

  • i=1

wi µi µi, u∗ = ±wj µj.

13

slide-42
SLIDE 42

How to solve the moment equations?

Effectively want to solve minθ Tθ − T2 s.t. Mθ = M. (†)

14

slide-43
SLIDE 43

How to solve the moment equations?

Effectively want to solve minθ Tθ − T2 s.t. Mθ = M. (†) Not convex in parameters θ = {( µi, wi)}.

14

slide-44
SLIDE 44

How to solve the moment equations?

Effectively want to solve minθ Tθ − T2 s.t. Mθ = M. (†) Not convex in parameters θ = {( µi, wi)}. What we do: find one component ( µi, wi) at a time, using local optimization of related (also non-convex) objective function. max

  • u∈Rd
  • i,j,k
  • Ti,j,k uiujuk

s.t.

  • i,j
  • Mi,j uiuj ≤ 1

(‡)

14

slide-45
SLIDE 45

How to solve the moment equations?

Effectively want to solve minθ Tθ − T2 s.t. Mθ = M. (†) Not convex in parameters θ = {( µi, wi)}. What we do: find one component ( µi, wi) at a time, using local optimization of related (also non-convex) objective function. max

  • u∈Rd
  • T(

u, u, u) s.t.

  • M(

u, u) ≤ 1 (‡)

14

slide-46
SLIDE 46

How to solve the moment equations?

Effectively want to solve minθ Tθ − T2 s.t. Mθ = M. (†) Not convex in parameters θ = {( µi, wi)}. What we do: find one component ( µi, wi) at a time, using local optimization of related (also non-convex) objective function.

  • u∗

1

  • u∗

2

  • u∗

3

14

slide-47
SLIDE 47

How to solve the moment equations?

Effectively want to solve minθ Tθ − T2 s.t. Mθ = M. (†) Not convex in parameters θ = {( µi, wi)}. What we do: find one component ( µi, wi) at a time, using local optimization of related (also non-convex) objective function.

  • u∗

1

  • u∗

2

  • u∗

3

( µ2⋆, w2⋆) ( µ3⋆, w3⋆) ( µ1⋆, w1⋆)

New robust algorithm for “tensor eigen-decomposition” efficiently approximates all local optima, each corresponding to a component. − → Near-optimal solution to (†).

14

slide-48
SLIDE 48

Local optimization

Want to find all local maximizers of max

  • u∈Rd
  • T(

u, u, u) s.t.

  • M(

u, u) ≤ 1. (‡) Must address initialization and convergence issues.

15

slide-49
SLIDE 49

Local optimization

Want to find all local maximizers of max

  • u∈Rd
  • T(

u, u, u) s.t.

  • M(

u, u) ≤ 1. (‡) Must address initialization and convergence issues. Crucially using special tensor structure of T ≈ Tθ⋆, together with non-linearity of u → T( · , u, u):

◮ Random initialization is good with significant probability.

(“Good” ⇒ simple iteration will quickly converge to some local max.)

15

slide-50
SLIDE 50

Local optimization

Want to find all local maximizers of max

  • u∈Rd
  • T(

u, u, u) s.t.

  • M(

u, u) ≤ 1. (‡) Must address initialization and convergence issues. Crucially using special tensor structure of T ≈ Tθ⋆, together with non-linearity of u → T( · , u, u):

◮ Random initialization is good with significant probability.

(“Good” ⇒ simple iteration will quickly converge to some local max.)

◮ Can check if initialization was good by checking objective

value after a few steps.

15

slide-51
SLIDE 51

Local optimization

Want to find all local maximizers of max

  • u∈Rd
  • T(

u, u, u) s.t.

  • M(

u, u) ≤ 1. (‡) Must address initialization and convergence issues. Crucially using special tensor structure of T ≈ Tθ⋆, together with non-linearity of u → T( · , u, u):

◮ Random initialization is good with significant probability.

(“Good” ⇒ simple iteration will quickly converge to some local max.)

◮ Can check if initialization was good by checking objective

value after a few steps.

◮ If value large enough: initialization was good; improve by

taking a few more steps.

15

slide-52
SLIDE 52

Local optimization

Want to find all local maximizers of max

  • u∈Rd
  • T(

u, u, u) s.t.

  • M(

u, u) ≤ 1. (‡) Must address initialization and convergence issues. Crucially using special tensor structure of T ≈ Tθ⋆, together with non-linearity of u → T( · , u, u):

◮ Random initialization is good with significant probability.

(“Good” ⇒ simple iteration will quickly converge to some local max.)

◮ Can check if initialization was good by checking objective

value after a few steps.

◮ If value large enough: initialization was good; improve by

taking a few more steps.

◮ Else: abandon and restart. 15

slide-53
SLIDE 53
  • 3. Concluding remarks

Introduction Learning algorithm Concluding remarks Open problems and summary

16

slide-54
SLIDE 54

Some open problems

17

slide-55
SLIDE 55

Some open problems

◮ Can also handle mixtures of Gaussians with somewhat

more general covariances, under incoherence conditions Eθ[ x ⊗ x] =

k

  • i=1

wi µi ⊗ µi

  • low-rank

+ some sparse matrix

17

slide-56
SLIDE 56

Some open problems

◮ Can also handle mixtures of Gaussians with somewhat

more general covariances, under incoherence conditions Eθ[ x ⊗ x] =

k

  • i=1

wi µi ⊗ µi

  • low-rank

+ some sparse matrix

◮ Question #1: What about mixtures of Gaussians with

arbitrary covariances?

17

slide-57
SLIDE 57

Some open problems

◮ Can also handle mixtures of Gaussians with somewhat

more general covariances, under incoherence conditions Eθ[ x ⊗ x] =

k

  • i=1

wi µi ⊗ µi

  • low-rank

+ some sparse matrix

◮ Question #1: What about mixtures of Gaussians with

arbitrary covariances?

◮ Question #2: How to handle degenerate cases / k ≫ d?

(Practical relevance: automatic speech recognition)

17

slide-58
SLIDE 58

Summary

18

slide-59
SLIDE 59

Summary

◮ Learning mixtures of spherical Gaussians:

worst-case (information-theoretically) hard, but non-degenerate cases are easy.

18

slide-60
SLIDE 60

Summary

◮ Learning mixtures of spherical Gaussians:

worst-case (information-theoretically) hard, but non-degenerate cases are easy.

◮ Structure in low-order multivariate moments uniquely

determines model parameters under natural non-degeneracy condition; ⇒ permits computationally efficient algorithm for estimation.

18

slide-61
SLIDE 61

Summary

◮ Learning mixtures of spherical Gaussians:

worst-case (information-theoretically) hard, but non-degenerate cases are easy.

◮ Structure in low-order multivariate moments uniquely

determines model parameters under natural non-degeneracy condition; ⇒ permits computationally efficient algorithm for estimation.

◮ Similar story for many other statistical models

(e.g., HMMs (Mossel-Roch, ’06; H-Kakade-Zhang, ’09), topic models (Arora-Ge-Moitra, ’12; Anandkumar et al, ’12), ICA (Arora et al, ’12)).

18

slide-62
SLIDE 62

Summary

◮ Learning mixtures of spherical Gaussians:

worst-case (information-theoretically) hard, but non-degenerate cases are easy.

◮ Structure in low-order multivariate moments uniquely

determines model parameters under natural non-degeneracy condition; ⇒ permits computationally efficient algorithm for estimation.

◮ Similar story for many other statistical models

(e.g., HMMs (Mossel-Roch, ’06; H-Kakade-Zhang, ’09), topic models (Arora-Ge-Moitra, ’12; Anandkumar et al, ’12), ICA (Arora et al, ’12)).

◮ Open problem: efficient estimators for highly

  • ver-complete and general mixture models (k ≫ d).

18

slide-63
SLIDE 63

Thanks!

Related survey/overview-ish paper:

◮ Tensor decompositions for latent variable models

(with Anandkumar, Ge, Kakade, and Telgarsky): http://arxiv.org/abs/1210.7559

19

slide-64
SLIDE 64

Structure of low-order moments

◮ First-order moments:

E[ x] =

k

  • i=1

wi µi.

◮ Second-order moments:

E[ x ⊗ x] =

k

  • i=1

wi µi ⊗ µi + ¯ σ2I where ¯ σ2 := k

i=1 wi σ2 i .

Fact: ¯ σ2 is the smallest eigenvalue of Cov( x) = E[ x ⊗ x] − E[ x] ⊗ E[ x].

20

slide-65
SLIDE 65

Structure of low-order moments

◮ Third-order moments:

E[ x ⊗ x ⊗ x] =

k

  • i=1

wi µi ⊗ µi ⊗ µi +

d

  • i=1
  • m ⊗ ei ⊗ ei + ei ⊗

m ⊗ ei + ei ⊗ ei ⊗ m where m := k

i=1 wi σ2 i

µi. Fact: m = E[ ( u⊤( x − E[ x]))2 x ] for any unit-norm eigenvector u of Cov( x) corresponding to eigenvalue ¯ σ2.

21

slide-66
SLIDE 66

Proof idea for optimization lemma

max

  • u∈Rd T(

u, u, u) s.t. M( u, u) ≤ 1

22

slide-67
SLIDE 67

Proof idea for optimization lemma

max

  • u∈Rd

k

  • i=1

wi µi, u3 s.t.

k

  • i=1

wi µi, u2 ≤ 1

22

slide-68
SLIDE 68

Proof idea for optimization lemma

max

  • θ∈Rk

k

  • i=1

1 √wi θ3

i

s.t.

k

  • i=1

θ2

i ≤ 1

(θi := √wi µi, u.)

22

slide-69
SLIDE 69

Proof idea for optimization lemma

max

  • θ∈Rk

k

  • i=1

1 √wi θ3

i

s.t.

k

  • i=1

θ2

i ≤ 1

(θi := √wi µi, u.)

Isolated local maxima are

1 √w1 , 1 √w2 , . . . , achieved at

(1, 0, 0, . . . ), (0, 1, 0, . . . ), . . .

22

slide-70
SLIDE 70

Proof idea for optimization lemma

max

  • θ∈Rk

k

  • i=1

1 √wi θ3

i

s.t.

k

  • i=1

θ2

i ≤ 1

(θi := √wi µi, u.)

Isolated local maxima are

1 √w1 , 1 √w2 , . . . , achieved at

(1, 0, 0, . . . ), (0, 1, 0, . . . ), . . . Translates to directions u∗ orthogonal to all but one µj.

  • µ2
  • µ3
  • u∗
  • µ1

22