Efficient algorithms for estimating multi-view mixture models - - PowerPoint PPT Presentation

efficient algorithms for estimating multi view mixture
SMART_READER_LITE
LIVE PREVIEW

Efficient algorithms for estimating multi-view mixture models - - PowerPoint PPT Presentation

Efficient algorithms for estimating multi-view mixture models Daniel Hsu Microsoft Research, New England Outline Multi-view mixture models Multi-view method-of-moments Some applications and open questions Concluding remarks Part 1.


slide-1
SLIDE 1

Efficient algorithms for estimating multi-view mixture models

Daniel Hsu

Microsoft Research, New England

slide-2
SLIDE 2

Outline

Multi-view mixture models Multi-view method-of-moments Some applications and open questions Concluding remarks

slide-3
SLIDE 3

Part 1. Multi-view mixture models

Multi-view mixture models Unsupervised learning and mixture models Multi-view mixture models Complexity barriers Multi-view method-of-moments Some applications and open questions Concluding remarks

slide-4
SLIDE 4

Unsupervised learning

◮ Many modern applications of machine learning:

◮ high-dimensional data from many diverse sources, ◮ but mostly unlabeled.

slide-5
SLIDE 5

Unsupervised learning

◮ Many modern applications of machine learning:

◮ high-dimensional data from many diverse sources, ◮ but mostly unlabeled.

◮ Unsupervised learning: extract useful info from this data.

◮ Disentangle sub-populations in data source. ◮ Discover useful representations for downstream stages of

learning pipeline (e.g., supervised learning).

slide-6
SLIDE 6

Mixture models

Simple latent variable model: mixture model h

  • x

h ∈ [k] := {1, 2, . . . , k} (hidden);

  • x ∈ Rd (observed);

Pr[ h = j ] = wj;

  • x
  • h ∼ Ph;

so x has a mixture distribution P( x) = w1P1( x) + w2P2( x) + · · · + wkPk( x).

slide-7
SLIDE 7

Mixture models

Simple latent variable model: mixture model h

  • x

h ∈ [k] := {1, 2, . . . , k} (hidden);

  • x ∈ Rd (observed);

Pr[ h = j ] = wj;

  • x
  • h ∼ Ph;

so x has a mixture distribution P( x) = w1P1( x) + w2P2( x) + · · · + wkPk( x). Typical use: learn about constituent sub-populations (e.g., clusters) in data source.

slide-8
SLIDE 8

Multi-view mixture models

Can we take advantage of diverse sources of information?

slide-9
SLIDE 9

Multi-view mixture models

Can we take advantage of diverse sources of information? h

  • x1
  • x2

· · ·

  • xℓ

h ∈ [k],

  • x1 ∈ Rd1,

x2 ∈ Rd2, . . . , xℓ ∈ Rdℓ. k = # components, ℓ = # views (e.g., audio, video, text). View 1: x1 ∈ Rd1 View 2: x2 ∈ Rd2 View 3: x3 ∈ Rd3

slide-10
SLIDE 10

Multi-view mixture models

Can we take advantage of diverse sources of information? h

  • x1
  • x2

· · ·

  • xℓ

h ∈ [k],

  • x1 ∈ Rd1,

x2 ∈ Rd2, . . . , xℓ ∈ Rdℓ. k = # components, ℓ = # views (e.g., audio, video, text). View 1: x1 ∈ Rd1 View 2: x2 ∈ Rd2 View 3: x3 ∈ Rd3

slide-11
SLIDE 11

Multi-view mixture models

Multi-view assumption: Views are conditionally independent given the component. View 1: x1 ∈ Rd1 View 2: x2 ∈ Rd2 View 3: x3 ∈ Rd3 Larger k (# components): more sub-populations to disentangle. Larger ℓ (# views): more non-redundant sources of information.

slide-12
SLIDE 12

Semi-parametric estimation task

“Parameters” of component distributions: Mixing weights wj := Pr[h = j], j ∈ [k]; Conditional means µv,j := E[ xv|h = j] ∈ Rdv, j ∈ [k], v ∈ [ℓ]. Goal: Estimate mixing weights and conditional means from independent copies of ( x1, x2, . . . , xℓ).

slide-13
SLIDE 13

Semi-parametric estimation task

“Parameters” of component distributions: Mixing weights wj := Pr[h = j], j ∈ [k]; Conditional means µv,j := E[ xv|h = j] ∈ Rdv, j ∈ [k], v ∈ [ℓ]. Goal: Estimate mixing weights and conditional means from independent copies of ( x1, x2, . . . , xℓ). Questions:

  • 1. How do we estimate {wj} and {

µv,j} without observing h?

  • 2. How many views ℓ are sufficient to learn with poly(k)

computational / sample complexity?

slide-14
SLIDE 14

Some barriers to efficient estimation

Challenge: many difficult parametric estimation tasks reduce to this estimation problem.

slide-15
SLIDE 15

Some barriers to efficient estimation

Challenge: many difficult parametric estimation tasks reduce to this estimation problem. Cryptographic barrier: discrete HMM pa- rameter estimation as hard as learning parity functions with noise (Mossel-Roch, ’06).

slide-16
SLIDE 16

Some barriers to efficient estimation

Challenge: many difficult parametric estimation tasks reduce to this estimation problem. Cryptographic barrier: discrete HMM pa- rameter estimation as hard as learning parity functions with noise (Mossel-Roch, ’06). Statistical barrier: Gaussian mixtures in R1 can require exp(Ω(k)) samples to estimate parameters, even if components are well- separated (Moitra-Valiant, ’10).

slide-17
SLIDE 17

Some barriers to efficient estimation

Challenge: many difficult parametric estimation tasks reduce to this estimation problem. Cryptographic barrier: discrete HMM pa- rameter estimation as hard as learning parity functions with noise (Mossel-Roch, ’06). Statistical barrier: Gaussian mixtures in R1 can require exp(Ω(k)) samples to estimate parameters, even if components are well- separated (Moitra-Valiant, ’10). In practice: resort to local search (e.g., EM), often subject to slow convergence and inaccurate local optima.

slide-18
SLIDE 18

Making progress: Gaussian mixture model

Gaussian mixture model: problem becomes easier if assume some large minimum separation between component means

(Dasgupta, ’99):

sep := min

i=j

  • µi −

µj max{σi, σj}.

slide-19
SLIDE 19

Making progress: Gaussian mixture model

Gaussian mixture model: problem becomes easier if assume some large minimum separation between component means

(Dasgupta, ’99):

sep := min

i=j

  • µi −

µj max{σi, σj}.

◮ sep = Ω(dc): interpoint distance-based methods / EM

(Dasgupta, ’99; Dasgupta-Schulman, ’00; Arora-Kannan, ’00)

◮ sep = Ω(kc): first use PCA to k dimensions

(Vempala-Wang, ’02; Kannan-Salmasian-Vempala, ’05; Achlioptas-McSherry, ’05)

◮ Also works for mixtures of log-concave distributions.

slide-20
SLIDE 20

Making progress: Gaussian mixture model

Gaussian mixture model: problem becomes easier if assume some large minimum separation between component means

(Dasgupta, ’99):

sep := min

i=j

  • µi −

µj max{σi, σj}.

◮ sep = Ω(dc): interpoint distance-based methods / EM

(Dasgupta, ’99; Dasgupta-Schulman, ’00; Arora-Kannan, ’00)

◮ sep = Ω(kc): first use PCA to k dimensions

(Vempala-Wang, ’02; Kannan-Salmasian-Vempala, ’05; Achlioptas-McSherry, ’05)

◮ Also works for mixtures of log-concave distributions.

◮ No minimum separation requirement: method-of-moments

but exp(Ω(k)) running time / sample size

(Kalai-Moitra-Valiant, ’10; Belkin-Sinha, ’10; Moitra-Valiant, ’10)

slide-21
SLIDE 21

Making progress: discrete hidden Markov models

Hardness reductions create HMMs with degenerate output and next-state distributions.

1 4 5 6 7 8 2 3 0.6 Pr[ xt = ·|ht = 2] + 0.4 Pr[ xt = ·|ht = 3] 1 4 5 6 7 8 2 3 1 4 5 6 7 8 2 3

Pr[ xt = ·|ht = 1]

0.6 +0.4

slide-22
SLIDE 22

Making progress: discrete hidden Markov models

Hardness reductions create HMMs with degenerate output and next-state distributions.

1 4 5 6 7 8 2 3 0.6 Pr[ xt = ·|ht = 2] + 0.4 Pr[ xt = ·|ht = 3] 1 4 5 6 7 8 2 3 1 4 5 6 7 8 2 3

Pr[ xt = ·|ht = 1]

0.6 +0.4

These instances are avoided by assuming parameter matrices are full-rank (Mossel-Roch, ’06; Hsu-Kakade-Zhang, ’09)

slide-23
SLIDE 23

What we do

This work: given ≥ 3 views, mild non-degeneracy conditions imply efficient algorithms for estimation.

slide-24
SLIDE 24

What we do

This work: given ≥ 3 views, mild non-degeneracy conditions imply efficient algorithms for estimation.

◮ Non-degeneracy condition for multi-view mixture model:

Conditional means { µv,1, µv,2, . . . , µv,k} are linearly independent for each view v ∈ [ℓ], and w > 0. Requires high-dimensional observations (dv ≥ k)!

slide-25
SLIDE 25

What we do

This work: given ≥ 3 views, mild non-degeneracy conditions imply efficient algorithms for estimation.

◮ Non-degeneracy condition for multi-view mixture model:

Conditional means { µv,1, µv,2, . . . , µv,k} are linearly independent for each view v ∈ [ℓ], and w > 0. Requires high-dimensional observations (dv ≥ k)!

◮ New efficient learning guarantees for parametric models

(e.g., mixtures of Gaussians, general HMMs)

slide-26
SLIDE 26

What we do

This work: given ≥ 3 views, mild non-degeneracy conditions imply efficient algorithms for estimation.

◮ Non-degeneracy condition for multi-view mixture model:

Conditional means { µv,1, µv,2, . . . , µv,k} are linearly independent for each view v ∈ [ℓ], and w > 0. Requires high-dimensional observations (dv ≥ k)!

◮ New efficient learning guarantees for parametric models

(e.g., mixtures of Gaussians, general HMMs)

◮ General tensor decomposition framework applicable to

a wide variety of estimation problems.

slide-27
SLIDE 27

Part 2. Multi-view method-of-moments

Multi-view mixture models Multi-view method-of-moments Overview Structure of moments Uniqueness of decomposition Computing the decomposition Asymmetric views Some applications and open questions Concluding remarks

slide-28
SLIDE 28

The plan

◮ First, assume views are (conditionally) exchangeable,

and derive basic algorithm.

slide-29
SLIDE 29

The plan

◮ First, assume views are (conditionally) exchangeable,

and derive basic algorithm.

◮ Then, provide reduction from general multi-view setting to

exchangeable case. − →

slide-30
SLIDE 30

Simpler case: exchangeable views

(Conditionally) exchangeable views: assume the views have the same conditional means, i.e., E[ xv|h = j ] ≡ µj, j ∈ [k], v ∈ [ℓ].

slide-31
SLIDE 31

Simpler case: exchangeable views

(Conditionally) exchangeable views: assume the views have the same conditional means, i.e., E[ xv|h = j ] ≡ µj, j ∈ [k], v ∈ [ℓ]. Motivating setting: bag-of-words model,

  • x1,

x2, . . . , xℓ ≡ ℓ exchangeable words in a document. One-hot encoding:

  • xv =

ei ⇔ v-th word in document is i-th word in vocab

(where ei ∈ {0, 1}d has 1 in i-th position, 0 elsewhere).

( µj)i = E[( xv)i|h = j] = Pr[ xv = ei|h = j], i ∈ [d], j ∈ [k].

slide-32
SLIDE 32

Key ideas

  • 1. Method-of-moments: conditional means are revealed by

appropriate low-rank decompositions of moment matrices and tensors.

  • 2. Third-order tensor decomposition is uniquely

determined by directions of (locally) maximum skew.

  • 3. The required local optimization can be efficiently

performed in poly time.

slide-33
SLIDE 33

Algebraic structure in moments

Recall: E[ xv|h = j ] = µj.

slide-34
SLIDE 34

Algebraic structure in moments

Recall: E[ xv|h = j ] = µj. By conditional independence and exchangeability of

  • x1,

x2, . . . , xℓ given h, Pairs := E[ x1 ⊗ x2] = E

  • E[

x1|h] ⊗ E[ x2|h]

  • = E[

µh ⊗ µh] =

k

  • i=1

wi µi ⊗ µi ∈ Rd×d.

slide-35
SLIDE 35

Algebraic structure in moments

Recall: E[ xv|h = j ] = µj. By conditional independence and exchangeability of

  • x1,

x2, . . . , xℓ given h, Pairs := E[ x1 ⊗ x2] = E

  • E[

x1|h] ⊗ E[ x2|h]

  • = E[

µh ⊗ µh] =

k

  • i=1

wi µi ⊗ µi ∈ Rd×d. Triples := E[ x1 ⊗ x2 ⊗ x3] =

k

  • i=1

wi µi ⊗ µi ⊗ µi ∈ Rd×d×d, etc.

(If only we could extract these “low-rank” decompositions . . . )

slide-36
SLIDE 36

2nd moment: subspace spanned by conditional means

slide-37
SLIDE 37

2nd moment: subspace spanned by conditional means

Non-degeneracy assumption ({ µi} linearly independent)

slide-38
SLIDE 38

2nd moment: subspace spanned by conditional means

Non-degeneracy assumption ({ µi} linearly independent) = ⇒ Pairs = k

i=1 wi

µi ⊗ µi symmetric psd and rank k

slide-39
SLIDE 39

2nd moment: subspace spanned by conditional means

Non-degeneracy assumption ({ µi} linearly independent) = ⇒ Pairs = k

i=1 wi

µi ⊗ µi symmetric psd and rank k = ⇒ Pairs equips k-dim subspace span{ µ1, µ2, . . . , µk} with inner product Pairs( x, y) := x ⊤Pairs y.

slide-40
SLIDE 40

2nd moment: subspace spanned by conditional means

Non-degeneracy assumption ({ µi} linearly independent) = ⇒ Pairs = k

i=1 wi

µi ⊗ µi symmetric psd and rank k = ⇒ Pairs equips k-dim subspace span{ µ1, µ2, . . . , µk} with inner product Pairs( x, y) := x ⊤Pairs y. However, { µi} not generally determined by just Pairs

(e.g., { µi} are not necessarily orthogonal).

slide-41
SLIDE 41

2nd moment: subspace spanned by conditional means

Non-degeneracy assumption ({ µi} linearly independent) = ⇒ Pairs = k

i=1 wi

µi ⊗ µi symmetric psd and rank k = ⇒ Pairs equips k-dim subspace span{ µ1, µ2, . . . , µk} with inner product Pairs( x, y) := x ⊤Pairs y. However, { µi} not generally determined by just Pairs

(e.g., { µi} are not necessarily orthogonal).

Must look at higher-order moments?

slide-42
SLIDE 42

3rd moment: (cross) skew maximizers

Claim: Up to third-moment (i.e., 3 views) suffices. View Triples: Rd × Rd × Rd → R as trilinear form.

slide-43
SLIDE 43

3rd moment: (cross) skew maximizers

Claim: Up to third-moment (i.e., 3 views) suffices. View Triples: Rd × Rd × Rd → R as trilinear form.

Theorem

Each isolated local maximizer η∗ of max

  • η∈Rd Triples(

η, η, η) s.t. Pairs( η, η) ≤ 1 satisfies, for some i ∈ [k], Pairs η∗ = √wi µi, Triples( η∗, η∗, η∗) = 1 √wi .

slide-44
SLIDE 44

3rd moment: (cross) skew maximizers

Claim: Up to third-moment (i.e., 3 views) suffices. View Triples: Rd × Rd × Rd → R as trilinear form.

Theorem

Each isolated local maximizer η∗ of max

  • η∈Rd Triples(

η, η, η) s.t. Pairs( η, η) ≤ 1 satisfies, for some i ∈ [k], Pairs η∗ = √wi µi, Triples( η∗, η∗, η∗) = 1 √wi . Also: these maximizers can be found efficiently and robustly.

slide-45
SLIDE 45

Variational analysis

max

  • η∈Rd Triples(

η, η, η) s.t. Pairs( η, η) ≤ 1

slide-46
SLIDE 46

Variational analysis

max

  • η∈Rd Triples(

η, η, η) s.t. Pairs( η, η) ≤ 1

(Substitute Pairs = k

i=1 wi

µi ⊗ µi and Triples = k

i=1 wi

µi ⊗ µi ⊗ µi.)

slide-47
SLIDE 47

Variational analysis

max

  • η∈Rd

k

  • i=1

wi ( η⊤ µi)3 s.t.

k

  • i=1

wi ( η⊤ µi)2 ≤ 1

(Substitute Pairs = k

i=1 wi

µi ⊗ µi and Triples = k

i=1 wi

µi ⊗ µi ⊗ µi.)

slide-48
SLIDE 48

Variational analysis

max

  • η∈Rd

k

  • i=1

wi ( η⊤ µi)3 s.t.

k

  • i=1

wi ( η⊤ µi)2 ≤ 1

(Let θi := √wi ( η⊤ µi) for i ∈ [k].)

slide-49
SLIDE 49

Variational analysis

max

  • η∈Rd

k

  • i=1

1 √wi (√wi η⊤ µi)3 s.t.

k

  • i=1

(√wi η⊤ µi)2 ≤ 1

(Let θi := √wi ( η⊤ µi) for i ∈ [k].)

slide-50
SLIDE 50

Variational analysis

max

  • θ∈Rk

k

  • i=1

1 √wi θ3

i

s.t.

k

  • i=1

θ2

i ≤ 1

(Let θi := √wi ( η⊤ µi) for i ∈ [k].)

slide-51
SLIDE 51

Variational analysis

max

  • θ∈Rk

k

  • i=1

1 √wi θ3

i

s.t.

k

  • i=1

θ2

i ≤ 1

(Let θi := √wi ( η⊤ µi) for i ∈ [k].)

Isolated local maximizers θ∗ (found via gradient ascent) are

  • e1 = (1, 0, 0, . . . ),
  • e2 = (0, 1, 0, . . . ),

etc. which means that each η∗ satisfies, for some i ∈ [k], wj

  • η∗⊤

µj

  • =
  • 1

j = i j = i.

slide-52
SLIDE 52

Variational analysis

max

  • θ∈Rk

k

  • i=1

1 √wi θ3

i

s.t.

k

  • i=1

θ2

i ≤ 1

(Let θi := √wi ( η⊤ µi) for i ∈ [k].)

Isolated local maximizers θ∗ (found via gradient ascent) are

  • e1 = (1, 0, 0, . . . ),
  • e2 = (0, 1, 0, . . . ),

etc. which means that each η∗ satisfies, for some i ∈ [k], wj

  • η∗⊤

µj

  • =
  • 1

j = i j = i. Therefore Pairs η∗ =

k

  • j=1

wj µj

  • η∗⊤

µj

  • = √wi

µi.

slide-53
SLIDE 53

Extracting all isolated local maximizers

  • 1. Start with T := Triples.
slide-54
SLIDE 54

Extracting all isolated local maximizers

  • 1. Start with T := Triples.
  • 2. Find isolated local maximizer of

T( η, η, η) s.t. Pairs( η, η) ≤ 1 via gradient ascent from random η ∈ range(Pairs). Say maximum is λ∗ and maximizer is η∗.

slide-55
SLIDE 55

Extracting all isolated local maximizers

  • 1. Start with T := Triples.
  • 2. Find isolated local maximizer of

T( η, η, η) s.t. Pairs( η, η) ≤ 1 via gradient ascent from random η ∈ range(Pairs). Say maximum is λ∗ and maximizer is η∗.

  • 3. Deflation: replace T with T − λ∗

η∗ ⊗ η∗ ⊗ η∗. Goto step 2.

slide-56
SLIDE 56

Extracting all isolated local maximizers

  • 1. Start with T := Triples.
  • 2. Find isolated local maximizer of

T( η, η, η) s.t. Pairs( η, η) ≤ 1 via gradient ascent from random η ∈ range(Pairs). Say maximum is λ∗ and maximizer is η∗.

  • 3. Deflation: replace T with T − λ∗

η∗ ⊗ η∗ ⊗ η∗. Goto step 2. A variant of this runs in polynomial time (w.h.p.), and is robust to perturbations to Pairs and Triples.

slide-57
SLIDE 57

General case: asymmetric views

Each view v has different set of conditional means { µv,1, µv,2, . . . , µv,k} ⊂ Rdv.

slide-58
SLIDE 58

General case: asymmetric views

Each view v has different set of conditional means { µv,1, µv,2, . . . , µv,k} ⊂ Rdv. Reduction: transform x1 and x2 to “look like” x3 via linear transformations. − →

slide-59
SLIDE 59

Asymmetric cross moments

Define asymmetric cross moment: Pairsu,v := E[ xu ⊗ xv].

slide-60
SLIDE 60

Asymmetric cross moments

Define asymmetric cross moment: Pairsu,v := E[ xu ⊗ xv]. Transforming view v to view 3: Cv→3 := E[ x3 ⊗ xu] E[ xv ⊗ xu]† ∈ Rd3×dv where † denotes Moore-Penrose pseudoinverse.

slide-61
SLIDE 61

Asymmetric cross moments

Define asymmetric cross moment: Pairsu,v := E[ xu ⊗ xv]. Transforming view v to view 3: Cv→3 := E[ x3 ⊗ xu] E[ xv ⊗ xu]† ∈ Rd3×dv where † denotes Moore-Penrose pseudoinverse. Simple exercise to show E[Cv→3 xv|h = j] = µ3,j so Cv→3 xv behaves like x3 (as far as our algorithm can tell).

slide-62
SLIDE 62

Part 3. Some applications and open questions

Multi-view mixture models Multi-view method-of-moments Some applications and open questions Mixtures of Gaussians Hidden Markov models and other models Topic models Open questions Concluding remarks

slide-63
SLIDE 63

Mixtures of axis-aligned Gaussians

Mixture of axis-aligned Gaussian in Rn, with component means

  • µ1,

µ2, . . . , µk ∈ Rn; no minimum separation requirement. h x1 x2 · · · xn

slide-64
SLIDE 64

Mixtures of axis-aligned Gaussians

Mixture of axis-aligned Gaussian in Rn, with component means

  • µ1,

µ2, . . . , µk ∈ Rn; no minimum separation requirement. h x1 x2 · · · xn Assumptions:

◮ non-degeneracy: component means span k dim subspace. ◮ weak incoherence condition: component means not

perfectly aligned with coordinate axes — similar to spreading condition of (Chaudhuri-Rao, ’08).

slide-65
SLIDE 65

Mixtures of axis-aligned Gaussians

Mixture of axis-aligned Gaussian in Rn, with component means

  • µ1,

µ2, . . . , µk ∈ Rn; no minimum separation requirement. h x1 x2 · · · xn Assumptions:

◮ non-degeneracy: component means span k dim subspace. ◮ weak incoherence condition: component means not

perfectly aligned with coordinate axes — similar to spreading condition of (Chaudhuri-Rao, ’08). Then, randomly partitioning coordinates into ℓ ≥ 3 views guarantees (w.h.p.) that non-degeneracy holds in all ℓ views.

slide-66
SLIDE 66

Hidden Markov models and others

h1 h2 h3

  • x1
  • x2
  • x3
slide-67
SLIDE 67

Hidden Markov models and others

h1 h2 h3

  • x1
  • x2
  • x3

− → h

  • x1
  • x2
  • x3
slide-68
SLIDE 68

Hidden Markov models and others

h1 h2 h3

  • x1
  • x2
  • x3

− → h

  • x1
  • x2
  • x3

Other models:

  • 1. Mixtures of Gaussians (Hsu-Kakade, ITCS’13)
  • 2. HMMs (Anandkumar-Hsu-Kakade, COLT’12)
  • 3. Latent Dirichlet Allocation

(Anandkumar-Foster-Hsu-Kakade-Liu, NIPS’12)

  • 4. Latent parse trees (Hsu-Kakade-Liang, NIPS’12)
  • 5. Independent Component Analysis

(Arora-Ge-Moitra-Sachdeva, NIPS’12; Hsu-Kakade, ITCS’13)

slide-69
SLIDE 69

Bag-of-words clustering model

( µj)i = Pr[ see word i in document | document topic is j ].

◮ Corpus: New York Times (from UCI), 300000 articles. ◮ Vocabulary size: d = 102660 words. ◮ Chose k = 50. ◮ For each topic j, show top 10 words i.

slide-70
SLIDE 70

Bag-of-words clustering model

( µj)i = Pr[ see word i in document | document topic is j ].

◮ Corpus: New York Times (from UCI), 300000 articles. ◮ Vocabulary size: d = 102660 words. ◮ Chose k = 50. ◮ For each topic j, show top 10 words i.

sales run school drug player economic inning student patient tiger_wood consumer hit teacher million won major game program company shot home season

  • fficial

doctor play indicator home public companies round weekly right children percent win

  • rder

games high cost tournament claim dodger education program tour scheduled left district health right

slide-71
SLIDE 71

Bag-of-words clustering model

palestinian tax cup point yard israel cut minutes game game israeli percent

  • il

team play yasser_arafat bush water shot season peace billion add play team israeli plan tablespoon laker touchdown israelis bill food season quarterback leader taxes teaspoon half coach

  • fficial

million pepper lead defense attack congress sugar games quarter

slide-72
SLIDE 72

Bag-of-words clustering model

percent al_gore car book taliban stock campaign race children attack market president driver ages afghanistan fund george_bush team author

  • fficial

investor bush won read military companies clinton win newspaper u_s analyst vice racing web united_states money presidential track writer terrorist investment million season written war economy democratic lap sales bin

slide-73
SLIDE 73

Bag-of-words clustering model

com court show film music www case network movie song site law season director group web lawyer nbc play part sites federal cb character new_york information government program actor company

  • nline

decision television show million mail trial series movies band internet microsoft night million show telegram right new_york part album etc.

slide-74
SLIDE 74

Some open questions

What if k > dv?

(relevant to overcomplete dictionary learning)

slide-75
SLIDE 75

Some open questions

What if k > dv?

(relevant to overcomplete dictionary learning)

◮ Apply some non-linear transformations

xv → fv( xv)?

slide-76
SLIDE 76

Some open questions

What if k > dv?

(relevant to overcomplete dictionary learning)

◮ Apply some non-linear transformations

xv → fv( xv)?

◮ Combine views, e.g., via tensor product

˜ x1,2 := x1 ⊗ x2, ˜ x3,4 := x3 ⊗ x4, ˜ x5,6 := x5 ⊗ x6,

  • etc. ?
slide-77
SLIDE 77

Some open questions

What if k > dv?

(relevant to overcomplete dictionary learning)

◮ Apply some non-linear transformations

xv → fv( xv)?

◮ Combine views, e.g., via tensor product

˜ x1,2 := x1 ⊗ x2, ˜ x3,4 := x3 ⊗ x4, ˜ x5,6 := x5 ⊗ x6,

  • etc. ?

Can we relax the multi-view assumption?

slide-78
SLIDE 78

Some open questions

What if k > dv?

(relevant to overcomplete dictionary learning)

◮ Apply some non-linear transformations

xv → fv( xv)?

◮ Combine views, e.g., via tensor product

˜ x1,2 := x1 ⊗ x2, ˜ x3,4 := x3 ⊗ x4, ˜ x5,6 := x5 ⊗ x6,

  • etc. ?

Can we relax the multi-view assumption?

◮ Allow for richer hidden state?

(e.g., independent component analysis)

slide-79
SLIDE 79

Some open questions

What if k > dv?

(relevant to overcomplete dictionary learning)

◮ Apply some non-linear transformations

xv → fv( xv)?

◮ Combine views, e.g., via tensor product

˜ x1,2 := x1 ⊗ x2, ˜ x3,4 := x3 ⊗ x4, ˜ x5,6 := x5 ⊗ x6,

  • etc. ?

Can we relax the multi-view assumption?

◮ Allow for richer hidden state?

(e.g., independent component analysis)

◮ “Gaussianization” via random projection?

slide-80
SLIDE 80

Part 4. Concluding remarks

Multi-view mixture models Multi-view method-of-moments Some applications and open questions Concluding remarks

slide-81
SLIDE 81

Concluding remarks

Take-home messages:

slide-82
SLIDE 82

Concluding remarks

Take-home messages:

◮ Power of multiple views: Can take advantage of diverse /

non-redundant sources of information in unsupervised learning.

slide-83
SLIDE 83

Concluding remarks

Take-home messages:

◮ Power of multiple views: Can take advantage of diverse /

non-redundant sources of information in unsupervised learning.

◮ Overcoming complexity barriers: Some provably hard

estimation problems become easy after ruling out “degenerate” cases.

slide-84
SLIDE 84

Concluding remarks

Take-home messages:

◮ Power of multiple views: Can take advantage of diverse /

non-redundant sources of information in unsupervised learning.

◮ Overcoming complexity barriers: Some provably hard

estimation problems become easy after ruling out “degenerate” cases.

◮ “Blessing of dimensionality” for estimators based on

method-of-moments.

slide-85
SLIDE 85

Thanks!

(Co-authors: Anima Anandkumar, Dean Foster, Rong Ge, Sham Kakade, Yi-Kai Liu, Matus Telgarsky)

http://arxiv.org/abs/1210.7559