Spectral Experts for Estimating Mixtures of Linear Regressions Arun - - PowerPoint PPT Presentation

spectral experts for estimating mixtures of linear
SMART_READER_LITE
LIVE PREVIEW

Spectral Experts for Estimating Mixtures of Linear Regressions Arun - - PowerPoint PPT Presentation

Spectral Experts for Estimating Mixtures of Linear Regressions Arun Tejasvi Chaganty Percy Liang Stanford University January 28, 2016 Chaganty, Liang (Stanford University) Spectral Experts January 28, 2016 1 / 22 Introduction Latent


slide-1
SLIDE 1

Spectral Experts for Estimating Mixtures of Linear Regressions

Arun Tejasvi Chaganty Percy Liang

Stanford University

January 28, 2016

Chaganty, Liang (Stanford University) Spectral Experts January 28, 2016 1 / 22

slide-2
SLIDE 2

Introduction

Latent Variable Models

Generative Models h x

Chaganty, Liang (Stanford University) Spectral Experts January 28, 2016 2 / 22

slide-3
SLIDE 3

Introduction

Latent Variable Models

Generative Models

◮ Gaussian Mixture Models ◮ Hidden Markov Models ◮ Latent Dirichlet Allocation ◮ PCFGs ◮ . . .

h x

Chaganty, Liang (Stanford University) Spectral Experts January 28, 2016 2 / 22

slide-4
SLIDE 4

Introduction

Latent Variable Models

Generative Models

◮ Gaussian Mixture Models ◮ Hidden Markov Models ◮ Latent Dirichlet Allocation ◮ PCFGs ◮ . . .

Discriminative Models h x h x y

Chaganty, Liang (Stanford University) Spectral Experts January 28, 2016 2 / 22

slide-5
SLIDE 5

Introduction

Latent Variable Models

Generative Models

◮ Gaussian Mixture Models ◮ Hidden Markov Models ◮ Latent Dirichlet Allocation ◮ PCFGs ◮ . . .

Discriminative Models

◮ Mixture of Experts ◮ Latent CRFs ◮ Discriminative LDA ◮ . . .

h x h x y

Chaganty, Liang (Stanford University) Spectral Experts January 28, 2016 2 / 22

slide-6
SLIDE 6

Introduction

Latent Variable Models

Generative Models

◮ Gaussian Mixture Models ◮ Hidden Markov Models ◮ Latent Dirichlet Allocation ◮ PCFGs ◮ . . .

Discriminative Models

◮ Mixture of Experts ◮ Latent CRFs ◮ Discriminative LDA ◮ . . .

◮ Easy to include features and

tend to be more accurate. h x h x y

Chaganty, Liang (Stanford University) Spectral Experts January 28, 2016 2 / 22

slide-7
SLIDE 7

Introduction

Parameter Estimation is Hard

θ − log pθ(x)

◮ Log-likelihood function is non-convex.

Chaganty, Liang (Stanford University) Spectral Experts January 28, 2016 3 / 22

slide-8
SLIDE 8

Introduction

Parameter Estimation is Hard

θ − log pθ(x) θMLE

◮ Log-likelihood function is non-convex. ◮ MLE is consistent but intractable.

Chaganty, Liang (Stanford University) Spectral Experts January 28, 2016 3 / 22

slide-9
SLIDE 9

Introduction

Parameter Estimation is Hard

θ − log pθ(x) θMLE θEM θEM

◮ Log-likelihood function is non-convex. ◮ MLE is consistent but intractable. ◮ Local methods (EM, gradient descent, etc.) are tractable but

inconsistent.

Chaganty, Liang (Stanford University) Spectral Experts January 28, 2016 3 / 22

slide-10
SLIDE 10

Introduction

Parameter Estimation is Hard

θ − log pθ(x) θMLE θEM θEM

◮ Log-likelihood function is non-convex. ◮ MLE is consistent but intractable. ◮ Local methods (EM, gradient descent, etc.) are tractable but

inconsistent.

◮ Can we build an efficient and consistent estimator?

Chaganty, Liang (Stanford University) Spectral Experts January 28, 2016 3 / 22

slide-11
SLIDE 11

Introduction

Related Work

◮ Method of Moments [Pearson, 1894]

Chaganty, Liang (Stanford University) Spectral Experts January 28, 2016 4 / 22

slide-12
SLIDE 12

Introduction

Related Work

◮ Method of Moments [Pearson, 1894] ◮ Observable operators

◮ Control Theory [Ljung, 1987] ◮ Observable operator models [Jaeger, 2000; Littman/Sutton/Singh,

2004]

◮ Hidden Markov models [Hsu/Kakade/Zhang, 2009] ◮ Low-treewidth graphs [Parikh et al., 2012] ◮ Weighted finite state automata [Balle & Mohri, 2012] Chaganty, Liang (Stanford University) Spectral Experts January 28, 2016 4 / 22

slide-13
SLIDE 13

Introduction

Related Work

◮ Method of Moments [Pearson, 1894] ◮ Observable operators

◮ Control Theory [Ljung, 1987] ◮ Observable operator models [Jaeger, 2000; Littman/Sutton/Singh,

2004]

◮ Hidden Markov models [Hsu/Kakade/Zhang, 2009] ◮ Low-treewidth graphs [Parikh et al., 2012] ◮ Weighted finite state automata [Balle & Mohri, 2012]

◮ Parameter Estimation

◮ Mixture of Gaussians [Kalai/Moitra/Valiant, 2010] ◮ Mixture models, HMMs [Anandkumar/Hsu/Kakade, 2012] ◮ Latent Dirichlet Allocation [Anandkumar/Hsu/Kakade, 2012] ◮ Stochastic block models [Anandkumar/Ge/Hsu/Kakade, 2012] ◮ Linear Bayesian networks [Anandkumar/Hsu/Javanmard/Kakade, 2012] Chaganty, Liang (Stanford University) Spectral Experts January 28, 2016 4 / 22

slide-14
SLIDE 14

Introduction

Outline

Introduction Tensor Factorization for a Generative Model Tensor Factorization for a Discriminative Model Experimental Insights Conclusions

Chaganty, Liang (Stanford University) Spectral Experts January 28, 2016 5 / 22

slide-15
SLIDE 15

Tensor Factorization for a Generative Model

Aside: Tensor Operations

Tensor Product x⊗3 = x ⊗ x ⊗ x x⊗3

ijk = xixjxk

= × ×

Chaganty, Liang (Stanford University) Spectral Experts January 28, 2016 6 / 22

slide-16
SLIDE 16

Tensor Factorization for a Generative Model

Aside: Tensor Operations

Tensor Product x⊗3 = x ⊗ x ⊗ x x⊗3

ijk = xixjxk ◮ Inner product

A, B =

  • ijk

AijkBijk = × ×

  • ,

= 0.5

Chaganty, Liang (Stanford University) Spectral Experts January 28, 2016 6 / 22

slide-17
SLIDE 17

Tensor Factorization for a Generative Model

Aside: Tensor Operations

Tensor Product x⊗3 = x ⊗ x ⊗ x x⊗3

ijk = xixjxk ◮ Inner product

A, B =

  • ijk

AijkBijk = vec A, vec B = × ×

  • ,

= 0.5

  • ,

= 0.5

Chaganty, Liang (Stanford University) Spectral Experts January 28, 2016 6 / 22

slide-18
SLIDE 18

Tensor Factorization for a Generative Model

Example: Gaussian Mixture Model

anandkumar12moments

◮ Generative process:

h ∼ Mult([π1, π2, · · · , πk]) x ∼ N(βh, σ2). h x

x1 x2

Chaganty, Liang (Stanford University) Spectral Experts January 28, 2016 7 / 22

slide-19
SLIDE 19

Tensor Factorization for a Generative Model

Example: Gaussian Mixture Model

anandkumar12moments

◮ Generative process:

h ∼ Mult([π1, π2, · · · , πk]) x ∼ N(βh, σ2).

◮ Moments:

E[x|h] = βh h x

x1 x2

Chaganty, Liang (Stanford University) Spectral Experts January 28, 2016 7 / 22

slide-20
SLIDE 20

Tensor Factorization for a Generative Model

Example: Gaussian Mixture Model

anandkumar12moments

◮ Generative process:

h ∼ Mult([π1, π2, · · · , πk]) x ∼ N(βh, σ2).

◮ Moments:

E[x|h] = βh E[x] =

  • h

πhβh h x

x1 x2

Chaganty, Liang (Stanford University) Spectral Experts January 28, 2016 7 / 22

slide-21
SLIDE 21

Tensor Factorization for a Generative Model

Example: Gaussian Mixture Model

anandkumar12moments

◮ Generative process:

h ∼ Mult([π1, π2, · · · , πk]) x ∼ N(βh, σ2).

◮ Moments:

E[x|h] = βh E[x] =

  • h

πhβh E[x⊗2] =

  • h

πh(βhβT

h ) + σ2

=

  • h

πhβh⊗2 + σ2 h x

x1 x2

E[x⊗2] d d

Chaganty, Liang (Stanford University) Spectral Experts January 28, 2016 7 / 22

slide-22
SLIDE 22

Tensor Factorization for a Generative Model

Example: Gaussian Mixture Model

anandkumar12moments

◮ Generative process:

h ∼ Mult([π1, π2, · · · , πk]) x ∼ N(βh, σ2).

◮ Moments:

E[x|h] = βh E[x] =

  • h

πhβh E[x⊗2] =

  • h

πh(βhβT

h ) + σ2

=

  • h

πhβh⊗2 + σ2 E[x⊗3] =

  • h

πhβ⊗3

h

+ bias. h x

x1 x2

E[x⊗2] d d E[x⊗3] d

Chaganty, Liang (Stanford University) Spectral Experts January 28, 2016 7 / 22

slide-23
SLIDE 23

Tensor Factorization for a Generative Model

Solution: Tensor Factorization

◮ E[x⊗3] = k h=1 πhβ⊗3 h .

h x

x1 x2

Chaganty, Liang (Stanford University) Spectral Experts January 28, 2016 8 / 22

slide-24
SLIDE 24

Tensor Factorization for a Generative Model

Solution: Tensor Factorization

◮ E[x⊗3] = k h=1 πhβ⊗3 h .

h x

x1 x2

= + + · · · + k

Chaganty, Liang (Stanford University) Spectral Experts January 28, 2016 8 / 22

slide-25
SLIDE 25

Tensor Factorization for a Generative Model

Solution: Tensor Factorization

AnandkumarGeHsu2012

◮ E[x⊗3] = k h=1 πhβ⊗3 h . ◮ If βh are orthogonal, they are

eigenvectors! E[x⊗3](βh, βh) = πhβh. h x

x1 x2

= + + · · · + k

Chaganty, Liang (Stanford University) Spectral Experts January 28, 2016 8 / 22

slide-26
SLIDE 26

Tensor Factorization for a Generative Model

Solution: Tensor Factorization

AnandkumarGeHsu2012

◮ E[x⊗3] = k h=1 πhβ⊗3 h . ◮ If βh are orthogonal, they are

eigenvectors! E[x⊗3](βh, βh) = πhβh.

◮ In general, whiten E[x⊗3] first.

h x

x1 x2

= + + · · · + k

Chaganty, Liang (Stanford University) Spectral Experts January 28, 2016 8 / 22

slide-27
SLIDE 27

Tensor Factorization for a Generative Model

h x Generative Models h x y Discriminative Models

Chaganty, Liang (Stanford University) Spectral Experts January 28, 2016 9 / 22

slide-28
SLIDE 28

Tensor Factorization for a Generative Model

h x Generative Models h x y Discriminative Models

Chaganty, Liang (Stanford University) Spectral Experts January 28, 2016 9 / 22

slide-29
SLIDE 29

Tensor Factorization for a Discriminative Model

Mixture of Linear Regressions

h x y

x y

Chaganty, Liang (Stanford University) Spectral Experts January 28, 2016 10 / 22

slide-30
SLIDE 30

Tensor Factorization for a Discriminative Model

Mixture of Linear Regressions

h x y

◮ Given x

◮ h ∼ Mult([π1, π2, · · · , πk]).

x y

Chaganty, Liang (Stanford University) Spectral Experts January 28, 2016 10 / 22

slide-31
SLIDE 31

Tensor Factorization for a Discriminative Model

Mixture of Linear Regressions

h x y

◮ Given x

◮ h ∼ Mult([π1, π2, · · · , πk]). ◮ y = βT

h x + ǫ.

x y

Chaganty, Liang (Stanford University) Spectral Experts January 28, 2016 10 / 22

slide-32
SLIDE 32

Tensor Factorization for a Discriminative Model

Mixture of Linear Regressions

h x y

◮ Given x

◮ h ∼ Mult([π1, π2, · · · , πk]). ◮ y = βT

h x + ǫ.

x y

Chaganty, Liang (Stanford University) Spectral Experts January 28, 2016 10 / 22

slide-33
SLIDE 33

Tensor Factorization for a Discriminative Model

Mixture of Linear Regressions

h x y

◮ Given x

◮ h ∼ Mult([π1, π2, · · · , πk]). ◮ y = βT

h x + ǫ.

x y

Chaganty, Liang (Stanford University) Spectral Experts January 28, 2016 10 / 22

slide-34
SLIDE 34

Tensor Factorization for a Discriminative Model

Mixture of Linear Regressions

h x y

◮ Given x

◮ h ∼ Mult([π1, π2, · · · , πk]). ◮ y = βT

h x + ǫ.

x y

Chaganty, Liang (Stanford University) Spectral Experts January 28, 2016 10 / 22

slide-35
SLIDE 35

Tensor Factorization for a Discriminative Model

Mixture of Linear Regressions

h x y

◮ Given x

◮ h ∼ Mult([π1, π2, · · · , πk]). ◮ y = βT

h x + ǫ.

x y

Chaganty, Liang (Stanford University) Spectral Experts January 28, 2016 10 / 22

slide-36
SLIDE 36

Tensor Factorization for a Discriminative Model

Mixture of Linear Regressions

x y

     

π1 π2 . . . πk

            

β1 β2 . . . βk

      

  • B

?

Chaganty, Liang (Stanford University) Spectral Experts January 28, 2016 11 / 22

slide-37
SLIDE 37

Tensor Factorization for a Discriminative Model

Finding Tensor Structure

y = βh , x + ǫ

Chaganty, Liang (Stanford University) Spectral Experts January 28, 2016 12 / 22

slide-38
SLIDE 38

Tensor Factorization for a Discriminative Model

Finding Tensor Structure

y = βh

  • random

, x + ǫ

Chaganty, Liang (Stanford University) Spectral Experts January 28, 2016 12 / 22

slide-39
SLIDE 39

Tensor Factorization for a Discriminative Model

Finding Tensor Structure

y = βh

  • random

, x + ǫ = E[βh], x + (βh − E[βh]), x + ǫ E[βh] =

h πhβh.

Chaganty, Liang (Stanford University) Spectral Experts January 28, 2016 12 / 22

slide-40
SLIDE 40

Tensor Factorization for a Discriminative Model

Finding Tensor Structure

y = βh

  • random

, x + ǫ = E[βh], x

  • linear measurement

+ (βh − E[βh]), x + ǫ E[βh] =

h πhβh.

Chaganty, Liang (Stanford University) Spectral Experts January 28, 2016 12 / 22

slide-41
SLIDE 41

Tensor Factorization for a Discriminative Model

Finding Tensor Structure

y = βh

  • random

, x + ǫ = E[βh], x

  • linear measurement

+ (βh − E[βh]), x + ǫ

  • noise

E[βh] =

h πhβh.

Chaganty, Liang (Stanford University) Spectral Experts January 28, 2016 12 / 22

slide-42
SLIDE 42

Tensor Factorization for a Discriminative Model

Finding Tensor Structure

y =

linear measurement

  • E[βh], x

+

noise

  • (βh − E[βh])Tx + ǫ
  • ,
  • Chaganty, Liang (Stanford University)

Spectral Experts January 28, 2016 13 / 22

slide-43
SLIDE 43

Tensor Factorization for a Discriminative Model

Finding Tensor Structure

y =

linear measurement

  • E[βh], x

+

noise

  • (βh − E[βh])Tx + ǫ

y2 = (βh, x + ǫ)2

  • ,
  • Chaganty, Liang (Stanford University)

Spectral Experts January 28, 2016 13 / 22

slide-44
SLIDE 44

Tensor Factorization for a Discriminative Model

Finding Tensor Structure

y =

linear measurement

  • E[βh], x

+

noise

  • (βh − E[βh])Tx + ǫ

y2 = (βh, x + ǫ)2 = E[β⊗2

h ], x⊗2

+ bias2 + noise2

  • ,
  • ,
  • Chaganty, Liang (Stanford University)

Spectral Experts January 28, 2016 13 / 22

slide-45
SLIDE 45

Tensor Factorization for a Discriminative Model

Finding Tensor Structure

y =

linear measurement

  • E[βh], x

+

noise

  • (βh − E[βh])Tx + ǫ

y2 = (βh, x + ǫ)2 = E[β⊗2

h ] M2

, x⊗2 + bias2 + noise2

  • ,
  • ,
  • Chaganty, Liang (Stanford University)

Spectral Experts January 28, 2016 13 / 22

slide-46
SLIDE 46

Tensor Factorization for a Discriminative Model

Finding Tensor Structure

y =

linear measurement

  • E[βh], x

+

noise

  • (βh − E[βh])Tx + ǫ

y2 = (βh, x + ǫ)2 = E[β⊗2

h ] M2

, x⊗2 + bias2 + noise2 y3 = E[β⊗3

h ] M3

, x⊗3 + bias3 + noise3

  • ,
  • ,
  • ,
  • Chaganty, Liang (Stanford University)

Spectral Experts January 28, 2016 13 / 22

slide-47
SLIDE 47

Tensor Factorization for a Discriminative Model

Recovering Parameters

◮ M3

def

= E[β⊗3

h ] = k h=1 πhβ⊗3 h

Chaganty, Liang (Stanford University) Spectral Experts January 28, 2016 14 / 22

slide-48
SLIDE 48

Tensor Factorization for a Discriminative Model

Recovering Parameters

◮ M3

def

= E[β⊗3

h ] = k h=1 πhβ⊗3 h

= + + · · · + k

Chaganty, Liang (Stanford University) Spectral Experts January 28, 2016 14 / 22

slide-49
SLIDE 49

Tensor Factorization for a Discriminative Model

Recovering Parameters

◮ M3

def

= E[β⊗3

h ] = k h=1 πhβ⊗3 h

◮ Apply tensor factorization!

= + + · · · + k

Chaganty, Liang (Stanford University) Spectral Experts January 28, 2016 14 / 22

slide-50
SLIDE 50

Tensor Factorization for a Discriminative Model

Overview: Spectral Experts

x⊗2, y2

(x,y)∈D

x⊗3, y3

(x,y)∈D

M2 M3 π, B

tensor factorization

regression tensor factorization

Chaganty, Liang (Stanford University) Spectral Experts January 28, 2016 15 / 22

slide-51
SLIDE 51

Tensor Factorization for a Discriminative Model

Overview: Spectral Experts

x⊗2, y2

(x,y)∈D

x⊗3, y3

(x,y)∈D

M2 M3 π, B

tensor factorization

regression tensor factorization Assumptions:

Chaganty, Liang (Stanford University) Spectral Experts January 28, 2016 15 / 22

slide-52
SLIDE 52

Tensor Factorization for a Discriminative Model

Overview: Spectral Experts

x⊗2, y2

(x,y)∈D

x⊗3, y3

(x,y)∈D

M2 M3 π, B

tensor factorization

regression tensor factorization Assumptions: ˆ E[vec(x⊗2)⊗2] ≻ 0 ˆ E[vec(x⊗3)⊗2] ≻ 0.

Chaganty, Liang (Stanford University) Spectral Experts January 28, 2016 15 / 22

slide-53
SLIDE 53

Tensor Factorization for a Discriminative Model

Overview: Spectral Experts

x⊗2, y2

(x,y)∈D

x⊗3, y3

(x,y)∈D

M2 M3 π, B

tensor factorization

regression tensor factorization Assumptions: ˆ E[vec(x⊗2)⊗2] ≻ 0 ˆ E[vec(x⊗3)⊗2] ≻ 0. π ≻ 0 rank(B) = k ≤ d

Chaganty, Liang (Stanford University) Spectral Experts January 28, 2016 15 / 22

slide-54
SLIDE 54

Tensor Factorization for a Discriminative Model

Exploiting Low-rank Structure.

ˆ M2 = arg min

M

  • (x,y)∈D
  • y2 −
  • M, x⊗2

− bias2

2

= + + · · · + k

Chaganty, Liang (Stanford University) Spectral Experts January 28, 2016 16 / 22

slide-55
SLIDE 55

Tensor Factorization for a Discriminative Model

Exploiting Low-rank Structure.

fazel2002matrix

ˆ M2 = arg min

M

  • (x,y)∈D
  • y2 −
  • M, x⊗2

− bias2

2 + M∗

  • i σi(M)

= + + · · · + k

Chaganty, Liang (Stanford University) Spectral Experts January 28, 2016 16 / 22

slide-56
SLIDE 56

Tensor Factorization for a Discriminative Model

Exploiting Low-rank Structure.

fazel2002matrix tomioka2010estimation

ˆ M3 = arg min

M

  • (x,y)∈D
  • y3 −
  • M, x⊗3

− bias3

2 + M∗

= + + · · · + k

Chaganty, Liang (Stanford University) Spectral Experts January 28, 2016 16 / 22

slide-57
SLIDE 57

Tensor Factorization for a Discriminative Model

Sample Complexity

x⊗2, y2

(x,y)∈D

x⊗3, y3

(x,y)∈D

M2 M3 π, B

tensor factorization

low-rank regression tensor factorization

Chaganty, Liang (Stanford University) Spectral Experts January 28, 2016 17 / 22

slide-58
SLIDE 58

Tensor Factorization for a Discriminative Model

Sample Complexity

NegahbanWainwright2009; Tomioka2011 x⊗2, y2

(x,y)∈D

x⊗3, y3

(x,y)∈D

M2 M3 π, B

tensor factorization

low-rank regression tensor factorization O

k x12 β6 E[ǫ2]6

Chaganty, Liang (Stanford University) Spectral Experts January 28, 2016 17 / 22

slide-59
SLIDE 59

Tensor Factorization for a Discriminative Model

Sample Complexity

NegahbanWainwright2009; Tomioka2011 AnandkumarGeHsu2012 x⊗2, y2

(x,y)∈D

x⊗3, y3

(x,y)∈D

M2 M3 π, B

tensor factorization

low-rank regression tensor factorization O

k x12 β6 E[ǫ2]6

O

kπ2

max

σk(M2)5

  • Chaganty, Liang (Stanford University)

Spectral Experts January 28, 2016 17 / 22

slide-60
SLIDE 60

Experimental Insights

Experimental Insights

1.0 0.5 0.0 0.5 1.0 t 3 2 1 1 2 3 4 5 y

y = βT

    

1 t t4 t7

    

x

+ǫ k = 3, d = 4, n = 105

Chaganty, Liang (Stanford University) Spectral Experts January 28, 2016 18 / 22

slide-61
SLIDE 61

Experimental Insights

Experimental Insights

1.0 0.5 0.0 0.5 1.0 t 3 2 1 1 2 3 4 5 y

EM

y = βT

    

1 t t4 t7

    

x

+ǫ k = 3, d = 4, n = 105

Chaganty, Liang (Stanford University) Spectral Experts January 28, 2016 18 / 22

slide-62
SLIDE 62

Experimental Insights

Experimental Insights

1.0 0.5 0.0 0.5 1.0 t 3 2 1 1 2 3 4 5 y

EM

1 2 3 4 5 6 7 Parameter Error 0.0% 20.0% 40.0% 60.0% 80.0% 100.0%

EM

Chaganty, Liang (Stanford University) Spectral Experts January 28, 2016 18 / 22

slide-63
SLIDE 63

Experimental Insights

Experimental Insights

1.0 0.5 0.0 0.5 1.0 t 3 2 1 1 2 3 4 5 y

Spectral

1 2 3 4 5 6 7 Parameter Error 0.0% 20.0% 40.0% 60.0% 80.0% 100.0%

EM Spectral

Chaganty, Liang (Stanford University) Spectral Experts January 28, 2016 18 / 22

slide-64
SLIDE 64

Experimental Insights

Experimental Insights

1.0 0.5 0.0 0.5 1.0 t 3 2 1 1 2 3 4 5 y

Spectral+EM

1 2 3 4 5 6 7 Parameter Error 0.0% 20.0% 40.0% 60.0% 80.0% 100.0%

EM Spectral Spectral+EM

Chaganty, Liang (Stanford University) Spectral Experts January 28, 2016 18 / 22

slide-65
SLIDE 65

Experimental Insights

Experimental Insights

EM Spectral Spectral + EM 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Parameter Error

d = 4, k = 2

EM Spectral Spectral + EM 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Parameter Error

d = 5, k = 2

EM Spectral Spectral + EM 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Parameter Error

d = 5, k = 3

EM Spectral Spectral + EM 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Parameter Error

d = 6, k = 2

Chaganty, Liang (Stanford University) Spectral Experts January 28, 2016 19 / 22

slide-66
SLIDE 66

Experimental Insights

On Initialization (Cartoon)

θ − log pθ(x)

Chaganty, Liang (Stanford University) Spectral Experts January 28, 2016 20 / 22

slide-67
SLIDE 67

Experimental Insights

On Initialization (Cartoon)

θ − log pθ(x)

x ˆ

θEM

Chaganty, Liang (Stanford University) Spectral Experts January 28, 2016 20 / 22

slide-68
SLIDE 68

Experimental Insights

On Initialization (Cartoon)

θ − log pθ(x)

x ˆ

θEM

x

ˆ θspec

Chaganty, Liang (Stanford University) Spectral Experts January 28, 2016 20 / 22

slide-69
SLIDE 69

Experimental Insights

On Initialization (Cartoon)

θ − log pθ(x)

x ˆ

θEM

x

ˆ θspec

x

ˆ θspec + EM

Chaganty, Liang (Stanford University) Spectral Experts January 28, 2016 20 / 22

slide-70
SLIDE 70

Conclusions

Conclusions

◮ Consistent estimator for the mixture of linear regressions

Chaganty, Liang (Stanford University) Spectral Experts January 28, 2016 21 / 22

slide-71
SLIDE 71

Conclusions

Conclusions

◮ Consistent estimator for the mixture of linear regressions ◮ Key Idea: Expose tensor factorization structure through regression.

Chaganty, Liang (Stanford University) Spectral Experts January 28, 2016 21 / 22

slide-72
SLIDE 72

Conclusions

Conclusions

◮ Consistent estimator for the mixture of linear regressions ◮ Key Idea: Expose tensor factorization structure through regression. ◮ Theory: Polynomial sample and computational complexity.

Chaganty, Liang (Stanford University) Spectral Experts January 28, 2016 21 / 22

slide-73
SLIDE 73

Conclusions

Conclusions

◮ Consistent estimator for the mixture of linear regressions ◮ Key Idea: Expose tensor factorization structure through regression. ◮ Theory: Polynomial sample and computational complexity. ◮ Experiments: Method of moment estimates can be a good

initialization for EM.

Chaganty, Liang (Stanford University) Spectral Experts January 28, 2016 21 / 22

slide-74
SLIDE 74

Conclusions

Conclusions

◮ Consistent estimator for the mixture of linear regressions ◮ Key Idea: Expose tensor factorization structure through regression. ◮ Theory: Polynomial sample and computational complexity. ◮ Experiments: Method of moment estimates can be a good

initialization for EM.

◮ Future Work: How can we handle other discriminative models?

Chaganty, Liang (Stanford University) Spectral Experts January 28, 2016 21 / 22

slide-75
SLIDE 75

Conclusions

Conclusions

◮ Consistent estimator for the mixture of linear regressions ◮ Key Idea: Expose tensor factorization structure through regression. ◮ Theory: Polynomial sample and computational complexity. ◮ Experiments: Method of moment estimates can be a good

initialization for EM.

◮ Future Work: How can we handle other discriminative models?

◮ Dependencies between h and x (mixture of experts). Chaganty, Liang (Stanford University) Spectral Experts January 28, 2016 21 / 22

slide-76
SLIDE 76

Conclusions

Conclusions

◮ Consistent estimator for the mixture of linear regressions ◮ Key Idea: Expose tensor factorization structure through regression. ◮ Theory: Polynomial sample and computational complexity. ◮ Experiments: Method of moment estimates can be a good

initialization for EM.

◮ Future Work: How can we handle other discriminative models?

◮ Dependencies between h and x (mixture of experts). ◮ Non-linear link functions (hidden variable logistic regression). Chaganty, Liang (Stanford University) Spectral Experts January 28, 2016 21 / 22

slide-77
SLIDE 77

Conclusions

Thank you!

Chaganty, Liang (Stanford University) Spectral Experts January 28, 2016 22 / 22