MLE part 2 18 / 70 Gaussian Mixture Model Suppose data is drawn - - PowerPoint PPT Presentation

mle part 2
SMART_READER_LITE
LIVE PREVIEW

MLE part 2 18 / 70 Gaussian Mixture Model Suppose data is drawn - - PowerPoint PPT Presentation

MLE part 2 18 / 70 Gaussian Mixture Model Suppose data is drawn from k Gaussians, meaning Y = j Discrete ( 1 , . . . , k ) , X = x | Y = j N ( j , j ) , and the parameters are = (( 1 , 1 , 1 ) , . . . , ( k ,


slide-1
SLIDE 1

MLE part 2

18 / 70

slide-2
SLIDE 2

Gaussian Mixture Model

◮ Suppose data is drawn from k Gaussians, meaning Y = j ∼ Discrete(π1, . . . , πk), X = x|Y = j ∼ N(µj, Σj), and the parameters are θ = ((π1, µ1, Σ1), . . . , (πk, µk, Σk)). (Note: this is a generative model, and we have a way to sample.)

19 / 70

slide-3
SLIDE 3

Gaussian Mixture Model

◮ Suppose data is drawn from k Gaussians, meaning Y = j ∼ Discrete(π1, . . . , πk), X = x|Y = j ∼ N(µj, Σj), and the parameters are θ = ((π1, µ1, Σ1), . . . , (πk, µk, Σk)). (Note: this is a generative model, and we have a way to sample.) ◮ The probability density (with parameters θ = ((πj, µj, Σj))k

j=1) at

a given x is pθ(x) =

k

  • j=1

pθ(x|y = j)pθ(y = j) =

k

  • j=1

pµj,Σj(x|Y = j)πj, and the likelihood problem is

L(θ) =

n

  • i=1

ln

k

  • j=1

πj

  • (2π)d|Σ|

exp

  • −1

2(xi − µj)

TΣ−1(xi − µj)

  • .

The ln and the exp are no longer next to each other; we can’t just take the derivative and set the answer to 0.

19 / 70

slide-4
SLIDE 4

Lloyd’s method for k-means

Original k-means formulation φ((µ1, . . . , µk)) =

n

  • i=1

min

j

xi − µj2.

20 / 70

slide-5
SLIDE 5

Lloyd’s method for k-means

Original k-means formulation φ((µ1, . . . , µk)) =

n

  • i=1

min

j

xi − µj2. To make an algorithm, we introduced assignment matrix A ∈ An,k: φ((µ1, . . . , µk); A) =

n

  • i=1

k

  • j=1

Aijxi − µj2.

20 / 70

slide-6
SLIDE 6

Lloyd’s method for k-means

Original k-means formulation φ((µ1, . . . , µk)) =

n

  • i=1

min

j

xi − µj2. To make an algorithm, we introduced assignment matrix A ∈ An,k: φ((µ1, . . . , µk); A) =

n

  • i=1

k

  • j=1

Aijxi − µj2. Let’s do the same thing with Gaussians!

20 / 70

slide-7
SLIDE 7

Gaussian mixture likelihood with responsibility matrix R

Let’s replace n

i=1 ln k j=1 πjpµj,Σj(xi) with n

  • i=1

k

  • j=1

Rij ln

  • πjpµj,Σj(xi)
  • where R ∈ Rn,k := {R ∈ [0, 1]n×k : R1k = 1n} is a responsibility

matrix.

21 / 70

slide-8
SLIDE 8

Gaussian mixture likelihood with responsibility matrix R

Let’s replace n

i=1 ln k j=1 πjpµj,Σj(xi) with n

  • i=1

k

  • j=1

Rij ln

  • πjpµj,Σj(xi)
  • where R ∈ Rn,k := {R ∈ [0, 1]n×k : R1k = 1n} is a responsibility

matrix. Holding R fixed and optimizing θ gives πj := n

i=1 Rij

n

i=1

k

l=1 Ril

= n

i=1 Rij

n ; µj := n

i=1 Rijxi

n

i=1 Rij

= n

i=1 Rijxi

nπj , Σj := n

i=1 Rij(xi − µj)(xi − µj)T

nπj . (Should use new mean in Σj so that all deriviatives 0.)

21 / 70

slide-9
SLIDE 9

Updating µj

Recall our new likelihood with responsibilities R:

n

  • i=1

k

  • j=1

Rij ln πjpµj,Σj(xi) (In the literature, this quantity is “expected complete data likelihood”.)

22 / 70

slide-10
SLIDE 10

Updating µj

Recall our new likelihood with responsibilities R:

n

  • i=1

k

  • j=1

Rij ln πjpµj,Σj(xi) (In the literature, this quantity is “expected complete data likelihood”.) Taking derivative and setting to 0: 0 =

n

  • i=1

Rij∇µj

  • ln exp(−1

2(xi − µj)

TΣ−1

j (xi − µj)) + terms w/o µj

  • =

n

  • i=1

RijΣ−1

j (xi − µj).

Rearranging, µj =

n

i=1 Rijxi

nπj

.

−1

22 / 70

slide-11
SLIDE 11

Updating π

Recall our new likelihood with responsibilities R:

n

  • i=1

k

  • j=1

Rij ln πjpµj,Σj(xi)

23 / 70

slide-12
SLIDE 12

Updating π

Recall our new likelihood with responsibilities R:

n

  • i=1

k

  • j=1

Rij ln πjpµj,Σj(xi) Taking derivative and setting to 0: 0 =

n

  • i=1

Rij πj ;

  • ops?

23 / 70

slide-13
SLIDE 13

Updating π

Recall our new likelihood with responsibilities R:

n

  • i=1

k

  • j=1

Rij ln πjpµj,Σj(xi) Taking derivative and setting to 0: 0 =

n

  • i=1

Rij πj ;

  • ops?

Fix: we forgot the constraints on π!

23 / 70

slide-14
SLIDE 14

Updating π

Include constraint k

j=1 πj = 1 with a Lagrangian: n

  • i=1

k

  • j=1

Rij ln πjpµj,Σj(xi) + λ

  • 1 −

k

  • j=1

πj

  • 24 / 70
slide-15
SLIDE 15

Updating π

Include constraint k

j=1 πj = 1 with a Lagrangian: n

  • i=1

k

  • j=1

Rij ln πjpµj,Σj(xi) + λ

  • 1 −

k

  • j=1

πj

  • Differentiating and setting this Lagrangian to 0, we get

λ =

n

  • i=1

Rij πj , and

  • j

πj = 1.

24 / 70

slide-16
SLIDE 16

Updating π

Include constraint k

j=1 πj = 1 with a Lagrangian: n

  • i=1

k

  • j=1

Rij ln πjpµj,Σj(xi) + λ

  • 1 −

k

  • j=1

πj

  • Differentiating and setting this Lagrangian to 0, we get

λ =

n

  • i=1

Rij πj , and

  • j

πj = 1. Together, πj = n

i=1 Rij/λ, and

1 =

k

  • j=1

πj =

k

  • j=1

n

  • i=1

Rij λ = n λ, so λ = n and πj = n

i=1 Rij/n.

24 / 70

slide-17
SLIDE 17

Updating Σj

Starting again from likelihood with responsibilities R:

n

  • i=1

k

  • j=1

Rij ln πjpµj,Σj(xi).

25 / 70

slide-18
SLIDE 18

Updating Σj

Starting again from likelihood with responsibilities R:

n

  • i=1

k

  • j=1

Rij ln πjpµj,Σj(xi). Taking derivative and setting to 0, 0 =

n

  • i=1

Rij∇Σj

  • −1

2(xi − µj)

TΣ−1

j (xi − µj) − 1

2 ln |Σj| + other stuff

  • .

25 / 70

slide-19
SLIDE 19

Updating Σj

Starting again from likelihood with responsibilities R:

n

  • i=1

k

  • j=1

Rij ln πjpµj,Σj(xi). Taking derivative and setting to 0, 0 =

n

  • i=1

Rij∇Σj

  • −1

2(xi − µj)

TΣ−1

j (xi − µj) − 1

2 ln |Σj| + other stuff

  • .

By magic matrix derivative rules, Σ−1

j

= n

i=1 Rij(xi − µj)(xi − µj)T/(nπj).

25 / 70

slide-20
SLIDE 20

Summary of θ optimization

Replace n

i=1 ln k j=1 πjpµj,Σj(xi) with n

  • i=1

k

  • j=1

Rij ln

  • πjpµj,Σj(xi)
  • .

Hold R fixed and optimize θ: πj := n

i=1 Rij

n

i=1

k

l=1 Ril

= n

i=1 Rij

n ; µj := n

i=1 Rijxi

n

i=1 Rij

= n

i=1 Rijxi

nπj , Σj := n

i=1 Rij(xi − µj)(xi − µj)T

nπj .

26 / 70

slide-21
SLIDE 21

Summary of θ optimization

Replace n

i=1 ln k j=1 πjpµj,Σj(xi) with n

  • i=1

k

  • j=1

Rij ln

  • πjpµj,Σj(xi)
  • .

Hold R fixed and optimize θ: πj := n

i=1 Rij

n

i=1

k

l=1 Ril

= n

i=1 Rij

n ; µj := n

i=1 Rijxi

n

i=1 Rij

= n

i=1 Rijxi

nπj , Σj := n

i=1 Rij(xi − µj)(xi − µj)T

nπj . How to optimize Rij? ◮ Likelihood lacks the minj from the k-means cost. ◮ We’ll now develop the E-M method, which picks R in a way that guarantees likelihood increases.

26 / 70

slide-22
SLIDE 22

E-M (Expectation-Maximization)

27 / 70

slide-23
SLIDE 23

Generalizing the assignment matrix to GMMs

We introduced an assigment matrix A ∈ {0, 1}n×k: ◮ For each xi, define µ(xi) to be a closest center: xi − µ(xi) = min

j

xi − µj. ◮ For each i, set Aij = 1[µ(xi) = µj].

28 / 70

slide-24
SLIDE 24

Generalizing the assignment matrix to GMMs

We introduced an assigment matrix A ∈ {0, 1}n×k: ◮ For each xi, define µ(xi) to be a closest center: xi − µ(xi) = min

j

xi − µj. ◮ For each i, set Aij = 1[µ(xi) = µj]. ◮ Key property: by this choice, φ(C; A) =

n

  • i=1

k

  • j=1

Aijxi − µj2 =

n

  • i=1

min

j

xi − µj2 = φ(C); therefore we can decrase φ(C) = φ(C; A) first by optimizing C to get φ(C′; A) ≤ φ(C; A), then setting A as above to get φ(C′) = φ(C′; A′) ≤ φ(C′; A) ≤ φ(C; A) = φ(C). In other words: we minimize φ(C) via φ(C; A).

28 / 70

slide-25
SLIDE 25

Generalizing the assignment matrix to GMMs

We introduced an assigment matrix A ∈ {0, 1}n×k: ◮ For each xi, define µ(xi) to be a closest center: xi − µ(xi) = min

j

xi − µj. ◮ For each i, set Aij = 1[µ(xi) = µj]. ◮ Key property: by this choice, φ(C; A) =

n

  • i=1

k

  • j=1

Aijxi − µj2 =

n

  • i=1

min

j

xi − µj2 = φ(C); therefore we can decrase φ(C) = φ(C; A) first by optimizing C to get φ(C′; A) ≤ φ(C; A), then setting A as above to get φ(C′) = φ(C′; A′) ≤ φ(C′; A) ≤ φ(C; A) = φ(C). In other words: we minimize φ(C) via φ(C; A). What fulfills the same role for L?

28 / 70

slide-26
SLIDE 26

Latent variable models.

Since 1 =

k

  • j=1

pθ(yi = j|xi) and pθ(yi = j|xi) = pθ(yi = j, xi) pθ(xi) , then L(θ) =

n

  • i=1

ln pθ(xi) =

n

  • i=1

1 · ln pθ(xi) =

n

  • i=1

k

  • j=1

pθ(yi = j|xi) ln pθ(xi) =

n

  • i=1

k

  • j=1

pθ(yi = j|xi) ln pθ(xi, yi = j) pθ(yi = j|xi) .

29 / 70

slide-27
SLIDE 27

Latent variable models.

Since 1 =

k

  • j=1

pθ(yi = j|xi) and pθ(yi = j|xi) = pθ(yi = j, xi) pθ(xi) , then L(θ) =

n

  • i=1

ln pθ(xi) =

n

  • i=1

1 · ln pθ(xi) =

n

  • i=1

k

  • j=1

pθ(yi = j|xi) ln pθ(xi) =

n

  • i=1

k

  • j=1

pθ(yi = j|xi) ln pθ(xi, yi = j) pθ(yi = j|xi) . Therefore: define augmented likelihood L(θ; R) :=

n

  • i=1

k

  • j=1

Rij ln pθ(xi, yi = j) Rij ; note that Rij := pθ(yi = j|xi) implies L(θ; R) = L(θ).

29 / 70

slide-28
SLIDE 28

E-M method for latent variable models

Define augmented likelihood L(θ; R) :=

n

  • i=1

k

  • j=1

Rij ln pθ(xi, yi = j) Rij , with responsibility matrix R ∈ Rn,k := {R ∈ [0, 1]n×k : R1k = 1n} . Alternate two steps: ◮ E-step: set (Rt)ij := pθt−1(yi = j|xi). ◮ M-step: set θt = arg maxθ∈Θ L(θ; Rt).

30 / 70

slide-29
SLIDE 29

E-M method for latent variable models

Define augmented likelihood L(θ; R) :=

n

  • i=1

k

  • j=1

Rij ln pθ(xi, yi = j) Rij , with responsibility matrix R ∈ Rn,k := {R ∈ [0, 1]n×k : R1k = 1n} . Alternate two steps: ◮ E-step: set (Rt)ij := pθt−1(yi = j|xi). ◮ M-step: set θt = arg maxθ∈Θ L(θ; Rt). Soon: we’ll see this gives nondecreasing likelihood!

30 / 70

slide-30
SLIDE 30

E-M for Gaussian mixtures

Initialization: a standard choice is πj = 1/k, Σj = I, and (µj)k

j=1 given

by k-means. ◮ E-step: Set Rij = pθ(yi = j|xi), meaning Rij = pθ(yi = j|xi) = pθ(yi = j, xi) pθ(xi) = πjpµj,Σj(xi) k

l=1 πlpµl,Σl(xi)

. ◮ M-step: solve arg maxθ∈Θ L(θ; R), meaning πj := n

i=1 Rij

n

i=1

k

l=1 Ril

= n

i=1 Rij

n , µj := n

i=1 Rijxi

n

i=1 Rij

= n

i=1 Rijxi

nπj , Σj := n

i=1 Rij(xi − µj)(xi − µj)T

nπj . (These are as before.)

31 / 70

slide-31
SLIDE 31

Demo: spherical clusters

2 2 4 6 8 10 2 2 4 6

(Initialized with k-means, thus not so dramatic.)

32 / 70

slide-32
SLIDE 32

Demo: spherical clusters

2 2 4 6 8 10 2 2 4 6

(Initialized with k-means, thus not so dramatic.)

32 / 70

slide-33
SLIDE 33

Demo: spherical clusters

2 2 4 6 8 10 2 2 4 6

(Initialized with k-means, thus not so dramatic.)

32 / 70

slide-34
SLIDE 34

Demo: spherical clusters

2 2 4 6 8 10 2 2 4 6

(Initialized with k-means, thus not so dramatic.)

32 / 70

slide-35
SLIDE 35

Demo: spherical clusters

2 2 4 6 8 10 2 2 4 6

(Initialized with k-means, thus not so dramatic.)

32 / 70

slide-36
SLIDE 36

Demo: spherical clusters

2 2 4 6 8 10 2 2 4 6

(Initialized with k-means, thus not so dramatic.)

32 / 70

slide-37
SLIDE 37

Demo: spherical clusters

2 2 4 6 8 10 2 2 4 6

(Initialized with k-means, thus not so dramatic.)

32 / 70

slide-38
SLIDE 38

Demo: spherical clusters

2 2 4 6 8 10 2 2 4 6

(Initialized with k-means, thus not so dramatic.)

32 / 70

slide-39
SLIDE 39

Demo: spherical clusters

2 2 4 6 8 10 2 2 4 6

(Initialized with k-means, thus not so dramatic.)

32 / 70

slide-40
SLIDE 40

Demo: spherical clusters

2 2 4 6 8 10 2 2 4 6

(Initialized with k-means, thus not so dramatic.)

32 / 70

slide-41
SLIDE 41

Demo: elliptical clusters

  • E. . .

5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 33 / 70

slide-42
SLIDE 42

Demo: elliptical clusters

  • E. . . M. . .

5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 33 / 70

slide-43
SLIDE 43

Demo: elliptical clusters

  • E. . . M. . . E. . .

5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 33 / 70

slide-44
SLIDE 44

Demo: elliptical clusters

  • E. . . M. . . E. . . M. . .

5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 33 / 70

slide-45
SLIDE 45

Demo: elliptical clusters

  • E. . . M. . . E. . . M. . . E. . .

5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 33 / 70

slide-46
SLIDE 46

Demo: elliptical clusters

  • E. . . M. . . E. . . M. . . E. . . M. . .

5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 33 / 70

slide-47
SLIDE 47

Demo: elliptical clusters

  • E. . . M. . . E. . . M. . . E. . . M. . . E. . .

5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 33 / 70

slide-48
SLIDE 48

Demo: elliptical clusters

  • E. . . M. . . E. . . M. . . E. . . M. . . E. . . M. . .

5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 33 / 70

slide-49
SLIDE 49

Demo: elliptical clusters

  • E. . . M. . . E. . . M. . . E. . . M. . . E. . . M. . .

5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 33 / 70

slide-50
SLIDE 50

Demo: elliptical clusters

  • E. . . M. . . E. . . M. . . E. . . M. . . E. . . M. . .

5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 33 / 70

slide-51
SLIDE 51

Demo: elliptical clusters

  • E. . . M. . . E. . . M. . . E. . . M. . . E. . . M. . .

10 5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 33 / 70

slide-52
SLIDE 52

Demo: elliptical clusters

  • E. . . M. . . E. . . M. . . E. . . M. . . E. . . M. . .

10 5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 33 / 70

slide-53
SLIDE 53

Demo: elliptical clusters

  • E. . . M. . . E. . . M. . . E. . . M. . . E. . . M. . .

10 5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 33 / 70

slide-54
SLIDE 54

Demo: elliptical clusters

  • E. . . M. . . E. . . M. . . E. . . M. . . E. . . M. . .

10 5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 33 / 70

slide-55
SLIDE 55

Demo: elliptical clusters

  • E. . . M. . . E. . . M. . . E. . . M. . . E. . . M. . .

10 5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 33 / 70

slide-56
SLIDE 56

Demo: elliptical clusters

  • E. . . M. . . E. . . M. . . E. . . M. . . E. . . M. . .

10 5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 33 / 70

slide-57
SLIDE 57

Demo: elliptical clusters

  • E. . . M. . . E. . . M. . . E. . . M. . . E. . . M. . .

10 5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 33 / 70

slide-58
SLIDE 58

Demo: elliptical clusters

  • E. . . M. . . E. . . M. . . E. . . M. . . E. . . M. . .

10 5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 33 / 70

slide-59
SLIDE 59

Demo: elliptical clusters

  • E. . . M. . . E. . . M. . . E. . . M. . . E. . . M. . .

10 5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 33 / 70

slide-60
SLIDE 60

Demo: elliptical clusters

  • E. . . M. . . E. . . M. . . E. . . M. . . E. . . M. . .

10 5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 33 / 70

slide-61
SLIDE 61

Demo: elliptical clusters

  • E. . . M. . . E. . . M. . . E. . . M. . . E. . . M. . .

10 5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 33 / 70

slide-62
SLIDE 62

Demo: elliptical clusters

  • E. . . M. . . E. . . M. . . E. . . M. . . E. . . M. . .

10 5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 33 / 70

slide-63
SLIDE 63

Demo: elliptical clusters

  • E. . . M. . . E. . . M. . . E. . . M. . . E. . . M. . .

10 5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 33 / 70

slide-64
SLIDE 64

Demo: elliptical clusters

  • E. . . M. . . E. . . M. . . E. . . M. . . E. . . M. . .

10 5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 33 / 70

slide-65
SLIDE 65

Demo: elliptical clusters

  • E. . . M. . . E. . . M. . . E. . . M. . . E. . . M. . .

10 5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 33 / 70

slide-66
SLIDE 66

Demo: elliptical clusters

  • E. . . M. . . E. . . M. . . E. . . M. . . E. . . M. . .

10 5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 33 / 70

slide-67
SLIDE 67

Demo: elliptical clusters

  • E. . . M. . . E. . . M. . . E. . . M. . . E. . . M. . .

10 5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 33 / 70

slide-68
SLIDE 68

Demo: elliptical clusters

  • E. . . M. . . E. . . M. . . E. . . M. . . E. . . M. . .

10 5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 33 / 70

slide-69
SLIDE 69

Theorem. Suppose (R0, θ0) ∈ Rn,k × Θ arbitrary, thereafter (Rt, θt) given by E-M: (Rt)ij := pθt−1(y = j|xi). and θt := arg max

θ∈Θ

L(θ; Rt) Then L(θt; Rt) ≤ max

R∈Rn×k L(θt; R) = L(θt; Rt+1) = L(θt)

≤ L(θt+1; Rt+1). In particular, L(θt) ≤ L(θt+1).

34 / 70

slide-70
SLIDE 70

Theorem. Suppose (R0, θ0) ∈ Rn,k × Θ arbitrary, thereafter (Rt, θt) given by E-M: (Rt)ij := pθt−1(y = j|xi). and θt := arg max

θ∈Θ

L(θ; Rt) Then L(θt; Rt) ≤ max

R∈Rn×k L(θt; R) = L(θt; Rt+1) = L(θt)

≤ L(θt+1; Rt+1). In particular, L(θt) ≤ L(θt+1). Remarks. ◮ We proved a similar guarantee for k-means, which is also an alternating minimization scheme. ◮ Similarly, MLE for Gaussian mixtures is NP-hard; it is also known to need exponentially many samples in k to information-theoretically recover the parameters.

34 / 70

slide-71
SLIDE 71
  • Proof. We’ve already shown:

◮ L(θt; Rt+1) = L(θt); ◮ L(θt; Rt+1) = maxθ∈Θ L(θ; Rt+1) ≤ L(θt+1; Rt+1) by definition

  • f θt+1.

We still need to show: L(θt; Rt+1) = maxR∈Rn,k L(θt+1; R). We’ll give two proofs.

35 / 70

slide-72
SLIDE 72
  • Proof. We’ve already shown:

◮ L(θt; Rt+1) = L(θt); ◮ L(θt; Rt+1) = maxθ∈Θ L(θ; Rt+1) ≤ L(θt+1; Rt+1) by definition

  • f θt+1.

We still need to show: L(θt; Rt+1) = maxR∈Rn,k L(θt+1; R). We’ll give two proofs. By concavity of ln (“Jensen’s inequality” in convexity lectures), for any R ∈ Rn,k, L(θt; R) =

n

  • i=1

k

  • j=1

Rij ln pθt(xi, yi = j) Rij ≤

n

  • i=1

ln  

k

  • j=1

Rij pθt(xi, yi = j) Rij   =

n

  • i=1

ln pθt(xi) = L(θt) = L(θt; Rt+1). Since R was arbitrary, maxR∈R L(θt; R) = L(θt; Rt+1).

35 / 70

slide-73
SLIDE 73

Proof (continued). Here’s a second proof of that missing fact. To evaluate arg maxR∈Rn,k L(θ; R), consider Lagrangian

n

  • i=1

 

k

  • j=1

Rij ln pθ(xi, y = j) −

k

  • j=1

Rij ln Rij + λi k

  • j=1

Rij − 1

 . Fixing i and taking the gradient with respect to Rij for any j, 0 = ln pθ(xi, yi = j) − ln Rij − 1 + λi, giving Rij = pθ(xi, y = j) exp(λi − 1). Since moreover 1 =

  • j

Rij = exp(λi − 1)

  • j

pθ(xi, y = j) = exp(λi − 1)pθ(xi), it follows that exp(λi − 1) = 1/pθ(xi), and the optimal R satisfies Rij = pθ(xi,y=j)/pθ(xi) = pθ(y = j|xi).

  • 36 / 70
slide-74
SLIDE 74

Related issues.

37 / 70

slide-75
SLIDE 75

Parameter constraints.

E-M for GMMs still works if we freeze or constrain some parameters.

38 / 70

slide-76
SLIDE 76

Parameter constraints.

E-M for GMMs still works if we freeze or constrain some parameters. Examples: ◮ No weights: initialize π = (1/k, . . . , 1/k) and never update it. ◮ Diagonal covariance matrices: update everything as before, except Σj := diag((σj)2

1, . . . , (σj)2 d) where

(σj)2

l :=

n

i=1 Rij(xi − µj)2 l

nπj ; that is: we use coordinate-wise sample variances weighted by R. Why is this a good idea?

38 / 70

slide-77
SLIDE 77

Parameter constraints.

E-M for GMMs still works if we freeze or constrain some parameters. Examples: ◮ No weights: initialize π = (1/k, . . . , 1/k) and never update it. ◮ Diagonal covariance matrices: update everything as before, except Σj := diag((σj)2

1, . . . , (σj)2 d) where

(σj)2

l :=

n

i=1 Rij(xi − µj)2 l

nπj ; that is: we use coordinate-wise sample variances weighted by R. Why is this a good idea? Computation (of inverse), sample complexity, . . .

38 / 70

slide-78
SLIDE 78

Gaussian Mixture Model with diagonal covariances.

5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 39 / 70

slide-79
SLIDE 79

Gaussian Mixture Model with diagonal covariances.

5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 39 / 70

slide-80
SLIDE 80

Gaussian Mixture Model with diagonal covariances.

5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 39 / 70

slide-81
SLIDE 81

Gaussian Mixture Model with diagonal covariances.

5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 39 / 70

slide-82
SLIDE 82

Gaussian Mixture Model with diagonal covariances.

5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 39 / 70

slide-83
SLIDE 83

Gaussian Mixture Model with diagonal covariances.

5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 39 / 70

slide-84
SLIDE 84

Gaussian Mixture Model with diagonal covariances.

5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 39 / 70

slide-85
SLIDE 85

Gaussian Mixture Model with diagonal covariances.

5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 39 / 70

slide-86
SLIDE 86

Gaussian Mixture Model with diagonal covariances.

5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 39 / 70

slide-87
SLIDE 87

Gaussian Mixture Model with diagonal covariances.

5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 39 / 70

slide-88
SLIDE 88

Gaussian Mixture Model with diagonal covariances.

5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 39 / 70

slide-89
SLIDE 89

Gaussian Mixture Model with diagonal covariances.

5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 39 / 70

slide-90
SLIDE 90

Gaussian Mixture Model with diagonal covariances.

10 5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 39 / 70

slide-91
SLIDE 91

Gaussian Mixture Model with diagonal covariances.

10 5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 39 / 70

slide-92
SLIDE 92

Gaussian Mixture Model with diagonal covariances.

10 5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 39 / 70

slide-93
SLIDE 93

Gaussian Mixture Model with diagonal covariances.

10 5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 39 / 70

slide-94
SLIDE 94

Gaussian Mixture Model with diagonal covariances.

10 5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 39 / 70

slide-95
SLIDE 95

Gaussian Mixture Model with diagonal covariances.

10 5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 39 / 70

slide-96
SLIDE 96

Gaussian Mixture Model with diagonal covariances.

10 5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 39 / 70

slide-97
SLIDE 97

Gaussian Mixture Model with diagonal covariances.

10 5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 39 / 70

slide-98
SLIDE 98

Gaussian Mixture Model with diagonal covariances.

10 5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 39 / 70

slide-99
SLIDE 99

Gaussian Mixture Model with diagonal covariances.

10 5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 39 / 70

slide-100
SLIDE 100

Gaussian Mixture Model with diagonal covariances.

10 5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 39 / 70

slide-101
SLIDE 101

Gaussian Mixture Model with diagonal covariances.

10 5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 39 / 70

slide-102
SLIDE 102

Gaussian Mixture Model with diagonal covariances.

10 5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 39 / 70

slide-103
SLIDE 103

Gaussian Mixture Model with diagonal covariances.

10 5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 39 / 70

slide-104
SLIDE 104

Gaussian Mixture Model with diagonal covariances.

10 5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 39 / 70

slide-105
SLIDE 105

Gaussian Mixture Model with diagonal covariances.

10 5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 39 / 70

slide-106
SLIDE 106

Gaussian Mixture Model with diagonal covariances.

10 5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 39 / 70

slide-107
SLIDE 107

Gaussian Mixture Model with diagonal covariances.

10 5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 39 / 70

slide-108
SLIDE 108

Gaussian Mixture Model with diagonal covariances.

10 5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 39 / 70

slide-109
SLIDE 109

Gaussian Mixture Model with diagonal covariances.

10 5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 39 / 70

slide-110
SLIDE 110

Singularities

E-M with GMMs suffers from singularities: trivial situations where the likelihood goes to ∞ but the solution is bad. ◮ Suppose: d = 1, k = 2, πj = 1/2, n = 3 with x1 = −1 and x2 = +1 and x3 = +3. Initialize with µ1 = 0 and σ1 = 1, but µ2 = +3 = x3 and σ2 = 1/100. Then σ2 → 0 and L ↑ ∞.

40 / 70

slide-111
SLIDE 111

Interpolating between k-means and GMM E-M

Same M-step: fix π = (1/k, . . . , 1/k) and Σj = cI for a fixed c > 0.

41 / 70

slide-112
SLIDE 112

Interpolating between k-means and GMM E-M

Same M-step: fix π = (1/k, . . . , 1/k) and Σj = cI for a fixed c > 0. Same E-step: define qij := 1

2xi − µj2; the E-step chooses

Rij := pθ(yi = j|xi) = pθ(yi = j, xi) pθ(xi) = pθ(yi = j, xi) k

l=1 pθ(yi = l, xi)

= πjpµj,Σj(xi) k

l=1 πlpµl,Σl(xi)

= exp(−qij/c) k

l=1 exp(−qil/c)

Fix i ∈ {1, . . . , n} and suppose minimum qi := minj qij is unique:

41 / 70

slide-113
SLIDE 113

Interpolating between k-means and GMM E-M

Same M-step: fix π = (1/k, . . . , 1/k) and Σj = cI for a fixed c > 0. Same E-step: define qij := 1

2xi − µj2; the E-step chooses

Rij := pθ(yi = j|xi) = pθ(yi = j, xi) pθ(xi) = pθ(yi = j, xi) k

l=1 pθ(yi = l, xi)

= πjpµj,Σj(xi) k

l=1 πlpµl,Σl(xi)

= exp(−qij/c) k

l=1 exp(−qil/c)

Fix i ∈ {1, . . . , n} and suppose minimum qi := minj qij is unique: lim

c↓0 Rij = lim c↓0

exp(−qij/c) k

l=1 exp(−qil/c)

= lim

c↓0

exp(qi − qij/c) k

l=1 exp(qi − qil/c)

=

  • 1

qij = qi, qij = qi. That is, R becomes hard assignment A as c ↓ 0.

41 / 70

slide-114
SLIDE 114

Interpolating between k-means and GMM E-M (part 2)

We can interpolate algorithmically, meaning we can create algorithms that have elements of both. Here’s something like k-means but with weights and covariances.

5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 42 / 70

slide-115
SLIDE 115

Interpolating between k-means and GMM E-M (part 2)

We can interpolate algorithmically, meaning we can create algorithms that have elements of both. Here’s something like k-means but with weights and covariances.

5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 42 / 70

slide-116
SLIDE 116

Interpolating between k-means and GMM E-M (part 2)

We can interpolate algorithmically, meaning we can create algorithms that have elements of both. Here’s something like k-means but with weights and covariances.

5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 42 / 70

slide-117
SLIDE 117

Interpolating between k-means and GMM E-M (part 2)

We can interpolate algorithmically, meaning we can create algorithms that have elements of both. Here’s something like k-means but with weights and covariances.

5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 42 / 70

slide-118
SLIDE 118

Interpolating between k-means and GMM E-M (part 2)

We can interpolate algorithmically, meaning we can create algorithms that have elements of both. Here’s something like k-means but with weights and covariances.

5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 42 / 70

slide-119
SLIDE 119

Interpolating between k-means and GMM E-M (part 2)

We can interpolate algorithmically, meaning we can create algorithms that have elements of both. Here’s something like k-means but with weights and covariances.

5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 42 / 70

slide-120
SLIDE 120

Interpolating between k-means and GMM E-M (part 2)

We can interpolate algorithmically, meaning we can create algorithms that have elements of both. Here’s something like k-means but with weights and covariances.

5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 42 / 70

slide-121
SLIDE 121

Interpolating between k-means and GMM E-M (part 2)

We can interpolate algorithmically, meaning we can create algorithms that have elements of both. Here’s something like k-means but with weights and covariances.

5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 42 / 70

slide-122
SLIDE 122

Interpolating between k-means and GMM E-M (part 2)

We can interpolate algorithmically, meaning we can create algorithms that have elements of both. Here’s something like k-means but with weights and covariances.

5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 42 / 70

slide-123
SLIDE 123

Interpolating between k-means and GMM E-M (part 2)

We can interpolate algorithmically, meaning we can create algorithms that have elements of both. Here’s something like k-means but with weights and covariances.

5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 42 / 70

slide-124
SLIDE 124

Interpolating between k-means and GMM E-M (part 2)

We can interpolate algorithmically, meaning we can create algorithms that have elements of both. Here’s something like k-means but with weights and covariances.

5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 42 / 70

slide-125
SLIDE 125

Interpolating between k-means and GMM E-M (part 2)

We can interpolate algorithmically, meaning we can create algorithms that have elements of both. Here’s something like k-means but with weights and covariances.

5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 42 / 70

slide-126
SLIDE 126

Interpolating between k-means and GMM E-M (part 2)

We can interpolate algorithmically, meaning we can create algorithms that have elements of both. Here’s something like k-means but with weights and covariances.

10 5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 42 / 70

slide-127
SLIDE 127

Interpolating between k-means and GMM E-M (part 2)

We can interpolate algorithmically, meaning we can create algorithms that have elements of both. Here’s something like k-means but with weights and covariances.

10 5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 42 / 70

slide-128
SLIDE 128

Interpolating between k-means and GMM E-M (part 2)

We can interpolate algorithmically, meaning we can create algorithms that have elements of both. Here’s something like k-means but with weights and covariances.

10 5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 42 / 70

slide-129
SLIDE 129

Interpolating between k-means and GMM E-M (part 2)

We can interpolate algorithmically, meaning we can create algorithms that have elements of both. Here’s something like k-means but with weights and covariances.

10 5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 42 / 70

slide-130
SLIDE 130

Interpolating between k-means and GMM E-M (part 2)

We can interpolate algorithmically, meaning we can create algorithms that have elements of both. Here’s something like k-means but with weights and covariances.

10 5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 42 / 70

slide-131
SLIDE 131

Interpolating between k-means and GMM E-M (part 2)

We can interpolate algorithmically, meaning we can create algorithms that have elements of both. Here’s something like k-means but with weights and covariances.

10 5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 42 / 70

slide-132
SLIDE 132

Interpolating between k-means and GMM E-M (part 2)

We can interpolate algorithmically, meaning we can create algorithms that have elements of both. Here’s something like k-means but with weights and covariances.

10 5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 42 / 70

slide-133
SLIDE 133

Interpolating between k-means and GMM E-M (part 2)

We can interpolate algorithmically, meaning we can create algorithms that have elements of both. Here’s something like k-means but with weights and covariances.

10 5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 42 / 70

slide-134
SLIDE 134

Interpolating between k-means and GMM E-M (part 2)

We can interpolate algorithmically, meaning we can create algorithms that have elements of both. Here’s something like k-means but with weights and covariances.

10 5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 42 / 70

slide-135
SLIDE 135

Interpolating between k-means and GMM E-M (part 2)

We can interpolate algorithmically, meaning we can create algorithms that have elements of both. Here’s something like k-means but with weights and covariances.

10 5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 42 / 70

slide-136
SLIDE 136

Interpolating between k-means and GMM E-M (part 2)

We can interpolate algorithmically, meaning we can create algorithms that have elements of both. Here’s something like k-means but with weights and covariances.

10 5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 42 / 70

slide-137
SLIDE 137

Interpolating between k-means and GMM E-M (part 2)

We can interpolate algorithmically, meaning we can create algorithms that have elements of both. Here’s something like k-means but with weights and covariances.

10 5 5 10 15 10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 42 / 70

slide-138
SLIDE 138

Summary of MLE part 2

43 / 70