CS 6316 Machine Learning Generative Models Yangfeng Ji Department - - PowerPoint PPT Presentation

cs 6316 machine learning
SMART_READER_LITE
LIVE PREVIEW

CS 6316 Machine Learning Generative Models Yangfeng Ji Department - - PowerPoint PPT Presentation

CS 6316 Machine Learning Generative Models Yangfeng Ji Department of Computer Science University of Virginia Basic Definition Data generation process An idealized process to illustrate the relations among domain set X , label set Y , and the


slide-1
SLIDE 1

CS 6316 Machine Learning

Generative Models

Yangfeng Ji

Department of Computer Science University of Virginia

slide-2
SLIDE 2

Basic Definition

slide-3
SLIDE 3

Data generation process

An idealized process to illustrate the relations among domain set X, label set Y, and the training set S

  • 1. the probability distribution D over the domain set X
  • 2. sample an instance x ∈ Xaccording to D
  • 3. annotate it using the labeling function f as y f (x)

[From Lecture 02]

2

slide-4
SLIDE 4

Example

Here is an data generation model p(x) 0.6 · N(x; µ+, Σ+)

  • y+1

+ 0.4 · N(x; µ-, Σ-)

  • y−1

(1) with

◮ µ+ [2, 0]T ◮ Σ+ 1.0

0.8 0.8 2.0

  • ◮ µ- [−2, 0]T

◮ Σ- 2.0

0.6 0.6 1.0

  • 3
slide-5
SLIDE 5

Example (II)

The data generation model can also be represented with the following components p(y +1)

  • 0.6

(2) p(y −1)

  • 1 − p(y +1) 0.4

(3) p(x | y +1)

  • N(x; µ+, Σ+)

(4) p(x | y −1)

  • N(x; µ-, Σ-)

(5)

4

slide-6
SLIDE 6

Data Generation

The specific data generation process: for each data point

  • 1. Randomly select a value of y ∈ {+1, −1} based on

p(y +1) 0.6 p(y −1) 0.4 (6)

5

slide-7
SLIDE 7

Data Generation

The specific data generation process: for each data point

  • 1. Randomly select a value of y ∈ {+1, −1} based on

p(y +1) 0.6 p(y −1) 0.4 (6)

  • 2. Sample x from the corresponding component based
  • n the value of y

p(x | y)

N(x; µ+, Σ+)

y +1 N(x; µ-, Σ-) y −1 (7)

5

slide-8
SLIDE 8

Data Generation

The specific data generation process: for each data point

  • 1. Randomly select a value of y ∈ {+1, −1} based on

p(y +1) 0.6 p(y −1) 0.4 (6)

  • 2. Sample x from the corresponding component based
  • n the value of y

p(x | y)

N(x; µ+, Σ+)

y +1 N(x; µ-, Σ-) y −1 (7)

  • 3. Add (x, y) to S, go to step 1

5

slide-9
SLIDE 9

Illustration

With N 1000 samples, here is the plot

◮ 588 positive samples and 412 negative samples

6

slide-10
SLIDE 10

Discriminative Models for Classification

◮ Discriminative models directly give predictions on

the target variable (e.g., y)

◮ Example: logistic regression

p(y | x) σ(yw, x) 1 1 + e−yw,x (8) where w is the model parameter

7

slide-11
SLIDE 11

Discriminative Models for Classification

◮ Discriminative models directly give predictions on

the target variable (e.g., y)

◮ Example: logistic regression

p(y | x) σ(yw, x) 1 1 + e−yw,x (8) where w is the model parameter

◮ Other examples

◮ AdaBoost (lecture 05) ◮ SVMs (lecture 07) ◮ Feed-forward neural network (lecture 08)

7

slide-12
SLIDE 12

Generative Models for Classification

◮ Basic idea: Building a classifier by simulating the data

generation process

8

slide-13
SLIDE 13

Generative Models for Classification

◮ Basic idea: Building a classifier by simulating the data

generation process

◮ For the binary classification problem, recall the basic

components of the data generation process ◮ p(y) where y ∈ {−1, +1} ◮ p(x | y +1) where x ∈ Rd ◮ p(x | y −1) where x ∈ Rd

8

slide-14
SLIDE 14

Generative Models for Classification

◮ Basic idea: Building a classifier by simulating the data

generation process

◮ For the binary classification problem, recall the basic

components of the data generation process ◮ p(y) where y ∈ {−1, +1} ◮ p(x | y +1) where x ∈ Rd ◮ p(x | y −1) where x ∈ Rd

◮ Challenge in machine learning: we do not know any

  • f them, instead we have the samples S from this

distribution ◮ This has always been our assumption in machine

learning — we have no idea about the true data distribution

8

slide-15
SLIDE 15

Generative Models for Classification (II)

We use a set of distribution q(·) to approximate the true distribution p(·)

Data Generation Model Generative Model p(y) q(y) p(x | y +1) q(x | y +1) p(x | y −1) q(x | y −1)

9

slide-16
SLIDE 16

Learning with Generative Models

  • 1. Define distributions for all components
  • 2. Estimate the parameters for each component

distribution

10

slide-17
SLIDE 17

Defining Distributions

A typical way of defining distributions for generative models is based on our understanding about the problem

11

slide-18
SLIDE 18

Defining Distributions

A typical way of defining distributions for generative models is based on our understanding about the problem

◮ Output domain y ∈ {+1, −1}: Bernoulli distribution

p(y) Bern(y; α) αδ(y+1)(1 − α)δ(y−1) (9) where α ∈ (0, 1) is the parameter

11

slide-19
SLIDE 19

Defining Distributions

A typical way of defining distributions for generative models is based on our understanding about the problem

◮ Output domain y ∈ {+1, −1}: Bernoulli distribution

p(y) Bern(y; α) αδ(y+1)(1 − α)δ(y−1) (9) where α ∈ (0, 1) is the parameter

◮ Input domain x ∈ Rd: Gaussian distribution

p(x | y +1) N(x; µ+, Σ+) (10) where µ+ and Σ+ are the parameters

11

slide-20
SLIDE 20

Defining Distributions

A typical way of defining distributions for generative models is based on our understanding about the problem

◮ Output domain y ∈ {+1, −1}: Bernoulli distribution

p(y) Bern(y; α) αδ(y+1)(1 − α)δ(y−1) (9) where α ∈ (0, 1) is the parameter

◮ Input domain x ∈ Rd: Gaussian distribution

p(x | y +1) N(x; µ+, Σ+) (10) where µ+ and Σ+ are the parameters

◮ Similarly, for p(x | y −1)

p(x | y −1) N(x; µ-, Σ-) (11) where µ- and Σ- are the parameters

11

slide-21
SLIDE 21

Parameter Estimation

◮ The collection of the parameters

θ {α, µ+, Σ+, µ-, Σ-} (12)

◮ Training data S {(x1, y1), . . . , (xm, ym)}

12

slide-22
SLIDE 22

Parameter Estimation

◮ The collection of the parameters

θ {α, µ+, Σ+, µ-, Σ-} (12)

◮ Training data S {(x1, y1), . . . , (xm, ym)} ◮ Learning algorithm: Maximum Likelihood

Estimation (MLE)

12

slide-23
SLIDE 23

Maximum Likelihood Estimation (MLE)

MLE defined on the whole distribution q(x, y) θ ← argmax

θ′ m

  • i1

log q(xi, yi; θ′) (13)

13

slide-24
SLIDE 24

Maximum Likelihood Estimation (MLE)

MLE defined on the whole distribution q(x, y) θ ← argmax

θ′ m

  • i1

log q(xi, yi; θ′) (13) Based on the chain rule of probability q(x, y; θ) q(y; α)q(x | y; µy, Σy), (14)

13

slide-25
SLIDE 25

Maximum Likelihood Estimation (MLE)

MLE defined on the whole distribution q(x, y) θ ← argmax

θ′ m

  • i1

log q(xi, yi; θ′) (13) Based on the chain rule of probability q(x, y; θ) q(y; α)q(x | y; µy, Σy), (14) Therefore ˆ θ ← argmax

θ

  • m
  • i1

log log q(yi; α)+

m

  • i1

log q(xi | yi; µy, Σy)

  • the last item has two components, depending on the value
  • f y

13

slide-26
SLIDE 26

MLE: Bernoulli Distribution

Recall the definition of Bernoulli distribution, we have

m

  • i1

log q(yi; α)

m

  • i1

{δ(yi +1) log α+δ(yi −1) log(1−α)} (15)

14

slide-27
SLIDE 27

MLE: Bernoulli Distribution

Recall the definition of Bernoulli distribution, we have

m

  • i1

log q(yi; α)

m

  • i1

{δ(yi +1) log α+δ(yi −1) log(1−α)} (15) Then, the value of α can be estimated from d m

i1 log q(yi; α)

  • m

i1 δ(yi +1)

α −

m

i1 δ(yi −1)

1 − α (16)

14

slide-28
SLIDE 28

MLE: Bernoulli Distribution

Recall the definition of Bernoulli distribution, we have

m

  • i1

log q(yi; α)

m

  • i1

{δ(yi +1) log α+δ(yi −1) log(1−α)} (15) Then, the value of α can be estimated from d m

i1 log q(yi; α)

  • m

i1 δ(yi +1)

α −

m

i1 δ(yi −1)

1 − α (16) therefore, α

m

i1 δ(yi +1)

m (17)

14

slide-29
SLIDE 29

MLE: Gaussian Distribution

The definition of multi-variate Gaussian distribution q(x | y; µ, Σ) 1

  • (2π)d|Σ|

exp − 1 2(x−µ)TΣ−1(x−µ) (18)

◮ For y +1, MLE on µ+ and Σ+ will only consider the

samples x with y +1 (assume it’s S+)

15

slide-30
SLIDE 30

MLE: Gaussian Distribution

The definition of multi-variate Gaussian distribution q(x | y; µ, Σ) 1

  • (2π)d|Σ|

exp − 1 2(x−µ)TΣ−1(x−µ) (18)

◮ For y +1, MLE on µ+ and Σ+ will only consider the

samples x with y +1 (assume it’s S+)

◮ MLE on µ+

µ 1 |S+|

  • xi∈S+

xi (19)

15

slide-31
SLIDE 31

MLE: Gaussian Distribution

The definition of multi-variate Gaussian distribution q(x | y; µ, Σ) 1

  • (2π)d|Σ|

exp − 1 2(x−µ)TΣ−1(x−µ) (18)

◮ For y +1, MLE on µ+ and Σ+ will only consider the

samples x with y +1 (assume it’s S+)

◮ MLE on µ+

µ 1 |S+|

  • xi∈S+

xi (19)

◮ MLE on Σ+

Σ+ 1 |S+|

  • xi∈S+

(xi − µ)(xi − µ)T (20)

15

slide-32
SLIDE 32

MLE: Gaussian Distribution

The definition of multi-variate Gaussian distribution q(x | y; µ, Σ) 1

  • (2π)d|Σ|

exp − 1 2(x−µ)TΣ−1(x−µ) (18)

◮ For y +1, MLE on µ+ and Σ+ will only consider the

samples x with y +1 (assume it’s S+)

◮ MLE on µ+

µ 1 |S+|

  • xi∈S+

xi (19)

◮ MLE on Σ+

Σ+ 1 |S+|

  • xi∈S+

(xi − µ)(xi − µ)T (20)

◮ Exercise: prove equations 19 and 20 with d 1

15

slide-33
SLIDE 33

Example: Parameter Estimation

Given N 1000 samples, here are the parameters Parameter p(·) q(·) µ+ [2, 0]T [1.95, −0.11]T Σ+

1.0

0.8 0.8 2.0

  • 0.88

0.74 0.74 1.97

  • µ-

[−2, 0]T [−2.08, 0.08]T Σ-

2.0

0.6 0.6 1.0

  • 1.88

0.55 0.55 1.07

  • 16
slide-34
SLIDE 34

Prediction

◮ For a new data point x′, the prediction is given as

q(y′ | x′) q(y′)q(x | y′) q(x′) ∝ q(y′)q(x′ | y′) (21) No need to compute q(x′)

17

slide-35
SLIDE 35

Prediction

◮ For a new data point x′, the prediction is given as

q(y′ | x′) q(y′)q(x | y′) q(x′) ∝ q(y′)q(x′ | y′) (21) No need to compute q(x′)

◮ Prediction rule

y′

+1

q(y′ +1 | x′) > q(y′ −1 | x′) −1 q(y′ +1 | x′) < q(y′ +1 | x′) (22)

17

slide-36
SLIDE 36

Prediction

◮ For a new data point x′, the prediction is given as

q(y′ | x′) q(y′)q(x | y′) q(x′) ∝ q(y′)q(x′ | y′) (21) No need to compute q(x′)

◮ Prediction rule

y′

+1

q(y′ +1 | x′) > q(y′ −1 | x′) −1 q(y′ +1 | x′) < q(y′ +1 | x′) (22)

◮ Although equation 22 looks like the one used in the

Bayes optimal predictor, the prediction power is limited by q(y′ | x′) ≈ p(y | x) (23) Again, we don’t know p(·)

17

slide-37
SLIDE 37

Naive Bayes Classifiers

slide-38
SLIDE 38

Number of Parameters

Assume x (x·,1, . . . , x·,d) ∈ Rd, then the number of parameters in q(x, y)

◮ q(y): 1 (α) ◮ q(x | y +1):

◮ µ+ ∈ Rd: d parameters ◮ Σ+ ∈ Rd×d: d2 parameters

◮ q(x | y −1): d2 + d parameters

In total, we have 2d2 + 2d + 1 parameters

19

slide-39
SLIDE 39

Challenge of Parameter Estimation

◮ When d 100, we have 2d2 + 2d + 1 20201

parameters

◮ A close look about the covariance matrix Σ in a

multivariate Gaussian distribution Σ

      

σ2

1,1

· · · σ2

1,d

. . . ... . . . σ2

d,1

· · · σ2

d,d

      

(24)

20

slide-40
SLIDE 40

Challenge of Parameter Estimation

◮ When d 100, we have 2d2 + 2d + 1 20201

parameters

◮ A close look about the covariance matrix Σ in a

multivariate Gaussian distribution Σ

      

σ2

1,1

· · · σ2

1,d

. . . ... . . . σ2

d,1

· · · σ2

d,d

      

(24)

◮ To reduce the number of parameters, we assume

σi,j 0 if i j (25)

20

slide-41
SLIDE 41

Diagonal Covariance Matrix

With the diagonal covariance matrix Σ

      

σ2

1,1

· · · . . . ... . . . · · · σ2

d,d

      

(26) Now, the multivariate Gaussian distribution can be rewritten with |Σ|

  • d
  • j1

σ2

j,j

(27) (x − µ)TΣ−1(x − µ)

  • d
  • j1

(x·,j − µj)2 σ2

j,j

(28)

21

slide-42
SLIDE 42

Diagonal Covariance Matrix (II)

In other words q(x | y, µ, Σ)

d

  • j1

q(x·,j | y; µj, σ2

j,j)

(29)

22

slide-43
SLIDE 43

Diagonal Covariance Matrix (II)

In other words q(x | y, µ, Σ)

d

  • j1

q(x·,j | y; µj, σ2

j,j)

(29)

◮ Conditional Independence: Equation 29 means,

given y, each component xj is independent of other components

22

slide-44
SLIDE 44

Diagonal Covariance Matrix (II)

In other words q(x | y, µ, Σ)

d

  • j1

q(x·,j | y; µj, σ2

j,j)

(29)

◮ Conditional Independence: Equation 29 means,

given y, each component xj is independent of other components

◮ This is a strong and naive assumption about q(x | ·)

22

slide-45
SLIDE 45

Diagonal Covariance Matrix (II)

In other words q(x | y, µ, Σ)

d

  • j1

q(x·,j | y; µj, σ2

j,j)

(29)

◮ Conditional Independence: Equation 29 means,

given y, each component xj is independent of other components

◮ This is a strong and naive assumption about q(x | ·) ◮ Together with q(y), this generative model is called

the Naive Bayes classifier

22

slide-46
SLIDE 46

Diagonal Covariance Matrix (II)

In other words q(x | y, µ, Σ)

d

  • j1

q(x·,j | y; µj, σ2

j,j)

(29)

◮ Conditional Independence: Equation 29 means,

given y, each component xj is independent of other components

◮ This is a strong and naive assumption about q(x | ·) ◮ Together with q(y), this generative model is called

the Naive Bayes classifier

◮ Parameter estimation can be done per dimension

22

slide-47
SLIDE 47

Example: Parameter Estimation

Given N 1000 samples, here are the parameters Parameter p(·) q(·) Naive Bayes µ+ [2, 0]T [1.95, −0.11]T [1.95, −0.11]T Σ+

1.0

0.8 0.8 2.0

  • 0.88

0.74 0.74 1.97

  • 0.88

1.97

  • µ-

[−2, 0]T [−2.08, 0.08]T [−2.08, 0.08]T Σ-

2.0

0.6 0.6 1.0

  • 1.88

0.55 0.55 1.07

  • 1.88

1.07

  • 23
slide-48
SLIDE 48

Latent Variable Models

slide-49
SLIDE 49

Data Generation Model, Revisited

Consider the following model again without any label information p(x) α · N(x; µ1, Σ1)

  • c1

+ (1 − α) · N(x; µ2, Σ2)

  • c2

(30)

25

slide-50
SLIDE 50

Data Generation Model, Revisited

Consider the following model again without any label information p(x) α · N(x; µ1, Σ1)

  • c1

+ (1 − α) · N(x; µ2, Σ2)

  • c2

(30)

◮ No labeling information ◮ Instead of having two classes, now it has two

components c ∈ {1, 2}

25

slide-51
SLIDE 51

Data Generation Model, Revisited

Consider the following model again without any label information p(x) α · N(x; µ1, Σ1)

  • c1

+ (1 − α) · N(x; µ2, Σ2)

  • c2

(30)

◮ No labeling information ◮ Instead of having two classes, now it has two

components c ∈ {1, 2}

◮ It is a specific case of Gaussian mixture models

◮ A mixture model with two Gaussian components

25

slide-52
SLIDE 52

Data Generation

The data generation process: for each data point

  • 1. Randomly select a component c based on

p(c 1) α p(c 2) 1 − α (31)

26

slide-53
SLIDE 53

Data Generation

The data generation process: for each data point

  • 1. Randomly select a component c based on

p(c 1) α p(c 2) 1 − α (31)

  • 2. Sample x from the corresponding component c

p(x | y)

N(x; µ1, Σ1)

c 1 N(x; µ2, Σ2) c 2 (32)

26

slide-54
SLIDE 54

Data Generation

The data generation process: for each data point

  • 1. Randomly select a component c based on

p(c 1) α p(c 2) 1 − α (31)

  • 2. Sample x from the corresponding component c

p(x | y)

N(x; µ1, Σ1)

c 1 N(x; µ2, Σ2) c 2 (32)

  • 3. Add x to S, go to step 1

26

slide-55
SLIDE 55

Illustration

Here is an example data set S with 1,000 samples No label information available

27

slide-56
SLIDE 56

The Learning Problem

Consider using the following distribution to fit the data S q(x) α · N(x; µ1, Σ1) + (1 − α) · N(x; µ2, Σ2) (33)

28

slide-57
SLIDE 57

The Learning Problem

Consider using the following distribution to fit the data S q(x) α · N(x; µ1, Σ1) + (1 − α) · N(x; µ2, Σ2) (33)

◮ This is a density estimation problem — one of the

unsupervised learning problems

28

slide-58
SLIDE 58

The Learning Problem

Consider using the following distribution to fit the data S q(x) α · N(x; µ1, Σ1) + (1 − α) · N(x; µ2, Σ2) (33)

◮ This is a density estimation problem — one of the

unsupervised learning problems

◮ The number of components in q(x) is part of the

assumption based on our understanding about the data

28

slide-59
SLIDE 59

The Learning Problem

Consider using the following distribution to fit the data S q(x) α · N(x; µ1, Σ1) + (1 − α) · N(x; µ2, Σ2) (33)

◮ This is a density estimation problem — one of the

unsupervised learning problems

◮ The number of components in q(x) is part of the

assumption based on our understanding about the data

◮ Without knowing the true data distribution, the

number of components is treated as a hyper-parameter (predetermined before learning)

28

slide-60
SLIDE 60

Parameter Estimation

◮ Based on the general form of GMMs, the parameters

are θ {α, µ1, Σ1, µ2, Σ2}

◮ Given a set of training example S {x1, . . . , xm}, the

straightforward method is MLE L(θ)

  • m
  • i1

log q(xi; θ)

  • m
  • i1

log

  • α · N(xi; µ1, Σ1)

+(1 − α) · N(xi; µ2, Σ2)

  • (34)

◮ Learning: θ ← argmaxθ′ L(θ′)

29

slide-61
SLIDE 61

Singularity in GMM Parameter Estimation

Singularity happens when one of the mixture component

  • nly captures a single data point, which eventually leads

the (log-)likelihood to ∞

30

slide-62
SLIDE 62

Singularity in GMM Parameter Estimation

Singularity happens when one of the mixture component

  • nly captures a single data point, which eventually leads

the (log-)likelihood to ∞

◮ It is easy to overfit the training set using GMMs, for

example when K m

30

slide-63
SLIDE 63

Singularity in GMM Parameter Estimation

Singularity happens when one of the mixture component

  • nly captures a single data point, which eventually leads

the (log-)likelihood to ∞

◮ It is easy to overfit the training set using GMMs, for

example when K m

◮ This issue does not exist when estimating parameters

for a single Gaussian distribution

30

slide-64
SLIDE 64

Gradient-based Learning

Recall the definition of L(θ) L(θ)

m

  • i1

log

  • α·N(xi; µ1, Σ1)+(1−α)·N(xi; µ2, Σ2)
  • (35)

◮ There is no closed form solution of ∇L(θ) 0

◮ E.g., the value of α depends on {µc, Σc}2

c1, vice versa

◮ Gradient-based learning is still feasible as

θ(new) ← θ(old) + η · ∇L(θ)

31

slide-65
SLIDE 65

Latent Variable Models

To rewrite equation 33 into a full probabilistic form, we introduce a random variable z ∈ {1, 2}, with q(z 1) α q(z 2) 1 − α (36)

  • r

q(z) αδ(z1)(1 − α)δ(z2) (37)

32

slide-66
SLIDE 66

Latent Variable Models

To rewrite equation 33 into a full probabilistic form, we introduce a random variable z ∈ {1, 2}, with q(z 1) α q(z 2) 1 − α (36)

  • r

q(z) αδ(z1)(1 − α)δ(z2) (37)

◮ z is a random variable and indicates the mixture

component for x (a similar role as y in the classification problem)

32

slide-67
SLIDE 67

Latent Variable Models

To rewrite equation 33 into a full probabilistic form, we introduce a random variable z ∈ {1, 2}, with q(z 1) α q(z 2) 1 − α (36)

  • r

q(z) αδ(z1)(1 − α)δ(z2) (37)

◮ z is a random variable and indicates the mixture

component for x (a similar role as y in the classification problem)

◮ z is not directly observed in the data, therefore it is a

latent (random) variable.

32

slide-68
SLIDE 68

GMM with Latent Variable

With latent variable z, we can rewrite the probabilistic model as a joint distribution over x and z q(x, z)

  • q(z)q(x | z)
  • αδ(z1) · N(x; µ1, Σ1)δ(z1)

· (1 − α)δ(z2) · N(x; µ2, Σ2)δ(z2) (38)

33

slide-69
SLIDE 69

GMM with Latent Variable

With latent variable z, we can rewrite the probabilistic model as a joint distribution over x and z q(x, z)

  • q(z)q(x | z)
  • αδ(z1) · N(x; µ1, Σ1)δ(z1)

· (1 − α)δ(z2) · N(x; µ2, Σ2)δ(z2) (38) And the marginal probability p(x) is the same as in equation 33 q(x)

  • q(z 1)q(x | z 1) + q(z 2)q(x | z 2)
  • α · N(x; µ1, Σ1) + (1 − α) · N(x; µ2, Σ2)

(39)

33

slide-70
SLIDE 70

Parameter Estimation: MLE?

For each xi, we introduce a latent variable zi as mixture component indicator, then the log likelihood is defined as

ℓ(θ)

  • m
  • i1

log q(xi, zi)

  • m
  • i1

log

  • αδ(zi1) · N(xi; µ1, Σ1)δ(zi1)

· (1 − α)δ(zi2) · N(xi; µ2, Σ2)δ(zi2) (40)

  • m
  • i1
  • δ(zi 1) log α + δ(zi 1) log N(xi; µ1, Σ1)

δ(zi 2) log(1 − α) + δ(zi 2) log N(xi; µ2, Σ2)

  • 34
slide-71
SLIDE 71

Parameter Estimation: MLE?

For each xi, we introduce a latent variable zi as mixture component indicator, then the log likelihood is defined as

ℓ(θ)

  • m
  • i1

log q(xi, zi)

  • m
  • i1

log

  • αδ(zi1) · N(xi; µ1, Σ1)δ(zi1)

· (1 − α)δ(zi2) · N(xi; µ2, Σ2)δ(zi2) (40)

  • m
  • i1
  • δ(zi 1) log α + δ(zi 1) log N(xi; µ1, Σ1)

δ(zi 2) log(1 − α) + δ(zi 2) log N(xi; µ2, Σ2)

  • Question: we have already know that zi is a random

variable, but E [zi 1] α?

34

slide-72
SLIDE 72

EM Algorithm

slide-73
SLIDE 73

Basic Idea

◮ The key challenge of GMM with latent variables is

that we do not know the distributions of {zi}

36

slide-74
SLIDE 74

Basic Idea

◮ The key challenge of GMM with latent variables is

that we do not know the distributions of {zi}

◮ The basic idea of the EM algorithm is to alternatively

address the challenge between {zi}m

i1 ⇔ θ {α, µ1, Σ1, µ2, Σ2}

(41)

36

slide-75
SLIDE 75

Basic Idea

◮ The key challenge of GMM with latent variables is

that we do not know the distributions of {zi}

◮ The basic idea of the EM algorithm is to alternatively

address the challenge between {zi}m

i1 ⇔ θ {α, µ1, Σ1, µ2, Σ2}

(41)

◮ Basic procedure

  • 1. Fix θ, estimate the distributions of {zi}m

i1

  • 2. Fix the distribution of {zi}m

i1, estimate the value of θ

  • 3. Go back to step 1

36

slide-76
SLIDE 76

How to Estimate zi?

Fix θ, we can estimate the distribution of each zi as (with equation 38 and 39) q(zi | xi)

  • q(xi, zi)

q(xi) (42) Particularly, we have q(zi 1 | xi) α · N(xi; µ1, Σ1) α · N(xi; µ1, Σ1) + (1 − α) · N(xi; µ2, Σ2) (43)

37

slide-77
SLIDE 77

Expectation

Let γi be the expectation of zi under the distribution of q(zi | xi) E [zi] γi (44)

38

slide-78
SLIDE 78

Expectation

Let γi be the expectation of zi under the distribution of q(zi | xi) E [zi] γi (44)

◮ Since zi is a Bernoulli random variable, we also have

q(zi 1 | xi) γi

38

slide-79
SLIDE 79

Expectation

Let γi be the expectation of zi under the distribution of q(zi | xi) E [zi] γi (44)

◮ Since zi is a Bernoulli random variable, we also have

q(zi 1 | xi) γi

◮ Furthermore, the expectation of δ(zi 1) under the

distribution of q(zi | xi) E [δ(zi 1)]

  • δ(zi 1) · q(zi 1 | xi)

+δ(zi 1) · q(zi 2 | xi)

  • q(zi 1) γi

(45)

38

slide-80
SLIDE 80

Parameter Estimation (I)

Given

ℓ(θ)

m

  • i1
  • δ(zi 1) log α + δ(zi 1) log N(xi; µ1, Σ1)

δ(zi 2) log(1 − α) + δ(zi 2) log N(xi; µ2, Σ2)

  • (46)

39

slide-81
SLIDE 81

Parameter Estimation (I)

Given

ℓ(θ)

m

  • i1
  • δ(zi 1) log α + δ(zi 1) log N(xi; µ1, Σ1)

δ(zi 2) log(1 − α) + δ(zi 2) log N(xi; µ2, Σ2)

  • (46)

To maximize ℓ(θ) with respect to α we have

m

  • i1

δ(zi 1)

α − δ(zi 2) 1 − α

  • (47)

39

slide-82
SLIDE 82

Parameter Estimation (I)

Given

ℓ(θ)

m

  • i1
  • δ(zi 1) log α + δ(zi 1) log N(xi; µ1, Σ1)

δ(zi 2) log(1 − α) + δ(zi 2) log N(xi; µ2, Σ2)

  • (46)

To maximize ℓ(θ) with respect to α we have

m

  • i1

δ(zi 1)

α − δ(zi 2) 1 − α

  • (47)

and α | z

m

i1 δ(zi 1)

m

i1(δ(zi 1) + δ(zi 2))

m

i1 δ(zi 1)

m (48) which is similar to the classification example, except that zi is a random variable

39

slide-83
SLIDE 83

Parameter Estimation (II)

Without going through the details, the estimate of mean and covariance take the similar forms. For example, for the first component, we have µ1 | z

  • 1

m

m

  • i1

δ(zi 1)xi (49) Σ1 | z

  • 1

m

m

  • i1

δ(zi 1)(xi − µ1)(xi − µ1)T (50)

40

slide-84
SLIDE 84

Parameter Estimation (II)

Without going through the details, the estimate of mean and covariance take the similar forms. For example, for the first component, we have µ1 | z

  • 1

m

m

  • i1

δ(zi 1)xi (49) Σ1 | z

  • 1

m

m

  • i1

δ(zi 1)(xi − µ1)(xi − µ1)T (50) Question: how to eliminate the randomness in α, µ1, Σ1 (and similarly in µ2, Σ2)?

40

slide-85
SLIDE 85

Expectation (II)

With E [δ(zi 1)] γi, we have α

  • E [α | z] 1

m

m

  • i1

E [δ(zi 1)] xi

  • 1

m

m

  • i1

γixi (51)

41

slide-86
SLIDE 86

Expectation (II)

With E [δ(zi 1)] γi, we have α

  • E [α | z] 1

m

m

  • i1

E [δ(zi 1)] xi

  • 1

m

m

  • i1

γixi (51) Similarly, we have

µ1 1 m

m

  • i1

γixi µ2 1 m

m

  • i1

(1 − γi)xi Σ1

  • 1

m

m

  • i1

γi(xi − µ1)(xi − µ1)T Σ2

  • 1

m

m

  • i1

(1 − γi)(xi − µ2)(xi − µ2)T (52)

41

slide-87
SLIDE 87

The EM Algorithm, Review

The algorithm iteratively run the following two steps: E-step Given θ, for each xi, estimate the distribution

  • f the corresponding latent variable zi

q(zi | xi) q(xi, zi) q(xi) (53) and its expectation γi M-step Given {zi}m

i1, maximize the log-likelihood

function ℓ(θ) and estimate the parameter θ with {γi}m

i1

42

slide-88
SLIDE 88

Illustration

[Bishop, 2006, Page 437]

43

slide-89
SLIDE 89

Variational Inference (Optional)

slide-90
SLIDE 90

The Computation of q(z | x)

◮ In the previous example, we were able to compute

the analytic solution of q(z | x) as q(z | x) q(x, z) q(x) (54) where q(x)

z q(x, z)

◮ Challenge: Unlike the simple case in GMMs, usually

q(x) is difficult to compute q(x)

  • z

q(x, z) discrete (55)

z

q(x, z)dz continuous (56)

45

slide-91
SLIDE 91

Solution

◮ Instead of computing q(x) and then q(z | x), we

propose another distribution q′(z | x) to approximate q(z | x) q′(z | x) ≈ q(z | x) (57) where q′(z | x) should be simple enough to facilitate the computation

46

slide-92
SLIDE 92

Solution

◮ Instead of computing q(x) and then q(z | x), we

propose another distribution q′(z | x) to approximate q(z | x) q′(z | x) ≈ q(z | x) (57) where q′(z | x) should be simple enough to facilitate the computation

◮ The objective of finding a good approximation is the

Kullback–Leibler (KL) divergence

KL(q′q)

  • z

q′(z | x) log q′(z | x) q(z | x) discrete

z

q′(z | x) log q′(z | x) q(z | x) dz continuous

46

slide-93
SLIDE 93

KL Divergence

◮ KL(q′q) ≥ 0 and the equality holds if and only if

q′ q

47

slide-94
SLIDE 94

KL Divergence

◮ KL(q′q) ≥ 0 and the equality holds if and only if

q′ q

◮ Consider the continuous case for the visualization

purpose. KL(q′q)

z

q′(z | x) log q′(z | x) q(z | x) dz (58)

47

slide-95
SLIDE 95

KL Divergence

◮ KL(q′q) ≥ 0 and the equality holds if and only if

q′ q

◮ Consider the continuous case for the visualization

purpose. KL(q′q)

z

q′(z | x) log q′(z | x) q(z | x) dz (58)

◮ Regardless what q(z | x) looks like, we decide to

define q′(z | x) for simplicity

47

slide-96
SLIDE 96

KL Divergence

◮ KL(q′q) ≥ 0 and the equality holds if and only if

q′ q

◮ Consider the continuous case for the visualization

purpose. KL(q′q)

z

q′(z | x) log q′(z | x) q(z | x) dz (58)

◮ Regardless what q(z | x) looks like, we decide to

define q′(z | x) for simplicity

◮ Because of q(z | x) in equation 58, the challenge still

exists

47

slide-97
SLIDE 97

ELBo

The learning objective for q′(z | x) is

KL(q′q)

z

q′(z | x) log q′(z | x) q(z | x) dz

48

slide-98
SLIDE 98

ELBo

The learning objective for q′(z | x) is

KL(q′q)

z

q′(z | x) log q′(z | x) q(z | x) dz

z

q′(z | x) log q′(z | x)q(x) q(z, x) dz

z

q′(z | x) log q′(z | x)q(x) q(x | z)q(z) dz

48

slide-99
SLIDE 99

ELBo

The learning objective for q′(z | x) is

KL(q′q)

z

q′(z | x) log q′(z | x) q(z | x) dz

z

q′(z | x) log q′(z | x)q(x) q(z, x) dz

z

q′(z | x) log q′(z | x)q(x) q(x | z)q(z) dz

z

q′(z | x)

  • − log q(x | z) + log q′(z | x)

q(z) + log q(x)

  • dz
  • −E
  • log q(x | z)
  • + KL(q′(z | x)q(z)) + log q(x)

48

slide-100
SLIDE 100

ELBo

The learning objective for q′(z | x) is

KL(q′q)

z

q′(z | x) log q′(z | x) q(z | x) dz

z

q′(z | x) log q′(z | x)q(x) q(z, x) dz

z

q′(z | x) log q′(z | x)q(x) q(x | z)q(z) dz

z

q′(z | x)

  • − log q(x | z) + log q′(z | x)

q(z) + log q(x)

  • dz
  • −E
  • log q(x | z)
  • + KL(q′(z | x)q(z)) + log q(x)
  • −ELBo + log q(x)

Minimize KL(q′q) is equivalent to maximize the Evidence Lower Bound (ELBo)

48

slide-101
SLIDE 101

Reference

Bishop, C. M. (2006). Pattern recognition and machine learning. springer.

49