Unsupervised Learning Weinan Zhang Shanghai Jiao Tong University - - PowerPoint PPT Presentation

unsupervised learning
SMART_READER_LITE
LIVE PREVIEW

Unsupervised Learning Weinan Zhang Shanghai Jiao Tong University - - PowerPoint PPT Presentation

2018 EE448, Big Data Mining, Lecture 7 Unsupervised Learning Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/ee448/index.html ML Problem Setting First build and learn p ( x ) and then infer the


slide-1
SLIDE 1

Unsupervised Learning

Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net 2018 EE448, Big Data Mining, Lecture 7

http://wnzhang.net/teaching/ee448/index.html

slide-2
SLIDE 2

ML Problem Setting

  • First build and learn p(x) and then infer the

conditional dependence p(xt|xi)

  • Unsupervised learning
  • Each dimension of x is equally treated
  • Directly learn the conditional dependence p(xt|xi)
  • Supervised learning
  • xt is the label to predict
slide-3
SLIDE 3

Definition of Unsupervised Learning

  • Given the training dataset

let the machine learn the data underlying patterns

D = fxigi=1;2;:::;N D = fxigi=1;2;:::;N

p(x) p(x)

  • Probabilistic density function (p.d.f.) estimation
  • Latent variables

z ! x z ! x

  • Good data representation (used for discrimination)

Á(x) Á(x)

slide-4
SLIDE 4

Uses of Unsupervised Learning

  • Data structure discovery, data science
  • Data compression
  • Outlier detection
  • Input to supervised/reinforcement algorithms (causes

may be more simply related to outputs or rewards)

  • A theory of biological learning and perception

Slide credit: Maneesh Sahani

slide-5
SLIDE 5

Content

  • Fundamentals of Unsupervised Learning
  • K-means clustering
  • Principal component analysis
  • Probabilistic Unsupervised Learning
  • Mixture Gaussians
  • EM Methods
slide-6
SLIDE 6

K-Means Clustering

slide-7
SLIDE 7

K-Means Clustering

slide-8
SLIDE 8

K-Means Clustering

  • Provide the number of desired clusters k
  • Randomly choose k instances as seeds, one per

each cluster, i.e. the centroid for each cluster

  • Iterate
  • Assign each instance to the cluster with the closest

centroid

  • Re-estimate the centroid of each cluster
  • Stop when clustering converges
  • Or after a fixed number of iterations

Slide credit: Ray Mooney

slide-9
SLIDE 9

K-Means Clustering: Centriod

  • Assume instances are real-valued vectors

Slide credit: Ray Mooney

x 2 Rd x 2 Rd

  • Clusters based on centroids, center of gravity, or

mean of points in a cluster Ck

¹k = 1 Ck X

x2Ck

x ¹k = 1 Ck X

x2Ck

x

slide-10
SLIDE 10

K-Means Clustering: Distance

  • Distance to a centroid

Slide credit: Ray Mooney

L(x; ¹k) L(x; ¹k)

  • Euclidian distance (L2 norm)

L2(x; ¹k) = kx ¡ ¹kk = v u u t

d

X

m=1

(xi ¡ ¹k

m)2

L2(x; ¹k) = kx ¡ ¹kk = v u u t

d

X

m=1

(xi ¡ ¹k

m)2

  • Euclidian distance (L1 norm)

L1(x; ¹k) = jx ¡ ¹kj =

d

X

m=1

jxi ¡ ¹k

mj

L1(x; ¹k) = jx ¡ ¹kj =

d

X

m=1

jxi ¡ ¹k

mj

  • Cosine distance

Lcos(x; ¹k) = 1 ¡ x>¹k jxj ¢ j¹kj Lcos(x; ¹k) = 1 ¡ x>¹k jxj ¢ j¹kj

slide-11
SLIDE 11

K-Means Example (K=2)

Pick seeds Reassign clusters Compute centroids x x Reasssign clusters x x x x Compute centroids Reassign clusters Converged!

Slide credit: Ray Mooney

slide-12
SLIDE 12

K-Means Time Complexity

  • Assume computing distance between two instances

is O(d) where d is the dimensionality of the vectors

  • Reassigning clusters: O(knd) distance computations
  • Computing centroids: Each instance vector gets

added once to some centroid: O(nd)

  • Assume these two steps are each done once for I

iterations: O(Iknd)

Slide credit: Ray Mooney

slide-13
SLIDE 13

K-Means Clustering Objective

  • The objective of K-means is to minimize the total

sum of the squared distance of every point to its corresponding cluster centroid

min

f¹kgK

k=1

K

X

k=1

X

x2Ck

L(x ¡ ¹k) min

f¹kgK

k=1

K

X

k=1

X

x2Ck

L(x ¡ ¹k)

¹k = 1 Ck X

x2Ck

x ¹k = 1 Ck X

x2Ck

x

  • Finding the global optimum is NP-hard.
  • The K-means algorithm is guaranteed to converge a

local optimum.

slide-14
SLIDE 14

Seed Choice

  • Results can vary based on random seed selection.
  • Some seeds can result in poor convergence rate, or

convergence to sub-optimal clusterings.

  • Select good seeds using a heuristic or the results of

another method.

slide-15
SLIDE 15

Clustering Applications

  • Text mining
  • Cluster documents for related search
  • Cluster words for query suggestion
  • Recommender systems and advertising
  • Cluster users for item/ad recommendation
  • Cluster items for related item suggestion
  • Image search
  • Cluster images for similar image search and duplication

detection

  • Speech recognition or separation
  • Cluster phonetical features
slide-16
SLIDE 16

Principal Component Analysis (PCA)

  • An example of 2-

dimensional data

  • x1: the piloting skill
  • f pilot
  • x2: how much he/she

enjoys flying

  • Main components
  • u1: intrinsic piloting

“karma” of a person

  • u2: some noise

Example credit: Andrew Ng

slide-17
SLIDE 17

Principal Component Analysis (PCA)

  • PCA tries to identify the subspace in which the data

approximately lies

  • PCA uses an orthogonal transformation to convert a

set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components.

  • The number of principal components is less than or

equal to the smaller of the number of original variables

  • r the number of observations.

Rd ! Rk k ¿ d Rd ! Rk k ¿ d

slide-18
SLIDE 18

PCA Data Preprocessing

  • Typically we first pre-process the data to normalize

its mean and variance

  • 1. Move the central of the data set to 0

¹ = 1 m

m

X

i=1

x(i) ¹ = 1 m

m

X

i=1

x(i)

  • Given the dataset

D = fx(i)gm

i=1

D = fx(i)gm

i=1

x(i) Ã x(i) ¡ ¹ x(i) Ã x(i) ¡ ¹

  • 2. Unify the variance of each variable

¾2

j = 1

m

m

X

i=1

(x(i)

j )2

¾2

j = 1

m

m

X

i=1

(x(i)

j )2

x(i) Ã x(i)=¾j x(i) Ã x(i)=¾j

slide-19
SLIDE 19

PCA Data Preprocessing

  • Zero out the mean of the data
  • Rescale each coordinate to have unit variance, which ensures that

different attributes are all treated on the same “scale”.

slide-20
SLIDE 20

PCA Solution

  • PCA finds the directions with the largest variable

variance

  • which correspond to the eigenvectors of the matrix XTX

with the largest eigenvalues

slide-21
SLIDE 21

PCA Solution: Data Projection

  • The projection of each

point x(i) to a direction u

u x(i) x(i) x(i)>u x(i)>u x(i)>u x(i)>u

  • The variance of the

projection

1 m

m

X

i=1

(x(i)>u)2 = 1 m

m

X

i=1

u>x(i)x(i)>u = u>³ 1 m

m

X

i=1

x(i)x(i)>´ u ´ u>§u 1 m

m

X

i=1

(x(i)>u)2 = 1 m

m

X

i=1

u>x(i)x(i)>u = u>³ 1 m

m

X

i=1

x(i)x(i)>´ u ´ u>§u

(kuk = 1) (kuk = 1)

slide-22
SLIDE 22

u x(i) x(i) x(i)>u x(i)>u

PCA Solution: Largest Eigenvalues

  • Find k principal components of the

data is to find the k principal eigenvectors of Σ

  • i.e. the top-k eigenvectors with the

largest eigenvalues

  • Projected vector for x(i)

max

u

u>§u s.t. kuk = 1 max

u

u>§u s.t. kuk = 1 § = 1 m

m

X

i=1

x(i)x(i)> § = 1 m

m

X

i=1

x(i)x(i)>

y(i) = 2 6 6 6 4 u>

1 x(i)

u>

2 x(i)

. . . u>

k x(i)

3 7 7 7 5 2 Rk y(i) = 2 6 6 6 4 u>

1 x(i)

u>

2 x(i)

. . . u>

k x(i)

3 7 7 7 5 2 Rk

slide-23
SLIDE 23

Eigendecomposition Revisit

  • For a semi-positive square matrix Σd×d
  • suppose u to be its eigenvector
  • with the scalar eigenvalue w

§u = wu §u = wu

(kuk = 1) (kuk = 1)

  • Thus any vector v can be written as
  • There are d eigenvectors-eigenvalue pairs (ui, wi)
  • These d eigenvectors are orthogonal, thus they form an
  • rthonormal basis

d

X

i=1

uiu>

i = I d

X

i=1

uiu>

i = I

v = ³

d

X

i=1

uiu>

i

´ v =

d

X

i=1

(u>

i v)ui = d

X

i=1

v(i)ui v = ³

d

X

i=1

uiu>

i

´ v =

d

X

i=1

(u>

i v)ui = d

X

i=1

v(i)ui

  • Σd×d can be written as

§ =

d

X

i=1

uiu>

i § = d

X

i=1

wiuiu>

i = UWU>

§ =

d

X

i=1

uiu>

i § = d

X

i=1

wiuiu>

i = UWU>

U = [u1; u2; : : : ; ud] U = [u1; u2; : : : ; ud] W = 2 6 6 6 4 w1 ¢ ¢ ¢ w2 ¢ ¢ ¢ . . . . . . ... ¢ ¢ ¢ wd 3 7 7 7 5 W = 2 6 6 6 4 w1 ¢ ¢ ¢ w2 ¢ ¢ ¢ . . . . . . ... ¢ ¢ ¢ wd 3 7 7 7 5

slide-24
SLIDE 24

Eigendecomposition Revisit

and its covariance matrix § = X>X

§ = X>X

X = 2 6 6 6 4 x>

1

x>

2

. . . x>

n

3 7 7 7 5 X = 2 6 6 6 4 x>

1

x>

2

. . . x>

n

3 7 7 7 5

  • The variance in any direction v is
  • Given the data

kXvk2 = ° ° °X ³

d

X

i=1

v(i)ui ´° ° °

2

= X

ij

v(i)u>

i §uiv(j) = d

X

i=1

v2

(i)wi

kXvk2 = ° ° °X ³

d

X

i=1

v(i)ui ´° ° °

2

= X

ij

v(i)u>

i §uiv(j) = d

X

i=1

v2

(i)wi

  • The variance in direction ui is

kXuik2 = u>

i X>Xui = u> i §ui = u> i wiui = wi

kXuik2 = u>

i X>Xui = u> i §ui = u> i wiui = wi

where v(i) is the projection length of v on ui

  • If vTv = 1, then

arg max

kvk=1 kXvk2 = u(max)

arg max

kvk=1 kXvk2 = u(max)

The direction of greatest variance is the eigenvector with the largest eigenvalue (here we may drop m for simplicity)

slide-25
SLIDE 25

PCA Discussion

  • PCA can also be derived by picking the basis that minimizes

the approximation error arising from projecting the data

  • nto the k-dimensional subspace spanned by them.
slide-26
SLIDE 26

PCA Visualization

http://setosa.io/ev/principal-component-analysis/

slide-27
SLIDE 27

PCA Visualization

http://setosa.io/ev/principal-component-analysis/

slide-28
SLIDE 28

Content

  • Fundamentals of Unsupervised Learning
  • K-means clustering
  • Principal component analysis
  • Probabilistic Unsupervised Learning
  • Mixture Gaussians
  • EM Methods
slide-29
SLIDE 29

Mixture Gaussian

slide-30
SLIDE 30

Mixture Gaussian

slide-31
SLIDE 31

Graphic Model for Mixture Gaussian

  • Given a training set

Á

z x

fx(1); x(2); : : : ; x(m)g fx(1); x(2); : : : ; x(m)g

  • Model the data by specifying a joint distribution

p(x(i); z(i)) = p(x(i)jz(i))p(z(i)) p(x(i); z(i)) = p(x(i)jz(i))p(z(i)) z(i) » Multinomial(Á) z(i) » Multinomial(Á) x(i) » N(¹j; §j) x(i) » N(¹j; §j) p(z(i) = j) = Áj p(z(i) = j) = Áj

Latent variable: the Gaussian cluster ID Indicates which Gaussian each x comes from Observed data points Parameters of latent variable distribution

slide-32
SLIDE 32

Data Likelihood

  • No closed form solution by simply setting

l(Á; ¹; §) =

m

X

i=1

log p(x(i); Á; ¹; §) =

m

X

i=1

log

k

X

z(i)=1

p(x(i)jz(i); ¹; §)p(z(i); Á) =

m

X

i=1

log

k

X

j=1

N(x(i)j¹j; §j)Áj l(Á; ¹; §) =

m

X

i=1

log p(x(i); Á; ¹; §) =

m

X

i=1

log

k

X

z(i)=1

p(x(i)jz(i); ¹; §)p(z(i); Á) =

m

X

i=1

log

k

X

j=1

N(x(i)j¹j; §j)Áj

@l(Á; ¹; §) @Á = 0 @l(Á; ¹; §) @Á = 0 @l(Á; ¹; §) @¹ = 0 @l(Á; ¹; §) @¹ = 0 @l(Á; ¹; §) @§ = 0 @l(Á; ¹; §) @§ = 0

  • We want to maximize
slide-33
SLIDE 33

Data Likelihood Maximization

  • For each data point x(i), latent variable z(i) indicates

which Gaussian it comes from

  • If we knew z(i), the data likelihood

l(Á; ¹; §) =

m

X

i=1

log p(x(i); Á; ¹; §) =

m

X

i=1

log p(x(i)jz(i); ¹; §)p(z(i); Á) =

m

X

i=1

log N(x(i)j¹z(i); §z(i)) + log p(z(i); Á) l(Á; ¹; §) =

m

X

i=1

log p(x(i); Á; ¹; §) =

m

X

i=1

log p(x(i)jz(i); ¹; §)p(z(i); Á) =

m

X

i=1

log N(x(i)j¹z(i); §z(i)) + log p(z(i); Á)

slide-34
SLIDE 34

Data Likelihood Maximization

  • Given z(i), maximize the data likelihood

max

Á;¹;§ l(Á; ¹; §) = max Á;¹;§ m

X

i=1

log N(x(i)j¹z(i); §z(i)) + log p(z(i); Á) max

Á;¹;§ l(Á; ¹; §) = max Á;¹;§ m

X

i=1

log N(x(i)j¹z(i); §z(i)) + log p(z(i); Á)

  • It is easy to get the solution

Áj = 1 m

m

X

i=1

1fz(i) = jg ¹j = Pm

i=1 1fz(i) = jgx(i)

Pm

i=1 1fz(i) = jg

§j = Pm

i=1 1fz(i) = jg(x(i) ¡ ¹j)(x(i) ¡ ¹j)>

Pm

i=1 1fz(i) = jg

Áj = 1 m

m

X

i=1

1fz(i) = jg ¹j = Pm

i=1 1fz(i) = jgx(i)

Pm

i=1 1fz(i) = jg

§j = Pm

i=1 1fz(i) = jg(x(i) ¡ ¹j)(x(i) ¡ ¹j)>

Pm

i=1 1fz(i) = jg

slide-35
SLIDE 35

Latent Variable Inference

  • Given the parameters μ, Σ, ϕ, it is not hard to infer the

posterior of the latent variable z(i) for each instance

p(z(i) = jjx(i); Á; ¹; §) = p(z(i) = j; x(i); Á; ¹; §) p(x(i); Á; ¹; §) = p(x(i)jz(i) = j; ¹; §)p(z(i) = j; Á) Pk

l=1 p(x(i)jz(i) = l; ¹; §)p(z(i) = l; Á)

p(z(i) = jjx(i); Á; ¹; §) = p(z(i) = j; x(i); Á; ¹; §) p(x(i); Á; ¹; §) = p(x(i)jz(i) = j; ¹; §)p(z(i) = j; Á) Pk

l=1 p(x(i)jz(i) = l; ¹; §)p(z(i) = l; Á)

Á

z x

¹; § ¹; §

where

  • The prior of z(i) is
  • The likelihood is

p(z(i) = j; Á) p(z(i) = j; Á) p(x(i)jz(i) = j; ¹; §) p(x(i)jz(i) = j; ¹; §)

slide-36
SLIDE 36

Expectation Maximization Methods

  • E-step: infer the posterior distribution of the latent

variables given the model parameters

  • M-step: tune parameters to maximize the data

likelihood given the latent variable distribution

  • EM methods
  • Iteratively execute E-step and M-step until convergence
slide-37
SLIDE 37

EM Methods for Mixture Gaussians

Á

z x

¹; § ¹; §

Repeat until convergence: { (E-step) For each i, j, set (M-step) Update the parameters }

  • Mixture Gaussian example

w(i)

j

= p(z(i) = j; x(i); Á; ¹; §) w(i)

j

= p(z(i) = j; x(i); Á; ¹; §)

Áj = 1 m

m

X

i=1

w(i)

j

¹j = Pm

i=1 w(i) j x(i)

Pm

i=1 w(i) j

§j = Pm

i=1 w(i) j (x(i) ¡ ¹j)(x(i) ¡ ¹j)>

Pm

i=1 w(i) j

Áj = 1 m

m

X

i=1

w(i)

j

¹j = Pm

i=1 w(i) j x(i)

Pm

i=1 w(i) j

§j = Pm

i=1 w(i) j (x(i) ¡ ¹j)(x(i) ¡ ¹j)>

Pm

i=1 w(i) j

slide-38
SLIDE 38

General EM Methods

  • Claims:
  • 1. After each E-M step, the data likelihood will not

decrease.

  • 2. The EM algorithm finds a (local) maximum of a

latent variable model likelihood

  • Now let’s discuss the general EM methods and

verify its effectiveness of improving data likelihood and its convergence

slide-39
SLIDE 39

Jensen’s Inequality

  • Theorem. Let f be a convex function, and let X be a

random variable. Then:

E[f(X)] ¸ f(E[X]) E[f(X)] ¸ f(E[X])

  • Moreover, if f is strictly convex, then

holds true if and only if with probability 1 (i.e., if X is a constant).

E[f(X)] = f(E[X]) E[f(X)] = f(E[X]) X = E[X] X = E[X]

slide-40
SLIDE 40

Jensen’s Inequality

E[f(X)] ¸ f(E[X]) E[f(X)] ¸ f(E[X])

Figure credit: Andrew Ng

slide-41
SLIDE 41

Jensen’s Inequality

Figure credit: Maneesh Sahani

slide-42
SLIDE 42

General EM Methods: Problem

  • Given the training dataset

let the machine learn the data underlying patterns D = fxigi=1;2;:::;N D = fxigi=1;2;:::;N

  • Assume latent variables

z ! x z ! x

  • We wish to fit the parameters of a model p(x,z) to the data,

where the log-likelihood is

l(μ) =

N

X

i=1

log p(x; μ) =

N

X

i=1

log X

z

p(x; z; μ) l(μ) =

N

X

i=1

log p(x; μ) =

N

X

i=1

log X

z

p(x; z; μ)

slide-43
SLIDE 43

General EM Methods: Problems

  • EM methods solve the problems where
  • Explicitly find the maximum likelihood estimation (MLE)

is hard

μ¤ = arg max

μ N

X

i=1

log X

z

p(x(i); z(i); μ) μ¤ = arg max

μ N

X

i=1

log X

z

p(x(i); z(i); μ)

  • But given z(i) observed, the MLE is easy

μ¤ = arg max

μ N

X

i=1

log p(x(i)jz(i); μ) μ¤ = arg max

μ N

X

i=1

log p(x(i)jz(i); μ)

  • EM methods give an efficient solution for MLE, by

iteratively doing

  • E-step: construct a (good) lower-bound of log-likelihood
  • M-step: optimize that lower-bound
slide-44
SLIDE 44

General EM Methods: Lower Bound

  • For each instance i, let qi be some distribution of z(i)

X

z

qi(z) = 1; qi(z) ¸ 0 X

z

qi(z) = 1; qi(z) ¸ 0

  • Thus the data log-likelihood

l(μ) =

N

X

i=1

log p(x(i); μ) =

N

X

i=1

log X

z(i)

p(x(i); z(i); μ) =

N

X

i=1

log X

z(i)

qi(z(i))p(x(i); z(i); μ) qi(z(i)) ¸

N

X

i=1

X

z(i)

qi(z(i)) log p(x(i); z(i); μ) qi(z(i)) l(μ) =

N

X

i=1

log p(x(i); μ) =

N

X

i=1

log X

z(i)

p(x(i); z(i); μ) =

N

X

i=1

log X

z(i)

qi(z(i))p(x(i); z(i); μ) qi(z(i)) ¸

N

X

i=1

X

z(i)

qi(z(i)) log p(x(i); z(i); μ) qi(z(i))

Jensen’s inequality

  • log(x) is a convex function

Lower bound

  • f l(θ)
slide-45
SLIDE 45

General EM Methods: Lower Bound

  • Then what qi(z) should we choose?

l(μ) =

N

X

i=1

log p(x(i); μ) ¸

N

X

i=1

X

z(i)

qi(z(i)) log p(x(i); z(i); μ) qi(z(i)) l(μ) =

N

X

i=1

log p(x(i); μ) ¸

N

X

i=1

X

z(i)

qi(z(i)) log p(x(i); z(i); μ) qi(z(i))

slide-46
SLIDE 46

Jensen’s Inequality

  • Theorem. Let f be a convex function, and let X be a

random variable. Then:

E[f(X)] ¸ f(E[X]) E[f(X)] ¸ f(E[X])

  • Moreover, if f is strictly convex, then

holds true if and only if with probability 1 (i.e., if X is a constant).

E[f(X)] = f(E[X]) E[f(X)] = f(E[X]) X = E[X] X = E[X]

REVIEW

slide-47
SLIDE 47

General EM Methods: Lower Bound

  • Then what qi(z) should we choose?
  • In order to make above inequality tight (to hold with

equality), it is sufficient that

l(μ) =

N

X

i=1

log p(x(i); μ) ¸

N

X

i=1

X

z(i)

qi(z(i)) log p(x(i); z(i); μ) qi(z(i)) l(μ) =

N

X

i=1

log p(x(i); μ) ¸

N

X

i=1

X

z(i)

qi(z(i)) log p(x(i); z(i); μ) qi(z(i))

p(x(i); z(i); μ) = qi(z(i)) ¢ c p(x(i); z(i); μ) = qi(z(i)) ¢ c

log p(x(i); μ) = log X

z(i)

p(x(i); z(i); μ) = log X

z(i)

q(z(i))c = X

z(i)

qi(z(i)) log p(x(i); z(i); μ) qi(z(i)) log p(x(i); μ) = log X

z(i)

p(x(i); z(i); μ) = log X

z(i)

q(z(i))c = X

z(i)

qi(z(i)) log p(x(i); z(i); μ) qi(z(i))

  • We can derive
  • As such qi(z) is written as the posterior distribution

qi(z(i)) = p(x(i); z(i); μ) P

z p(x(i); z; μ) = p(x(i); z(i); μ)

p(x(i); μ) = p(z(i)jx(i); μ) qi(z(i)) = p(x(i); z(i); μ) P

z p(x(i); z; μ) = p(x(i); z(i); μ)

p(x(i); μ) = p(z(i)jx(i); μ)

slide-48
SLIDE 48

General EM Methods

Repeat until convergence: { (E-step) For each i, set (M-step) Update the parameters }

qi(z(i)) = p(z(i)jx(i); μ) qi(z(i)) = p(z(i)jx(i); μ) μ = arg max

μ N

X

i=1

X

z(i)

qi(z(i)) log p(x(i); z(i); μ) qi(z(i)) μ = arg max

μ N

X

i=1

X

z(i)

qi(z(i)) log p(x(i); z(i); μ) qi(z(i))

slide-49
SLIDE 49

Convergence of EM

  • Denote θ(t) and θ(t+1) as the parameters of two successive

iterations of EM, we prove that

l(μ(t)) · l(μ(t+1)) l(μ(t)) · l(μ(t+1))

which shows EM always monotonically improves the log- likelihood, thus ensures EM will at least converge to a local

  • ptimum.
slide-50
SLIDE 50

Proof of EM Convergence

  • Start from θ(t), we choose the posterior of latent variable

q(t)

i (z(i)) = p(z(i)jx(i); μ(t))

q(t)

i (z(i)) = p(z(i)jx(i); μ(t))

  • This choice ensures the Jensen’s inequality holds with equality

l(μ(t)) =

N

X

i=1

log X

z(i)

q(t)

i (z(i))p(x(i); z(i); μ(t))

q(t)

i (z(i))

=

N

X

i=1

X

z(i)

qi(z(i)) log p(x(i); z(i); μ(t)) q(t)

i (z(i))

l(μ(t)) =

N

X

i=1

log X

z(i)

q(t)

i (z(i))p(x(i); z(i); μ(t))

q(t)

i (z(i))

=

N

X

i=1

X

z(i)

qi(z(i)) log p(x(i); z(i); μ(t)) q(t)

i (z(i))

  • Then the parameters θ(t+1) are then obtained by maximizing

the right hand side of above equation

  • Thus l(μ(t+1)) ¸

N

X

i=1

X

z(i)

q(t)

i (z(i)) log p(x(i); z(i); μ(t+1))

q(t)

i (z(i))

¸

N

X

i=1

X

z(i)

q(t)

i (z(i)) log p(x(i); z(i); μ(t))

q(t)

i (z(i))

= l(μ(t)) l(μ(t+1)) ¸

N

X

i=1

X

z(i)

q(t)

i (z(i)) log p(x(i); z(i); μ(t+1))

q(t)

i (z(i))

¸

N

X

i=1

X

z(i)

q(t)

i (z(i)) log p(x(i); z(i); μ(t))

q(t)

i (z(i))

= l(μ(t))

[lower bound] [parameter optimization]

slide-51
SLIDE 51

Remark of EM Convergence

  • If we define

J(q; μ) =

N

X

i=1

X

z(i)

qi(z(i)) log p(x(i); z(i); μ) qi(z(i)) J(q; μ) =

N

X

i=1

X

z(i)

qi(z(i)) log p(x(i); z(i); μ) qi(z(i))

Then we know

l(μ) ¸ J(q; μ) l(μ) ¸ J(q; μ)

  • EM can also be viewed as a coordinate ascent on J
  • E-step maximizes it w.r.t. q
  • M-step maximizes it w.r.t. θ
slide-52
SLIDE 52

Coordinate Ascent in EM

μ q

Figure credit: Maneesh Sahani

q(t)

i (z(i)) = p(z(i)jx(i); μ(t))

q(t)

i (z(i)) = p(z(i)jx(i); μ(t))

μ(t+1) = arg max

μ N

X

i=1

X

z(i)

q(t)

i (z(i)) log p(x(i); z(i); μ)

q(t)

i (z(i))

μ(t+1) = arg max

μ N

X

i=1

X

z(i)

q(t)

i (z(i)) log p(x(i); z(i); μ)

q(t)

i (z(i))