CS480/680 Machine Learning Lecture 12: February 13 th , 2020 - - PowerPoint PPT Presentation

β–Ά
cs480 680 machine learning lecture 12 february 13 th 2020
SMART_READER_LITE
LIVE PREVIEW

CS480/680 Machine Learning Lecture 12: February 13 th , 2020 - - PowerPoint PPT Presentation

CS480/680 Machine Learning Lecture 12: February 13 th , 2020 Expectation-Maximization Zahra Sheikhbahaee University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 1 Outline -mean Clustering Gaussian Mixture model EM for


slide-1
SLIDE 1

CS480/680 Winter 2020 Zahra Sheikhbahaee

CS480/680 Machine Learning Lecture 12: February 13th, 2020

Expectation-Maximization Zahra Sheikhbahaee

University of Waterloo

1

slide-2
SLIDE 2

CS480/680 Winter 2020 Zahra Sheikhbahaee

Outline

  • 𝐿-mean Clustering
  • Gaussian Mixture model
  • EM for Gaussian Mixture model
  • EM algorithm

University of Waterloo

2

slide-3
SLIDE 3

CS480/680 Winter 2020 Zahra Sheikhbahaee

𝐿-mean Clustering

  • The organization of unlabeled data into similarity groups called

clusters.

  • A cluster is a collection of data items which are similar between them,

and dissimilar to data items in other clusters.

University of Waterloo

3

slide-4
SLIDE 4

CS480/680 Winter 2020 Zahra Sheikhbahaee

𝐿-mean clustering

  • 𝐿-mean clustering has been used for image segmentation.
  • In image segmentation, one partitions an image into regions

each of which has a reasonably homogeneous visual appearance or which corresponds to objects or parts of

  • bjects.
  • In data compression, for an RGB image with 𝑂 pixels

values of each is stored with 8 bits precision. Total cost of the original image transmission 24𝑂 bits Transmitting the identity of nearest centroid for each pixel has the total cost of 𝑂 log! 𝐿 transmitting the 𝐿 centroid vectors requires 24𝐿 bits The compressed image has the cost of 24𝐿 + 𝑂 log! 𝐿bits

University of Waterloo

4

slide-5
SLIDE 5

CS480/680 Winter 2020 Zahra Sheikhbahaee

𝐿-mean clustering

  • Let a data set be 𝑦), … , 𝑦* which is 𝑂 observations of a random 𝐸-

dimensional Euclidean variable π’š.

  • The 𝐿-means algorithm partitions the given data into 𝐿 clusters (𝐿 is

known):

– Each cluster has a cluster center, called centroid (𝝂! where 𝑙 = 1 … 𝐿). – The sum of the squares of the distances of each data point to its closest vector 𝝂!, is a minimum – Each data point 𝑦" has a corresponding set of binary indicator variables 𝑠"!which represent whether data point 𝑦# belongs to cluster 𝑙 or not 𝑠"! = 2 1 if 𝑦# is assigned to cluster 𝑙 0 otherwise

University of Waterloo

5

slide-6
SLIDE 6

CS480/680 Winter 2020 Zahra Sheikhbahaee

𝐿-mean clustering

  • Let a data set be 𝑦), … , 𝑦* which is 𝑂 observations of a random 𝐸-

dimensional Euclidean variable π’š.

  • The 𝐿-means algorithm partitions the given data into 𝐿 clusters (𝐿 is

known):

– Each cluster has a cluster center, called centroid (𝝂! where 𝑙 = 1 … 𝐿). – The sum of the squares of the distances of each data point to its closest vector 𝝂!, is a minimum – Each data point 𝑦" has a corresponding set of binary indicator variables 𝑠"!which represent whether data point 𝑦" belongs to cluster 𝑙 or not 𝐾 = B

"$% #

B

!$% &

𝑠"! βˆ₯ 𝑦" βˆ’ 𝜈! βˆ₯'

University of Waterloo

6

slide-7
SLIDE 7

CS480/680 Winter 2020 Zahra Sheikhbahaee

𝐿-mean clustering

University of Waterloo

7

Algorithm :

Initialize 𝜈% , … , 𝜈& Iterations:

  • We minimize J with respect to the 𝑠

!", keeping the 𝜈" fixed by assigning each data point to the closest centroid

  • We minimize J with respect to the 𝜈", keeping the 𝑠

!" fixed by recomputing the centroids using the current

cluster membership

Repeat until convergence

slide-8
SLIDE 8

CS480/680 Winter 2020 Zahra Sheikhbahaee

𝐿-mean clustering

University of Waterloo

8

Algorithm :

Initialize 𝜈" , … , 𝜈# Iterations:

  • We minimize J with respect to the 𝑠

!", keeping the 𝜈" fixed by assigning each data point to the closest centroid

𝑠

!" = (

1 if 𝑙 = arg min

#

βˆ₯ 𝑦! βˆ’ 𝜈# βˆ₯$ 0 otherwise

  • We minimize J with respect to the 𝜈", keeping the 𝑠

!" fixed by recomputing the centroids using the current cluster membership

πœ–πΎ πœ–πœˆ" = βˆ’2 ?

!%& '

𝑠

!" 𝑦! βˆ’ 𝜈" = 0 β†’ 𝜈" =

βˆ‘! 𝑠

!"𝑦!

βˆ‘! 𝑠

!"

Repeat until convergence

slide-9
SLIDE 9

CS480/680 Winter 2020 Zahra Sheikhbahaee

𝐿-mean clustering (Cons)

University of Waterloo

9

  • 𝐿-mean algorithm may converge to a

local rather than a global minimum of 𝐾

  • The 𝐿-means algorithm is based on the

use of squared Euclidean distance as the measure of dissimilarity between a data point and a prototype vector.

  • In 𝐿-means algorithm, every data point

is assigned uniquely to one, and only

  • ne, of the clusters (hard assignment to

the nearest cluster).

slide-10
SLIDE 10

CS480/680 Winter 2020 Zahra Sheikhbahaee

Gaussian Mixture Model

  • Let a data set be π’š = 𝑦), … , 𝑦* observations. A linear superposition
  • f Gaussians

π‘ž π’š = +

GH) I

𝜌Gπ’ͺ(π’š|𝝂G, 𝜯G) 𝜌G: mixing coefficient

  • Let a binary random variable π’œ which has 𝐿 dimensions and satisfies

the following condition +

GH) I

𝑨G = 1, where 𝑨G ∈ {0,1}

University of Waterloo

10

slide-11
SLIDE 11

CS480/680 Winter 2020 Zahra Sheikhbahaee

Gaussian Mixture Model

  • The joint distribution of the observed π’š and hidden variable π’œ is

π‘ž π’š, π’œ = π‘ž π’š π’œ π‘ž(π’œ) The marginal distribution over π’œ π‘ž 𝑨G = 1 = 𝜌G, where 0 ≀ 𝜌G ≀ 1 π‘ž π’œ = ?

GH) I

𝜌G

J! = Cat(π’œ|𝝆)

University of Waterloo

11

slide-12
SLIDE 12

CS480/680 Winter 2020 Zahra Sheikhbahaee

Gaussian Mixture Model

  • The joint distribution of the observed π’š and hidden variable π’œ is

π‘ž π’š, π’œ = π‘ž π’š π’œ π‘ž(π’œ) The marginal distribution over π’œ π‘ž π’œ = ?

GH) I

𝜌G

J! = Cat(π’œ|𝝆)

The conditional distribution of π’š given a particular value for π’œ π‘ž π’š 𝑨G = 1 = π’ͺ(π’š|𝝂G, 𝜯G)

University of Waterloo

12

slide-13
SLIDE 13

CS480/680 Winter 2020 Zahra Sheikhbahaee

Gaussian Mixture Model

  • The joint distribution of the observed π’š and hidden variable π’œ is

π‘ž π’š, π’œ = π‘ž π’š π’œ π‘ž(π’œ) The marginal distribution over π’œ π‘ž π’œ = ?

GH) I

𝜌G

J! = Cat(π’œ|𝝆)

The conditional distribution of π’š given a π’œ π‘ž π’š π’œ = ?

GH) I

π’ͺ(π’š|𝝂G, 𝜯G)J!

University of Waterloo

13

slide-14
SLIDE 14

CS480/680 Winter 2020 Zahra Sheikhbahaee

Gaussian Mixture Model

  • The joint distribution of the observed π’š and hidden variable π’œ is

π‘ž π’š, π’œ = π‘ž π’š π’œ π‘ž(π’œ) The marginal distribution over π’š π‘ž π’š = +

J

π‘ž(π’š|π’œ) π‘ž π’œ = +

GH) I

𝜌Gπ’ͺ(π’š|𝝂G, 𝜯G)

University of Waterloo

14

slide-15
SLIDE 15

CS480/680 Winter 2020 Zahra Sheikhbahaee

Gaussian Mixture Model

  • The posterior distribution of π’œ

π‘ž 𝑨! = 1 π’š = π‘ž π’š 𝑨! = 1 π‘ž(𝑨! = 1) βˆ‘"#$

%

π‘ž π’š 𝑨

" = 1 π‘ž(𝑨 " = 1) =

𝜌!π’ͺ(π’š|𝝂!, 𝜯!) βˆ‘"#$

!

𝜌" π’ͺ(π’š|𝝂", 𝜯") Let assume we have i.i.d data set. The log-likelihood function is ln π‘ž 𝒀 𝝆, 𝝂, 𝜯 = ln 4

&#$ '

5

(!

π‘ž(𝑦'|𝑨')π‘ž 𝑨' = 5

'#$ )

ln 5

!#$ %

𝜌! π’ͺ(𝑦'|𝜈!, Ξ£!)

University of Waterloo

15

slide-16
SLIDE 16

CS480/680 Winter 2020 Zahra Sheikhbahaee

Gaussian Mixture Model

The log-likelihood function is ln π‘ž 𝒀 𝝆, 𝝂, 𝜯 = 5

'#$ )

ln 5

!#$ %

𝜌! π’ͺ(𝑦'|𝜈!, Ξ£!)

  • Problems:
  • Singularities: Arbitrarily large likelihood when a Gaussian explains a single point

(whenever one of the Gaussian components collapses onto a specific data point)

  • Identifiability: Solution is invariant to permutations

A total of 𝐿! equivalent solutions because of the 𝐿! ways of assigning 𝐿 sets of parameters to 𝐿 components.

  • Non-convex

University of Waterloo

16

slide-17
SLIDE 17

CS480/680 Winter 2020 Zahra Sheikhbahaee

Expectation-Maximization For GMM

  • Let assume that we do not have access to the complete data set

𝒀, 𝒂 .Then the actual observed 𝒀 is considered as incomplete data. So we can not use the complete data log-likelihood β„’ πœ„ π‘Œ, π‘Ž = ln 𝑄(π‘Œ, π‘Ž|πœ„) We consider the expected value of the likelihood function under the posterior distribution of latent variable 𝔽K(M|O,P"#$) β„’ πœ„ π‘Œ, π‘Ž = βˆ‘J 𝑄(π‘Ž|π‘Œ, πœ„RST) ln 𝑄(π‘Œ, π‘Ž|πœ„) which is the Expectation step of the EM algorithm. In the Maximization step, we maximize this expectation.

University of Waterloo

17

slide-18
SLIDE 18

CS480/680 Winter 2020 Zahra Sheikhbahaee

Expectation-Maximization For GMM

University of Waterloo

18

Algorithm :

Initialize πœ„+ Iterations:

E step: Evaluate the posterior distribution of the latent variables π‘Ž and compute 𝒭 πœ„, πœ„BCD = B

E

π‘ž(π‘Ž|π‘Œ, πœ„BCD) ln π‘ž(π‘Œ, π‘Ž|πœ„) M step: Evaluate πœ„FGH πœ„FGH = 𝑏𝑠𝑕 max

I

𝒭 πœ„, πœ„BCD Check for the convergence of either the log-likelihood or the parameter values, otherwise πœ„BCD ← πœ„FGH

slide-19
SLIDE 19

CS480/680 Winter 2020 Zahra Sheikhbahaee

Expectation-Maximization For GMM

  • The likelihood function of complete data

π‘ž π‘Œ, π‘Ž 𝜈, Ξ£, 𝜌 = ?

_H) *

?

GH) I

𝜌G

J%! π’ͺ(𝑦_|𝝂G, 𝜯G)J%!

The log-likelihood β„’ πœ„ π‘Œ, π‘Ž = ln 𝑄 π‘Œ, π‘Ž 𝜈, Ξ£, 𝜌 = +

_H) *

+

GH) I

𝑨_G {ln 𝜌G + ln π’ͺ(𝑦_|𝝂G, 𝜯G)} The summation over 𝑙 and the logarithm have been interchanged.

University of Waterloo

19

slide-20
SLIDE 20

CS480/680 Winter 2020 Zahra Sheikhbahaee

  • In the E-step:

we first compute the posterior distribution of π‘Ž given π‘Œ π‘ž π‘Ž = 𝑙 π‘Œ, πœ„ = π‘ž π‘Œ π‘Ž = 𝑙, πœ„ π‘ž(π‘Ž = 𝑙) π‘ž(π‘Œ) = π‘ž π‘Œ π‘Ž, πœ„ π‘ž(π‘Ž) βˆ‘`H)

I

π‘ž π‘Œ π‘Ž = π‘˜, πœ„ π‘ž(π‘Ž = π‘˜) = 𝜌Gπ’ͺ(π‘Œ|𝝂G, 𝜯G) βˆ‘`H)

I

𝜌`π’ͺ(π‘Œ|𝝂`, 𝜯`) = 𝛿G 𝛿G ∢ The responsibility that component 𝑙 takes for β€˜explaining’ the

  • bservation 𝑦

University of Waterloo

20

Expectation-Maximization For GMM

slide-21
SLIDE 21

CS480/680 Winter 2020 Zahra Sheikhbahaee

Expectation-Maximization For GMM

  • The likelihood function of complete data

π‘ž π‘Œ, π‘Ž 𝜈, Ξ£, 𝜌 = 4

'#$ )

4

!#$ %

𝜌!

(!$ π’ͺ(𝑦'|𝝂!, 𝜯!)(!$

The log-likelihood β„’ πœ„ π‘Œ, π‘Ž = ln 𝑄 π‘Œ, π‘Ž 𝜈, Ξ£, 𝜌 = 5

'#$ )

5

!#$ %

𝑨'! {ln 𝜌! βˆ’ 1 2 [ln 2𝜌 + ln Ξ£! + (𝑦' βˆ’ 𝜈!)*Ξ£!

+$(𝑦'βˆ’πœˆ!)]}

πœ–β„’ πœ–πœˆ! = 5

'#$ )

𝑨'!Ξ£!

+$(𝑦'βˆ’πœˆ!) = 0 ⟢ 𝜈! =

βˆ‘' 𝑨'!𝑦' βˆ‘' 𝑨'!

University of Waterloo

21

slide-22
SLIDE 22

CS480/680 Winter 2020 Zahra Sheikhbahaee

Expectation-Maximization For GMM

  • The likelihood function of complete data

π‘ž π‘Œ, π‘Ž 𝜈, Ξ£, 𝜌 = -

./0 1

  • 2/0

3

𝜌2

4#$ π’ͺ(𝑦.|𝝂2, 𝜯2)4#$

The log-likelihood β„’ πœ„ π‘Œ, π‘Ž = ln 𝑄 π‘Œ, π‘Ž 𝜈, Ξ£, 𝜌 = A

./0 1

A

2/0 3

𝑨.2 {ln 𝜌2 βˆ’ 1 2 [ln 2𝜌 + ln Ξ£2 + (𝑦. βˆ’ 𝜈2)IΞ£2

J0(𝑦.βˆ’πœˆ2)]}

πœ–β„’ πœ–Ξ£2 = Ξ£2

J0 A ./0 1

𝑨.2 {1 βˆ’(𝑦. βˆ’ 𝜈2)IΞ£2

J0(𝑦. βˆ’πœˆ2)} = 0 ⟢ Ξ£2 = βˆ‘. 𝑨.2(𝑦. βˆ’ 𝜈2)I(𝑦.βˆ’πœˆ2)

βˆ‘. 𝑨.2

University of Waterloo

22

slide-23
SLIDE 23

CS480/680 Winter 2020 Zahra Sheikhbahaee

Expectation-Maximization For GMM

  • The likelihood function of complete data

π‘ž π‘Œ, π‘Ž 𝜈, Ξ£, 𝜌 = 4

'#$ )

4

!#$ %

𝜌!

(!$ π’ͺ(𝑦'|𝝂!, 𝜯!)(!$

For computing 𝜌!, we add a constraint to the log-likelihood using the Lagrange multiplier β„’ πœ„ π‘Œ, π‘Ž = ln 𝑄 π‘Œ, π‘Ž 𝜈, Ξ£, 𝜌 + πœ‡ 5

!#$ %

𝜌! βˆ’ 1 πœ–β„’ πœ–πœŒ! = 0 ⟢ 𝜌! = βˆ‘' 𝑨'! 𝑂

University of Waterloo

23

slide-24
SLIDE 24

CS480/680 Winter 2020 Zahra Sheikhbahaee

Expectation-Maximization For GMM

University of Waterloo

24

Algorithm :

Initialize πœ„+ Iterations:

E step: Evaluate 𝛿J = 𝔽 𝑨FJ and compute 𝔽E ln π‘ž π‘Œ, π‘Ž πœ„BCD = B

FK" L

B

JK" #

𝔽 𝑨FJ {ln 𝜌J + ln π’ͺ(𝑦F|𝝂J, 𝜯J)} M step: Evaluate πœ„FGH where πœ„ = {𝜈J, Ξ£J, 𝜌J} Check for the convergence of either the log-likelihood or the parameter values, otherwise πœ„BCD ← πœ„FGH

slide-25
SLIDE 25

CS480/680 Winter 2020 Zahra Sheikhbahaee

Expectation-Maximization Method

  • Our goal is to maximize the likelihood function

𝑄 π‘Œ πœ„ = +

J

𝑄(π‘Œ, π‘Ž|πœ„)

  • Typically the direct optimization of 𝑄 π‘Œ πœ„ is difficult but optimizing

𝑄(π‘Œ, π‘Ž|πœ„) is much easier.

  • Let π‘Ÿ π‘Ž be a distribution over the latent variable

ln π‘ž π‘Œ πœ„ = β„’ π‘Ÿ, πœ„ + 𝐸Ij(π‘Ÿ(𝑨)||π‘ž(π‘Ž|π‘Œ, πœ„)) β„’ π‘Ÿ, πœ„ = +

J

π‘Ÿ(π‘Ž) ln π‘ž(π‘Ž, π‘Œ|πœ„) π‘Ÿ(π‘Ž)

University of Waterloo

25

slide-26
SLIDE 26

CS480/680 Winter 2020 Zahra Sheikhbahaee

Expectation-Maximization Method

  • Since 𝐸&,(π‘Ÿ||π‘ž) β‰₯ 0 therefore we have ln π‘ž π‘Œ πœ„ β‰₯ β„’ π‘Ÿ, πœ„ .

So β„’ π‘Ÿ, πœ„ is a lower bound for ln 𝑄 π‘Œ πœ„ .

University of Waterloo

26

slide-27
SLIDE 27

CS480/680 Winter 2020 Zahra Sheikhbahaee

Expectation-Maximization Method

  • Since 𝐸&,(π‘Ÿ||π‘ž) β‰₯ 0 therefore we have

ln π‘ž π‘Œ πœ„ β‰₯ β„’ π‘Ÿ, πœ„ . So β„’ π‘Ÿ, πœ„ is a lower bound for ln 𝑄 π‘Œ πœ„ . EM algorithm : Initialize πœ„+ Iterations: E step: The lower bound β„’(π‘Ÿ, πœ„-./) is maximized w.r.t. π‘Ÿ(π‘Ž) while holding πœ„-./ fixed. It achieves when π‘Ÿ π‘Ž β‰ˆ π‘ž(π‘Ž|π‘Œ, πœ„)

University of Waterloo

27

slide-28
SLIDE 28

CS480/680 Winter 2020 Zahra Sheikhbahaee

Expectation-Maximization Method

  • Since 𝐸&,(π‘Ÿ||π‘ž) β‰₯ 0 therefore we have ln π‘ž π‘Œ πœ„ β‰₯ β„’ π‘Ÿ, πœ„ .

So β„’ π‘Ÿ, πœ„ is a lower bound for ln 𝑄 π‘Œ πœ„ .

EM algorithm :

Initialize πœ„+ Iterations:

E step: The lower bound β„’(π‘Ÿ, πœ„BCD) is maximized w.r.t. π‘Ÿ(π‘Ž) while holding πœ„BCD fixed. It achieves when π‘Ÿ π‘Ž β‰ˆ π‘ž(π‘Ž|π‘Œ, πœ„) M step: The distribution π‘Ÿ(π‘Ž) is held fixed and β„’(π‘Ÿ, πœ„) is maximized w.r.t. πœ„. Evaluate πœ„FGH where πœ„ = {𝜈J, Ξ£J, 𝜌J} Check for the convergence of either the log-likelihood or the parameter values,

  • therwise

πœ„BCD ← πœ„FGH

University of Waterloo

28

The log-likelihood increases but now π‘Ÿ π‘Ž β‰  π‘ž(π‘Ž|π‘Œ, πœ„!"#). Since the KL divergence is nonnegative, this causes the log likelihood ln 𝑄 π‘Œ πœ„ to increase by at least as much as the lower bound does.

slide-29
SLIDE 29

CS480/680 Winter 2020 Zahra Sheikhbahaee

Expectation-Maximization Method

  • From previous Lecture:

The free energy β„±(π‘Ÿ4 𝑨 , πœ„) is a lower bound on β„’(πœ„).

EM algorithm :

Initialize πœ„T Iterations:

E step: Infer posterior distributions over hidden variables given a current parameter setting. π‘Ÿ!(

(#$%) ← arg max ')(

β„±(π‘Ÿ! 𝑨 , πœ„(#)) , βˆ€π‘— ∈ 1, … , π‘œ π‘Ÿ!(

(#$%) = π‘ž(𝑨(|𝑦(, πœ„(#))

M step: Maximize β„’(πœ„) w.r.t. πœ„. πœ„(#$%) ← arg max

)

β„±(π‘Ÿ!

(#$%) 𝑨 , πœ„)

πœ„(#$%) ← arg max

)

E

(

F 𝑒𝑨( π‘ž(𝑨(|𝑦(, πœ„(#)) ln π‘ž(𝑨(, 𝑦(|πœ„)

University of Waterloo

29

slide-30
SLIDE 30

CS480/680 Winter 2020 Zahra Sheikhbahaee

Pros and Cons of EM Algorithm

  • Pros
  • It is guaranteed that 𝑄 π‘Œ πœ„"01 β‰₯ 𝑄(π‘Œ|πœ„-./)
  • This algorithm works well in practice
  • Cons
  • Not guaranteed to give πœ„2,3 because it might get stuck in local optimal
  • MLE may overfit but we can instead compute MAP
  • Convergence can be slow
  • Specialized for exponential family distributions.

University of Waterloo

30