Subhransu Maji
CMPSCI 689: Machine Learning
14 April 2015
Expectation maximization Subhransu Maji CMPSCI 689: Machine - - PowerPoint PPT Presentation
Expectation maximization Subhransu Maji CMPSCI 689: Machine Learning 14 April 2015 Motivation Suppose you are building a naive Bayes spam classifier. After your are done your boss tells you that there is no money to label the data. You
14 April 2015
Subhransu Maji (UMASS) CMPSCI 689 /19
Suppose you are building a naive Bayes spam classifier. After your are done your boss tells you that there is no money to label the data.
don't have any labels. Can you still do something?
simultaneously along with the parameters of the model
naive Bayes classification and learn why EM works
2
Subhransu Maji (UMASS) CMPSCI 689 /19
Suppose data comes from a Gaussian Mixture Model (GMM) — you have K clusters and the data from the cluster k is drawn from a Gaussian with mean μk and variance σk2 We will assume that the data comes with labels (we will soon remove this assumption) Generative story of the data:
➡ Choose a label ➡ Choose example
Likelihood of the data:
3
xn ∼ N(µk, σ2
k)
yn ∼ Mult(θ1, θ2, . . . , θK) p(D) =
N
Y
n=1
p(yn)p(xn|yn) =
N
Y
n=1
θynN(xn; µyn, σ2
yn)
p(D) =
N
Y
n=1
θyn
yn
− D
2 exp
✓ −||xn − µyn||2 2σ2
yn
◆
Subhransu Maji (UMASS) CMPSCI 689 /19
Likelihood of the data:
the parameters is easy:
4
θk = 1 N X
n
[yn = k] µk = P
n[yn = k]xn
P
n[yn = k]
fraction of examples with label k mean of all the examples with label k variance of all the examples with label k p(D) =
N
Y
n=1
θyn
yn
− D
2 exp
✓ −||xn − µyn||2 2σ2
yn
◆ σ2
k =
P
n[yn = k]||xn − µk||2
P
n[yn = k]
Subhransu Maji (UMASS) CMPSCI 689 /19
Now suppose you didn’t have labels yn. Analogous to k-means, one solution is to iterate. Start by guessing the parameters and then repeat the two steps:
hard assignment (point 10 goes to cluster 2) In expectation maximization (EM) we will will use soft assignment (point 10 goes half to cluster 2 and half to cluster 5)
assignment vector for the nth point
5
Subhransu Maji (UMASS) CMPSCI 689 /19
Formally zn,k is the probability that the nth point goes to cluster k
Given zn,k , we can update the parameters (θk,μk,σk2) as:
6
zn,k = p(yn = k|xn) = P(yn = k, xn) P(xn) ∝ P(yn = k)P(xn|yn) = θkN(xn; µk, σ2
k)
θk = 1 N X
n
zn,k µk = P
n zn,kxn
P
n zn,k
σ2
k =
P
n zn,k||xn − µk||2
P
n zn,k
fraction of examples with label k mean of all the fractional examples with label k variance of all the fractional examples with label k
Subhransu Maji (UMASS) CMPSCI 689 /19
We have replaced the indicator variable [yn = k] with p(yn=k) which is the expectation of [yn=k]. This is our guess of the labels. Just like k-means the EM is susceptible to local minima. Clustering example:
7
http://nbviewer.ipython.org/github/NICTA/MLSS/tree/master/clustering/
k-means GMM
Subhransu Maji (UMASS) CMPSCI 689 /19
We have data with observations xn and hidden variables yn, and would like to estimate parameters θ The likelihood of the data and hidden variables:
marginalizing out the yn:
8
hard to maximize since the sum is inside the log
p(D) = Y
n
p(xn, yn|θ) p(X|θ) = Y
n
X
yn
p(xn, yn|θ) θML ← arg max
θ
X
n
log X
yn
p(xn, yn|θ) !
Subhransu Maji (UMASS) CMPSCI 689 /19
Given a concave function f and a set of weights λi ≥ 0 and ∑ᵢ λᵢ = 1 Jensen’s inequality states that f(∑ᵢ λᵢ xᵢ) ≥ ∑ᵢ λᵢ f(xᵢ) This is a direct consequence of concavity
9
f(x) f(y) a f(x) + b f(y) f(ax+by)
Subhransu Maji (UMASS) CMPSCI 689 /19
Construct a lower bound the log-likelihood using Jensen’s inequality
10
θ ← arg max
θ
X
n
X
yn
q(yn) log p(xn, yn|θ) independent of θ L(X|θ) = X
n
log X
yn
p(xn, yn|θ) ! = X
n
log X
yn
q(yn)p(xn, yn|θ) q(yn) ! ≥ X
n
X
yn
q(yn) log ✓p(xn, yn|θ) q(yn) ◆ = X
n
X
yn
[q(yn) log p(xn, yn|θ) − q(yn) log q(yn)] , ˆ L(X|θ)
x
λ
Jensen’s inequality
f
Subhransu Maji (UMASS) CMPSCI 689 /19
Maximizing the lower bound increases the value of the original function if the lower bound touches the function at the current value
11
L(X|θ) θt θt+1 ˆ L(X|θt) ˆ L(X|θt+1)
Subhransu Maji (UMASS) CMPSCI 689 /19
Any choice of the probability distribution q(yn) is valid as long as the lower bound touches the function at the current estimate of θ
data and the current estimate of the parameters
12
L(X|θt) = ˆ L(X|θt)
q(yn) ← p(yn|xn, θt)
arg max
q
X
yn
[q(yn) log p(xn, yn|θ) − q(yn) log q(yn)]
Subhransu Maji (UMASS) CMPSCI 689 /19
We have data with observations xn and hidden variables yn, and would like to estimate parameters θ of the distribution p(x | θ) EM algorithm
➡ E step: Compute probability distribution over the hidden variables
p(x | θ) cannot be easily optimized over θ
given the memberships
13
q(yn) ← p(yn|xn, θ) θ ← arg max
θ
X
n
X
yn
q(yn) log p(xn, yn|θ)
Subhransu Maji (UMASS) CMPSCI 689 /19
Consider the binary prediction problem Let the data be distributed according to a probability distribution:
14
pθ(y, x) = pθ(y, x1, x2, . . . , xD)
pθ(y, x) = pθ(y)pθ(x1|y)pθ(x2|x1, y) . . . pθ(xD|x1, x2, . . . , xD−1, y) = pθ(y)
D
Y
d=1
pθ(xd|x1, x2, . . . , xd−1, y)
pθ(xd|xd0, y) = pθ(xd|y), 8d0 6= d
Subhransu Maji (UMASS) CMPSCI 689 /19
Case: binary labels and binary features
15
pθ(y) = Bernoulli(θ0) pθ(xd|y = 1) = Bernoulli(θ+
d )
pθ(xd|y = −1) = Bernoulli(θ−
d )
1+2D parameters
// label +1 // label -1
pθ(y, x) = pθ(y)
D
Y
d=1
pθ(xd|y) = θ[y=+1] (1 − θ0)[y=−1] ... ×
D
Y
d=1
θ+[xd=1,y=+1]
d
(1 − θ+
d )[xd=0,y=+1]
... ×
D
Y
d=1
θ−[xd=1,y=−1]
d
(1 − θ−
d )[xd=0,y=−1]
Subhransu Maji (UMASS) CMPSCI 689 /19
Given data we can estimate the parameters by maximizing data likelihood The maximum likelihood estimates are:
16
ˆ θ0 = P
n[yn = +1]
N
// fraction of the data with label as +1 // fraction of the instances with 1 among +1
ˆ θ−
d =
P
n[xd,n = 1, yn = −1]
P
n[yn = −1]
ˆ θ+
d =
P
n[xd,n = 1, yn = +1]
P
n[yn = +1]
// fraction of the instances with 1 among -1
Subhransu Maji (UMASS) CMPSCI 689 /19
Now suppose you don’t have labels yn Initialize the parameters θ randomly E step: compute the distribution over the hidden variables q(yn)
17
q(yn = 1) = p(yn = +1|xn, θ) ∝ θ+
D
Y
d=1
θ+[xd,n=1]
d
(1 − θ+
d )[xd,n=0]
θ0 = P
n q(yn = 1)
N θ+
d =
P
n[xd,n = 1]q(yn = 1)
P
n q(yn = 1)
θ−
d =
P
n[xd,n = 1]q(yn = −1)
P
n q(yn = −1)
// fraction of the data with label as +1 // fraction of the instances with 1 among +1 // fraction of the instances with 1 among -1
Subhransu Maji (UMASS) CMPSCI 689 /19
Expectation maximization
when some observations are hidden
log-likelihood — we used Jensen’s inequality to switch the log-sum to sum-log EM can be used for learning:
18
Subhransu Maji (UMASS) CMPSCI 689 /19
Some of the slides are based on CIML book by Hal Daume III The figure for the EM lower bound is based on https:// cxwangyi.wordpress.com/2008/11/ Clustering k-means vs GMM is from http://nbviewer.ipython.org/ github/NICTA/MLSS/tree/master/clustering/
19