Unsupervised Learning
Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net 2018 EE448, Big Data Mining, Lecture 7
http://wnzhang.net/teaching/ee448/index.html
Unsupervised Learning Weinan Zhang Shanghai Jiao Tong University - - PowerPoint PPT Presentation
2018 EE448, Big Data Mining, Lecture 7 Unsupervised Learning Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/ee448/index.html ML Problem Setting First build and learn p ( x ) and then infer the
Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net 2018 EE448, Big Data Mining, Lecture 7
http://wnzhang.net/teaching/ee448/index.html
conditional dependence p(xt|xi)
let the machine learn the data underlying patterns
D = fxigi=1;2;:::;N D = fxigi=1;2;:::;N
p(x) p(x)
z ! x z ! x
Á(x) Á(x)
may be more simply related to outputs or rewards)
Slide credit: Maneesh Sahani
each cluster, i.e. the centroid for each cluster
centroid
Slide credit: Ray Mooney
Slide credit: Ray Mooney
x 2 Rd x 2 Rd
mean of points in a cluster Ck
¹k = 1 Ck X
x2Ck
x ¹k = 1 Ck X
x2Ck
x
Slide credit: Ray Mooney
L(x; ¹k) L(x; ¹k)
L2(x; ¹k) = kx ¡ ¹kk = v u u t
d
X
m=1
(xi ¡ ¹k
m)2
L2(x; ¹k) = kx ¡ ¹kk = v u u t
d
X
m=1
(xi ¡ ¹k
m)2
L1(x; ¹k) = jx ¡ ¹kj =
d
X
m=1
jxi ¡ ¹k
mj
L1(x; ¹k) = jx ¡ ¹kj =
d
X
m=1
jxi ¡ ¹k
mj
Lcos(x; ¹k) = 1 ¡ x>¹k jxj ¢ j¹kj Lcos(x; ¹k) = 1 ¡ x>¹k jxj ¢ j¹kj
Pick seeds Reassign clusters Compute centroids x x Reasssign clusters x x x x Compute centroids Reassign clusters Converged!
Slide credit: Ray Mooney
is O(d) where d is the dimensionality of the vectors
added once to some centroid: O(nd)
iterations: O(Iknd)
Slide credit: Ray Mooney
sum of the squared distance of every point to its corresponding cluster centroid
min
f¹kgK
k=1
K
X
k=1
X
x2Ck
L(x ¡ ¹k) min
f¹kgK
k=1
K
X
k=1
X
x2Ck
L(x ¡ ¹k)
¹k = 1 Ck X
x2Ck
x ¹k = 1 Ck X
x2Ck
x
local optimum.
convergence to sub-optimal clusterings.
another method.
detection
dimensional data
enjoys flying
“karma” of a person
Example credit: Andrew Ng
approximately lies
set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components.
equal to the smaller of the number of original variables
Rd ! Rk k ¿ d Rd ! Rk k ¿ d
its mean and variance
¹ = 1 m
m
X
i=1
x(i) ¹ = 1 m
m
X
i=1
x(i)
D = fx(i)gm
i=1
D = fx(i)gm
i=1
x(i) Ã x(i) ¡ ¹ x(i) Ã x(i) ¡ ¹
¾2
j = 1
m
m
X
i=1
(x(i)
j )2
¾2
j = 1
m
m
X
i=1
(x(i)
j )2
x(i) Ã x(i)=¾j x(i) Ã x(i)=¾j
different attributes are all treated on the same “scale”.
variance
with the largest eigenvalues
point x(i) to a direction u
u x(i) x(i) x(i)>u x(i)>u x(i)>u x(i)>u
projection
1 m
m
X
i=1
(x(i)>u)2 = 1 m
m
X
i=1
u>x(i)x(i)>u = u>³ 1 m
m
X
i=1
x(i)x(i)>´ u ´ u>§u 1 m
m
X
i=1
(x(i)>u)2 = 1 m
m
X
i=1
u>x(i)x(i)>u = u>³ 1 m
m
X
i=1
x(i)x(i)>´ u ´ u>§u
(kuk = 1) (kuk = 1)
u x(i) x(i) x(i)>u x(i)>u
data is to find the k principal eigenvectors of Σ
largest eigenvalues
max
u
u>§u s.t. kuk = 1 max
u
u>§u s.t. kuk = 1 § = 1 m
m
X
i=1
x(i)x(i)> § = 1 m
m
X
i=1
x(i)x(i)>
y(i) = 2 6 6 6 4 u>
1 x(i)
u>
2 x(i)
. . . u>
k x(i)
3 7 7 7 5 2 Rk y(i) = 2 6 6 6 4 u>
1 x(i)
u>
2 x(i)
. . . u>
k x(i)
3 7 7 7 5 2 Rk
§u = wu §u = wu
(kuk = 1) (kuk = 1)
d
X
i=1
uiu>
i = I d
X
i=1
uiu>
i = I
v = ³
d
X
i=1
uiu>
i
´ v =
d
X
i=1
(u>
i v)ui = d
X
i=1
v(i)ui v = ³
d
X
i=1
uiu>
i
´ v =
d
X
i=1
(u>
i v)ui = d
X
i=1
v(i)ui
§ =
d
X
i=1
uiu>
i § = d
X
i=1
wiuiu>
i = UWU>
§ =
d
X
i=1
uiu>
i § = d
X
i=1
wiuiu>
i = UWU>
U = [u1; u2; : : : ; ud] U = [u1; u2; : : : ; ud] W = 2 6 6 6 4 w1 ¢ ¢ ¢ w2 ¢ ¢ ¢ . . . . . . ... ¢ ¢ ¢ wd 3 7 7 7 5 W = 2 6 6 6 4 w1 ¢ ¢ ¢ w2 ¢ ¢ ¢ . . . . . . ... ¢ ¢ ¢ wd 3 7 7 7 5
and its covariance matrix § = X>X
§ = X>X
X = 2 6 6 6 4 x>
1
x>
2
. . . x>
n
3 7 7 7 5 X = 2 6 6 6 4 x>
1
x>
2
. . . x>
n
3 7 7 7 5
kXvk2 = ° ° °X ³
d
X
i=1
v(i)ui ´° ° °
2
= X
ij
v(i)u>
i §uiv(j) = d
X
i=1
v2
(i)wi
kXvk2 = ° ° °X ³
d
X
i=1
v(i)ui ´° ° °
2
= X
ij
v(i)u>
i §uiv(j) = d
X
i=1
v2
(i)wi
kXuik2 = u>
i X>Xui = u> i §ui = u> i wiui = wi
kXuik2 = u>
i X>Xui = u> i §ui = u> i wiui = wi
where v(i) is the projection length of v on ui
arg max
kvk=1 kXvk2 = u(max)
arg max
kvk=1 kXvk2 = u(max)
The direction of greatest variance is the eigenvector with the largest eigenvalue (here we may drop m for simplicity)
the approximation error arising from projecting the data
http://setosa.io/ev/principal-component-analysis/
http://setosa.io/ev/principal-component-analysis/
Á
z x
fx(1); x(2); : : : ; x(m)g fx(1); x(2); : : : ; x(m)g
p(x(i); z(i)) = p(x(i)jz(i))p(z(i)) p(x(i); z(i)) = p(x(i)jz(i))p(z(i)) z(i) » Multinomial(Á) z(i) » Multinomial(Á) x(i) » N(¹j; §j) x(i) » N(¹j; §j) p(z(i) = j) = Áj p(z(i) = j) = Áj
Latent variable: the Gaussian cluster ID Indicates which Gaussian each x comes from Observed data points Parameters of latent variable distribution
l(Á; ¹; §) =
m
X
i=1
log p(x(i); Á; ¹; §) =
m
X
i=1
log
k
X
z(i)=1
p(x(i)jz(i); ¹; §)p(z(i); Á) =
m
X
i=1
log
k
X
j=1
N(x(i)j¹j; §j)Áj l(Á; ¹; §) =
m
X
i=1
log p(x(i); Á; ¹; §) =
m
X
i=1
log
k
X
z(i)=1
p(x(i)jz(i); ¹; §)p(z(i); Á) =
m
X
i=1
log
k
X
j=1
N(x(i)j¹j; §j)Áj
@l(Á; ¹; §) @Á = 0 @l(Á; ¹; §) @Á = 0 @l(Á; ¹; §) @¹ = 0 @l(Á; ¹; §) @¹ = 0 @l(Á; ¹; §) @§ = 0 @l(Á; ¹; §) @§ = 0
which Gaussian it comes from
l(Á; ¹; §) =
m
X
i=1
log p(x(i); Á; ¹; §) =
m
X
i=1
log p(x(i)jz(i); ¹; §)p(z(i); Á) =
m
X
i=1
log N(x(i)j¹z(i); §z(i)) + log p(z(i); Á) l(Á; ¹; §) =
m
X
i=1
log p(x(i); Á; ¹; §) =
m
X
i=1
log p(x(i)jz(i); ¹; §)p(z(i); Á) =
m
X
i=1
log N(x(i)j¹z(i); §z(i)) + log p(z(i); Á)
max
Á;¹;§ l(Á; ¹; §) = max Á;¹;§ m
X
i=1
log N(x(i)j¹z(i); §z(i)) + log p(z(i); Á) max
Á;¹;§ l(Á; ¹; §) = max Á;¹;§ m
X
i=1
log N(x(i)j¹z(i); §z(i)) + log p(z(i); Á)
Áj = 1 m
m
X
i=1
1fz(i) = jg ¹j = Pm
i=1 1fz(i) = jgx(i)
Pm
i=1 1fz(i) = jg
§j = Pm
i=1 1fz(i) = jg(x(i) ¡ ¹j)(x(i) ¡ ¹j)>
Pm
i=1 1fz(i) = jg
Áj = 1 m
m
X
i=1
1fz(i) = jg ¹j = Pm
i=1 1fz(i) = jgx(i)
Pm
i=1 1fz(i) = jg
§j = Pm
i=1 1fz(i) = jg(x(i) ¡ ¹j)(x(i) ¡ ¹j)>
Pm
i=1 1fz(i) = jg
posterior of the latent variable z(i) for each instance
p(z(i) = jjx(i); Á; ¹; §) = p(z(i) = j; x(i); Á; ¹; §) p(x(i); Á; ¹; §) = p(x(i)jz(i) = j; ¹; §)p(z(i) = j; Á) Pk
l=1 p(x(i)jz(i) = l; ¹; §)p(z(i) = l; Á)
p(z(i) = jjx(i); Á; ¹; §) = p(z(i) = j; x(i); Á; ¹; §) p(x(i); Á; ¹; §) = p(x(i)jz(i) = j; ¹; §)p(z(i) = j; Á) Pk
l=1 p(x(i)jz(i) = l; ¹; §)p(z(i) = l; Á)
Á
z x
¹; § ¹; §
where
p(z(i) = j; Á) p(z(i) = j; Á) p(x(i)jz(i) = j; ¹; §) p(x(i)jz(i) = j; ¹; §)
variables given the model parameters
likelihood given the latent variable distribution
Á
z x
¹; § ¹; §
Repeat until convergence: { (E-step) For each i, j, set (M-step) Update the parameters }
w(i)
j
= p(z(i) = j; x(i); Á; ¹; §) w(i)
j
= p(z(i) = j; x(i); Á; ¹; §)
Áj = 1 m
m
X
i=1
w(i)
j
¹j = Pm
i=1 w(i) j x(i)
Pm
i=1 w(i) j
§j = Pm
i=1 w(i) j (x(i) ¡ ¹j)(x(i) ¡ ¹j)>
Pm
i=1 w(i) j
Áj = 1 m
m
X
i=1
w(i)
j
¹j = Pm
i=1 w(i) j x(i)
Pm
i=1 w(i) j
§j = Pm
i=1 w(i) j (x(i) ¡ ¹j)(x(i) ¡ ¹j)>
Pm
i=1 w(i) j
decrease.
latent variable model likelihood
verify its effectiveness of improving data likelihood and its convergence
random variable. Then:
E[f(X)] ¸ f(E[X]) E[f(X)] ¸ f(E[X])
holds true if and only if with probability 1 (i.e., if X is a constant).
E[f(X)] = f(E[X]) E[f(X)] = f(E[X]) X = E[X] X = E[X]
E[f(X)] ¸ f(E[X]) E[f(X)] ¸ f(E[X])
Figure credit: Andrew Ng
Figure credit: Maneesh Sahani
let the machine learn the data underlying patterns D = fxigi=1;2;:::;N D = fxigi=1;2;:::;N
z ! x z ! x
where the log-likelihood is
l(μ) =
N
X
i=1
log p(x; μ) =
N
X
i=1
log X
z
p(x; z; μ) l(μ) =
N
X
i=1
log p(x; μ) =
N
X
i=1
log X
z
p(x; z; μ)
is hard
μ¤ = arg max
μ N
X
i=1
log X
z
p(x(i); z(i); μ) μ¤ = arg max
μ N
X
i=1
log X
z
p(x(i); z(i); μ)
μ¤ = arg max
μ N
X
i=1
log p(x(i)jz(i); μ) μ¤ = arg max
μ N
X
i=1
log p(x(i)jz(i); μ)
iteratively doing
X
z
qi(z) = 1; qi(z) ¸ 0 X
z
qi(z) = 1; qi(z) ¸ 0
l(μ) =
N
X
i=1
log p(x(i); μ) =
N
X
i=1
log X
z(i)
p(x(i); z(i); μ) =
N
X
i=1
log X
z(i)
qi(z(i))p(x(i); z(i); μ) qi(z(i)) ¸
N
X
i=1
X
z(i)
qi(z(i)) log p(x(i); z(i); μ) qi(z(i)) l(μ) =
N
X
i=1
log p(x(i); μ) =
N
X
i=1
log X
z(i)
p(x(i); z(i); μ) =
N
X
i=1
log X
z(i)
qi(z(i))p(x(i); z(i); μ) qi(z(i)) ¸
N
X
i=1
X
z(i)
qi(z(i)) log p(x(i); z(i); μ) qi(z(i))
Jensen’s inequality
Lower bound
l(μ) =
N
X
i=1
log p(x(i); μ) ¸
N
X
i=1
X
z(i)
qi(z(i)) log p(x(i); z(i); μ) qi(z(i)) l(μ) =
N
X
i=1
log p(x(i); μ) ¸
N
X
i=1
X
z(i)
qi(z(i)) log p(x(i); z(i); μ) qi(z(i))
random variable. Then:
E[f(X)] ¸ f(E[X]) E[f(X)] ¸ f(E[X])
holds true if and only if with probability 1 (i.e., if X is a constant).
E[f(X)] = f(E[X]) E[f(X)] = f(E[X]) X = E[X] X = E[X]
REVIEW
equality), it is sufficient that
l(μ) =
N
X
i=1
log p(x(i); μ) ¸
N
X
i=1
X
z(i)
qi(z(i)) log p(x(i); z(i); μ) qi(z(i)) l(μ) =
N
X
i=1
log p(x(i); μ) ¸
N
X
i=1
X
z(i)
qi(z(i)) log p(x(i); z(i); μ) qi(z(i))
p(x(i); z(i); μ) = qi(z(i)) ¢ c p(x(i); z(i); μ) = qi(z(i)) ¢ c
log p(x(i); μ) = log X
z(i)
p(x(i); z(i); μ) = log X
z(i)
q(z(i))c = X
z(i)
qi(z(i)) log p(x(i); z(i); μ) qi(z(i)) log p(x(i); μ) = log X
z(i)
p(x(i); z(i); μ) = log X
z(i)
q(z(i))c = X
z(i)
qi(z(i)) log p(x(i); z(i); μ) qi(z(i))
qi(z(i)) = p(x(i); z(i); μ) P
z p(x(i); z; μ) = p(x(i); z(i); μ)
p(x(i); μ) = p(z(i)jx(i); μ) qi(z(i)) = p(x(i); z(i); μ) P
z p(x(i); z; μ) = p(x(i); z(i); μ)
p(x(i); μ) = p(z(i)jx(i); μ)
Repeat until convergence: { (E-step) For each i, set (M-step) Update the parameters }
qi(z(i)) = p(z(i)jx(i); μ) qi(z(i)) = p(z(i)jx(i); μ) μ = arg max
μ N
X
i=1
X
z(i)
qi(z(i)) log p(x(i); z(i); μ) qi(z(i)) μ = arg max
μ N
X
i=1
X
z(i)
qi(z(i)) log p(x(i); z(i); μ) qi(z(i))
iterations of EM, we prove that
l(μ(t)) · l(μ(t+1)) l(μ(t)) · l(μ(t+1))
which shows EM always monotonically improves the log- likelihood, thus ensures EM will at least converge to a local
q(t)
i (z(i)) = p(z(i)jx(i); μ(t))
q(t)
i (z(i)) = p(z(i)jx(i); μ(t))
l(μ(t)) =
N
X
i=1
log X
z(i)
q(t)
i (z(i))p(x(i); z(i); μ(t))
q(t)
i (z(i))
=
N
X
i=1
X
z(i)
qi(z(i)) log p(x(i); z(i); μ(t)) q(t)
i (z(i))
l(μ(t)) =
N
X
i=1
log X
z(i)
q(t)
i (z(i))p(x(i); z(i); μ(t))
q(t)
i (z(i))
=
N
X
i=1
X
z(i)
qi(z(i)) log p(x(i); z(i); μ(t)) q(t)
i (z(i))
the right hand side of above equation
N
X
i=1
X
z(i)
q(t)
i (z(i)) log p(x(i); z(i); μ(t+1))
q(t)
i (z(i))
¸
N
X
i=1
X
z(i)
q(t)
i (z(i)) log p(x(i); z(i); μ(t))
q(t)
i (z(i))
= l(μ(t)) l(μ(t+1)) ¸
N
X
i=1
X
z(i)
q(t)
i (z(i)) log p(x(i); z(i); μ(t+1))
q(t)
i (z(i))
¸
N
X
i=1
X
z(i)
q(t)
i (z(i)) log p(x(i); z(i); μ(t))
q(t)
i (z(i))
= l(μ(t))
[lower bound] [parameter optimization]
J(q; μ) =
N
X
i=1
X
z(i)
qi(z(i)) log p(x(i); z(i); μ) qi(z(i)) J(q; μ) =
N
X
i=1
X
z(i)
qi(z(i)) log p(x(i); z(i); μ) qi(z(i))
Then we know
l(μ) ¸ J(q; μ) l(μ) ¸ J(q; μ)
μ q
Figure credit: Maneesh Sahani
q(t)
i (z(i)) = p(z(i)jx(i); μ(t))
q(t)
i (z(i)) = p(z(i)jx(i); μ(t))
μ(t+1) = arg max
μ N
X
i=1
X
z(i)
q(t)
i (z(i)) log p(x(i); z(i); μ)
q(t)
i (z(i))
μ(t+1) = arg max
μ N
X
i=1
X
z(i)
q(t)
i (z(i)) log p(x(i); z(i); μ)
q(t)
i (z(i))