Unsupervised Learning
Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net 2019 CS420, Machine Learning, Lecture 9
http://wnzhang.net/teaching/cs420/index.html
Unsupervised Learning Weinan Zhang Shanghai Jiao Tong University - - PowerPoint PPT Presentation
2019 CS420, Machine Learning, Lecture 9 Unsupervised Learning Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/cs420/index.html What is Data Science Data Science Physics Goal: discover the
Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net 2019 CS420, Machine Learning, Lecture 9
http://wnzhang.net/teaching/cs420/index.html
underlying Principal of the world
the world from observations
underlying Principal of the data
the data from observations
F = Gm1m2 r2 F = Gm1m2 r2 p(x) = ef(x) P
x0 ef(x0)
p(x) = ef(x) P
x0 ef(x0)
distribution
distribution p(x) p(x)
p(x2jx1) p(x2jx1) p(x) = 1 p 2¼¾2 e¡ (x¡¹)2
2¾2
p(x) = 1 p 2¼¾2 e¡ (x¡¹)2
2¾2
p(x) = e¡ 1
2(x¡¹)>§¡1(x¡¹)
p j2¼§j p(x) = e¡ 1
2(x¡¹)>§¡1(x¡¹)
p j2¼§j
Interest Gender Age BBC Sports PubMed Bloomberg Business Spotify Finance Male 29 Yes No Yes No Sports Male 21 Yes No No Yes Medicine Female 32 No Yes No No Music Female 25 No No No Yes Medicine Male 40 Yes Yes Yes No
p(Interest=Finance, Gender=Male, Age=29, Browsing=BBC Sports,Bloomberg Business)
p(Interest=Finance | Browsing=BBC Sports,Bloomberg Business) p(Gender=Male | Browsing=BBC Sports,Bloomberg Business)
conditional dependence p(xt|xi)
let the machine learn the data underlying patterns
D = fxigi=1;2;:::;N D = fxigi=1;2;:::;N
p(x) p(x)
z ! x z ! x
Á(x) Á(x)
may be more simply related to outputs or rewards)
Slide credit: Maneesh Sahani
each cluster, i.e. the centroid for each cluster
centroid
Slide credit: Ray Mooney
Slide credit: Ray Mooney
x 2 Rd x 2 Rd
mean of points in a cluster Ck
¹k = 1 Ck X
x2Ck
x ¹k = 1 Ck X
x2Ck
x
Slide credit: Ray Mooney
L(x; ¹k) L(x; ¹k)
L2(x; ¹k) = kx ¡ ¹kk = v u u t
d
X
m=1
(xi ¡ ¹k
m)2
L2(x; ¹k) = kx ¡ ¹kk = v u u t
d
X
m=1
(xi ¡ ¹k
m)2
L1(x; ¹k) = jx ¡ ¹kj =
d
X
m=1
jxi ¡ ¹k
mj
L1(x; ¹k) = jx ¡ ¹kj =
d
X
m=1
jxi ¡ ¹k
mj
Lcos(x; ¹k) = 1 ¡ x>¹k jxj ¢ j¹kj Lcos(x; ¹k) = 1 ¡ x>¹k jxj ¢ j¹kj
Pick seeds Reassign clusters Compute centroids x x Re-assign clusters x x x x Compute centroids Reassign clusters Converged!
Slide credit: Ray Mooney
is O(d) where d is the dimensionality of the vectors
added once to some centroid: O(nd)
iterations: O(Iknd)
Slide credit: Ray Mooney
sum of the squared distance of every point to its corresponding cluster centroid
min
f¹kgK
k=1
K
X
k=1
X
x2Ck
L(x ¡ ¹k) min
f¹kgK
k=1
K
X
k=1
X
x2Ck
L(x ¡ ¹k)
¹k = 1 Ck X
x2Ck
x ¹k = 1 Ck X
x2Ck
x
to a local optimum.
convergence to sub-optimal clusterings.
another method.
detection
dimensional data
enjoys flying
“karma” of a person
Example credit: Andrew Ng
approximately lies
set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components.
equal to the smaller of the number of original variables
Rd ! Rk k ¿ d Rd ! Rk k ¿ d
its mean and variance
¹ = 1 m
m
X
i=1
x(i) ¹ = 1 m
m
X
i=1
x(i)
D = fx(i)gm
i=1
D = fx(i)gm
i=1
x(i) Ã x(i) ¡ ¹ x(i) Ã x(i) ¡ ¹
¾2
j = 1
m
m
X
i=1
(x(i)
j )2
¾2
j = 1
m
m
X
i=1
(x(i)
j )2
x(i) Ã x(i)=¾j x(i) Ã x(i)=¾j
different attributes are all treated on the same “scale”.
variance
with the largest eigenvalues
point x(i) to a direction u
u x(i) x(i) x(i)>u x(i)>u x(i)>u x(i)>u
projection
1 m
m
X
i=1
(x(i)>u)2 = 1 m
m
X
i=1
u>x(i)x(i)>u = u>³ 1 m
m
X
i=1
x(i)x(i)>´ u ´ u>§u 1 m
m
X
i=1
(x(i)>u)2 = 1 m
m
X
i=1
u>x(i)x(i)>u = u>³ 1 m
m
X
i=1
x(i)x(i)>´ u ´ u>§u
(kuk = 1) (kuk = 1)
u x(i) x(i) x(i)>u x(i)>u
data is to find the k principal eigenvectors of Σ
largest eigenvalues
max
u
u>§u s.t. kuk = 1 max
u
u>§u s.t. kuk = 1 § = 1 m
m
X
i=1
x(i)x(i)> § = 1 m
m
X
i=1
x(i)x(i)>
y(i) = 2 6 6 6 4 u>
1 x(i)
u>
2 x(i)
. . . u>
k x(i)
3 7 7 7 5 2 Rk y(i) = 2 6 6 6 4 u>
1 x(i)
u>
2 x(i)
. . . u>
k x(i)
3 7 7 7 5 2 Rk
§u = wu §u = wu
(kuk = 1) (kuk = 1)
d
X
i=1
uiu>
i = I d
X
i=1
uiu>
i = I
v = ³
d
X
i=1
uiu>
i
´ v =
d
X
i=1
(u>
i v)ui = d
X
i=1
v(i)ui v = ³
d
X
i=1
uiu>
i
´ v =
d
X
i=1
(u>
i v)ui = d
X
i=1
v(i)ui
§ =
d
X
i=1
uiu>
i § = d
X
i=1
wiuiu>
i = UWU>
§ =
d
X
i=1
uiu>
i § = d
X
i=1
wiuiu>
i = UWU>
U = [u1; u2; : : : ; ud] U = [u1; u2; : : : ; ud] W = 2 6 6 6 4 w1 ¢ ¢ ¢ w2 ¢ ¢ ¢ . . . . . . ... ¢ ¢ ¢ wd 3 7 7 7 5 W = 2 6 6 6 4 w1 ¢ ¢ ¢ w2 ¢ ¢ ¢ . . . . . . ... ¢ ¢ ¢ wd 3 7 7 7 5
and its covariance matrix § = X>X
§ = X>X
X = 2 6 6 6 4 x>
1
x>
2
. . . x>
n
3 7 7 7 5 X = 2 6 6 6 4 x>
1
x>
2
. . . x>
n
3 7 7 7 5
kXvk2 = ° ° °X ³
d
X
i=1
v(i)ui ´° ° °
2
= X
ij
v(i)u>
i §uiv(j) = d
X
i=1
v2
(i)wi
kXvk2 = ° ° °X ³
d
X
i=1
v(i)ui ´° ° °
2
= X
ij
v(i)u>
i §uiv(j) = d
X
i=1
v2
(i)wi
kXuik2 = u>
i X>Xui = u> i §ui = u> i wiui = wi
kXuik2 = u>
i X>Xui = u> i §ui = u> i wiui = wi
where v(i) is the projection length of v on ui
arg max
kvk=1 kXvk2 = u(max)
arg max
kvk=1 kXvk2 = u(max)
The direction of greatest variance is the eigenvector with the largest eigenvalue (here we may drop m for simplicity)
the approximation error arising from projecting the data
http://setosa.io/ev/principal-component-analysis/
http://setosa.io/ev/principal-component-analysis/
Á
z x
fx(1); x(2); : : : ; x(m)g fx(1); x(2); : : : ; x(m)g
p(x(i); z(i)) = p(x(i)jz(i))p(z(i)) p(x(i); z(i)) = p(x(i)jz(i))p(z(i)) z(i) » Multinomial(Á) z(i) » Multinomial(Á) x(i) » N(¹j; §j) x(i) » N(¹j; §j) p(z(i) = j) = Áj p(z(i) = j) = Áj
Latent variable: the Gaussian cluster ID Indicates which Gaussian each x comes from Observed data points Parameters of latent variable distribution
l(Á; ¹; §) =
m
X
i=1
log p(x(i); Á; ¹; §) =
m
X
i=1
log
k
X
z(i)=1
p(x(i)jz(i); ¹; §)p(z(i); Á) =
m
X
i=1
log
k
X
j=1
N(x(i)j¹j; §j)Áj l(Á; ¹; §) =
m
X
i=1
log p(x(i); Á; ¹; §) =
m
X
i=1
log
k
X
z(i)=1
p(x(i)jz(i); ¹; §)p(z(i); Á) =
m
X
i=1
log
k
X
j=1
N(x(i)j¹j; §j)Áj
@l(Á; ¹; §) @Á = 0 @l(Á; ¹; §) @Á = 0 @l(Á; ¹; §) @¹ = 0 @l(Á; ¹; §) @¹ = 0 @l(Á; ¹; §) @§ = 0 @l(Á; ¹; §) @§ = 0
which Gaussian it comes from
l(Á; ¹; §) =
m
X
i=1
log p(x(i); Á; ¹; §) =
m
X
i=1
log p(x(i)jz(i); ¹; §)p(z(i); Á) =
m
X
i=1
log N(x(i)j¹z(i); §z(i)) + log p(z(i); Á) l(Á; ¹; §) =
m
X
i=1
log p(x(i); Á; ¹; §) =
m
X
i=1
log p(x(i)jz(i); ¹; §)p(z(i); Á) =
m
X
i=1
log N(x(i)j¹z(i); §z(i)) + log p(z(i); Á)
max
Á;¹;§ l(Á; ¹; §) = max Á;¹;§ m
X
i=1
log N(x(i)j¹z(i); §z(i)) + log p(z(i); Á) max
Á;¹;§ l(Á; ¹; §) = max Á;¹;§ m
X
i=1
log N(x(i)j¹z(i); §z(i)) + log p(z(i); Á)
Áj = 1 m
m
X
i=1
1fz(i) = jg ¹j = Pm
i=1 1fz(i) = jgx(i)
Pm
i=1 1fz(i) = jg
§j = Pm
i=1 1fz(i) = jg(x(i) ¡ ¹j)(x(i) ¡ ¹j)>
Pm
i=1 1fz(i) = jg
Áj = 1 m
m
X
i=1
1fz(i) = jg ¹j = Pm
i=1 1fz(i) = jgx(i)
Pm
i=1 1fz(i) = jg
§j = Pm
i=1 1fz(i) = jg(x(i) ¡ ¹j)(x(i) ¡ ¹j)>
Pm
i=1 1fz(i) = jg
posterior of the latent variable z(i) for each instance
p(z(i) = jjx(i); Á; ¹; §) = p(z(i) = j; x(i); Á; ¹; §) p(x(i); Á; ¹; §) = p(x(i)jz(i) = j; ¹; §)p(z(i) = j; Á) Pk
l=1 p(x(i)jz(i) = l; ¹; §)p(z(i) = l; Á)
p(z(i) = jjx(i); Á; ¹; §) = p(z(i) = j; x(i); Á; ¹; §) p(x(i); Á; ¹; §) = p(x(i)jz(i) = j; ¹; §)p(z(i) = j; Á) Pk
l=1 p(x(i)jz(i) = l; ¹; §)p(z(i) = l; Á)
Á
z x
¹; § ¹; §
where
p(z(i) = j; Á) p(z(i) = j; Á) p(x(i)jz(i) = j; ¹; §) p(x(i)jz(i) = j; ¹; §)
variables given the model parameters
likelihood given the latent variable distribution
Á
z x
¹; § ¹; §
Repeat until convergence: { (E-step) For each i, j, set (M-step) Update the parameters }
w(i)
j
= p(z(i) = jjx(i); Á; ¹; §) w(i)
j
= p(z(i) = jjx(i); Á; ¹; §)
Áj = 1 m
m
X
i=1
w(i)
j
¹j = Pm
i=1 w(i) j x(i)
Pm
i=1 w(i) j
§j = Pm
i=1 w(i) j (x(i) ¡ ¹j)(x(i) ¡ ¹j)>
Pm
i=1 w(i) j
Áj = 1 m
m
X
i=1
w(i)
j
¹j = Pm
i=1 w(i) j x(i)
Pm
i=1 w(i) j
§j = Pm
i=1 w(i) j (x(i) ¡ ¹j)(x(i) ¡ ¹j)>
Pm
i=1 w(i) j
decrease
latent variable model likelihood
verify its effectiveness of improving data likelihood and its convergence
random variable. Then:
E[f(X)] ¸ f(E[X]) E[f(X)] ¸ f(E[X])
holds true if and only if with probability 1 (i.e., if X is a constant).
E[f(X)] = f(E[X]) E[f(X)] = f(E[X]) X = E[X] X = E[X]
E[f(X)] ¸ f(E[X]) E[f(X)] ¸ f(E[X])
Figure credit: Andrew Ng
Figure credit: Maneesh Sahani
let the machine learn the data underlying patterns D = fxigi=1;2;:::;N D = fxigi=1;2;:::;N
z ! x z ! x
where the log-likelihood is
l(μ) =
N
X
i=1
log p(x; μ) =
N
X
i=1
log X
z
p(x; z; μ) l(μ) =
N
X
i=1
log p(x; μ) =
N
X
i=1
log X
z
p(x; z; μ)
is hard
μ¤ = arg max
μ N
X
i=1
log X
z
p(x(i); z(i); μ) μ¤ = arg max
μ N
X
i=1
log X
z
p(x(i); z(i); μ)
μ¤ = arg max
μ N
X
i=1
log p(x(i)jz(i); μ) μ¤ = arg max
μ N
X
i=1
log p(x(i)jz(i); μ)
iteratively doing
X
z
qi(z) = 1; qi(z) ¸ 0 X
z
qi(z) = 1; qi(z) ¸ 0
l(μ) =
N
X
i=1
log p(x(i); μ) =
N
X
i=1
log X
z(i)
p(x(i); z(i); μ) =
N
X
i=1
log X
z(i)
qi(z(i))p(x(i); z(i); μ) qi(z(i)) ¸
N
X
i=1
X
z(i)
qi(z(i)) log p(x(i); z(i); μ) qi(z(i)) l(μ) =
N
X
i=1
log p(x(i); μ) =
N
X
i=1
log X
z(i)
p(x(i); z(i); μ) =
N
X
i=1
log X
z(i)
qi(z(i))p(x(i); z(i); μ) qi(z(i)) ¸
N
X
i=1
X
z(i)
qi(z(i)) log p(x(i); z(i); μ) qi(z(i))
Jensen’s inequality
Lower bound
l(μ) =
N
X
i=1
log p(x(i); μ) ¸
N
X
i=1
X
z(i)
qi(z(i)) log p(x(i); z(i); μ) qi(z(i)) l(μ) =
N
X
i=1
log p(x(i); μ) ¸
N
X
i=1
X
z(i)
qi(z(i)) log p(x(i); z(i); μ) qi(z(i))
random variable. Then:
E[f(X)] ¸ f(E[X]) E[f(X)] ¸ f(E[X])
holds true if and only if with probability 1 (i.e., if X is a constant).
E[f(X)] = f(E[X]) E[f(X)] = f(E[X]) X = E[X] X = E[X]
REVIEW
equality), it is sufficient that
l(μ) =
N
X
i=1
log p(x(i); μ) ¸
N
X
i=1
X
z(i)
qi(z(i)) log p(x(i); z(i); μ) qi(z(i)) l(μ) =
N
X
i=1
log p(x(i); μ) ¸
N
X
i=1
X
z(i)
qi(z(i)) log p(x(i); z(i); μ) qi(z(i))
p(x(i); z(i); μ) = qi(z(i)) ¢ c p(x(i); z(i); μ) = qi(z(i)) ¢ c
log p(x(i); μ) = log X
z(i)
p(x(i); z(i); μ) = log X
z(i)
q(z(i))c = X
z(i)
qi(z(i)) log p(x(i); z(i); μ) qi(z(i)) log p(x(i); μ) = log X
z(i)
p(x(i); z(i); μ) = log X
z(i)
q(z(i))c = X
z(i)
qi(z(i)) log p(x(i); z(i); μ) qi(z(i))
qi(z(i)) = p(x(i); z(i); μ) P
z p(x(i); z; μ) = p(x(i); z(i); μ)
p(x(i); μ) = p(z(i)jx(i); μ) qi(z(i)) = p(x(i); z(i); μ) P
z p(x(i); z; μ) = p(x(i); z(i); μ)
p(x(i); μ) = p(z(i)jx(i); μ)
Repeat until convergence: { (E-step) For each i, set (M-step) Update the parameters }
qi(z(i)) = p(z(i)jx(i); μ) qi(z(i)) = p(z(i)jx(i); μ) μ = arg max
μ N
X
i=1
X
z(i)
qi(z(i)) log p(x(i); z(i); μ) qi(z(i)) μ = arg max
μ N
X
i=1
X
z(i)
qi(z(i)) log p(x(i); z(i); μ) qi(z(i))
iterations of EM, we prove that
l(μ(t)) · l(μ(t+1)) l(μ(t)) · l(μ(t+1))
which shows EM always monotonically improves the log- likelihood, thus ensures EM will at least converge to a local
q(t)
i (z(i)) = p(z(i)jx(i); μ(t))
q(t)
i (z(i)) = p(z(i)jx(i); μ(t))
l(μ(t)) =
N
X
i=1
log X
z(i)
q(t)
i (z(i))p(x(i); z(i); μ(t))
q(t)
i (z(i))
=
N
X
i=1
X
z(i)
qi(z(i)) log p(x(i); z(i); μ(t)) q(t)
i (z(i))
l(μ(t)) =
N
X
i=1
log X
z(i)
q(t)
i (z(i))p(x(i); z(i); μ(t))
q(t)
i (z(i))
=
N
X
i=1
X
z(i)
qi(z(i)) log p(x(i); z(i); μ(t)) q(t)
i (z(i))
the right hand side of above equation
N
X
i=1
X
z(i)
q(t)
i (z(i)) log p(x(i); z(i); μ(t+1))
q(t)
i (z(i))
¸
N
X
i=1
X
z(i)
q(t)
i (z(i)) log p(x(i); z(i); μ(t))
q(t)
i (z(i))
= l(μ(t)) l(μ(t+1)) ¸
N
X
i=1
X
z(i)
q(t)
i (z(i)) log p(x(i); z(i); μ(t+1))
q(t)
i (z(i))
¸
N
X
i=1
X
z(i)
q(t)
i (z(i)) log p(x(i); z(i); μ(t))
q(t)
i (z(i))
= l(μ(t))
[lower bound] [parameter optimization]
J(q; μ) =
N
X
i=1
X
z(i)
qi(z(i)) log p(x(i); z(i); μ) qi(z(i)) J(q; μ) =
N
X
i=1
X
z(i)
qi(z(i)) log p(x(i); z(i); μ) qi(z(i))
Then we know
l(μ) ¸ J(q; μ) l(μ) ¸ J(q; μ)
μ q
Figure credit: Maneesh Sahani
q(t)
i (z(i)) = p(z(i)jx(i); μ(t))
q(t)
i (z(i)) = p(z(i)jx(i); μ(t))
μ(t+1) = arg max
μ N
X
i=1
X
z(i)
q(t)
i (z(i)) log p(x(i); z(i); μ)
q(t)
i (z(i))
μ(t+1) = arg max
μ N
X
i=1
X
z(i)
q(t)
i (z(i)) log p(x(i); z(i); μ)
q(t)
i (z(i))
h v
neural network that can learn a probability distribution over its set of inputs h v
units are not connected to each other
E(v; h) = ¡ X
i
bivi ¡ X
j
bjhj ¡ X
i;j
viwi;jhj E(v; h) = ¡ X
i
bivi ¡ X
j
bjhj ¡ X
i;j
viwi;jhj p(v; h) = 1 Z e¡E(v;h) p(v; h) = 1 Z e¡E(v;h)
Hinton, Geoffrey E., and Ruslan R. Salakhutdinov. "Reducing the dimensionality of data with neural networks." science 313.5786 (2006): 504-507.
Latent semantic analysis based on PCA A 2000- 500-250-125-2 autoencoder Trained by DBN
Hinton, Geoffrey E., and Ruslan R. Salakhutdinov. "Reducing the dimensionality of data with neural networks." science 313.5786 (2006): 504-507.
for unsupervised learning of efficient codings
typically for the purpose of dimensionality reduction
z = ¾(W1x + b1) ~ x = ¾(W2z + b2) z = ¾(W1x + b1) ~ x = ¾(W2z + b2)
z is regarded as the low dimensional latent factor representation
x ~ x ~ x
J(W1; b1; W2; b2) =
m
X
i=1
(~ x(i) ¡ x(i))2 =
m
X
i=1
(W2z(i) + b2 ¡ x(i))2 =
m
X
i=1
³ W2¾(W1x(i) + b1) + b2 ¡ x(i)´2 J(W1; b1; W2; b2) =
m
X
i=1
(~ x(i) ¡ x(i))2 =
m
X
i=1
(W2z(i) + b2 ¡ x(i))2 =
m
X
i=1
³ W2¾(W1x(i) + b1) + b2 ¡ x(i)´2
μ Ã μ ¡ ´@J @μ μ Ã μ ¡ ´@J @μ
supervised fashion
~ x » qD(~ xjx) ~ x » qD(~ xjx)
~ x ~ x z = fμ(~ x) z = fμ(~ x)
x qD qD ~ x ~ x fμ fμ z ^ x ^ x L(x; ^ x) L(x; ^ x)
^ x = gμ0(z) ^ x = gμ0(z)
e.g. Gaussian noise
reconstruct x
to reconstruct z1
reconstruct z2
z1 z1 z2 z2 z3 z3
Original Corrupted Reconstructed
[Goodfellow, I., et al. 2014. Generative adversarial nets. In NIPS 2014.]
the data distribution that fits the true one
D = fxg D = fxg qμ(x) qμ(x)
max
μ
1 jDj X
x2D
[log qμ(x)] ' max
μ
Ex»p(x)[log qμ(x)] max
μ
1 jDj X
x2D
[log qμ(x)] ' max
μ
Ex»p(x)[log qμ(x)] p(x) p(x)
the learned model
true data is with a high mass density
model
max
μ
Ex»p(x)[log qμ(x)] max
μ
Ex»p(x)[log qμ(x)] max
μ
Ex»qμ(x)[log p(x)] max
μ
Ex»qμ(x)[log p(x)]
Training/evaluation Use
model-generated data is considered as true as possible
but it is hard or impossible to directly calculate p(x)
p(x)
max
μ
1 jDj X
x2D
[log qμ(x)] max
μ
1 jDj X
x2D
[log qμ(x)]
max
μ
Ex»qμ(x)[log p(x)] max
μ
Ex»qμ(x)[log p(x)]
p(x)
whether a data instance is true or fake (artificially generated)?
discriminative models
the fake model-generated data
discriminator
data, G nicely fits the true underlying data distribution
Real World Generator Discriminator Data
z but need not do so
perceptron
probabilistic prediction
Real World Generator Discriminator Data
J(D) = E »pdata( )[log D(x)] + E »p ( )[log(1 ¡ D(G(z)))] J(D) = E »pdata( )[log D(x)] + E »p ( )[log(1 ¡ D(G(z)))]
min
G max D J(D)
min
G max D J(D)
max
D J(D)
max
D J(D)
min
G max D J(D)
min
G max D J(D)
max
D J(D)
max
D J(D)
Generator Discriminator
Discriminator Data Generator
J(D) = E »pdata( )[log D(x)] + E »p ( )[log(1 ¡ D(G(z)))] J(D) = E »pdata( )[log D(x)] + E »p ( )[log(1 ¡ D(G(z)))]
min
G max D J(D)
min
G max D J(D)
max
D J(D)
max
D J(D)
Generator Discriminator
perfect data distribution
distinguish the true and generated data
Training discriminator
Training generator
pdata(x) and pG(x) is always
D(x) = pdata(x) pdata(x) + pG(x) D(x) = pdata(x) pdata(x) + pG(x)
Discriminator Data Generator
J(D) = E »pdata( )[log D(x)] + E »p ( )[log(1 ¡ D(G(z)))] = E »pdata( )[log D(x)] + E »pG( )[log(1 ¡ D(x))] = E »pdata( ) · log pdata(x) pdata(x) + pG(x) ¸ + E »pG( ) · log pG(x) pdata(x) + pG(x) ¸ = ¡ log(4) + KL μ pdata ° ° °pdata + pG 2 ¶ + KL μ pG ° ° °pdata + pG 2 ¶ J(D) = E »pdata( )[log D(x)] + E »p ( )[log(1 ¡ D(G(z)))] = E »pdata( )[log D(x)] + E »pG( )[log(1 ¡ D(x))] = E »pdata( ) · log pdata(x) pdata(x) + pG(x) ¸ + E »pG( ) · log pG(x) pdata(x) + pG(x) ¸ = ¡ log(4) + KL μ pdata ° ° °pdata + pG 2 ¶ + KL μ pG ° ° °pdata + pG 2 ¶
min
G max D J(D)
min
G max D J(D)
max
D J(D)
max
D J(D)
G: D:
is something between and
[Huszár, Ferenc. "How (not) to Train your Generative Model: Scheduled Sampling, Likelihood, Adversary?." arXiv (2015).]
to be continuous
x z p
Generator Discriminator
min
G max D J(G; D)
min
G max D J(G; D)
max
D J(G; D)
max
D J(G; D)
generated ones, which means GAN does not simply memorize training instances
Two imaginary celebrities that were dreamed up by a random number generator.
Tero Karras et al. Progressive Growing of GANs for Improved Quality, Stability, and Variation. ICLR 2018.
Ledig, Christian, et al. "Photo-realistic single image super-resolution using a generative adversarial network." CVPR 2017.
deep residual generative adversarial network optimized for a loss more sensitive to human perception [4× upscaling]
Isola, Phillip, et al. "Image-to-image translation with conditional adversarial networks." CVPR 2017.
Yun Cao, Weinan Zhang etc. Unsupervised Diverse Colorization via Generative Adversarial Networks. ECML-PKDD 2017.
Ground Truth Ground Truth Generated Colorization after Performing Grayscale Generated Colorization after Performing Grayscale
Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. "High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs", arXiv preprint arXiv:1711.11585.