Unsupervised Learning Weinan Zhang Shanghai Jiao Tong University - - PowerPoint PPT Presentation

unsupervised learning
SMART_READER_LITE
LIVE PREVIEW

Unsupervised Learning Weinan Zhang Shanghai Jiao Tong University - - PowerPoint PPT Presentation

2019 CS420, Machine Learning, Lecture 9 Unsupervised Learning Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/cs420/index.html What is Data Science Data Science Physics Goal: discover the


slide-1
SLIDE 1

Unsupervised Learning

Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net 2019 CS420, Machine Learning, Lecture 9

http://wnzhang.net/teaching/cs420/index.html

slide-2
SLIDE 2

What is Data Science

  • Physics
  • Goal: discover the

underlying Principal of the world

  • Solution: build the model of

the world from observations

  • Data Science
  • Goal: discover the

underlying Principal of the data

  • Solution: build the model of

the data from observations

F = Gm1m2 r2 F = Gm1m2 r2 p(x) = ef(x) P

x0 ef(x0)

p(x) = ef(x) P

x0 ef(x0)

slide-3
SLIDE 3

Data Science

  • Mathematically
  • Find joint data

distribution

  • Then the conditional

distribution p(x) p(x)

p(x2jx1) p(x2jx1) p(x) = 1 p 2¼¾2 e¡ (x¡¹)2

2¾2

p(x) = 1 p 2¼¾2 e¡ (x¡¹)2

2¾2

p(x) = e¡ 1

2(x¡¹)>§¡1(x¡¹)

p j2¼§j p(x) = e¡ 1

2(x¡¹)>§¡1(x¡¹)

p j2¼§j

  • Gaussian distribution
  • Multivariate
  • Univariate
slide-4
SLIDE 4

A Simple Example in User Behavior Modelling

Interest Gender Age BBC Sports PubMed Bloomberg Business Spotify Finance Male 29 Yes No Yes No Sports Male 21 Yes No No Yes Medicine Female 32 No Yes No No Music Female 25 No No No Yes Medicine Male 40 Yes Yes Yes No

  • Joint data distribution

p(Interest=Finance, Gender=Male, Age=29, Browsing=BBC Sports,Bloomberg Business)

  • Conditional data distribution

p(Interest=Finance | Browsing=BBC Sports,Bloomberg Business) p(Gender=Male | Browsing=BBC Sports,Bloomberg Business)

slide-5
SLIDE 5

Problem Setting

  • First build and learn p(x) and then infer the

conditional dependence p(xt|xi)

  • Unsupervised learning
  • Each dimension of x is equally treated
  • Directly learn the conditional dependence p(xt|xi)
  • Supervised learning
  • xt is the label to predict
slide-6
SLIDE 6

Definition of Unsupervised Learning

  • Given the training dataset

let the machine learn the data underlying patterns

D = fxigi=1;2;:::;N D = fxigi=1;2;:::;N

p(x) p(x)

  • Probabilistic density function (p.d.f.) estimation
  • Latent variables

z ! x z ! x

  • Good data representation (used for discrimination)

Á(x) Á(x)

slide-7
SLIDE 7

Uses of Unsupervised Learning

  • Data structure discovery, data science
  • Data compression
  • Outlier detection
  • Input to supervised/reinforcement algorithms (causes

may be more simply related to outputs or rewards)

  • A theory of biological learning and perception

Slide credit: Maneesh Sahani

slide-8
SLIDE 8

Content

  • Fundamentals of Unsupervised Learning
  • K-means clustering
  • Principal component analysis
  • Probabilistic Unsupervised Learning
  • Mixture Gaussians
  • EM Methods
  • Deep Unsupervised Learning
  • Auto-encoders
  • Generative adversarial nets
slide-9
SLIDE 9

Content

  • Fundamentals of Unsupervised Learning
  • K-means clustering
  • Principal component analysis
  • Probabilistic Unsupervised Learning
  • Mixture Gaussians
  • EM Methods
  • Deep Unsupervised Learning
  • Auto-encoders
  • Generative adversarial nets
slide-10
SLIDE 10

K-Means Clustering

slide-11
SLIDE 11

K-Means Clustering

slide-12
SLIDE 12

K-Means Clustering

  • Provide the number of desired clusters k
  • Randomly choose k instances as seeds, one per

each cluster, i.e. the centroid for each cluster

  • Iterate
  • Assign each instance to the cluster with the closest

centroid

  • Re-estimate the centroid of each cluster
  • Stop when clustering converges
  • Or after a fixed number of iterations

Slide credit: Ray Mooney

slide-13
SLIDE 13

K-Means Clustering: Centriod

  • Assume instances are real-valued vectors

Slide credit: Ray Mooney

x 2 Rd x 2 Rd

  • Clusters based on centroids, center of gravity, or

mean of points in a cluster Ck

¹k = 1 Ck X

x2Ck

x ¹k = 1 Ck X

x2Ck

x

slide-14
SLIDE 14

K-Means Clustering: Distance

  • Distance to a centroid

Slide credit: Ray Mooney

L(x; ¹k) L(x; ¹k)

  • Euclidian distance (L2 norm)

L2(x; ¹k) = kx ¡ ¹kk = v u u t

d

X

m=1

(xi ¡ ¹k

m)2

L2(x; ¹k) = kx ¡ ¹kk = v u u t

d

X

m=1

(xi ¡ ¹k

m)2

  • Euclidian distance (L1 norm)

L1(x; ¹k) = jx ¡ ¹kj =

d

X

m=1

jxi ¡ ¹k

mj

L1(x; ¹k) = jx ¡ ¹kj =

d

X

m=1

jxi ¡ ¹k

mj

  • Cosine distance

Lcos(x; ¹k) = 1 ¡ x>¹k jxj ¢ j¹kj Lcos(x; ¹k) = 1 ¡ x>¹k jxj ¢ j¹kj

slide-15
SLIDE 15

K-Means Example (K=2)

Pick seeds Reassign clusters Compute centroids x x Re-assign clusters x x x x Compute centroids Reassign clusters Converged!

Slide credit: Ray Mooney

slide-16
SLIDE 16

K-Means Time Complexity

  • Assume computing distance between two instances

is O(d) where d is the dimensionality of the vectors

  • Reassigning clusters: O(knd) distance computations
  • Computing centroids: Each instance vector gets

added once to some centroid: O(nd)

  • Assume these two steps are each done once for I

iterations: O(Iknd)

Slide credit: Ray Mooney

slide-17
SLIDE 17

K-Means Clustering Objective

  • The objective of K-means is to minimize the total

sum of the squared distance of every point to its corresponding cluster centroid

min

f¹kgK

k=1

K

X

k=1

X

x2Ck

L(x ¡ ¹k) min

f¹kgK

k=1

K

X

k=1

X

x2Ck

L(x ¡ ¹k)

¹k = 1 Ck X

x2Ck

x ¹k = 1 Ck X

x2Ck

x

  • Finding the global optimum is NP-hard.
  • The K-means algorithm is guaranteed to converge

to a local optimum.

slide-18
SLIDE 18

Seed Choice

  • Results can vary based on random seed selection.
  • Some seeds can result in poor convergence rate, or

convergence to sub-optimal clusterings.

  • Select good seeds using a heuristic or the results of

another method.

slide-19
SLIDE 19

Clustering Applications

  • Text mining
  • Cluster documents for related search
  • Cluster words for query suggestion
  • Recommender systems and advertising
  • Cluster users for item/ad recommendation
  • Cluster items for related item suggestion
  • Image search
  • Cluster images for similar image search and duplication

detection

  • Speech recognition or separation
  • Cluster phonetical features
slide-20
SLIDE 20

Principal Component Analysis (PCA)

  • An example of 2-

dimensional data

  • x1: the piloting skill
  • f pilot
  • x2: how much he/she

enjoys flying

  • Main components
  • u1: intrinsic piloting

“karma” of a person

  • u2: some noise

Example credit: Andrew Ng

slide-21
SLIDE 21

Principal Component Analysis (PCA)

  • PCA tries to identify the subspace in which the data

approximately lies

  • PCA uses an orthogonal transformation to convert a

set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components.

  • The number of principal components is less than or

equal to the smaller of the number of original variables

  • r the number of observations.

Rd ! Rk k ¿ d Rd ! Rk k ¿ d

slide-22
SLIDE 22

PCA Data Preprocessing

  • Typically we first pre-process the data to normalize

its mean and variance

  • 1. Move the central of the data set to 0

¹ = 1 m

m

X

i=1

x(i) ¹ = 1 m

m

X

i=1

x(i)

  • Given the dataset

D = fx(i)gm

i=1

D = fx(i)gm

i=1

x(i) Ã x(i) ¡ ¹ x(i) Ã x(i) ¡ ¹

  • 2. Unify the variance of each variable

¾2

j = 1

m

m

X

i=1

(x(i)

j )2

¾2

j = 1

m

m

X

i=1

(x(i)

j )2

x(i) Ã x(i)=¾j x(i) Ã x(i)=¾j

slide-23
SLIDE 23

PCA Data Preprocessing

  • Zero out the mean of the data
  • Rescale each coordinate to have unit variance, which ensures that

different attributes are all treated on the same “scale”.

slide-24
SLIDE 24

PCA Solution

  • PCA finds the directions with the largest variable

variance

  • which correspond to the eigenvectors of the matrix XTX

with the largest eigenvalues

slide-25
SLIDE 25

PCA Solution: Data Projection

  • The projection of each

point x(i) to a direction u

u x(i) x(i) x(i)>u x(i)>u x(i)>u x(i)>u

  • The variance of the

projection

1 m

m

X

i=1

(x(i)>u)2 = 1 m

m

X

i=1

u>x(i)x(i)>u = u>³ 1 m

m

X

i=1

x(i)x(i)>´ u ´ u>§u 1 m

m

X

i=1

(x(i)>u)2 = 1 m

m

X

i=1

u>x(i)x(i)>u = u>³ 1 m

m

X

i=1

x(i)x(i)>´ u ´ u>§u

(kuk = 1) (kuk = 1)

slide-26
SLIDE 26

u x(i) x(i) x(i)>u x(i)>u

PCA Solution: Largest Eigenvalues

  • Find k principal components of the

data is to find the k principal eigenvectors of Σ

  • i.e. the top-k eigenvectors with the

largest eigenvalues

  • Projected vector for x(i)

max

u

u>§u s.t. kuk = 1 max

u

u>§u s.t. kuk = 1 § = 1 m

m

X

i=1

x(i)x(i)> § = 1 m

m

X

i=1

x(i)x(i)>

y(i) = 2 6 6 6 4 u>

1 x(i)

u>

2 x(i)

. . . u>

k x(i)

3 7 7 7 5 2 Rk y(i) = 2 6 6 6 4 u>

1 x(i)

u>

2 x(i)

. . . u>

k x(i)

3 7 7 7 5 2 Rk

slide-27
SLIDE 27

Eigendecomposition Revisit

  • For a semi-positive square matrix Σd×d
  • suppose u to be its eigenvector
  • with the scalar eigenvalue w

§u = wu §u = wu

(kuk = 1) (kuk = 1)

  • Thus any vector v can be written as
  • There are d eigenvectors-eigenvalue pairs (ui, wi)
  • These d eigenvectors are orthogonal, thus they form an
  • rthonormal basis

d

X

i=1

uiu>

i = I d

X

i=1

uiu>

i = I

v = ³

d

X

i=1

uiu>

i

´ v =

d

X

i=1

(u>

i v)ui = d

X

i=1

v(i)ui v = ³

d

X

i=1

uiu>

i

´ v =

d

X

i=1

(u>

i v)ui = d

X

i=1

v(i)ui

  • Σd×d can be written as

§ =

d

X

i=1

uiu>

i § = d

X

i=1

wiuiu>

i = UWU>

§ =

d

X

i=1

uiu>

i § = d

X

i=1

wiuiu>

i = UWU>

U = [u1; u2; : : : ; ud] U = [u1; u2; : : : ; ud] W = 2 6 6 6 4 w1 ¢ ¢ ¢ w2 ¢ ¢ ¢ . . . . . . ... ¢ ¢ ¢ wd 3 7 7 7 5 W = 2 6 6 6 4 w1 ¢ ¢ ¢ w2 ¢ ¢ ¢ . . . . . . ... ¢ ¢ ¢ wd 3 7 7 7 5

slide-28
SLIDE 28

Eigendecomposition Revisit

and its covariance matrix § = X>X

§ = X>X

X = 2 6 6 6 4 x>

1

x>

2

. . . x>

n

3 7 7 7 5 X = 2 6 6 6 4 x>

1

x>

2

. . . x>

n

3 7 7 7 5

  • The variance in any direction v is
  • Given the data

kXvk2 = ° ° °X ³

d

X

i=1

v(i)ui ´° ° °

2

= X

ij

v(i)u>

i §uiv(j) = d

X

i=1

v2

(i)wi

kXvk2 = ° ° °X ³

d

X

i=1

v(i)ui ´° ° °

2

= X

ij

v(i)u>

i §uiv(j) = d

X

i=1

v2

(i)wi

  • The variance in direction ui is

kXuik2 = u>

i X>Xui = u> i §ui = u> i wiui = wi

kXuik2 = u>

i X>Xui = u> i §ui = u> i wiui = wi

where v(i) is the projection length of v on ui

  • If vTv = 1, then

arg max

kvk=1 kXvk2 = u(max)

arg max

kvk=1 kXvk2 = u(max)

The direction of greatest variance is the eigenvector with the largest eigenvalue (here we may drop m for simplicity)

slide-29
SLIDE 29

PCA Discussion

  • PCA can also be derived by picking the basis that minimizes

the approximation error arising from projecting the data

  • nto the k-dimensional subspace spanned by them.
slide-30
SLIDE 30

PCA Visualization

http://setosa.io/ev/principal-component-analysis/

slide-31
SLIDE 31

PCA Visualization

http://setosa.io/ev/principal-component-analysis/

slide-32
SLIDE 32

Content

  • Fundamentals of Unsupervised Learning
  • K-means clustering
  • Principal component analysis
  • Probabilistic Unsupervised Learning
  • Mixture Gaussians
  • EM Methods
  • Deep Unsupervised Learning
  • Auto-encoders
  • Generative adversarial nets
slide-33
SLIDE 33

Mixture Gaussian

slide-34
SLIDE 34

Mixture Gaussian

slide-35
SLIDE 35

Graphic Model for Mixture Gaussian

  • Given a training set

Á

z x

fx(1); x(2); : : : ; x(m)g fx(1); x(2); : : : ; x(m)g

  • Model the data by specifying a joint distribution

p(x(i); z(i)) = p(x(i)jz(i))p(z(i)) p(x(i); z(i)) = p(x(i)jz(i))p(z(i)) z(i) » Multinomial(Á) z(i) » Multinomial(Á) x(i) » N(¹j; §j) x(i) » N(¹j; §j) p(z(i) = j) = Áj p(z(i) = j) = Áj

Latent variable: the Gaussian cluster ID Indicates which Gaussian each x comes from Observed data points Parameters of latent variable distribution

slide-36
SLIDE 36

Data Likelihood

  • No closed form solution by simply setting

l(Á; ¹; §) =

m

X

i=1

log p(x(i); Á; ¹; §) =

m

X

i=1

log

k

X

z(i)=1

p(x(i)jz(i); ¹; §)p(z(i); Á) =

m

X

i=1

log

k

X

j=1

N(x(i)j¹j; §j)Áj l(Á; ¹; §) =

m

X

i=1

log p(x(i); Á; ¹; §) =

m

X

i=1

log

k

X

z(i)=1

p(x(i)jz(i); ¹; §)p(z(i); Á) =

m

X

i=1

log

k

X

j=1

N(x(i)j¹j; §j)Áj

@l(Á; ¹; §) @Á = 0 @l(Á; ¹; §) @Á = 0 @l(Á; ¹; §) @¹ = 0 @l(Á; ¹; §) @¹ = 0 @l(Á; ¹; §) @§ = 0 @l(Á; ¹; §) @§ = 0

  • We want to maximize
slide-37
SLIDE 37

Data Likelihood Maximization

  • For each data point x(i), latent variable z(i) indicates

which Gaussian it comes from

  • If we knew z(i), the data likelihood

l(Á; ¹; §) =

m

X

i=1

log p(x(i); Á; ¹; §) =

m

X

i=1

log p(x(i)jz(i); ¹; §)p(z(i); Á) =

m

X

i=1

log N(x(i)j¹z(i); §z(i)) + log p(z(i); Á) l(Á; ¹; §) =

m

X

i=1

log p(x(i); Á; ¹; §) =

m

X

i=1

log p(x(i)jz(i); ¹; §)p(z(i); Á) =

m

X

i=1

log N(x(i)j¹z(i); §z(i)) + log p(z(i); Á)

slide-38
SLIDE 38

Data Likelihood Maximization

  • Given z(i), maximize the data likelihood

max

Á;¹;§ l(Á; ¹; §) = max Á;¹;§ m

X

i=1

log N(x(i)j¹z(i); §z(i)) + log p(z(i); Á) max

Á;¹;§ l(Á; ¹; §) = max Á;¹;§ m

X

i=1

log N(x(i)j¹z(i); §z(i)) + log p(z(i); Á)

  • It is easy to get the solution

Áj = 1 m

m

X

i=1

1fz(i) = jg ¹j = Pm

i=1 1fz(i) = jgx(i)

Pm

i=1 1fz(i) = jg

§j = Pm

i=1 1fz(i) = jg(x(i) ¡ ¹j)(x(i) ¡ ¹j)>

Pm

i=1 1fz(i) = jg

Áj = 1 m

m

X

i=1

1fz(i) = jg ¹j = Pm

i=1 1fz(i) = jgx(i)

Pm

i=1 1fz(i) = jg

§j = Pm

i=1 1fz(i) = jg(x(i) ¡ ¹j)(x(i) ¡ ¹j)>

Pm

i=1 1fz(i) = jg

slide-39
SLIDE 39

Latent Variable Inference

  • Given the parameters μ, Σ, ϕ, it is not hard to infer the

posterior of the latent variable z(i) for each instance

p(z(i) = jjx(i); Á; ¹; §) = p(z(i) = j; x(i); Á; ¹; §) p(x(i); Á; ¹; §) = p(x(i)jz(i) = j; ¹; §)p(z(i) = j; Á) Pk

l=1 p(x(i)jz(i) = l; ¹; §)p(z(i) = l; Á)

p(z(i) = jjx(i); Á; ¹; §) = p(z(i) = j; x(i); Á; ¹; §) p(x(i); Á; ¹; §) = p(x(i)jz(i) = j; ¹; §)p(z(i) = j; Á) Pk

l=1 p(x(i)jz(i) = l; ¹; §)p(z(i) = l; Á)

Á

z x

¹; § ¹; §

where

  • The prior of z(i) is
  • The likelihood is

p(z(i) = j; Á) p(z(i) = j; Á) p(x(i)jz(i) = j; ¹; §) p(x(i)jz(i) = j; ¹; §)

  • Then update the parameters μ, Σ, ϕ based on our guess of z(i)’s
slide-40
SLIDE 40

Expectation Maximization Methods

  • E-step: infer the posterior distribution of the latent

variables given the model parameters

  • M-step: tune parameters to maximize the data

likelihood given the latent variable distribution

  • EM methods
  • Iteratively execute E-step and M-step until convergence
slide-41
SLIDE 41

EM Methods for Mixture Gaussians

Á

z x

¹; § ¹; §

Repeat until convergence: { (E-step) For each i, j, set (M-step) Update the parameters }

  • Mixture Gaussian example

w(i)

j

= p(z(i) = jjx(i); Á; ¹; §) w(i)

j

= p(z(i) = jjx(i); Á; ¹; §)

Áj = 1 m

m

X

i=1

w(i)

j

¹j = Pm

i=1 w(i) j x(i)

Pm

i=1 w(i) j

§j = Pm

i=1 w(i) j (x(i) ¡ ¹j)(x(i) ¡ ¹j)>

Pm

i=1 w(i) j

Áj = 1 m

m

X

i=1

w(i)

j

¹j = Pm

i=1 w(i) j x(i)

Pm

i=1 w(i) j

§j = Pm

i=1 w(i) j (x(i) ¡ ¹j)(x(i) ¡ ¹j)>

Pm

i=1 w(i) j

slide-42
SLIDE 42

General EM Methods

  • Claims:
  • 1. After each E-M step, the data likelihood will not

decrease

  • 2. The EM algorithm finds a (local) maximum of a

latent variable model likelihood

  • Now let’s discuss the general EM methods and

verify its effectiveness of improving data likelihood and its convergence

slide-43
SLIDE 43

Jensen’s Inequality

  • Theorem. Let f be a convex function, and let X be a

random variable. Then:

E[f(X)] ¸ f(E[X]) E[f(X)] ¸ f(E[X])

  • Moreover, if f is strictly convex, then

holds true if and only if with probability 1 (i.e., if X is a constant).

E[f(X)] = f(E[X]) E[f(X)] = f(E[X]) X = E[X] X = E[X]

slide-44
SLIDE 44

Jensen’s Inequality

E[f(X)] ¸ f(E[X]) E[f(X)] ¸ f(E[X])

Figure credit: Andrew Ng

slide-45
SLIDE 45

Jensen’s Inequality

Figure credit: Maneesh Sahani

slide-46
SLIDE 46

General EM Methods: Problem

  • Given the training dataset

let the machine learn the data underlying patterns D = fxigi=1;2;:::;N D = fxigi=1;2;:::;N

  • Assume latent variables

z ! x z ! x

  • We wish to fit the parameters of a model p(x,z) to the data,

where the log-likelihood is

l(μ) =

N

X

i=1

log p(x; μ) =

N

X

i=1

log X

z

p(x; z; μ) l(μ) =

N

X

i=1

log p(x; μ) =

N

X

i=1

log X

z

p(x; z; μ)

slide-47
SLIDE 47

General EM Methods: Problems

  • EM methods solve the problems where
  • Explicitly find the maximum likelihood estimation (MLE)

is hard

μ¤ = arg max

μ N

X

i=1

log X

z

p(x(i); z(i); μ) μ¤ = arg max

μ N

X

i=1

log X

z

p(x(i); z(i); μ)

  • But given z(i) observed, the MLE is easy

μ¤ = arg max

μ N

X

i=1

log p(x(i)jz(i); μ) μ¤ = arg max

μ N

X

i=1

log p(x(i)jz(i); μ)

  • EM methods give an efficient solution for MLE, by

iteratively doing

  • E-step: construct a (good) lower-bound of log-likelihood
  • M-step: optimize that lower-bound
slide-48
SLIDE 48

General EM Methods: Lower Bound

  • For each instance i, let qi be some distribution of z(i)

X

z

qi(z) = 1; qi(z) ¸ 0 X

z

qi(z) = 1; qi(z) ¸ 0

  • Thus the data log-likelihood

l(μ) =

N

X

i=1

log p(x(i); μ) =

N

X

i=1

log X

z(i)

p(x(i); z(i); μ) =

N

X

i=1

log X

z(i)

qi(z(i))p(x(i); z(i); μ) qi(z(i)) ¸

N

X

i=1

X

z(i)

qi(z(i)) log p(x(i); z(i); μ) qi(z(i)) l(μ) =

N

X

i=1

log p(x(i); μ) =

N

X

i=1

log X

z(i)

p(x(i); z(i); μ) =

N

X

i=1

log X

z(i)

qi(z(i))p(x(i); z(i); μ) qi(z(i)) ¸

N

X

i=1

X

z(i)

qi(z(i)) log p(x(i); z(i); μ) qi(z(i))

Jensen’s inequality

  • log(x) is a convex function

Lower bound

  • f l(θ)
slide-49
SLIDE 49

General EM Methods: Lower Bound

  • Then what qi(z) should we choose?

l(μ) =

N

X

i=1

log p(x(i); μ) ¸

N

X

i=1

X

z(i)

qi(z(i)) log p(x(i); z(i); μ) qi(z(i)) l(μ) =

N

X

i=1

log p(x(i); μ) ¸

N

X

i=1

X

z(i)

qi(z(i)) log p(x(i); z(i); μ) qi(z(i))

slide-50
SLIDE 50

Jensen’s Inequality

  • Theorem. Let f be a convex function, and let X be a

random variable. Then:

E[f(X)] ¸ f(E[X]) E[f(X)] ¸ f(E[X])

  • Moreover, if f is strictly convex, then

holds true if and only if with probability 1 (i.e., if X is a constant).

E[f(X)] = f(E[X]) E[f(X)] = f(E[X]) X = E[X] X = E[X]

REVIEW

slide-51
SLIDE 51

General EM Methods: Lower Bound

  • Then what qi(z) should we choose?
  • In order to make above inequality tight (to hold with

equality), it is sufficient that

l(μ) =

N

X

i=1

log p(x(i); μ) ¸

N

X

i=1

X

z(i)

qi(z(i)) log p(x(i); z(i); μ) qi(z(i)) l(μ) =

N

X

i=1

log p(x(i); μ) ¸

N

X

i=1

X

z(i)

qi(z(i)) log p(x(i); z(i); μ) qi(z(i))

p(x(i); z(i); μ) = qi(z(i)) ¢ c p(x(i); z(i); μ) = qi(z(i)) ¢ c

log p(x(i); μ) = log X

z(i)

p(x(i); z(i); μ) = log X

z(i)

q(z(i))c = X

z(i)

qi(z(i)) log p(x(i); z(i); μ) qi(z(i)) log p(x(i); μ) = log X

z(i)

p(x(i); z(i); μ) = log X

z(i)

q(z(i))c = X

z(i)

qi(z(i)) log p(x(i); z(i); μ) qi(z(i))

  • We can derive
  • As such, qi(z) is written as the posterior distribution

qi(z(i)) = p(x(i); z(i); μ) P

z p(x(i); z; μ) = p(x(i); z(i); μ)

p(x(i); μ) = p(z(i)jx(i); μ) qi(z(i)) = p(x(i); z(i); μ) P

z p(x(i); z; μ) = p(x(i); z(i); μ)

p(x(i); μ) = p(z(i)jx(i); μ)

slide-52
SLIDE 52

General EM Methods

Repeat until convergence: { (E-step) For each i, set (M-step) Update the parameters }

qi(z(i)) = p(z(i)jx(i); μ) qi(z(i)) = p(z(i)jx(i); μ) μ = arg max

μ N

X

i=1

X

z(i)

qi(z(i)) log p(x(i); z(i); μ) qi(z(i)) μ = arg max

μ N

X

i=1

X

z(i)

qi(z(i)) log p(x(i); z(i); μ) qi(z(i))

slide-53
SLIDE 53

Convergence of EM

  • Denote θ(t) and θ(t+1) as the parameters of two successive

iterations of EM, we prove that

l(μ(t)) · l(μ(t+1)) l(μ(t)) · l(μ(t+1))

which shows EM always monotonically improves the log- likelihood, thus ensures EM will at least converge to a local

  • ptimum.
slide-54
SLIDE 54

Proof of EM Convergence

  • Start from θ(t), we choose the posterior of latent variable

q(t)

i (z(i)) = p(z(i)jx(i); μ(t))

q(t)

i (z(i)) = p(z(i)jx(i); μ(t))

  • This choice ensures the Jensen’s inequality holds with equality

l(μ(t)) =

N

X

i=1

log X

z(i)

q(t)

i (z(i))p(x(i); z(i); μ(t))

q(t)

i (z(i))

=

N

X

i=1

X

z(i)

qi(z(i)) log p(x(i); z(i); μ(t)) q(t)

i (z(i))

l(μ(t)) =

N

X

i=1

log X

z(i)

q(t)

i (z(i))p(x(i); z(i); μ(t))

q(t)

i (z(i))

=

N

X

i=1

X

z(i)

qi(z(i)) log p(x(i); z(i); μ(t)) q(t)

i (z(i))

  • Then the parameters θ(t+1) are then obtained by maximizing

the right hand side of above equation

  • Thus l(μ(t+1)) ¸

N

X

i=1

X

z(i)

q(t)

i (z(i)) log p(x(i); z(i); μ(t+1))

q(t)

i (z(i))

¸

N

X

i=1

X

z(i)

q(t)

i (z(i)) log p(x(i); z(i); μ(t))

q(t)

i (z(i))

= l(μ(t)) l(μ(t+1)) ¸

N

X

i=1

X

z(i)

q(t)

i (z(i)) log p(x(i); z(i); μ(t+1))

q(t)

i (z(i))

¸

N

X

i=1

X

z(i)

q(t)

i (z(i)) log p(x(i); z(i); μ(t))

q(t)

i (z(i))

= l(μ(t))

[lower bound] [parameter optimization]

slide-55
SLIDE 55

Remark of EM Convergence

  • If we define

J(q; μ) =

N

X

i=1

X

z(i)

qi(z(i)) log p(x(i); z(i); μ) qi(z(i)) J(q; μ) =

N

X

i=1

X

z(i)

qi(z(i)) log p(x(i); z(i); μ) qi(z(i))

Then we know

l(μ) ¸ J(q; μ) l(μ) ¸ J(q; μ)

  • EM can also be viewed as a coordinate ascent on J
  • E-step maximizes it w.r.t. q
  • M-step maximizes it w.r.t. θ
slide-56
SLIDE 56

Coordinate Ascent in EM

μ q

Figure credit: Maneesh Sahani

q(t)

i (z(i)) = p(z(i)jx(i); μ(t))

q(t)

i (z(i)) = p(z(i)jx(i); μ(t))

μ(t+1) = arg max

μ N

X

i=1

X

z(i)

q(t)

i (z(i)) log p(x(i); z(i); μ)

q(t)

i (z(i))

μ(t+1) = arg max

μ N

X

i=1

X

z(i)

q(t)

i (z(i)) log p(x(i); z(i); μ)

q(t)

i (z(i))

slide-57
SLIDE 57

Content

  • Fundamentals of Unsupervised Learning
  • K-means clustering
  • Principal component analysis
  • Probabilistic Unsupervised Learning
  • Mixture Gaussians
  • EM Methods
  • Deep Unsupervised Learning
  • Auto-encoders
  • Generative adversarial nets
slide-58
SLIDE 58

Neural Nets for Unsupervised Learning

  • Basic idea: use neural networks to recover the data
  • Restricted Boltzmann Machine

h v

slide-59
SLIDE 59

Restricted Boltzmann Machine

  • An RBM is an a generative stochastic artificial

neural network that can learn a probability distribution over its set of inputs h v

  • Undirected graphical model
  • Restricted: Visible (hidden)

units are not connected to each other

  • Energy function

E(v; h) = ¡ X

i

bivi ¡ X

j

bjhj ¡ X

i;j

viwi;jhj E(v; h) = ¡ X

i

bivi ¡ X

j

bjhj ¡ X

i;j

viwi;jhj p(v; h) = 1 Z e¡E(v;h) p(v; h) = 1 Z e¡E(v;h)

slide-60
SLIDE 60

Deep Belief Networks

Hinton, Geoffrey E., and Ruslan R. Salakhutdinov. "Reducing the dimensionality of data with neural networks." science 313.5786 (2006): 504-507.

slide-61
SLIDE 61

Performance of Latent Factor Analysis

Latent semantic analysis based on PCA A 2000- 500-250-125-2 autoencoder Trained by DBN

Hinton, Geoffrey E., and Ruslan R. Salakhutdinov. "Reducing the dimensionality of data with neural networks." science 313.5786 (2006): 504-507.

slide-62
SLIDE 62

Auto-encoder

  • An auto-encoder is an artificial neural net used

for unsupervised learning of efficient codings

  • learn a representation (encoding) for a set of data,

typically for the purpose of dimensionality reduction

x ~ x ~ x z

z = ¾(W1x + b1) ~ x = ¾(W2z + b2) z = ¾(W1x + b1) ~ x = ¾(W2z + b2)

z is regarded as the low dimensional latent factor representation

  • f x
slide-63
SLIDE 63

Learning Auto-encoder

  • Objective: squared difference between and

x ~ x ~ x

J(W1; b1; W2; b2) =

m

X

i=1

(~ x(i) ¡ x(i))2 =

m

X

i=1

(W2z(i) + b2 ¡ x(i))2 =

m

X

i=1

³ W2¾(W1x(i) + b1) + b2 ¡ x(i)´2 J(W1; b1; W2; b2) =

m

X

i=1

(~ x(i) ¡ x(i))2 =

m

X

i=1

(W2z(i) + b2 ¡ x(i))2 =

m

X

i=1

³ W2¾(W1x(i) + b1) + b2 ¡ x(i)´2

μ Ã μ ¡ ´@J @μ μ Ã μ ¡ ´@J @μ

  • Auto-encoder is an unsupervised learning model trained in a

supervised fashion

slide-64
SLIDE 64

Denoising Auto-encoder

  • Clean input x is partially destroyed, yielding corrupted input

~ x » qD(~ xjx) ~ x » qD(~ xjx)

  • The corrupted input is mapped to hidden representation

~ x ~ x z = fμ(~ x) z = fμ(~ x)

x qD qD ~ x ~ x fμ fμ z ^ x ^ x L(x; ^ x) L(x; ^ x)

  • From z reconstruct the data

^ x = gμ0(z) ^ x = gμ0(z)

e.g. Gaussian noise

slide-65
SLIDE 65

Stacked Auto-encoder

  • Layer-by-layer training
  • 1. Train the first layer to use z1 to

reconstruct x

  • 2. Train the second layer to use z2

to reconstruct z1

  • 3. Train the third layer to use z3 to

reconstruct z2

z1 z1 z2 z2 z3 z3

slide-66
SLIDE 66

Some Denoising AE Examples

Original Corrupted Reconstructed

slide-67
SLIDE 67

Generative Adversarial Networks (GANs)

[Goodfellow, I., et al. 2014. Generative adversarial nets. In NIPS 2014.]

slide-68
SLIDE 68

Problem Definition

  • Given a dataset , build a model of

the data distribution that fits the true one

D = fxg D = fxg qμ(x) qμ(x)

  • Traditional objective: maximum likelihood estimation (MLE)

max

μ

1 jDj X

x2D

[log qμ(x)] ' max

μ

Ex»p(x)[log qμ(x)] max

μ

1 jDj X

x2D

[log qμ(x)] ' max

μ

Ex»p(x)[log qμ(x)] p(x) p(x)

  • Check whether a true data is with a high mass density of

the learned model

slide-69
SLIDE 69

Inconsistency of Evaluation and Use

  • Check whether a

true data is with a high mass density

  • f the learned

model

  • Approximated by

max

μ

Ex»p(x)[log qμ(x)] max

μ

Ex»p(x)[log qμ(x)] max

μ

Ex»qμ(x)[log p(x)] max

μ

Ex»qμ(x)[log p(x)]

Training/evaluation Use

  • Check whether a

model-generated data is considered as true as possible

  • More straightforward

but it is hard or impossible to directly calculate p(x)

p(x)

max

μ

1 jDj X

x2D

[log qμ(x)] max

μ

1 jDj X

x2D

[log qμ(x)]

  • Given a generator q with a certain generalization ability
slide-70
SLIDE 70

Generative Adversarial Nets (GANs)

  • What we really want

max

μ

Ex»qμ(x)[log p(x)] max

μ

Ex»qμ(x)[log p(x)]

  • But we cannot directly calculate p(x)

p(x)

  • Idea: what if we build a discriminator to judge

whether a data instance is true or fake (artificially generated)?

  • Leverage the strong power of deep learning based

discriminative models

slide-71
SLIDE 71

Generative Adversarial Nets (GANs)

  • Discriminator tries to correctly distinguish the true data and

the fake model-generated data

  • Generator tries to generate high-quality data to fool

discriminator

  • G & D can be implemented via neural networks
  • Ideally, when D cannot distinguish the true and generated

data, G nicely fits the true underlying data distribution

G D

Real World Generator Discriminator Data

slide-72
SLIDE 72

Generator Network

  • Must be differentiable
  • No invertibility requirement
  • Trainable for any size of z
  • Can make x conditionally Gaussian given

z but need not do so

  • e.g. Variational Auto-Encoder
  • Popular implementation: multi-layer

perceptron

x = G(z; μ(G)) x = G(z; μ(G))

slide-73
SLIDE 73

Discriminator Network

  • Can be implemented by any neural networks with a

probabilistic prediction

  • For example
  • Multi-layer perceptron with logistic output
  • AlexNet etc.

P(realjx) = D(x; μ(D)) P(realjx) = D(x; μ(D))

slide-74
SLIDE 74

GAN: A Minimax Game

G D

Real World Generator Discriminator Data

J(D) = E »pdata( )[log D(x)] + E »p ( )[log(1 ¡ D(G(z)))] J(D) = E »pdata( )[log D(x)] + E »p ( )[log(1 ¡ D(G(z)))]

min

G max D J(D)

min

G max D J(D)

max

D J(D)

max

D J(D)

min

G max D J(D)

min

G max D J(D)

max

D J(D)

max

D J(D)

Generator Discriminator

slide-75
SLIDE 75

Illustration of GANs

Discriminator Data Generator

J(D) = E »pdata( )[log D(x)] + E »p ( )[log(1 ¡ D(G(z)))] J(D) = E »pdata( )[log D(x)] + E »p ( )[log(1 ¡ D(G(z)))]

min

G max D J(D)

min

G max D J(D)

max

D J(D)

max

D J(D)

Generator Discriminator

slide-76
SLIDE 76

Ideal Final Equilibrium

  • Generator generates

perfect data distribution

  • Discriminator cannot

distinguish the true and generated data

slide-77
SLIDE 77

Training GANs

Training discriminator

slide-78
SLIDE 78

Training GANs

Training generator

slide-79
SLIDE 79

Optimal Strategy for Discriminator

  • Optimal D(x) for any

pdata(x) and pG(x) is always

D(x) = pdata(x) pdata(x) + pG(x) D(x) = pdata(x) pdata(x) + pG(x)

Discriminator Data Generator

slide-80
SLIDE 80

Reformulate the Minimax Game

J(D) = E »pdata( )[log D(x)] + E »p ( )[log(1 ¡ D(G(z)))] = E »pdata( )[log D(x)] + E »pG( )[log(1 ¡ D(x))] = E »pdata( ) · log pdata(x) pdata(x) + pG(x) ¸ + E »pG( ) · log pG(x) pdata(x) + pG(x) ¸ = ¡ log(4) + KL μ pdata ° ° °pdata + pG 2 ¶ + KL μ pG ° ° °pdata + pG 2 ¶ J(D) = E »pdata( )[log D(x)] + E »p ( )[log(1 ¡ D(G(z)))] = E »pdata( )[log D(x)] + E »pG( )[log(1 ¡ D(x))] = E »pdata( ) · log pdata(x) pdata(x) + pG(x) ¸ + E »pG( ) · log pG(x) pdata(x) + pG(x) ¸ = ¡ log(4) + KL μ pdata ° ° °pdata + pG 2 ¶ + KL μ pG ° ° °pdata + pG 2 ¶

min

G max D J(D)

min

G max D J(D)

max

D J(D)

max

D J(D)

G: D:

is something between and

[Huszár, Ferenc. "How (not) to Train your Generative Model: Scheduled Sampling, Likelihood, Adversary?." arXiv (2015).]

slide-81
SLIDE 81
  • In order to take gradient on the generator parameter, x has

to be continuous

x z p

  • 1. Generation
  • 2. Discrimination
  • 3. Gradient on generated data
  • 4. Further gradient on generator

GANs for Continuous Data

Generator Discriminator

min

G max D J(G; D)

min

G max D J(G; D)

max

D J(G; D)

max

D J(G; D)

slide-82
SLIDE 82

Case Study of GANs

  • The rightmost images in each row is the closest training data images to the neighbor

generated ones, which means GAN does not simply memorize training instances

slide-83
SLIDE 83

High Resolution and Quality Images

  • Progressive Growing of GANs

Two imaginary celebrities that were dreamed up by a random number generator.

Tero Karras et al. Progressive Growing of GANs for Improved Quality, Stability, and Variation. ICLR 2018.

slide-84
SLIDE 84

Single Image Super-Resolution

Ledig, Christian, et al. "Photo-realistic single image super-resolution using a generative adversarial network." CVPR 2017.

deep residual generative adversarial network optimized for a loss more sensitive to human perception [4× upscaling]

slide-85
SLIDE 85

Image to Image Translation

Isola, Phillip, et al. "Image-to-image translation with conditional adversarial networks." CVPR 2017.

slide-86
SLIDE 86

Grayscale Image Colorization

Yun Cao, Weinan Zhang etc. Unsupervised Diverse Colorization via Generative Adversarial Networks. ECML-PKDD 2017.

Ground Truth Ground Truth Generated Colorization after Performing Grayscale Generated Colorization after Performing Grayscale

slide-87
SLIDE 87

High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs

Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. "High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs", arXiv preprint arXiv:1711.11585.