[PPT] - Unsupervised Learning Shan-Hung Wu shwu@cs.nthu.edu.tw Department PowerPoint Presentation

SLIDE 1

Unsupervised Learning

Shan-Hung Wu

shwu@cs.nthu.edu.tw

Department of Computer Science, National Tsing Hua University, Taiwan

Machine Learning

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 1 / 81

SLIDE 2

Outline

1

Unsupervised Learning

2

Self-Supervised Learning

3

Autoencoders & Manifold Learning

4

Generative Adversarial Networks The Basics Challenges More GANs

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 2 / 81

SLIDE 3

Outline

1

Unsupervised Learning

2

Self-Supervised Learning

3

Autoencoders & Manifold Learning

4

Generative Adversarial Networks The Basics Challenges More GANs

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 3 / 81

SLIDE 4

Unsupervised Learning

Dataset: X = {x(i)}i, where x(i)’s are i.i.d. samples of x

No supervision (e.g., labels)

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 4 / 81

SLIDE 5

Unsupervised Learning

Dataset: X = {x(i)}i, where x(i)’s are i.i.d. samples of x

No supervision (e.g., labels)

What can we learn?

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 4 / 81

SLIDE 6

Clustering I

Goal: to divide x(i)’s into K groups/clusters

Based on some pairwise similarity/distance measure

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 5 / 81

SLIDE 7

Clustering II

K-means algorithm (K fixed):

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 6 / 81

SLIDE 8

Clustering II

K-means algorithm (K fixed): Hierarchical clustering (variable K):

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 6 / 81

SLIDE 9

Factorization and Recommendation

Goal: to uncover the factors behind X

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 7 / 81

SLIDE 10

Factorization and Recommendation

Goal: to uncover the factors behind X Commonly used in the recommender systems Let X, Xi,: = x(i), be a rating matrix Non-negative matrix factorization (NMF) [12, 13]: arg min

W≥O,H≥OX −WHF

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 7 / 81

SLIDE 11

Factorization and Recommendation

Goal: to uncover the factors behind X Commonly used in the recommender systems Let X, Xi,: = x(i), be a rating matrix Non-negative matrix factorization (NMF) [12, 13]: arg min

W≥O,H≥OX −WHF

X∗ = W∗H∗ a dense matrix and can be used to predict user interests

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 7 / 81

SLIDE 12

Dimension Reduction

Goal: to learn a low dimensional representation z of x

E.g., PCA

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 8 / 81

SLIDE 13

Self-Supervised Learning

Goal: to learn a model that is able to “fill in the blanks”

Also called “predictive learning”

x(i) y(i)

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 9 / 81

SLIDE 14

Self-Supervised Learning

Goal: to learn a model that is able to “fill in the blanks”

Also called “predictive learning”

Links unsupervised tasks with supervised models x(i) y(i)

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 9 / 81

SLIDE 15

Self-Supervised Learning

Goal: to learn a model that is able to “fill in the blanks”

Also called “predictive learning”

Links unsupervised tasks with supervised models For supervised tasks: more easy data collection For unsupervised tasks: better representations of x and/or y thanks to DL models x(i) y(i)

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 9 / 81

SLIDE 16

Manifold Learning

Goal: to learn the underlying manifold of x

E.g., given a point x(i), output the tangent vector of x(i)

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 10 / 81

SLIDE 17

Data Synthesis/Generation I

Goal: to generate new samples of x

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 11 / 81

SLIDE 18

Data Synthesis/Generation I

Goal: to generate new samples of x Generative adversarial networks (GANs)

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 11 / 81

SLIDE 19

Data Synthesis/Generation II

Conditional GANs, e.g., text to image synthesis “This bird is completely red with black wings and pointy beak.”

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 12 / 81

SLIDE 20

Outline

1

Unsupervised Learning

2

Self-Supervised Learning

3

Autoencoders & Manifold Learning

4

Generative Adversarial Networks The Basics Challenges More GANs

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 13 / 81

SLIDE 21

Self-Supervised Learning

Goal: to learn a model for blank filling

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 14 / 81

SLIDE 22

Self-Supervised Learning

Goal: to learn a model for blank filling E.g., word2vec [17, 16]: “... the cat sat on...” CBOW Skip Gram

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 14 / 81

SLIDE 23

Self-Supervised Learning

Goal: to learn a model for blank filling E.g., word2vec [17, 16]: “... the cat sat on...” CBOW Skip Gram Latent representation h encodes the semantics of a word

No need for synonym dictionary; big data tell that already

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 14 / 81

SLIDE 24

Doc2Vec

How to encode a document?

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 15 / 81

SLIDE 25

Doc2Vec

How to encode a document?

Bag of words (TF-IDF), average word2vec, etc.

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 15 / 81

SLIDE 26

Doc2Vec

How to encode a document?

Bag of words (TF-IDF), average word2vec, etc.

Do not capture the semantics due to sentence/paragraph/doc structure

“John likes Mary” = “Mary likes John”

Self-supervised learning for docs?

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 15 / 81

SLIDE 27

Doc2Vec

How to encode a document?

Bag of words (TF-IDF), average word2vec, etc.

Do not capture the semantics due to sentence/paragraph/doc structure

“John likes Mary” = “Mary likes John”

Self-supervised learning for docs? Doc2vec [10]: to capture the context not explained by words

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 15 / 81

SLIDE 28

Filling Images

How?

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 16 / 81

SLIDE 29

Filling Images

How? PixelRNN [25], PixelCNN [24]

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 16 / 81

SLIDE 30

More

Predicting the future by watching unlabeled videos [14, 7, 27]:

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 17 / 81

SLIDE 31

More

Predicting the future by watching unlabeled videos [14, 7, 27]: Better representations/predictions without the need for labels

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 17 / 81

SLIDE 32

Outline

1

Unsupervised Learning

2

Self-Supervised Learning

3

Autoencoders & Manifold Learning

4

Generative Adversarial Networks The Basics Challenges More GANs

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 18 / 81

SLIDE 33

Autoencoders I

Encoder: to learn a low dimensional representation c (called code) of input x Decoder: to reconstruct x from c

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 19 / 81

SLIDE 34

Autoencoders I

Encoder: to learn a low dimensional representation c (called code) of input x Decoder: to reconstruct x from c Cost function: argminΘ −logP(X|Θ) = argminΘ −∑n logP(x(n) |Θ)

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 19 / 81

SLIDE 35

Autoencoders I

Encoder: to learn a low dimensional representation c (called code) of input x Decoder: to reconstruct x from c Cost function: argminΘ −logP(X|Θ) = argminΘ −∑n logP(x(n) |Θ) Sigmoid output units a(L)

j

= ˆ ρj for xj ∼ Bernoulli(ρj)

P(x(n)

j

|Θ) = (a(L)

j

)x(n)

j (1−a(L)

j

)(1−x(n)

j

)

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 19 / 81

SLIDE 36

Autoencoders I

Encoder: to learn a low dimensional representation c (called code) of input x Decoder: to reconstruct x from c Cost function: argminΘ −logP(X|Θ) = argminΘ −∑n logP(x(n) |Θ) Sigmoid output units a(L)

j

= ˆ ρj for xj ∼ Bernoulli(ρj)

P(x(n)

j

|Θ) = (a(L)

j

)x(n)

j (1−a(L)

j

)(1−x(n)

j

)

Linear output units a(L) = z(L) = ˆ µ for x ∼ N (µ,Σ)

−logP(x(n) |Θ) = x(n) −a(L)2

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 19 / 81

SLIDE 37

Convolutional Autoencoders

Convolution + deconvolution layers:

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 20 / 81

SLIDE 38

Convolutional Autoencoders

Convolution + deconvolution layers: Decoder is a simplified DeconvNet [28] trained from scratch:

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 20 / 81

SLIDE 39

Convolutional Autoencoders

Convolution + deconvolution layers: Decoder is a simplified DeconvNet [28] trained from scratch:

Uppooling → upsampling (no need to remember max positions)

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 20 / 81

SLIDE 40

Convolutional Autoencoders

Convolution + deconvolution layers: Decoder is a simplified DeconvNet [28] trained from scratch:

Uppooling → upsampling (no need to remember max positions) Deconvolution → convolution

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 20 / 81

SLIDE 41

Codes & Reconstructed x

A 32-bit code can roughly represents a 32×32 (1024 dimensional) MNIST image

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 21 / 81

SLIDE 42

Manifolds I

In many applications, data concentrate around one or more low-dimensional manifolds

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 22 / 81

SLIDE 43

Manifolds I

In many applications, data concentrate around one or more low-dimensional manifolds A manifold is a topological space that are linear locally

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 22 / 81

SLIDE 44

Manifolds II

For each point x on a manifold, we have its tangent space spanned by tangent vectors

Local directions specify how one can change x infinitesimally while staying on the manifold

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 23 / 81

SLIDE 45

Learning Manifolds I

How to make c produced by autoencoders denote a coordinate of a dimensional manifold?

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 24 / 81

SLIDE 46

Learning Manifolds I

How to make c produced by autoencoders denote a coordinate of a dimensional manifold? Contractive autoencoder [20]: regularizes the code c such that it is invariant to local changes of x: Ω(c) = ∑

n

∂c(n)

∂x(n)

2

F

∂c(n)/∂x(n) is a Jacobian matrix

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 24 / 81

SLIDE 47

Learning Manifolds I

How to make c produced by autoencoders denote a coordinate of a dimensional manifold? Contractive autoencoder [20]: regularizes the code c such that it is invariant to local changes of x: Ω(c) = ∑

n

∂c(n)

∂x(n)

2

F

∂c(n)/∂x(n) is a Jacobian matrix

Hence, c represents only the variations needed to reconstruct x

I.e., c changes most along tangent vectors

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 24 / 81

SLIDE 48

Learning Manifolds II

In practice, it is easier to train a denoising autoencoder [26]:

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 25 / 81

SLIDE 49

Learning Manifolds II

In practice, it is easier to train a denoising autoencoder [26]:

Encoder: to encode x with random noises Decoder: to reconstruct x without noises

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 25 / 81

SLIDE 50

Getting Tangent Vectors I

The code c represents a coordinate on a low dimensional manifold

E.g., the blue line

How to get the tangent vectors of a given c?

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 26 / 81

SLIDE 51

Getting Tangent Vectors II

Recall: directions in the input space that changes c most should be tangent vectors

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 27 / 81

SLIDE 52

Getting Tangent Vectors II

Recall: directions in the input space that changes c most should be tangent vectors Given a point x, let c be the code of x and J(x) = ∂c

∂x be the

Jacobian matrix of c at x

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 27 / 81

SLIDE 53

Getting Tangent Vectors II

Recall: directions in the input space that changes c most should be tangent vectors Given a point x, let c be the code of x and J(x) = ∂c

∂x be the

Jacobian matrix of c at x

J(x) summarizes how c changes in terms of x

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 27 / 81

SLIDE 54

Getting Tangent Vectors II

Recall: directions in the input space that changes c most should be tangent vectors Given a point x, let c be the code of x and J(x) = ∂c

∂x be the

Jacobian matrix of c at x

J(x) summarizes how c changes in terms of x

1

Decompose J(x) using SVD such that J(x) = UDV⊤

2

Let tangent vectors be rows of V corresponding to the largest singular values in D

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 27 / 81

SLIDE 55

Getting Tangent Vectors III

In practice, J(x) usually has few large singular values Tangent vectors found by contractive/denoising autoencoders:

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 28 / 81

SLIDE 56

Getting Tangent Vectors III

In practice, J(x) usually has few large singular values Tangent vectors found by contractive/denoising autoencoders: Can be used by Tangent Prop [23]: Let {v(i,j)}j be tangent vectors of each example x(i) Trains an NN classifier f with cost penalty: Ω[f] = ∑i,j ∇xf(x(i))⊤v(i,j)

Points in the same manifold share the same label

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 28 / 81

SLIDE 57

Outline

1

Unsupervised Learning

2

Self-Supervised Learning

3

Autoencoders & Manifold Learning

4

Generative Adversarial Networks The Basics Challenges More GANs

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 29 / 81

SLIDE 58

Decoder as Data Generator

Decoder of an autoencoder can be used to generate data points even with synthetic codes

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 30 / 81

SLIDE 59

Decoder as Data Generator

Decoder of an autoencoder can be used to generate data points even with synthetic codes Problems:

Same c, same output

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 30 / 81

SLIDE 60

Decoder as Data Generator

Decoder of an autoencoder can be used to generate data points even with synthetic codes Problems:

Same c, same output→ dropout layers, variational autoencoders [9]

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 30 / 81

SLIDE 61

Decoder as Data Generator

Decoder of an autoencoder can be used to generate data points even with synthetic codes Problems:

Same c, same output→ dropout layers, variational autoencoders [9] Blurry images

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 30 / 81

SLIDE 62

Why Blurry Images?

Cost function: argminΘ −logP(X|Θ) = argminΘ −∑n logP(x(n) |Θ) Image generation: linear output units a(L) = z(L) = ˆ µ for x ∼ N (µ,Σ)

−logP(x(n) |Θ) = x(n) −a(L)2

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 31 / 81

SLIDE 63

Why Blurry Images?

Cost function: argminΘ −logP(X|Θ) = argminΘ −∑n logP(x(n) |Θ) Image generation: linear output units a(L) = z(L) = ˆ µ for x ∼ N (µ,Σ)

−logP(x(n) |Θ) = x(n) −a(L)2

Better assuming distribution for x?

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 31 / 81

SLIDE 64

Why Blurry Images?

Cost function: argminΘ −logP(X|Θ) = argminΘ −∑n logP(x(n) |Θ) Image generation: linear output units a(L) = z(L) = ˆ µ for x ∼ N (µ,Σ)

−logP(x(n) |Θ) = x(n) −a(L)2

Better assuming distribution for x? P(x) may be very complex Better “goodness” measure?

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 31 / 81

SLIDE 65

Why Blurry Images?

Cost function: argminΘ −logP(X|Θ) = argminΘ −∑n logP(x(n) |Θ) Image generation: linear output units a(L) = z(L) = ˆ µ for x ∼ N (µ,Σ)

−logP(x(n) |Θ) = x(n) −a(L)2

Better assuming distribution for x? P(x) may be very complex Better “goodness” measure? Why not use an NN to tell if a generated image is of good quality?

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 31 / 81

SLIDE 66

Outline

1

Unsupervised Learning

2

Self-Supervised Learning

3

Autoencoders & Manifold Learning

4

Generative Adversarial Networks The Basics Challenges More GANs

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 32 / 81

SLIDE 67

Generative Adversarial Networks (GANs)

Generative adversarial network (GAN) [4]:

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 33 / 81

SLIDE 68

Generative Adversarial Networks (GANs)

Generative adversarial network (GAN) [4]: Generator g: to generate data points from random codes

No need for “encoder” since the task is data synthesis

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 33 / 81

SLIDE 69

Generative Adversarial Networks (GANs)

Generative adversarial network (GAN) [4]: Generator g: to generate data points from random codes

No need for “encoder” since the task is data synthesis

Discriminator f: to separate generated points from real ones

Weights for x and ˆ x are tied A binary classifier with Sigmoid output unit a(L) = ˆ ρ for P(y = true point|x) ∼ Bernoulli(ρ)

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 33 / 81

SLIDE 70

Generative Adversarial Networks (GANs)

Generative adversarial network (GAN) [4]: Generator g: to generate data points from random codes

No need for “encoder” since the task is data synthesis

Discriminator f: to separate generated points from real ones

Weights for x and ˆ x are tied A binary classifier with Sigmoid output unit a(L) = ˆ ρ for P(y = true point|x) ∼ Bernoulli(ρ)

Goal: to train a g that tricks f into believing g(c) is real

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 33 / 81

SLIDE 71

Cost Function

Given N real training points and N generated points: argminΘgmaxΘf logP(X|Θg,Θf ) = argminΘgmaxΘf ∑n logf(x(n))+∑m log(1−f(g(c(m))))

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 34 / 81

SLIDE 72

Cost Function

Given N real training points and N generated points: argminΘgmaxΘf logP(X|Θg,Θf ) = argminΘgmaxΘf ∑n logf(x(n))+∑m log(1−f(g(c(m)))) = argminΘgmaxΘf ∑N

n=1 log ˆ

ρ(n) +∑N

m=1 log(1− ˆ

ρ(m)) Recall that f maximizes the log likelihood logP(X|Θ) ∝ ∑n logP(y(n) |x(n),Θ) = ∑n log

( ˆ

ρ(n))y(n)(1− ˆ ρ(n))(1−y(n))

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 34 / 81

SLIDE 73

Cost Function

Given N real training points and N generated points: argminΘgmaxΘf logP(X|Θg,Θf ) = argminΘgmaxΘf ∑n logf(x(n))+∑m log(1−f(g(c(m)))) = argminΘgmaxΘf ∑N

n=1 log ˆ

ρ(n) +∑N

m=1 log(1− ˆ

ρ(m)) Recall that f maximizes the log likelihood logP(X|Θ) ∝ ∑n logP(y(n) |x(n),Θ) = ∑n log

( ˆ

ρ(n))y(n)(1− ˆ ρ(n))(1−y(n)) Inner max first, then outer min

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 34 / 81

SLIDE 74

Cost Function

Given N real training points and N generated points: argminΘgmaxΘf logP(X|Θg,Θf ) = argminΘgmaxΘf ∑n logf(x(n))+∑m log(1−f(g(c(m)))) = argminΘgmaxΘf ∑N

n=1 log ˆ

ρ(n) +∑N

m=1 log(1− ˆ

ρ(m)) Recall that f maximizes the log likelihood logP(X|Θ) ∝ ∑n logP(y(n) |x(n),Θ) = ∑n log

( ˆ

ρ(n))y(n)(1− ˆ ρ(n))(1−y(n)) Inner max first, then outer min ˆ ρ(n) depends on Θf only ˆ ρ(m) depends on both Θf and Θg

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 34 / 81

SLIDE 75

Training: Alternative SGD

argmin

Θg max Θf ∑ n

logf(x(n))+∑

m

log(1−f(g(c(m)))) Initialize Θg for g and Θf for f At each SGD step/iteration:

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 35 / 81

SLIDE 76

Training: Alternative SGD

argmin

Θg max Θf ∑ n

logf(x(n))+∑

m

log(1−f(g(c(m)))) Initialize Θg for g and Θf for f At each SGD step/iteration:

1

Repeat K times (with fixed Θg):

1

Sample N real points {x(n)}n from X and N codes from c ∼ N (0,I)

2

Θf ← Θf +η∇Θf [∑n logf(x(n))+∑m log(1−f(g(c(m))))]

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 35 / 81

SLIDE 77

Training: Alternative SGD

argmin

Θg max Θf ∑ n

logf(x(n))+∑

m

log(1−f(g(c(m)))) Initialize Θg for g and Θf for f At each SGD step/iteration:

1

Repeat K times (with fixed Θg):

1

Sample N real points {x(n)}n from X and N codes from c ∼ N (0,I)

2

Θf ← Θf +η∇Θf [∑n logf(x(n))+∑m log(1−f(g(c(m))))]

2

Execute once (with fixed Θf ):

1

Sample N codes from c ∼ N (0,I)

2

Θg ← Θg −η∇Θg[∑m log(1−f(g(c(m))))]

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 35 / 81

SLIDE 78

Training: Alternative SGD

argmin

Θg max Θf ∑ n

logf(x(n))+∑

m

log(1−f(g(c(m)))) Initialize Θg for g and Θf for f At each SGD step/iteration:

1

Repeat K times (with fixed Θg):

1

Sample N real points {x(n)}n from X and N codes from c ∼ N (0,I)

2

Θf ← Θf +η∇Θf [∑n logf(x(n))+∑m log(1−f(g(c(m))))]

2

Execute once (with fixed Θf ):

1

Sample N codes from c ∼ N (0,I)

2

Θg ← Θg −η∇Θg[∑m log(1−f(g(c(m))))]

Why limiting the steps (K) when updating Θf ?

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 35 / 81

SLIDE 79

Training: Alternative SGD

argmin

Θg max Θf ∑ n

logf(x(n))+∑

m

log(1−f(g(c(m)))) Initialize Θg for g and Θf for f At each SGD step/iteration:

1

Repeat K times (with fixed Θg):

1

Sample N real points {x(n)}n from X and N codes from c ∼ N (0,I)

2

Θf ← Θf +η∇Θf [∑n logf(x(n))+∑m log(1−f(g(c(m))))]

2

Execute once (with fixed Θf ):

1

Sample N codes from c ∼ N (0,I)

2

Θg ← Θg −η∇Θg[∑m log(1−f(g(c(m))))]

Why limiting the steps (K) when updating Θf ?

f may overfit data and give very different values once g is updated Limiting K so to prevent g from being updated for “wrong” target

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 35 / 81

SLIDE 80

Results

Domain-specific architecture, e.g., DC-GAN [18]

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 36 / 81

SLIDE 81

Results

Domain-specific architecture, e.g., DC-GAN [18]

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 36 / 81

SLIDE 82

GANs Are Hard to Train!

Tips for Training Stable GANs Keep Calm and train a GAN. Pitfalls and Tips... 10 Lessons I Learned Training GANs for one Year GAN hacks on GitHub

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 37 / 81

SLIDE 83

Outline

1

Unsupervised Learning

2

Self-Supervised Learning

3

Autoencoders & Manifold Learning

4

Generative Adversarial Networks The Basics Challenges More GANs

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 38 / 81

SLIDE 84

Challenge: Non-Convergence

The GAN training may not converge

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 39 / 81

SLIDE 85

Challenge: Non-Convergence

The GAN training may not converge The goal of GAN is to find a saddle point argmin

Θg max Θf ∑ n

logf(x(n))+∑

m

log(1−f(g(c(m))))

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 39 / 81

SLIDE 86

Challenge: Non-Convergence

The GAN training may not converge The goal of GAN is to find a saddle point argmin

Θg max Θf ∑ n

logf(x(n))+∑

m

log(1−f(g(c(m)))) The updated Θf and Θg may cancel each other’s progress

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 39 / 81

SLIDE 87

Challenge: Non-Convergence

The GAN training may not converge The goal of GAN is to find a saddle point argmin

Θg max Θf ∑ n

logf(x(n))+∑

m

log(1−f(g(c(m)))) The updated Θf and Θg may cancel each other’s progress Requires human monitoring and termination

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 39 / 81

SLIDE 88

Mode Collapsing

Even worse: mode collapsing

g may oscillate from generating one kind of points to generating another kind of points

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 40 / 81

SLIDE 89

Mode Collapsing

Even worse: mode collapsing

g may oscillate from generating one kind of points to generating another kind of points

When K is small, alternate SGD does not distinguish between minΘgmaxΘf and maxΘf minΘg argmin

Θg max Θf ∑ n

logf(x(n))+∑

m

log(1−f(g(c(m)))) maxΘf minΘg?

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 40 / 81

SLIDE 90

Mode Collapsing

Even worse: mode collapsing

g may oscillate from generating one kind of points to generating another kind of points

When K is small, alternate SGD does not distinguish between minΘgmaxΘf and maxΘf minΘg argmin

Θg max Θf ∑ n

logf(x(n))+∑

m

log(1−f(g(c(m)))) maxΘf minΘg? g is encouraged to map every code to the “mode” that f believes is most likely to be real

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 40 / 81

SLIDE 91

Solutions

Minibatch discrimination [22]

In maxΘf minΘg case, g collapses because ∇Θf C are computed independently for each point Why not augment each x(n)/ˆ x(n) with batch features?

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 41 / 81

SLIDE 92

Solutions

Minibatch discrimination [22]

In maxΘf minΘg case, g collapses because ∇Θf C are computed independently for each point Why not augment each x(n)/ˆ x(n) with batch features? If g collapses, f can tell this from batch features and reject fake points Now, g needs to generate dissimilar points to fool f

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 41 / 81

SLIDE 93

Solutions

Minibatch discrimination [22]

In maxΘf minΘg case, g collapses because ∇Θf C are computed independently for each point Why not augment each x(n)/ˆ x(n) with batch features? If g collapses, f can tell this from batch features and reject fake points Now, g needs to generate dissimilar points to fool f

without with

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 41 / 81

SLIDE 94

Solutions

Minibatch discrimination [22]

In maxΘf minΘg case, g collapses because ∇Θf C are computed independently for each point Why not augment each x(n)/ˆ x(n) with batch features? If g collapses, f can tell this from batch features and reject fake points Now, g needs to generate dissimilar points to fool f

without with Unrolled GANs [15]: to back-propagate through several max steps when computing ∇ΘgC

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 41 / 81

SLIDE 95

Challenge: Balance between g and f

argmin

Θg max Θf ∑ n

logf(x(n))+∑

m

log(1−f(g(c(m)))) Alternate SGD:

Θf ← Θf +η∇Θf [∑n logf(x(n))+∑m log(1−f(g(c(m))))] for K times Θg ← Θg −η∇Θg[∑m log(1−f(g(c(m))))]

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 42 / 81

SLIDE 96

Challenge: Balance between g and f

argmin

Θg max Θf ∑ n

logf(x(n))+∑

m

log(1−f(g(c(m)))) Alternate SGD:

Θf ← Θf +η∇Θf [∑n logf(x(n))+∑m log(1−f(g(c(m))))] for K times Θg ← Θg −η∇Θg[∑m log(1−f(g(c(m))))]

Why limiting K when updating Θf ? Too large K: Too small K:

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 42 / 81

SLIDE 97

Challenge: Balance between g and f

argmin

Θg max Θf ∑ n

logf(x(n))+∑

m

log(1−f(g(c(m)))) Alternate SGD:

Θf ← Θf +η∇Θf [∑n logf(x(n))+∑m log(1−f(g(c(m))))] for K times Θg ← Θg −η∇Θg[∑m log(1−f(g(c(m))))]

Why limiting K when updating Θf ? Too large K:

f may overfit data, making g updated for “wrong” target f

Too small K:

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 42 / 81

SLIDE 98

Challenge: Balance between g and f

argmin

Θg max Θf ∑ n

logf(x(n))+∑

m

log(1−f(g(c(m)))) Alternate SGD:

Θf ← Θf +η∇Θf [∑n logf(x(n))+∑m log(1−f(g(c(m))))] for K times Θg ← Θg −η∇Θg[∑m log(1−f(g(c(m))))]

Why limiting K when updating Θf ? Too large K:

f may overfit data, making g updated for “wrong” target f Vanishing gradients: ∇Θg[∑m log(1−f(g(c(m))))] too small to learn

Too small K:

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 42 / 81

SLIDE 99

Challenge: Balance between g and f

argmin

Θg max Θf ∑ n

logf(x(n))+∑

m

log(1−f(g(c(m)))) Alternate SGD:

Θf ← Θf +η∇Θf [∑n logf(x(n))+∑m log(1−f(g(c(m))))] for K times Θg ← Θg −η∇Θg[∑m log(1−f(g(c(m))))]

Why limiting K when updating Θf ? Too large K:

f may overfit data, making g updated for “wrong” target f Vanishing gradients: ∇Θg[∑m log(1−f(g(c(m))))] too small to learn

Too small K:

g updated for “meaningless” f

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 42 / 81

SLIDE 100

Solution: Wasserstein GAN [1]

Let f be a regressor without the sigmoid output layer Cost function: argmin

Θg max Θf ∑ n

f(x(n))−∑

m

f(g(c(m))) Initialize Θg for g and Θf for f At each SGD step/iteration:

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 43 / 81

SLIDE 101

Solution: Wasserstein GAN [1]

Let f be a regressor without the sigmoid output layer Cost function: argmin

Θg max Θf ∑ n

f(x(n))−∑

m

f(g(c(m))) Initialize Θg for g and Θf for f At each SGD step/iteration:

1

Repeat K times (with fixed Θg):

1

Sample N real points {x(n)}n from X and N codes from c ∼ N (0,I)

2

Θf ← Θf +η clip(∇Θf [∑n f(x(n))−∑m f(g(c(m)))])

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 43 / 81

SLIDE 102

Solution: Wasserstein GAN [1]

Let f be a regressor without the sigmoid output layer Cost function: argmin

Θg max Θf ∑ n

f(x(n))−∑

m

f(g(c(m))) Initialize Θg for g and Θf for f At each SGD step/iteration:

1

Repeat K times (with fixed Θg):

1

Sample N real points {x(n)}n from X and N codes from c ∼ N (0,I)

2

Θf ← Θf +η clip(∇Θf [∑n f(x(n))−∑m f(g(c(m)))])

2

Execute once (with fixed Θf ):

1

Sample N codes from c ∼ N (0,I)

2

Θg ← Θg −η∇Θg[−∑m f(g(c(m)))]

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 43 / 81

SLIDE 103

GANs from Information Theory Perspective

Review the Information Theory first!

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 44 / 81

SLIDE 104

GANs from Information Theory Perspective

Review the Information Theory first! Let Pdata / Pg be distribution of x / ˆ x = g(c)

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 44 / 81

SLIDE 105

GANs from Information Theory Perspective

Review the Information Theory first! Let Pdata / Pg be distribution of x / ˆ x = g(c) A way to find g: argminΘg DKL(PdataPg)

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 44 / 81

SLIDE 106

GANs from Information Theory Perspective

Review the Information Theory first! Let Pdata / Pg be distribution of x / ˆ x = g(c) A way to find g: argminΘg DKL(PdataPg)

Why not argminΘg DKL(PgPdata)?

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 44 / 81

SLIDE 107

GANs from Information Theory Perspective

Review the Information Theory first! Let Pdata / Pg be distribution of x / ˆ x = g(c) A way to find g: argminΘg DKL(PdataPg)

Why not argminΘg DKL(PgPdata)?

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 44 / 81

SLIDE 108

GANs from Information Theory Perspective

Review the Information Theory first! Let Pdata / Pg be distribution of x / ˆ x = g(c) A way to find g: argminΘg DKL(PdataPg)

Why not argminΘg DKL(PgPdata)?

GAN: argminΘg maxΘf ∑n logf(x(n))+∑m log(1−f(g(c(m))))

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 44 / 81

SLIDE 109

GANs from Information Theory Perspective

Review the Information Theory first! Let Pdata / Pg be distribution of x / ˆ x = g(c) A way to find g: argminΘg DKL(PdataPg)

Why not argminΘg DKL(PgPdata)?

GAN: argminΘg maxΘf ∑n logf(x(n))+∑m log(1−f(g(c(m)))) Actually, the max term measures Jensen-Shannon divergence (a.k.a. symmetric KL divergence):

DJS(PdataPg) = 1 2DKL(PdataQ)+ 1 2DKL(PgQ), where Q = 1 2(Pg +Pdata)

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 44 / 81

SLIDE 110

Why Jensen-Shannon Divergence? I

Given a fixed g, we have C∗ = maxΘf ∑n logf(x(n))+∑m log(1−f(g(c(m))))

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 45 / 81

SLIDE 111

Why Jensen-Shannon Divergence? I

Given a fixed g, we have C∗ = maxΘf ∑n logf(x(n))+∑m log(1−f(g(c(m)))) = maxΘf

1 N ∑n logf(x(n))+ 1 N ∑m log(1−f(ˆ

x(m)))

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 45 / 81

SLIDE 112

Why Jensen-Shannon Divergence? I

Given a fixed g, we have C∗ = maxΘf ∑n logf(x(n))+∑m log(1−f(g(c(m)))) = maxΘf

1 N ∑n logf(x(n))+ 1 N ∑m log(1−f(ˆ

x(m))) ≈ maxΘf Ex∼Pdata[logf(x)]+Ex∼Pg[log(1−f(x))]

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 45 / 81

SLIDE 113

Why Jensen-Shannon Divergence? I

Given a fixed g, we have C∗ = maxΘf ∑n logf(x(n))+∑m log(1−f(g(c(m)))) = maxΘf

1 N ∑n logf(x(n))+ 1 N ∑m log(1−f(ˆ

x(m))) ≈ maxΘf Ex∼Pdata[logf(x)]+Ex∼Pg[log(1−f(x))] = maxΘf

x Pdata(x)logf(x)dx+
x Pg(x)log(1−f(x))dx

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 45 / 81

SLIDE 114

Why Jensen-Shannon Divergence? I

Given a fixed g, we have C∗ = maxΘf ∑n logf(x(n))+∑m log(1−f(g(c(m)))) = maxΘf

1 N ∑n logf(x(n))+ 1 N ∑m log(1−f(ˆ

x(m))) ≈ maxΘf Ex∼Pdata[logf(x)]+Ex∼Pg[log(1−f(x))] = maxΘf

x Pdata(x)logf(x)dx+
x Pg(x)log(1−f(x))dx

= maxΘf

x[Pdata(x)logf(x)+Pg(x)log(1−f(x))]dx

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 45 / 81

SLIDE 115

Why Jensen-Shannon Divergence? I

Given a fixed g, we have C∗ = maxΘf ∑n logf(x(n))+∑m log(1−f(g(c(m)))) = maxΘf

1 N ∑n logf(x(n))+ 1 N ∑m log(1−f(ˆ

x(m))) ≈ maxΘf Ex∼Pdata[logf(x)]+Ex∼Pg[log(1−f(x))] = maxΘf

x Pdata(x)logf(x)dx+
x Pg(x)log(1−f(x))dx

= maxΘf

x[Pdata(x)logf(x)+Pg(x)log(1−f(x))]dx

To have C∗, we can find f maximizing Pdata(x)logf(x)+Pg(x)log(1−f(x)) for each x

Assuming that f has infinite capacity

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 45 / 81

SLIDE 116

Why Jensen-Shannon Divergence? II

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 46 / 81

SLIDE 117

Why Jensen-Shannon Divergence? II

Given Pdata, Pg, and x, what is the f(x) that maximizes Pdata(x)logf(x)+Pg(x)log(1−f(x))? f ∗(x) =

Pdata(x) Pdata(x)+Pg(x) ∈ [0,1] [Proof]

That is, C∗ = maxΘf

x[Pdata(x)logf(x)+Pg(x)log(1−f(x))]dx

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 46 / 81

SLIDE 118

Why Jensen-Shannon Divergence? II

Given Pdata, Pg, and x, what is the f(x) that maximizes Pdata(x)logf(x)+Pg(x)log(1−f(x))? f ∗(x) =

Pdata(x) Pdata(x)+Pg(x) ∈ [0,1] [Proof]

That is, C∗ = maxΘf

x[Pdata(x)logf(x)+Pg(x)log(1−f(x))]dx

=

x Pdata(x)log

Pdata(x) Pdata(x)+Pg(x)dx

+

x Pg(x)log(1−

Pg(x) Pdata(x)+Pg(x))dx

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 46 / 81

SLIDE 119

Why Jensen-Shannon Divergence? II

Given Pdata, Pg, and x, what is the f(x) that maximizes Pdata(x)logf(x)+Pg(x)log(1−f(x))? f ∗(x) =

Pdata(x) Pdata(x)+Pg(x) ∈ [0,1] [Proof]

That is, C∗ = maxΘf

x[Pdata(x)logf(x)+Pg(x)log(1−f(x))]dx

=

x Pdata(x)log

Pdata(x) Pdata(x)+Pg(x)dx

+

x Pg(x)log(1−

Pg(x) Pdata(x)+Pg(x))dx

= −2log2+

x Pdata(x)log

Pdata(x) (Pdata(x)+Pg(x))/2dx

+

x Pg(x)log(1−

Pg(x) (Pdata(x)+Pg(x))/2)dx

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 46 / 81

SLIDE 120

Why Jensen-Shannon Divergence? II

Given Pdata, Pg, and x, what is the f(x) that maximizes Pdata(x)logf(x)+Pg(x)log(1−f(x))? f ∗(x) =

Pdata(x) Pdata(x)+Pg(x) ∈ [0,1] [Proof]

That is, C∗ = maxΘf

x[Pdata(x)logf(x)+Pg(x)log(1−f(x))]dx

=

x Pdata(x)log

Pdata(x) Pdata(x)+Pg(x)dx

+

x Pg(x)log(1−

Pg(x) Pdata(x)+Pg(x))dx

= −2log2+

x Pdata(x)log

Pdata(x) (Pdata(x)+Pg(x))/2dx

+

x Pg(x)log(1−

Pg(x) (Pdata(x)+Pg(x))/2)dx

= −2log2+2DJS(PdataPg)

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 46 / 81

SLIDE 121

Balance between g and f, Revisited I

Cost function of GAN: argminΘg maxΘf ∑n logf(x(n))+∑m log(1−f(g(c(m)))) = argminΘg −2log2+2DJS(PdataPg)

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 47 / 81

SLIDE 122

Balance between g and f, Revisited I

Cost function of GAN: argminΘg maxΘf ∑n logf(x(n))+∑m log(1−f(g(c(m)))) = argminΘg −2log2+2DJS(PdataPg) However, no mater how g changes, DJS(PdataPg) remains high during the GAN training process

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 47 / 81

SLIDE 123

Balance between g and f, Revisited I

Cost function of GAN: argminΘg maxΘf ∑n logf(x(n))+∑m log(1−f(g(c(m)))) = argminΘg −2log2+2DJS(PdataPg) However, no mater how g changes, DJS(PdataPg) remains high during the GAN training process There’s something wrong with the design of the inner max problem!

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 47 / 81

SLIDE 124

Balance between g and f, Revisited II

GAN: argminΘg DJS(PdataPg)

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 48 / 81

SLIDE 125

Balance between g and f, Revisited II

GAN: argminΘg DJS(PdataPg) 0 ≤ DJS(PdataPg) ≤ log2 ≈ 0.69

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 48 / 81

SLIDE 126

Balance between g and f, Revisited II

GAN: argminΘg DJS(PdataPg) 0 ≤ DJS(PdataPg) ≤ log2 ≈ 0.69 When does DJS(PdataPg) reach its maximum value?

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 48 / 81

SLIDE 127

Balance between g and f, Revisited II

GAN: argminΘg DJS(PdataPg) 0 ≤ DJS(PdataPg) ≤ log2 ≈ 0.69 When does DJS(PdataPg) reach its maximum value? Let’s see the case when Pg and Pdata are “disjointed”

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 48 / 81

SLIDE 128

Balance between g and f, Revisited II

GAN: argminΘg DJS(PdataPg) 0 ≤ DJS(PdataPg) ≤ log2 ≈ 0.69 When does DJS(PdataPg) reach its maximum value? Let’s see the case when Pg and Pdata are “disjointed” Suppose Pg(x) = 0 ⇔ Pdata(x) = 0 and Pg(x) = 0 ⇔ Pdata(x) = 0, we have DJS(PgPdata) = 1

2DKL(PgPg+Pdata 2

)+ 1

2DKL(PdataPg+Pdata 2

)

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 48 / 81

SLIDE 129

Balance between g and f, Revisited II

GAN: argminΘg DJS(PdataPg) 0 ≤ DJS(PdataPg) ≤ log2 ≈ 0.69 When does DJS(PdataPg) reach its maximum value? Let’s see the case when Pg and Pdata are “disjointed” Suppose Pg(x) = 0 ⇔ Pdata(x) = 0 and Pg(x) = 0 ⇔ Pdata(x) = 0, we have DJS(PgPdata) = 1

2DKL(PgPg+Pdata 2

)+ 1

2DKL(PdataPg+Pdata 2

) = 1

2

x Pg(x)log

2Pg(x) Pg(x)+Pdata(x)dx

+ 1

2

x Pdata(x)log

2Pdata(x) Pg(x)+Pdata(x)dx

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 48 / 81

SLIDE 130

Balance between g and f, Revisited II

GAN: argminΘg DJS(PdataPg) 0 ≤ DJS(PdataPg) ≤ log2 ≈ 0.69 When does DJS(PdataPg) reach its maximum value? Let’s see the case when Pg and Pdata are “disjointed” Suppose Pg(x) = 0 ⇔ Pdata(x) = 0 and Pg(x) = 0 ⇔ Pdata(x) = 0, we have DJS(PgPdata) = 1

2DKL(PgPg+Pdata 2

)+ 1

2DKL(PdataPg+Pdata 2

) = 1

2

x Pg(x)log

2Pg(x) Pg(x)+Pdata(x)dx

+ 1

2

x Pdata(x)log

2Pdata(x) Pg(x)+Pdata(x)dx

= 1

2 log2+ 1 2 log2 = log2

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 48 / 81

SLIDE 131

Balance between g and f, Revisited II

GAN: argminΘg DJS(PdataPg) 0 ≤ DJS(PdataPg) ≤ log2 ≈ 0.69 When does DJS(PdataPg) reach its maximum value? Let’s see the case when Pg and Pdata are “disjointed” Suppose Pg(x) = 0 ⇔ Pdata(x) = 0 and Pg(x) = 0 ⇔ Pdata(x) = 0, we have DJS(PgPdata) = 1

2DKL(PgPg+Pdata 2

)+ 1

2DKL(PdataPg+Pdata 2

) = 1

2

x Pg(x)log

2Pg(x) Pg(x)+Pdata(x)dx

+ 1

2

x Pdata(x)log

2Pdata(x) Pg(x)+Pdata(x)dx

= 1

2 log2+ 1 2 log2 = log2

Are Pg and Pdata really disjointed during the GAN training?

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 48 / 81

SLIDE 132

Disjoining Pg and Pdata

In a high dimensional space, x and g(z) may resides in low dimensional manifolds

Pg and Pdata may have values only on the manifolds

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 49 / 81

SLIDE 133

Disjoining Pg and Pdata

In a high dimensional space, x and g(z) may resides in low dimensional manifolds

Pg and Pdata may have values only on the manifolds

Pg and Pdata can be very different initially during GAN training

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 49 / 81

SLIDE 134

Disjoining Pg and Pdata

In a high dimensional space, x and g(z) may resides in low dimensional manifolds

Pg and Pdata may have values only on the manifolds

Pg and Pdata can be very different initially during GAN training The intersections where Pg(x) = 0 and Pdata(x) = 0 can be neglected

Pg(x) = 0 ⇔ Pdata(x) = 0 and Pg(x) = 0 ⇔ Pdata(x) = 0 almost surely Maximum JS divergence at all time

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 49 / 81

SLIDE 135

Better Divergence Measure?

GAN: argminΘg DJS(PdataPg)

DJS(PdataPg) does not bring Pg and Pdata closer during GAN training

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 50 / 81

SLIDE 136

Better Divergence Measure?

GAN: argminΘg DJS(PdataPg)

DJS(PdataPg) does not bring Pg and Pdata closer during GAN training

Wasserstein (or earth-mover) distance: W(Pdata,Pg) = infQ∈Γ(Pdata,Pg) E(x,ˆ

x)∼Q[x− ˆ

x] = infQ∈Γ(Pdata,Pg)

(x,ˆ

x) Q(x, ˆ

x)x− ˆ xd(x, ˆ x)

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 50 / 81

SLIDE 137

Better Divergence Measure?

GAN: argminΘg DJS(PdataPg)

DJS(PdataPg) does not bring Pg and Pdata closer during GAN training

Wasserstein (or earth-mover) distance: W(Pdata,Pg) = infQ∈Γ(Pdata,Pg) E(x,ˆ

x)∼Q[x− ˆ

x] = infQ∈Γ(Pdata,Pg)

(x,ˆ

x) Q(x, ˆ

x)x− ˆ xd(x, ˆ x) Intuitively, the minimal “cost” to change Pdata into Pg

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 50 / 81

SLIDE 138

Better Divergence Measure?

GAN: argminΘg DJS(PdataPg)

DJS(PdataPg) does not bring Pg and Pdata closer during GAN training

Wasserstein (or earth-mover) distance: W(Pdata,Pg) = infQ∈Γ(Pdata,Pg) E(x,ˆ

x)∼Q[x− ˆ

x] = infQ∈Γ(Pdata,Pg)

(x,ˆ

x) Q(x, ˆ

x)x− ˆ xd(x, ˆ x) Intuitively, the minimal “cost” to change Pdata into Pg W(Pdata,Pg) measures the “divergence” between Pg and Pdata even when they are disjointed

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 50 / 81

SLIDE 139

Wasserstein GAN I

W-GAN: argminΘg W(Pdata,Pg)

W(Pdata,Pg) = infQ∈Γ(Pdata,Pg) E(x,ˆ

x)∼Q[x− ˆ

x]

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 51 / 81

SLIDE 140

Wasserstein GAN I

W-GAN: argminΘg W(Pdata,Pg)

W(Pdata,Pg) = infQ∈Γ(Pdata,Pg) E(x,ˆ

x)∼Q[x− ˆ

x]

Unfortunately, W(Pdata,Pg) is hard to solve directly

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 51 / 81

SLIDE 141

Wasserstein GAN II

Theorem Consider f’s that are Lipschitz continuous with constant 1, i.e., |f(x)−f(ˆ x)| ≤ 1·x− ˆ x,∀x, ˆ x, we have a W(Pdata,Pg) = supf Ex∼Pdata[f(x)]−Ex∼Pg[f(x)]

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 52 / 81

SLIDE 142

Wasserstein GAN II

Theorem Consider f’s that are Lipschitz continuous with constant 1, i.e., |f(x)−f(ˆ x)| ≤ 1·x− ˆ x,∀x, ˆ x, we have a W(Pdata,Pg) = supf Ex∼Pdata[f(x)]−Ex∼Pg[f(x)] = supf

x(Pdata(x)−Pg(x))f(x)dx.

ahttps://vincentherrmann.github.io/blog/wasserstein/

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 52 / 81

SLIDE 143

Wasserstein GAN II

Theorem Consider f’s that are Lipschitz continuous with constant 1, i.e., |f(x)−f(ˆ x)| ≤ 1·x− ˆ x,∀x, ˆ x, we have a W(Pdata,Pg) = supf Ex∼Pdata[f(x)]−Ex∼Pg[f(x)] = supf

x(Pdata(x)−Pg(x))f(x)dx.

ahttps://vincentherrmann.github.io/blog/wasserstein/

W-GAN [1]: argminΘg maxΘf Ex∼Pdata[f(x)]−Ex∼Pg[f(x)]

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 52 / 81

SLIDE 144

Alternate SGD for W-GAN

argminΘg maxΘf Ex∼Pdata[f(x)]−Ex∼Pg[f(x)] = argminΘg maxΘf ∑n f(x(n))−∑m f(g(c(m))) f a regressor without the sigmoid output layer

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 53 / 81

SLIDE 145

Alternate SGD for W-GAN

argminΘg maxΘf Ex∼Pdata[f(x)]−Ex∼Pg[f(x)] = argminΘg maxΘf ∑n f(x(n))−∑m f(g(c(m))) f a regressor without the sigmoid output layer Initialize Θg for g and Θf for f At each SGD step/iteration:

1

Repeat K times (with fixed Θg):

1

Sample N real points {x(n)}n from X and N codes from c ∼ N (0,I)

2

Θf ← Θf +η clip(∇Θf [∑n f(x(n))−∑m f(g(c(m)))])

2

Execute once (with fixed Θf ):

1

Sample N codes from c ∼ N (0,I)

2

Θg ← Θg −η∇Θg[−∑m f(g(c(m)))]

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 53 / 81

SLIDE 146

Why Gradient Clipping?

Update rule for Θf : Θf ← Θf +η clip(∇Θf [∑

n

f(x(n))−∑

m

f(g(c(m)))]) Gradient clipping: ∀w ∈ Θf , clip(w) = max(min(w,τ),−τ) for some threshold τ > 0 Why?

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 54 / 81

SLIDE 147

Why Gradient Clipping?

Update rule for Θf : Θf ← Θf +η clip(∇Θf [∑

n

f(x(n))−∑

m

f(g(c(m)))]) Gradient clipping: ∀w ∈ Θf , clip(w) = max(min(w,τ),−τ) for some threshold τ > 0 Why? In W-GAN, we have : argminΘg W(Pdata,Pg) = argminΘg maxΘf ∑n f(x(n))−∑m f(g(c(m)))

nly if f is 1-Lipschitz continuous

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 54 / 81

SLIDE 148

Why Gradient Clipping?

Update rule for Θf : Θf ← Θf +η clip(∇Θf [∑

n

f(x(n))−∑

m

f(g(c(m)))]) Gradient clipping: ∀w ∈ Θf , clip(w) = max(min(w,τ),−τ) for some threshold τ > 0 Why? In W-GAN, we have : argminΘg W(Pdata,Pg) = argminΘg maxΘf ∑n f(x(n))−∑m f(g(c(m)))

nly if f is 1-Lipschitz continuous

Heuristics for making f 1-Lipchitz

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 54 / 81

SLIDE 149

Advantages of W-GAN

“Corrects” the inner max problem

Wasserstein distance guides g even when Pg and Pdata “disjointed” Training less sensitive to K (balance between g and f)

The max value can be used as a “stop” indicator f a regressor, avoids vanishing gradients for g

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 55 / 81

SLIDE 150

Improved W-GAN I

In practice, W-GAN training converges slowly and is unstable to τ

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 56 / 81

SLIDE 151

Improved W-GAN I

In practice, W-GAN training converges slowly and is unstable to τ W-GAN use a small τ to make f 1-Lipschitz continuous However, too small a τ severely limits the capacity of f such it cannot actually maximize max

Θf Ex∼Pdata[f(x)]−Ex∼Pg[f(x)]

g is not updated for minimizing W(Pdata,Pg)

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 56 / 81

SLIDE 152

Improved W-GAN I

In practice, W-GAN training converges slowly and is unstable to τ W-GAN use a small τ to make f 1-Lipschitz continuous However, too small a τ severely limits the capacity of f such it cannot actually maximize max

Θf Ex∼Pdata[f(x)]−Ex∼Pg[f(x)]

g is not updated for minimizing W(Pdata,Pg) Distribution of weight values of f (τ = 0.01): Exploding and vanishing gradients

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 56 / 81

SLIDE 153

Improved W-GAN II

If f is 1-Lipschitz, then ∇f(x) ≤ 1 for all x Why not just panelize ∇f(x) > 1 for all x? Cost function: argminΘg maxΘf Ex∼Pdata[f(x)]−Ex∼Pg[f(x)] −λEx∼Ppenalty[max(0,∇f(x)−1)]

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 57 / 81

SLIDE 154

Improved W-GAN II

If f is 1-Lipschitz, then ∇f(x) ≤ 1 for all x Why not just panelize ∇f(x) > 1 for all x? Cost function: argminΘg maxΘf Ex∼Pdata[f(x)]−Ex∼Pg[f(x)] −λEx∼Ppenalty[max(0,∇f(x)−1)] Ppenalty?

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 57 / 81

SLIDE 155

Improved W-GAN II

If f is 1-Lipschitz, then ∇f(x) ≤ 1 for all x Why not just panelize ∇f(x) > 1 for all x? Cost function: argminΘg maxΘf Ex∼Pdata[f(x)]−Ex∼Pg[f(x)] −λEx∼Ppenalty[max(0,∇f(x)−1)] Ppenalty? W-GAN-GP [5]: argmin

Θg max Θf ∑ n

f(x(n))−∑

m

f(g(c(m)))−λ ∑

p

(∇f(x(p))−1)2

The larger f(x(p)) the better (subject to ∇f(x(p)) ≤ 1) Faster convergence

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 57 / 81

SLIDE 156

Results

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 58 / 81

SLIDE 157

Challenge: Global Coherence

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 59 / 81

SLIDE 158

Challenge: Global Coherence

Large images generated by GANs usually lack global coherency Counting Perspective Shape

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 59 / 81

SLIDE 159

Challenge: Global Coherence

Large images generated by GANs usually lack global coherency Counting Perspective Shape A CNN, when used as f, detects existence of patterns more than their relative positions

f loses track of the position of a pattern after several pooling layers Relative position of patterns in X may change due to different view angels

Solutions?

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 59 / 81

SLIDE 160

Challenge: Global Coherence

Large images generated by GANs usually lack global coherency Counting Perspective Shape A CNN, when used as f, detects existence of patterns more than their relative positions

f loses track of the position of a pattern after several pooling layers Relative position of patterns in X may change due to different view angels

Solutions? A better f, such as the CapsuleNet [21]?

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 59 / 81

SLIDE 161

Progressive Growing of GANs [8]

Incrementally adds new layers in sequential GAN trainings

Convolution + upsampling for g each time Convolution + downpooling for f each time

Real images are downscaled to match the current resolution of g

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 60 / 81

SLIDE 162

Progressive Growing of GANs [8]

Incrementally adds new layers in sequential GAN trainings

Convolution + upsampling for g each time Convolution + downpooling for f each time

Real images are downscaled to match the current resolution of g Goal: to let new layers add details without ruining the context

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 60 / 81

SLIDE 163

Transition when Adding a New Layer

Gradually increases α

New convolution layer in g/f learns to generate/detect details first Then learns the “context”

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 61 / 81

SLIDE 164

Results

Minibatch discrimination [22] + W-GAN-GP [5] + progressive growing [8] + other tricks 2 ∼ 6 times faster

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 62 / 81

SLIDE 165

Outline

1

Unsupervised Learning

2

Self-Supervised Learning

3

Autoencoders & Manifold Learning

4

Generative Adversarial Networks The Basics Challenges More GANs

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 63 / 81

SLIDE 166

Code Space Arithmetics

DC-GAN [18] can learn to use codes in meaningful ways:

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 64 / 81

SLIDE 167

Code Space Arithmetics

DC-GAN [18] can learn to use codes in meaningful ways: Finding codes for images with constraints [29, 2]

Demo 1 Demo 2

argmin

c mask(g(c))−constraintF

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 64 / 81

SLIDE 168

Conditional GAN I

Text to image synthesis [19]: X = {(x(n),φ (n))}n “This bird is completely red with black wings and pointy beak.” How?

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 65 / 81

SLIDE 169

Conditional GAN I

Text to image synthesis [19]: X = {(x(n),φ (n))}n “This bird is completely red with black wings and pointy beak.” How?

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 65 / 81

SLIDE 170

Conditional GAN II

Pitfall: g and f can choose to ignore the condition φ altogether Solution?

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 66 / 81

SLIDE 171

Conditional GAN II

Pitfall: g and f can choose to ignore the condition φ altogether Solution? Conditioned labeling (x(n),φ (n)) ⇒ true (x(n),φ ′) ⇒ false,∀φ ′ = φ (n) (ˆ x(m),φ (m)) ⇒ false

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 66 / 81

SLIDE 172

Super Resolution [11]

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 67 / 81

SLIDE 173

Super Resolution [11]

g : low res img → high res img

Training: c’s are downscaled images

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 67 / 81

SLIDE 174

Super Resolution [11]

g : low res img → high res img

Training: c’s are downscaled images

No “creativity,” f acts as a better lose metric

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 67 / 81

SLIDE 175

Image-to-Image Translation [6] I

Given an image xsrc in source domain, generate image(s) xtarget in target domain

xsrc and xtarget are semantically aligned

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 68 / 81

SLIDE 176

Image-to-Image Translation [6] II

Based on conditional GAN:

c = xsrc; g(c) = xtarget Conditioned labels Uses dropout layers to create diversity (if needed)

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 69 / 81

SLIDE 177

Image-to-Image Translation [6] II

Based on conditional GAN:

c = xsrc; g(c) = xtarget Conditioned labels Uses dropout layers to create diversity (if needed)

Requires paired examples X = {(x(n)

src,x(n) target)}n

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 69 / 81

SLIDE 178

Unpaired Image-to-Image Translation I

What if the images in different domains are unpaired?

X = {x(n)

src}n ∪{x(n) target}n

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 70 / 81

SLIDE 179

Unpaired Image-to-Image Translation II

Cycle GAN [30]: to train two generators gsrc2target and gtarget2src simultaneously in two GANs

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 71 / 81

SLIDE 180

Unpaired Image-to-Image Translation II

Cycle GAN [30]: to train two generators gsrc2target and gtarget2src simultaneously in two GANs Add a loss term ∑n x(n)

src −gtarget2src(gsrc2target(x(n) src))F and

∑n x(n)

target −gsrc2target(gtarget2src(x(n) target))F for gsrc2target and gtarget2src

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 71 / 81

SLIDE 181

How Does Cycle GAN Work?

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 72 / 81

SLIDE 182

How Does Cycle GAN Work?

Ideal: gsrc2target and gtarget2src learns to translate images

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 72 / 81

SLIDE 183

How Does Cycle GAN Work?

Ideal: gsrc2target and gtarget2src learns to translate images Reality: gsrc2target and gtarget2src learns to hide inoformation [3]

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 72 / 81

SLIDE 184

How Does Cycle GAN Work?

Ideal: gsrc2target and gtarget2src learns to translate images Reality: gsrc2target and gtarget2src learns to hide inoformation [3] Unsupervised DNN models, including GANs, may not work as

ne may expect

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 72 / 81

SLIDE 185

Reference I

[1] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. In International Conference on Machine Learning, pages 214–223, 2017. [2] Andrew Brock, Theodore Lim, JM Ritchie, and Nick Weston. Neural photo editing with introspective adversarial networks. arXiv preprint arXiv:1609.07093, 2016. [3] Casey Chu, Andrey Zhmoginov, and Mark Sandler. Cyclegan, a master of steganography. arXiv preprint arXiv:1712.02950, 2017. [4] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, pages 2672–2680, 2014.

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 73 / 81

SLIDE 186

Reference II

[5] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron Courville. Improved training of wasserstein gans. arXiv preprint arXiv:1704.00028, 2017. [6] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. arXiv preprint arXiv:1611.07004, 2016. [7] Nal Kalchbrenner, Aaron van den Oord, Karen Simonyan, Ivo Danihelka, Oriol Vinyals, Alex Graves, and Koray Kavukcuoglu. Video pixel networks. arXiv preprint arXiv:1610.00527, 2016. [8] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017.

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 74 / 81

SLIDE 187

Reference III

[9] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013. [10] Quoc V Le and Tomas Mikolov. Distributed representations of sentences and documents. In ICML, volume 14, pages 1188–1196, 2014. [11] Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, et al. Photo-realistic single image super-resolution using a generative adversarial network. arXiv preprint arXiv:1609.04802, 2016. [12] Daniel D Lee and H Sebastian Seung. Learning the parts of objects by non-negative matrix factorization. Nature, 401(6755):788–791, 1999.

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 75 / 81

SLIDE 188

Reference IV

[13] Daniel D Lee and H Sebastian Seung. Algorithms for non-negative matrix factorization. In Advances in neural information processing systems, pages 556–562, 2001. [14] William Lotter, Gabriel Kreiman, and David Cox. Deep predictive coding networks for video prediction and unsupervised learning. arXiv preprint arXiv:1605.08104, 2016. [15] Luke Metz, Ben Poole, David Pfau, and Jascha Sohl-Dickstein. Unrolled generative adversarial networks. arXiv preprint arXiv:1611.02163, 2016. [16] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 76 / 81

SLIDE 189

Reference V

[17] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119, 2013. [18] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015. [19] Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. Generative adversarial text to image synthesis. arXiv preprint arXiv:1605.05396, 2016.

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 77 / 81

SLIDE 190

Reference VI

[20] Salah Rifai, Pascal Vincent, Xavier Muller, Xavier Glorot, and Yoshua Bengio. Contractive auto-encoders: Explicit invariance during feature extraction. In Proceedings of the 28th international conference on machine learning (ICML-11), pages 833–840, 2011. [21] Sara Sabour, Nicholas Frosst, and Geoffrey E Hinton. Dynamic routing between capsules. In Advances in Neural Information Processing Systems, pages 3857–3867, 2017. [22] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. In Advances in Neural Information Processing Systems, pages 2226–2234, 2016.

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 78 / 81

SLIDE 191

Reference VII

[23] Patrice Simard, Bernard Victorri, Yann LeCun, and John S Denker. Tangent prop-a formalism for specifying selected invariances in an adaptive network. In NIPS, volume 91, pages 895–903, 1991. [24] Aaron van den Oord, Nal Kalchbrenner, Lasse Espeholt, Oriol Vinyals, Alex Graves, et al. Conditional image generation with pixelcnn decoders. In Advances in Neural Information Processing Systems, pages 4790–4798, 2016. [25] Aaron Van Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks. In International Conference on Machine Learning, pages 1747–1756, 2016.

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 79 / 81

SLIDE 192

Reference VIII

[26] Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th international conference on Machine learning, pages 1096–1103. ACM, 2008. [27] Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. Anticipating the future by watching unlabeled video. arXiv preprint arXiv:1504.08023, 2015. [28] Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In European conference on computer vision, pages 818–833. Springer, 2014.

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 80 / 81

SLIDE 193

Reference IX

[29] Jun-Yan Zhu, Philipp Krähenbühl, Eli Shechtman, and Alexei A Efros. Generative visual manipulation on the natural image manifold. In European Conference on Computer Vision, pages 597–613. Springer, 2016. [30] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. arXiv preprint arXiv:1703.10593, 2017.

Shan-Hung Wu (CS, NTHU) Unsupervised Learning Machine Learning 81 / 81