Deep Generative Models for Clustering: A Semi-supervised and - - PowerPoint PPT Presentation

deep generative models for clustering a semi supervised
SMART_READER_LITE
LIVE PREVIEW

Deep Generative Models for Clustering: A Semi-supervised and - - PowerPoint PPT Presentation

Deep Generative Models for Clustering: A Semi-supervised and Unsupervised Approach Jhosimar George Arias Figueroa Advisor: Gerberth Ad n Ram rez Rivera Masters Thesis Defense February 19, 2018 Motivation Huge amount of unlabeled


slide-1
SLIDE 1

Deep Generative Models for Clustering: A Semi-supervised and Unsupervised Approach

Jhosimar George Arias Figueroa Advisor: Gerberth Ad´ ın Ram´ ırez Rivera Master’s Thesis Defense

February 19, 2018

slide-2
SLIDE 2

Motivation

slide-3
SLIDE 3

Huge amount of unlabeled data!

Image Credit: Ruslan Salakhutdinov

Labeling huge amount of data is hard and expensive.

1

slide-4
SLIDE 4

Huge amount of unlabeled data!

Image Credit: Ruslan Salakhutdinov

Labeling huge amount of data is hard and expensive. Can we discover underlying hidden structure from data in a semi-supervised or unsupervised way?

1

slide-5
SLIDE 5

What is Clustering?

Goal: Find distinct groups such that similar elements belong to the same group.

2

slide-6
SLIDE 6

What if we have a small amount of labeled data?

Large amount of Unlabeled data (ImageNet) Small amount of Labeled data (CIFAR-10)

Can we learn good representations and cluster data in a semi-supervised way?

3

slide-7
SLIDE 7

Two types of Semi-supervised Clustering:

Class labels (seeded points) Pairwise Constraints (must-link or cannot link)

Our work focuses on the first type of semi-supervised clustering: Use of labels as seeds

4

slide-8
SLIDE 8

Semisupervised Clustering: Related Works

slide-9
SLIDE 9

Intuition

5

slide-10
SLIDE 10

Learning Representations: Neural Networks

How to learn feature representations? Train such that features can be used to perform classification (supervised learning).

6

slide-11
SLIDE 11

Auxiliary Task: Semi-supervised Embedding

  • Deep Learning via Semi-Supervised Embedding, Weston et al, ICML’08

7

slide-12
SLIDE 12

Learning Representations: Autoencoder

How to learn feature representations? Train such that features can be used to reconstruct original data Autoencoding - encoding itself

8

slide-13
SLIDE 13

Auxiliary Task: Clean Encoder and Denoising Decoder

  • Semi-Supervised Learning with Ladder Networks, Rasmus et al. NIPS’15
  • Deep Embedded Regularized Clustering (DEPICT), Dizaji et al. ICCV’17

9

slide-14
SLIDE 14

Variational Autoencoder

Auto-Encoding Variational Bayes, Kingma et al, ICLR’14 10

slide-15
SLIDE 15

Variational Autoencoder

Posterior distribution pθ(z|x) is intractable because of pθ(x): pθ(z|x) = pθ(x|z)pθ(z) pθ(x) pθ(x) =

  • z

pθ(x, z)dz

Auto-Encoding Variational Bayes, Kingma et al, ICLR’14 10

slide-16
SLIDE 16

Variational Autoencoder

Approximate pθ(z|x) with a tractable distribution qφ(z|x)

Auto-Encoding Variational Bayes, Kingma et al, ICLR’14 10

slide-17
SLIDE 17

Variational Autoencoder - Lower Bound

Variational Lower Bound:

Auto-Encoding Variational Bayes, Kingma et al, ICLR’14 11

slide-18
SLIDE 18

Generative Models for Semi-supervised learning

Inference Model Generative Model Probabilistic graphical models of M1+M2, Kingma et al, NIPS14 Inference Model Generative Model Auxiliary Deep Generative Models, Maaloe et al, ICML’16 12

slide-19
SLIDE 19

Stochastic Continuous Nodes: Reparameterization Trick

We can ’externalize’ the randomness in z by re-parameterizing it as deterministic: z = µ + σ ǫ 13

slide-20
SLIDE 20

Stochastic Discrete Nodes: Gumbel-Max Trick

We can sample from a categorical distribution with the Gumbel-Max trick. 14

slide-21
SLIDE 21

Stochastic Discrete Nodes: Gumbel-Softmax distribution

We can approximate arg max with a softmax. 15

slide-22
SLIDE 22

Avoiding marginalization over discrete variables

Marginalizing out c over all categories is an expensive process which requires multiple gradient estimations. Gumbel-Softmax allows us to backpropagate through a single sample gradient estimation.

16

slide-23
SLIDE 23

Semi-supervised Clustering: Proposed Method

slide-24
SLIDE 24

Intuition

17

slide-25
SLIDE 25

Proposed Probabilistic Model

Inference Model Generative Model

Proposed probabilistic model

18

slide-26
SLIDE 26

Proposed Probabilistic Model

Inference Model Generative Model

Variational Lower Bound:

19

slide-27
SLIDE 27

Proposed Loss Function

Variational Lower Bound: Use of Auxiliary Tasks:

20

slide-28
SLIDE 28

Proposed Architecture

Inference Model Generative Model

Proposed Architecture:

21

slide-29
SLIDE 29

Reconstruction Loss

LR = 1 N

N

  • i=1

1 |xi|(xi − ˜ xi)2

22

slide-30
SLIDE 30

Clustering Loss

LC = KL (qik || U(0, 1)) = 1 N log K

N

  • i=1

K

  • k=1

qik log (Kqik)

23

slide-31
SLIDE 31

Assignment Process

24

slide-32
SLIDE 32

Assignment Process

24

slide-33
SLIDE 33

Assignment Process

24

slide-34
SLIDE 34

Assignment Loss

25

slide-35
SLIDE 35

Assignment Loss

25

slide-36
SLIDE 36

Assignment Loss

NLL = − log p(c|x) LA =

N

  • i=1
  • −log p(ci|xi)+
  • d∈C

log p(d|xi)

  • 25
slide-37
SLIDE 37

Assignment Loss

LA =

N

  • i=1
  • −log p(ci|xi)+
  • d∈C

log p(d|xi)

  • Normalized Loss:

LA = 1 2N

N

  • i=1
  • tanh
  • −log p(ci|xi)
  • +
  • d∈C

tanh

  • log p(d|xi)
  • +1
  • 25
slide-38
SLIDE 38

Feature Loss

26

slide-39
SLIDE 39

Feature Loss

26

slide-40
SLIDE 40

Feature Loss

D(f, r) = 1 √ 2|r|

|r|

  • l

f − rl

27

slide-41
SLIDE 41

Feature Loss

D(f, r) = 1 √ 2|r|

|r|

  • l

f − rl

27

slide-42
SLIDE 42

Feature Loss

D(f, r) = 1 √ 2|r|

|r|

  • l

f − rl LF = α L

L

  • i

D(fi, ri) +(1 − α) U

U

  • j

D(fj, rj)

α weights the importance of labeled distances 27

slide-43
SLIDE 43

Feature Loss

D(f, r) = 1 √ 2|r|

|r|

  • l

f − rl LF = α L

L

  • i

D(fi, ri)+ (1 − α) U

U

  • j

D(fj, rj)

α weights the importance of labeled distances 27

slide-44
SLIDE 44

Experiments and Results

slide-45
SLIDE 45

Datasets

Synthetic Data MNIST (70000, 10, 28x28) SVHN (99289, 10, 32x32)

28

slide-46
SLIDE 46

Analysis of Normalized Loss Functions

59.0 61.0 63.0 1 10 20 30 40 50 0.0 5.0 10.0 Ite ration numbe r Training loss (%)

LA LR LC LF

Training loss at different epochs

1 10 20 30 40 50 −0.4 −0.2 0.0 0.2 0.4 Ite ration numbe r Probabilitie s (π)

(a) wA = 0

1 10 20 30 40 50 −0.4 −0.2 0.0 0.2 0.4 Ite ration numbe r Probabilitie s (π)

(b) wA = 1

1 10 20 30 40 50 −0.4 −0.2 0.0 0.2 0.4 Ite ration numbe r Probabilitie s (π)

(c) wA = 5

Effect of the assignment loss (LA) over the categorical loss (LC) at different epochs 29

slide-47
SLIDE 47

Hyperparameter Selection

1 10 20 30 40 50 92.0 94.0 96.0 98.0 100.0

Ite ration numbe r ACC (%)

wC = 0 1 wA = 10

1 10 20 30 40 50 85.0 90.0 95.0 100.0

Ite ration numbe r NMI (%)

wC = 0 1 wA = 10

Performance of weights (wA, wC) at different iterations

Selected hyperparameters of our model.

Hyperparameter fsz η τ α κ wA Value 150 0.001 1.0 0.6 1 10

30

slide-48
SLIDE 48

Synthetic Data - Circles

few labeled samples iteration = 1 iteration = 5 iteration = 8 iteration = 10 iteration = 12

31

slide-49
SLIDE 49

Synthetic Data - Moons

few labeled samples iteration = 1 iteration = 3 iteration = 5 iteration = 10 iteration = 20

32

slide-50
SLIDE 50

Synthetic Data - Moons

few labeled samples iteration = 1 iteration = 3 iteration = 5 iteration = 10 iteration = 20

33

slide-51
SLIDE 51

Clustering: Performance

Clustering Accuracy (ACC) and Normalized Mutual Information (NMI), on MNIST for different unsupervised algorithms.

Model NMI ACC GMVAE, Dilokthanakul et al, arXiv’16

  • 0.778

VADE, Jiang et al, IJCAI’17

  • 0.945

JULE-SF, Yang et al, CVPR’16 0.876 0.940 JULE-RC, Yang et al, CVPR’16 0.915 0.961 DEPICT, Dizaji et al, arXiv’17 0.916 0.965 Proposed 0.954 0.984

Note that our results are not directly comparable with unsupervised methods. However, we want to show our model’s clustering results. Larger values for ACC and NMI indicates better performance.

34

slide-52
SLIDE 52

Classification: MNIST Performance

Semi-supervised test error (%) benchmarks on MNIST for 100 randomly and evenly distributed labeled data. Model 100 labeled examples SWWAE, Zhao et al, ICLR’16 8.71 (± 0.34) EmbedCNN, Weston et al, ICML’08 7.75 Small-CNN, Rasmus et al, NIPS’15 6.43 (± 0.84) M1+M2, Kingma et al, NIPS’14 3.33 (± 0.14) DEPICT, Dizaji et al, arXiv’17 2.65 (± 0.35) Conv-CatGAN, Springenberg, ICLR’16 1.39 (± 0.28) SDGM, Maale et al, ICML’16 1.32 (± 0.07) ADGM, Maale et al, ICML’16 0.96 (± 0.02) Improved GAN, Salimans et al, NIPS’16 0.93 (± 0, 65) Conv-Ladder, Rasmus et al, NIPS’15 0.89 (± 0.50) Proposed 1.65 (± 0.10)

Smaller values for test error indicate better performance. All the results of the related works are reported from the original papers.

35

slide-53
SLIDE 53

Classification: MNIST Performance

Semi-supervised test error (%) benchmarks on MNIST for 100 randomly and evenly distributed labeled data. Model 100 labeled examples SWWAE, Zhao et al, ICLR’16 8.71 (± 0.34) EmbedCNN, Weston et al, ICML’08 7.75 Small-CNN, Rasmus et al, NIPS’15 6.43 (± 0.84) M1+M2, Kingma et al, NIPS’14 3.33 (± 0.14) DEPICT, Dizaji et al, arXiv’17 2.65 (± 0.35) Conv-CatGAN, Springenberg, ICLR’16 1.39 (± 0.28) SDGM, Maale et al, ICML’16 1.32 (± 0.07) ADGM, Maale et al, ICML’16 0.96 (± 0.02) Improved GAN, Salimans et al, NIPS’16 0.93 (± 0, 65) Conv-Ladder, Rasmus et al, NIPS’15 0.89 (± 0.50) Proposed 1.65 (± 0.10)

Smaller values for test error indicate better performance. Colored rows denote bayesian methods.

36

slide-54
SLIDE 54

Classification: SVHN Performance

Semi-supervised test error (%) benchmarks on SVHN for 1000 randomly and evenly distributed labeled data.

Model With n labeled examples 1000 M1+TSVM, Kingma et al, NIPS’14 55.33 (± 0.11) M1+M2, Kingma et al, NIPS’14 36.02 (± 0.10) SWWAE, Zhao et al, ICLR’16 23.56 ADGM, Maale et al, ICML’16 22.86 SDGM, Maale et al, ICML’16 16.61 (± 0.24) Improved GAN, Salimans et al, NIPS’16 8.11 (± 1.3) Proposed 21.74 (± 0.41)

Smaller values for test error indicate better performance. All the results of the related works are reported from the original papers.

37

slide-55
SLIDE 55

Clustering: Visualization MNIST

(a) Epoch 1 (b) Epoch 5 (c) Epoch 20 (d) Epoch 50 (e) Epoch 80 (f) Epoch 100

38

slide-56
SLIDE 56

Image Generation

Use feature vector obtained from gφ(x) and vary the category c (one-hot).

39

slide-57
SLIDE 57

Unsupervised Clustering

slide-58
SLIDE 58

What if we have no labeled data?

Large amount of Unlabeled data (ImageNet) No Labeled data (CIFAR-10)

Can we learn good representations and cluster data in a unsupervised way?

40

slide-59
SLIDE 59

Unsupervised Clustering: Related Works

slide-60
SLIDE 60

Intuition

41

slide-61
SLIDE 61

Use of pretrained features

42

slide-62
SLIDE 62

Fine-tuning

  • Deep Embedding Clustering (DEC), Xie et al. ICML’16
  • Deep Clustering Network (DCN), Yang et al. ICML’17
  • Improved Deep Embedding Clustering (IDEC), Guo et al. IJCAI’17

43

slide-63
SLIDE 63

End-To-End

  • Joint Unsupervised Learning (JULE), Yang et al. CVPR’16
  • Deep Embedded Regularized Clustering (DEPICT), Dizaji et al. ICCV’17

44

slide-64
SLIDE 64

Complex Structure: Generative Models

  • Variational Deep Embedding, Jian et al. IJCAI’17
  • Gaussian Mixture Variational Autoencoders, Dilokthanakul et al. arXiv’17.

45

slide-65
SLIDE 65

Unsupervised Clustering: Proposed Method

slide-66
SLIDE 66

Intuition

46

slide-67
SLIDE 67

Our Probabilistic Model

Inference Model Generative Model 47

slide-68
SLIDE 68

Stacked M1+M2 generative model

Inference Model Generative Model M1 model Semi-supervised Learning with Deep Generative Models, Kingma et al, NIPS’14 48

slide-69
SLIDE 69

Stacked M1+M2 generative model

Inference Model Generative Model M1 model Inference Model Generative Model M2 model Semi-supervised Learning with Deep Generative Models, Kingma et al, NIPS’14 48

slide-70
SLIDE 70

Stacked M1+M2 generative model

Inference Model Generative Model M1 model Inference Model Generative Model M2 model Inference Model Generative Model Probabilistic graphical models of M1+M2 Semi-supervised Learning with Deep Generative Models, Kingma et al, NIPS’14 48

slide-71
SLIDE 71

Stacked M1+M2: Training

Train M1 model and use its feature representations to train M2 model separately.

49

slide-72
SLIDE 72

Problem with hierarchical stochastic latent variables

Inactive Stochastic Units:

Ladder Variational Autoencoders , Sonderby et al, NIPS’16 50

slide-73
SLIDE 73

Problem with hierarchical stochastic latent variables

Inactive Stochastic Units:

Ladder Variational Autoencoders , Sonderby et al, NIPS’16

Solutions require complex models:

Inference Model Generative Model Auxiliary Deep Generative Models, Maaloe et al, ICML’16 51

slide-74
SLIDE 74

Avoiding the problem of hierarchical stochastic variables

Replace the stochastic layer that produces x with a deterministic one ˆ x = g(x).

52

slide-75
SLIDE 75

Other Differences with M1+M2 model

  • Use of Gumbel-Softmax instead of marginalization over stochastic

discrete variables.

53

slide-76
SLIDE 76

Other Differences with M1+M2 model

  • Use of Gumbel-Softmax instead of marginalization over stochastic

discrete variables.

  • Training end-to-end instead of pre-training.

53

slide-77
SLIDE 77

Other Differences with M1+M2 model

  • Use of Gumbel-Softmax instead of marginalization over stochastic

discrete variables.

  • Training end-to-end instead of pre-training.
  • Unsupervised model: Labels are not required.

53

slide-78
SLIDE 78

Variational Lower Bound

Inference Model Generative Model

Loss Function:

Ltotal = LR + LC + LG

54

slide-79
SLIDE 79

Proposed Model

55

slide-80
SLIDE 80

Reconstruction Loss

Ltotal = LR+LC + LG LBCE = −(x log(˜ x) + (1 − x) log(1 − ˜ x)) LMSE = ||x − ˜ x||2

56

slide-81
SLIDE 81

Gaussian and Categorical Regularizers

Ltotal = LR + LC + LG LG = KL (N(µ(x), σ(x)) || N(0, 1)) = −1 2

K

  • k=0

1 + log σ2

k − µ2 k − σ2 k

LC = KL (Cat(π) || U(0, 1)) =

K

  • k=1

πk log (Kπk)

57

slide-82
SLIDE 82

Experiments

slide-83
SLIDE 83

Datasets

MNIST (70000, 10, 28x28) USPS (9298, 10, 16x16) REUTERS-10K (10000, 4)

58

slide-84
SLIDE 84

Analysis of Clustering Performance

1 30 50 80 100 5.0 10.0 15.0 20.0 25.0 30.0

Ite ration numbe r Pe rformance (%)

ACC NMI

Clustering performance at each epoch, considering all loss weights equal to 1

59

slide-85
SLIDE 85

Analysis of loss functions weights

Ltotal = LR + LC + wGLG

1 3 5 7 9 0.0 20.0 40.0 60.0 80.0

Loss function we ight (w∗) ACC (%)

LR LC LG

1 3 5 7 9 0.0 20.0 40.0 60.0 80.0

Loss function we ight (w∗) NMI (%)

LR LC LG

60

slide-86
SLIDE 86

Quantitative Results: Clustering Performance

Clustering performance, ACC (%) and NMI (%), on all datasets.

Method MNIST USPS REUTERS-10K ACC NMI ACC NMI ACC NMI k-means 53.24

  • 66.82
  • 51.62
  • GMM

53.73

  • 54.72
  • AE+k-means

81.82 74.73 69.31 66.20 70.52 39.79 AE+GMM 82.18

  • 70.13
  • GMVAE

82.31 (± 4)

  • DCN

83.00 81.00

  • DEC

86.55 83.72 74.08 75.29 73.68 49.76 IDEC 88.06 86.72 76.05 78.46 75.64 49.81 VaDE 94.46

  • 79.83
  • Proposed

85.75 (± 8) 82.13 (± 5) 72.58 (± 3) 67.01 (± 2) 80.41 (± 5) 52.13 (± 5)

Larger values for ACC and NMI indicate better performance. Colored rows denote methods that require pre-training. All the results of the related works are reported from the original papers.

61

slide-87
SLIDE 87

Quantitative Results: Classification - MNIST Performance

MNIST test error-rate (%) for kNN.

Method k 3 5 10 VAE 18.43 15.69 14.19 DLGMM 9.14 8.38 8.42 VaDE 2.20 2.14 2.22 Proposed 3.46 3.30 3.44

Smaller values for test error indicate better performance.

62

slide-88
SLIDE 88

Qualitative Results: Image Generation

10 clusters

Fix the category c (one-hot) and vary the latent variable z.

63

slide-89
SLIDE 89

Qualitative Results: Image Generation

7 clusters 14 clusters

64

slide-90
SLIDE 90

Qualitative Results: Style Generation

Input a test image x (first column) through qφ(z|ˆ x).

65

slide-91
SLIDE 91

Qualitative Results: Style Generation

Use vector obtained from qφ(z|ˆ x) and vary the category c (one-hot).

65

slide-92
SLIDE 92

Qualitative Results: Clustering Visualization

(a) Epoch 1 (b) Epoch 5 (c) Epoch 20 (d) Epoch 50 (e) Epoch 150 (f) Epoch 300

Visualization of the feature representations on MNIST data set at different epochs.

66

slide-93
SLIDE 93

Conclusions and Future Work

slide-94
SLIDE 94

Contributions

For semi-supervised clustering our contributions were:

  • a semi-supervised auxiliary task which aims to define clustering

assignments.

67

slide-95
SLIDE 95

Contributions

For semi-supervised clustering our contributions were:

  • a semi-supervised auxiliary task which aims to define clustering

assignments.

  • a regularization on the feature representations of the data.

67

slide-96
SLIDE 96

Contributions

For semi-supervised clustering our contributions were:

  • a semi-supervised auxiliary task which aims to define clustering

assignments.

  • a regularization on the feature representations of the data.
  • a loss function that combines a variational loss with our auxiliary

task to guide the learning process.

67

slide-97
SLIDE 97

Contributions

For unsupervised clustering our contributions were:

  • a combination of deterministic and stochastic layers to solve the

problem of hierarchical stochastic variables, allowing an end-to-end learning.

68

slide-98
SLIDE 98

Contributions

For unsupervised clustering our contributions were:

  • a combination of deterministic and stochastic layers to solve the

problem of hierarchical stochastic variables, allowing an end-to-end learning.

  • a simple deep generative model represented by the combination of a

simple Gaussian and categorical distribution.

68

slide-99
SLIDE 99

Future Works

  • Use of clustering algorithms (e.g., K-means, DBSCAN,

agglomerative clustering, etc.) over the feature representations to improve the learning process.

69

slide-100
SLIDE 100

Future Works

  • Use of clustering algorithms (e.g., K-means, DBSCAN,

agglomerative clustering, etc.) over the feature representations to improve the learning process.

  • Improvements of our probabilistic generative model can be

performed by using generative adversarial models (GANs).

69

slide-101
SLIDE 101

Publications

  • J. Arias and A. Ram´

ırez. Learning to Cluster with Auxiliary Tasks: A Semi-Supervised Approach. In Conference on Graphics, Patterns and Images (SIBGRAPI), Niter´

  • i, 2017.

Source code: https://gitlab.com/mipl/clustering-sibgrapi-2017

  • J. Arias and A. Ram´

ırez. Is Simple Better?: Revisiting Simple Generative Models for Unsupervised Clustering. In Second workshop

  • n Bayesian Deep Learning (NIPS), Long Beach, 2017.

Source code: https://gitlab.com/mipl/simple-vae-clustering

70

slide-102
SLIDE 102

Thanks!

71

slide-103
SLIDE 103

Appendix

slide-104
SLIDE 104

Variational Autoencoder - Lower Bound

KL(q(z|x) p(z|x)) =

  • z

q(z|x) log q(z|x) p(z|x) = Ez∼q(z|x)

  • log q(z|x)

p(z|x)

  • = Ez∼q(z|x) [log q(z|x) − log p(z|x)]

= Ez∼q(z|x) [log q(z|x) − log p(x, z) + log p(x)] = Ez∼q(z|x) [log q(z|x) − log p(x, z)] + log p(x) = −Ez∼q(z|x) [log p(x, z) − log q(z|x)] + log p(x) = −Ez∼q(z|x) [log p(x|z) + log p(z) − log q(z|x)] + log p(x) = −Ez∼q(z|x) [log p(x|z)] + Ez∼q(z|x) [log q(z|x) − log p(z)] + log p(x) Then, reordering the terms, log p(x) = Eqφ(z|x) [log pθ(x|z)] − KL(qφ(z|x) pθ(z))

  • L(θ,φ)

+ KL(qφ(z|x) pθ(z|x))

  • ≥0

log pθ(x) ≥ L(θ, φ)

Variational Lower Bound (ELBO)

θ∗, φ∗ = arg max

θ,φ

L(θ, φ)

Training: Maximize Lower Bound

72

slide-105
SLIDE 105

Semi-supervised Clustering - Evidence Lower Bound

KL(q(c, f|x) p(c, f|x)) =

  • c,f

q(c, f|x) log q(c, f|x) p(c, f|x) = Ec,f∼q(c,f|x)

  • log q(c, f|x)

p(c, f|x)

  • = Ec,f∼q(c,f|x) [log q(c, f|x) − log p(c, f|x)]

= Ec,f∼q(c,f|x) [log q(c, f|x) − log p(c, f, x) + log p(x)] = Ec,f∼q(c,f|x) [log q(c, f|x) − log p(c, f, x)] + log p(x) = −Ec,f∼q(c,f|x) [log p(c, f, x) − log q(c, f|x)] + log p(x) Then, reordering the terms, log p(x) = Ec,f∼q(c,f|x) [log p(c, f, x) − log q(c, f|x)]

  • L(θ,ψ,φ)=L

+ KL(q(c, f|x) p(c, f|x))

  • ≥0

73

slide-106
SLIDE 106

Semi-supervised Clustering - Evidence Lower Bound

The variational lower bound, L, also called evidence lower bound (ELBO) can be expanded: L = Eq(c,f|x) [log p(c, f, x) − log q(c, f|x)] = Eq(c,f|x) [log p(x|c, f) + log p(c, f) − log q(c, f|x)] = Eq(c,f|x) [log p(x|c, f)] − Eq(c,f|x) [log q(c, f|x) − log p(c, f)] = Eq(c,f|x) [log p(x|c, f)] − Eq(f|x)

  • Eq(c|f) [log q(c|f) + log q(f|x) − log p(c, f)]
  • = Eq(c,f|x) [log p(x|c, f)] − Eq(f|x)
  • Eq(c|f)
  • log q(c|f)

p(c|f)

  • + log q(f|x) − log p(f)
  • = Eq(c,f|x) [log p(x|c, f)] − Eq(f|x) [KL(q(c|f) p(c|f)) + log q(f|x) − log p(f)]

= Eq(c,f|x) [log p(x|c, f)] − Eq(f|x) [KL(q(c|f) p(c|f))] − Eq(f|x)

  • log q(f|x)

p(f)

  • = Eq(c,f|x) [log p(x|c, f)] − Eq(f|x) [KL(q(c|f) p(c|f))] − KL(q(f|x) p(f))

74

slide-107
SLIDE 107

Unsupervised Clustering - Evidence Lower Bound

KL(q(z, c|ˆ x) p(z, c|x)) =

  • z,c

q(z, c|ˆ x) log q(z, c|ˆ x) p(z, c|x) = Ez,c∼q(z,c|ˆ

x)

  • log q(z, c|ˆ

x) p(z, c|x)

  • = Ez,c∼q(z,c|ˆ

x) [log q(z, c|ˆ

x) − log p(z, c|x)] = Ez,c∼q(z,c|ˆ

x) [log q(z, c|ˆ

x) − log p(z, c, x) + log p(x)] = Ez,c∼q(z,c|ˆ

x) [log q(z, c|ˆ

x) − log p(z, c, x)] + log p(x) = −Ez,c∼q(z,c|ˆ

x) [log p(z, c, x) − log q(z, c|ˆ

x)] + log p(x) Then, reordering the terms, log p(x) = Ez,c∼q(z,c|ˆ

x) [log p(z, c, x) − log q(z, c|ˆ

x)]

  • L(θ,φ)=L

+ KL(q(z, c|ˆ x) p(z, c|x))

  • ≥0

75

slide-108
SLIDE 108

Unsupervised Clustering - Evidence Lower Bound

The variational lower bound, L, also called evidence lower bound (ELBO) can be expanded: L = Eq(z,c|ˆ

x) [log p(z, c, x) − log q(z, c|ˆ

x)] = Eq(z,c|ˆ

x) [log p(x|z, c) + log p(z, c) − log q(z, c|ˆ

x)] = Eq(z,c|ˆ

x) [log p(x|z, c)] − Eq(z,c|ˆ x) [log q(z, c|ˆ

x) − log p(z, c)] = Eq(z,c|ˆ

x) [log p(x|z, c)] − Eq(z|ˆ x)

  • Eq(c|ˆ

x) [log q(c|ˆ

x) + log q(z|ˆ x) − log p(z, c)]

  • = Eq(z,c|ˆ

x) [log p(x|z, c)] − Eq(z|ˆ x)

  • Eq(c|ˆ

x)

  • log q(c|ˆ

x) p(c)

  • + log q(z|ˆ

x) − log p(z)

  • = Eq(z,c|ˆ

x) [log p(x|z, c)] − Eq(z|ˆ x) [KL(q(c|ˆ

x) p(c)) + log q(z|ˆ x) − log p(z)] = Eq(z,c|ˆ

x) [log p(x|z, c)] − Eq(z|ˆ x) [KL(q(c|ˆ

x) p(c))] − Eq(z|ˆ

x)

  • log q(z|ˆ

x) p(z)

  • = Eq(z,c|ˆ

x) [log p(x|z, c)] − KL(q(c|ˆ

x) p(c)) − KL(q(z|ˆ x) p(z))

76

slide-109
SLIDE 109

Problems with stochastic latent variables

Gradients are difficult to obtain: ∇φLθ,φ(x) = ∇φEqφ(z|x) [log pθ(x, z) − log qφ(z|x)] =

  • ∇φqφ(z|x) [log pθ(x, z) − log qφ(z|x)] dz

Auto-Encoding Variational Bayes, Kingma et al, ICLR’14 77

slide-110
SLIDE 110

Continuous stochastic variables: Reparameterization Trick

We can ’externalize’ the randomness in z by re-parameterizing the variable as a deterministic: z = µ + σ ǫ ∇φLθ,φ(x) = ∇φEqφ(z|x) [log pθ(x, z) − log qφ(z|x)] = ∇φEp(ǫ) [log pθ(x, z) − log qφ(z|x)]

Auto-Encoding Variational Bayes, Kingma et al, ICLR’14 78

slide-111
SLIDE 111

Problems with stochastic latent variables

Gradients are difficult to obtain: ∇φLθ,φ(x) = ∇φEqφ(z|x) [log pθ(x, z) − log qφ(z|x)] =

  • ∇φqφ(z|x) [log pθ(x, z) − log qφ(z|x)] dz

Auto-Encoding Variational Bayes, Kingma et al, ICLR’14 79

slide-112
SLIDE 112

Discrete stochastic variables: Gumbel-Max Trick

Gradients are difficult to obtain:

Image Credit: UofG Machine Learning Research Group 80

slide-113
SLIDE 113

Discrete stochastic variables: Gumbel-Softmax

We can approximate arg max with a softmax.

τ → 0 we obtain a one-hot τ → +∞ we obtain a uniform Categorical Reparameterization with Gumbel-Softmax, Jang et al, ICLR’17 A Continuous Relaxation of Discrete Random Variables, Maddison et al, ICLR’17 81

slide-114
SLIDE 114

Clustering Metrics: Clustering Accuracy (ACC)

For a set of N input elements, this metric is defined as: ACC = N

i=1 1{li = map(ci)}

N , where:

  • li is the ground truth label of i-th input,
  • ci is the cluster assignment produced by the algorithm,
  • map(ci) is the optimal mapping function.

82

slide-115
SLIDE 115

Clustering Accuracy (ACC): Hungarian Algorithm

Predicted Ground Truth

  • 1. Calculate the contingency table between the clusters defined by the

algorithm and the true categories. T1 T2 T3 C1 4 1 1 C2 2 5 C3 1 5 2

83

slide-116
SLIDE 116

Clustering Accuracy (ACC): Hungarian Algorithm

Predicted Ground Truth

  • 2. Create bipartite graph from the contingency table.

T1 T2 T3 C1 4 1 1 C2 2 5 C3 1 5 2

83

slide-117
SLIDE 117

Clustering Accuracy (ACC): Hungarian Algorithm

Predicted Ground Truth

  • 3. Apply Kuhn-Munkres algorithm.

T1 T2 T3 C1 4 1 1 C2 2 5 C3 1 5 2

83

slide-118
SLIDE 118

Clustering Accuracy (ACC): Simple Approach

Probabilities q(c|x) after training.

C1 C2 C3 T 0.6 0.2 0.2 2 0.1 0.4 0.5 1 0.8 0.1 0.1 1 0.3 0.6 0.1 3 0.1 0.75 0.15 2 0.7 0.1 0.2 1 0.2 0.1 0.7 2 0.05 0.05 0.9 2 0.7 0.2 0.1 3 0.1 0.8 0.1 3 0.1 0.6 0.3 1

For each cluster k, we find the validation example xi that maximizes q(ck|xi) and assign the label of xi to all the elements that were assigned to cluster k.

84

slide-119
SLIDE 119

Clustering Accuracy (ACC): Simple Approach

Cluster 1 Assignments

C1 C2 C3 T 0.6 0.2 0.2 2 0.1 0.4 0.5 1 0.8 0.1 0.1 1 0.3 0.6 0.1 3 0.1 0.75 0.15 2 0.7 0.1 0.2 1 0.2 0.1 0.7 2 0.05 0.05 0.9 2 0.7 0.2 0.1 3 0.1 0.8 0.1 3 0.1 0.6 0.3 1

85

slide-120
SLIDE 120

Clustering Accuracy (ACC): Simple Approach

Cluster 1 is mapped to Ground Truth 1

C1 C2 C3 T 0.6 0.2 0.2 2 0.1 0.4 0.5 1 0.8 0.1 0.1 1 0.3 0.6 0.1 3 0.1 0.75 0.15 2 0.7 0.1 0.2 1 0.2 0.1 0.7 2 0.05 0.05 0.9 2 0.7 0.2 0.1 3 0.1 0.8 0.1 3 0.1 0.6 0.3 1

85

slide-121
SLIDE 121

Clustering Accuracy (ACC): Simple Approach

Table 1: Assignment of labels to clusters based on maximum q(ck|xi)

C1 C2 C3 T 0.6 0.2 0.2 2 0.1 0.4 0.5 1 0.8 0.1 0.1 1 0.3 0.6 0.1 3 0.1 0.75 0.15 2 0.7 0.1 0.2 1 0.2 0.1 0.7 2 0.05 0.05 0.9 2 0.7 0.2 0.1 3 0.1 0.8 0.1 3 0.1 0.6 0.3 1

85

slide-122
SLIDE 122

Clustering Metrics: Normalized Mutual Information (NMI)

For two arbitrary variables T and C, representing the ground truth labels and cluster labels respectively, NMI is defined as follows: NMI(T, C) = I(T, C)

  • H(T)H(C)

, (1) where:

  • I(T, C) denotes the mutual information between T and C,
  • H(∗) denotes the entropy.

86

slide-123
SLIDE 123

Mutual Information (MI)

Mutual Information quantifies the information shared by two clusters. MI tells us the reduction in the entropy of class labels that we get if we know the cluster labels. (Similar to Information gain in decision trees): I(T, C) = H(T) − H(T|C)

87

slide-124
SLIDE 124

Mutual Information (MI)

Perfect Correlation (NMI=1) Independent (NMI = 0)

88

slide-125
SLIDE 125

Semi-supervised Clustering: SVHN Image Generation

Use feature vector obtained from gφ(x) and vary the category c (one-hot).

89