Deep Generative Models for Clustering: A Semi-supervised and Unsupervised Approach
Jhosimar George Arias Figueroa Advisor: Gerberth Ad´ ın Ram´ ırez Rivera Master’s Thesis Defense
February 19, 2018
Deep Generative Models for Clustering: A Semi-supervised and - - PowerPoint PPT Presentation
Deep Generative Models for Clustering: A Semi-supervised and Unsupervised Approach Jhosimar George Arias Figueroa Advisor: Gerberth Ad n Ram rez Rivera Masters Thesis Defense February 19, 2018 Motivation Huge amount of unlabeled
Deep Generative Models for Clustering: A Semi-supervised and Unsupervised Approach
Jhosimar George Arias Figueroa Advisor: Gerberth Ad´ ın Ram´ ırez Rivera Master’s Thesis Defense
February 19, 2018
Motivation
Huge amount of unlabeled data!
Image Credit: Ruslan Salakhutdinov
Labeling huge amount of data is hard and expensive.
1
Huge amount of unlabeled data!
Image Credit: Ruslan Salakhutdinov
Labeling huge amount of data is hard and expensive. Can we discover underlying hidden structure from data in a semi-supervised or unsupervised way?
1
What is Clustering?
Goal: Find distinct groups such that similar elements belong to the same group.
2
What if we have a small amount of labeled data?
Large amount of Unlabeled data (ImageNet) Small amount of Labeled data (CIFAR-10)
Can we learn good representations and cluster data in a semi-supervised way?
3
Two types of Semi-supervised Clustering:
Class labels (seeded points) Pairwise Constraints (must-link or cannot link)
Our work focuses on the first type of semi-supervised clustering: Use of labels as seeds
4
Semisupervised Clustering: Related Works
Intuition
5
Learning Representations: Neural Networks
How to learn feature representations? Train such that features can be used to perform classification (supervised learning).
6
Auxiliary Task: Semi-supervised Embedding
7
Learning Representations: Autoencoder
How to learn feature representations? Train such that features can be used to reconstruct original data Autoencoding - encoding itself
8
Auxiliary Task: Clean Encoder and Denoising Decoder
9
Variational Autoencoder
Auto-Encoding Variational Bayes, Kingma et al, ICLR’14 10
Variational Autoencoder
Posterior distribution pθ(z|x) is intractable because of pθ(x): pθ(z|x) = pθ(x|z)pθ(z) pθ(x) pθ(x) =
pθ(x, z)dz
Auto-Encoding Variational Bayes, Kingma et al, ICLR’14 10
Variational Autoencoder
Approximate pθ(z|x) with a tractable distribution qφ(z|x)
Auto-Encoding Variational Bayes, Kingma et al, ICLR’14 10
Variational Autoencoder - Lower Bound
Variational Lower Bound:
Auto-Encoding Variational Bayes, Kingma et al, ICLR’14 11
Generative Models for Semi-supervised learning
Inference Model Generative Model Probabilistic graphical models of M1+M2, Kingma et al, NIPS14 Inference Model Generative Model Auxiliary Deep Generative Models, Maaloe et al, ICML’16 12
Stochastic Continuous Nodes: Reparameterization Trick
We can ’externalize’ the randomness in z by re-parameterizing it as deterministic: z = µ + σ ǫ 13
Stochastic Discrete Nodes: Gumbel-Max Trick
We can sample from a categorical distribution with the Gumbel-Max trick. 14
Stochastic Discrete Nodes: Gumbel-Softmax distribution
We can approximate arg max with a softmax. 15
Avoiding marginalization over discrete variables
Marginalizing out c over all categories is an expensive process which requires multiple gradient estimations. Gumbel-Softmax allows us to backpropagate through a single sample gradient estimation.
16
Semi-supervised Clustering: Proposed Method
Intuition
17
Proposed Probabilistic Model
Inference Model Generative Model
Proposed probabilistic model
18
Proposed Probabilistic Model
Inference Model Generative Model
Variational Lower Bound:
19
Proposed Loss Function
Variational Lower Bound: Use of Auxiliary Tasks:
20
Proposed Architecture
Inference Model Generative Model
Proposed Architecture:
21
Reconstruction Loss
LR = 1 N
N
1 |xi|(xi − ˜ xi)2
22
Clustering Loss
LC = KL (qik || U(0, 1)) = 1 N log K
N
K
qik log (Kqik)
23
Assignment Process
24
Assignment Process
24
Assignment Process
24
Assignment Loss
25
Assignment Loss
25
Assignment Loss
NLL = − log p(c|x) LA =
N
log p(d|xi)
Assignment Loss
LA =
N
log p(d|xi)
LA = 1 2N
N
tanh
Feature Loss
26
Feature Loss
26
Feature Loss
D(f, r) = 1 √ 2|r|
|r|
f − rl
27
Feature Loss
D(f, r) = 1 √ 2|r|
|r|
f − rl
27
Feature Loss
D(f, r) = 1 √ 2|r|
|r|
f − rl LF = α L
L
D(fi, ri) +(1 − α) U
U
D(fj, rj)
α weights the importance of labeled distances 27
Feature Loss
D(f, r) = 1 √ 2|r|
|r|
f − rl LF = α L
L
D(fi, ri)+ (1 − α) U
U
D(fj, rj)
α weights the importance of labeled distances 27
Experiments and Results
Datasets
Synthetic Data MNIST (70000, 10, 28x28) SVHN (99289, 10, 32x32)
28
Analysis of Normalized Loss Functions
59.0 61.0 63.0 1 10 20 30 40 50 0.0 5.0 10.0 Ite ration numbe r Training loss (%)
LA LR LC LF
Training loss at different epochs
1 10 20 30 40 50 −0.4 −0.2 0.0 0.2 0.4 Ite ration numbe r Probabilitie s (π)(a) wA = 0
1 10 20 30 40 50 −0.4 −0.2 0.0 0.2 0.4 Ite ration numbe r Probabilitie s (π)(b) wA = 1
1 10 20 30 40 50 −0.4 −0.2 0.0 0.2 0.4 Ite ration numbe r Probabilitie s (π)(c) wA = 5
Effect of the assignment loss (LA) over the categorical loss (LC) at different epochs 29
Hyperparameter Selection
1 10 20 30 40 50 92.0 94.0 96.0 98.0 100.0
Ite ration numbe r ACC (%)
wC = 0 1 wA = 101 10 20 30 40 50 85.0 90.0 95.0 100.0
Ite ration numbe r NMI (%)
wC = 0 1 wA = 10Performance of weights (wA, wC) at different iterations
Selected hyperparameters of our model.
Hyperparameter fsz η τ α κ wA Value 150 0.001 1.0 0.6 1 10
30
Synthetic Data - Circles
few labeled samples iteration = 1 iteration = 5 iteration = 8 iteration = 10 iteration = 12
31
Synthetic Data - Moons
few labeled samples iteration = 1 iteration = 3 iteration = 5 iteration = 10 iteration = 20
32
Synthetic Data - Moons
few labeled samples iteration = 1 iteration = 3 iteration = 5 iteration = 10 iteration = 20
33
Clustering: Performance
Clustering Accuracy (ACC) and Normalized Mutual Information (NMI), on MNIST for different unsupervised algorithms.
Model NMI ACC GMVAE, Dilokthanakul et al, arXiv’16
VADE, Jiang et al, IJCAI’17
JULE-SF, Yang et al, CVPR’16 0.876 0.940 JULE-RC, Yang et al, CVPR’16 0.915 0.961 DEPICT, Dizaji et al, arXiv’17 0.916 0.965 Proposed 0.954 0.984
Note that our results are not directly comparable with unsupervised methods. However, we want to show our model’s clustering results. Larger values for ACC and NMI indicates better performance.
34
Classification: MNIST Performance
Semi-supervised test error (%) benchmarks on MNIST for 100 randomly and evenly distributed labeled data. Model 100 labeled examples SWWAE, Zhao et al, ICLR’16 8.71 (± 0.34) EmbedCNN, Weston et al, ICML’08 7.75 Small-CNN, Rasmus et al, NIPS’15 6.43 (± 0.84) M1+M2, Kingma et al, NIPS’14 3.33 (± 0.14) DEPICT, Dizaji et al, arXiv’17 2.65 (± 0.35) Conv-CatGAN, Springenberg, ICLR’16 1.39 (± 0.28) SDGM, Maale et al, ICML’16 1.32 (± 0.07) ADGM, Maale et al, ICML’16 0.96 (± 0.02) Improved GAN, Salimans et al, NIPS’16 0.93 (± 0, 65) Conv-Ladder, Rasmus et al, NIPS’15 0.89 (± 0.50) Proposed 1.65 (± 0.10)
Smaller values for test error indicate better performance. All the results of the related works are reported from the original papers.
35
Classification: MNIST Performance
Semi-supervised test error (%) benchmarks on MNIST for 100 randomly and evenly distributed labeled data. Model 100 labeled examples SWWAE, Zhao et al, ICLR’16 8.71 (± 0.34) EmbedCNN, Weston et al, ICML’08 7.75 Small-CNN, Rasmus et al, NIPS’15 6.43 (± 0.84) M1+M2, Kingma et al, NIPS’14 3.33 (± 0.14) DEPICT, Dizaji et al, arXiv’17 2.65 (± 0.35) Conv-CatGAN, Springenberg, ICLR’16 1.39 (± 0.28) SDGM, Maale et al, ICML’16 1.32 (± 0.07) ADGM, Maale et al, ICML’16 0.96 (± 0.02) Improved GAN, Salimans et al, NIPS’16 0.93 (± 0, 65) Conv-Ladder, Rasmus et al, NIPS’15 0.89 (± 0.50) Proposed 1.65 (± 0.10)
Smaller values for test error indicate better performance. Colored rows denote bayesian methods.
36
Classification: SVHN Performance
Semi-supervised test error (%) benchmarks on SVHN for 1000 randomly and evenly distributed labeled data.
Model With n labeled examples 1000 M1+TSVM, Kingma et al, NIPS’14 55.33 (± 0.11) M1+M2, Kingma et al, NIPS’14 36.02 (± 0.10) SWWAE, Zhao et al, ICLR’16 23.56 ADGM, Maale et al, ICML’16 22.86 SDGM, Maale et al, ICML’16 16.61 (± 0.24) Improved GAN, Salimans et al, NIPS’16 8.11 (± 1.3) Proposed 21.74 (± 0.41)
Smaller values for test error indicate better performance. All the results of the related works are reported from the original papers.
37
Clustering: Visualization MNIST
(a) Epoch 1 (b) Epoch 5 (c) Epoch 20 (d) Epoch 50 (e) Epoch 80 (f) Epoch 100
38
Image Generation
Use feature vector obtained from gφ(x) and vary the category c (one-hot).
39
Unsupervised Clustering
What if we have no labeled data?
Large amount of Unlabeled data (ImageNet) No Labeled data (CIFAR-10)
Can we learn good representations and cluster data in a unsupervised way?
40
Unsupervised Clustering: Related Works
Intuition
41
Use of pretrained features
42
Fine-tuning
43
End-To-End
44
Complex Structure: Generative Models
45
Unsupervised Clustering: Proposed Method
Intuition
46
Our Probabilistic Model
Inference Model Generative Model 47
Stacked M1+M2 generative model
Inference Model Generative Model M1 model Semi-supervised Learning with Deep Generative Models, Kingma et al, NIPS’14 48
Stacked M1+M2 generative model
Inference Model Generative Model M1 model Inference Model Generative Model M2 model Semi-supervised Learning with Deep Generative Models, Kingma et al, NIPS’14 48
Stacked M1+M2 generative model
Inference Model Generative Model M1 model Inference Model Generative Model M2 model Inference Model Generative Model Probabilistic graphical models of M1+M2 Semi-supervised Learning with Deep Generative Models, Kingma et al, NIPS’14 48
Stacked M1+M2: Training
Train M1 model and use its feature representations to train M2 model separately.
49
Problem with hierarchical stochastic latent variables
Inactive Stochastic Units:
Ladder Variational Autoencoders , Sonderby et al, NIPS’16 50
Problem with hierarchical stochastic latent variables
Inactive Stochastic Units:
Ladder Variational Autoencoders , Sonderby et al, NIPS’16
Solutions require complex models:
Inference Model Generative Model Auxiliary Deep Generative Models, Maaloe et al, ICML’16 51
Avoiding the problem of hierarchical stochastic variables
Replace the stochastic layer that produces x with a deterministic one ˆ x = g(x).
52
Other Differences with M1+M2 model
discrete variables.
53
Other Differences with M1+M2 model
discrete variables.
53
Other Differences with M1+M2 model
discrete variables.
53
Variational Lower Bound
Inference Model Generative Model
Loss Function:
Ltotal = LR + LC + LG
54
Proposed Model
55
Reconstruction Loss
Ltotal = LR+LC + LG LBCE = −(x log(˜ x) + (1 − x) log(1 − ˜ x)) LMSE = ||x − ˜ x||2
56
Gaussian and Categorical Regularizers
Ltotal = LR + LC + LG LG = KL (N(µ(x), σ(x)) || N(0, 1)) = −1 2
K
1 + log σ2
k − µ2 k − σ2 k
LC = KL (Cat(π) || U(0, 1)) =
K
πk log (Kπk)
57
Experiments
Datasets
MNIST (70000, 10, 28x28) USPS (9298, 10, 16x16) REUTERS-10K (10000, 4)
58
Analysis of Clustering Performance
1 30 50 80 100 5.0 10.0 15.0 20.0 25.0 30.0
Ite ration numbe r Pe rformance (%)
ACC NMIClustering performance at each epoch, considering all loss weights equal to 1
59
Analysis of loss functions weights
Ltotal = LR + LC + wGLG
1 3 5 7 9 0.0 20.0 40.0 60.0 80.0
Loss function we ight (w∗) ACC (%)
LR LC LG1 3 5 7 9 0.0 20.0 40.0 60.0 80.0
Loss function we ight (w∗) NMI (%)
LR LC LG60
Quantitative Results: Clustering Performance
Clustering performance, ACC (%) and NMI (%), on all datasets.
Method MNIST USPS REUTERS-10K ACC NMI ACC NMI ACC NMI k-means 53.24
53.73
81.82 74.73 69.31 66.20 70.52 39.79 AE+GMM 82.18
82.31 (± 4)
83.00 81.00
86.55 83.72 74.08 75.29 73.68 49.76 IDEC 88.06 86.72 76.05 78.46 75.64 49.81 VaDE 94.46
85.75 (± 8) 82.13 (± 5) 72.58 (± 3) 67.01 (± 2) 80.41 (± 5) 52.13 (± 5)
Larger values for ACC and NMI indicate better performance. Colored rows denote methods that require pre-training. All the results of the related works are reported from the original papers.
61
Quantitative Results: Classification - MNIST Performance
MNIST test error-rate (%) for kNN.
Method k 3 5 10 VAE 18.43 15.69 14.19 DLGMM 9.14 8.38 8.42 VaDE 2.20 2.14 2.22 Proposed 3.46 3.30 3.44
Smaller values for test error indicate better performance.
62
Qualitative Results: Image Generation
10 clusters
Fix the category c (one-hot) and vary the latent variable z.
63
Qualitative Results: Image Generation
7 clusters 14 clusters
64
Qualitative Results: Style Generation
Input a test image x (first column) through qφ(z|ˆ x).
65
Qualitative Results: Style Generation
Use vector obtained from qφ(z|ˆ x) and vary the category c (one-hot).
65
Qualitative Results: Clustering Visualization
(a) Epoch 1 (b) Epoch 5 (c) Epoch 20 (d) Epoch 50 (e) Epoch 150 (f) Epoch 300
Visualization of the feature representations on MNIST data set at different epochs.
66
Conclusions and Future Work
Contributions
For semi-supervised clustering our contributions were:
assignments.
67
Contributions
For semi-supervised clustering our contributions were:
assignments.
67
Contributions
For semi-supervised clustering our contributions were:
assignments.
task to guide the learning process.
67
Contributions
For unsupervised clustering our contributions were:
problem of hierarchical stochastic variables, allowing an end-to-end learning.
68
Contributions
For unsupervised clustering our contributions were:
problem of hierarchical stochastic variables, allowing an end-to-end learning.
simple Gaussian and categorical distribution.
68
Future Works
agglomerative clustering, etc.) over the feature representations to improve the learning process.
69
Future Works
agglomerative clustering, etc.) over the feature representations to improve the learning process.
performed by using generative adversarial models (GANs).
69
Publications
ırez. Learning to Cluster with Auxiliary Tasks: A Semi-Supervised Approach. In Conference on Graphics, Patterns and Images (SIBGRAPI), Niter´
Source code: https://gitlab.com/mipl/clustering-sibgrapi-2017
ırez. Is Simple Better?: Revisiting Simple Generative Models for Unsupervised Clustering. In Second workshop
Source code: https://gitlab.com/mipl/simple-vae-clustering
70
71
Appendix
Variational Autoencoder - Lower Bound
KL(q(z|x) p(z|x)) =
q(z|x) log q(z|x) p(z|x) = Ez∼q(z|x)
p(z|x)
= Ez∼q(z|x) [log q(z|x) − log p(x, z) + log p(x)] = Ez∼q(z|x) [log q(z|x) − log p(x, z)] + log p(x) = −Ez∼q(z|x) [log p(x, z) − log q(z|x)] + log p(x) = −Ez∼q(z|x) [log p(x|z) + log p(z) − log q(z|x)] + log p(x) = −Ez∼q(z|x) [log p(x|z)] + Ez∼q(z|x) [log q(z|x) − log p(z)] + log p(x) Then, reordering the terms, log p(x) = Eqφ(z|x) [log pθ(x|z)] − KL(qφ(z|x) pθ(z))
+ KL(qφ(z|x) pθ(z|x))
log pθ(x) ≥ L(θ, φ)
Variational Lower Bound (ELBO)θ∗, φ∗ = arg max
θ,φ
L(θ, φ)
Training: Maximize Lower Bound
72
Semi-supervised Clustering - Evidence Lower Bound
KL(q(c, f|x) p(c, f|x)) =
q(c, f|x) log q(c, f|x) p(c, f|x) = Ec,f∼q(c,f|x)
p(c, f|x)
= Ec,f∼q(c,f|x) [log q(c, f|x) − log p(c, f, x) + log p(x)] = Ec,f∼q(c,f|x) [log q(c, f|x) − log p(c, f, x)] + log p(x) = −Ec,f∼q(c,f|x) [log p(c, f, x) − log q(c, f|x)] + log p(x) Then, reordering the terms, log p(x) = Ec,f∼q(c,f|x) [log p(c, f, x) − log q(c, f|x)]
+ KL(q(c, f|x) p(c, f|x))
73
Semi-supervised Clustering - Evidence Lower Bound
The variational lower bound, L, also called evidence lower bound (ELBO) can be expanded: L = Eq(c,f|x) [log p(c, f, x) − log q(c, f|x)] = Eq(c,f|x) [log p(x|c, f) + log p(c, f) − log q(c, f|x)] = Eq(c,f|x) [log p(x|c, f)] − Eq(c,f|x) [log q(c, f|x) − log p(c, f)] = Eq(c,f|x) [log p(x|c, f)] − Eq(f|x)
p(c|f)
= Eq(c,f|x) [log p(x|c, f)] − Eq(f|x) [KL(q(c|f) p(c|f))] − Eq(f|x)
p(f)
74
Unsupervised Clustering - Evidence Lower Bound
KL(q(z, c|ˆ x) p(z, c|x)) =
q(z, c|ˆ x) log q(z, c|ˆ x) p(z, c|x) = Ez,c∼q(z,c|ˆ
x)
x) p(z, c|x)
x) [log q(z, c|ˆ
x) − log p(z, c|x)] = Ez,c∼q(z,c|ˆ
x) [log q(z, c|ˆ
x) − log p(z, c, x) + log p(x)] = Ez,c∼q(z,c|ˆ
x) [log q(z, c|ˆ
x) − log p(z, c, x)] + log p(x) = −Ez,c∼q(z,c|ˆ
x) [log p(z, c, x) − log q(z, c|ˆ
x)] + log p(x) Then, reordering the terms, log p(x) = Ez,c∼q(z,c|ˆ
x) [log p(z, c, x) − log q(z, c|ˆ
x)]
+ KL(q(z, c|ˆ x) p(z, c|x))
75
Unsupervised Clustering - Evidence Lower Bound
The variational lower bound, L, also called evidence lower bound (ELBO) can be expanded: L = Eq(z,c|ˆ
x) [log p(z, c, x) − log q(z, c|ˆ
x)] = Eq(z,c|ˆ
x) [log p(x|z, c) + log p(z, c) − log q(z, c|ˆ
x)] = Eq(z,c|ˆ
x) [log p(x|z, c)] − Eq(z,c|ˆ x) [log q(z, c|ˆ
x) − log p(z, c)] = Eq(z,c|ˆ
x) [log p(x|z, c)] − Eq(z|ˆ x)
x) [log q(c|ˆ
x) + log q(z|ˆ x) − log p(z, c)]
x) [log p(x|z, c)] − Eq(z|ˆ x)
x)
x) p(c)
x) − log p(z)
x) [log p(x|z, c)] − Eq(z|ˆ x) [KL(q(c|ˆ
x) p(c)) + log q(z|ˆ x) − log p(z)] = Eq(z,c|ˆ
x) [log p(x|z, c)] − Eq(z|ˆ x) [KL(q(c|ˆ
x) p(c))] − Eq(z|ˆ
x)
x) p(z)
x) [log p(x|z, c)] − KL(q(c|ˆ
x) p(c)) − KL(q(z|ˆ x) p(z))
76
Problems with stochastic latent variables
Gradients are difficult to obtain: ∇φLθ,φ(x) = ∇φEqφ(z|x) [log pθ(x, z) − log qφ(z|x)] =
Auto-Encoding Variational Bayes, Kingma et al, ICLR’14 77
Continuous stochastic variables: Reparameterization Trick
We can ’externalize’ the randomness in z by re-parameterizing the variable as a deterministic: z = µ + σ ǫ ∇φLθ,φ(x) = ∇φEqφ(z|x) [log pθ(x, z) − log qφ(z|x)] = ∇φEp(ǫ) [log pθ(x, z) − log qφ(z|x)]
Auto-Encoding Variational Bayes, Kingma et al, ICLR’14 78
Problems with stochastic latent variables
Gradients are difficult to obtain: ∇φLθ,φ(x) = ∇φEqφ(z|x) [log pθ(x, z) − log qφ(z|x)] =
Auto-Encoding Variational Bayes, Kingma et al, ICLR’14 79
Discrete stochastic variables: Gumbel-Max Trick
Gradients are difficult to obtain:
Image Credit: UofG Machine Learning Research Group 80
Discrete stochastic variables: Gumbel-Softmax
We can approximate arg max with a softmax.
τ → 0 we obtain a one-hot τ → +∞ we obtain a uniform Categorical Reparameterization with Gumbel-Softmax, Jang et al, ICLR’17 A Continuous Relaxation of Discrete Random Variables, Maddison et al, ICLR’17 81
Clustering Metrics: Clustering Accuracy (ACC)
For a set of N input elements, this metric is defined as: ACC = N
i=1 1{li = map(ci)}
N , where:
82
Clustering Accuracy (ACC): Hungarian Algorithm
Predicted Ground Truth
algorithm and the true categories. T1 T2 T3 C1 4 1 1 C2 2 5 C3 1 5 2
83
Clustering Accuracy (ACC): Hungarian Algorithm
Predicted Ground Truth
T1 T2 T3 C1 4 1 1 C2 2 5 C3 1 5 2
83
Clustering Accuracy (ACC): Hungarian Algorithm
Predicted Ground Truth
T1 T2 T3 C1 4 1 1 C2 2 5 C3 1 5 2
83
Clustering Accuracy (ACC): Simple Approach
Probabilities q(c|x) after training.
C1 C2 C3 T 0.6 0.2 0.2 2 0.1 0.4 0.5 1 0.8 0.1 0.1 1 0.3 0.6 0.1 3 0.1 0.75 0.15 2 0.7 0.1 0.2 1 0.2 0.1 0.7 2 0.05 0.05 0.9 2 0.7 0.2 0.1 3 0.1 0.8 0.1 3 0.1 0.6 0.3 1
For each cluster k, we find the validation example xi that maximizes q(ck|xi) and assign the label of xi to all the elements that were assigned to cluster k.
84
Clustering Accuracy (ACC): Simple Approach
Cluster 1 Assignments
C1 C2 C3 T 0.6 0.2 0.2 2 0.1 0.4 0.5 1 0.8 0.1 0.1 1 0.3 0.6 0.1 3 0.1 0.75 0.15 2 0.7 0.1 0.2 1 0.2 0.1 0.7 2 0.05 0.05 0.9 2 0.7 0.2 0.1 3 0.1 0.8 0.1 3 0.1 0.6 0.3 1
85
Clustering Accuracy (ACC): Simple Approach
Cluster 1 is mapped to Ground Truth 1
C1 C2 C3 T 0.6 0.2 0.2 2 0.1 0.4 0.5 1 0.8 0.1 0.1 1 0.3 0.6 0.1 3 0.1 0.75 0.15 2 0.7 0.1 0.2 1 0.2 0.1 0.7 2 0.05 0.05 0.9 2 0.7 0.2 0.1 3 0.1 0.8 0.1 3 0.1 0.6 0.3 1
85
Clustering Accuracy (ACC): Simple Approach
Table 1: Assignment of labels to clusters based on maximum q(ck|xi)
C1 C2 C3 T 0.6 0.2 0.2 2 0.1 0.4 0.5 1 0.8 0.1 0.1 1 0.3 0.6 0.1 3 0.1 0.75 0.15 2 0.7 0.1 0.2 1 0.2 0.1 0.7 2 0.05 0.05 0.9 2 0.7 0.2 0.1 3 0.1 0.8 0.1 3 0.1 0.6 0.3 1
85
Clustering Metrics: Normalized Mutual Information (NMI)
For two arbitrary variables T and C, representing the ground truth labels and cluster labels respectively, NMI is defined as follows: NMI(T, C) = I(T, C)
, (1) where:
86
Mutual Information (MI)
Mutual Information quantifies the information shared by two clusters. MI tells us the reduction in the entropy of class labels that we get if we know the cluster labels. (Similar to Information gain in decision trees): I(T, C) = H(T) − H(T|C)
87
Mutual Information (MI)
Perfect Correlation (NMI=1) Independent (NMI = 0)
88
Semi-supervised Clustering: SVHN Image Generation
Use feature vector obtained from gφ(x) and vary the category c (one-hot).
89