Unsuperv rvised Learning Jointly Wit ith Im Image Clu lustering
Vir irgin inia ia Tech
Jianwei Yang Devi Parikh Dhruv Batra
https://filebox.ece.vt.edu/~jw2yang/
1
Wit ith Im Image Clu lustering Jianwei Yang Devi Parikh Dhruv - - PowerPoint PPT Presentation
Unsuperv rvised Learning Jointly Wit ith Im Image Clu lustering Jianwei Yang Devi Parikh Dhruv Batra Vir irgin inia ia Tech 1 https://filebox.ece.vt.edu/~jw2yang/ 2 Huge amount of images!!! 3 Huge amount of images!!! Learning
Vir irgin inia ia Tech
Jianwei Yang Devi Parikh Dhruv Batra
https://filebox.ece.vt.edu/~jw2yang/
1
2
Huge amount of images!!!
3
Huge amount of images!!! Learning without annotation efforts
4
Huge amount of images!!! Learning without annotation efforts What we need to learn?
5
Huge amount of images!!! Learning without annotation efforts What we need to learn?
6
Huge amount of images!!! Learning without annotation efforts What we need to learn?
7
Huge amount of images!!! Learning without annotation efforts What we need to learn?
8
Jain, Anil K., M. Narasimha Murty, and Patrick J. Flynn. "Data clustering: a review." ACM computing surveys (CSUR) 31.3 (1999): 264-323.
9
Jain, Anil K., M. Narasimha Murty, and Patrick J. Flynn. "Data clustering: a review." ACM computing surveys (CSUR) 31.3 (1999): 264-323.
K-means (Image Credit: Jesse Johnson) 10
Jain, Anil K., M. Narasimha Murty, and Patrick J. Flynn. "Data clustering: a review." ACM computing surveys (CSUR) 31.3 (1999): 264-323.
K-means (Image Credit: Jesse Johnson) Hierarchical Clustering 11
Jain, Anil K., M. Narasimha Murty, and Patrick J. Flynn. "Data clustering: a review." ACM computing surveys (CSUR) 31.3 (1999): 264-323.
K-means (Image Credit: Jesse Johnson) Spectral Clustering Manor et al, NIPS’04 Hierarchical Clustering 12
Jain, Anil K., M. Narasimha Murty, and Patrick J. Flynn. "Data clustering: a review." ACM computing surveys (CSUR) 31.3 (1999): 264-323.
K-means (Image Credit: Jesse Johnson) Spectral Clustering Manor et al, NIPS’04 Hierarchical Clustering Graph Cut Shi et al, TPAMI’00 13
Jain, Anil K., M. Narasimha Murty, and Patrick J. Flynn. "Data clustering: a review." ACM computing surveys (CSUR) 31.3 (1999): 264-323.
K-means (Image Credit: Jesse Johnson) DBSCAN, Ester et al, KDD’96 (Image Credit: Jesse Johnson) Spectral Clustering Manor et al, NIPS’04 Hierarchical Clustering Graph Cut Shi et al, TPAMI’00 14
Jain, Anil K., M. Narasimha Murty, and Patrick J. Flynn. "Data clustering: a review." ACM computing surveys (CSUR) 31.3 (1999): 264-323.
K-means (Image Credit: Jesse Johnson) DBSCAN, Ester et al, KDD’96 (Image Credit: Jesse Johnson) Spectral Clustering Manor et al, NIPS’04 Hierarchical Clustering Graph Cut Shi et al, TPAMI’00 EM Algorithm, Dempster et al, JRSS’77 15
Jain, Anil K., M. Narasimha Murty, and Patrick J. Flynn. "Data clustering: a review." ACM computing surveys (CSUR) 31.3 (1999): 264-323.
K-means (Image Credit: Jesse Johnson) DBSCAN, Ester et al, KDD’96 (Image Credit: Jesse Johnson) Spectral Clustering Manor et al, NIPS’04 Hierarchical Clustering Graph Cut Shi et al, TPAMI’00 EM Algorithm, Dempster et al, JRSS’77 NMF, Xu et al, SIGIR‘03 (Image Credit: Conrad Lee) 16
PCA (Image Credit: Jesse Johnson) ICA (Image Credit: Shylaja et al) tSNE, Maaten et al, JMLR’08 Subspace Clustering, Vidal et al. Sparse coding, Olshausen et al. Vision Research’97 17
Yoshua Bengio, Aaron Courville, and Pierre Vincent. "Representation learning: A review and new perspectives." IEEE Transactions on Pattern Analysis and Machine Intelligence. 35.8 (2013): 1798-1828.
Autoencoder, Hinton et al, Science’06 (Image Credit: Jesse Johnson) DBN, Hinton et al, Science’06 DBM, Salakhutdinov et al, AISTATS’09 Bengio et al, TPAMI’13 18
VAE, Kingma et al, arXiv’13 (Image Credit: Fast Forward Labs) GAN, Goodfellow et al, NIPS’14 DCGAN, Radford et al, arXiv’15 (Image Credit: Mike Swarbrick Jones) 19
Spatial context, Doersch et al, ICCV’15 Temporal context, Wang et al, ICCV’15 Solving Jigsaw, Noroozi et al, ECCV’16 Context Encoder, Deepak et al, CVPR’16 Ego-motion, Jayaraman et al, ICCV’15
20
Visual concept clustering, Huang et al, CVPR’16 Graph constraint, Li et al, ECCV’16 TAGnet, Wang et al, SDM’16 Deep Embedding, Xie et al, ICML’16
21
Joint Unsupervised Learning (JULE)
22
23
Meaningful clusters can provide supervisory signals to learn image representations
24
Meaningful clusters can provide supervisory signals to learn image representations
Good representations help to get meaningful clusters
25
Cluster images first, and then learn representations
26
Cluster images first, and then learn representations Learn representations first, and then cluster images
27
Cluster images and learn representations progressively Cluster images first, and then learn representations Learn representations first, and then cluster images
28
Good cluster Good representations Good clusters Good representations Poor clusters Poor representations
29
Good cluster Good representations Good clusters Good representations Poor clusters Poor representations
30
Good cluster Good representations Good representations Good clusters Poor clusters Poor representations
31
Good cluster Good representations Good representations Good clusters Poor clusters Poor representations
32
33
Agglomerative Clustering Convolutional Neural Network
Representation Learning Agglomerative Clustering
argmin ( | , )
y
L y I arg min ( | , ) L y I
,
argmin ( , | )
y
L y I
34
Convolutional Neural Network Agglomerative Clustering
arg min ( | , ) L y I
argmin ( | , )
y
L y I
35
36
37
38
39
40
41
Backward at each time-step is time-consuming and prone to over-fitting!
42
How about updating once for multiple time-steps?
Backward at each time-step is time-consuming and prone to over-fitting!
43
Partially Unrolling: divide all T time-steps into P periods
In each period, we merge clusters for multiple times and update CNN parameters at the end of period
44
Partially Unrolling: divide all T time-steps into P periods
In each period, we merge clusters for multiple times and update CNN parameters at the end of period
45
In each period, we merge clusters for multiple times and update CNN parameters at the end of period
Partially Unrolling: divide all T time-steps into P periods
P is determined by a hyper-parameter will be introduced later
46
Overall loss:
,
argmin ( , | )
y
L y I
argmin ( | , )
y
L y I arg min ( | , ) L y I
47
Loss at time-step t:
Conventional Agg. Clustering Strategy Proposed Agg. Clustering Strategy
48
Loss at time-step t:
Conventional Agg. Clustering Strategy Proposed Agg. Clustering Strategy Affinity measure
49
Loss at time-step t:
Conventional Agg. Clustering Strategy Proposed Agg. Clustering Strategy i-th cluster
50
Loss at time-step t:
Conventional Agg. Clustering Strategy Proposed Agg. Clustering Strategy K_c nearest neighbor clusters of i-th cluster
51
Loss at time-step t:
Conventional Agg. Clustering Strategy Proposed Agg. Clustering Strategy Affinity between i-th cluster and its NN
52
Loss at time-step t:
Conventional Agg. Clustering Strategy Proposed Agg. Clustering Strategy Affinity between i-th cluster and its NN Differences between two cluster affinities
53
Loss at time-step t:
Conventional Agg. Clustering Strategy Proposed Agg. Clustering Strategy Affinity between i-th cluster and its NN Differences between two cluster affinities Merge these two clusters
54
Loss at time-step t:
Conventional Agg. Clustering Strategy Proposed Agg. Clustering Strategy Affinity between i-th cluster and its NN Differences between two cluster affinities Merge these two clusters
55
Loss in forward pass in period p (merge clusters): Loss in forward pass in period p (merge clusters):
56
Loss in forward pass in period p (merge clusters): Loss in forward pass in period p (merge clusters):
57
Loss in forward pass in period p (merge clusters): Loss in forward pass in period p (merge clusters):
CNN parameters are fixed
58
Loss in forward pass in period p (merge clusters): Loss in forward pass in period p (merge clusters):
CNN parameters are fixed Cluster labels are fixed
59
Forward Pass: Simple Greedy Algorithm Merge two clusters which minimize the loss at each time step
60
Forward Pass: Simple Greedy Algorithm Merge two clusters which minimize the loss at each time step
61
Forward Pass: Simple Greedy Algorithm Merge two clusters which minimize the loss at each time step
62
Forward Pass: Simple Greedy Algorithm Merge two clusters which minimize the loss at each time step
63
Backward Pass:
64
Backward Pass:
Consider all previous periods
65
Backward Pass:
Cluster based loss is not proper for batch optimization!!!
Consider all previous periods
66
Backward Pass:
Cluster based loss is not proper for batch optimization!!!
Consider all previous periods
Approximation:
67
Backward Pass:
Convert to sample-based loss:
Consider all previous periods
Intra-sample affinity Inter-sample affinity
Recall cluster-based loss:
68
Backward Pass:
Convert to sample-based loss:
Consider all previous periods
Intra-sample affinity Inter-sample affinity
Recall cluster-based loss:
Weighted triplet loss
69
70
Raw image data
71
Raw image data Assume it is known
72
Raw image data Assume it is known Randomly initialize CNN parameters 4 samples in each cluster in average
73
Raw image data Assume it is known Randomly initialize CNN parameters 4 samples in each cluster in average Train CNN for about 20 epochs
74
Raw image data Assume it is known Randomly initialize CNN parameters 4 samples in each cluster in average Train CNN for about 20 epochs We can go back and retrain the model, but it improve slightly
75
76
MNIST (70000, 10, 28x28) USPS (11000, 10, 16x16) COIL20 (1440, 20, 128x128) COIL100 (7200, 100, 128x128) UMist (575, 20, 112x92) FRGC (2462, 20, 32x32) CMU-PIE (2856, 68, 32x32) Youtube Face (1000, 41, 55x55)
77
Two important parameters Set the layer numbers so that the Output feature map is about 10x10
78
+6.43% on NMI to best performance of existing approaches averaged over all datasets
79
+12.76% on AC to best performance of existing approaches averaged over all datasets
80
Average +21.5% on NMI
81
Average +25.7% on NMI
82
Our clustering performance vs. that of existing clustering approaches using raw image data.
Clustering performance using our representation fed to existing clustering algorithms.
COIL-20 COIL-100
84
USPS MNIST-test
85
86
87
88
Testing generalization of our learnt (unsupervised) representation to LFW face verification. Evaluation on CIFAR-10 classification
Representation transfer Representation learning
89
90
cast the problem into a recurrent optimization problem;
pass, and representation learning is conducted during backward pass;
datasets;
91
https://github.com/jwyang/joint-unsupervised-learning
92