UNSUPERVISED LEARNING, CLUSTERING UNSUPERVISED LEARNING - - PowerPoint PPT Presentation
UNSUPERVISED LEARNING, CLUSTERING UNSUPERVISED LEARNING - - PowerPoint PPT Presentation
UNSUPERVISED LEARNING, CLUSTERING UNSUPERVISED LEARNING UNSUPERVISED LEARNING Supervised learning: X - y pairs, f(x) function approximation Unsupervised learning: only X, no y Exploring the space of X measurements,
UNSUPERVISED LEARNING
UNSUPERVISED LEARNING
▸ Supervised learning: ▸ X - y pairs, f(x) function approximation ▸ Unsupervised learning: ▸ only X, no y ▸ Exploring the space of X measurements, understanding data, identifying
populations, problems, outliers (before modelling)
▸ Dimension reduction, important when working with high dimensional
data
▸ Usually part of exploratory data analysis, which may lead to measuring
the “supervising” signal when interesting structure is found in the X data
▸ Not a well defined problem
UNSUPERVISED LEARNING
DATA EXPLORATION, DIMENSIONALITY REDUCTION
▸ Large dimensional datasets (N dim often >> N data) ▸ impossible to “visually” find structure, clusters, outliers, batch
effects, etc.
▸ One way to explore the data is to somehow embed it into a few
dimensions, which humans are capable of inspecting visually (1,2,3?)
▸ It is very important to know the internal structure of your data! ▸ Usually the first step with large dimensional data is
dimensionality reduction ( in parallel with opening your data in a spreadsheet and just eyeballing it for a few hours :) )
UNSUPERVISED LEARNING
PCA - PRINCIPAL COMPONENT ANALYSIS
▸ PCA is a linear basis transformation from the original bases
to new bases dictated by the variation in the data itself
▸ 1st component direction is along the largest variance
in the data
▸ 2nd component is the orthogonal direction to the 1st
with the largest variance and so on …
▸ Number of components is min(n_features, n_data) ▸ The projections of the original data points give the scores ▸ Projected data points (scores) are uncorrelated in PCA
space
▸ The first components capture the largest variation in the
data, the interesting things! We can reveal some structure
- f the data using only few dimensions.
σ 2
signal
σ 2
noise
x y
▸
(Image from (Shlens))
UNSUPERVISED LEARNING
PCA - PRINCIPAL COMPONENT ANALYSIS
▸ Standard use: 2D plots of projections ▸ Original base directions may be useful to
plot
First principal component Second principal component −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0
First Principal Component Second Principal Component * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * −0.5 0.0 0.5 Murder Assault UrbanPop Rape −3 −2 −1 1 2 3 −3 −2 −1 1 2 3 −0.5 0.0 0.5
▸ Outliers: Sometimes
components correspond to individual data points, outliers. These should be inspected and removed. PCA should be repeated without the outliers.
UNSUPERVISED LEARNING
PCA - PRINCIPAL COMPONENT ANALYSIS
▸ How many components do you need?
Proportion of variance explained.
▸ Zero mean per dimension is assumed,
do it! (Fitting ellipse around the origin)
▸ If different quantities are measured,
units may not be comparable (Number
- f fingers or height in cm?)
In this case, normalise original dimensions to have variance = 1
▸ Only line of direction is defined: -1
flips might occur!
First Principal Component Second Principal Component * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * −0.5 0.0 0.5 Murder Assault UrbanPop Rape Scaled −3 −2 −1 1 2 3 −100 −50 50 100 150 First Principal Component Second Principal Component * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * −3 −2 −1 1 2 3 −0.5 0.0 0.5 −100 −50 50 100 150 −0.5 0.0 0.5 1.0 −0.5 0.0 0.5 1.0 Murder Assau UrbanPop Rape Unscaled Principal Component- Prop. Variance Explained
UNSUPERVISED LEARNING
MORE DIMENSION REDUCTION, EMBEDDING
▸ MDS, Multi dimensional scaling (embed
the points in low dimension, given their measured distances)
▸ T-SNE, t-distributed stochastic
neighbour embedding (Local embedding, usually works best with complex data)
▸ UMAP: Uniform Manifold Approximation
and Projection (way way faster than TSNE)
▸ ICA, independent component analysis
(PCA: uncorrelated, ICA independent, e.g.: EEG)
▸ NMF Non-negative matrix factorisation
e.g.: mutations)
▸ And more, left source http://scikit-
learn.org/stable/modules/manifold.html
CLUSTERING
CLUSTERING
▸ Data points can be meaningfully
categorised: clusters
▸ Classification: we have labels (y) for
groups
▸ Clustering: labels are not
measured, they are inferred from the (X) data
▸ Not a well defined problem ▸ Clusters inferred should be
validated (with measurements, new data)
CLUSTERING
K-MEANS CLUSTERING
▸ A priori fix the number of clusters ▸ Minimise the sum of intra-cluster distances ▸ Algorithm: ▸ 1. randomly data assign each data point to
clusters
▸ 2. calculate cluster centroids, reassign each data
point to the closest centroid, repeat until convergence
▸ Distance metric is generally Euclidean ▸ Local minimum is found, repeat multiple times to for
best solution, and assessment of stability
▸ Left: possible failure modes.
source (http://scikit-learn.org/stable/ auto_examples/cluster/ plot_kmeans_assumptions.html)
CLUSTERING
HIERARCHICAL CLUSTERING
▸ Number of clusters not fixed ▸ Iteratively agglomerate clusters from
individual observations
▸ Algorithm: ▸ 1. assign each data point to a cluster ▸ 2. join the two closest clusters ▸ Cluster distance metric is super
important
▸ Single (smallest pairwise distance),
Average, Complete (maximal distance)
▸ The result is not a clustering, it is a
- dendrogram. A horizontal cut defines a
- clustering. Where to cut? Well.
Average Linkage Complete Linkage Single Linkage
CLUSTERING
MORE CLUSTERING
▸ DBSCAN, density
thresholds define clusters
▸ Spectral clustering: using
the eigenvectors of the pairwise distance matrix
▸ Gaussian mixture models ▸ And more,
left source: http://scikit- learn.org/stable/modules/ clustering.html
SEMI-SUPERVISED LEARNING
SEMI-SUPERVISED LEARNING
▸ Few data points have labels, most
- thers not
▸ Exploit data structure of unlabelled
examples for most effective supervised learning
▸ Use unsupervised learning to explore
the data structure, clusters, and use few points to assign labels to cluster
▸ Hot topic, as data labelling is often
much more expensive data unlabelled data collection
SELF-SUPERVISED LEARNING
SELF-SUPERVISED LEARNING
▸ Unsupervised learning, where a part of the data is
predicted from another part of the data.
▸ Examples explain it ▸ Future video frame prediction ▸ Grayscale image colorisation ▸ Impainting ▸ Jigsaw puzzle solving ▸ Motion direction predictions ▸ etc.. ▸ orders of magnitudes unsupervised data is
collected (images videos)
▸ Human visual learning is supposedly
unsupervised (maybe it is self supervised)
→
(a) (b) (c)
▸
Images: Lotter et al, Zhang et al, Noroozi et Favaro, Walker et al
REFERENCES
REFERENCES
▸ ISLR, chapter 10. ▸ ESL, chapter 14. ▸ http://scikit-learn.org/stable/modules/decomposition.html#decompositions ▸ http://scikit-learn.org/stable/modules/manifold.html ▸ http://scikit-learn.org/stable/modules/clustering.html#clustering ▸ https://umap-learn.readthedocs.io/en/latest/ ▸ Shlens, J., 2014. A Tutorial on Principal Component Analysis. arXiv:1404.1100 [cs, stat]. ▸ Walker, J., Gupta, A., Hebert, M., 2015. Dense Optical Flow Prediction from a Static Image. arXiv:1505.00295 [cs]. ▸ Lotter, W., Kreiman, G., Cox, D., 2016. Deep Predictive Coding Networks for Video Prediction and Unsupervised Learning.
arXiv:1605.08104 [cs, q-bio].
▸ Zhang, R., Isola, P., Efros, A.A., 2016. Colorful Image Colorization. arXiv:1603.08511 [cs]. ▸ Noroozi, M., Favaro, P., 2016. Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles. arXiv:
1603.09246 [cs].