UNSUPERVISED LEARNING, CLUSTERING UNSUPERVISED LEARNING - - PowerPoint PPT Presentation

unsupervised learning clustering
SMART_READER_LITE
LIVE PREVIEW

UNSUPERVISED LEARNING, CLUSTERING UNSUPERVISED LEARNING - - PowerPoint PPT Presentation

UNSUPERVISED LEARNING, CLUSTERING UNSUPERVISED LEARNING UNSUPERVISED LEARNING Supervised learning: X - y pairs, f(x) function approximation Unsupervised learning: only X, no y Exploring the space of X measurements,


slide-1
SLIDE 1

UNSUPERVISED LEARNING, CLUSTERING

slide-2
SLIDE 2

UNSUPERVISED LEARNING

UNSUPERVISED LEARNING

▸ Supervised learning: ▸ X - y pairs, f(x) function approximation ▸ Unsupervised learning: ▸ only X, no y ▸ Exploring the space of X measurements, understanding data, identifying

populations, problems, outliers (before modelling)

▸ Dimension reduction, important when working with high dimensional

data

▸ Usually part of exploratory data analysis, which may lead to measuring

the “supervising” signal when interesting structure is found in the X data

▸ Not a well defined problem

slide-3
SLIDE 3

UNSUPERVISED LEARNING

DATA EXPLORATION, DIMENSIONALITY REDUCTION

▸ Large dimensional datasets (N dim often >> N data) ▸ impossible to “visually” find structure, clusters, outliers, batch

effects, etc.

▸ One way to explore the data is to somehow embed it into a few

dimensions, which humans are capable of inspecting visually (1,2,3?)

▸ It is very important to know the internal structure of your data! ▸ Usually the first step with large dimensional data is

dimensionality reduction ( in parallel with opening your data in a spreadsheet and just eyeballing it for a few hours :) )

slide-4
SLIDE 4

UNSUPERVISED LEARNING

PCA - PRINCIPAL COMPONENT ANALYSIS

▸ PCA is a linear basis transformation from the original bases

to new bases dictated by the variation in the data itself

▸ 1st component direction is along the largest variance

in the data

▸ 2nd component is the orthogonal direction to the 1st

with the largest variance and so on …

▸ Number of components is min(n_features, n_data) ▸ The projections of the original data points give the scores ▸ Projected data points (scores) are uncorrelated in PCA

space

▸ The first components capture the largest variation in the

data, the interesting things! We can reveal some structure

  • f the data using only few dimensions.

σ 2

signal

σ 2

noise

x y

(Image from (Shlens))

slide-5
SLIDE 5

UNSUPERVISED LEARNING

PCA - PRINCIPAL COMPONENT ANALYSIS

▸ Standard use: 2D plots of projections ▸ Original base directions may be useful to

plot

First principal component Second principal component −1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0

First Principal Component Second Principal Component * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * −0.5 0.0 0.5 Murder Assault UrbanPop Rape −3 −2 −1 1 2 3 −3 −2 −1 1 2 3 −0.5 0.0 0.5

▸ Outliers: Sometimes

components correspond to individual data points, outliers. 
 These should be inspected and removed. PCA should be repeated without the outliers.

slide-6
SLIDE 6

UNSUPERVISED LEARNING

PCA - PRINCIPAL COMPONENT ANALYSIS

▸ How many components do you need?

Proportion of variance explained.

▸ Zero mean per dimension is assumed,

do it! (Fitting ellipse around the origin)

▸ If different quantities are measured,

units may not be comparable (Number

  • f fingers or height in cm?)


In this case, normalise original dimensions to have variance = 1

▸ Only line of direction is defined: -1

flips might occur!

First Principal Component Second Principal Component * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * −0.5 0.0 0.5 Murder Assault UrbanPop Rape Scaled −3 −2 −1 1 2 3 −100 −50 50 100 150 First Principal Component Second Principal Component * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * −3 −2 −1 1 2 3 −0.5 0.0 0.5 −100 −50 50 100 150 −0.5 0.0 0.5 1.0 −0.5 0.0 0.5 1.0 Murder Assau UrbanPop Rape Unscaled Principal Component
  • Prop. Variance Explained
Principal Component 1.0 1.5 2.0 2.5 3.0 3.5 4.0 1.0 1.5 2.0 2.5 3.0 3.5 4.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Cumulative Prop. Variance Explained
slide-7
SLIDE 7

UNSUPERVISED LEARNING

MORE DIMENSION REDUCTION, EMBEDDING

▸ MDS, Multi dimensional scaling (embed

the points in low dimension, given their measured distances)

▸ T-SNE, t-distributed stochastic

neighbour embedding (Local embedding, usually works best with complex data)

▸ UMAP: Uniform Manifold Approximation

and Projection (way way faster than TSNE)

▸ ICA, independent component analysis

(PCA: uncorrelated, ICA independent, e.g.: EEG)

▸ NMF Non-negative matrix factorisation

e.g.: mutations)

▸ And more, left source http://scikit-

learn.org/stable/modules/manifold.html

slide-8
SLIDE 8

CLUSTERING

CLUSTERING

▸ Data points can be meaningfully

categorised: clusters

▸ Classification: we have labels (y) for

groups

▸ Clustering: labels are not

measured, they are inferred from the (X) data

▸ Not a well defined problem ▸ Clusters inferred should be

validated (with measurements, new data)

slide-9
SLIDE 9

CLUSTERING

K-MEANS CLUSTERING

▸ A priori fix the number of clusters ▸ Minimise the sum of intra-cluster distances ▸ Algorithm: ▸ 1. randomly data assign each data point to

clusters

▸ 2. calculate cluster centroids, reassign each data

point to the closest centroid, repeat until convergence

▸ Distance metric is generally Euclidean ▸ Local minimum is found, repeat multiple times to for

best solution, and assessment of stability

▸ Left: possible failure modes. 


source (http://scikit-learn.org/stable/ auto_examples/cluster/ plot_kmeans_assumptions.html)

slide-10
SLIDE 10

CLUSTERING

HIERARCHICAL CLUSTERING

▸ Number of clusters not fixed ▸ Iteratively agglomerate clusters from

individual observations

▸ Algorithm: ▸ 1. assign each data point to a cluster ▸ 2. join the two closest clusters ▸ Cluster distance metric is super

important

▸ Single (smallest pairwise distance),

Average, Complete (maximal distance)

▸ The result is not a clustering, it is a

  • dendrogram. A horizontal cut defines a
  • clustering. Where to cut? Well.
2 4 6 8 10 2 4 6 8 10 2 4 6 8 10

Average Linkage Complete Linkage Single Linkage

slide-11
SLIDE 11

CLUSTERING

MORE CLUSTERING

▸ DBSCAN, density

thresholds define clusters

▸ Spectral clustering: using

the eigenvectors of the pairwise distance matrix

▸ Gaussian mixture models ▸ And more, 


left source: http://scikit- learn.org/stable/modules/ clustering.html

slide-12
SLIDE 12

SEMI-SUPERVISED LEARNING

SEMI-SUPERVISED LEARNING

▸ Few data points have labels, most

  • thers not

▸ Exploit data structure of unlabelled

examples for most effective supervised learning

▸ Use unsupervised learning to explore

the data structure, clusters, and use few points to assign labels to cluster

▸ Hot topic, as data labelling is often

much more expensive data unlabelled data collection

slide-13
SLIDE 13

SELF-SUPERVISED LEARNING

SELF-SUPERVISED LEARNING

▸ Unsupervised learning, where a part of the data is

predicted from another part of the data.

▸ Examples explain it ▸ Future video frame prediction ▸ Grayscale image colorisation ▸ Impainting ▸ Jigsaw puzzle solving ▸ Motion direction predictions ▸ etc.. ▸ orders of magnitudes unsupervised data is

collected (images videos)

▸ Human visual learning is supposedly

unsupervised (maybe it is self supervised)

(a) (b) (c)

Images: Lotter et al, Zhang et al, Noroozi et Favaro, Walker et al

slide-14
SLIDE 14

REFERENCES

REFERENCES

▸ ISLR, chapter 10. ▸ ESL, chapter 14. ▸ http://scikit-learn.org/stable/modules/decomposition.html#decompositions ▸ http://scikit-learn.org/stable/modules/manifold.html ▸ http://scikit-learn.org/stable/modules/clustering.html#clustering ▸ https://umap-learn.readthedocs.io/en/latest/ ▸ Shlens, J., 2014. A Tutorial on Principal Component Analysis. arXiv:1404.1100 [cs, stat]. ▸ Walker, J., Gupta, A., Hebert, M., 2015. Dense Optical Flow Prediction from a Static Image. arXiv:1505.00295 [cs]. ▸ Lotter, W., Kreiman, G., Cox, D., 2016. Deep Predictive Coding Networks for Video Prediction and Unsupervised Learning.

arXiv:1605.08104 [cs, q-bio].

▸ Zhang, R., Isola, P., Efros, A.A., 2016. Colorful Image Colorization. arXiv:1603.08511 [cs]. ▸ Noroozi, M., Favaro, P., 2016. Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles. arXiv:

1603.09246 [cs].