kernel CCA, kernel Kmeans Spectral Clustering 1 MACHINE LEARNING - - PowerPoint PPT Presentation

kernel cca kernel kmeans spectral clustering 1 machine
SMART_READER_LITE
LIVE PREVIEW

kernel CCA, kernel Kmeans Spectral Clustering 1 MACHINE LEARNING - - PowerPoint PPT Presentation

MACHINE LEARNING 2012 MACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering 1 MACHINE LEARNING 2012 Change in timetable: We have practical session next week! 2 MACHINE LEARNING 2012 Structure of todays and next weeks


slide-1
SLIDE 1

MACHINE LEARNING – 2012 1

MACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering

slide-2
SLIDE 2

MACHINE LEARNING – 2012 2

Change in timetable: We have practical session next week!

slide-3
SLIDE 3

MACHINE LEARNING – 2012 3

Structure of today’s and next week’s class

1) Briefly go through some extension or variants on the principle of kernel PCA, namely kernel CCA. 2) Look at one particular set of clustering algorithms for structure discovery, kernel K-Means 3) Describe the general concept of Spectral Clustering, highlighting equivalency between kernel PCA, ISOMAP, etc and their use for clustering. 4) Introduce the notion of unsupervised and semi-supervised learning and how this can be used to evaluate clustering methods 5) Compare kernel K-means and Gaussian Mixture Model for unsupervised and semi-supervised clustering

slide-4
SLIDE 4

MACHINE LEARNING – 2012 4

Canonical Correlation Analysis (CCA)

Video description Audio description

 

1 1

, x y

N

x

P

y  

2 2

, x y

 

,

max ,

x y

T T x y w w corr w x w y

Determine features in two (or more) separate descriptions of the dataset that best explain each datapoint.

Extract hidden structure that maximize correlation across two different projections.

slide-5
SLIDE 5

MACHINE LEARNING – 2012 5

Canonical Correlation Analysis (CCA)

 

1 1

, x y

 

2 2

, x y

 

,

max ,

x y

T T x y w w corr w X w Y

Pair of multidimensional zero mean variables We have M instances of the pairs.

   

1 1

,

i i

M M N q i i

X x Y y

 

   

Search two projections and w : ' and '

x y T T x y

w X w X Y w Y  

 

,

solutions of: max max corr X',Y'

x y

w w

 

slide-6
SLIDE 6

MACHINE LEARNING – 2012 6

Canonical Correlation Analysis (CCA)

 

 

, T x y T , x y T x y T T , x y

max max corr X',Y' w w max w w w w max w w

x y x y x y

w w T T w w xy w w xx x yy y

E XY X Y C C w C w    

   

Covariance matrices =E : N =E :

T xx T yy

C XX N C YY q q  

Crosscovariance matrix is Measure crosscorrelation between and .

xy

C N q X Y 

slide-7
SLIDE 7

MACHINE LEARNING – 2012 7

Canonical Correlation Analysis (CCA)

T T x y

Correlation not affected by rescaling the norm of the vectors, we can ask that w w 1

xx x yy y

C w C w   

T x y , T T x y

max max w w

  • u. c. w

w 1

x y

xy w w xx x yy y

C C w C w    

 

 

, T x y T , x y T x y T T , x y

max max corr X',Y' w w max w w w w max w w

x y x y x y

w w T T w w xy w w xx x yy y

E XY X Y C C w C w    

slide-8
SLIDE 8

MACHINE LEARNING – 2012 8

Canonical Correlation Analysis (CCA)

T x y T T 1 2 , x y T T x y

To determine the optimum (maximum) of , solve by Lagrange: w w max w w u.c. w w 1

x y

xy w w xx x yy y xy yy yx x xx x xx x yy y

C C w C w C C C w C w C w C w  

          

Generalized Eigenvalue Problem; can be reduced to a classical eigenvalue problem if Cxx is invertible

slide-9
SLIDE 9

MACHINE LEARNING – 2012 9

Kernel Canonical Correlation Analysis

  • CCA finds basis vectors, s.t. the correlation between the projections

is mutually maximized.  generalized version of PCA for two or more multi-dimensional datasets. CCA depends on the coordinate system in which the variables are described. Even if strong linear relationship between variables, depending on the coordinate system used, this relationship might not be visible as a correlation  Kernel CCA.

slide-10
SLIDE 10

MACHINE LEARNING – 2012 10

Principle of Kernel Methods

Determine a metric which brings out features of the data so as to make subsequent computation easier

Original Space

x1 x2

After Lifting Data in Feature Space Data becomes linearly separable when using a rbf kernel and projecting onto first 2 PC of kernelPCA.

slide-11
SLIDE 11

MACHINE LEARNING – 2012 11 Original Space x1 x2 e1 e2

Idea: Send the data X into a feature space H through the nonlinear map f.

 

 

   

 

1... 1 ,....., i M i N M

X x X x x f f f

  

f

H Performs linear transformation in feature space

Principle of Kernel Methods (Recall)

In feature space, perform classical linear computation

slide-12
SLIDE 12

MACHINE LEARNING – 2012 12

Kernel CCA

   

1 1

,

i i

M M N q i i

X x Y y

 

   

 

 

 

 

   

1 1 1 1

and , with 0 and

i i i i

M M M M x y x y i i i i

x y x y f f f f

   

 

 

Project into a feature space

   

Construct associated kernel matrices: , , columns of , are ,

T T i i x x x y y y x y x y

K F F K F F F F x y f f   The projection vectors can be expressed as a linear combination in feature space: and

x x x y y y

w F w F    

slide-13
SLIDE 13

MACHINE LEARNING – 2012 13

Kernel CCA

Kernel CCA can then become an optimization problem of the

   

1/2 1/2 2 2 , , 2 2

max max u.c 1

x y x y

T x x y y T T x x x y y y T T x x x y y y

K K K K K K

   

             

T x y T T , x y T T x y

In Linear CCA, we were solving for: w w max w w u.c. w w 1

x y

xy w w xx x yy y xx x yy y

C C w C w C w C w  

slide-14
SLIDE 14

MACHINE LEARNING – 2012 14

Kernel CCA

Kernel CCA can then become an optimization problem of the

   

1/2 1/2 2 2 , , 2 2

max max u.c 1

x y x y

T x x y y T T x x x y y y T T x x x y y y

K K K K K K

   

             

 

In practice, the intersection between the spaces spanned by , is non-zero, then the problem has a trivial solution, as ~ cos , 1.

x x y y x x y y

K K K K      

slide-15
SLIDE 15

MACHINE LEARNING – 2012 15

Kernel CCA

2 2

Generalized eigenvalue problem:

x y x x x y y y x y

K K K K K K                                

2 2

Add a regularization term to increase the rank of the matrix and make it invertible + I 2

x x

M K K        

Several methods have been proposed to choose carefully the regularizing term so as to get projections that are as close as possible to the “true” projections.

slide-16
SLIDE 16

MACHINE LEARNING – 2012 16

Kernel CCA

2 2

+ I 2 + I 2

x x y x x y y y x y

M K K K K K M K

A B

      

                                            

Becomes a classical eigenvalue problem  

1 1 T

C AC  

 

Set: and

T

B C C C    

slide-17
SLIDE 17

MACHINE LEARNING – 2012 17

Kernel CCA

Can be extended to multiple datasets:

   

1 1

,

i i

M M N q i i

X x Y y

 

   

Two datasets case

1 1

datasets: ,...., with

  • bservations each

Dimensions ,.... :, i.e. :

L L i i

L X X M N N X N M 

1 1

Applying a non-linear transformation to ,... construct Gram matrices: ,.....,

L L

X X L K K f 

slide-18
SLIDE 18

MACHINE LEARNING – 2012 18

Kernel CCA

Was formulated as generalized eigenvalue problem:

Can be extended to multiple datasets:

   

1 1

,

i i

M M N q i i

X x Y y

 

   

Two datasets case

2 2 1 1 2 1 1 2 1 2 2 1 2

+ I ....... 2 0 ........ . . ....... . . .......

L L L L L L L

M K K K K K K K K K K K K K K K                                    

1 2 2 2

......................................... . . + I 2

L L

M K                                        

2 2

+ I 2 + I 2

x x y x x y y y x y

M K K K K K M K

A B

      

                                            

slide-19
SLIDE 19

MACHINE LEARNING – 2012 19

Kernel CCA

  • M. Kuss and T, Graepel ,The Geometry Of Kernel Canonical Correlation Analysis Tech. Report, Max Planck Institute, 2003
slide-20
SLIDE 20

MACHINE LEARNING – 2012 20

Matlab Example File: demo_CCA-KCCA.m

Kernel CCA

slide-21
SLIDE 21

MACHINE LEARNING – 2012 21

Applications of Kernel CCA

Kernel matrices K1, K2 and K3 correspond to gene-gene similarities in pathways, genome position, and microarray expression data resp. Use RBF kernel with fixed kernel width. Goal: To measure correlation between heterogeneous datasets and to extract sets of genes which share similarities with respect to multiple biological attributes

Y Yamanishi, JP Vert, A Nakaya, M Kanehisa - Extraction of correlated gene clusters from multiple genomic data by generalized kernel canonical correlation analysis, Bioinformatics, 2003

Correlation scores in MKCCA: pathway vs. genome vs. expression.

slide-22
SLIDE 22

MACHINE LEARNING – 2012 22

Applications of Kernel CCA

Goal: To measure correlation between heterogeneous datasets and to extract sets of genes which share similarities with respect to multiple biological attributes

Y Yamanishi, JP Vert, A Nakaya, M Kanehisa - Extraction of correlated gene clusters from multiple genomic data by generalized kernel canonical correlation analysis, Bioinformatics, 2003

Correlation scores in MKCCA: pathway vs. genome vs. expression.

Gives pairwise correlation between K1, K2 Gives pairwise correlation between K1, K3

A readout of the entries with equal projection onto the first canonical vectors  give the genes which belong to each cluster Two clusters correspond to genes close to each other with respect to their positions in the pathways, in the genome, and to their expression

slide-23
SLIDE 23

MACHINE LEARNING – 2012 23

Applications of Kernel CCA

Goal: To construct appearance models for estimating an object’s pose from raw brightness images

  • T. Melzer, M. Reiter and H. Bischof, Appearance models based on kernel canonical correlation analysis, Pattern

Recognition 36 (2003) p. 1961–1971

Example of two image datapoints with different poses

X: Set of images Y: pose parameters (pan and tilt angle of the object w.r.t. the camera in degrees) Use linear kernel on X and RBF kernel on Y and compare performance to applying PCA on the (X, Y) dataset directly

slide-24
SLIDE 24

MACHINE LEARNING – 2012 24

kernel-CCA performs better than PCA, especially for small k=testing/training ratio (i.e., for larger training sets). The kernel-CCA estimators tend to produce less outliers, i.e., gross errors, and consequently yield a smaller standard deviation of the pose estimation error than their PCA-based counterparts.

Applications of Kernel CCA

Goal: To construct appearance models for estimating an object’s pose from raw brightness images

  • T. Melzer, M. Reiter and H. Bischof, Appearance models based on kernel canonical correlation analysis, Pattern

Recognition 36 (2003) p. 1961–1971

For very small training sets, the performance of both approaches becomes similar

Testing/training ratio

slide-25
SLIDE 25

MACHINE LEARNING – 2012 26

Kernel K-means Spectral Clustering

slide-26
SLIDE 26

MACHINE LEARNING – 2012 27

Structure Discovery: Clustering

Groups pair of points according to how similar these are. Density based clustering methods (Soft K-means, Kernel K-means, Gaussian Mixture Models) compare the relative distributions.

slide-27
SLIDE 27

MACHINE LEARNING – 2012 28

K-means

*m1

*m2

*m3

K-means is a hard partitioning of the space through K clusters equidistant according to norm 2 measure. The distribution of data within each cluster is encapsulated in a sphere.

slide-28
SLIDE 28

MACHINE LEARNING – 2012 29

K-means Algorithm

, 1... ,

k k

K m 

slide-29
SLIDE 29

MACHINE LEARNING – 2012 30

  • 2. Calculate the distance from each data point to each centroid .
  • 3. Assignment Step: Assign the responsibility of each data point to its

“closest” centroid (E-step). If a tie happens (i.e. two centroids are equidistant to a data

point, one assigns the data point to the smallest winning centroid).

 

,

p j k j k

d x x m m  

 

 

argmin ,

j k

k

d x m

K-means Algorithm

Iterative Method (variant on Expectation-Maximization)

slide-30
SLIDE 30

MACHINE LEARNING – 2012 31

  • 2. Calculate the distance from each data point to each centroid .
  • 3. Assignment Step: Assign the responsibility of each data point to its

“closest” centroid (E-step). If a tie happens (i.e. two centroids are equidistant to a data

point, one assigns the data point to the smallest winning centroid).

 

,

p j k j k

d x x m m  

K-means Algorithm

Iterative Method (variant on Expectation-Maximization) 4. Update Step: Adjust the centroids to be the means of all data points assigned to them (M-step) 5. Go back to step 2 and repeat the process until the clusters are stable.

 

 

argmin ,

j k

k

d x m

slide-31
SLIDE 31

MACHINE LEARNING – 2012

32

Two hyperparameters: number of clusters K and power p of the metric Very sensitive to the choice of the number of clusters K and the initialization.

K-means Clustering: Weaknesses

slide-32
SLIDE 32

MACHINE LEARNING – 2012

33

Two hyperparameters: number of clusters K and power p of the metric Choice of power determines the form of the decision boundaries

K-means Clustering: Hyperparameters

P=1 P=2 P=3 P=4

slide-33
SLIDE 33

MACHINE LEARNING – 2012 34

Kernel K-means

K-Means algorithm consists of minimization of:

 

1 1

,...., with : number of datapoints in cluster

j k j k

j K p K j k k x C k x C k k k

x J x m m C m m m m

  

   

 

 

  1

i

M i

x f

Project into a feature space

     

1 1

,....,

j k

p K K j k k x C

J x m m f f m

 

 

 

 

j k

j x C k

x m f

We cannot observe the mean in feature space.  Construct the mean in feature space using image of points in same cluster

slide-34
SLIDE 34

MACHINE LEARNING – 2012 35

Kernel K-means

                 

 

     

 

2 1 1 , 2 1 , 2 1

,...., 2 = k , 2 k , = k ,

j k j l k l k j k j l k j k j k

K K j k k x C j l l j K x x C j j x C k x C k k j l i j K x x C i i x C k x C k k

J x x x x x x x m m x x x x x x m m m m f f m f f f f f f

         

                         

         

slide-35
SLIDE 35

MACHINE LEARNING – 2012 36

Kernel K-means

Kernel K-means algorithm is also an iterative procedure:

  • 1. Initialization: pick K clusters
  • 2. Assignment Step: Assign each data point to its “closest” centroid (E-step).

If a tie happens (i.e. two centroids are equidistant to a data point, one assigns the data point to the smallest winning centroid) by computing the distance in feature space.

  • 3. Update Step: Update the list of points belonging to each centroid (M-step)
  • 4. Go back to step 2 and repeat the process until the clusters are stable.

       

 

, 2

k , 2 k , min , min k ,

j l k j k

j l i j x x C i k i i x C k k k k

x x x x d x C x x m m

 

            

 

slide-36
SLIDE 36

MACHINE LEARNING – 2012 37

Kernel K-means

       

 

, 2

k , 2 k , min , min k ,

j l k j k

j l i j x x C i k i i x C k k k k

x x x x d x C x x m m

 

            

 

With a RBF kernel

If xi is close to all points in cluster k, this is close to 1. Cst of value 1 If the points are well grouped in cluster k, this sum is close to 1.

With homogeneous polynomial kernel?

slide-37
SLIDE 37

MACHINE LEARNING – 2012 38

Kernel K-means

       

 

, 2

k , 2 k , min , min k ,

j l k j k

j l i j x x C i k i i x C k k k k

x x x x d x C x x m m

 

            

 

With a polynomial kernel

Some of the terms change sign depending

  • n their position with

respect to the origin. Positive value If the points are aligned in the same Quadran, the sum is maximal

slide-38
SLIDE 38

MACHINE LEARNING – 2012 39

Kernel K-means: examples

Rbf Kernel, 2 Clusters

slide-39
SLIDE 39

MACHINE LEARNING – 2012 40

Kernel K-means: examples

Rbf Kernel, 2 Clusters

slide-40
SLIDE 40

MACHINE LEARNING – 2012 41

Kernel K-means: examples

Rbf Kernel, 2 Clusters

Kernel width: 0.5 Kernel width: 0.05

slide-41
SLIDE 41

MACHINE LEARNING – 2012 42

Kernel K-means: examples

Polynomial Kernel, 2 Clusters

slide-42
SLIDE 42

MACHINE LEARNING – 2012 43

Kernel K-means: examples

Polynomial Kernel (p=8), 2 Clusters

slide-43
SLIDE 43

MACHINE LEARNING – 2012 44

Kernel K-means: examples

Polynomial Kernel, 2 Clusters

Order 2 Order 4 Order 6

slide-44
SLIDE 44

MACHINE LEARNING – 2012 45

Kernel K-means: examples

Polynomial Kernel, 2 Clusters

The separating line will always be perpendicular to the line passing by the origin (which is located at the mean of the datapoints) and parrallel to the axis of the

  • rdinates (because of the change in sign of the cosine function in the inner product.

 No better than linear K-means!

slide-45
SLIDE 45

MACHINE LEARNING – 2012 46

Kernel K-means: examples

Polynomial Kernel, 4 Clusters

Can only group datapoints that do not overlap across quadrans with respect to the

  • rigin (careful, data are centered!).

 No better than linear K-means! (except less sensitive to random initialization)

Solutions found with Kernel K-means Solutions found with K-means

slide-46
SLIDE 46

MACHINE LEARNING – 2012 47

Choice of number of Clusters in Kernel K-means is important

Kernel K-means: Limitations

slide-47
SLIDE 47

MACHINE LEARNING – 2012 48

Choice of number of Clusters in Kernel K-means is important

Kernel K-means: Limitations

slide-48
SLIDE 48

MACHINE LEARNING – 2012 49

Choice of number of Clusters in Kernel K-means is important

Kernel K-means: Limitations

slide-49
SLIDE 49

MACHINE LEARNING – 2012

50

Limitations of kernel K-means

Raw Data

slide-50
SLIDE 50

MACHINE LEARNING – 2012

51

Limitations of kernel K-means

kernel K-means with K=3, RBF kernel

slide-51
SLIDE 51

MACHINE LEARNING – 2012 52

From Non-Linear Manifolds Laplacian Eigenmaps, Isomaps To Spectral Clustering

slide-52
SLIDE 52

MACHINE LEARNING – 2012 53

Non-Linear Manifolds

PCA and Kernel PCA belong to a more general class of methods to create non-linear manifolds based on spectral decomposition.

(Spectral decomposition of matrices is more frequently referred to as an eigenvalue decomposition.)

Depending on which matrix we decompose, we get a different set of projections.

  • PCA decomposes the covariance matrix of the dataset  generate

rotation and projection in the original space

  • kernel PCA decomposes the Gram matrix  partition or regroup the

datapoints

  • The Laplacian matrix is a matrix representation of a graph. Its spectral

decomposition can be used for clustering.

slide-53
SLIDE 53

MACHINE LEARNING – 2012 54

Embed Data in a Graph

  • Build a similarity graph
  • Each vertex on the graph is a datapoint

Original dataset Graph representation of the dataset

slide-54
SLIDE 54

MACHINE LEARNING – 2012 55

Measure Distances in Graph

Construct the similarity matrix S to denote whether points are close or far away to weight the edges of the graph:

0.9.....0.8. .. 0.2 ... 0.2 ..... 0.2.....0.2........0.7....0.6 S           

slide-55
SLIDE 55

MACHINE LEARNING – 2012 56

Disconnected Graphs

1.........1. .. .....0 .......0 ..... 0.........0..........1.........1 S           

Disconnected Graph: Two data-points are connected if: a) the similarity between them is higher than a threshold. b)

  • r if they are k-nearest neighbors

(according to the similarity metric)

slide-56
SLIDE 56

MACHINE LEARNING – 2012 57

Graph Laplacian

1

1.........1. .. .....0 .......0 Given the similarity matrix ..... 0.........0..........1.........1 Construct the diagonal matrix D composed of the sum on each line of K: .............

i i

S S D            

2

...0 ........0 , .... 0.................. and then, build the Laplacian matrix : L is positive semi-definite spectral decomposition possible

i i Mi i

S S L D S                     

 

slide-57
SLIDE 57

MACHINE LEARNING – 2012 58

Graph Laplacian

1 2

Eigenvalue decomposition of the Laplacian matrix: All eigenvalues of L are positive and the smallest eigenvalue of L is zero: If we order the eigenvalue by increasing order: .... .

T M

L U U           If the graph has connected components, then the eigenvalue =0 has multiplicity . k k 

slide-58
SLIDE 58

MACHINE LEARNING – 2012 59

Spectral Clustering

The multiplicity of the eigenvalue 0 determines the number of connected components in a graph. The associated eigenvectors identify these connected components. For an eigenvalue 0, the correspondin

i

   g eigenvector has the same value for all vertices in a component,and a different value for each one of their i 1 components.

i

e 

Identifying the clusters is then trivial when the similarity matrix is composed of zeros and ones (as when using k-nearest neighbor). What happens when the similarity matrix is full?

slide-59
SLIDE 59

MACHINE LEARNING – 2012 60

Spectral Clustering

0.9.....0.8. .. 0.2 ... 0.2 ..... 0.2.....0.2........0.7....0.6 S            Similarity map :

N N

S    

2

2

can either be binary (k-nearest neighboor)

  • r continuous with Gaussia kernel

,

i j

x x i j

S S x x e

  

slide-60
SLIDE 60

MACHINE LEARNING – 2012 61

Spectral Clustering

Similarity map :

N N

S    

2

2

can either be binary (k-nearest neighboor)

  • r continuous with Gaussia kernel

,

i j

x x i j

S S x x e

  

1 2

1) Build the Laplacian matrix : 2) Do eigenvalue decomposition of the Laplacian matrix: 3) Order the eigenvalue by increasing order: .... .

T M

L D S L U U           

The first eigenvalue is still zero but with multiplicity 1 only (fully connected graph)! Idea: the smallest eigenvalues are close to zero and hence provide also information on the partioning of the graph (see exercise session)

slide-61
SLIDE 61

MACHINE LEARNING – 2012 62

Spectral Clustering

1 2 1 1 1 1 2 1 2 1 1

Eigenvalue decomposition of the Laplacian matrix: ........ . . . ........

T K K M

L U U e e e e U e e e                       

1

. . .

i i K i

e y e                 

Construct an embedding of each of the M datapoints through . Reduce dimensionality by picking k<M projections , 1... .

i i i

x y y i K 

With a clear partitioning of the graph, the entries in y are split into sets of equal values. Each group of points with same value belong to the same partition (cluster).

slide-62
SLIDE 62

MACHINE LEARNING – 2012 63

Spectral Clustering

1 2 1 1 1 1 2 1 2 1 1

Eigenvalue decomposition of the Laplacian matrix: ........ . . . ........

T K K M

L U U e e e e U e e e                       

When we have a fully connected graph, the entries in Y take any real value.

1

. . .

i i K i

e y e                 

Construct an embedding of each of the M datapoints through . Reduce dimensionality by picking k<M projections , 1... .

i i i

x y y i K 

slide-63
SLIDE 63

MACHINE LEARNING – 2012 64 1 2

One solution for the two associated eigenvectors is: 1 2 1 1 e e                      

1 2

Anothersolution for the two associated eigenvectors is: 0.33 0.33 0.88 0.1 0.1 0.99 e e                         

Spectral Clustering

Example: 3 datapoints in a graph composed of 2 partitions 1 1 0 The similarity matrix is 1 1 0 0 0 1 has eigenvalue =0 with multiplicity two. S L            

1

x

2

x

3

x

Entries in the eigenvector for the two first datapoins are equal

       

1 1 2 2 for the 1st set of eigenvectors for the 2nd set of eigenvectors

1 0 0.33 -0.1 1 0 0.33 -0.1 y y y y    

slide-64
SLIDE 64

MACHINE LEARNING – 2012 65 1 2 3

with associated eigenvectors : 1 0.411 0.8 2 1 , 0.404 , 0.7 1 0.81 0.0 e e e                                   

Spectral Clustering

Example: 3 datapoints in a fully connected graph 1 0.9 0.02 The similarity matrix is 0.9 1 0.02 0.01 0.02 1 has eigenvalue =0 with multiplicity 1. The seco S L            

2 3

nd eigenvalue is small 0.04, whereas the 3rd one is large, 1.81.    

1

x

2

x

3

x

Entries in the 2nd eigenvector for the two first datapoins are almost equal.

   

1 2 The first two points have almost the same coordinates on the y embedding.

1 0.41 , 1 0.40 y y  

Reduce the dimensionality by considering the smallest eigenvalue

slide-65
SLIDE 65

MACHINE LEARNING – 2012 66 1 2 3

with associated eigenvectors : 1 0.21 0.78 2 1 , 0.57 , 0.57 1 0.79 0.21 e e e                                    

Spectral Clustering

Example: 3 datapoints in a fully connected graph 1 0.9 0.8 The similarity matrix is 0.9 1 0.7 0.8 0.7 1 has eigenvalue =0 with multiplicity 1. The secon S L            

2 3

d and third eigenvalues are both large 2.23, 2.57.    

1

x

2

x

3

x

Entries in the 2nd eigenvector for the two first datapoins are no longer equal.

   

1 2 The first two points have no longer the same coordinates on the y embedding.

1 -0.21 , 1 -0.57 y y  

The 3rd point is now closer to the two other points

slide-66
SLIDE 66

MACHINE LEARNING – 2012 67

Spectral Clustering

1

x

2

x

3

x

4

x

5

x

6

x

12

w

21

w

1

y

2

y

3

y

4

y

5

y

6

y

Step 1: Embedding in y Idea: Points close to one another have almost the same coordinate on the eigenvectors of L with small eigenvalues. Step1: Do an eigenvalue decomposition of the Lagrange matrix L and project the datapoints onto the first K eigenvectors with smallest eigenvalue (hence reducing the dimensionality).

slide-67
SLIDE 67

MACHINE LEARNING – 2012 68

Spectral Clustering

1

Step 2: Perform K-Means on the set of ,... vectors Cluster datapoints x according to their clustering in y.

M

y y

1

x

2

x

3

x

4

x

5

x

6

x

12

w

21

w

1

y

2

y

3

y

4

y

5

y

6

y

slide-68
SLIDE 68

MACHINE LEARNING – 2012 70

Equivalency to other non-linear Embeddings

1 1

Spectral deomposition of the similarity matrix (which is already positive semi-definite) . . .

i i K K i

e y e                    

In Isomap, the embedding is normalized by the eigenvalues and uses geodesic distance to build the similarity matrix, see supplementary material.

slide-69
SLIDE 69

MACHINE LEARNING – 2012 71

Laplacian Eigenmaps

The vectors , 1... , form an embedding of the datapoints.

i

y i M 

Swissroll example Projections on each Laplacian eigenvector

Image courtesy from A. Singh

Solve the generalized eigenvalue problem: Solution to optimization problem: min such that 1. Ensures minimal distorsion while preventing arbitrary scaling.

T T y

Ly Dy y Ly y Dy   

slide-70
SLIDE 70

MACHINE LEARNING – 2012 72

Equivalency to other non-linear Embeddings

kernel PCA: Eigenvalue decomposition of the matrix of similarity S

T

S UDU 

The choice of parameters in kernel K-Means can be initialized by doing a readout of the Gram matrix after kernel PCA.

slide-71
SLIDE 71

MACHINE LEARNING – 2012 73

1 2 1

The optimization problem of kernel K-means is equivalent to: max , Since , : M eigenvalues, resulting from the eigenvalue decomposition

  • f the Gram Matrix

T H M T i i i

tr H KH H YD tr H KH  

          

Kernel K-means and Kernel PCA

: : Y M K D K K  

Each entry of Y is 1 if the datapoint belongs to cluster k, otherwise zero D is diagonal. Element on the diagonal is sum of the datapoint in cluster k

See paper by M. Welling, supplementary document on website

Look at the eigenvalues to determine

  • ptimal number of clusters.
slide-72
SLIDE 72

MACHINE LEARNING – 2012 74

Kernel PCA projections can also help determine the kernel width From top to bottom Kernel width of 0.8, 1.5, 2.5

slide-73
SLIDE 73

MACHINE LEARNING – 2012 75

Kernel PCA projections can help determine the kernel width

The sum of eigenvalue grows as we get a better clustering

slide-74
SLIDE 74

MACHINE LEARNING – 2012

76

Quick Recap of Gaussian Mixture Model

slide-75
SLIDE 75

MACHINE LEARNING – 2012

77

Clustering with Mixture of Gaussians

Alternative to K-means; soft partitioning with elliptic clusters instead of spheres

Clustering with Mixtures of Gaussians using spherical Gaussians (left) and non spherical Gaussians (i.e. with full covariance matrix) (right). Notice how the clusters become elongated along the direction of the clusters (the grey circles represent the first and second variances of the distributions).

slide-76
SLIDE 76

MACHINE LEARNING – 2012 78

Gaussian Mixture Model (GMM)

Using a set of M N-dimensional training datapoints The pdf of X will be modeled through a mixture of K Gaussians:

 

1,... 1,... i M i j j N

X x

 

 

     

1

| , , with | , , , : mean and covariance matrix of Gaussian

K i i i i i i i i i i

p X p X p X N i  m m m m

      

1

1

K i i

Mixing Coefficients Probability that the data was explained by Gaussian i:

 

 

1

|

M j i j

p i p i x 

 

slide-77
SLIDE 77

MACHINE LEARNING – 2012

79

Gaussian Mixture Modeling

The parameters of a GMM are the means, covariance matrices and prior pdf: Estimation of all the parameters can be done through Expectation- Maximization (E-M). E-M tries to find the optimum of the likelihood of the model given the data, i.e.:

 

1 1 1

,..... , ,..... , ,.....

K K K

m m      

   

max | max | L X p X

 

  

See lecture notes for details

slide-78
SLIDE 78

MACHINE LEARNING – 2012 80

Gaussian Mixture Model

slide-79
SLIDE 79

MACHINE LEARNING – 2012 81

Gaussian Mixture Model

GMM using 4 Gaussians with random initialization

slide-80
SLIDE 80

MACHINE LEARNING – 2012 82

Gaussian Mixture Model

Expectation Maximization is very sensitive to initial conditions:

GMM using 4 Gaussians with new random initialization

slide-81
SLIDE 81

MACHINE LEARNING – 2012 83

Gaussian Mixture Model

Very sensitive to choice of number of Gaussians. Number of Gaussians can be optimized iteratively using AIC or BIC, like for K-means:

Here, GMM using 8 Gaussians

slide-82
SLIDE 82

MACHINE LEARNING – 2012

84

Evaluation of Clustering Methods

slide-83
SLIDE 83

MACHINE LEARNING – 2012

85

Evaluation of Clustering Methods

Clustering methods rely on hyper parameters

  • Number of clusters
  • Kernel parameters

 Need to determine the goodness of these choices Clustering is unsupervised classification  Do not know the real number of clusters and the data labels  Difficult to evaluate these choice without ground truth

slide-84
SLIDE 84

MACHINE LEARNING – 2012

86

Evaluation of Clustering Methods

Two types of measures: Internal versus external measures Internal measures rely on measure of similarity (e.g. intra-cluster distance versus inter-cluster distances) E.g.: Residual Sum of Square is an internal measure (available in mldemos); Gives the squared distance of each vector from its centroid summed over all vectors.  Internal measures are problematic as the metric of similarity is often already optimized by clustering algorithm

2 1

RSS=

k

K k k x C

x m

 



slide-85
SLIDE 85

MACHINE LEARNING – 2012

87

K-Means, soft-K-Means and GMM have several hyperparameters: (Fixed number of clusters, beta, number of Gaussian functions)  Measure to determine how well the choice of hyperparameters fit the dataset (maximum-likelihood measure)

 

: dataset; : number of datapoints; : number of free parameters

  • Aikaike Information Criterion: AIC=

2ln 2

  • Bayesian Information Criterion:

2ln ln L: maximum likelihood of the model giv X M B L B BIC L B M      en B parameters Lower BIC implies either fewer explanatory variables, better fit, or both. As the number of datapoints (observations) increase, BIC assigns more weights to simpler models than AIC.

Choosing AIC versus BIC depends on the application: Is the purpose of the analysis to make predictions, or to decide which model best represents reality? AIC may have better predictive ability than BIC, but BIC finds a computationally more efficient solution.

Evaluation of Clustering Methods

Penalty for increase in computational costs

slide-86
SLIDE 86

MACHINE LEARNING – 2012

88

Evaluation of Clustering Methods

Two types of measures: Internal versus external measures External measures assume that a subset of datapoints have class label and measures how well these datapoints are clustered.  Needs to have an idea of the class and have labeled some datapoints  Interesting only in cases when labeling is highly time-consuming when the data is very large (e.g. in speech recognition)

slide-87
SLIDE 87

MACHINE LEARNING – 2012

89

Evaluation of Clustering Methods

Raw Data

slide-88
SLIDE 88

MACHINE LEARNING – 2012

90

     

 

             

: nm of datapoints, : the set of classes : nm of clusters, : nm of members of class and of cluster , max , 2 , , , , , , ,

ik i ik ik

i i i i c C k i i i i i i i i

M C c K n c k c F C K F c k M R c k P c k F c k R c k P c k n R c k c n P c k k

     

Semi-Supervised Learning

Clustering F-Measure:

(careful: similar but not the same F-measure as the F-measure we will see for classification!)

Tradeoff between clustering correctly all datapoints of the same class in the same cluster and making sure that each cluster contains points of only one class.

Picks for each class the cluster with the maximal number of datapoints Recall: proportion of datapoints correctly classified/clusterized Precision: proportion of datapoints of the same class in the cluster

slide-89
SLIDE 89

MACHINE LEARNING – 2012

91

Evaluation of Clustering Methods

RSS with K-means can find the true optimal number of clusters but very sensitive to random initialization (left and right: two different runs). RSS finds an optimum for K=4 and K=5 for the right run

slide-90
SLIDE 90

MACHINE LEARNING – 2012

92

Evaluation of Clustering Methods

BIC (left) and AIC (right) perform very poorly here, splitting some clusters into two halves.

slide-91
SLIDE 91

MACHINE LEARNING – 2012

93

Evaluation of Clustering Methods

BIC (left) and AIC (right) perform much better for picking the right number of clusters in GMM.

slide-92
SLIDE 92

MACHINE LEARNING – 2012

94

Evaluation of Clustering Methods

Raw Data

slide-93
SLIDE 93

MACHINE LEARNING – 2012

95

Evaluation of Clustering Methods

Optimization with BIC using K-means

slide-94
SLIDE 94

MACHINE LEARNING – 2012

96

Evaluation of Clustering Methods

Optimization with AIC using K-means AIC tends to find more clusters

slide-95
SLIDE 95

MACHINE LEARNING – 2012

97

Evaluation of Clustering Methods

Raw Data

slide-96
SLIDE 96

MACHINE LEARNING – 2012

98

Evaluation of Clustering Methods

Optimization with AIC using kernel K-means with RBF

slide-97
SLIDE 97

MACHINE LEARNING – 2012

99

Evaluation of Clustering Methods

Optimization with BIC using kernel K-means with RBF

slide-98
SLIDE 98

MACHINE LEARNING – 2012

100

Semi-Supervised Learning

Raw Data: 3 classes

slide-99
SLIDE 99

MACHINE LEARNING – 2012

101

Semi-Supervised Learning

Clustering with RBF kernel K-Means after optimization with BIC

slide-100
SLIDE 100

MACHINE LEARNING – 2012

102

Semi-Supervised Learning

After semi-supervised learning

slide-101
SLIDE 101

MACHINE LEARNING – 2012 103

Summary

We have seen several methods for extracting structure in data using the notion of kernel and discussed their similarity, differences and complementarity:

  • Kernel CCA is a generalization of kernel PCA to determine partial

grouping in different dimensions.

  • Kernel PCA can be used too bootstrap choice of hyperparameters in

kernel K-means We have compared the geometrical division of space yielded by Rbf and Polynomial kernels. We have seen that simpler techniques than kernel K-means such as K- means with norm-p and mixture of Gaussians can yield complex non-linear clustering.

slide-102
SLIDE 102

MACHINE LEARNING – 2012 104

Summary

When to use what? E.g. k-means versus kernel K-means versus GMM When using any machine learning algorithm, you have to balance a number of factors:

  • Computing time at training and at testing
  • Number of open parameters
  • Curse of dimensionality (order of growth with number of datapoints,

dimension, etc)

  • Robustness to initial condition, optimality of solution