MACHINE LEARNING Kernel Canonical Correlation Analysis 1 ADVANCED - - PowerPoint PPT Presentation

machine learning kernel canonical correlation analysis
SMART_READER_LITE
LIVE PREVIEW

MACHINE LEARNING Kernel Canonical Correlation Analysis 1 ADVANCED - - PowerPoint PPT Presentation

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING MACHINE LEARNING Kernel Canonical Correlation Analysis 1 ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING Structure of todays and next weeks class 1) Briefly go through one


slide-1
SLIDE 1

ADVANCED MACHINE LEARNING 1 ADVANCED MACHINE LEARNING

MACHINE LEARNING Kernel Canonical Correlation Analysis

slide-2
SLIDE 2

ADVANCED MACHINE LEARNING 2 ADVANCED MACHINE LEARNING

Structure of today’s and next week’s class

1) Briefly go through

  • ne

extension

  • f

principal component analysis, namely Canonical Correlation Analysis (CCA). 2) Derive the non-linear version of CCA, kernel CCA (kCCA). 3) Make an exercise to understand the modulation of the space generated by CCA and kCCA.

slide-3
SLIDE 3

ADVANCED MACHINE LEARNING 3 ADVANCED MACHINE LEARNING

Canonical Correlation Analysis (CCA)

Video description Audio description

 

1 1

, x y

x

N

x

y

N

y  

2 2

, x y

 

,

max ,

x y

T T x y w w corr w x w y

Determine features in two (or more) separate descriptions of the dataset that best explain each datapoint.

Extract hidden structure that maximize correlation across two different projections.

slide-4
SLIDE 4

ADVANCED MACHINE LEARNING 4 ADVANCED MACHINE LEARNING

 

,

max ,

x y

T T x y w w corr w x w y

Canonical Correlation Analysis (CCA)

 

1 1

, x y

 

2 2

, x y

Pair of multidimensional zero mean variables We have M instances of the pairs.

 

 

1 1

,

y i x i

M M N N i i

X x Y y

 

   

Search two projections and w : and

x y T T x x y y

w z w X z w Y  

 

,

solutions of: max max corr z ,z

x y x y

w w

 

slide-5
SLIDE 5

ADVANCED MACHINE LEARNING 5 ADVANCED MACHINE LEARNING

 

,

max ,

x y

T T x y w w corr w x w y

Canonical Correlation Analysis (CCA)  

T x y T , x y

w w max w w

x y

T T w w

E XY X Y 

T x y T T , x y

w w max w w

x y

xy w w xx x yy y

C C w C w 

   

With and zero mean, i.e. X Y E X E Y  

Search two projections and w : and

x y T T x x y y

w z w X z w Y  

 

,

solutions of: max max corr z ,z

x y x y

w w

 

slide-6
SLIDE 6

ADVANCED MACHINE LEARNING 6 ADVANCED MACHINE LEARNING

Canonical Correlation Analysis (CCA)  

T x y T , x y

w w max w w

x y

T T w w

E XY X Y 

T x y T T , x y

w w max w w

x y

xy w w xx x yy y

C C w C w 

   

Covariance matrices =E : =E :

T xx x x T yy y y

C XX N N C YY N N  

Crosscovariance matrix is Measure crosscorrelation between and .

xy x y

C N N X Y 

 

,

solutions of: max max corr z ,z

x y x y

w w

 

   

With and zero mean, i.e. X Y E X E Y  

slide-7
SLIDE 7

ADVANCED MACHINE LEARNING 7 ADVANCED MACHINE LEARNING

Canonical Correlation Analysis (CCA)  

T x y T , x y

w w max w w

x y

T T w w

E XY X Y 

T x y T T , x y

w w max w w

x y

xy w w xx x yy y

C C w C w 

T T x y

Correlation not affected by rescaling the norm of the vectors, we can ask that w w 1

xx x yy y

C w C w   

T x y , T T x y

max max w w

  • u. c. w

w 1

x y

xy w w xx x yy y

C C w C w    

 

,

solutions of: max max corr z ,z

x y x y

w w

 

slide-8
SLIDE 8

ADVANCED MACHINE LEARNING 8 ADVANCED MACHINE LEARNING

Canonical Correlation Analysis (CCA)

T T x y

Correlation not affected by rescaling the norm of the vectors, we can ask that w w 1

xx x yy y

C w C w   

T x y , T T x y

max max w w

  • u. c. w

w 1

x y

xy w w xx x yy y

C C w C w    

     

To determine the optimum (maximum) of , solve by Lagrange: , , , = 1 1 Taking the partial derivatives over , : / 2

T T T x y x y x xy y x x xx x y y yy y x y x y

L w w w C w w C w w C w w w               

slide-9
SLIDE 9

ADVANCED MACHINE LEARNING 9 ADVANCED MACHINE LEARNING

Canonical Correlation Analysis (CCA)

1 2

Replacing and write the set of equations gives: Which can be rewritten as

xy x xx x y yy y yx xy yy yx x xx x

C w C w w C w C C C C w C w   

                           

Generalized Eigenvalue Problem; It can be reduced to a classical eigenvalue problem if Cxx is invertible

1 2

Solving for gives: If is invertible, it becomes an eigenvalue problem as for .

y yx xx xy y yy y yy y

w C C C w C w C w 

 

1..

These two eigenvalue problems yield a pair of vectors , , where min( , ) ,

y x

i i x y x y i q N N i i x y

q w w q N N w w

  

slide-10
SLIDE 10

ADVANCED MACHINE LEARNING 10 ADVANCED MACHINE LEARNING

CCA: Exercise I Exercise - I

  • 1
  • 0.8
  • 0.6
  • 0.4
  • 0.2

0.2 0.4 0.6 0.8 1

  • 1
  • 0.5

0.5 1 X1 X2 X3 X4

  • 1.5
  • 1
  • 0.5

0.5 1

  • 1
  • 0.5

0.5 1 Y1 Y2 Y3 Y4

Consider the example below of a dataset of 4 points with 2-dimensional coordinates in both X and Y.

  • Determine by hand the directions found by CCA in each space.
  • Contrast to the directions found by PCA.
slide-11
SLIDE 11

ADVANCED MACHINE LEARNING 12 ADVANCED MACHINE LEARNING

Kernel Canonical Correlation Analysis

CCA finds basis vectors, s.t. the correlation between the projections (of all datapoints in X and Y) is mutually maximized. CCA is a generalized version of PCA for two or more multi- dimensional datasets, but unlike PCA it does have the constraint to find orthogonal vectors. Assumes a linear correlation. If correlation non-linear  Kernel CCA.

slide-12
SLIDE 12

ADVANCED MACHINE LEARNING 13 ADVANCED MACHINE LEARNING

Kernel Canonical Correlation Analysis (kCCA)

Video description Audio description

 

1 1

, x y

x

N

x

y

N

y  

2 2

, x y

   

 

,

max ,

x y

T T x x y y w w corr w

x w y  

x

y

Assume two transformations And then perform correlation analysis in feature space across the two feature spaces.

slide-13
SLIDE 13

ADVANCED MACHINE LEARNING 14 ADVANCED MACHINE LEARNING

From CCA to Kernel CCA

 

 

1 1

,

y i x i

M M N N i i

X x Y y

 

   

 

 

 

 

   

1 1 1 1

and , with 0 and

i i i i

M M M M x y x y i i i i

x y x y    

   

 

 

Send into two separate feature spaces for data in X and in Y.

   

Construct associated kernel matrices: , , columns of , are ,

T T i i x x x y y y x y x y

K F F K F F F F x y    

slide-14
SLIDE 14

ADVANCED MACHINE LEARNING 15 ADVANCED MACHINE LEARNING

From CCA to Kernel CCA

   

, , 1 1

Express the projection vectors as a linear combination of images of datapoints in feature space (as in kPCA): and and

x x x y y y M M i i x x i x x y i y i i

w F w F w x w y      

 

    

 

T x y , T T x y

In Linear CCA, we were solving for: max w w u.c. w w 1

x y

xy w w xx x yy y

C C w C w  

Replace the covariance and crosscovariance matrices by the product of the projection vectors in feature space (as in kPCA):

T xx x x T yy y y T xy x y

C F F C F F C F F   

T x y , T T x y

In kernel CCA, we solve for: max u.c. 1

x y

T T x x y y w w T T T T x x x x x y y y y y

F F F F F F F F F F F F        

x

K

y

K

x

K

y

K

slide-15
SLIDE 15

ADVANCED MACHINE LEARNING 16 ADVANCED MACHINE LEARNING

Kernel CCA

   

 

, ,

In summary, in kernel CCA, we search the projection vectors , (that live in feature space) so as to maximize: max max ,

x y x y

x y x x y y w w w w

w w corr w x w y    

   

, , 2 2

max max . . 1

x y x y

T x x y y w w T T x x x y y y

K K u c K K

 

         

This is again a generalized eigenvalue problem with , the dual eigenvectors (as dual vectors in kPCA), see documentation in annexes for derivation.

x y

 

2 2

Generalized eigenvalue problem:

x y x x x y y y x y

K K K K K K                                

slide-16
SLIDE 16

ADVANCED MACHINE LEARNING 17 ADVANCED MACHINE LEARNING

Kernel CCA

2 2

Generalized eigenvalue problem:

x y x x x y y y x y

K K K K K K                                

 

If the intersection between the spaces spanned by , is non-zero, then the problem has a trivial solution, as ~ cos , 1 (see solution to the exercises).

x x y y x x y y

K K K K      

   

, , 2 2

max max . . 1

x y x y

T x x y y w w T T x x x y y y

K K u c K K

 

         

slide-17
SLIDE 17

ADVANCED MACHINE LEARNING 18 ADVANCED MACHINE LEARNING

Kernel CCA

2 2

Generalized eigenvalue problem:

x y x x x y y y x y

K K K K K K                                

2 2

Add a regularization term to increase the rank of the matrix and make it invertible (to avoid the trivial solution) + I , 2

x x

M K K          

Several methods have been proposed to choose carefully the regularization term so as to get projections that are as close as possible to the “true” projections.

slide-18
SLIDE 18

ADVANCED MACHINE LEARNING 19 ADVANCED MACHINE LEARNING

Kernel CCA

2 2

+ I 2 + I 2

x x y x x y y y x y

M K K K K K M K

A B

      

                                            

Becomes a classical eigenvalue problem  

1 1 T

C AC  

 

Set: and

T

B C C C    

slide-19
SLIDE 19

ADVANCED MACHINE LEARNING 20 ADVANCED MACHINE LEARNING

Kernel CCA

Can be extended to multiple datasets:

 

 

1 1

,

y i x i

M M N N i i

X x Y y

 

   

Two datasets case

1 1

datasets: ,...., with

  • bservations each

Dimensions ,.... :, i.e. :

L L i i

L X X M N N X N M 

1 1 1

Applying non-linear transformation , to ,... construct Gram matrices: ,.....,

L L

X X L K K  

slide-20
SLIDE 20

ADVANCED MACHINE LEARNING 21 ADVANCED MACHINE LEARNING

Kernel CCA

Can be extended to multiple datasets:

1 1

datasets: ,...., with

  • bservations each

Dimensions ,.... :, i.e. :

L L i i

L X X M N N X N M 

2 1 1 2 1 1 2 1 2 2 1 2

+ I ....... 2 0 ........ . . ........... . . ....... 0

L L L L L

M K K K K K K K K K K K K K                                    

1 2 2

..................................... . . + I 2

L L

M K                                        

 

 

1 1

,

y i x i

M M N N i i

X x Y y

 

   

Two datasets case

slide-21
SLIDE 21

ADVANCED MACHINE LEARNING 22 ADVANCED MACHINE LEARNING

Kernel CCA

Can be extended to multiple datasets (MKCCA)

2 2

+ I 2 + I 2

x x y x x y y y x y

M K K K K K M K

A B

      

                                            

had as solution the following generalized eigenvalue problem:

 

 

1 1

,

y i x i

M M N N i i

X x Y y

 

   

Two datasets case

2 1 1 2 1 1 2 1 2 2 1 2

+ I ....... 2 0 ........ . . ........... . . ....... 0

L L L L L

M K K K K K K K K K K K K K                                    

1 2 2

..................................... . . + I 2

L L

M K                                        

slide-22
SLIDE 22

ADVANCED MACHINE LEARNING 23 ADVANCED MACHINE LEARNING

CCA: Exercise II

Consider the following kernel matrices : 1 0.5 0 1 0.8 0 0.5 1 0 , 0.8 1 0 0 0 1 0 0 1 a) How many datapoints do you hav

x y

K K                       e? What are the dimensions of the two feature spaces? b) Assume a RBF kernel with same kernel width for and , draw the distribution of points and give the shape of the dual vectors and , solu

x y x y

K K  

2 2 , ,

tions

  • f max

max with 1. c) What is the effect of changing the kernel width on and and

  • n

and ? d) Do (b) when considering a polynomial kernel. Assume t

x y x y

T T T x x y y x x x y y y x y x y

K K K K K K

   

            hen same distribution of points as for RBF kernel. What are the Gram matrices?

slide-23
SLIDE 23

ADVANCED MACHINE LEARNING 24 ADVANCED MACHINE LEARNING

CCA: Exercise III Exercise - I

  • 1
  • 0.8
  • 0.6
  • 0.4
  • 0.2

0.2 0.4 0.6 0.8 1

  • 1
  • 0.5

0.5 1 X1 X2 X3 X4

  • 1.5
  • 1
  • 0.5

0.5 1

  • 1
  • 0.5

0.5 1 Y1 Y2 Y3 Y4

Consider the example below of a dataset of 4 points with 2- dimensional coordinates in both X and Y.

  • What is the shape of the kernel matrices and dual eigenvectors

and draw the isolines when considering a RBF kernel.

  • Do the same with a polynomial kernel.
slide-24
SLIDE 24

ADVANCED MACHINE LEARNING 26 ADVANCED MACHINE LEARNING

Applications of Kernel CCA

Kernel matrices K1, K2 and K3 correspond to gene-gene similarities in pathways, genome position, and microarray expression data resp. Use RBF kernel with fixed kernel width. Goal: To measure correlation between heterogeneous datasets and to extract sets of genes which share similarities with respect to multiple biological attributes

Y Yamanishi, JP Vert, A Nakaya, M Kanehisa - Extraction of correlated gene clusters from multiple genomic data by generalized kernel canonical correlation analysis, Bioinformatics, 2003

Correlation scores in MKCCA: pathway vs. genome vs. expression.

slide-25
SLIDE 25

ADVANCED MACHINE LEARNING 27 ADVANCED MACHINE LEARNING

Applications of Kernel CCA

Goal: To measure correlation between heterogeneous datasets and to extract sets of genes which share similarities with respect to multiple biological attributes

Y Yamanishi, JP Vert, A Nakaya, M Kanehisa - Extraction of correlated gene clusters from multiple genomic data by generalized kernel canonical correlation analysis, Bioinformatics, 2003

Correlation scores in MKCCA: pathway vs. genome vs. expression.

Gives pairwise correlation between K1, K2 Gives pairwise correlation between K1, K3

A readout of the entries with equal projection

  • nto

the first canonical vectors  give the genes which belong to each cluster Two clusters correspond to genes close to each other with respect to their positions in the pathways, in the genome, and to their expression

slide-26
SLIDE 26

ADVANCED MACHINE LEARNING 28 ADVANCED MACHINE LEARNING

Applications of Kernel CCA

Goal: To construct appearance models for estimating an object’s pose from raw brightness images.

  • T. Melzer, M. Reiter and H. Bischof, Appearance models based on kernel canonical correlation analysis, Pattern

Recognition 36 (2003) p. 1961–1971

Example of two image datapoints with different poses

X: Set of images Y: Pose parameters (pan and tilt angle of the object w.r.t. the camera in degrees) Method: used linear kernel on X and RBF kernel on Y and compared performance to applying PCA on the (X, Y) dataset directly.

slide-27
SLIDE 27

ADVANCED MACHINE LEARNING 29 ADVANCED MACHINE LEARNING

kernel-CCA performs better than kPCA, especially for small testing/training ratio (i.e., for larger training sets). The kernel-CCA estimators tend to produce less outliers, i.e., gross errors, and consequently yield a smaller standard deviation of the pose estimation error than their PCA-based counterparts.

Applications of Kernel CCA

Goal: To construct appearance models for estimating an object’s pose from raw brightness images

  • T. Melzer, M. Reiter and H. Bischof, Appearance models based on kernel canonical correlation analysis, Pattern

Recognition 36 (2003) p. 1961–1971

For very small training sets, the performance of both approaches becomes similar

Testing/training ratio Pose estimation error (in degree)

slide-28
SLIDE 28

ADVANCED MACHINE LEARNING 30 ADVANCED MACHINE LEARNING

Summary

  • CCA is an excellent means to discover appropriate projections

when your data is multi-modal.

  • In each modality (separately), CCA finds projections that

highlight features common to the datapoints as a whole.

  • It generates projections that are different from performing PCA
  • n each modality separately.
  • The non-linear version of CCA, kernel CCA, generates sets of

projections different from linear CCA and from kPCA.

  • These projections highlight sets of modalities that are common

to groups of datapoints. It is a good pre-processing method before clustering.