ADVANCED MACHINE LEARNING 1 ADVANCED MACHINE LEARNING
MACHINE LEARNING Kernel Canonical Correlation Analysis 1 ADVANCED - - PowerPoint PPT Presentation
MACHINE LEARNING Kernel Canonical Correlation Analysis 1 ADVANCED - - PowerPoint PPT Presentation
ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING MACHINE LEARNING Kernel Canonical Correlation Analysis 1 ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING Structure of todays and next weeks class 1) Briefly go through one
ADVANCED MACHINE LEARNING 2 ADVANCED MACHINE LEARNING
Structure of today’s and next week’s class
1) Briefly go through
- ne
extension
- f
principal component analysis, namely Canonical Correlation Analysis (CCA). 2) Derive the non-linear version of CCA, kernel CCA (kCCA). 3) Make an exercise to understand the modulation of the space generated by CCA and kCCA.
ADVANCED MACHINE LEARNING 3 ADVANCED MACHINE LEARNING
Canonical Correlation Analysis (CCA)
Video description Audio description
1 1
, x y
x
N
x
y
N
y
2 2
, x y
,
max ,
x y
T T x y w w corr w x w y
Determine features in two (or more) separate descriptions of the dataset that best explain each datapoint.
Extract hidden structure that maximize correlation across two different projections.
ADVANCED MACHINE LEARNING 4 ADVANCED MACHINE LEARNING
,
max ,
x y
T T x y w w corr w x w y
Canonical Correlation Analysis (CCA)
1 1
, x y
2 2
, x y
Pair of multidimensional zero mean variables We have M instances of the pairs.
1 1
,
y i x i
M M N N i i
X x Y y
Search two projections and w : and
x y T T x x y y
w z w X z w Y
,
solutions of: max max corr z ,z
x y x y
w w
ADVANCED MACHINE LEARNING 5 ADVANCED MACHINE LEARNING
,
max ,
x y
T T x y w w corr w x w y
Canonical Correlation Analysis (CCA)
T x y T , x y
w w max w w
x y
T T w w
E XY X Y
T x y T T , x y
w w max w w
x y
xy w w xx x yy y
C C w C w
With and zero mean, i.e. X Y E X E Y
Search two projections and w : and
x y T T x x y y
w z w X z w Y
,
solutions of: max max corr z ,z
x y x y
w w
ADVANCED MACHINE LEARNING 6 ADVANCED MACHINE LEARNING
Canonical Correlation Analysis (CCA)
T x y T , x y
w w max w w
x y
T T w w
E XY X Y
T x y T T , x y
w w max w w
x y
xy w w xx x yy y
C C w C w
Covariance matrices =E : =E :
T xx x x T yy y y
C XX N N C YY N N
Crosscovariance matrix is Measure crosscorrelation between and .
xy x y
C N N X Y
,
solutions of: max max corr z ,z
x y x y
w w
With and zero mean, i.e. X Y E X E Y
ADVANCED MACHINE LEARNING 7 ADVANCED MACHINE LEARNING
Canonical Correlation Analysis (CCA)
T x y T , x y
w w max w w
x y
T T w w
E XY X Y
T x y T T , x y
w w max w w
x y
xy w w xx x yy y
C C w C w
T T x y
Correlation not affected by rescaling the norm of the vectors, we can ask that w w 1
xx x yy y
C w C w
T x y , T T x y
max max w w
- u. c. w
w 1
x y
xy w w xx x yy y
C C w C w
,
solutions of: max max corr z ,z
x y x y
w w
ADVANCED MACHINE LEARNING 8 ADVANCED MACHINE LEARNING
Canonical Correlation Analysis (CCA)
T T x y
Correlation not affected by rescaling the norm of the vectors, we can ask that w w 1
xx x yy y
C w C w
T x y , T T x y
max max w w
- u. c. w
w 1
x y
xy w w xx x yy y
C C w C w
To determine the optimum (maximum) of , solve by Lagrange: , , , = 1 1 Taking the partial derivatives over , : / 2
T T T x y x y x xy y x x xx x y y yy y x y x y
L w w w C w w C w w C w w w
ADVANCED MACHINE LEARNING 9 ADVANCED MACHINE LEARNING
Canonical Correlation Analysis (CCA)
1 2
Replacing and write the set of equations gives: Which can be rewritten as
xy x xx x y yy y yx xy yy yx x xx x
C w C w w C w C C C C w C w
Generalized Eigenvalue Problem; It can be reduced to a classical eigenvalue problem if Cxx is invertible
1 2
Solving for gives: If is invertible, it becomes an eigenvalue problem as for .
y yx xx xy y yy y yy y
w C C C w C w C w
1..
These two eigenvalue problems yield a pair of vectors , , where min( , ) ,
y x
i i x y x y i q N N i i x y
q w w q N N w w
ADVANCED MACHINE LEARNING 10 ADVANCED MACHINE LEARNING
CCA: Exercise I Exercise - I
- 1
- 0.8
- 0.6
- 0.4
- 0.2
0.2 0.4 0.6 0.8 1
- 1
- 0.5
0.5 1 X1 X2 X3 X4
- 1.5
- 1
- 0.5
0.5 1
- 1
- 0.5
0.5 1 Y1 Y2 Y3 Y4
Consider the example below of a dataset of 4 points with 2-dimensional coordinates in both X and Y.
- Determine by hand the directions found by CCA in each space.
- Contrast to the directions found by PCA.
ADVANCED MACHINE LEARNING 12 ADVANCED MACHINE LEARNING
Kernel Canonical Correlation Analysis
CCA finds basis vectors, s.t. the correlation between the projections (of all datapoints in X and Y) is mutually maximized. CCA is a generalized version of PCA for two or more multi- dimensional datasets, but unlike PCA it does have the constraint to find orthogonal vectors. Assumes a linear correlation. If correlation non-linear Kernel CCA.
ADVANCED MACHINE LEARNING 13 ADVANCED MACHINE LEARNING
Kernel Canonical Correlation Analysis (kCCA)
Video description Audio description
1 1
, x y
x
N
x
y
N
y
2 2
, x y
,
max ,
x y
T T x x y y w w corr w
x w y
x
y
Assume two transformations And then perform correlation analysis in feature space across the two feature spaces.
ADVANCED MACHINE LEARNING 14 ADVANCED MACHINE LEARNING
From CCA to Kernel CCA
1 1
,
y i x i
M M N N i i
X x Y y
1 1 1 1
and , with 0 and
i i i i
M M M M x y x y i i i i
x y x y
Send into two separate feature spaces for data in X and in Y.
Construct associated kernel matrices: , , columns of , are ,
T T i i x x x y y y x y x y
K F F K F F F F x y
ADVANCED MACHINE LEARNING 15 ADVANCED MACHINE LEARNING
From CCA to Kernel CCA
, , 1 1
Express the projection vectors as a linear combination of images of datapoints in feature space (as in kPCA): and and
x x x y y y M M i i x x i x x y i y i i
w F w F w x w y
T x y , T T x y
In Linear CCA, we were solving for: max w w u.c. w w 1
x y
xy w w xx x yy y
C C w C w
Replace the covariance and crosscovariance matrices by the product of the projection vectors in feature space (as in kPCA):
T xx x x T yy y y T xy x y
C F F C F F C F F
T x y , T T x y
In kernel CCA, we solve for: max u.c. 1
x y
T T x x y y w w T T T T x x x x x y y y y y
F F F F F F F F F F F F
x
K
y
K
x
K
y
K
ADVANCED MACHINE LEARNING 16 ADVANCED MACHINE LEARNING
Kernel CCA
, ,
In summary, in kernel CCA, we search the projection vectors , (that live in feature space) so as to maximize: max max ,
x y x y
x y x x y y w w w w
w w corr w x w y
, , 2 2
max max . . 1
x y x y
T x x y y w w T T x x x y y y
K K u c K K
This is again a generalized eigenvalue problem with , the dual eigenvectors (as dual vectors in kPCA), see documentation in annexes for derivation.
x y
2 2
Generalized eigenvalue problem:
x y x x x y y y x y
K K K K K K
ADVANCED MACHINE LEARNING 17 ADVANCED MACHINE LEARNING
Kernel CCA
2 2
Generalized eigenvalue problem:
x y x x x y y y x y
K K K K K K
If the intersection between the spaces spanned by , is non-zero, then the problem has a trivial solution, as ~ cos , 1 (see solution to the exercises).
x x y y x x y y
K K K K
, , 2 2
max max . . 1
x y x y
T x x y y w w T T x x x y y y
K K u c K K
ADVANCED MACHINE LEARNING 18 ADVANCED MACHINE LEARNING
Kernel CCA
2 2
Generalized eigenvalue problem:
x y x x x y y y x y
K K K K K K
2 2
Add a regularization term to increase the rank of the matrix and make it invertible (to avoid the trivial solution) + I , 2
x x
M K K
Several methods have been proposed to choose carefully the regularization term so as to get projections that are as close as possible to the “true” projections.
ADVANCED MACHINE LEARNING 19 ADVANCED MACHINE LEARNING
Kernel CCA
2 2
+ I 2 + I 2
x x y x x y y y x y
M K K K K K M K
A B
Becomes a classical eigenvalue problem
1 1 T
C AC
Set: and
T
B C C C
ADVANCED MACHINE LEARNING 20 ADVANCED MACHINE LEARNING
Kernel CCA
Can be extended to multiple datasets:
1 1
,
y i x i
M M N N i i
X x Y y
Two datasets case
1 1
datasets: ,...., with
- bservations each
Dimensions ,.... :, i.e. :
L L i i
L X X M N N X N M
1 1 1
Applying non-linear transformation , to ,... construct Gram matrices: ,.....,
L L
X X L K K
ADVANCED MACHINE LEARNING 21 ADVANCED MACHINE LEARNING
Kernel CCA
Can be extended to multiple datasets:
1 1
datasets: ,...., with
- bservations each
Dimensions ,.... :, i.e. :
L L i i
L X X M N N X N M
2 1 1 2 1 1 2 1 2 2 1 2
+ I ....... 2 0 ........ . . ........... . . ....... 0
L L L L L
M K K K K K K K K K K K K K
1 2 2
..................................... . . + I 2
L L
M K
1 1
,
y i x i
M M N N i i
X x Y y
Two datasets case
ADVANCED MACHINE LEARNING 22 ADVANCED MACHINE LEARNING
Kernel CCA
Can be extended to multiple datasets (MKCCA)
2 2
+ I 2 + I 2
x x y x x y y y x y
M K K K K K M K
A B
had as solution the following generalized eigenvalue problem:
1 1
,
y i x i
M M N N i i
X x Y y
Two datasets case
2 1 1 2 1 1 2 1 2 2 1 2
+ I ....... 2 0 ........ . . ........... . . ....... 0
L L L L L
M K K K K K K K K K K K K K
1 2 2
..................................... . . + I 2
L L
M K
ADVANCED MACHINE LEARNING 23 ADVANCED MACHINE LEARNING
CCA: Exercise II
Consider the following kernel matrices : 1 0.5 0 1 0.8 0 0.5 1 0 , 0.8 1 0 0 0 1 0 0 1 a) How many datapoints do you hav
x y
K K e? What are the dimensions of the two feature spaces? b) Assume a RBF kernel with same kernel width for and , draw the distribution of points and give the shape of the dual vectors and , solu
x y x y
K K
2 2 , ,
tions
- f max
max with 1. c) What is the effect of changing the kernel width on and and
- n
and ? d) Do (b) when considering a polynomial kernel. Assume t
x y x y
T T T x x y y x x x y y y x y x y
K K K K K K
hen same distribution of points as for RBF kernel. What are the Gram matrices?
ADVANCED MACHINE LEARNING 24 ADVANCED MACHINE LEARNING
CCA: Exercise III Exercise - I
- 1
- 0.8
- 0.6
- 0.4
- 0.2
0.2 0.4 0.6 0.8 1
- 1
- 0.5
0.5 1 X1 X2 X3 X4
- 1.5
- 1
- 0.5
0.5 1
- 1
- 0.5
0.5 1 Y1 Y2 Y3 Y4
Consider the example below of a dataset of 4 points with 2- dimensional coordinates in both X and Y.
- What is the shape of the kernel matrices and dual eigenvectors
and draw the isolines when considering a RBF kernel.
- Do the same with a polynomial kernel.
ADVANCED MACHINE LEARNING 26 ADVANCED MACHINE LEARNING
Applications of Kernel CCA
Kernel matrices K1, K2 and K3 correspond to gene-gene similarities in pathways, genome position, and microarray expression data resp. Use RBF kernel with fixed kernel width. Goal: To measure correlation between heterogeneous datasets and to extract sets of genes which share similarities with respect to multiple biological attributes
Y Yamanishi, JP Vert, A Nakaya, M Kanehisa - Extraction of correlated gene clusters from multiple genomic data by generalized kernel canonical correlation analysis, Bioinformatics, 2003
Correlation scores in MKCCA: pathway vs. genome vs. expression.
ADVANCED MACHINE LEARNING 27 ADVANCED MACHINE LEARNING
Applications of Kernel CCA
Goal: To measure correlation between heterogeneous datasets and to extract sets of genes which share similarities with respect to multiple biological attributes
Y Yamanishi, JP Vert, A Nakaya, M Kanehisa - Extraction of correlated gene clusters from multiple genomic data by generalized kernel canonical correlation analysis, Bioinformatics, 2003
Correlation scores in MKCCA: pathway vs. genome vs. expression.
Gives pairwise correlation between K1, K2 Gives pairwise correlation between K1, K3
A readout of the entries with equal projection
- nto
the first canonical vectors give the genes which belong to each cluster Two clusters correspond to genes close to each other with respect to their positions in the pathways, in the genome, and to their expression
ADVANCED MACHINE LEARNING 28 ADVANCED MACHINE LEARNING
Applications of Kernel CCA
Goal: To construct appearance models for estimating an object’s pose from raw brightness images.
- T. Melzer, M. Reiter and H. Bischof, Appearance models based on kernel canonical correlation analysis, Pattern
Recognition 36 (2003) p. 1961–1971
Example of two image datapoints with different poses
X: Set of images Y: Pose parameters (pan and tilt angle of the object w.r.t. the camera in degrees) Method: used linear kernel on X and RBF kernel on Y and compared performance to applying PCA on the (X, Y) dataset directly.
ADVANCED MACHINE LEARNING 29 ADVANCED MACHINE LEARNING
kernel-CCA performs better than kPCA, especially for small testing/training ratio (i.e., for larger training sets). The kernel-CCA estimators tend to produce less outliers, i.e., gross errors, and consequently yield a smaller standard deviation of the pose estimation error than their PCA-based counterparts.
Applications of Kernel CCA
Goal: To construct appearance models for estimating an object’s pose from raw brightness images
- T. Melzer, M. Reiter and H. Bischof, Appearance models based on kernel canonical correlation analysis, Pattern
Recognition 36 (2003) p. 1961–1971
For very small training sets, the performance of both approaches becomes similar
Testing/training ratio Pose estimation error (in degree)
ADVANCED MACHINE LEARNING 30 ADVANCED MACHINE LEARNING
Summary
- CCA is an excellent means to discover appropriate projections
when your data is multi-modal.
- In each modality (separately), CCA finds projections that
highlight features common to the datapoints as a whole.
- It generates projections that are different from performing PCA
- n each modality separately.
- The non-linear version of CCA, kernel CCA, generates sets of
projections different from linear CCA and from kPCA.
- These projections highlight sets of modalities that are common