On the On the Eigenspectrum Eigenspectrum of the Gram
- f the Gram
Matrix and the Matrix and the Generalisation Generalisation Error Error
- f Kernel PCA
- f Kernel PCA (
(Shawe Shawe-
- Taylor, et al. 2005)
On the Eigenspectrum Eigenspectrum of the Gram of the Gram On the - - PowerPoint PPT Presentation
On the Eigenspectrum Eigenspectrum of the Gram of the Gram On the Matrix and the Generalisation Generalisation Error Error Matrix and the of Kernel PCA ( of Kernel PCA (Shawe Shawe- -Taylor, et al. 2005) Taylor, et al. 2005) Ameet
A(x) x -> x’ A(x’)
Intractable learning problem Dim Red Tractable learning problem
=
m i T i i T X T X
1
T
Maximizes signal (variance)
Minimizes redundancy of signal (covariance)
Select variance maximizing direction input space
Find next variance maximizing direction that is orthogonal to all previously selected directions to all previously selected directions
Repeat k-
1 times
Y is
x
Goal: Find P where Y = PX s.t s.t. C . CY
Y is
is diagonalized diagonalized
Select P = ET
T, or a matrix where each
, or a matrix where each row is an eigenvector of row is an eigenvector of C Cx
x
Inverse = Transpose for orthonormal
matrix matrix
CY
Y is
is diagonalized diagonalized
PCs are the eigenvectors of C Cx
x
ith
th diagonal value of C
diagonal value of CY
Y is the variance of
is the variance of X along p X along pi
i
T T T Y
1 ) )( ( 1 1
T T T T T Y
PAP P PXX m PX PX m YY m C = = = =
T T
EDE XX m A = = 1 where
note: eigenvectors of E are
j i ij T
m x m, real, symmetric
Positive semi-
definite
“similarity matrix”
ij for each pair,
ij ≈
ij2 2 = Euclidean Distance:
T =
1/2D
1/2E
T =
1/2) (ED
1/2)
T
T X
1/2 )
T
Small Euclidean distance Large geodesic distance
k-
nearest neighbors: find the k nearest points to a given point
ε-
ball: find all points that lie within ε ε of a given point
1 ,
2 and
2 1
Efficiency: k may be much more efficient to compute than mapping and dot product in high dimensional space mapping and dot product in high dimensional space
Flexibility: k can be chosen arbitrarily so long as it is “positive ive definite symmetric” definite symmetric”
j i ij
PCA:
eigenvectors of Covariance matrix are Principal Components are Principal Components
Can rewrite solely with dot-
products
Kernel PCA:
( )
j T j m i j T i i
V y m V x x y y x y ) ( ) ( ) ( ) ( : ) ( by (*) multiply ], ...x [
1 T m 1
Φ = Φ Φ ⋅ Φ Φ ∈ ∀
=
λ
= = Φ
Φ ⋅ Φ = = Φ Φ =
m i i j i j j j j m i j T i i j x
x V x m V V V x x m V C
1 1 ) (
) ( ) ( 1 * ) ( ) ( 1 λ λ
j T j T j j T j T m
m m
1 1
1
j j j
j T j m i j T i i
1 m 1
=
Kernel Matrix Kernel Matrix
K is m x m kernel (gram) matrix
Use eigendecomposition eigendecomposition on K to find
eigenvectors eigenvectors
Project test points in F on subset of eigenvectors (dimension reduction) eigenvectors (dimension reduction)
PCA:
eigenvectors of Covariance matrix are Principal Components are Principal Components
Can rewrite solely with dot-
products
Kernel PCA:
( )
j T j m i j T i i
V y m V x x y y x y ) ( ) ( ) ( ) ( : ) ( by (*) multiply ], ...x [
1 T m 1
Φ = Φ Φ ⋅ Φ Φ ∈ ∀
=
λ
= = Φ
Φ ⋅ Φ = = Φ Φ =
m i i j i j j j j m i j T i i j x
x V x m V V V x x m V C
1 1 ) (
) ( ) ( 1 * ) ( ) ( 1 λ λ
j j j
) , ( 1 ) ( ) ( ) ( 1 ) (
1 1 i k m i ji j i T k m i j i j j T k
x x V x x V x m V x
= =
= Φ Φ ⋅ Φ = Φ κ λ λ
k k k n i i k i i
> ≤ ≤ = =
1 1
S, and
j T j m i j T i i j T j m i j T i i
1 1 T m 1
= =
1 j j m i j T i i j S
=
j j
j
χ
j T j j T
j
j
j T j m i j T i i
1 T m 1
=
V Φ
V Φ
⊥
V Φ
⊥
distance between original point and its projection
st eigenvector of
q
st eigenvalue
q equals expected squared
st eigenvector of
q
= 2 1 1
k
V q k i q
⊥ ∈ ∈ 2 2 2 1
v
q F v q v q F v q
=
k i i V
k
1 2
2
k
V
2 ˆ
k
V
=
k i i V
k
1 2 ˆ
= 2 ˆ 1 2
k k
V k i i V
= 2 1 2 ˆ
k k
V k i i V
is the best solution for empirical data S
is the best solution for underlying process
Goal of paper: show that chain of inequalities below is accurate and bound difference between first and last terms accurate and bound difference between first and last terms
k
k
2 ˆ 2 2 2 ˆ
k k k k
V V V V
⊥
> ≤ ≤ = =
2 2 2 1 1
k k k
V V V k k k n i i k i i
k the sum the largest k
= ≤ ≤ ≤ ≤
2 1 2 1 2 ˆ 2
m i i i l k l V V k
k k
Includes dependencies on confidence parameter and distribution radius R radius R
= ≤ ≤ ≤ m i i i l k l
1 2 1
2
Tradeoff between terms within max term: as l increases, captured variance increases, but so does the ratio of variance increases, but so does the ratio of l/m l/m
For “well-
behaved” kernels (those for which dot product is bounded), the square root term should be a constant the square root term should be a constant
k the expected squared
= > ≤ ≤ >
⊥
m i i i l k l V V k
k k
2 1 2 1 2 ˆ 2