On the Eigenspectrum Eigenspectrum of the Gram of the Gram On the - - PowerPoint PPT Presentation

on the eigenspectrum eigenspectrum of the gram of the
SMART_READER_LITE
LIVE PREVIEW

On the Eigenspectrum Eigenspectrum of the Gram of the Gram On the - - PowerPoint PPT Presentation

On the Eigenspectrum Eigenspectrum of the Gram of the Gram On the Matrix and the Generalisation Generalisation Error Error Matrix and the of Kernel PCA ( of Kernel PCA (Shawe Shawe- -Taylor, et al. 2005) Taylor, et al. 2005) Ameet


slide-1
SLIDE 1

On the On the Eigenspectrum Eigenspectrum of the Gram

  • f the Gram

Matrix and the Matrix and the Generalisation Generalisation Error Error

  • f Kernel PCA
  • f Kernel PCA (

(Shawe Shawe-

  • Taylor, et al. 2005)

Taylor, et al. 2005)

Ameet Talwalkar Ameet Talwalkar 02/13/07 02/13/07

slide-2
SLIDE 2
  • Background

Background

  • Motivation

Motivation

  • PCA, MDS

PCA, MDS

  • (

(Isomap Isomap) )

  • Kernel PCA

Kernel PCA

  • Generalisation

Generalisation Error of Kernel PCA Error of Kernel PCA

Outline Outline

slide-3
SLIDE 3

Dimensional Reduction: Motivation Dimensional Reduction: Motivation

Lossy Lossy

  • Computational efficiency

Computational efficiency

  • Visualization of data requires 2D or 3D representations

Visualization of data requires 2D or 3D representations

  • Curse of Dimensionality : Learning algorithms require

Curse of Dimensionality : Learning algorithms require “reasonably” good sampling “reasonably” good sampling

Lossless Lossless – – “Manifold Learning” “Manifold Learning”

  • Assumes existence of “intrinsic dimension,” or a

Assumes existence of “intrinsic dimension,” or a reduced representation containing all independent reduced representation containing all independent variables variables

A(x) x -> x’ A(x’)

Intractable learning problem Dim Red Tractable learning problem

slide-4
SLIDE 4
  • Assumes input data is a linear function of the

Assumes input data is a linear function of the independent variables independent variables

  • Common Methods:

Common Methods:

  • Principal Component Analysis (PCA)

Principal Component Analysis (PCA)

  • Multidimensional Scaling (MDS)

Multidimensional Scaling (MDS)

Linear Dimensional Reduction Linear Dimensional Reduction

slide-5
SLIDE 5
  • Linearly transform input data in a way

Linearly transform input data in a way that: that:

  • Maximizes signal (variance)

Maximizes signal (variance)

  • Minimizes redundancy of signal (covariance)

Minimizes redundancy of signal (covariance)

PCA PCA – – Big Picture Big Picture

slide-6
SLIDE 6

PCA PCA – – Simple Example Simple Example

  • Original Data Points

Original Data Points

  • E.g. shoe size

E.g. shoe size measured in ft, cm measured in ft, cm

  • y = x provides a

y = x provides a good approx of data good approx of data

slide-7
SLIDE 7

PCA PCA – – Simple Example (cont) Simple Example (cont)

  • Original data

Original data restored using only restored using only first principal first principal component component

slide-8
SLIDE 8

PCA PCA – – Covariance Covariance

)] )( [( ) , cov( y y x x E y x − − =

  • Covariance is a measure of how much two

Covariance is a measure of how much two variables vary together variables vary together

  • If x and y are independent, then

If x and y are independent, then cov(x,y cov(x,y) = 0 ) = 0

) var( ) , cov( x x x =

slide-9
SLIDE 9

PCA PCA – – Covariance Matrix Covariance Matrix

[ ] [ ]

[ ]

=

= = − − =

m i T i i T X T X

x x m XX m C X E X X E X E C

1

1 1 ) )( (

  • Stores

Stores pairwise pairwise covariance of variables covariance of variables

  • Diagonals are variances

Diagonals are variances

  • Symmetric, Positive Semi

Symmetric, Positive Semi-

  • definite

definite

  • Start with m column vector observations of n variables

Start with m column vector observations of n variables

  • Covariance is an n x n matrix

Covariance is an n x n matrix

slide-10
SLIDE 10

Eigendecomposition Eigendecomposition

  • Eigenvectors (v) and eigenvalues (

Eigenvectors (v) and eigenvalues (λ λ) for an n x n ) for an n x n matrix, A, are pairs (v, matrix, A, are pairs (v, λ λ) such that: ) such that:

  • If A is a real symmetric matrix, it can be

If A is a real symmetric matrix, it can be diagonalized diagonalized into A = E DE into A = E DET

T

  • E = A’s

E = A’s orthonormal

  • rthonormal eigenvectors

eigenvectors

  • D = diagonal matrix of A’s eigenvalues

D = diagonal matrix of A’s eigenvalues

  • A is positive semi

A is positive semi-

  • definite => eigenvalues non

definite => eigenvalues non-

  • negative

negative

v Av λ =

slide-11
SLIDE 11
  • Linearly transform input data in a way that:

Linearly transform input data in a way that:

  • Maximizes signal (variance)

Maximizes signal (variance)

  • Minimizes redundancy of signal (covariance)

Minimizes redundancy of signal (covariance)

  • Algorithm:

Algorithm:

  • Select variance maximizing direction input space

Select variance maximizing direction input space

  • Find next variance maximizing direction that is orthogonal

Find next variance maximizing direction that is orthogonal to all previously selected directions to all previously selected directions

  • Repeat k

Repeat k-

  • 1 times

1 times

  • Find a transformation, P, such that Y = PX and C

Find a transformation, P, such that Y = PX and CY

Y is

is diagonalized diagonalized

  • Solution: project data onto eigenvectors of

Solution: project data onto eigenvectors of C Cx

x

PCA PCA – – Goal (x3) Goal (x3)

slide-12
SLIDE 12

PCA PCA – – Algorithm Algorithm

  • Goal: Find P where Y = PX

Goal: Find P where Y = PX s.t s.t. C . CY

Y is

is diagonalized diagonalized

  • Select P = E

Select P = ET

T, or a matrix where each

, or a matrix where each row is an eigenvector of row is an eigenvector of C Cx

x

  • Inverse = Transpose for

Inverse = Transpose for orthonormal

  • rthonormal

matrix matrix

  • C

CY

Y is

is diagonalized diagonalized

  • PCs are the eigenvectors of

PCs are the eigenvectors of C Cx

x

  • i

ith

th diagonal value of C

diagonal value of CY

Y is the variance of

is the variance of X along p X along pi

i

D P DP P P PAP C

T T T Y

= = = ) (

1 ) )( ( 1 1

T T T T T Y

PAP P PXX m PX PX m YY m C = = = =

T T

EDE XX m A = = 1 where

  • note: eigenvectors of E are

note: eigenvectors of E are

  • rthonormal
  • rthonormal
slide-13
SLIDE 13

Gram Matrix (Kernel Matrix) Gram Matrix (Kernel Matrix)

j i ij T

x x K X X K ⋅ = =

  • Given X, a collection of m column vector observations

Given X, a collection of m column vector observations

  • f n variables
  • f n variables
  • Gram Matrix of M: matrix of dot products of inputs

Gram Matrix of M: matrix of dot products of inputs

  • m x m, real, symmetric

m x m, real, symmetric

  • Positive semi

Positive semi-

  • definite

definite

  • “similarity matrix”

“similarity matrix”

slide-14
SLIDE 14

Classical Multidimensional Scaling Classical Multidimensional Scaling

  • Given m objects and dissimilarity

Given m objects and dissimilarity δ δij

ij for each pair,

for each pair, find space in which find space in which δ δij

ij ≈

≈ Euclidean distance Euclidean distance

  • If

If δ δij

ij2 2 = Euclidean Distance:

= Euclidean Distance:

  • Can convert Dissimilarity matrix to Gram Matrix (or we

Can convert Dissimilarity matrix to Gram Matrix (or we can just start with Gram Matrix) can just start with Gram Matrix)

  • MDS yields same answer as PCA

MDS yields same answer as PCA

slide-15
SLIDE 15

Classical Multidimensional Scaling Classical Multidimensional Scaling

  • Convert Dissimilarity Matrix to Gram Matrix (K)

Convert Dissimilarity Matrix to Gram Matrix (K)

  • Eigendecomposition

Eigendecomposition of K

  • f K
  • K = EDE

K = EDET

T =

= ED ED1/2

1/2D

D1/2

1/2E

ET

T =

= (ED (ED1/2

1/2) (ED

) (ED1/2

1/2)

)T

T

  • K =

K = X XT

T X

X

  • X

X = (ED = (ED1/2

1/2 )

)T

T

  • Reduce Dimension

Reduce Dimension

  • Construct

Construct X X from subset of eigenvectors/ from subset of eigenvectors/eigenvalues eigenvalues

  • Identical to PCA

Identical to PCA

slide-16
SLIDE 16

Limitations of Linear Methods Limitations of Linear Methods

  • Cannot account for non

Cannot account for non-

  • linear relationship of data in

linear relationship of data in input space input space

  • Data may still have linear

Data may still have linear relationship in some feature relationship in some feature space space

  • Isomap

Isomap: use geodesic : use geodesic distance to recover manifold distance to recover manifold

  • Length of shortest curve on

Length of shortest curve on a manifold connecting two a manifold connecting two points on the manifold points on the manifold

Small Euclidean distance Large geodesic distance

slide-17
SLIDE 17

Local Estimation of Manifolds Local Estimation of Manifolds

  • Small patches on a non

Small patches on a non-

  • linear manifold look linear

linear manifold look linear

  • Locally linear neighborhoods defined in two ways

Locally linear neighborhoods defined in two ways

  • k

k-

  • nearest neighbors: find the k nearest points to a given point

nearest neighbors: find the k nearest points to a given point

  • ε

ε-

  • ball: find all points that lie within

ball: find all points that lie within ε ε of a given point

  • f a given point
slide-18
SLIDE 18

Isomap Isomap idea idea

  • Create weighted graph

Create weighted graph

  • vertices =

vertices = datapoints datapoints

  • edges between “neighbors,” weighted by

edges between “neighbors,” weighted by Euclidean distance Euclidean distance

  • Distance matrix =

Distance matrix = pairwise pairwise Shortest paths Shortest paths

  • Construct d

Construct d-

  • dimensional embedding

dimensional embedding

  • Perform MDS and “eyeball” residual variance

Perform MDS and “eyeball” residual variance

slide-19
SLIDE 19

“ “Eyeballing” Intrinsic Dimension Eyeballing” Intrinsic Dimension

slide-20
SLIDE 20

Isomap Isomap – – Convergence Convergence

  • Guaranteed to asymptotically recover convex

Guaranteed to asymptotically recover convex Euclidean manifolds Euclidean manifolds

  • For a sufficiently high density of data points,

For a sufficiently high density of data points, given arbitrarily small values given arbitrarily small values λ λ1

1 ,

, λ λ2

2 and

and µ µ, then , then with probability at least 1 with probability at least 1-

  • µ:

µ:

  • Rate of convergence dependent on density of

Rate of convergence dependent on density of points and properties of underlying manifold points and properties of underlying manifold (radius of curvature, branch separation) (radius of curvature, branch separation)

2 1

1 distance geodesic distance graph

  • 1

λ λ + ≤ ≤

slide-21
SLIDE 21

Kernel Functions Kernel Functions

  • Kernel function: similarity measure between two vectors

Kernel function: similarity measure between two vectors

  • Define non

Define non-

  • linear mapping from input space to high

linear mapping from input space to high-

  • dimensional feature space:

dimensional feature space:

  • Define k such that:

Define k such that:

  • Efficiency: k may be much more efficient to compute than

Efficiency: k may be much more efficient to compute than mapping and dot product in high dimensional space mapping and dot product in high dimensional space

  • Flexibility: k can be chosen arbitrarily so long as it is “posit

Flexibility: k can be chosen arbitrarily so long as it is “positive ive definite symmetric” definite symmetric”

F X : → Φ ) , ( ) ( ) ( y x k y x = Φ ⋅ Φ

slide-22
SLIDE 22

Positive Definite Symmetric (PDS) Positive Definite Symmetric (PDS) Kernels Kernels

) , k(

j i ij

x x K =

  • Given m column vector observations of n variables

Given m column vector observations of n variables

  • Kernel Matrix: m x m matrix in which

Kernel Matrix: m x m matrix in which

  • Kernel (k) is PDS if K is symmetric and positive semi

Kernel (k) is PDS if K is symmetric and positive semi-

  • definite

definite

  • If K is positive semi

If K is positive semi-

  • definite then k is the dot

definite then k is the dot product in some dot product space (feature product in some dot product space (feature space) space)

slide-23
SLIDE 23

Kernel Trick Kernel Trick

  • For any algorithm relying solely on dot

For any algorithm relying solely on dot-

  • products,

products, we can replace the dot we can replace the dot-

  • product with a positive

product with a positive-

  • definite kernel

definite kernel

  • Allows for non

Allows for non-

  • linearity

linearity

  • Example: PCA

Example: PCA

slide-24
SLIDE 24

Kernel PCA Kernel PCA

  • PCA:

PCA:

  • eigenvectors of Covariance matrix

eigenvectors of Covariance matrix are Principal Components are Principal Components

  • Can rewrite solely with dot

Can rewrite solely with dot-

  • products

products

  • Kernel PCA:

Kernel PCA:

( )

j T j m i j T i i

V y m V x x y y x y ) ( ) ( ) ( ) ( : ) ( by (*) multiply ], ...x [

1 T m 1

Φ = Φ Φ ⋅ Φ Φ ∈ ∀

=

λ

( )

∑ ∑

= = Φ

Φ ⋅ Φ = = Φ Φ =

m i i j i j j j j m i j T i i j x

x V x m V V V x x m V C

1 1 ) (

) ( ) ( 1 * ) ( ) ( 1 λ λ

slide-25
SLIDE 25

Kernel PCA Kernel PCA

[ ]

( ) ( ) ( ) ( )

⎥ ⎥ ⎥ ⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ ⎡ Φ Φ = ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ ⎡ Φ Φ Φ ⋅ Φ Φ ⋅ Φ

j T j T j j T j T m

V x V x m V x V x x y x y

m m

. . . . ) ( ) ( . . ) ( ) (

1 1

1

λ

j j j

V V K λ = ⋅

( )

j T j m i j T i i

V y m V x x y x y ) ( ) ( ) ( ) ( : ] ...x [

1 m 1

Φ = Φ Φ ⋅ Φ ∈ ∀

=

λ

Kernel Matrix Kernel Matrix

… …

slide-26
SLIDE 26

Kernel PCA Kernel PCA

  • K is m x m kernel (gram) matrix

K is m x m kernel (gram) matrix

  • Use

Use eigendecomposition eigendecomposition on K to find

  • n K to find

eigenvectors eigenvectors

  • Project test points in F on subset of

Project test points in F on subset of eigenvectors (dimension reduction) eigenvectors (dimension reduction)

  • PCA:

PCA:

  • eigenvectors of Covariance matrix

eigenvectors of Covariance matrix are Principal Components are Principal Components

  • Can rewrite solely with dot

Can rewrite solely with dot-

  • products

products

  • Kernel PCA:

Kernel PCA:

( )

j T j m i j T i i

V y m V x x y y x y ) ( ) ( ) ( ) ( : ) ( by (*) multiply ], ...x [

1 T m 1

Φ = Φ Φ ⋅ Φ Φ ∈ ∀

=

λ

( )

∑ ∑

= = Φ

Φ ⋅ Φ = = Φ Φ =

m i i j i j j j j m i j T i i j x

x V x m V V V x x m V C

1 1 ) (

) ( ) ( 1 * ) ( ) ( 1 λ λ

j j j

V V K λ = ⋅

( )

) , ( 1 ) ( ) ( ) ( 1 ) (

1 1 i k m i ji j i T k m i j i j j T k

x x V x x V x m V x

∑ ∑

= =

= Φ Φ ⋅ Φ = Φ κ λ λ

slide-27
SLIDE 27

Theory behind dimensional reduction? Theory behind dimensional reduction?

  • Dimensional reduction has gained “popularity”

Dimensional reduction has gained “popularity” since since Isomap Isomap, LLE published , LLE published

  • But, not much theory behind it (

But, not much theory behind it (Isomap Isomap is an is an exception) exception)

  • Assuming existence of underlying manifold

Assuming existence of underlying manifold

  • Do various dim red algorithms converge to the correct

Do various dim red algorithms converge to the correct manifold? manifold?

  • What is the rate of convergence, i.e., given input X of m

What is the rate of convergence, i.e., given input X of m points, how close is points, how close is dim_red(X dim_red(X) to underlying manifold? ) to underlying manifold?

slide-28
SLIDE 28

Why focus on KPCA? Why focus on KPCA?

  • Generalization of dimensional reduction

Generalization of dimensional reduction

  • LLE and

LLE and Isomap Isomap are forms of KPCA are forms of KPCA

  • Residual Variance an intuitive measurement of

Residual Variance an intuitive measurement of accuracy accuracy

  • Limit is clear (and provable): Given an underlying

Limit is clear (and provable): Given an underlying manifold with dimension k, as m approaches infinity, manifold with dimension k, as m approaches infinity, residual variance approaches 0 residual variance approaches 0

  • Paper also uses RV to measure dim red accuracy in

Paper also uses RV to measure dim red accuracy in finite case finite case

slide-29
SLIDE 29

What we’re interested in What we’re interested in

k k k n i i k i i

CV

> ≤ ≤ = =

+ = =

∑ ∑

λ λ λ λ λ

1 1

  • Residual Variance = 1

Residual Variance = 1 – – Captured Variance Captured Variance

  • This paper provides bounds for the sums of

This paper provides bounds for the sums of these process eigenvalues as a function of these process eigenvalues as a function of empirical eigenvalues empirical eigenvalues

slide-30
SLIDE 30

Empirical eigenvalues Empirical eigenvalues

  • Perform PCA on sample, S, of m points

Perform PCA on sample, S, of m points

  • Note: are eigenvalues of C

Note: are eigenvalues of CS

S, and

, and

( ) ( )

j T j m i j T i i j T j m i j T i i

V y V x x y V y V x x y m y x y ) ( ˆ ) ( , ) ( ) ( , 1 : ) ( by (*) multiply ], ...x [

1 1 T m 1

Φ = Φ Φ = Φ Φ ∈ ∀

∑ ∑

= =

λ κ µ κ

* ) ( ) ( 1

1 j j m i j T i i j S

V V x x m V C µ = Φ Φ =

=

j j

mµ λ = ˆ

j

µ

slide-31
SLIDE 31

Process eigenvalues Process eigenvalues

  • As m approaches infinity, this becomes:

As m approaches infinity, this becomes: for a given kernel function and density for a given kernel function and density p(x p(x) on a space ) on a space X X

  • is an estimate for (process eigenvalue)

is an estimate for (process eigenvalue)

Φ = Φ

χ

λ κ

j T j j T

V y dx V x x p y x ) ( ) ( ) ( ) , (

j

λ

j

µ

( )

j T j m i j T i i

V y V x x y m y x y ) ( ) ( , 1 : ) ( by (*) multiply ], ...x [

1 T m 1

Φ = Φ Φ ∈ ∀

=

µ κ

  • Empirical

Empirical eigenproblem eigenproblem: :

slide-32
SLIDE 32

Projections onto Subspaces Projections onto Subspaces

  • : Projection onto subspace V

: Projection onto subspace V

( )

) (x P

V Φ

( )

) (x P

V Φ

  • : Projection onto orthogonal complement of V

: Projection onto orthogonal complement of V

( )

) (x P

V Φ

  • : Residual of projection onto V

: Residual of projection onto V

  • distance between original point and its projection

distance between original point and its projection

slide-33
SLIDE 33

Eigenvalues and Projections Eigenvalues and Projections

  • Equations maximized when v = 1

Equations maximized when v = 1st

st eigenvector of

eigenvector of K Kq

q

  • 1

1st

st eigenvalue

eigenvalue of operator

  • f operator K

Kq

q equals expected squared

equals expected squared norm of 1 norm of 1st

st eigenvector of

eigenvector of K Kq

q

  • intuition: first eigenvector is direction for which the

intuition: first eigenvector is direction for which the expected square of the residual is minimal expected square of the residual is minimal

  • q defines distribution of K (general formula applicable to

q defines distribution of K (general formula applicable to empirical and process cases) empirical and process cases)

( ) ⎥

⎦ ⎤ ⎢ ⎣ ⎡ Φ Ε = Κ

= 2 1 1

) ( ) ( x P

k

V q k i q

λ

( )

[ ] [ ]

( ) ⎥

⎦ ⎤ ⎢ ⎣ ⎡ Φ Ε − Φ Ε = Φ Ε = Κ

⊥ ∈ ∈ 2 2 2 1

) ( min ) ( ) ( max ) ( x P x x P

v

q F v q v q F v q

λ

slide-34
SLIDE 34

Empirical/Process Expectations of Empirical/Process Expectations of Empirical/Process Subspaces Empirical/Process Subspaces

( )

=

= ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ Φ Ε

k i i V

x P

k

1 2

) ( λ

( ) ⎥

⎦ ⎤ ⎢ ⎣ ⎡ Φ Ε

2

) ( ˆ x P

k

V

( ) ⎥

⎦ ⎤ ⎢ ⎣ ⎡ Φ Ε

2 ˆ

) (x P

k

V

( )

=

= ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ Φ Ε

k i i V

x P

k

1 2 ˆ

) ( ˆ µ

  • First two equations follow from last slide

First two equations follow from last slide

  • : Average residual over entire

: Average residual over entire distribution of projection onto first k empirical distribution of projection onto first k empirical eigenvectors (agreed?) eigenvectors (agreed?)

  • : Empirical average of squared norm

: Empirical average of squared norm for m points in S projected onto first k process for m points in S projected onto first k process eigenvectors eigenvectors

slide-35
SLIDE 35

Two simple inequalities Two simple inequalities

⎥ ⎦ ⎤ ⎢ ⎣ ⎡ Φ Ε ≥ = ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ Φ Ε

= 2 ˆ 1 2

) ( ( ) ( ( x P x P

k k

V k i i V

λ ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ Φ Ε ≥ = ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ Φ Ε

= 2 1 2 ˆ

) ( ( ˆ ) ( ( ˆ x P x P

k k

V k i i V

µ

  • is the best solution for empirical data S

is the best solution for empirical data S

  • is the best solution for underlying process

is the best solution for underlying process

  • Goal of paper: show that chain of inequalities below is

Goal of paper: show that chain of inequalities below is accurate and bound difference between first and last terms accurate and bound difference between first and last terms

k

V ˆ

k

V ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ Φ Ε ≥ ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ Φ Ε ≈ ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ Φ Ε ≥ ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ Φ Ε

2 ˆ 2 2 2 ˆ

) ( ( ) ( ( ) ( ( ˆ ) ( ( ˆ x P x P x P x P

k k k k

V V V V

slide-36
SLIDE 36

What we’re interested in What we’re interested in

⎥ ⎦ ⎤ ⎢ ⎣ ⎡ Φ Ε + ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ Φ Ε ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ Φ Ε = + = =

> ≤ ≤ = =

∑ ∑

2 2 2 1 1

) ( ( ) ( ( ) ( ( x P x P x P CV

k k k

V V V k k k n i i k i i

λ λ λ λ λ

  • Residual Variance = 1

Residual Variance = 1 – – Captured Variance Captured Variance

  • This paper provides bounds for the sums of

This paper provides bounds for the sums of these process eigenvalues as a function of these process eigenvalues as a function of empirical eigenvalues empirical eigenvalues

slide-37
SLIDE 37

And now…a first Bound And now…a first Bound

  • If we perform PCA in feature space defined by

If we perform PCA in feature space defined by κ κ( (x,y x,y), then ), then with probability greater than 1 with probability greater than 1-

  • δ

δ over random m

  • ver random m-
  • samples

samples S, if S, if new data new data is projected onto is projected onto Ṽ Ṽk

k the sum the largest k

the sum the largest k process eigenvalues (captured variance) is bounded by: process eigenvalues (captured variance) is bounded by: ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ + − ⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎣ ⎡ + − ≥ ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ Φ Ε ≥ ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ Φ Ε =

= ≤ ≤ ≤ ≤

δ κ µ λ ) 1 ( 2 ln 19 ) , ( 2 1 ) ( max ) ( ( ) ( (

2 1 2 1 2 ˆ 2

m m R x x m m l S x P x P

m i i i l k l V V k

k k

where support of the distribution is in a ball of radius R in where support of the distribution is in a ball of radius R in feature space feature space

slide-38
SLIDE 38

And now…a first Bound And now…a first Bound

  • Second term:

Second term:

  • Includes dependencies on confidence parameter and distribution

Includes dependencies on confidence parameter and distribution radius R radius R

⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎣ ⎡ + −

= ≤ ≤ ≤ m i i i l k l

x x m m l S

1 2 1

) , ( 2 1 ) ( max κ µ ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ + − δ ) 1 ( 2 ln 19

2

m m R

  • First term:

First term:

  • Tradeoff between terms within max term: as l increases, captured

Tradeoff between terms within max term: as l increases, captured variance increases, but so does the ratio of variance increases, but so does the ratio of l/m l/m

  • For “well

For “well-

  • behaved” kernels (those for which dot product is bounded),

behaved” kernels (those for which dot product is bounded), the square root term should be a constant the square root term should be a constant

slide-39
SLIDE 39

The second bound The second bound

  • If we perform PCA in feature space defined by

If we perform PCA in feature space defined by κ κ( (x,y x,y), then ), then with probability greater than 1 with probability greater than 1-

  • δ

δ over random m

  • ver random m-
  • samples

samples S, if S, if new data new data is projected onto is projected onto Ṽ Ṽk

k the expected squared

the expected squared residual is bounded by: residual is bounded by: ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ + ⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎣ ⎡ + + ≥ ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ Φ Ε ≤ ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ Φ Ε =

= > ≤ ≤ >

δ κ µ λ m m R x x m m l S x P x P

m i i i l k l V V k

k k

2 ln 18 ) , ( 2 1 ) ( min ) ( ( ) ( (

2 1 2 1 2 ˆ 2

where support of the distribution is in a ball of radius R in where support of the distribution is in a ball of radius R in feature space feature space

slide-40
SLIDE 40

Next steps Next steps

  • How tight are these bounds? Can we do better?

How tight are these bounds? Can we do better?

  • Can we use these bounds to compare existing

Can we use these bounds to compare existing dimensional reduction algorithms dimensional reduction algorithms

  • Can we construct a kernel that maximizes the

Can we construct a kernel that maximizes the tightness of this bound? tightness of this bound?