Statistical Geometry Processing Winter Semester 2011/2012 - - PowerPoint PPT Presentation

statistical geometry processing
SMART_READER_LITE
LIVE PREVIEW

Statistical Geometry Processing Winter Semester 2011/2012 - - PowerPoint PPT Presentation

Statistical Geometry Processing Winter Semester 2011/2012 Least-Squares Least-Squares Fitting Approximation Common Situation: We have many data points, they might be noisy Example: Scanned data Want to approximate the data with a


slide-1
SLIDE 1

Least-Squares

Statistical Geometry Processing

Winter Semester 2011/2012

slide-2
SLIDE 2

Least-Squares Fitting

slide-3
SLIDE 3

3

Approximation

Common Situation:

  • We have many data points, they might be noisy
  • Example: Scanned data
  • Want to approximate the data with a smooth curve /

surface

What we need:

  • Criterion – what is a good approximation?
  • Methods to compute this approximation
slide-4
SLIDE 4

4

Approximation Techniques

Agenda:

  • Least-squares approximation

(and why/when this makes sense)

  • Total least-squares linear approximation

(get rid of the coordinate system)

  • Iteratively reweighted least-squares

(for nasty noise distributions)

slide-5
SLIDE 5

5

Least-Squares

We assume the following scenario:

  • Given: Function values yi at positions xi.

(1D  1D for now)

  • Independent variables xi known exactly.
  • Dependent variables yi with some error.
  • Error Gaussian, i.i.d.
  • normal distributed
  • independent
  • same distribution at every point
  • We know the class of functions
slide-6
SLIDE 6

6

Situation

Situation:

  • Original sample points taken at xi from original f.
  • Unknown Gaussian iid noise added to each yi.
  • Want to estimated reconstructed f.

~

x1 x2 xn y1 y2 yn

slide-7
SLIDE 7

7

Summary

Statistical model yields least-squares criterion: Linear function space leads to quadratic objective: Critical point: linear system

n i i i f

y x f

1 2 ~

) ) ( ~ ( min arg

   

k j j j

x b x f

1

: ~ 

 

 

                

n i i k j i j j

y x b

1 2 1

) ( min arg 

λ

                              

k k k k k k

b y b y b b b b b b b b , , , , , ,

1 1 1 1 1 1

        

 

 

   

n t t t i i n t t j t i j i

y x b x b x b

1 1

) ( : , ) ( ) ( : , b y b b

with:

slide-8
SLIDE 8

8

Maximum Likelihood Estimation

Goal:

  • Maximize the probability that the data originated from

the reconstructed curve f.

  • “Maximum likelihood estimation”

~ Gaussian normal distribution

 

          

2 2 ,

2 exp π 2 1 ) (   

 

x x p

slide-9
SLIDE 9

9

Maximum Likelihood Estimation

     

     

                                             

n i i i f n i i i f n i i i f n i i i f n i i i f n i i i f

y x f y x f y x f y x f y x f y x f N

1 2 ~ 1 2 2 ~ 1 2 2 ~ 1 2 2 ~ 1 2 2 ~ 1 , ~

) ) ( ~ ( min arg 2 ) ) ( ~ ( min arg 2 ) ) ( ~ ( π 2 1 ln max arg 2 ) ) ( ~ ( exp π 2 1 ln max arg 2 ) ) ( ~ ( exp π 2 1 max arg ) ) ( ~ ( max arg       

slide-10
SLIDE 10

10

Maximum Likelihood Estimation

     

     

                                             

n i i i f n i i i f n i i i f n i i i f n i i i f n i i i f

y x f y x f y x f y x f y x f y x f N

1 2 ~ 1 2 2 ~ 1 2 2 ~ 1 2 2 ~ 1 2 2 ~ 1 , ~

) ) ( ~ ( min arg 2 ) ) ( ~ ( min arg 2 ) ) ( ~ ( π 2 1 ln max arg 2 ) ) ( ~ ( exp π 2 1 ln max arg 2 ) ) ( ~ ( exp π 2 1 max arg ) ) ( ~ ( max arg       

slide-11
SLIDE 11

11

Least-Squares Approximation

This shows:

  • Maximum likelihood estimate minimizes

sum of squared errors

Next: Compute optimal coefficients

  • Linear ansatz:
  • Determine optimal i

   

k j j j

x b x f

1

: ~ 

slide-12
SLIDE 12

12

entries 1 entries, 1 entries 1 entries 1

: ) ( ) ( : , ) ( ) ( : ) ( , :

n n n n i i i k k k k

y y x b x b x b x b x                                                                     y b b λ  

Maximum Likelihood Estimation

 

                                     

      

       n i i n i i i n i i i n i i i n i i k j i j j n i i i

y x y x x y x y x b y x f

1 2 1 T 1 T T 1 2 T 1 2 1 1 2

) ( 2 ) ( ) ( min arg ) ( min arg ) ( min arg ) ) ( ~ ( min arg b λ λ b b λ b λ

λ λ λ λ

xTAx bx c  Quadratic optimization problem

slide-13
SLIDE 13

13

Critical Point

                                    

   

    k n i i i n i i n i i i n i i i

x x y x y x x b y b y λ b b b λ λ b b λ

λ T 1 T 1 T 1 2 1 T 1 T T

2 ) ( ) ( 2 ) ( 2 ) ( ) ( 

We obtain a linear system of equations:

                  

 k n i i i

x x b y b y λ b b

T 1 T 1 T

) ( ) ( 

entries 1 entries, 1 entries 1 entries 1

: ) ( ) ( : , ) ( ) ( : ) ( , :

n n n n i i i k k k k

y y x b x b x b x b x                                                                     y b b λ  

slide-14
SLIDE 14

14

Critical Point

This can also be written as: with:

                              

k k k k k k

b y b y b b b b b b b b , , , , , ,

1 1 1 1 1 1

        

 

 

   

n t t t i i n t t j t i j i

y x b x b x b

1 1

) ( : , ) ( ) ( : , b y b b

slide-15
SLIDE 15

15

Summary (again)

Statistical model yields least-squares criterion: Linear function space leads to quadratic objective: Critical point: linear system

n i i i f

y x f

1 2 ~

) ) ( ~ ( min arg

   

k j j j

x b x f

1

: ~ 

 

 

                

n i i k j i j j

y x b

1 2 1

) ( min arg 

λ

with:

                              

k k k k k k

b y b y b b b b b b b b , , , , , ,

1 1 1 1 1 1

        

 

 

   

n t t t i i n t t j t i j i

y x b x b x b

1 1

) ( : , ) ( ) ( : , b y b b

slide-16
SLIDE 16

16

Variants

Weighted least squares:

  • In case the data point’s noise has different standard

deviations  at the different data points

  • This gives a weighted least squares problem
  • Noisier points have smaller influence
slide-17
SLIDE 17

17

Same procedure as prev. slides...

     

     

                                             

n i i i weights i f n i i i i f n i i i i i f n i i i i i f n i i i i i f n i i i f

y x f y x f y x f y x f y x f y x f N

1 2 2 ~ 1 2 2 ~ 1 2 2 ~ 1 2 2 ~ 1 2 2 ~ 1 ~

) ) ( ~ ( 1 min arg 2 ) ) ( ~ ( min arg 2 ) ) ( ~ ( π 2 1 log max arg 2 ) ) ( ~ ( exp π 2 1 log max arg 2 ) ) ( ~ ( exp π 2 1 max arg ) ) ( ~ ( max arg        

slide-18
SLIDE 18

18

Result

Linear system for the general case: Larger  larger influence of data point

                              

n n n n n n

b y b y b b b b b b b b , , , , , ,

1 1 1 1 1 1

        

   

 

 

     

n l i i i i i n l i i j i i j i

x y x b x x b x b

1 2 1 2

) ( : , ) ( ) ( : ,   b y b b

with:

   

i i i i

x x     1 i.e. , 1

2 2

 

slide-19
SLIDE 19

19

Least-Squares Linear Systems

Remark:

  • We get the same result, if we solve an overdetermined

system for the interpolation problem in a least squares sense

  • Least-squares solution to linear system:
  • “System of normal equations”

 

 

b A Ax A b A Ax A b b b Ax Ax A x b Ax b Ax

x x T T T T T T T 2

: i.e. , 2 2 : gradient compute 2 min arg min arg          

slide-20
SLIDE 20

20

SVD

Problem with normal equations:

  • Condition number of normal equations is square of that
  • f A itself
  • Proof:
  • For “evil” (i.e. ill conditioned) problems, normal equations

are not the best way to solve the problem

  • In that case, we can use the SVD to solve the problem...

V D V UDV DU V A A UDV A

2 T T T T

: SVD   

slide-21
SLIDE 21

21

One more Variant...

Function Approximation

  • Given the following problem:
  • We know a function f:   n  
  • We want to approximate f in

a linear subspace:

  • How to choose ?
  • Difference: Continuous function as “data” to be matched.
  • Solution: Almost the same as before...

   

k j j j

x b x f

1

: ~  f f ~

slide-22
SLIDE 22

22

Function Approximation

Objective function:

  • We obtain:

 

min ~

2

  f x f

       

f f f x b b b b b b b b b f x b f x b f x b

j k j j n n n n k j j j k j j j k j j j

, , 2 , , , , ,

1 1 1 1 1 T 1 1 2 1

                

   

   

    λ λ     

slide-23
SLIDE 23

23

Function Approximation

Critical point (i.e., solution): with:

   

                               f x b f x b b b b b b b b b

k k k k k k

, , , , , ,

1 1 1 1 1 1

        

 dx x g x f g f ) ( ) ( ,

 dx x x g x f g f ) ( ) ( ) ( ,

2

(unweighted version) (weighted version)

slide-24
SLIDE 24

24

Galerkin Approximation

The least-squares criterion is equivalent to:

         

x b f x b x b k i x b f x b k i

i i k j j j i k j j j

, , : } .. 1 { , : } .. 1 {

1 1

       

 

 

 

   

                                f x b f x b b b b b b b b b

n n n n n n

, , , , , ,

1 1 1 1 1 1

        

residual each basis function linear subspace residual vector best approx.

  • rigin
slide-25
SLIDE 25

25

Summary

What we can do so far:

  • Least-squares approximation:
  • Given more data points than basis functions,

we can fit an approximate function from a basis function set to the data

  • Variants:
  • We can solve linear systems in

a least-squares sense

  • Given a function, we can fit the

most similar approximation from a subspace

  • Extensions:
  • Any known uncertainty in the data can be modeled by weights
  • The multi-dimensional case is similar
slide-26
SLIDE 26

26

Remaining problems

What is missing:

  • Any error in x-direction is ignored so far (only y-direction)
  • We will look at that problem next (total least-squares)...
  • Noise must be Gaussian
  • Can be generalized using iteratively reweighted least-squares

(M-estimators)

slide-27
SLIDE 27

Total Least Squares

slide-28
SLIDE 28

28

Statistical Model

Generative Model:

  • riginal curve / surface

noisy sample points

slide-29
SLIDE 29

29

Statistical Model

Generative Model:

  • 1. Determine sample point (uniform)
  • 2. Add noise (Gaussian)

sampling Gaussian noise many samples distribution (in space)

slide-30
SLIDE 30

30

Squared Distance Function

Result:

  • Gaussian distribution convolved with object
  • No analytical density

Approximation:

  • 1D Gaussian  minimize squared residual
  • This case  minimize squared distance function
slide-31
SLIDE 31

31

General Total Least Squares

General Total Least Squares:

  • Given a class of objects obj with parameters   k.
  • A set of n sample points (Gaussian, iid, isotropic

covariance) di  m.

  • Total least squares solution minimizes:
  • In general: Non-linear, possibly constrained (restrictions
  • n admissible s) optimization problem
  • Special cases can be solved exactly

 

  n i i

d

  • bj

dist

k

1 2

, min arg

λ λ 

slide-32
SLIDE 32

32

Fitting Affine Subspaces

The following problem can be solved exactly:

  • Best fitting line to a set of 2D, 3D points
  • Best fitting plane to a set of 3D points
  • In general: Affine subspace of m, with dimension d  m

that best approximates a set of data points di  m.

Solution: principle component analysis (PCA)

slide-33
SLIDE 33

33

Start: 0-dim Subspaces

Easy Start: The optimal 0-dimensional affine subspace

  • Given a set D of n data points di  m, what is the point x0

with minimum least square error to all data points?

  • Answer: just the sample mean (average)...:
  • Proof: minimize

(next slide...)

 

n i i

  • pt

n

1 ) (

1 : ) m( d D x

 

n i i

E

1 2 0)

( d x x

slide-34
SLIDE 34

34

                

                 

                                      

n i n i i n i n i i n i i n i i n i n i i n i i n i n i i n i i n i n i i n i i n i i n i i

n E

1

  • f

t independen 1 2 2 ) m( for minimum 1 1 2 1 1 2 1 1 2 1 2 1 1 2 1 2 1 1 2 1 2 1 2 1 2

) m( ) m( ) m( ), m( 2 ) m( ) m( ) m( ), m( 2 ) m( ) m( ) m( ), m( 2 ) m( ) m( ) m( ), m( 2 ) m( ) m( ) m( ) (          

x D x

d D D x d D d d D x D x d D d D D x D x d D d D D x D x d D d D D x D x d D D x d x x

Proof

n i i

n

1

1 : ) m( d D

sample mean:

slide-35
SLIDE 35

35

Proof

n i i

n

1

1 : ) m( d D

sample mean:

                

                 

                                      

n i n i i n i n i i n i i n i i n i n i i n i i n i n i i n i i n i n i i n i i n i i n i i

n E

1

  • f

t independen 1 2 2 ) m( for minimum 1 1 2 1 1 2 1 1 2 1 2 1 1 2 1 2 1 1 2 1 2 1 2 1 2

) m( ) m( ) m( ), m( 2 ) m( ) m( ) m( ), m( 2 ) m( ) m( ) m( ), m( 2 ) m( ) m( ) m( ), m( 2 ) m( ) m( ) m( ) (          

x D x

d D D x d D d d D x D x d D d D D x D x d D d D D x D x d D d D D x D x d D D x d x x

slide-36
SLIDE 36

36

One Dimensional Subspaces...

Next:

  • What is the optimal line (1D subspace) that approximates

a set of data points di?

  • Two questions:
  • Optimum origin (point on the line)?

– This is still the average (proof omitted)

  • Optimum direction?

– We will look at that next...

  • Parametric line equation:

) 1 , , ( ) (       r r x r x x

m m

t t  

slide-37
SLIDE 37

37

, x r  

i i

d t

Best Fitting Line

Line equation:

 

 

 

  

n i i i n i i

t line dist

1 2 1 2

] [ ) , ( d r x d ) 1 , , ( ) (       r r x r x x

m m

t t  

Best projection on any line:

r (unit length) x0 (unit length) line r x0 di , x r  

i i

d t

Objective Function:

slide-38
SLIDE 38

38

Best Fitting Line

Objective Function:

     

 

  

               

r S

x d r x x r x d x r x d x r x r x d x r r x d r d r x d

w.r.t. const. 1 2 : Matrix, 1 T T 1 2 1 2 T 1 2 1 2 1 2 1 2 1 1 2 2 1 2 1 2 1 2

] [ ] [ ] [ , 2 , ] [ , 2 ] [ ] [ ) , (

            

             

                                 

n i i n i i i n i i n i i n i i n i i n i i n i i n i i i n i i n i i i n i i i n i i

d d d d d d t t t t line dist

, x r  

i i

d t

  • ptimal

parameters ti:

slide-39
SLIDE 39

39

Best Fitting Line

Objective Function:

     

 

  

               

r S

x d r x x r x d x r x d x r x r x d x r r x d r d r x d

w.r.t. const. 1 2 : Matrix, 1 T T 1 2 1 2 T 1 2 1 2 1 2 1 2 1 1 2 2 1 2 1 2 1 2

] [ ] [ ] [ , 2 , ] [ , 2 ] [ ] [ ) , (

            

             

                                 

n i i n i i i n i i n i i n i i n i i n i i n i i n i i i n i i n i i i n i i i n i i

d d d d d d t t t t line dist

, x r  

i i

d t

  • ptimal

parameters ti:

slide-40
SLIDE 40

40

Best Fitting Line

Result: Eigenvalue Problem:

  • rTSr is a Rayleigh quotient
  • Minimizing the energy: maximum quotient
  • Solution: eigenvector with largest eigenvalue

  

1 , : with . ) , (

1 T T 1 2

      

 

 

r x x S Sr r d

n i i i n i i

d d const line dist

slide-41
SLIDE 41

41

General Case

Fitting a d-dimensional affine subspace:

  • d = 1: line
  • d = 2: plane
  • d = 3: 3D subspace
  • ...

Simple rule:

  • Use the d eigenvectors with the largest eigenvalues from

the spectrum of S.

  • Gives the (total) least-squares optimal subspace that

approximates the data set D.

slide-42
SLIDE 42

42

General Case

Procedure: Principal Component Analysis (PCA)

  • Compute average x0 = m(D)
  • Compute “scatter matrix”
  • Let (1,v1), ... ,(n,vn) be sorted eigenvalue/vector pairs
  • f S, where 1 is the largest, and the vi are of unit length.
  • The subspace spanned by

approximates the data optimally in terms of squared distances to a point in the subspace.

  • Stronger: projecting the data into this subspace is the best

d-dimensional (affine subspace) data approximation.

  

  

n i i i

d d

1 T

x x S

   

 

d i i i d

t x t t p

1 1,...,

v

slide-43
SLIDE 43

43

Statistical Interpretation

Observation:

  • S/(n-1) is the covariance matrix of the data set D = {di}i
  • PCA can be interpreted as fitting a Gaussian distribution

and computing the main axes

x0

1 1 v

1 2 v

  

 

 

       

n i i i n i i

n n d n

1 T 1

1 1 1 1 ~ 1 x d x d S Σ x μ

   

        

μ x Σ μ x Σ x

μ Σ 1 2 1 2 ,

2 1 exp ) det( π) 2 ( 1 ) (

d

N ) (

,

x

μ Σ

N

slide-44
SLIDE 44

44

Applications

Fitting a line to a point cloud in 2:

  • Sample mean and direction
  • f maximum eigenvalue

Plane Fitting in 3:

  • Sample mean and the two

directions of maximum eigenvalues

  • Smallest eigenvalue
  • Eigenvector points in normal direction
  • Aspect ratio (3 / 2) is a measure of “flatness”

(quality of fit) x0

1

) ( v x x    t t

(2 / 1) small (2 / 1) larger

slide-45
SLIDE 45

45

Applications

Application: Normal estimation in point clouds

  • Given a set of points pi  3 that form a smooth surface.
  • We want to estimate:
  • Surface normals
  • Sampling spacing

Algorithm:

  • For each point, compute the k nearest neighbors (k  20)
  • Compute a PCA (average, main axes) of these points
  • Eigenvector with smallest eigenvalue  normal direction
  • The other two eigenvectors  tangent vectors
  • Tangent eigenvalues give sample spacing estimate
slide-46
SLIDE 46

46

Example

points normals tangential frames elliptic splats w/shading

slide-47
SLIDE 47

47

Applications

Another Application: Coordinate frame estimation

  • More general total least-squares (non-linear) is difficult
  • However: We can tweak it
  • Compute coordinate frames using PCA
  • Smallest eigenvalue = normal direction
  • Form height field in normal direction
  • Then: use ordinary least-squares in this coordinate system
slide-48
SLIDE 48

48

Example

Example:

  • k-nearest neighbors
  • PCA coordinate frames

at each point

  • Quadratic monomials

(bivariate, local coords.)

  • Least squares fit
slide-49
SLIDE 49

49

Curvature Estimation

Estimating curvature in point clouds:

  • Quadratic fitting (as in the example)
  • PCA coordinate frame (u, v, n)
  • Height field in normal direction (u, v)  n
  • Fit quadratic polynomials

{1, u, v, uv, u2, v2}

  • Eigenanalysis of the quadratic terms
  • Hessian matrix from fitted polynomial

2uvuv + uuu2 + vvv2 

  • Compute principal directions
  • Mean / Gauss / Normal curvature

       

vv uv uv uu

   

slide-50
SLIDE 50

50

Bunny Curvature

Stanford Bunny

(dense point cloud) principal curvature 1 principal curvature 2 mean curvature Gaussian curvature [courtesy of Martin Bokeloh]

slide-51
SLIDE 51

PCA and MDS

slide-52
SLIDE 52

52

Notation

Data matrix

              

n

x x ~ ~ ~

1

 X

Centered data matrix

 X

11 I X X X X ~ ~ ~ ~ ~

T 1 1 1 1 1 n n i i n n i i n

x x                    

  

d-dimensional input vectors xi,1 ... xi,d

slide-53
SLIDE 53

53

MDS

Multi-Dimensional Scaling

  • Input: n n pairwise distance matrix D
  • Compute pairwise scalar product matrix of centered

vectors:

  • By this construction:

 

   

T 1 T 1 2 2 1

~ , ~ 11 I D 11 I K D

n n ij ij

d     

   

        

T T

~ ~ X X X X XX K

(PCA: XTX)

slide-54
SLIDE 54

54

MDS (2)

Multi-Dimensional Scaling

  • Next: Compute eigenstructure of
  • Thus, a candidate reconstruction of X

is given by

  • Reminder: X – data vectors in rows
  • Hence:

T

XX K 

T

U UΛ K 

1/2

UΛ X 

   

) ( ,..., , ... , ) (

) ( ) ( 1 1 ) ( ) ( 1 1

n k x emb

i k k i i n n i i

   u u u u    

slide-55
SLIDE 55

55

MDS (3)

Properties: (assuming Euclidian distances)

  • Recovers points up to global translation and orthogonal

mapping

  • Reduced version (k-dim.) preserves distances in a least

square sense (dimension reduction)

  • Result is the same as for PCA embedding
  • n point coordinates (next slide)
slide-56
SLIDE 56

56

XV X  ) (

PCA

emb

SVD of Centered Data Matrix

T T

) , min( 1

V U V UΛ X          

n d

  

MDS is PCA

Equivalence of MDS and PCA

T T T T

U UΛ XX V VΛ X X

2 2

  UΛ X  ) (

MDS

emb UΛ XV V UΛ X   

T

PCA: MDS: PCA: MDS:

slide-57
SLIDE 57

57

So Where is the Difference?

PCA MDS

input

points

  • pw. distances / scalar prods.

complexity

d  d eigenvalue problem

(low dim., large data sets)

n  n eigenvalue problem

(high dim., small data sets)

result

data embedding, principal variances, principal axes data embedding, principal variances, no principal axes

subspace projection

can easily embed additional vectors not obvious (yes: Nyström method)

slide-58
SLIDE 58

58

Nyström Projection

Embedding further Vectors

  • Recompute everything
  • Expensive
  • Inconsistent for some applications (new coordinates)
  • “Nyström Formula”
  • Compute embedding by linear combination of

computed eigenvectors

  • Uses projections on input data set (scalar products only)
  • Assumes knowledge of point positions

(later: measure distances only)

slide-59
SLIDE 59

59

 

 

 

                   

 

    n i i n i n n i i i

u u emb

1 ) ( 1 ) 1 ( 1 T 1 T 1

, 1 , 1 ) ( x x x x x X U x X U Vx x    Λ Λ

Nyström Projection

Nyström Projection

T

V UΛ X 

T T

U UΛ XX K

2

  UΛ X  ) (

MDS

emb

  • Reminder:
  • Project vector x on principal axes:

Vx x  ) (

PCA

emb

slide-60
SLIDE 60

Kernel PCA

slide-61
SLIDE 61

61

Kernel PCA

“Kernel PCA is classical scaling in feature space” [Williams 2002] Main Idea:

  • MDS can be easily “kernelized” – just replace scalar

product matrix K with kernel matrix

  • No need to deal with feature space explicitly (which might

be intractable)

  • Will yield PCA anyway (but no eigenvectors)
slide-62
SLIDE 62

62

Algorithm

Kernel PCA – Step 1/3

Compute kernel (Gram) matrix

                      ) , ( ) , ( ) , ( ) , ( ) ( ), ( ) ( ), ( ) ( ), ( ) ( ), (

1 1 1 1 1 1 1 1 n n n n n n n n

x x k x x k x x k x x k x x x x x x x x                   K

slide-63
SLIDE 63

63

Algorithm

Kernel PCA – Step 2/3: Center Data

  • Cannot compute because

X is now in feature space

  • Center kernel matrix directly:

X X X   ~

       

T 1 T 1 T 1 T 1 ) (

11 K 11 11 K K 11 K K

n n n n centered

   

    

     

      

n k l n l k i n k k j n k k j i n k k j n k k i j i 1 1 T 1 T 1 T T 1 1 ,

) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( , ) ( ) ( x x x x x x x x x x x x K            

Proof:

   

T 1 T 1

11 I K 11 I

n n

  

exactly the same as in MDS

slide-64
SLIDE 64

64

Kernel PCA – Step 3/3 - Compute embedding

  • Eigenspace:
  • Embedding:

T ) (

U U K Λ 

centered

Algorithm

 

) ( ,..., ) (

) ( ) ( 1 1

n k x emb

i k k i i

  u u  

slide-65
SLIDE 65

65

Embedding Further Points

Project Additional Points:

  • Use Nyström Formula:
  • Uses only kernel computations in feature space

   

               

 

  n i i n i n n i i i

k u k u emb

1 ) ( 1 ) 1 ( 1

, 1 , 1 ) ( x x x x x   

slide-66
SLIDE 66

66

Kernel PCA

Summary:

  • Kernel PCA performs PCA/MDS in feature space using dot

products (i.e. kernel evaluations) only

  • It gives the same result as MDS
slide-67
SLIDE 67

67

Kernel PCA

Remarks:

  • Unlike “real” PCA, it does not output principal axes

vectors

  • They are in feature space, i.e. usually inaccessible
  • Preimages in (low-dim.) input space do not need to exist

(there are approximation techniques)

  • Even if so, they are difficult to compute...

...and the space is non-linear anyway (so they do not really help)

slide-68
SLIDE 68

68

Kernel PCA

Complexity:

  • Need to solve n  n eigenvalue problem
  • Memory, Time: (n2)
  • Does not scale for large data sets
  • Can use approximation techniques
  • Idea: Landmark MDS
  • Compute embedding on small subset of landmark points (e.g.

random subset)

  • Use Nyström formula to embed other points
slide-69
SLIDE 69

Examples

slide-70
SLIDE 70

70

From the Paper

[Schölkopf et al. 1996]

slide-71
SLIDE 71

71

Where is the Swiss Roll?

the roll exp. kernel  = 0.30D exp. kernel  = 0.35D exp. kernel  = 0.35D (centered) poly.- kernel (5th order)  = 0.35D

slide-72
SLIDE 72

72

What Else Can You Do?

Image Denoising via PCA:

slide-73
SLIDE 73

73

What Else Can You Do?

Does not work without correspondences:

slide-74
SLIDE 74

74

What Else Can You Do?

Shift invariant comparison kernel:

slide-75
SLIDE 75

75

MDS (Kernel PCA)

Shift invariant comparison kernel:

slide-76
SLIDE 76

76

References

  • B. Schölkopf, A. J. Smola, K.-R. Müller: Nonlinear Component Analysis as a Kernel Eigenvalue
  • Problem. In: Neural Computation, 10:1299-1319, 1998.
  • B. Schölkopf, S. Mika, C. J. C. Burges, P. Knirsch, K.-R. Müller, G. Ratsch, A. J. Smola: Input

Space versus Feature Space in Kernel-Based Methods. In: IEEE Trans. on Neural Networks, 10:1000-1017, 1999.

  • C. Williams: On a Connection between Kernel PCA and Metric Multidimensional Scaling. In: Machine

Learning, 46:11-19, 2002. K.-R. Müller, S. Mika, G. Ratsch, K. Tsuda, B. Schölkopf: An Introduction to Kernel-Based Learning Algorithms. In: IEEE Trans. on Neural Networks, 12:181-201, 2001.

  • J. Shawe-Taylor, N. Cristianini: Kernel Methods for Pattern Analysis. Cambridge University Press,

2004.

  • T. Cox, M. Cox: Multi-Dimensional Scaling. Chapman & Hall, 1994.
slide-77
SLIDE 77

Iteratively Reweighted Least Squares

slide-78
SLIDE 78

78

General Error Distributions

Problem with least-squares

  • Quadratic error measure
  • The farther away, the stronger

the attraction force

  • Outliers are disastrous

Gaussian likelihood

  • neg. log-likelihood

derivative

  • utlier
slide-79
SLIDE 79

79

Outliers

Problem:

  • Outliers are rather common
  • 3D scanners: Weird outlier points at random locations

(Shiny surfaces, transmission errors, etc...)

  • Least squares does not work well
  • 0,6
  • 0,4
  • 0,2

0,2 0,4 0,6 0,2 0,4 0,6 0,8 1 1,2 x - 0.5 noise Linear (noise)

  • 0,6
  • 0,4
  • 0,2

0,2 0,4 0,6 0,2 0,4 0,6 0,8 1 1,2 x - 0.5 noise Linear (noise)

uniform noise 2 outliers (out of 41)

slide-80
SLIDE 80

80

Robust Estimators

Problem:

  • Least squares criterion too strict
  • More robust: absolute distance (“l1-norm”)
  • “M-estimator”: Truncated squared/absolute distance

Problem squared distance

(“l2-norm error”)

absolute distance

(“l1-norm error”)

truncated

slide-81
SLIDE 81

81

Implementation

Implementation: Iteratively reweighted least-squares

  • Reminder: weighted least-squares
  • Iteratively reweighted least-squares:
  • Compute least-squares fit
  • Compute weights (dependent on solution)
  • Recompute / iterate until convergence
  • L1 norm (absolute distance) weights:

n i i i i f

y x f

1 2 weights ~

) ) ( ~ ( min arg  

 

   

n i i i k i f i i k i

y x f y x f

k

1 2 ) ( weights ~ ) 1 (

) ) ( ~ ( min arg ) ( ~ 1

) (

 

(iteration k)

slide-82
SLIDE 82

82

L1-Norm Fitting

L1-Norm Fitting:

  • Convergence guaranteed (convex objective function)
  • Mixture of l1 (far) and l2 (near) norm also works

General M-Estimators

  • No convergence guarantees for non-convex potentials
  • In particular: Truncated potentials may converge to a

local extremum only (depends on initialization)

 

   

   

n i i i k f n i i i k i i k f

y x f y x f y x f

k k

1 ) 1 ( ~ 1 2 ) ( ) 1 ( ~

) ( ~ min arg ) ) ( ~ ( ) ( ~ 1 min arg

) ( ) (

slide-83
SLIDE 83

83

  • 0,6
  • 0,4
  • 0,2

0,2 0,4 0,6 0,2 0,4 0,6 0,8 1 1,2

x - 0.5 noise Linear (noise)

Example (Schematic)

  • 0,6
  • 0,4
  • 0,2

0,2 0,4 0,6 0,2 0,4 0,6 0,8 1 1,2 x - 0.5 noise Linear (noise)

first iteration:

least squares, then truncation

second iteration:

improved solution