Lecture 25: Autoencoders Kernel PCA Aykut Erdem January 2017 - - PowerPoint PPT Presentation

lecture 25
SMART_READER_LITE
LIVE PREVIEW

Lecture 25: Autoencoders Kernel PCA Aykut Erdem January 2017 - - PowerPoint PPT Presentation

Lecture 25: Autoencoders Kernel PCA Aykut Erdem January 2017 Hacettepe University Today Motivation PCA algorithms Applications PCA shortcomings Autoencoders Kernel PCA 2 Autoencoders 3


slide-1
SLIDE 1

Lecture 25:

−Autoencoders −Kernel PCA

Aykut Erdem

January 2017 Hacettepe University

slide-2
SLIDE 2

Today

  • Motivation
  • PCA algorithms
  • Applications
  • PCA shortcomings
  • Autoencoders
  • Kernel PCA

2

slide-3
SLIDE 3

Autoencoders

3

slide-4
SLIDE 4

Relation to Neural Networks

  • PCA is closely related to a particular form of neural

network

  • An autoencoder is a neural network whose outputs

are its own inputs 



 
 
 
 


  • The goal is to minimize reconstruction error

4

slide by Sanja Fidler

slide-5
SLIDE 5

z = f (W x); ˆ x = g(V z)

Auto encoders

  • Define


5

slide by Sanja Fidler

slide-6
SLIDE 6

z = f (W x); ˆ x = g(V z) min

W,V

1 2N

N

X

n=1

||x(n) − ˆ x(n)||2

Auto encoders

  • Define


  • Goal:



 


6

slide by Sanja Fidler

slide-7
SLIDE 7

z = f (W x); ˆ x = g(V z) min

W,V

1 2N

N

X

n=1

||x(n) − ˆ x(n)||2 min

W,V

1 2N

N

X

n=1

||x(n) − VW x(n)||2

Auto encoders

  • Define


  • Goal:



 


  • If g and f are linear



 
 


7

slide by Sanja Fidler

slide-8
SLIDE 8

z = f (W x); ˆ x = g(V z) min

W,V

1 2N

N

X

n=1

||x(n) − ˆ x(n)||2 min

W,V

1 2N

N

X

n=1

||x(n) − VW x(n)||2

Auto encoders

  • Define


  • Goal:



 


  • If g and f are linear



 
 


  • In other words, the optimal solution is PCA

8

slide by Sanja Fidler

slide-9
SLIDE 9

Auto encoders: Nonlinear PCA

  • What if g( ) is not linear?
  • Then we are basically doing nonlinear PCA
  • Some subtleties but in general this is an

accurate description

9

slide by Sanja Fidler

slide-10
SLIDE 10

Comparing Reconstructions

10

Real data 30-d deep autoencoder 30-d logistic PCA 30-d PCA

slide by Sanja Fidler

slide-11
SLIDE 11

Kernel PCA

11

slide-12
SLIDE 12

Dimensionality Reduction

  • Data representation
  • Inputs are real-valued vectors in a


high dimensional space.


  • Linear structure
  • Does the date live in a low


dimensional subspace?


  • Nonlinear structure
  • Does the data live on a low


dimensional submanifold?

12

in a

PCA

slide by Rita Osadchy

slide-13
SLIDE 13

The “magic” of high dimensions

  • Given some problem, how do we know what

classes of functions are capable of solving that problem?

  • VC (Vapnik-Chervonenkis) theory tells us that
  • ften mappings which take us into a higher

dimensional space than the dimension of the input space provide us with greater classification power.

13

slide by Rita Osadchy

slide-14
SLIDE 14

Example in R2

14

Th lin

These classes are linearly inseparable in the input space.

slide by Rita Osadchy

slide-15
SLIDE 15

Example: High-Dimensional Mapping

15

We can make the problem linearly separable by a simple mapping

W probl separabl ma

) , , ( ) , ( :

2 2 2 1 2 1 2 1 3 2

x x x x x x + → Φ a R R

slide by Rita Osadchy

slide-16
SLIDE 16

Kernel Trick

  • High-dimensional mapping can seriously

increase computation time.


  • Can we get around this problem and still get

the benefit of high-D?


  • Yes! Kernel Trick
  • Given any algorithm that can be expressed

solely in terms of dot products, this trick allows us to construct different nonlinear versions of it.

16

l Trick

( )

) ( ) ( ,

j T i j i

x x x x K φ φ =

slide by Rita Osadchy

slide-17
SLIDE 17

Popular Kernels

17

slide by Rita Osadchy

slide-18
SLIDE 18

Kernel Principle Component Analysis

  • Extends conventional principal component

analysis (PCA) to a high dimensional feature space using the “kernel trick”.

  • Can extract up to n (number of samples)

nonlinear principal components without expensive computations.

18

slide by Rita Osadchy

slide-19
SLIDE 19

Making PCA Non-Linear

  • Suppose that instead of using the points xi

we would first map them to some nonlinear feature space φ(xi)

  • E.g. using polar coordinates instead of

cartesian coordinates would help us deal with the circle.

  • Extract principal component in that space (PCA)
  • The result will be non-linear in the original data

space!

19

slide by Rita Osadchy

slide-20
SLIDE 20

Derivation

  • Suppose that the mean of the data in the

feature space is

  • Covariance:
  • Eigenvectors

20

) ( 1

1

= = ∑

= n i i

x n φ µ

=

n T

x x ) ( ) ( 1 C φ φ

=

=

n i T i i

x x n

1

) ( ) ( 1 C φ φ v v λ = C

slide by Rita Osadchy

slide-21
SLIDE 21

Derivation

  • Eigenvectors can be expressed as linear

combination of features:

  • Proof:



 
 thus

21

us

T i n i i T i n i i

x v x n v x x n v ) ( ) ) ( ( 1 ) ( ) ( 1

1 1

φ φ λ φ φ λ ⋅ = =

∑ ∑

= =

v v x x n v

T i n i i

λ φ φ = = ∑

=

) ( ) ( 1 C

1

features:

) (

1 i n i i

x v φ α

=

=

slide by Rita Osadchy

slide-22
SLIDE 22

Showing that

22

t t t t

T T

x v x v xx ) ( ⋅ =

slide by Rita Osadchy slide by Rita Osadchy

slide-23
SLIDE 23

Showing that

23

t t t t

T T

x v x v xx ) ( ⋅ =

slide by Rita Osadchy

slide-24
SLIDE 24

Derivation

  • So, from before we had,
  • this means that all solutions v with λ = 0 lie in

the span of φ(x1),..,φ(xn) , i.e.,

  • Finding the eigenvectors is equivalent to

finding the coefficients αi

24

So, from before we had,

T i n i i T i n i i

x v x n v x x n v ) ( ) ) ( ( 1 ) ( ) ( 1

1 1

φ φ λ φ φ λ ⋅ = =

∑ ∑

= =

just a scalar

) (

1 i n i i

x v φ α

=

=

slide by Rita Osadchy

slide-25
SLIDE 25

Derivation

  • By substituting this back into the equation we get:
  • We can rewrite it as
  • Multiple this by φ(xk) from the left:



 


25

∑ ∑ ∑

= = =

=      

n l l jl j n i n l l jl T i i

x λ x x x n

1 1 1

) ( ) ( ) ( ) ( 1 φ α φ α φ φ

∑ ∑ ∑

= = =

=      

n l l jl j n i n l l i jl i

x λ x x K x n

1 1 1

) ( ) , ( ) ( 1 φ α α φ

∑ ∑ ∑

= = =

=      

n l l T k jl j n i n l l i jl i T k

x x λ x x K x x n

1 1 1

) ( ) ( ) , ( ) ( ) ( 1 φ φ α α φ φ

slide by Rita Osadchy

slide-26
SLIDE 26

Derivation

  • By plugging in the kernel and rearranging we

get:
 
 


We can remove a factor of K from both sides of the matrix (this will only affects the eigenvectors with zero eigenvalue, which will not be a principle component anyway):


  • We have a normalization condition for αj

vectors:

26

j 2

K K α λ α

j j

n =

j

K α λ α

j j

n =

( ) ( )

∑∑

= =

= ⇒ = ⇒ =

n k n l k T l jk jl j T j

x x v v

1 1 j T j

1 K 1 1 α α φ φ α α

slide by Rita Osadchy

slide-27
SLIDE 27

Derivation

  • By multiplying Kαj = nλjαj by αj and using

the normalization condition we get:

  • For a new point x, its projection onto the

principal components is:

27

j n

j T j j

∀ = , 1 α α λ

mponents is:

∑ ∑

= =

= =

n i n i i ji i T ji j T

x x K x x v x

1 1

) , ( ) ( ) ( ) ( α φ φ α φ

slide by Rita Osadchy

slide-28
SLIDE 28

Normalizing the feature space

  • In general, φ(xi) may not be zero mean.
  • Centered features:
  • The corresponding kernel is:

28

features:

=

− =

n k k i k

x n x x

1

) ( 1 ) ( ) ( ~ φ φ φ

∑ ∑ ∑ ∑ ∑

= = = = =

+ − − =       −       − = =

n k l k l n k k j n k k i j i n k k j T n k k i j T i j i

x x K n x x K n x x K n x x K x n x x n x x x x x K

1 , 2 1 1 1 1

) , ( 1 ) , ( 1 ) , ( 1 ) , ( ) ( 1 ) ( ) ( 1 ) ( ) ( ~ ) ( ~ ) , ( ~ φ φ φ φ φ φ

slide by Rita Osadchy

slide-29
SLIDE 29

Normalizing the feature space

  • In a matrix form



 
 
 where is a matrix with all elements 1/n.

29

∑ ∑ ∑

= = =

+ − − =

n k l k l n k k j n k k i j i j i

x x K n x x K n x x K n x x K x x K

1 , 2 1 1

) , ( 1 ) , ( 1 ) , ( 1 ) , ( ) , ( ~

is a matrix with all elements 1

1/n 1/n 1/n

K K 2

  • K

K ~ 1 1 1 + =

re is

1/n

1

slide by Rita Osadchy

slide-30
SLIDE 30

Summary of Kernel PCA

  • Pick a kernel
  • Construct the normalized kernel matrix of

the data (dimension m x m):

  • Solve an eigenvalue problem:
  • For any data point (new or old), we can

represent it as

30

1/n 1/n 1/n

K K 2

  • K

K ~ 1 1 1 + =

genvalue problem:

1/n 1/n 1/n

K K 2

  • K

K 1 1 1 + =

point (new or old)

i i i

α λ α = K ~

=

= =

n i i ji j

d x x K y

1

,.., 1 j ), , ( α

slide by Rita Osadchy

slide-31
SLIDE 31

Input points before kernel PCA

31

http://en.wikipedia.org/wiki/Kernel_principal_component_analysis

slide by Rita Osadchy

slide-32
SLIDE 32

Output after kernel PCA

32

66

The three groups are distinguishable using the first component only

slide by Rita Osadchy

slide-33
SLIDE 33

Example: De-noising images

33

slide by Rita Osadchy

slide-34
SLIDE 34

Properties of KPCA

  • Kernel PCA can give a good re-encoding of

the data when it lies along a non-linear manifold.

  • The kernel matrix is n x n, so kernel PCA

will have difficulties if we have lots of data points.

34

slide by Rita Osadchy