MACHINE LEARNING GPLVM, LWPR Slides for GPLVM are adapted from Neil - - PowerPoint PPT Presentation

machine learning gplvm lwpr
SMART_READER_LITE
LIVE PREVIEW

MACHINE LEARNING GPLVM, LWPR Slides for GPLVM are adapted from Neil - - PowerPoint PPT Presentation

MACHINE LEARNING - Doctoral Class - EDOC MACHINE LEARNING GPLVM, LWPR Slides for GPLVM are adapted from Neil Lawrence Lectures Slides for LWPR are adapted from Stefan Schaals Lectures MACHINE LEARNING - Doctoral Class - EDOC Why reducing


slide-1
SLIDE 1

MACHINE LEARNING - Doctoral Class - EDOC

MACHINE LEARNING GPLVM, LWPR

Slides for GPLVM are adapted from Neil Lawrence Lectures Slides for LWPR are adapted from Stefan Schaal’s Lectures

slide-2
SLIDE 2

MACHINE LEARNING - Doctoral Class - EDOC

Why reducing data dimensionality?

Raw Data Trying to find some structure in the data…..

slide-3
SLIDE 3

MACHINE LEARNING - Doctoral Class - EDOC

Why reducing data dimensionality?

Reducing the dimensionality

  • f the dataset at hand so

that computation afterwards is more tractable

Idea: sole a few of the dimensions matter, the projections of the data along the residual dimensions convey no structure of the data (already a form of generalization)

slide-4
SLIDE 4

MACHINE LEARNING - Doctoral Class - EDOC

We have already seen PCA and kernel PCA in the previous lectures PCA

  • > finds a linear projected space

maximizes the variance in the projected space, minimizes the squared reconstruction error kPCA

  • > finds a non-linear projected space

maximizes the variance in the projected space, minimizes the squared reconstruction error BUT… PCA was extended with:

Revision of PCA and kPCA

slide-5
SLIDE 5

MACHINE LEARNING - Doctoral Class - EDOC

Two-side projection: what if we need to manipulate data in the projected space and then reconstruct them back to the original?

Why Probabilistic Dimensionality Reduction?

kPCA

  • > does not define the projection sending back from feature

space to the original data-space.

slide-6
SLIDE 6

MACHINE LEARNING - Doctoral Class - EDOC

Why Probabilistic Dimensionality Reduction?

Non-probabilistic treatment -> No systematic way to process missing data On usually either discard the incomplete data (time consuming if huge amount of data) or complete the missing values using a variety of interpolation methods. Probabilistic PCA allows to handle missing data. E-M approach to computing the principal projections. Missing data: real-world data are often incomplete (e.g. a state-vector of a robot consists of several measurement delivered by the sensors, due to network problems some sensor readings are lost)

slide-7
SLIDE 7

MACHINE LEARNING - Doctoral Class - EDOC

Notation

q

  • dimension of latent / embedded space

Ν

  • dimension of data space

M

  • dimension of data points

1 :,1 :,

[ ,..., ] [ ,..., ]

T M N M N

R

×

= = ∈ Y y y y y

Centered

  • riginal data

1 :,1 :,

[ ,..., ] [ ,..., ]

T M q M q

R

×

= = ∈ X x x x x

Projected data

N q

R

×

∈ W

  • mapping matrix
slide-8
SLIDE 8

MACHINE LEARNING - Doctoral Class - EDOC

Linear Dimensionality Reduction: Recap

Represent data , , with a lower dimensional set of latent variables . Assume a linear relationship of the form

Y

X ,

: , : , : , i i i

η + = Wx y

where

) , (

2 : ,

I N

i

σ η =

slide-9
SLIDE 9

MACHINE LEARNING - Doctoral Class - EDOC

Represent data , , with a lower dimensional set of latent variables . Assume a linear relationship of the form

Y

X ,

: , : , : , i i i

η + = Wx y

where

) , (

2 : ,

I N

i

σ η =

Mapping matrix

Linear Dimensionality Reduction: Recap

slide-10
SLIDE 10

MACHINE LEARNING - Doctoral Class - EDOC

Represent data , , with a lower dimensional set of latent variables . Assume a linear relationship of the form

Y

X ,

: , : , : , i i i

η + = Wx y

where

) , (

2 : ,

I N

i

σ η =

Gaussian noise

Linear Dimensionality Reduction: Recap

slide-11
SLIDE 11

MACHINE LEARNING - Doctoral Class - EDOC

Probabilistic Regression: Recap ( )

, , ,

T

y f x w x w w x

Ν

= = ∈

A statistical approach to classical linear regression estimates the relationship between zero-mean variables y and x by building a linear model of the form:

( ) ( )

, , with 0,

T

y f x w x w N ε ε ε σ = + = + =

If one assumes that the observed values of y differ from f(x) by an additive noise ε that follows a zero-mean Gaussian distribution (such an assumption consists of putting a prior distribution over the noise), then:

slide-12
SLIDE 12

MACHINE LEARNING - Doctoral Class - EDOC

Probabilistic Regression: Recap ( )

( )

2 2 1 1

1 ˆ | , , exp 2 2

T M M i i i i i i

y x w y p y x w σ σ πσ

= =

⎛ ⎞ − ⎜ ⎟ = = − ⎜ ⎟ ⎝ ⎠

∏ ∏

{ }

1

Consider a training set of M pairs of data points , with each pair independently and identically distributed (i.i.d) according to a Gaussian distribution. The likelihood of the regressive model

M i i i

x y y

=

( )

0, is given by computing the probability density of each training pair given the parameters of the model ( , ) :

T

x w N w σ σ = +

slide-13
SLIDE 13

MACHINE LEARNING - Doctoral Class - EDOC

Probabilistic Regression: Recap

In Bayesian formalism, one would also specify a prior over the parameter w. Typical prior is to assume a zero mean Gaussian prior with fixed covariance matrix: One can then compute the posterior distribution over the parameter w using Bayes’ Theorem: Computing the expectation over the posterior distribution is called the maximum a posteriori (MAP) estimate of w.

( ) ( )

0,

w

p w N σ =

( ) ( ) ( ) ( )

| , likelihood x prior posterior = , | , marginal likelihood | p y x w p w p w y x p y x =

slide-14
SLIDE 14

MACHINE LEARNING - Doctoral Class - EDOC

Probabilistic PCA: Recap

Assumptions: all datapoints in are i.i.d. the relationship between data and the latent variables is Gaussian

Y

2 ,: ,: 1

( | , ) ( | , )

i i i

p N σ

Ν =

=∏ Y X W y Wx I

slide-15
SLIDE 15

MACHINE LEARNING - Doctoral Class - EDOC

The marginal likelihood can be obtained in two ways: marginalizing over the latent variables (PPCA) or marginalizing over the parameters (DPCA)

2 ,: ,: 1

( | , ) ( | , )

M i i i

p N σ

=

=∏ Y X W y Wx I

,: 1

( ) ( | , )

M i i

p N

=

=∏ X x 0 I

Define Gaussian prior

  • ver the latent space,

: X

Integrate out the latent variables:

2 ,: 1

( | ) ( | 0, )

M T i i

p N σ

=

= +

Y W y WW I Parameter Optimization:

Probabilistic PCA: Recap

slide-16
SLIDE 16

MACHINE LEARNING - Doctoral Class - EDOC

E-M for PCA

Probabilistic PCA: Recap

During the e-step of the EM algorithm, if the data matrix is incomplete, one can replace the missing elements with the mean (affect very little the estimate) or an extreme value (increase the uncertainty of the model).

slide-17
SLIDE 17

MACHINE LEARNING - Doctoral Class - EDOC

E-M for PCA

Probabilistic PCA: Revision

This is ok only if number of missing values is small. If the number

  • f missing values for Y is big, then one proceeds to finding the

projection y* which minimizes MSE while ensuring that the solution lies in the sub-space spanned by the known entries of y.

slide-18
SLIDE 18

MACHINE LEARNING - Doctoral Class - EDOC

2 ,: ,: 1

( | , ) ( | , )

M i i i

p N σ

=

=∏ Y X W y Wx I

Define Gaussian prior

  • ver the parameter,

Integrate out the parameters:

Latent Variables Optimization:

2 ,: 1

( | ) ( | 0, )

M T i i

p N σ

=

= +

Y X y XX I

,: 1

( ) ( | , )

i i

p N

Ν =

=∏ W w 0 I

: W

Dual Probabilistic PCA II

slide-19
SLIDE 19

MACHINE LEARNING - Doctoral Class - EDOC

2 1

( | , ) ( | 0, )

M T i i

p σ σ

=

= Ν +

Y X y XX I

1

1 ln 2 ln ( ) 2 2 2

T

M L tr π

Ν Ν = − − − K K YY

I XX K

2

σ + =

T

1 1 1 T

L N

− − −

∂ = − ∂ K YY K X K X X

The marginalized likelihood takes the form: To find the latent variables, optimize the log-likelihood: The gradient of w.r.t. the latent variables:

L

Dual Probabilistic PCA II

slide-20
SLIDE 20

MACHINE LEARNING - Doctoral Class - EDOC

Dual Probabilistic PCA II

X K X SK K X = − = ∂ ∂

− − − 1 1 1

L X X XX I S = +

−1 2

) (

T

σ

1 T −

= Ν S YY

Introducing the substitution: We rewrite the gradient of the likelihood as:

  • r

K

slide-21
SLIDE 21

MACHINE LEARNING - Doctoral Class - EDOC

Dual Probabilistic PCA II

X K X SK K X = − = ∂ ∂

− − − 1 1 1

L X X XX I S = +

−1 2

) (

T

σ

1 T −

= Ν S YY

T T

ULV V L L SU = +

− − 1 1 2

] [ σ

T

ULV X =

) (

2 2

L I U SU + = σ

Introducing the substitution: We rewrite the gradient of the likelihood as:

  • r

Let us consider the single value decomposition of , therefore, we can rewrite the equation for the latent :

X

The solution is invariant to :

V

slide-22
SLIDE 22

MACHINE LEARNING - Doctoral Class - EDOC

Dual Probabilistic PCA II

) (

2 2

L I U SU + = σ

The covariance matrix of the observable data Eigenvectors of S

Λ is the matrix

  • f eigenvalues
  • f S

This implies that the elements from the diagonal of are given by

L

2 1 2)

( σ λ − =

i i

l

T

ULV X =

Need to find matrices in the decomposition of

slide-23
SLIDE 23

MACHINE LEARNING - Doctoral Class - EDOC

Equivalence of PPCA and Dual PPCA

Solution for PPCA: Solution for Dual PPCA: Equivalence is of the form:

Λ U YU Y

W W T

=

T WLV

U W =

UΛ U YY =

T T

ULV X =

2 1 −

= UΛ Y U

T W

If one knows a solution of PPCA then the solution of Dual PPCA can be computed directly through the equivalence form. Marginalization over the latent variables and parameters is equivalent. But marginalization over the latent variables allows for interesting extensions; see next.

slide-24
SLIDE 24

MACHINE LEARNING - Doctoral Class - EDOC

2 ,: 1

( | ) ( | 0, )

M T i i

p N σ

=

= +

Y X y XX I

Gaussian Processes I

To recall, the marginal likelihood in Dual PPCA is given by:

slide-25
SLIDE 25

MACHINE LEARNING - Doctoral Class - EDOC

2 1

( | ) ( | 0, )

M T i i

p N σ

=

= +

Y X y XX I

Gaussian Processes I

To recall, the marginal likelihood in Dual PPCA is given by: Note that the following distribution is known as the Gaussian Process:

)) ( , | ( ) | ( X K y X y N p =

Hence the marginal likelihood in dual PPCA is the combination of M independent Gaussian Processes.

slide-26
SLIDE 26

MACHINE LEARNING - Doctoral Class - EDOC

=

+ =

N i T

N p

1 2 )

, | ( ) | ( I XX y X Y σ

Gaussian Processes I

Let’s have another look at the marginal likelihood in Dual PPCA: Note that the following distribution is known as the Gaussian Process:

)) ( , | ( ) | ( X K y X y N p =

Gaussian Processes have many useful properties (e.g. for modeling non Linear functional dependencies), see next lecture. For now, we just need to pay attention to the covariance function

slide-27
SLIDE 27

MACHINE LEARNING - Doctoral Class - EDOC

=

+ =

N i T

N p

1 2 )

, | ( ) | ( I XX y X Y σ

Gaussian Processes I

Let’s have another look at the marginal likelihood in Dual PPCA: Note that the following distribution is known as the Gaussian Process:

)) ( , | ( ) | ( X K y X y N p =

where: is a matrix function known as a covariance matrix.

) (X K

Covariance functions form a special class

  • f functions satisfying a number of

constraints (again, will discuss these constraints the next time).

slide-28
SLIDE 28

MACHINE LEARNING - Doctoral Class - EDOC

Gaussian Processes III

For today’s lecture we need to know that:

  • There are several ‘popular’

covariance functions

  • The new covariance functions can be defined as a sum of
  • ther covariance functions
slide-29
SLIDE 29

MACHINE LEARNING - Doctoral Class - EDOC

Gaussian Processes III

Some ‘Popular’ covariance functions:

T

XX X K = ) (

Linear Covariance:

) ' 2 exp( ) , (

2 1

x x x x − − = α α k

RBF Covariance:

Random functions generated by Gaussian Processes with different covariance functions

slide-30
SLIDE 30

MACHINE LEARNING - Doctoral Class - EDOC

2 ,: 1

( | ) ( | 0, )

M T i i

p N σ

=

= +

Y X y XX I

Non-linear Latent Variables Model

In the marginal likelihood of Dual PPCA: Linear Covariance function with the noise term

slide-31
SLIDE 31

MACHINE LEARNING - Doctoral Class - EDOC

Non-linear Latent Variables Model

,: 1

( | ) ( | 0, )

M i i

p N

=

=∏ Y X y K

The marginal likelihood is a product

  • f Gaussian processes

What if we use a non-linear covariance function (e.g. RBF)?

slide-32
SLIDE 32

MACHINE LEARNING - Doctoral Class - EDOC

GP Latent Variables Model

  • Consider a Gaussian kernel:
  • No longer possible to find a closed-form solution when optimizing

for X. No longer possible to simple proceed to a single value decomposition

  • Instead find gradient w.r.t. and optimize using

conjugate gradients

⎥ ⎦ ⎤ ⎢ ⎣ ⎡ − − = ' 2 exp ) ' , (

2 1

x x x x α α k

2 2 1

, , , σ α α X

We get a non-linear mapping into a low dimension manifold: Gaussian Process Latent Variables Model

slide-33
SLIDE 33

MACHINE LEARNING - Doctoral Class - EDOC

GP Latent Variables Model

Computationally heavy sparse technique: pick a subset of datapoints according to how much they reduce the posterior process entropy.

1 1 1 2 1

Optimization of the non-linear model is done by gradient descent on: , with each element of K given by ( , ') exp ' 2

T

L K YY K K K k α α

− − −

∂ = − Ν ∂ ⎡ ⎤ = − − ⎢ ⎥ ⎣ ⎦ x x x x

We get a non-linear mapping into a low dimension manifold: Gaussian Process Latent Variables Model

slide-34
SLIDE 34

MACHINE LEARNING - Doctoral Class - EDOC

Initialization of GP-LVM

In contrast to linear models such as PCA and PPCA, which allow for a closed-form solution to determine the latent variables (or projections in feature space), GP-LVM requires to proceed iteratively through gradient-based optimization. Therefore, a smart initialization is important. Usually initialization with PCA works fine…

slide-35
SLIDE 35

MACHINE LEARNING - Doctoral Class - EDOC

Initialization of GP-LVM

But not always… We have already seen the ‘swiss roll’ dataset, when discussed kPCA.

For this data the true structure is known: the manifold is a two dimensional square twisted into a spiral along one of its dimensions and living in a three dimensional space.

slide-36
SLIDE 36

MACHINE LEARNING - Doctoral Class - EDOC

Initialization of GP-LVM

GP-LVM with PCA initialization gives poor results. Initialization with Isomap allows to restore the original structure of the data.

slide-37
SLIDE 37

MACHINE LEARNING - Doctoral Class - EDOC

In this example, one combines the strengths of two different approaches: Isomap provide a unique solution which can recover the structure of the manifold

  • n which the data lies;

GP-LVM provides an underlying probabilistic model and an easy way to compute the mapping from the latent to the observed space. Due to the probabilistic nature of the GP-LVM we can also compare the resulting models through their log-likelihood. The log likelihood of the model when initialized with Isomap is much higher than that of the model when initialized with PCA (-45 vs

  • 534)

Initialization of GP-LVM

slide-38
SLIDE 38

MACHINE LEARNING - Doctoral Class - EDOC

Demo 2: Oil Data

Red crosses, green circles and blue + signs represent stratified, annular and homogeneous flows respectively. The grayscale in the right figure indicates the precision with which the manifold was expressed in data-space for that latent point.

Multi-phase oil flow data : 12 dimensional data set containing data of three known classes corresponding to the phase of flow in an oil pipeline: stratified, annular and homogeneous.

PCA (GP-LVM with the linear kernel) GP-LVM with the RBF kernel

slide-39
SLIDE 39

MACHINE LEARNING - Doctoral Class - EDOC

Demo 2: GP-LVM vs PCA

Red crosses, green circles and blue plus signs represent stratified, annular and homogeneous flows respectively. The grayscale in plot indicate the precision with which the manifold was expressed in data-space for that latent point.

Multi-phase oil flow data : 12 dimensional data set containing data of three known classes corresponding to the phase of flow in an oil pipeline: stratified, annular and homogeneous.

PCA (GP-LVM with the linear kernel) GP-LVM with the RBF kernel

Overlap of stratified and annular flows Flows are separated

slide-40
SLIDE 40

MACHINE LEARNING - Doctoral Class - EDOC

Demo 2: GP-LVM vs kPCA

Red crosses, green circles and blue plus signs represent stratified, annular and homogeneous flows respectively. The grayscale in plot indicate the precision with which the manifold was expressed in data-space for that latent point.

Multi-phase oil flow data : 12 dimensional data set containing data of three known classes corresponding to the phase of flow in an oil pipeline: stratified, annular and homogeneous.

kPCA GP-LVM with the RBF kernel

Flows are separated Overlap of stratified and annular flows

slide-41
SLIDE 41

MACHINE LEARNING - Doctoral Class - EDOC

Back Constraints I

Most dimensionality reduction techniques preserve local distances kPCA maps smoothly from data to the latent space points close in the data space are close in the latent space X

Y

slide-42
SLIDE 42

MACHINE LEARNING - Doctoral Class - EDOC

Back Constraints I

Most dimensionality reduction techniques preserve local distances kPCA maps smoothly from data from original to latent space points close in the data space are close in the latent space but points close in the latent space may be not close in the data space

slide-43
SLIDE 43

MACHINE LEARNING - Doctoral Class - EDOC

Back Constraints I

Most dimensionality reduction techniques preserve local distances kPCA maps smoothly from data from original to latent space points close in the data space are close in the latent space but points close in the latent space may be not close in the data space

  • GP LVM does not preserve local distances

but points close in the latent space are close in the data space

slide-44
SLIDE 44

MACHINE LEARNING - Doctoral Class - EDOC

Back Constraints II: Runner Dataset

Motion capture data of human walking. The paths of the sequences in the latent space are shown as solid lines. The dimension of the original data is 102 (34 markers x 3 coordinates).

slide-45
SLIDE 45

MACHINE LEARNING - Doctoral Class - EDOC

The data obtained from a subject breaking into a run from standing – cyclic motion. neighbors in time are connected with lines

Back Constraints II: Runner Dataset

slide-46
SLIDE 46

MACHINE LEARNING - Doctoral Class - EDOC

We are supposed to see a smooth periodic pattern in the latent space. But, instead …

Non-smooth mapping

Back Constraints II: Runner Dataset

slide-47
SLIDE 47

MACHINE LEARNING - Doctoral Class - EDOC

Back Constraints III

Lowe and Tipping [1997] made latent positions a function of the data. Function was either a multi-layer perceptron

  • r a radial

basis function network.

) , ( w yi

j ij

f x =

slide-48
SLIDE 48

MACHINE LEARNING - Doctoral Class - EDOC

Back Constraints III

The same idea can be used to force the GP-LVM to respect local distances [Lawrence and Candela, 2006]. By constraining each to be a smooth mapping from local distances can be respected This works because in GP-LVM, one maximizes w.r.t. the latent variables and does not integrate these out.

i

x

i

y

slide-49
SLIDE 49

MACHINE LEARNING - Doctoral Class - EDOC

Back Constraints III

GP-LVM normally proceeds by optimizing with respect to using The back constraints are of the form where are a set of unknown parameters We can compute via the chain rule and optimize

) | ( log ) ( X Y X p L =

X

X ∂ ∂L

) , (

: , B

yi

j ij

f x =

B

B ∂ ∂L

slide-50
SLIDE 50

MACHINE LEARNING - Doctoral Class - EDOC

Back Constraints IV: Runner Dataset

GP-LVM with the back constraints applied to the runner dataset , that we have seen before.

Smooth cyclical patter

slide-51
SLIDE 51

MACHINE LEARNING - Doctoral Class - EDOC

Back Constraints IV: Runner Dataset

GP-LVM with the back constraints applied to the runner dataset , that we have seen before.

slide-52
SLIDE 52

MACHINE LEARNING - Doctoral Class - EDOC

Dynamics in GP-LVM

Observable data are generated by a low-dimensional dynamical process (e.g. in the full-body motion, a dynamical law controlling all

body joints can be described in a low-dimensional space).

How can we learn the latent dynamics?

Y

slide-53
SLIDE 53

MACHINE LEARNING - Doctoral Class - EDOC

Dynamics in GP-LVM

y x

η B x y η A x x + = + =

) , ( ) ; (

1 t t t t

g f

Observable data are generated by a low-dimensional dynamical process (e.g. in the full-body motion, a dynamical law controlling all

body joints can be described in a low-dimensional space).

How can we learn the latent dynamics? Assume the latent-variable mapping with a first-order Markov dynamics where and are the matrices of parameters (equivalent to over which we marginalized in DPPCA)

Y A B W

Isotropic Gaussian noise

slide-54
SLIDE 54

MACHINE LEARNING - Doctoral Class - EDOC

Dynamics in GP-LVM

y x

η B x y η A x x + = + =

) , ( ) ; (

1 t t t t

g f

Observable data are generated by a low-dimensional dynamical process. Assume the latent-variable mapping with the first-order Markov dynamics where and are the matrices of parameters (equivalent to over which we marginalized in DPPCA)

Y A B W

standard GP-LVM mapping

slide-55
SLIDE 55

MACHINE LEARNING - Doctoral Class - EDOC

Dynamics in GP-LVM

y x

η B x y η A x x + = + =

) , ( ) ; (

1 t t t t

g f

Observable data are generated by a low-dimensional dynamical process. Assume the latent-variable mapping with the first-order Markov dynamics where and are the matrices of parameters (equivalent to over which we marginalized in DPPCA)

Y A B W

standard GP-LVM mapping

) ' 2 exp( ) ' , (

2 2 1

x x x x − − = β β

Y

k

Use the RBF kernel:

slide-56
SLIDE 56

MACHINE LEARNING - Doctoral Class - EDOC

Dynamics in GP-LVM

y x

η B x y η A x x + = + =

) , ( ) ; (

1 t t t t

g f

Observable data are generated by a low-dimensional dynamical process. Assume the latent-variable mapping with the first-order Markov dynamics where and are the matrices of parameters (equivalent to over which we marginalized in DPPCA)

Y A B W

novel, auto- regressive part

slide-57
SLIDE 57

MACHINE LEARNING - Doctoral Class - EDOC

Dynamics in GP-LVM

y x

η B x y η A x x + = + =

) , ( ) ; (

1 t t t t

g f

Observable data are generated by a low-dimensional dynamical process. Assume the latent-variable mapping with the first-order Markov dynamics where and are the matrices of parameters (equivalent to over which we marginalized in DPPCA)

Y A B W

novel, auto- regressive part

Use the RBF + linear kernel, the marginal distribution is getting More complicated

' ) ' 2 exp( ) ' , (

3 2 2 1

x x x x x x

T X

k α α α + − − =

slide-58
SLIDE 58

MACHINE LEARNING - Doctoral Class - EDOC

Dynamics in GP-LVM

A A A X X d p p p ) | ( ) , | ( ) | ( α α α

=

1 1 2

( | ) ( ) ( | , , ) ( | )

T t t t

p p p p d α α α

− =

=

∏ ∫

X x x x A A A

1 1 ( 1)

1 1 ( | ) ( ) exp( ( )) 2 (2 )

T X

  • ut
  • ut

N T N X

p p tr α π

− −

= − X x K X X K

2

[ ,..., ]T

  • ut

T

= X x x

( 1) ( 1) T T X

R

− × −

∈ K

  • Marginalize over the parameters
  • Incorporate the Markov Prior
  • With the isotropic Gaussian prior on

A

) (A p

where

slide-59
SLIDE 59

MACHINE LEARNING - Doctoral Class - EDOC

Dynamics in GP-LVM

  • The joint distribution of the latent variables is not Gaussian
  • Therefore, the log-likelihood is not quadratic in X
  • However, can compute maximum a posteriori (MAP) estimates
  • f the solution by minimizing the negative log-posterior:

1 1

ln ( , , | ) 1 1 ln ( ) ln ( ) ln ln 2 2 2 2

T T X X

  • ut
  • ut

Y Y i i i i

L p N N tr tr α β α β

− −

= − = + + + + +

∑ ∑

X Y K K X X K K YY

slide-60
SLIDE 60

MACHINE LEARNING - Doctoral Class - EDOC

GPDM: Results

Each pose was defined by 56 Euler angles for joints, 3 global pose angles, and 3 global translational velocities for torso . For learning, the data was mean-subtracted, and the latent coordinates were initialized with PCA.

Original Walk

slide-61
SLIDE 61

MACHINE LEARNING - Doctoral Class - EDOC

GPDM: Results

Models of the latent dynamics learned from a walking sequence of 2.5 gait cycles.

GP-LVM GPDM

Non-smooth mapping (GP-LVM without back constraints)

slide-62
SLIDE 62

MACHINE LEARNING - Doctoral Class - EDOC

GPDM: Results

(d) Random trajectories drawn from the model using Monte Carlo. (e) A GPDM of walk data learned with RBF+linear kernel dynamics. The poses were reconstructed from points on the trajectory.

slide-63
SLIDE 63

MACHINE LEARNING - Doctoral Class - EDOC

GPDM: Missing Data

50 missing frames in the 157-frame walk sequence

missing data

slide-64
SLIDE 64

MACHINE LEARNING - Doctoral Class - EDOC

GPDM: Missing Data

The latent coordinates for missing data are initialized by cubic spline interpolation from the 3D PCA initialization of observations. Problem: the difficulty of initializing the missing latent positions sufficiently close to the training data.

spline data

slide-65
SLIDE 65

MACHINE LEARNING - Doctoral Class - EDOC

GPDM: Missing Data

  • Learn a model with a subsampled

data sequence: oversmoothing data and generating a more uncertain but smoother distribution over the latent variables.

  • Restart the learning with the entire data set, but with the kernel

hyperparameters fixed (optimize w.r.t. the latent variables ).

  • The dynamics terms in the objective function exert more influence over

the latent coordinates of the training data, and a smooth model is learned.

X

slide-66
SLIDE 66

MACHINE LEARNING - Doctoral Class - EDOC

GPDM: Missing Data

slide-67
SLIDE 67

MACHINE LEARNING - Doctoral Class - EDOC

References on GPLVM

1. N.Lawrence. Probabilistic Non-linear Principal Component Analysis with Gaussian Process Latent Variable Models . Journal of Machine Learning Research, 2005.

  • 2. M. Tipping and C. Bishop. Probabilistic Principal Component Analysis. Journal of

the Royal Statistical Society, 1999.

  • 3. Neil D. Lawrence and Joaquin Quinonero-Candela. Local Distance Preservation

in the GP-LVM through Back Constraints.

  • 4. Jack M.Wang, David J. Fleet, Aaron Hertzmann. Gaussian Process Dynamical
  • Models. In the Proceedings of the International Conference on Neural Information

Processing and Systems, 2005. 5, The tutorial and the demo software from the website http://www.cs.man.ac.uk/~neill/

  • 6. The demos from the website http://www.dgp.toronto.edu/~jmwang/gpdm/
slide-68
SLIDE 68

MACHINE LEARNING - Doctoral Class - EDOC

Non-linear function approximation

  • We saw that SVR offers a powerful way of estimating arbitrary non-linear

function.

  • It had however a cost. Computation grows with the number of support

vectors.

  • While penalties can be given for keeping this number low, how to actually

set this penalty is not easy.

  • Besides, SVR does not allow to retrain easily the algorithm by adding new

data incrementally as it may change drastically which datapoints are used as SVs (neural networks with Hebbian Learning provide good alternatives for incremental learning but they are not ensured to find an optimal solution).

slide-69
SLIDE 69

MACHINE LEARNING - Doctoral Class - EDOC

  • nonlinear function approximation with high-dimensional input data

remains a nontrivial problem, especially in incremental and real-time formulations Two broad classes of function approximation methods: 1. methods that fit nonlinear functions globally, typically by input space expansions (feature space) with predefined or parameterized basis functions and subsequent linear combinations of the expanded inputs (e.g. SVR, GPLVM (Today’s lecture), GPR (next week)). 2. methods that fit nonlinear functions locally, usually by using spatially localized simple models in the original input space and automatically adjusting the complexity (e.g., number of local models and their locality) to accurately account for the nonlinearities and distributions of the target function (e.g. LWPR (Today’s lecture), GMR (next week))

Non-linear function approximation

slide-70
SLIDE 70

MACHINE LEARNING - Doctoral Class - EDOC

Locally weighted learning

  • Classical regression relies on minimizing MSE.
  • We discussed the importance of weighted regression.
  • Such a method still assumes that a single linear dependency

applies

  • everywhere. Not true for data sets with local dependencies.

It would be useful to design a regression method that estimates best the linear dependencies locally. Locally weighted learning (LWR, Atkeson et al 97)

slide-71
SLIDE 71

MACHINE LEARNING - Doctoral Class - EDOC

Locally weighted learning

{ } ( )

( )

( )

1 2 1

For a set of M multi-dimensional pair of datapoints , , a linear model of the form , = , where is the slope of the function, if optimizes by minimizing MSE, The data s

i i i i i i i

M i T M T i

x y y f x x J y x β β β β β

= =

= = −

( )

( )

( )

( )

2 1

et can be tailored to the query point x' by emphasizing nearby points in the regression. One can do this by weighting the training criterion , ' where K( ) is the weighting or kernel

i i

M T i i

J y x K d x x β β

=

= −

function and ( , ') is the distance between the data point and the query point '.

i i

d x x x x

slide-72
SLIDE 72

MACHINE LEARNING - Doctoral Class - EDOC

Locally weighted learning

{ } ( )

( )

( )

1 2 1

For a set of M multi-dimensional pair of datapoints , , a linear model of the form , = , where is the slope of the function, if optimizes by minimizing MSE, The data s

i i i i i i i

M i T M T i

x y y f x x J y x β β β β β

= =

= = −

( )

( )

( )

( )

2 1

et can be tailored to the query point x' by emphasizing nearby points in the regression. One can do this by weighting the training criterion , ' where K( ) is the weighting or kernel

i i

M T i i

J y x K d x x β β

=

= −

function and ( , ') is the distance between the data point and the query point '.

i i

d x x x x

( )

( )

Using this criterion, each , ' become a local model and can have a different set of parameters for each query point '.

i

f x x x β

slide-73
SLIDE 73

MACHINE LEARNING - Doctoral Class - EDOC

Locally weighted learning

( ) ( )

( ) ( )

2 1 1

If we assume each local model to be linear in the parameters, i.e. , = , finding the optimal projection with unweighted projection is done by minimizing MSE: . In weighted

T M T i i i T T

f x x J x y X X X y β β β β β

= −

= − ⇒ =

( ) ( )

( )

( )

regression, if we set w ' = , ' , When optimizing for a given query point x', we set W a diagonal matrix with entries w , , v , one can get an estimator for ˆ at the query point: ' '

i i i

x K d x x Z WX Wy y y x x = = =

( )

1 T T T

Z Z Z v

slide-74
SLIDE 74

MACHINE LEARNING - Doctoral Class - EDOC

Locally weighted learning

Unweighted Regression Weighted Regression

slide-75
SLIDE 75

MACHINE LEARNING - Doctoral Class - EDOC

Locally weighted projected regression (LWPR)

  • LWL is computationally expensive.
  • Computation costs increase quadratically

with the dimensionality of the data.

  • Method numerically brittle due to incremental NxN

matrix inversion.

  • Too many “manual tuning parameters”

Find optimal local projection to reduce the dimensionality locally and then proceed to LWR in each local subspace

slide-76
SLIDE 76

MACHINE LEARNING - Doctoral Class - EDOC

– Local Dimensionality Reduction

  • Globally high dimensional data can be locally low

dimensional

Non-linear function approximation

  • Use local dimensionality reduction techniques
  • PCA
  • Factor Analysis
  • Partial Least Squares Regression
slide-77
SLIDE 77

MACHINE LEARNING - Doctoral Class - EDOC

Methods for Real-Time Function Approximation?

Classical Neural Networks Mixture Models Support Vector Machines Gaussian Process Regression

slow, structure? expensive, local minima O(N2) ‐ O(N3)

Nonparametric Statistics (a.k.a. Locally Weighted Learning)

( )

2 , 1 1 M K T k i i i k i k

J w y β

= =

= −

∑∑

x

Global Optimization Processes (computationally expensive, slow)

( )

2 1 1 M K i k k i i k

J y β φ

= =

⎛ ⎞ = − ⎜ ⎟ ⎝ ⎠

∑ ∑

x

K independent local Linear optimization problems!

Kernel

slide-78
SLIDE 78

MACHINE LEARNING - Doctoral Class - EDOC

Choosing the actual number of local model is often difficult as it can lead to

  • verfit. Not a problem when the local models are learned purely from local
  • data. Then, an increasing number of local models does not overfit!

X’

Locally weighted projected regression (LWPR)

slide-79
SLIDE 79

MACHINE LEARNING - Doctoral Class - EDOC

Nonparametric Statistics (a.k.a. Locally Weighted Learning)

( )

( )

2 , 1 1

,

M K T k i i i k i k

J W w y β β

= =

= −

∑∑

x

K independent local Linear optimization problems!

T ii i i 1

1 w exp( ( ') ( ')) 2 ( )

k k k k −

= − − − =

T T

x x D x x β X W X X W Y

Locally weighted projected regression (LWPR)

slide-80
SLIDE 80

MACHINE LEARNING - Doctoral Class - EDOC

βκ

T ii i i 1

1 w exp( ( ') ( ')) 2 ( )

k k k k −

= − − − =

T T

x x D x x β X W X X W Y

Locally weighted projected regression (LWPR)

slide-81
SLIDE 81

MACHINE LEARNING - Doctoral Class - EDOC

Approximate non‐linear functions with a combination

  • f multiple weighted linear

models

T ii i i 1

1 w exp( ( ') ( ')) 2 ( ) ˆ ' ˆ ˆ /

k k k k T k k k k k k k

w y w

= − − − = = =∑

T T

x x D x x β X W X X W Y y x β y

Solve this problem for high dimensional space: LWPR

Sethu Vijayakumar, Aaron D'Souza and Stefan Schaal, Online Learning in High Dimensions, Neural Computation, vol. 17, pp. 2602‐34 (2005) Sethu Vijayakumar @ Univ. of Edinburgh

X’ X’

Locally weighted projected regression (LWPR)

slide-82
SLIDE 82

MACHINE LEARNING - Doctoral Class - EDOC

y = βx

Tx + β0 = β T ˜

x where ˜ x = x T 1

[ ]

T

w = exp − 1 2 x − c

( )

T D x − c

( )

⎛ ⎝ ⎞ ⎠ where D = MTM

  • The Linear Model
  • The Kernel Function
  • The Prediction

Open Parameters y = wiyk

i=1 K

wi

i=1 K

Locally weighted projected regression (LWPR)

slide-83
SLIDE 83

MACHINE LEARNING - Doctoral Class - EDOC

y = βx

Tx + β0 = β T ˜

x where ˜ x = x T 1

[ ]

T

w = exp − 1 2 x − c

( )

T D x − c

( )

⎛ ⎝ ⎞ ⎠ where D = MTM

  • The Linear Model
  • The Kernel Function
  • The Prediction

y = wiyk

i=1 K

wi

i=1 K

Locally weighted projected regression (LWPR)

Diagonal matrix usually Centers of local models are fixed, not learned

slide-84
SLIDE 84

MACHINE LEARNING - Doctoral Class - EDOC

For learning the linear models Ψ(x), LWPR employs an online formulation of weighted partial least squares (PLS) regression.

Locally weighted projected regression (LWPR)

slide-85
SLIDE 85

MACHINE LEARNING - Doctoral Class - EDOC

Partial Least Square Method and Regression

Partial Least Squares (PLS) refers to a wide class of methods for modeling relations between sets of observed variables by means

  • f latent variables.

It comprises regression and classification tasks as well as dimension reduction techniques and modeling tools.

{ }

1 M i i

X x

=

=

{ }

1 M i i

Y y

=

=

Least-square regression looks for a mapping that sends X onto Y, such that:

( )

2 1

1 min 2

M T i i w i

w x y

=

⎛ ⎞ − ⎜ ⎟ ⎝ ⎠

slide-86
SLIDE 86

MACHINE LEARNING - Doctoral Class - EDOC

Partial Least Square Method and Regression

PCA and CCA by extension can be viewed as regression problems, whereby one set of variable Y can be expressed in terms of a linear combination of the second set of variable X. PCA Regression does not take into account the response variable when constructing the principal components or latent variables Unsupervised problem PLS incorporate information about the response in the model by using latent variables. creates orthogonal score vectors (also called latent vectors or components) by maximizing the covariance between different sets

  • f variables.

PLS is similar to CCA in its principle. It however differs in the algorithm.

slide-87
SLIDE 87

MACHINE LEARNING - Doctoral Class - EDOC

PLS and CCA

PLS represents a form of CCA, where the criterion of maximal correlation is balanced with the requirement to explain as much variance as possible in both X and Y spaces

( )

[ ]

( )

( ) [

]

( )

( )

2 1

cov , max 1 var 1 var

x y

T T x y T T w w x y

w X w Y X w X X Y w Y Y γ γ γ γ

= =

− + − +

X Y γ γ = =

  • CCA

1 X Y γ γ = =

  • PLS
slide-88
SLIDE 88

MACHINE LEARNING - Doctoral Class - EDOC

PLS and CCA

1, 0 ?

x y

X Y γ γ γ γ = = = =

slide-89
SLIDE 89

MACHINE LEARNING - Doctoral Class - EDOC

For learning the linear models Ψ(x), LWPR employs an online formulation of weighted partial least squares (PLS) regression. Within each local model the input d at x is projected along selected directions ui , yielding “latent” variables si with

Locally weighted projected regression (LWPR)

slide-90
SLIDE 90

MACHINE LEARNING - Doctoral Class - EDOC

LWPR: A Basic Incremental Algorithm

Given (x,y), for all K local models:

( )

( )

1 1 1

' ' 1 ' ' '

T n n n T n k k k k k n T n n n k k k k T n k

w w β β β λ λ

+ + +

= + − ⎛ ⎞ ⎜ ⎟ = − ⎜ ⎟ ⎜ ⎟ + ⎝ ⎠ x P x y x P x x P P P x P x

slide-91
SLIDE 91

MACHINE LEARNING - Doctoral Class - EDOC

LWPR: A Basic Incremental Algorithm

Penalty term to avoid the variance to shrink in case of large amount of data

1 1 1 1

and

n n n n T n k k k k k

J M M D M M M α

+ + + +

∂ = − = ∂

2 2 , , , , 1 1, 1 , 1

1 ˆ

M M k i i k i i k ij M i i j k i i

J w D w γ

− = = = =

= − +

∑ ∑ ∑

y y

The distance metric D and, hence, the locality of the receptive fields, can be learned for each local model individually by stochastic gradient descent in a penalized leave-one-out cross-validation cost function Leave‐one‐out estimation of y

slide-92
SLIDE 92

MACHINE LEARNING - Doctoral Class - EDOC

LWPR: A Basic Incremental Algorithm

Recursive Least Squares Stochastic Leave‐

  • ne‐out

Cross Validation

  • Given (x,y), for all K local models:
  • Create a new model:

( )

( )

1 1 1

' ' 1 ' ' '

T n n n T n k k k k k n T n n n k k k k T n k

w w β β β λ λ

+ + +

= + − ⎛ ⎞ ⎜ ⎟ = − ⎜ ⎟ ⎜ ⎟ + ⎝ ⎠ x P x y x P x x P P P x P x

  • 1

1 1 1

and

n n n n T n k k k k k

J M M D M M M α

+ + + +

∂ = − = ∂

J = 1 wk,i

i=1 N

w k,i y i − ˆ y

k,i,−i 2 i=1 N

+γ Dk,ij

2 i=1, j=1 n

if min

k

wk

( )< wgen createnewRF at cK +1 = x

Automatic Structure Determination

slide-93
SLIDE 93

MACHINE LEARNING - Doctoral Class - EDOC

  • +

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +

  • 2
  • 1

1 2 3 4

  • 2.5
  • 2
  • 1.5
  • 1
  • 0.5

0.5 1 1.5 2 2.5 y x

  • riginal training data

+ new training data true y predicted y predicted y after new training data 0.5 1

  • 2.5
  • 2
  • 1.5
  • 1
  • 0.5

0.5 1 1.5 2 2.5 w x c) Learned Organization of Receptive Fields

  • +

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +

  • 6
  • 5
  • 4
  • 3
  • 2
  • 1

1 2 3 4

  • 2.5
  • 2
  • 1.5
  • 1
  • 0.5

0.5 1 1.5 2 2.5 y x b) Local Function Fitting With Receptive Fields a) Global Function Fitting With Sigmoidal Neural Network

Locally weighted projected regression (LWPR)

slide-94
SLIDE 94

MACHINE LEARNING - Doctoral Class - EDOC

Increasing the number of components leads to a better fit of the local linearities.

Locally weighted projected regression (LWPR)

slide-95
SLIDE 95

MACHINE LEARNING - Doctoral Class - EDOC

Empirical Evaluations (Cross Data)

Learned function Target function

Input Dimensionality = 2 (+ 8 or 18 redundant dimensions.) Noise ~ N(0,0.01) # training data = 500 Initial Receptive Fields Learned Receptive Field

Sethu Vijayakumar @ Univ. of Edinburgh

slide-96
SLIDE 96

MACHINE LEARNING - Doctoral Class - EDOC

Summary: class so far

DIMENSIONALITY REDUCTION and FEATURE ANALYSIS:

  • Importance of methods for reduction of dimensionality (PCA)
  • Advantage of probabilistic phrasing of PCA (missing data)
  • Advantage of non-linear transformation to infer regularities in the

data in feature space (kPCA) FUNCTION ESTIMATION

  • Estimation of function through regression
  • Advantage of formulating the problem statistically (probabilistic

regression) deal with missing data, handle noise

  • Non-linear transformation useful again to estimate the function

through latent variables (SVR)

  • Usefulness to be able to project back from feature space to
  • riginal space (GPLVM)
  • Incremental estimation of function (LWPR)