MACHINE LEARNING - Doctoral Class - EDOC
MACHINE LEARNING GPLVM, LWPR
Slides for GPLVM are adapted from Neil Lawrence Lectures Slides for LWPR are adapted from Stefan Schaal’s Lectures
MACHINE LEARNING GPLVM, LWPR Slides for GPLVM are adapted from Neil - - PowerPoint PPT Presentation
MACHINE LEARNING - Doctoral Class - EDOC MACHINE LEARNING GPLVM, LWPR Slides for GPLVM are adapted from Neil Lawrence Lectures Slides for LWPR are adapted from Stefan Schaals Lectures MACHINE LEARNING - Doctoral Class - EDOC Why reducing
MACHINE LEARNING - Doctoral Class - EDOC
Slides for GPLVM are adapted from Neil Lawrence Lectures Slides for LWPR are adapted from Stefan Schaal’s Lectures
MACHINE LEARNING - Doctoral Class - EDOC
Raw Data Trying to find some structure in the data…..
MACHINE LEARNING - Doctoral Class - EDOC
Idea: sole a few of the dimensions matter, the projections of the data along the residual dimensions convey no structure of the data (already a form of generalization)
MACHINE LEARNING - Doctoral Class - EDOC
We have already seen PCA and kernel PCA in the previous lectures PCA
maximizes the variance in the projected space, minimizes the squared reconstruction error kPCA
maximizes the variance in the projected space, minimizes the squared reconstruction error BUT… PCA was extended with:
MACHINE LEARNING - Doctoral Class - EDOC
Two-side projection: what if we need to manipulate data in the projected space and then reconstruct them back to the original?
kPCA
space to the original data-space.
MACHINE LEARNING - Doctoral Class - EDOC
Non-probabilistic treatment -> No systematic way to process missing data On usually either discard the incomplete data (time consuming if huge amount of data) or complete the missing values using a variety of interpolation methods. Probabilistic PCA allows to handle missing data. E-M approach to computing the principal projections. Missing data: real-world data are often incomplete (e.g. a state-vector of a robot consists of several measurement delivered by the sensors, due to network problems some sensor readings are lost)
MACHINE LEARNING - Doctoral Class - EDOC
1 :,1 :,
T M N M N
×
Centered
1 :,1 :,
T M q M q
×
Projected data
N q
×
MACHINE LEARNING - Doctoral Class - EDOC
Represent data , , with a lower dimensional set of latent variables . Assume a linear relationship of the form
: , : , : , i i i
where
2 : ,
i
MACHINE LEARNING - Doctoral Class - EDOC
Represent data , , with a lower dimensional set of latent variables . Assume a linear relationship of the form
: , : , : , i i i
where
2 : ,
i
Mapping matrix
MACHINE LEARNING - Doctoral Class - EDOC
Represent data , , with a lower dimensional set of latent variables . Assume a linear relationship of the form
: , : , : , i i i
where
2 : ,
i
Gaussian noise
MACHINE LEARNING - Doctoral Class - EDOC
T
Ν
A statistical approach to classical linear regression estimates the relationship between zero-mean variables y and x by building a linear model of the form:
T
If one assumes that the observed values of y differ from f(x) by an additive noise ε that follows a zero-mean Gaussian distribution (such an assumption consists of putting a prior distribution over the noise), then:
MACHINE LEARNING - Doctoral Class - EDOC
2 2 1 1
T M M i i i i i i
= =
1
M i i i
=
T
MACHINE LEARNING - Doctoral Class - EDOC
In Bayesian formalism, one would also specify a prior over the parameter w. Typical prior is to assume a zero mean Gaussian prior with fixed covariance matrix: One can then compute the posterior distribution over the parameter w using Bayes’ Theorem: Computing the expectation over the posterior distribution is called the maximum a posteriori (MAP) estimate of w.
0,
w
p w N σ =
| , likelihood x prior posterior = , | , marginal likelihood | p y x w p w p w y x p y x =
MACHINE LEARNING - Doctoral Class - EDOC
Assumptions: all datapoints in are i.i.d. the relationship between data and the latent variables is Gaussian
2 ,: ,: 1
i i i
Ν =
MACHINE LEARNING - Doctoral Class - EDOC
The marginal likelihood can be obtained in two ways: marginalizing over the latent variables (PPCA) or marginalizing over the parameters (DPCA)
2 ,: ,: 1
M i i i
=
,: 1
( ) ( | , )
M i i
p N
=
=∏ X x 0 I
Define Gaussian prior
Integrate out the latent variables:
2 ,: 1
( | ) ( | 0, )
M T i i
p N σ
=
= +
Y W y WW I Parameter Optimization:
MACHINE LEARNING - Doctoral Class - EDOC
During the e-step of the EM algorithm, if the data matrix is incomplete, one can replace the missing elements with the mean (affect very little the estimate) or an extreme value (increase the uncertainty of the model).
MACHINE LEARNING - Doctoral Class - EDOC
This is ok only if number of missing values is small. If the number
projection y* which minimizes MSE while ensuring that the solution lies in the sub-space spanned by the known entries of y.
MACHINE LEARNING - Doctoral Class - EDOC
2 ,: ,: 1
M i i i
=
Define Gaussian prior
Integrate out the parameters:
Latent Variables Optimization:
2 ,: 1
M T i i
=
,: 1
( ) ( | , )
i i
p N
Ν =
=∏ W w 0 I
MACHINE LEARNING - Doctoral Class - EDOC
2 1
M T i i
=
1
T
−
2
T
1 1 1 T
− − −
The marginalized likelihood takes the form: To find the latent variables, optimize the log-likelihood: The gradient of w.r.t. the latent variables:
MACHINE LEARNING - Doctoral Class - EDOC
− − − 1 1 1
−1 2
T
1 T −
Introducing the substitution: We rewrite the gradient of the likelihood as:
MACHINE LEARNING - Doctoral Class - EDOC
− − − 1 1 1
−1 2
T
1 T −
T T
− − 1 1 2
T
2 2
Introducing the substitution: We rewrite the gradient of the likelihood as:
Let us consider the single value decomposition of , therefore, we can rewrite the equation for the latent :
The solution is invariant to :
MACHINE LEARNING - Doctoral Class - EDOC
2 2
The covariance matrix of the observable data Eigenvectors of S
This implies that the elements from the diagonal of are given by
2 1 2)
i i
T
ULV X =
Need to find matrices in the decomposition of
MACHINE LEARNING - Doctoral Class - EDOC
Solution for PPCA: Solution for Dual PPCA: Equivalence is of the form:
W W T
T WLV
T T
2 1 −
T W
If one knows a solution of PPCA then the solution of Dual PPCA can be computed directly through the equivalence form. Marginalization over the latent variables and parameters is equivalent. But marginalization over the latent variables allows for interesting extensions; see next.
MACHINE LEARNING - Doctoral Class - EDOC
2 ,: 1
( | ) ( | 0, )
M T i i
p N σ
=
= +
Y X y XX I
To recall, the marginal likelihood in Dual PPCA is given by:
MACHINE LEARNING - Doctoral Class - EDOC
2 1
( | ) ( | 0, )
M T i i
p N σ
=
= +
Y X y XX I
To recall, the marginal likelihood in Dual PPCA is given by: Note that the following distribution is known as the Gaussian Process:
Hence the marginal likelihood in dual PPCA is the combination of M independent Gaussian Processes.
MACHINE LEARNING - Doctoral Class - EDOC
=
+ =
N i T
N p
1 2 )
, | ( ) | ( I XX y X Y σ
Let’s have another look at the marginal likelihood in Dual PPCA: Note that the following distribution is known as the Gaussian Process:
Gaussian Processes have many useful properties (e.g. for modeling non Linear functional dependencies), see next lecture. For now, we just need to pay attention to the covariance function
MACHINE LEARNING - Doctoral Class - EDOC
=
+ =
N i T
N p
1 2 )
, | ( ) | ( I XX y X Y σ
Let’s have another look at the marginal likelihood in Dual PPCA: Note that the following distribution is known as the Gaussian Process:
where: is a matrix function known as a covariance matrix.
Covariance functions form a special class
constraints (again, will discuss these constraints the next time).
MACHINE LEARNING - Doctoral Class - EDOC
For today’s lecture we need to know that:
covariance functions
MACHINE LEARNING - Doctoral Class - EDOC
Some ‘Popular’ covariance functions:
T
Linear Covariance:
2 1
RBF Covariance:
Random functions generated by Gaussian Processes with different covariance functions
MACHINE LEARNING - Doctoral Class - EDOC
2 ,: 1
( | ) ( | 0, )
M T i i
p N σ
=
= +
Y X y XX I
In the marginal likelihood of Dual PPCA: Linear Covariance function with the noise term
MACHINE LEARNING - Doctoral Class - EDOC
,: 1
M i i
=
The marginal likelihood is a product
What if we use a non-linear covariance function (e.g. RBF)?
MACHINE LEARNING - Doctoral Class - EDOC
for X. No longer possible to simple proceed to a single value decomposition
conjugate gradients
2 1
2 2 1
We get a non-linear mapping into a low dimension manifold: Gaussian Process Latent Variables Model
MACHINE LEARNING - Doctoral Class - EDOC
Computationally heavy sparse technique: pick a subset of datapoints according to how much they reduce the posterior process entropy.
1 1 1 2 1
T
− − −
We get a non-linear mapping into a low dimension manifold: Gaussian Process Latent Variables Model
MACHINE LEARNING - Doctoral Class - EDOC
In contrast to linear models such as PCA and PPCA, which allow for a closed-form solution to determine the latent variables (or projections in feature space), GP-LVM requires to proceed iteratively through gradient-based optimization. Therefore, a smart initialization is important. Usually initialization with PCA works fine…
MACHINE LEARNING - Doctoral Class - EDOC
But not always… We have already seen the ‘swiss roll’ dataset, when discussed kPCA.
For this data the true structure is known: the manifold is a two dimensional square twisted into a spiral along one of its dimensions and living in a three dimensional space.
MACHINE LEARNING - Doctoral Class - EDOC
GP-LVM with PCA initialization gives poor results. Initialization with Isomap allows to restore the original structure of the data.
MACHINE LEARNING - Doctoral Class - EDOC
In this example, one combines the strengths of two different approaches: Isomap provide a unique solution which can recover the structure of the manifold
GP-LVM provides an underlying probabilistic model and an easy way to compute the mapping from the latent to the observed space. Due to the probabilistic nature of the GP-LVM we can also compare the resulting models through their log-likelihood. The log likelihood of the model when initialized with Isomap is much higher than that of the model when initialized with PCA (-45 vs
MACHINE LEARNING - Doctoral Class - EDOC
Red crosses, green circles and blue + signs represent stratified, annular and homogeneous flows respectively. The grayscale in the right figure indicates the precision with which the manifold was expressed in data-space for that latent point.
Multi-phase oil flow data : 12 dimensional data set containing data of three known classes corresponding to the phase of flow in an oil pipeline: stratified, annular and homogeneous.
PCA (GP-LVM with the linear kernel) GP-LVM with the RBF kernel
MACHINE LEARNING - Doctoral Class - EDOC
Red crosses, green circles and blue plus signs represent stratified, annular and homogeneous flows respectively. The grayscale in plot indicate the precision with which the manifold was expressed in data-space for that latent point.
Multi-phase oil flow data : 12 dimensional data set containing data of three known classes corresponding to the phase of flow in an oil pipeline: stratified, annular and homogeneous.
PCA (GP-LVM with the linear kernel) GP-LVM with the RBF kernel
Overlap of stratified and annular flows Flows are separated
MACHINE LEARNING - Doctoral Class - EDOC
Red crosses, green circles and blue plus signs represent stratified, annular and homogeneous flows respectively. The grayscale in plot indicate the precision with which the manifold was expressed in data-space for that latent point.
Multi-phase oil flow data : 12 dimensional data set containing data of three known classes corresponding to the phase of flow in an oil pipeline: stratified, annular and homogeneous.
kPCA GP-LVM with the RBF kernel
Flows are separated Overlap of stratified and annular flows
MACHINE LEARNING - Doctoral Class - EDOC
Most dimensionality reduction techniques preserve local distances kPCA maps smoothly from data to the latent space points close in the data space are close in the latent space X
MACHINE LEARNING - Doctoral Class - EDOC
Most dimensionality reduction techniques preserve local distances kPCA maps smoothly from data from original to latent space points close in the data space are close in the latent space but points close in the latent space may be not close in the data space
MACHINE LEARNING - Doctoral Class - EDOC
Most dimensionality reduction techniques preserve local distances kPCA maps smoothly from data from original to latent space points close in the data space are close in the latent space but points close in the latent space may be not close in the data space
but points close in the latent space are close in the data space
MACHINE LEARNING - Doctoral Class - EDOC
Motion capture data of human walking. The paths of the sequences in the latent space are shown as solid lines. The dimension of the original data is 102 (34 markers x 3 coordinates).
MACHINE LEARNING - Doctoral Class - EDOC
The data obtained from a subject breaking into a run from standing – cyclic motion. neighbors in time are connected with lines
MACHINE LEARNING - Doctoral Class - EDOC
We are supposed to see a smooth periodic pattern in the latent space. But, instead …
Non-smooth mapping
MACHINE LEARNING - Doctoral Class - EDOC
Lowe and Tipping [1997] made latent positions a function of the data. Function was either a multi-layer perceptron
basis function network.
j ij
MACHINE LEARNING - Doctoral Class - EDOC
The same idea can be used to force the GP-LVM to respect local distances [Lawrence and Candela, 2006]. By constraining each to be a smooth mapping from local distances can be respected This works because in GP-LVM, one maximizes w.r.t. the latent variables and does not integrate these out.
i
i
MACHINE LEARNING - Doctoral Class - EDOC
GP-LVM normally proceeds by optimizing with respect to using The back constraints are of the form where are a set of unknown parameters We can compute via the chain rule and optimize
X ∂ ∂L
: , B
j ij
B ∂ ∂L
MACHINE LEARNING - Doctoral Class - EDOC
GP-LVM with the back constraints applied to the runner dataset , that we have seen before.
Smooth cyclical patter
MACHINE LEARNING - Doctoral Class - EDOC
GP-LVM with the back constraints applied to the runner dataset , that we have seen before.
MACHINE LEARNING - Doctoral Class - EDOC
Observable data are generated by a low-dimensional dynamical process (e.g. in the full-body motion, a dynamical law controlling all
body joints can be described in a low-dimensional space).
How can we learn the latent dynamics?
MACHINE LEARNING - Doctoral Class - EDOC
y x
−
1 t t t t
Observable data are generated by a low-dimensional dynamical process (e.g. in the full-body motion, a dynamical law controlling all
body joints can be described in a low-dimensional space).
How can we learn the latent dynamics? Assume the latent-variable mapping with a first-order Markov dynamics where and are the matrices of parameters (equivalent to over which we marginalized in DPPCA)
Isotropic Gaussian noise
MACHINE LEARNING - Doctoral Class - EDOC
y x
−
1 t t t t
Observable data are generated by a low-dimensional dynamical process. Assume the latent-variable mapping with the first-order Markov dynamics where and are the matrices of parameters (equivalent to over which we marginalized in DPPCA)
standard GP-LVM mapping
MACHINE LEARNING - Doctoral Class - EDOC
y x
−
1 t t t t
Observable data are generated by a low-dimensional dynamical process. Assume the latent-variable mapping with the first-order Markov dynamics where and are the matrices of parameters (equivalent to over which we marginalized in DPPCA)
standard GP-LVM mapping
2 2 1
Y
Use the RBF kernel:
MACHINE LEARNING - Doctoral Class - EDOC
y x
−
1 t t t t
Observable data are generated by a low-dimensional dynamical process. Assume the latent-variable mapping with the first-order Markov dynamics where and are the matrices of parameters (equivalent to over which we marginalized in DPPCA)
novel, auto- regressive part
MACHINE LEARNING - Doctoral Class - EDOC
y x
−
1 t t t t
Observable data are generated by a low-dimensional dynamical process. Assume the latent-variable mapping with the first-order Markov dynamics where and are the matrices of parameters (equivalent to over which we marginalized in DPPCA)
novel, auto- regressive part
Use the RBF + linear kernel, the marginal distribution is getting More complicated
3 2 2 1
T X
MACHINE LEARNING - Doctoral Class - EDOC
1 1 2
T t t t
− =
1 1 ( 1)
T X
N T N X
− −
2
T
( 1) ( 1) T T X
− × −
where
MACHINE LEARNING - Doctoral Class - EDOC
1 1
T T X X
Y Y i i i i
− −
MACHINE LEARNING - Doctoral Class - EDOC
Each pose was defined by 56 Euler angles for joints, 3 global pose angles, and 3 global translational velocities for torso . For learning, the data was mean-subtracted, and the latent coordinates were initialized with PCA.
MACHINE LEARNING - Doctoral Class - EDOC
Models of the latent dynamics learned from a walking sequence of 2.5 gait cycles.
GP-LVM GPDM
Non-smooth mapping (GP-LVM without back constraints)
MACHINE LEARNING - Doctoral Class - EDOC
(d) Random trajectories drawn from the model using Monte Carlo. (e) A GPDM of walk data learned with RBF+linear kernel dynamics. The poses were reconstructed from points on the trajectory.
MACHINE LEARNING - Doctoral Class - EDOC
50 missing frames in the 157-frame walk sequence
MACHINE LEARNING - Doctoral Class - EDOC
The latent coordinates for missing data are initialized by cubic spline interpolation from the 3D PCA initialization of observations. Problem: the difficulty of initializing the missing latent positions sufficiently close to the training data.
MACHINE LEARNING - Doctoral Class - EDOC
data sequence: oversmoothing data and generating a more uncertain but smoother distribution over the latent variables.
hyperparameters fixed (optimize w.r.t. the latent variables ).
the latent coordinates of the training data, and a smooth model is learned.
MACHINE LEARNING - Doctoral Class - EDOC
MACHINE LEARNING - Doctoral Class - EDOC
1. N.Lawrence. Probabilistic Non-linear Principal Component Analysis with Gaussian Process Latent Variable Models . Journal of Machine Learning Research, 2005.
the Royal Statistical Society, 1999.
in the GP-LVM through Back Constraints.
Processing and Systems, 2005. 5, The tutorial and the demo software from the website http://www.cs.man.ac.uk/~neill/
MACHINE LEARNING - Doctoral Class - EDOC
function.
vectors.
set this penalty is not easy.
data incrementally as it may change drastically which datapoints are used as SVs (neural networks with Hebbian Learning provide good alternatives for incremental learning but they are not ensured to find an optimal solution).
MACHINE LEARNING - Doctoral Class - EDOC
remains a nontrivial problem, especially in incremental and real-time formulations Two broad classes of function approximation methods: 1. methods that fit nonlinear functions globally, typically by input space expansions (feature space) with predefined or parameterized basis functions and subsequent linear combinations of the expanded inputs (e.g. SVR, GPLVM (Today’s lecture), GPR (next week)). 2. methods that fit nonlinear functions locally, usually by using spatially localized simple models in the original input space and automatically adjusting the complexity (e.g., number of local models and their locality) to accurately account for the nonlinearities and distributions of the target function (e.g. LWPR (Today’s lecture), GMR (next week))
MACHINE LEARNING - Doctoral Class - EDOC
applies
It would be useful to design a regression method that estimates best the linear dependencies locally. Locally weighted learning (LWR, Atkeson et al 97)
MACHINE LEARNING - Doctoral Class - EDOC
1 2 1
For a set of M multi-dimensional pair of datapoints , , a linear model of the form , = , where is the slope of the function, if optimizes by minimizing MSE, The data s
i i i i i i i
M i T M T i
x y y f x x J y x β β β β β
= =
= = −
2 1
et can be tailored to the query point x' by emphasizing nearby points in the regression. One can do this by weighting the training criterion , ' where K( ) is the weighting or kernel
i i
M T i i
J y x K d x x β β
=
= −
function and ( , ') is the distance between the data point and the query point '.
i i
d x x x x
MACHINE LEARNING - Doctoral Class - EDOC
1 2 1
For a set of M multi-dimensional pair of datapoints , , a linear model of the form , = , where is the slope of the function, if optimizes by minimizing MSE, The data s
i i i i i i i
M i T M T i
x y y f x x J y x β β β β β
= =
= = −
2 1
et can be tailored to the query point x' by emphasizing nearby points in the regression. One can do this by weighting the training criterion , ' where K( ) is the weighting or kernel
i i
M T i i
J y x K d x x β β
=
= −
function and ( , ') is the distance between the data point and the query point '.
i i
d x x x x
Using this criterion, each , ' become a local model and can have a different set of parameters for each query point '.
i
f x x x β
MACHINE LEARNING - Doctoral Class - EDOC
2 1 1
If we assume each local model to be linear in the parameters, i.e. , = , finding the optimal projection with unweighted projection is done by minimizing MSE: . In weighted
T M T i i i T T
f x x J x y X X X y β β β β β
= −
= − ⇒ =
regression, if we set w ' = , ' , When optimizing for a given query point x', we set W a diagonal matrix with entries w , , v , one can get an estimator for ˆ at the query point: ' '
i i i
x K d x x Z WX Wy y y x x = = =
1 T T T
Z Z Z v
−
MACHINE LEARNING - Doctoral Class - EDOC
Unweighted Regression Weighted Regression
MACHINE LEARNING - Doctoral Class - EDOC
MACHINE LEARNING - Doctoral Class - EDOC
MACHINE LEARNING - Doctoral Class - EDOC
Classical Neural Networks Mixture Models Support Vector Machines Gaussian Process Regression
slow, structure? expensive, local minima O(N2) ‐ O(N3)
Nonparametric Statistics (a.k.a. Locally Weighted Learning)
2 , 1 1 M K T k i i i k i k
= =
Global Optimization Processes (computationally expensive, slow)
2 1 1 M K i k k i i k
= =
K independent local Linear optimization problems!
Kernel
MACHINE LEARNING - Doctoral Class - EDOC
Choosing the actual number of local model is often difficult as it can lead to
X’
MACHINE LEARNING - Doctoral Class - EDOC
Nonparametric Statistics (a.k.a. Locally Weighted Learning)
2 , 1 1
M K T k i i i k i k
= =
K independent local Linear optimization problems!
T ii i i 1
1 w exp( ( ') ( ')) 2 ( )
k k k k −
= − − − =
T T
x x D x x β X W X X W Y
MACHINE LEARNING - Doctoral Class - EDOC
T ii i i 1
1 w exp( ( ') ( ')) 2 ( )
k k k k −
= − − − =
T T
x x D x x β X W X X W Y
MACHINE LEARNING - Doctoral Class - EDOC
Approximate non‐linear functions with a combination
models
T ii i i 1
1 w exp( ( ') ( ')) 2 ( ) ˆ ' ˆ ˆ /
k k k k T k k k k k k k
w y w
−
= − − − = = =∑
T T
x x D x x β X W X X W Y y x β y
Solve this problem for high dimensional space: LWPR
Sethu Vijayakumar, Aaron D'Souza and Stefan Schaal, Online Learning in High Dimensions, Neural Computation, vol. 17, pp. 2602‐34 (2005) Sethu Vijayakumar @ Univ. of Edinburgh
X’ X’
MACHINE LEARNING - Doctoral Class - EDOC
y = βx
Tx + β0 = β T ˜
x where ˜ x = x T 1
T
w = exp − 1 2 x − c
( )
T D x − c
( )
⎛ ⎝ ⎞ ⎠ where D = MTM
Open Parameters y = wiyk
i=1 K
wi
i=1 K
MACHINE LEARNING - Doctoral Class - EDOC
y = βx
Tx + β0 = β T ˜
x where ˜ x = x T 1
T
w = exp − 1 2 x − c
( )
T D x − c
( )
⎛ ⎝ ⎞ ⎠ where D = MTM
y = wiyk
i=1 K
wi
i=1 K
Diagonal matrix usually Centers of local models are fixed, not learned
MACHINE LEARNING - Doctoral Class - EDOC
For learning the linear models Ψ(x), LWPR employs an online formulation of weighted partial least squares (PLS) regression.
MACHINE LEARNING - Doctoral Class - EDOC
Partial Least Squares (PLS) refers to a wide class of methods for modeling relations between sets of observed variables by means
It comprises regression and classification tasks as well as dimension reduction techniques and modeling tools.
1 M i i
=
1 M i i
=
Least-square regression looks for a mapping that sends X onto Y, such that:
2 1
M T i i w i
=
MACHINE LEARNING - Doctoral Class - EDOC
PCA and CCA by extension can be viewed as regression problems, whereby one set of variable Y can be expressed in terms of a linear combination of the second set of variable X. PCA Regression does not take into account the response variable when constructing the principal components or latent variables Unsupervised problem PLS incorporate information about the response in the model by using latent variables. creates orthogonal score vectors (also called latent vectors or components) by maximizing the covariance between different sets
PLS is similar to CCA in its principle. It however differs in the algorithm.
MACHINE LEARNING - Doctoral Class - EDOC
PLS represents a form of CCA, where the criterion of maximal correlation is balanced with the requirement to explain as much variance as possible in both X and Y spaces
2 1
x y
T T x y T T w w x y
= =
MACHINE LEARNING - Doctoral Class - EDOC
x y
MACHINE LEARNING - Doctoral Class - EDOC
For learning the linear models Ψ(x), LWPR employs an online formulation of weighted partial least squares (PLS) regression. Within each local model the input d at x is projected along selected directions ui , yielding “latent” variables si with
MACHINE LEARNING - Doctoral Class - EDOC
1 1 1
T n n n T n k k k k k n T n n n k k k k T n k
+ + +
MACHINE LEARNING - Doctoral Class - EDOC
Penalty term to avoid the variance to shrink in case of large amount of data
1 1 1 1
n n n n T n k k k k k
+ + + +
2 2 , , , , 1 1, 1 , 1
M M k i i k i i k ij M i i j k i i
− = = = =
The distance metric D and, hence, the locality of the receptive fields, can be learned for each local model individually by stochastic gradient descent in a penalized leave-one-out cross-validation cost function Leave‐one‐out estimation of y
MACHINE LEARNING - Doctoral Class - EDOC
Recursive Least Squares Stochastic Leave‐
Cross Validation
( )
( )
1 1 1
' ' 1 ' ' '
T n n n T n k k k k k n T n n n k k k k T n k
w w β β β λ λ
+ + +
= + − ⎛ ⎞ ⎜ ⎟ = − ⎜ ⎟ ⎜ ⎟ + ⎝ ⎠ x P x y x P x x P P P x P x
1 1 1
and
n n n n T n k k k k k
J M M D M M M α
+ + + +
∂ = − = ∂
J = 1 wk,i
i=1 N
w k,i y i − ˆ y
k,i,−i 2 i=1 N
+γ Dk,ij
2 i=1, j=1 n
if min
k
wk
Automatic Structure Determination
MACHINE LEARNING - Doctoral Class - EDOC
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
1 2 3 4
0.5 1 1.5 2 2.5 y x
+ new training data true y predicted y predicted y after new training data 0.5 1
0.5 1 1.5 2 2.5 w x c) Learned Organization of Receptive Fields
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
1 2 3 4
0.5 1 1.5 2 2.5 y x b) Local Function Fitting With Receptive Fields a) Global Function Fitting With Sigmoidal Neural Network
MACHINE LEARNING - Doctoral Class - EDOC
Increasing the number of components leads to a better fit of the local linearities.
MACHINE LEARNING - Doctoral Class - EDOC
Sethu Vijayakumar @ Univ. of Edinburgh
MACHINE LEARNING - Doctoral Class - EDOC
DIMENSIONALITY REDUCTION and FEATURE ANALYSIS:
data in feature space (kPCA) FUNCTION ESTIMATION
regression) deal with missing data, handle noise
through latent variables (SVR)