1
MACHINE LEARNING MACHINE LEARNING
MACHINE LEARNING
Kernel Methods for Regression
Support Vector Regression Gaussian Mixture Regression Gaussian Process Regression
Kernel Methods for Regression Support Vector Regression Gaussian - - PowerPoint PPT Presentation
MACHINE LEARNING MACHINE LEARNING MACHINE LEARNING Kernel Methods for Regression Support Vector Regression Gaussian Mixture Regression Gaussian Process Regression 1 MACHINE LEARNING 2012 MACHINE LEARNING MACHINE LEARNING Problem
1
MACHINE LEARNING MACHINE LEARNING
Support Vector Regression Gaussian Mixture Regression Gaussian Process Regression
MACHINE LEARNING – 2012
2
MACHINE LEARNING MACHINE LEARNING
y x
1,...
Estimate that best predict set of training points , ?
i i i M
f x y
1
x
1
y
2
x
2
y
3
x
3
y
4
x
4
y
MACHINE LEARNING – 2012
3
MACHINE LEARNING MACHINE LEARNING
Non-Linear regression: Fit data with a function that is not linear in the parameters Non-parametric regression: use the data to determine the parameters
regression problem. Kernel Trick: Send data in feature space with non-linear function and perform linear regression in feature space
i i i i
MACHINE LEARNING – 2012
5
MACHINE LEARNING MACHINE LEARNING
y
Good prediction depends on the choice of datapoints.
x
Blue: true function Red: estimated function
MACHINE LEARNING – 2012
6
MACHINE LEARNING MACHINE LEARNING
y
Good prediction depends on the choice of datapoints. The more datapoints, the better the fit. Computational costs increase dramatically with number of datapoints
x
Blue: true function Red: estimated function
MACHINE LEARNING – 2012
7
MACHINE LEARNING MACHINE LEARNING
y
Several methods in ML for performing non-linear regression. Differ in the objective function, in the amount of parameters.
Gaussian Process Regression (GPR) uses all datapoints x
Blue: true function Red: estimated function
MACHINE LEARNING – 2012
8
MACHINE LEARNING MACHINE LEARNING
y
Several methods in ML for performing non-linear regression. Differ in the objective function, in the amount of parameters.
Gaussian Process Regression (GPR) uses all datapoints Support Vector Regression (SVR) picks a subset of datapoints (support vectors) x
Blue: true function Red: estimated function
MACHINE LEARNING – 2012
9
MACHINE LEARNING MACHINE LEARNING
y x
Several methods in ML for performing non-linear regression. Differ in the objective function, in the amount of parameters.
Gaussian Process Regression (GPR) uses all datapoints Support Vector Regression (SVR) picks a subset of datapoints (support vectors) Gaussian Mixture Regression (GMR) generates a new set of datapoints (centers of Gaussian functions)
Blue: true function Red: estimated function
MACHINE LEARNING – 2012
10
MACHINE LEARNING MACHINE LEARNING
y x
N
Deterministic regressive model
2
Probabilistic regressive model Build an estimate of the noise model and then compute f directly (Support Vector Regression)
MACHINE LEARNING – 2012
11
MACHINE LEARNING MACHINE LEARNING
MACHINE LEARNING – 2012
12
MACHINE LEARNING MACHINE LEARNING
1,...
i i i M
How to generalize the support vector machine framework for classification to estimate continuous functions? 1. Assume a non-linear mapping through feature space and then perform linear regression in feature space 2. Supervised learning – minimizes an error function. First determine a way to measure error on testing set in the linear case!
MACHINE LEARNING – 2012
13
MACHINE LEARNING MACHINE LEARNING
Assume a linear mapping , s.t. .
T
f y f x w x b
x
Measure the error on prediction b is estimated as in SVR through least-square regression on support vectors; hence we omit it from the rest of the developments .
y f x
How to estimate and to best predict the pair of training points , ?
i i i M
w b x y
MACHINE LEARNING – 2012
14
MACHINE LEARNING MACHINE LEARNING
x
+𝜁 −𝜁
Set an upper bound on the error and consider as correctly classified all points such that ( ) , Penalize only datapoints that are not contained in the -tube. f x y
MACHINE LEARNING – 2012
15
MACHINE LEARNING MACHINE LEARNING
X -margin
The -margin is a measure of the width of the -insensitive tube and hence of the precision of the regression. A small ||w|| corresponds to a small slope for f. In the linear case, f is more horizontal. y wx b
MACHINE LEARNING – 2012
16
MACHINE LEARNING MACHINE LEARNING
-margin
A large ||w|| corresponds to a large slope for f. In the linear case, f is more vertical. The flatter the slope of the function f, the larger the margin To maximize the margin, we must minimize the norm of w.
X
y wx b
MACHINE LEARNING – 2012
17
MACHINE LEARNING MACHINE LEARNING
2
1,...
This can be rephrased as a constraint-based optimization problem
1 minimize 2 , subject to ,
i i
i i
i M
w w x b y y w x b
Need to penalize points outside the -insensitive tube.
𝜁
MACHINE LEARNING – 2012
18
MACHINE LEARNING MACHINE LEARNING
Need to penalize points outside the -insensitive tube.
* 2 * 1 * *
Introduce slack variables , , 0 : 1 C minimize + 2 , subject to , 0,
i i
i i M i i i i i i i i i
C w M w x b y y w x b
i
* i
𝜁
MACHINE LEARNING – 2012
19
MACHINE LEARNING MACHINE LEARNING
All points outside the -tube become Support Vectors
i
* i
𝜁
* 2 * 1 * *
Introduce slack variables , , 0 : 1 C minimize + 2 , subject to , 0,
i i
i i M i i i i i i i i i
C w M w x b y y w x b
We now have the solution to the linear regression problem. How to generalize this to the nonlinear case?
MACHINE LEARNING – 2012
20
MACHINE LEARNING MACHINE LEARNING
Lift x into feature space and then perform linear regression in feature space.
Linear Case: , Non-Linear Case: , y f x w x b x x y f x w x b w lives in feature space!
x x
MACHINE LEARNING – 2012
21
MACHINE LEARNING MACHINE LEARNING
2 * 1 * *
i i
M i i i i i i i i i
MACHINE LEARNING – 2012
22
MACHINE LEARNING MACHINE LEARNING
2 * * * 1 1 1 * * 1
1 C C L , , *, = + 2 , ,
i i i i i
M M i i i i i i i M i i i M i i i
w b w M M y w x b y w x b
Lagrangian = Objective function + l * constraints
MACHINE LEARNING – 2012
23
MACHINE LEARNING MACHINE LEARNING
Requiring that the partial derivatives are all zero And replacing in the primal Lagrangian, we get the Dual optimization problem:
* 1
L 0;
i
M i i
b
* 1
L 0;
i
M i i i
w x w
* * *
i i i
*
* * , 1 * * , 1 1 * * 1
1 , 2 max subject to 0 and , 0,
i i i i i
M i j i j j i j M M i i i i i M i i i i
k x x y C M
MACHINE LEARNING – 2012
24
MACHINE LEARNING MACHINE LEARNING
The solution is given by:
* 1
i
M i i i
Linear Coefficients (Lagrange multipliers for each constraint). If Gaussian Kernel, M Gaussians centered on each training datapoint.
MACHINE LEARNING – 2012
25
MACHINE LEARNING MACHINE LEARNING
y x
* 1
i
M i i i
The solution is given by: Kernel places a Gauss function on each SV
MACHINE LEARNING – 2012
26
MACHINE LEARNING MACHINE LEARNING
y x
* 1
i
M i i i
The solution is given by: The Lagrange multipliers define the importance of each Gaussian function
* 1
1.5
2
2
4
3
* 3
1.5
* 5
6
b Converges to b when SV effect vanished
1
x
2
x
3
x
4
x
5
x
6
x
MACHINE LEARNING – 2012
27
MACHINE LEARNING MACHINE LEARNING
2 * 1 * *
1 C minimize + 2 , subject to , 0,
i i
M i i i i i i i i i
w M w x b y y w x b
The solution to SVR we just saw is referred to as SVR Two Hyperparameters C controls the penalty term on poor fit determines the minimal required precision
MACHINE LEARNING – 2012
28
MACHINE LEARNING MACHINE LEARNING
Effect of the RBF kernel width on the fit. Here fit using C=1000, =0.01, kernel width=0.1.
MACHINE LEARNING – 2012
29
MACHINE LEARNING MACHINE LEARNING
Effect of the RBF kernel width on the fit. Here fit using C=1000, =0.01, kernel width=0.01 Overfitting
MACHINE LEARNING – 2012
30
MACHINE LEARNING MACHINE LEARNING
Effect of the RBF kernel width on the fit. Here fit using C=100, =0.03, kernel width=0.1 Reduction of the effect of the kernel width on the fit by choosing appropriate hyperparameters. .
MACHINE LEARNING – 2012
31
MACHINE LEARNING MACHINE LEARNING
As the number of data grows, so does the number of support vectors. Introduce a new parameter as in SVM: n-SVR puts an upper bound on the support vectors It allows to fit automatically the epsilon-tube!
2 * 1 , ,
M j j j w
MACHINE LEARNING – 2012
32
MACHINE LEARNING MACHINE LEARNING
Effect of the automatic adaptation of using -SVR
MACHINE LEARNING – 2012
33
MACHINE LEARNING MACHINE LEARNING
Effect of the automatic adaptation of using -SVR Added noise on data
MACHINE LEARNING – 2012
34
MACHINE LEARNING MACHINE LEARNING
N
Deterministic regressive model
2
Probabilistic regressive model Build an estimate of the noise model and then compute f directly (Support Vector Regression)
MACHINE LEARNING – 2012
35
MACHINE LEARNING MACHINE LEARNING
y
N
Deterministic regressive model
2
Probabilistic regressive model Probabilistic estimate of the nonlinear relationship between y and x through the conditional density: (estimates the noise model and f) And then compute the estimate by taking the expectation over the conditional density:
| p y x
36
MACHINE LEARNING MACHINE LEARNING
x
1) Estimate the joint density, p(x,y), across pairs of datapoints using GMM.
1
, , ; , , with , ; , , , : mean and covariance matrix of Gaussian
K i i i i i i i i i i
p x y p x y p x y N i
2D projection of a Gauss function Ellipse contour ~ 2 std deviation
x y
y
37
MACHINE LEARNING MACHINE LEARNING
x
1) Estimate the joint density, p(x,y), across pairs of datapoints using GMM.
1
, , ; , , with , ; , , , : mean and covariance matrix of Gaussian
K i i i i i i i i i i
p x y p x y p x y N i
Parameters are learned through Expectation-maximization. Iterative procedure. Start with random initialization.
y
38
MACHINE LEARNING MACHINE LEARNING
y x
1) Estimate the joint density, p(x,y), across pairs of datapoints using GMM.
1
, , ; , , with , ; , , , : mean and covariance matrix of Gaussian
K i i i i i i i i i i
p x y p x y p x y N i
1
K i i
Mixing Coefficients Probability that all M datapoints were generated by Gaussian i:
1
|
M j i j
p i p i x
1 2
40
MACHINE LEARNING MACHINE LEARNING
1) Estimate the joint density, p(x,y), across pairs of datapoints using GMM. 2) Compute the regressive signal, by taking p(y|x)
y x
1
K i i i i
1
; , with ; ,
i i i i K j j j j
p x x p x
Gauss function
The variance changes depending on the query point
41
MACHINE LEARNING MACHINE LEARNING
1) Estimate the joint density, p(x,y), across pairs of datapoints using GMM. 2) Compute the regressive signal, by taking p(y|x)
y x
1
K i i i i
The factors 𝛾 give a measure of the relative importance of each K regressive model. They are computed at each query point weighted regression
1
; , with ; ,
i i i i K j j j j
p x x p x
2 x
1 x
Influence of each marginal is modulated by
Query point
42
MACHINE LEARNING MACHINE LEARNING
y x
3) The regressive signal is then obtained by computing E{p(y|x)}:
1 1
1
|
K i i i i i i
i x
K i i i
x x
E p y x
Linear combination of K local regressive models
1 x
2 x
2 x
1 x
43
MACHINE LEARNING MACHINE LEARNING
y x
1 x
2 x
Computing the variance var{p(x,y)} provides information on the uncertainty
Careful: This is not the uncertainty of the model. Use the likelihood to compute the uncertainty of the predictor.!
2 2 2 1 1
var |
K K i i i i i i i
p y x x x x x
44
MACHINE LEARNING – 2012 MACHINE LEARNING
var | p y x
| E p y x
Computing the variance var{p(x,y)} provides information on the uncertainty
45
MACHINE LEARNING – 2012 MACHINE LEARNING
46
MACHINE LEARNING – 2012 MACHINE LEARNING
Fit with 4 Gaussians Uniform initialization
47
MACHINE LEARNING – 2012 MACHINE LEARNING
Fit with 4 Gaussians Random initialization
48
MACHINE LEARNING – 2012 MACHINE LEARNING
Fit with 10 Gaussians Random initialization
MACHINE LEARNING – 2012
50
MACHINE LEARNING MACHINE LEARNING
y
N
Deterministic regressive model
2
Probabilistic regressive model Probabilistic estimate of the nonlinear relationship between y and x through the conditional density: (estimates the noise model and f) And then compute the estimate by taking the expectation over the conditional density:
| p y x
Gaussian Mixture Regression (GMR) computes first p(x, y) and, then, derives p(y|x). Gaussian Process Regression (GPR) computes directly p(y|x).
51
MACHINE LEARNING MACHINE LEARNING
A signal y can be estimated through regression y=f(x) by taking the expectation over the conditional probability of p on x, for a choice of parameters for p:
The simplest way to estimate p(y|x) is through Probabilistic Regression that estimates a linear regressive model.
52
MACHINE LEARNING MACHINE LEARNING
T N
PR is a statistical approach to classical linear regression that estimates the relationship between zero-mean variables y and x by building a linear model
2
T
If one assumes that the observed values of y differ from f(x) by an additive noise that follows a zero-mean Gaussian distribution (such an assumption consists of putting a prior distribution over the noise), then:
Change wT
53
MACHINE LEARNING MACHINE LEARNING
1 2 2 1
i
M i i i T i M i
1 2
i i
M i T
Parameters of the model
54
MACHINE LEARNING MACHINE LEARNING
1 2
i i
M i T
Prior model on distribution of parameter w:
1
T w w
Hyperparameters Given by user
55
MACHINE LEARNING MACHINE LEARNING
1 2
1 1 2
1 with
T w
T T
A XX
Testing point Training datapoints
58
MACHINE LEARNING – 2012 MACHINE LEARNING
How to extend the simple linear Bayesian regressive model for nonlinear regression, such that the non-linear problem becomes linear again?
x y
?
2
T
2
T
x y
v
MACHINE LEARNING – 2012
59
MACHINE LEARNING MACHINE LEARNING
1 2
1 1 2
1
T w
T T
A XX
Non-Linear Transformation
1 1 2 2 1
T T T w
2
T
How to extend the simple linear Bayesian regressive model for nonlinear regression, such that the non-linear problem becomes linear again?
MACHINE LEARNING – 2012
60
MACHINE LEARNING MACHINE LEARNING
1 1 2 2 1
T T T w
Inner product in feature space
T w
Take as kernel
1 2
1
i
M i i
MACHINE LEARNING – 2012
61
MACHINE LEARNING MACHINE LEARNING
1 2
1
i
M i i
All datapoints are used in the computation!
MACHINE LEARNING – 2012
62
MACHINE LEARNING MACHINE LEARNING
1 2
1
i
M i i
The kernel and its hyperparameters are given by the user. These can be optimized through maximum likelihood over the marginal likelihood, i.e. p(y|X;parameters).
MACHINE LEARNING – 2012
63
MACHINE LEARNING MACHINE LEARNING
Sensitivity to the choice of kernel width (called lengthscale in most books) when using Gaussian kernels (also called RBF or square exponential). Kernel Width=0.1
'
x x l
MACHINE LEARNING – 2012
64
MACHINE LEARNING MACHINE LEARNING
Kernel Width=0.5
'
x x l
Sensitivity to the choice of kernel width (called lengthscale in most books) when using Gaussian kernels (also called RBF or square exponential).
MACHINE LEARNING – 2012
65
MACHINE LEARNING MACHINE LEARNING
1 2
1
i
M i i
The value for the noise needs to pre-set by hand. The larger the noise, the more uncertainty. The noise is <=1.
1 2
MACHINE LEARNING – 2012
67
MACHINE LEARNING MACHINE LEARNING
Low noise: 0.05
MACHINE LEARNING – 2012
68
MACHINE LEARNING MACHINE LEARNING
High noise: 0.2
MACHINE LEARNING – 2012
70
MACHINE LEARNING MACHINE LEARNING
GPR Predict y=0 away from datapoints! Generalization – prediction away from datapoints
1
i
M i i
* 1
i
M i i i
SVR predicts y=b away from datapoints
MACHINE LEARNING – 2012
71
MACHINE LEARNING MACHINE LEARNING
Generalization – prediction away from datapoints GMR predicts the trend away from data
MACHINE LEARNING – 2012
72
MACHINE LEARNING MACHINE LEARNING
Generalization – prediction away from datapoints But prediction depends on initialization and solution found during training
Variance in p(y|x) in GMR represents uncertainty of predictive model. Variance in p(y|x) in GPR represents uncertainty of predictive mode.
MACHINE LEARNING – 2012
75
MACHINE LEARNING MACHINE LEARNING
Variance in SVR represents the epsilon-tube and does not represent uncertainty of the model either! No measure of uncertainty in SVR!
MACHINE LEARNING – 2012
76
MACHINE LEARNING MACHINE LEARNING
predictors
Gaussian distributions (true only when using Gaussian kernels for GPR and SVR)
GPR can predict only a uni-dimensional output y.
2
MACHINE LEARNING – 2012
77
MACHINE LEARNING MACHINE LEARNING
SVR, GMR and GPR are based on the same probabilistic regressive model. But they do not optimize the same objective function find different solutions.
ensured to find the optimal estimate; but not unique solution
MACHINE LEARNING – 2012
78
MACHINE LEARNING MACHINE LEARNING
SVR, GMR and GPR all depend on hyperparameters that need to be determined beforehand. These are:
SVM.
The hyperparamaters can be optimized separately; e.g. the nm of Gaussians in GMR can be estimated using BIC; the lengthscale and noise of GPR can be estimated through maximum likelihood and the kernel parameters of SVR can be optimized through grid search.
MACHINE LEARNING – 2012
79
MACHINE LEARNING MACHINE LEARNING
No easy way to determine which regression technique fits best your problem
Few SV or Gaussian fct Small fraction of original data Keeps all the datapoints
Analytical solution One shot, but uses all data-points Convex optimization
EM, iterative technique, needs several runs