Kernel Methods for Regression Support Vector Regression Gaussian - - PowerPoint PPT Presentation

kernel methods for regression support vector regression
SMART_READER_LITE
LIVE PREVIEW

Kernel Methods for Regression Support Vector Regression Gaussian - - PowerPoint PPT Presentation

MACHINE LEARNING MACHINE LEARNING MACHINE LEARNING Kernel Methods for Regression Support Vector Regression Gaussian Mixture Regression Gaussian Process Regression 1 MACHINE LEARNING 2012 MACHINE LEARNING MACHINE LEARNING Problem


slide-1
SLIDE 1

1

MACHINE LEARNING MACHINE LEARNING

MACHINE LEARNING

Kernel Methods for Regression

Support Vector Regression Gaussian Mixture Regression Gaussian Process Regression

slide-2
SLIDE 2

MACHINE LEARNING – 2012

2

MACHINE LEARNING MACHINE LEARNING

Problem Statement

Predict given input through a non-linear function : y x f

 

y f x 

y x

 

1,...

Estimate that best predict set of training points , ?

i i i M

f x y

1

x

1

y

2

x

2

y

3

x

3

y

4

x

4

y

slide-3
SLIDE 3

MACHINE LEARNING – 2012

3

MACHINE LEARNING MACHINE LEARNING

Non-linear regression and the Kernel Trick

Non-Linear regression: Fit data with a function that is not linear in the parameters Non-parametric regression: use the data to determine the parameters

  • f the function so that the problem can be again phrased as a linear

regression problem. Kernel Trick: Send data in feature space with non-linear function and perform linear regression in feature space

 

; ; : parameters of the function y f x   

 

, x : datapoints, k: kernel fct.

i i i i

y k x x   

slide-4
SLIDE 4

MACHINE LEARNING – 2012

5

MACHINE LEARNING MACHINE LEARNING

Data-driven Regression

y

Good prediction depends on the choice of datapoints.

x

Blue: true function Red: estimated function

slide-5
SLIDE 5

MACHINE LEARNING – 2012

6

MACHINE LEARNING MACHINE LEARNING

Data-driven Regression

y

Good prediction depends on the choice of datapoints. The more datapoints, the better the fit. Computational costs increase dramatically with number of datapoints

x

Blue: true function Red: estimated function

slide-6
SLIDE 6

MACHINE LEARNING – 2012

7

MACHINE LEARNING MACHINE LEARNING

Kernel methods for regression

y

Several methods in ML for performing non-linear regression. Differ in the objective function, in the amount of parameters.

Gaussian Process Regression (GPR) uses all datapoints x

Blue: true function Red: estimated function

slide-7
SLIDE 7

MACHINE LEARNING – 2012

8

MACHINE LEARNING MACHINE LEARNING

Kernel methods for regression

y

Several methods in ML for performing non-linear regression. Differ in the objective function, in the amount of parameters.

Gaussian Process Regression (GPR) uses all datapoints Support Vector Regression (SVR) picks a subset of datapoints (support vectors) x

Blue: true function Red: estimated function

slide-8
SLIDE 8

MACHINE LEARNING – 2012

9

MACHINE LEARNING MACHINE LEARNING

Kernel methods for regression

y x

Several methods in ML for performing non-linear regression. Differ in the objective function, in the amount of parameters.

Gaussian Process Regression (GPR) uses all datapoints Support Vector Regression (SVR) picks a subset of datapoints (support vectors) Gaussian Mixture Regression (GMR) generates a new set of datapoints (centers of Gaussian functions)

Blue: true function Red: estimated function

slide-9
SLIDE 9

MACHINE LEARNING – 2012

10

MACHINE LEARNING MACHINE LEARNING

Kernel methods for regression

y x

 ,

,

N

y f x x y   

Deterministic regressive model

 

 

2

, with 0, y f x N      

Probabilistic regressive model Build an estimate of the noise model and then compute f directly (Support Vector Regression)

slide-10
SLIDE 10

MACHINE LEARNING – 2012

11

MACHINE LEARNING MACHINE LEARNING

Support Vector Regression

slide-11
SLIDE 11

MACHINE LEARNING – 2012

12

MACHINE LEARNING MACHINE LEARNING

 

 

1,...

Assume a nonlinear mapping , s.t. . How to estimate to best predict the pair of training points , ?

i i i M

f y f x f x y

How to generalize the support vector machine framework for classification to estimate continuous functions? 1. Assume a non-linear mapping through feature space and then perform linear regression in feature space 2. Supervised learning – minimizes an error function.  First determine a way to measure error on testing set in the linear case!

Support Vector Regression (SVR)

slide-12
SLIDE 12

MACHINE LEARNING – 2012

13

MACHINE LEARNING MACHINE LEARNING

 

Assume a linear mapping , s.t. .

T

f y f x w x b   

x

Measure the error on prediction b is estimated as in SVR through least-square regression on support vectors; hence we omit it from the rest of the developments .

 

y f x 

Support Vector Regression

  1,...

How to estimate and to best predict the pair of training points , ?

i i i M

w b x y

slide-13
SLIDE 13

MACHINE LEARNING – 2012

14

MACHINE LEARNING MACHINE LEARNING

x

Support Vector Regression

+𝜁 −𝜁

Set an upper bound on the error and consider as correctly classified all points such that ( ) , Penalize only datapoints that are not contained in the -tube. f x y     

slide-14
SLIDE 14

MACHINE LEARNING – 2012

15

MACHINE LEARNING MACHINE LEARNING

Support Vector Regression

X -margin  

The -margin is a measure of the width of the -insensitive tube and hence of the precision of the regression. A small ||w|| corresponds to a small slope for f. In the linear case, f is more horizontal. y wx b  

slide-15
SLIDE 15

MACHINE LEARNING – 2012

16

MACHINE LEARNING MACHINE LEARNING

Support Vector Regression

-margin  

A large ||w|| corresponds to a large slope for f. In the linear case, f is more vertical. The flatter the slope of the function f, the larger the margin  To maximize the margin, we must minimize the norm of w.

X

y wx b  

slide-16
SLIDE 16

MACHINE LEARNING – 2012

17

MACHINE LEARNING MACHINE LEARNING

Support Vector Regression

2

1,...

This can be rephrased as a constraint-based optimization problem

  • f the form:

1 minimize 2 , subject to ,

i i

i i

i M

w w x b y y w x b  

 

          

Need to penalize points outside the -insensitive tube.

𝜁

slide-17
SLIDE 17

MACHINE LEARNING – 2012

18

MACHINE LEARNING MACHINE LEARNING

Support Vector Regression

Need to penalize points outside the -insensitive tube.

 

* 2 * 1 * *

Introduce slack variables , , 0 : 1 C minimize + 2 , subject to , 0,

i i

i i M i i i i i i i i i

C w M w x b y y w x b          

                  

i

* i

𝜁

slide-18
SLIDE 18

MACHINE LEARNING – 2012

19

MACHINE LEARNING MACHINE LEARNING

Support Vector Regression

All points outside the -tube become Support Vectors

i

* i

𝜁

 

* 2 * 1 * *

Introduce slack variables , , 0 : 1 C minimize + 2 , subject to , 0,

i i

i i M i i i i i i i i i

C w M w x b y y w x b          

                  

We now have the solution to the linear regression problem. How to generalize this to the nonlinear case?

slide-19
SLIDE 19

MACHINE LEARNING – 2012

20

MACHINE LEARNING MACHINE LEARNING

Lift x into feature space and then perform linear regression in feature space.

     

 

 

Linear Case: , Non-Linear Case: , y f x w x b x x y f x w x b           w lives in feature space!

 

x x  

Support Vector Regression

slide-20
SLIDE 20

MACHINE LEARNING – 2012

21

MACHINE LEARNING MACHINE LEARNING

Support Vector Regression

     

2 * 1 * *

In feature space, we obtain the same constrained optimization problem: 1 C minimize + 2 , subject to , 0,

i i

M i i i i i i i i i

w M w x b y y w x b          

                 

slide-21
SLIDE 21

MACHINE LEARNING – 2012

22

MACHINE LEARNING MACHINE LEARNING

Support Vector Regression

 

     

 

 

 

2 * * * 1 1 1 * * 1

1 C C L , , *, = + 2 , ,

i i i i i

M M i i i i i i i M i i i M i i i

w b w M M y w x b y w x b                

   

            

   

Again, we can solve this quadratic problem by introducing sets of Lagrange multipliers and writing the Lagrangian :

Lagrangian = Objective function + l * constraints

slide-22
SLIDE 22

MACHINE LEARNING – 2012

23

MACHINE LEARNING MACHINE LEARNING

Support Vector Regression

Requiring that the partial derivatives are all zero And replacing in the primal Lagrangian, we get the Dual optimization problem:

 

* 1

L 0;

i

M i i

b  

    

   

* 1

L 0;

i

M i i i

w x w   

     

     

* * *

L

i i i

C M         

          

*

* * , 1 * * , 1 1 * * 1

1 , 2 max subject to 0 and , 0,

i i i i i

M i j i j j i j M M i i i i i M i i i i

k x x y C M

 

            

   

                      

   

slide-23
SLIDE 23

MACHINE LEARNING – 2012

24

MACHINE LEARNING MACHINE LEARNING

Support Vector Regression

The solution is given by:

 

   

* 1

,

i

M i i i

y f x k x x b  

   

Linear Coefficients (Lagrange multipliers for each constraint). If Gaussian Kernel, M Gaussians centered on each training datapoint.

slide-24
SLIDE 24

MACHINE LEARNING – 2012

25

MACHINE LEARNING MACHINE LEARNING

Support Vector Regression

y x

 

   

* 1

,

i

M i i i

y f x k x x b  

   

The solution is given by: Kernel places a Gauss function on each SV

slide-25
SLIDE 25

MACHINE LEARNING – 2012

26

MACHINE LEARNING MACHINE LEARNING

Support Vector Regression

y x

 

   

* 1

,

i

M i i i

y f x k x x b  

   

The solution is given by: The Lagrange multipliers define the importance of each Gaussian function

* 1

1.5  

2

2  

4

3  

* 3

1.5  

* 5

1  

6

2.5  

b Converges to b when SV effect vanished

1

x

2

x

3

x

4

x

5

x

6

x

slide-26
SLIDE 26

MACHINE LEARNING – 2012

27

MACHINE LEARNING MACHINE LEARNING

 

2 * 1 * *

1 C minimize + 2 , subject to , 0,

i i

M i i i i i i i i i

w M w x b y y w x b        

                 

SVR: Hyperparameters

The solution to SVR we just saw is referred to as SVR Two Hyperparameters C controls the penalty term on poor fit  determines the minimal required precision

slide-27
SLIDE 27

MACHINE LEARNING – 2012

28

MACHINE LEARNING MACHINE LEARNING

Effect of the RBF kernel width on the fit. Here fit using C=1000, =0.01, kernel width=0.1.

SVR: Effect of Hyperparameters

slide-28
SLIDE 28

MACHINE LEARNING – 2012

29

MACHINE LEARNING MACHINE LEARNING

Effect of the RBF kernel width on the fit. Here fit using C=1000, =0.01, kernel width=0.01  Overfitting

SVR: Effect of Hyperparameters

slide-29
SLIDE 29

MACHINE LEARNING – 2012

30

MACHINE LEARNING MACHINE LEARNING

SVR: Effect of Hyperparameters

Effect of the RBF kernel width on the fit. Here fit using C=100, =0.03, kernel width=0.1 Reduction of the effect of the kernel width on the fit by choosing appropriate hyperparameters. .

slide-30
SLIDE 30

MACHINE LEARNING – 2012

31

MACHINE LEARNING MACHINE LEARNING

As the number of data grows, so does the number of support vectors. Introduce a new parameter as in SVM: n-SVR puts an upper bound on the support vectors It allows to fit automatically the epsilon-tube!

 

0,1  

Support Vector Regression: -SVR

 

2 * 1 , ,

1 min 2

M j j j w

w

 

  

        

slide-31
SLIDE 31

MACHINE LEARNING – 2012

32

MACHINE LEARNING MACHINE LEARNING

SVR: Example

Effect of the automatic adaptation of  using -SVR

Support Vector Regression: -SVR

slide-32
SLIDE 32

MACHINE LEARNING – 2012

33

MACHINE LEARNING MACHINE LEARNING

SVR: Example

Effect of the automatic adaptation of  using -SVR Added noise on data

Support Vector Regression: -SVR

slide-33
SLIDE 33

MACHINE LEARNING – 2012

34

MACHINE LEARNING MACHINE LEARNING

Kernel methods for regression

 ,

,

N

y f x x y   

Deterministic regressive model

 

 

2

, with 0, y f x N      

Probabilistic regressive model Build an estimate of the noise model and then compute f directly (Support Vector Regression)

slide-34
SLIDE 34

MACHINE LEARNING – 2012

35

MACHINE LEARNING MACHINE LEARNING

Kernel methods for regression

y

 ,

,

N

y f x x y   

Deterministic regressive model

 

 

2

, with 0, y f x N      

Probabilistic regressive model Probabilistic estimate of the nonlinear relationship between y and x through the conditional density: (estimates the noise model and f) And then compute the estimate by taking the expectation over the conditional density:

 

| p y x

 

 

   

 

| y E f x f x y E p y x     

slide-35
SLIDE 35

36

MACHINE LEARNING MACHINE LEARNING

x

Gaussian Mixture Regression (GMR)

1) Estimate the joint density, p(x,y), across pairs of datapoints using GMM.

 

     

1

, , ; , , with , ; , , , : mean and covariance matrix of Gaussian

K i i i i i i i i i i

p x y p x y p x y N i     

      

2D projection of a Gauss function Ellipse contour ~ 2 std deviation

x y

y

slide-36
SLIDE 36

37

MACHINE LEARNING MACHINE LEARNING

x

1) Estimate the joint density, p(x,y), across pairs of datapoints using GMM.

 

     

1

, , ; , , with , ; , , , : mean and covariance matrix of Gaussian

K i i i i i i i i i i

p x y p x y p x y N i     

      

Parameters are learned through Expectation-maximization. Iterative procedure. Start with random initialization.

y

Gaussian Mixture Regression (GMR)

slide-37
SLIDE 37

38

MACHINE LEARNING MACHINE LEARNING

y x

1) Estimate the joint density, p(x,y), across pairs of datapoints using GMM.

 

     

1

, , ; , , with , ; , , , : mean and covariance matrix of Gaussian

K i i i i i i i i i i

p x y p x y p x y N i     

      

1

1

K i i

Mixing Coefficients Probability that all M datapoints were generated by Gaussian i:

 

 

1

|

M j i j

p i p i x 

 

1 2

  

Gaussian Mixture Regression (GMR)

slide-38
SLIDE 38

40

MACHINE LEARNING MACHINE LEARNING

1) Estimate the joint density, p(x,y), across pairs of datapoints using GMM. 2) Compute the regressive signal, by taking p(y|x)

y x

    

1

| | ; ,

K i i i i

p y x x p y x  

  

 

   

1

; , with ; ,

i i i i K j j j j

p x x p x     

  

Gauss function

The variance changes depending on the query point

Gaussian Mixture Regression (GMR)

slide-39
SLIDE 39

41

MACHINE LEARNING MACHINE LEARNING

1) Estimate the joint density, p(x,y), across pairs of datapoints using GMM. 2) Compute the regressive signal, by taking p(y|x)

y x

    

1

| | ; ,

K i i i i

p y x x p y x  

  

The factors 𝛾 give a measure of the relative importance of each K regressive model. They are computed at each query point  weighted regression

 

   

1

; , with ; ,

i i i i K j j j j

p x x p x     

  

 

2 x

 

1 x

Influence of each marginal is modulated by 

Query point

Gaussian Mixture Regression (GMR)

slide-40
SLIDE 40

42

MACHINE LEARNING MACHINE LEARNING

y x

3) The regressive signal is then obtained by computing E{p(y|x)}:

   

 

   

 

 

 

 

   

1 1

1

|

|

Y YX XX X

K i i i i i i

i x

K i i i

x x

E p y x

E p y x x x

  

 

 

    

 

Linear combination of K local regressive models

 

1 x

 

2 x

 

2 x

 

1 x

Gaussian Mixture Regression (GMR)

slide-41
SLIDE 41

43

MACHINE LEARNING MACHINE LEARNING

y x

 

1 x

 

2 x

Computing the variance var{p(x,y)} provides information on the uncertainty

  • f the prediction computed from the conditional distribution.

Careful: This is not the uncertainty of the model. Use the likelihood to compute the uncertainty of the predictor.!

 

 

   

 

 

 

   

 

2 2 2 1 1

var |

K K i i i i i i i

p y x x x x x    

 

     

 

Gaussian Mixture Regression (GMR)

slide-42
SLIDE 42

44

MACHINE LEARNING – 2012 MACHINE LEARNING

 

 

var | p y x

 

 

| E p y x

Gaussian Mixture Regression (GMR)

Computing the variance var{p(x,y)} provides information on the uncertainty

  • f the prediction computed from the conditional distribution.
slide-43
SLIDE 43

45

MACHINE LEARNING – 2012 MACHINE LEARNING

GMR: Sensitivity to Choice of K and Initialization

slide-44
SLIDE 44

46

MACHINE LEARNING – 2012 MACHINE LEARNING

Fit with 4 Gaussians Uniform initialization

GMR: Sensitivity to Choice of K and Initialization

slide-45
SLIDE 45

47

MACHINE LEARNING – 2012 MACHINE LEARNING

Fit with 4 Gaussians Random initialization

GMR: Sensitivity to Choice of K and Initialization

slide-46
SLIDE 46

48

MACHINE LEARNING – 2012 MACHINE LEARNING

Fit with 10 Gaussians Random initialization

GMR: Sensitivity to Choice of K and Initialization

slide-47
SLIDE 47

MACHINE LEARNING – 2012

50

MACHINE LEARNING MACHINE LEARNING

Kernel methods for regression

y

 ,

,

N

y f x x y   

Deterministic regressive model

 

 

2

, with 0, y f x N      

Probabilistic regressive model Probabilistic estimate of the nonlinear relationship between y and x through the conditional density: (estimates the noise model and f) And then compute the estimate by taking the expectation over the conditional density:

 

| p y x

 

 

   

 

| y E f x f x y E p y x     

Gaussian Mixture Regression (GMR) computes first p(x, y) and, then, derives p(y|x). Gaussian Process Regression (GPR) computes directly p(y|x).

slide-48
SLIDE 48

51

MACHINE LEARNING MACHINE LEARNING

Probability Density Function and Regression

A signal y can be estimated through regression y=f(x) by taking the expectation over the conditional probability of p on x, for a choice of parameters for p:

 

 

| y E p y x 

The simplest way to estimate p(y|x) is through Probabilistic Regression that estimates a linear regressive model.

slide-49
SLIDE 49

52

MACHINE LEARNING MACHINE LEARNING

 

, , ,

T N

y f x w w x w x   

PR is a statistical approach to classical linear regression that estimates the relationship between zero-mean variables y and x by building a linear model

  • f the form:

 

2

, with 0,

T

y w x N      

If one assumes that the observed values of y differ from f(x) by an additive noise  that follows a zero-mean Gaussian distribution (such an assumption consists of putting a prior distribution over the noise), then:

Probabilistic Regression (PR)

Change wT

slide-50
SLIDE 50

53

MACHINE LEARNING MACHINE LEARNING

Probabilistic Regression  

   

1 2 2 1

Data points are independently and identically distributed (i.i.d) | , , ~ | , , 1 exp 2 2

i

M i i i T i M i

p y X w p y x w y w x    

 

          

 

  

  

 

1 2

Training set of pairs of data points , , Likelihood of the regressive model 0, ~ | , ,

i i

M i T

M X y x y y w X N y p y X w  

   

Parameters of the model

slide-51
SLIDE 51

54

MACHINE LEARNING MACHINE LEARNING

  

  

 

1 2

Training set of pairs of data points , , Likelihood of the regressive model 0, ~ | , ,

i i

M i T

M X y x y y w X N y p y X w  

   

Probabilistic Regression

Prior model on distribution of parameter w:

   

1

1 0, exp 2

T w w

p w N w w

          

Hyperparameters Given by user

slide-52
SLIDE 52

55

MACHINE LEARNING MACHINE LEARNING

Probabilistic Regression

 

1 2

1 1 2

1 with

1 | , , ,

T w

T T

A XX

p y x X N x A X x A x

 

  

       y y

Testing point Training datapoints

slide-53
SLIDE 53

58

MACHINE LEARNING – 2012 MACHINE LEARNING

How to extend the simple linear Bayesian regressive model for nonlinear regression, such that the non-linear problem becomes linear again?

From Probabilistic Regression to Gaussian Process Regression

x y

?

 

2

0,

T

y w x N   

 

x 

 

 

2

, ~ 0,

T

y w x N      

x y

v

slide-54
SLIDE 54

MACHINE LEARNING – 2012

59

MACHINE LEARNING MACHINE LEARNING

 

1 2

1 1 2

1

1 | , , , ,

T w

T T

A XX

p y x X N x A X x A x

 

  

       y y

 

x 

Non-Linear Transformation

Gaussian Process Regression

             

1 1 2 2 1

1 | , , , with

T T T w

p y x X N x A X x A x A X X

   

             y y     

 

2

0,

T

y w x N   

How to extend the simple linear Bayesian regressive model for nonlinear regression, such that the non-linear problem becomes linear again?

slide-55
SLIDE 55

MACHINE LEARNING – 2012

60

MACHINE LEARNING MACHINE LEARNING

Gaussian Process Regression

             

1 1 2 2 1

1 | , , , with

T T T w

p y x X N x A X x A x A X X

   

            y y     

Inner product in feature space

     

, ' '

T w

k x x x x    

Take as kernel

 

 

 

1 2

1

with ,

| , , ,

i

M i i

K X X I

y E y x X k x x

 

 

     

y

y

slide-56
SLIDE 56

MACHINE LEARNING – 2012

61

MACHINE LEARNING MACHINE LEARNING

Gaussian Process Regression  

 

 

1 2

1

with ,

| , , ,

i

M i i

K X X I

y E y x X k x x

 

 

     

y

y >0 

 All datapoints are used in the computation!

slide-57
SLIDE 57

MACHINE LEARNING – 2012

62

MACHINE LEARNING MACHINE LEARNING

Gaussian Process Regression  

 

 

1 2

1

with ,

| , , ,

i

M i i

K X X I

y E y x X k x x

 

 

     

y

y

The kernel and its hyperparameters are given by the user. These can be optimized through maximum likelihood over the marginal likelihood, i.e. p(y|X;parameters).

slide-58
SLIDE 58

MACHINE LEARNING – 2012

63

MACHINE LEARNING MACHINE LEARNING

Sensitivity to the choice of kernel width (called lengthscale in most books) when using Gaussian kernels (also called RBF or square exponential). Kernel Width=0.1

 

'

, '

x x l

k x x e

 

Gaussian Process Regression

slide-59
SLIDE 59

MACHINE LEARNING – 2012

64

MACHINE LEARNING MACHINE LEARNING

Kernel Width=0.5

 

'

, '

x x l

k x x e

 

Sensitivity to the choice of kernel width (called lengthscale in most books) when using Gaussian kernels (also called RBF or square exponential).

Gaussian Process Regression

slide-60
SLIDE 60

MACHINE LEARNING – 2012

65

MACHINE LEARNING MACHINE LEARNING

Gaussian Process Regression  

 

 

1 2

1

with ,

| , , ,

i

M i i

K X X I

y E y x X k x x

 

 

     

y

y

The value for the noise needs to pre-set by hand. The larger the noise, the more uncertainty. The noise is <=1.

 

 

       

1 2

cov | , , , , p y x K x x K x X K X X I K X x 

      

slide-61
SLIDE 61

MACHINE LEARNING – 2012

67

MACHINE LEARNING MACHINE LEARNING

Gaussian Process Regression

Low noise: 0.05

Gaussian Process Regression

slide-62
SLIDE 62

MACHINE LEARNING – 2012

68

MACHINE LEARNING MACHINE LEARNING

Gaussian Process Regression

High noise: 0.2

Gaussian Process Regression

slide-63
SLIDE 63

MACHINE LEARNING – 2012

70

MACHINE LEARNING MACHINE LEARNING

Comparison Across Methods

GPR Predict y=0 away from datapoints! Generalization – prediction away from datapoints

 

 

1

| , , ,

i

M i i

y E y x X k x x 

   y

   

* 1

,

i

M i i i

y k x x b  

  

SVR predicts y=b away from datapoints

slide-64
SLIDE 64

MACHINE LEARNING – 2012

71

MACHINE LEARNING MACHINE LEARNING

Comparison Across Methods

Generalization – prediction away from datapoints GMR predicts the trend away from data

slide-65
SLIDE 65

MACHINE LEARNING – 2012

72

MACHINE LEARNING MACHINE LEARNING

Comparison Across Methods

Generalization – prediction away from datapoints But prediction depends on initialization and solution found during training

slide-66
SLIDE 66

Variance in p(y|x) in GMR represents uncertainty of predictive model. Variance in p(y|x) in GPR represents uncertainty of predictive mode.

slide-67
SLIDE 67

MACHINE LEARNING – 2012

75

MACHINE LEARNING MACHINE LEARNING

Variance in SVR represents the epsilon-tube and does not represent uncertainty of the model either! No measure of uncertainty in SVR!

Comparison Across Methods

slide-68
SLIDE 68

MACHINE LEARNING – 2012

76

MACHINE LEARNING MACHINE LEARNING

SVR, GPR, GMR: Similarities

  • SVR, GMR and GPR are based on the same regressive model:
  • GMR and GPR are Gaussian Conditional Distributions
  • SVR, GMR and GPR compute a weighted combination of local

predictors

  • SVR, GMR and GPR Separate input space into regions modeled by

Gaussian distributions (true only when using Gaussian kernels for GPR and SVR)

  • GMR allows to predict multi-dimensional outputs, while SVR and

GPR can predict only a uni-dimensional output y.

 

 

2

, ~ 0, y f x N     

slide-69
SLIDE 69

MACHINE LEARNING – 2012

77

MACHINE LEARNING MACHINE LEARNING

SVR, GPR, GMR: Differences

SVR, GMR and GPR are based on the same probabilistic regressive model. But they do not optimize the same objective function  find different solutions.

  • SVR:
  • minimizes reconstruction error through convex optimization

 ensured to find the optimal estimate; but not unique solution

  • usually finds a nm of models <= nm of datapoints (support vectors)
  • GMR:
  • learns p(x,y) through maximum likelihood  finds local optimum
  • compute a generative model p(x,y) from which it derives p(y|x)
  • starts with a low nm of models << nm of datapoints
  • GPR:
  • No optimization; analytical (optimal) solution
  • expresses p(y|x) as a full density model
  • nm of models = nm of datapoints!
slide-70
SLIDE 70

MACHINE LEARNING – 2012

78

MACHINE LEARNING MACHINE LEARNING

Hyperparameters of SVR, GPR, GMR

SVR, GMR and GPR all depend on hyperparameters that need to be determined beforehand. These are:

  • SVR
  • choice of error margin , which can be replaced by choice of  in 

SVM.

  • choice of kernel and associated kernel parameters
  • GMR:
  • choice of the number of Gaussians
  • choice of initialization (affects convergence to local optimum)
  • GPR:
  • choice of noise parameter
  • choice of kernel width (length-scale)

The hyperparamaters can be optimized separately; e.g. the nm of Gaussians in GMR can be estimated using BIC; the lengthscale and noise of GPR can be estimated through maximum likelihood and the kernel parameters of SVR can be optimized through grid search.

slide-71
SLIDE 71

MACHINE LEARNING – 2012

79

MACHINE LEARNING MACHINE LEARNING

Conclusion

No easy way to determine which regression technique fits best your problem

SVR GMR GPR SVR

Training Testing

Few SV or Gaussian fct Small fraction of original data Keeps all the datapoints

GPR

Analytical solution One shot, but uses all data-points Convex optimization

GMR

EM, iterative technique, needs several runs