ADVANCED MACHINE LEARNING Non-linear regression techniques (SVR and - - PowerPoint PPT Presentation

advanced machine learning non linear regression
SMART_READER_LITE
LIVE PREVIEW

ADVANCED MACHINE LEARNING Non-linear regression techniques (SVR and - - PowerPoint PPT Presentation

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING Non-linear regression techniques (SVR and extensions, GPR, Gradient Boosting) 1 1 ADVANCED MACHINE LEARNING Regression: Principle N Map N-dim. input to a continuous


slide-1
SLIDE 1

1

ADVANCED MACHINE LEARNING

1

ADVANCED MACHINE LEARNING Non-linear regression techniques (SVR and extensions, GPR, Gradient Boosting)

slide-2
SLIDE 2

2

ADVANCED MACHINE LEARNING

2

Regression: Principle

( )

Map N-dim. input to a continuous output . Learn a function of the type: : and .

N N

x y f y f x ∈ ∈ → =    

y

{ }

1,...

Estimate that best predict set of training points , ?

i i i M

f x y

= x

1

x

1

y

2

x

2

y

3

x

3

y

4

x

4

y

True function Estimate

slide-3
SLIDE 3

3

ADVANCED MACHINE LEARNING

3

Regression: Issues

( )

Map N-dim. input to a continuous output . Learn a function of the type: : and .

N N

x y f y f x ∈ ∈ → =    

y

{ }

1,...

Estimate that best predict set of training points , ?

i i i M

f x y

= x

1

x

1

y

2

x

2

y

3

x

3

y

4

x

4

y

True function Estimate

Fit strongly influenced by choice of:

  • datapoints for training
  • complexity of the model (interpolation)
slide-4
SLIDE 4

4

ADVANCED MACHINE LEARNING

4

Regression Algorithms in this Course

Support Vector Machine Relevance Vector Machine Boosting – random projections Boosting – random gaussians Random forest Gaussian Process Support vector regression Relevance vector regression Gaussian process regression Gradient boosting Locally weighted projected regression Not covered in class!!

slide-5
SLIDE 5

5

ADVANCED MACHINE LEARNING

5

Regression Algorithms in this Course

Support Vector Machine Relevance Vector Machine Boosting – random projections Support vector regression Relevance Vector Machine Relevance vector regression

slide-6
SLIDE 6

6 6

ADVANCED MACHINE LEARNING

6

Support Vector Regression

slide-7
SLIDE 7

7 7

ADVANCED MACHINE LEARNING

7

( )

{ }

1,...

Assume a nonlinear mapping , s.t. . How to estimate to best predict the pair of training points , ?

i i i M

f y f x f x y

=

=

How to generalize the support vector machine framework for classification to estimate continuous functions? 1. Assume a non-linear mapping through feature space and then perform linear regression in feature space 2. Supervised learning – minimizes an error function.  First determine a way to measure error on testing set in the linear case!

Support Vector Regression

slide-8
SLIDE 8

8 8

ADVANCED MACHINE LEARNING

8

( )

Assume a linear mapping , s.t. .

T

f y f x w x b = = +

Measure the error on prediction b is estimated as in SVR through least-square regression on support vectors; hence we omit it from the rest

  • f the developments .

Support Vector Regression

{ } 1,...

How to estimate and to best predict the pair of training points , ?

i i i M

w b x y

=

x

y

( )

T

y f x w x b = = + 

slide-9
SLIDE 9

9 9

ADVANCED MACHINE LEARNING

9

Support Vector Regression

Set an upper bound on the error and consider as correctly classified all points such that ( ) , Penalize only datapoints that are not contained in the -tube. f x y ε ε ε − ≤

x

y

( )

T

y f x w x b = = + 

slide-10
SLIDE 10

10 10

ADVANCED MACHINE LEARNING

10

x

Support Vector Regression

ε-margin

The ε-margin is a measure of the width of the ε- insensitive tube. It is a measure of the precision of the regression. A small ||w|| corresponds to a small slope for f. In the linear case, f is more horizontal. y

slide-11
SLIDE 11

11 11

ADVANCED MACHINE LEARNING

11

x

Support Vector Regression

ε-margin

y A large ||w|| corresponds to a large slope for f. In the linear case, f is more vertical. The flatter the slope of the function f, the larger the ε−margin.  To maximize the margin, we must minimize the norm of w.

slide-12
SLIDE 12

12 12

ADVANCED MACHINE LEARNING

12

Support Vector Regression

2

1,...

This can be rephrased as a constraint-based optimization problem

  • f the form:

1 minimize 2 , subject to ,

i i

i i

i M

w w x b y y w x b ε ε

∀ =

 + − ≤   − − ≤  

Need to penalize points outside the ε-insensitive tube.

slide-13
SLIDE 13

13 13

ADVANCED MACHINE LEARNING

13

Support Vector Regression

Need to penalize points outside the ε-insensitive tube.

( )

* 2 * 1 * *

Introduce slack variables , , 0 : 1 C minimize + 2 , subject to , 0,

i i

i i M i i i i i i i i i

C w M w x b y y w x b ξ ξ ξ ξ ε ξ ε ξ ξ ξ

=

≥ +  + − ≤ +   − − ≤ +   ≥ ≥  

i

ξ

* i

ξ

slide-14
SLIDE 14

14 14

ADVANCED MACHINE LEARNING

14

Support Vector Regression

All points outside the ε-tube become Support Vectors

i

ξ

* i

ξ

( )

* 2 * 1 * *

Introduce slack variables , , 0 : 1 C minimize + 2 , subject to , 0,

i i

i i M i i i i i i i i i

C w M w x b y y w x b ξ ξ ξ ξ ε ξ ε ξ ξ ξ

=

≥ +  + − ≤ +   − − ≤ +   ≥ ≥  

We now have the solution to the linear regression problem. How to generalize this to the nonlinear case?

slide-15
SLIDE 15

15 15

ADVANCED MACHINE LEARNING

15

Lift x into feature space and then perform linear regression in feature space.

( ) ( ) ( )

( )

( )

Linear Case: , Non-Linear Case: , y f x w x b x x y f x w x b φ φ φ = = + → = = + w lives in feature space!

( )

x x φ →

Support Vector Regression

slide-16
SLIDE 16

16 16

ADVANCED MACHINE LEARNING

16

Support Vector Regression

( ) ( ) ( )

2 * 1 * *

In feature space, we obtain the same constrained optimization problem: 1 C minimize + 2 , subject to , 0,

i i

M i i i i i i i i i

w M w x b y y w x b ξ ξ φ ε ξ φ ε ξ ξ ξ

=

+  + − ≤ +   − − ≤ +   ≥ ≥  

slide-17
SLIDE 17

17 17

ADVANCED MACHINE LEARNING

17

Support Vector Regression

( )

( ) ( ) ( )

( )

( )

( )

2 * * * 1 1 1 * * 1

1 C C L , , *, = + 2 , ,

i i i i i

M M i i i i i i i M i i i M i i i

w b w M M y w x b y w x b ξ ξ ξ ξ η ξ η ξ α ε ξ φ α ε ξ φ

= = = =

+ − + − + + − − − + − + +

∑ ∑ ∑ ∑

Again, we can solve this quadratic problem by introducing sets of Lagrange multipliers and writing the Lagrangian :

Lagrangian = Objective function + λ * constraints

slide-18
SLIDE 18

18 18

ADVANCED MACHINE LEARNING

18

( )

( ) ( ) ( )

( )

( )

( )

2 * * * 1 1 1 * * 1

1 C C L , , *, = + 2 , ,

i i i i i

M M i i i i i i i M i i i M i i i

w b w M M y w x b y w x b ξ ξ ξ ξ η ξ η ξ α ε ξ φ α ε ξ φ

= = = =

+ − + − + + − − − + − + +

∑ ∑ ∑ ∑

Support Vector Regression

i

ξ

* i

ξ

Constraints on points lying on either side of the ε-tube

*

0 for all points that do not satisfy the constraints points outside the -tube

i

i

α α ε = = →

i

α >

* i

α >

slide-19
SLIDE 19

19 19

ADVANCED MACHINE LEARNING

19

Support Vector Regression

*

0 for all points that do not satisfy the constraints points outside the -tube

i

i

α α ε = = →

i

α >

* i

α >

Requiring that the partial derivatives are all zero:

( ) ( )

* 1

L 0.

i

M i i i

w x w α α φ

=

∂ = − − = ∂

( ) ( )

* 1

.

i

M i i i

w x α α φ

=

⇒ = −

Linear combination of support vectors

( )

* 1

L 0.

i

M i i

b α α

=

∂ = − = ∂

* 1 1

i

M M i i i

α α

= =

→ =

∑ ∑

Rebalancing the effect of the support vectors on both sides of the ε-tube

slide-20
SLIDE 20

20 20

ADVANCED MACHINE LEARNING

20

Support Vector Regression

And replacing in the primal Lagrangian, we get the Dual optimization problem:

( )( ) ( ) ( ) ( ) ( )

*

* * , 1 * * , 1 1 * * 1

1 , 2 max subject to 0 and , 0,

i i i i i

M i j i j j i j M M i i i i i M i i i i

k x x y C M

α α

α α α α ε α α α α α α α α

= = = =

− − − ⋅     − + + +     − = ∈    

∑ ∑ ∑ ∑

( ) ( ) ( )

, ,

i j i j

k x x x x φ φ =

Kernel Trick

slide-21
SLIDE 21

21 21

ADVANCED MACHINE LEARNING

21

Support Vector Regression

The solution is given by:

( )

( ) ( )

* 1

,

i

M i i i

y f x k x x b α α

=

= = − +

Linear Coefficients (Lagrange multipliers for each constraint). If one uses RBF Kernel, M un-normalized isotropic Gaussians centered on each training datapoint.

slide-22
SLIDE 22

22 22

ADVANCED MACHINE LEARNING

22

The solution is given by:

Support Vector Regression

y x

( )

( ) ( )

* 1

,

i

M i i i

y f x k x x b α α

=

= = − +

Kernel places a Gauss function on each SV

slide-23
SLIDE 23

23 23

ADVANCED MACHINE LEARNING

23

Support Vector Regression

y x

( )

( ) ( )

* 1

,

i

M i i i

y f x k x x b α α

=

= = − +

The solution is given by: The Lagrange multipliers define the importance of each Gaussian function.

* 1

1.5 α =

2

2 α =

4

3 α =

* 3

1.5 α =

* 5

1 α =

6

2.5 α =

b Converges to b when SV effect vanishes.

1

x

2

x

3

x

4

x

5

x

6

x

Y=f(x)

slide-24
SLIDE 24

24 24

ADVANCED MACHINE LEARNING

24

Support Vector Regression: Exercise I

{ } ( ) ( ) ( ) ( )

* 1 * 1 1

SVM gives the following estimate for each pair of datapoints , , , i 1... a) Compute an estimate of using the above: 1 , b) Plot for a choice

j j M j j i i i i M M j j i i i j i

y x y k x x b M b b y k x x M b α α α α

= = =

= − + =   ⇒ = − −    

∑ ∑ ∑

  • f 3-datapoints dataset.
slide-25
SLIDE 25

25 25

ADVANCED MACHINE LEARNING

25

Support Vector Regression: Exercise II

( )

( ) ( )

* 1

Recall the solution to SVM: , a) What type of function can you model with the homogeneous polynomial? b) What minimum order of a homogeneous polynomial kernel do you need to achi

M i i i i

y f x k x x b f α α

=

= = − +

eve good regression on the set of 3 points below?

slide-26
SLIDE 26

26 26

ADVANCED MACHINE LEARNING

26

( )

2 * 1 * *

1 C minimize + 2 , subject to , 0,

i i

M i i i i i i i i i

w M w x b y y w x b ξ ξ ε ξ ε ξ ξ ξ

=

+  + − ≤ +   − − ≤ +   ≥ ≥  

ε−SVR: Hyperparameters

The solution to SVR we just saw is referred to as ε−SVR Two Hyperparameters C controls the penalty term on poor fit ε determines the minimal required precision

slide-27
SLIDE 27

27 27

ADVANCED MACHINE LEARNING

27

Effect of the RBF kernel width on the fit. Here fit using C=100, ε=0.1, kernel width=0.01.

ε−SVR: Effect of Hyperparameters

slide-28
SLIDE 28

28 28

ADVANCED MACHINE LEARNING

28

Effect of the RBF kernel width on the fit. Here fit using C=100, ε=0.01, kernel width=0.01

ε−SVR: Effect of Hyperparameters

 Overfitting

slide-29
SLIDE 29

29 29

ADVANCED MACHINE LEARNING

29

ε−SVR: Effect of Hyperparameters

Effect of the RBF kernel width on the fit. Here fit using C=100, ε=0.05, kernel width=0.01 Reduction of the effect of the kernel width on the fit by choosing appropriate hyperparameters. .

slide-30
SLIDE 30

30 30

ADVANCED MACHINE LEARNING

30

ε−SVR: Effect of Hyperparameters

Mldemos does not display the support vectors if there is more than one point for the same x!

slide-31
SLIDE 31

31 31

ADVANCED MACHINE LEARNING

31

( )

2 * 1 * *

1 C minimize + 2 , subject to , 0,

i i

M i i i i i i i i i

w M w x b y y w x b ξ ξ ε ξ ε ξ ξ ξ

=

+  + − ≤ +   − − ≤ +   ≥ ≥  

ε−SVR: Hyperparameters

The solution to SVR we just saw is referred to as ε−SVR Two Hyperparameters C controls the penalty term on poor fit ε determines the minimal required precision

slide-32
SLIDE 32

32 32

ADVANCED MACHINE LEARNING

32

Extensions of SVR

As in the classification case, the optimization framework used for support vector regression is extended with: − ν-SVR: yielding a sparser version of SVR and relaxing the constraint of choosing C, the unbounded hyperparameter.

  • Relevance Vector Regression: the regression version of

RVM, which provides also a sparser version of SVM and

  • ffers a probabilistic interpretation of the solution.

(see Tipping 2011, supplementary material to the class)

slide-33
SLIDE 33

33 33

ADVANCED MACHINE LEARNING

33

As the number of data grows, so does the number of support vectors.

ν−SVR puts a lower bound on the

fraction of support vectors (see previous case for SVM)

[ ]

0,1 ν ∈

Support Vector Regression: ν-SVR

slide-34
SLIDE 34

34 34

ADVANCED MACHINE LEARNING

34

Support Vector Regression: ν-SVR

As for ν-SVM, one can rewrite the problem as a convex optimization expression:

( )

( ) ( )

2 * 1 , , * *

1 min under constraints 2 , , 0, 0 1, 0, 0. The margin error is given by all the data points for which 0. is an upper bound

j j j j

M j j j w T j T j j j i

w w x b y y w x b

ξ ρ

νε ξ ξ ε ξ ε ξ ε ν ξ ξ ξ ν

=

  + + +     ⋅ + − ≥ + − ⋅ + ≥ + ≥ ≤ ≤ ≥ ≥ >

  • n the fraction of training error and a lower bound on the

fraction of support vectors.

slide-35
SLIDE 35

35 35

ADVANCED MACHINE LEARNING

35

Effect of the automatic adaptation of ε using ν-SVR

ν−SVR: Example

slide-36
SLIDE 36

36 36

ADVANCED MACHINE LEARNING

36

ν−SVR: Example

Effect of the automatic adaptation of ε using ν-SVR Added noise on data

slide-37
SLIDE 37

37 37

ADVANCED MACHINE LEARNING

37

Relevance Vector Regression (RVR)

Same principle as that described for RVM (see slides on SVM and extensions). The derivation of the parameters however differ (see Tipping 2011 for details). To recall, we start from the solution of SVM.

( ) ( )

( )

( ) ( ) ( ) ( )

1

, , ....... 1

M i i i T T

y x f x k x x b x i z x x x M α ψ α ψ ψ

=

= = +   = Ψ Ψ =  

    

Rewrite the solution of SVM as a linear combination over M basis functions

1 1

1 . . . ... . 1

M

α α α α                     = =                        

In the (binary) classification case, [0;1]. In the regression case, . y y ∈ ∈

A sparse solution has a majority of entries with alpha zero.

slide-38
SLIDE 38

38

ADVANCED MACHINE LEARNING

38

Comparison ε-SVR, ν-SVR, RVR

Solution with ε-SVR: RBF kernel , C=3000, ε=0.08, σ=0.05, 37 support vectors

slide-39
SLIDE 39

39

ADVANCED MACHINE LEARNING

39

Comparison ε-SVR, ν-SVR, RVR

Solution with ν-SVR: RBF kernel , C=3000, ν=0.04, σ=0.001, 17 support vectors

slide-40
SLIDE 40

40

ADVANCED MACHINE LEARNING

40

Comparison ε-SVR, ν-SVR, RVR

Solution with RVR: RBF kernel , ε=0.08, σ=0.05, 7 support vectors

slide-41
SLIDE 41

41

ADVANCED MACHINE LEARNING

41

Regression Algorithms in this Course

Support Vector Machine Relevance Vector Machine Boosting – random projections Boosting – random gaussians Random forest Gaussian Process Support vector regression Relevance vector regression Gaussian process regression Gradient boosting Locally weighted projected regression

slide-42
SLIDE 42

42

ADVANCED MACHINE LEARNING

42

Regression Algorithms in this Course

Random forest Gaussian Process Gaussian process regression

slide-43
SLIDE 43

44

ADVANCED MACHINE LEARNING

44

( )

, , ,

T N

y f x w w x w x = = ∈

PR is a statistical approach to classical linear regression that estimates the relationship between zero-mean variables y and x by building a linear model

  • f the form:

( )

2

, with 0,

T

y w x N ε ε σ = + =

If one assumes that the observed values of y differ from f(x) by an additive noise ε that follows a zero-mean Gaussian distribution (such an assumption consists of putting a prior distribution over the noise), then:

Probabilistic Regression (PR)

Change wT

Where have we seen this before? Answer: RVM / RVR

slide-44
SLIDE 44

45

ADVANCED MACHINE LEARNING

45

{ } {

} ( )

( )

1 2

Training set of pairs of data points , , Likelihood of the regressive model 0, ~ | , ,

i i

M i T

M X x y w X N p X w σ σ

=

= = + ⇒ y y y y

Probabilistic Regression ( )

( ) ( )

1 2 2 1

Data points are independently and identically distributed (i.i.d) | , , ~ | , , 1 exp 2 2

i

M i i i T i M i

p X w p y x w y w x σ σ σ πσ

= =

  −   = −    

∏ ∏

y

Parameters of the model

slide-45
SLIDE 45

46

ADVANCED MACHINE LEARNING

46

{ } {

} ( )

( )

1 2

Training set of pairs of data points , , Likelihood of the regressive model 0, ~ | , ,

i i

M i T

M X x y w X N p X w σ σ

=

= = + ⇒ y y y y

Probabilistic Regression

Prior model on distribution of parameter w:

( ) ( )

1

1 0, exp 2

T w w

p w N w w

  = ∑ = − ∑    

Hyperparameters Given by user Parameters of the model

slide-46
SLIDE 46

47

ADVANCED MACHINE LEARNING

47

( )

1 1 1 1 2 2

2

1 1

1 | ,

,

T T w w

XX XX

p w X N X

σ σ

σ

− − − −

+ ∑ + ∑

      ⇒ ∝              y y

( ) ( )

1

1 Prior on : 0, exp 2

T w w

w p w N w w

  = ∑ = − ∑    

( ) ( ) ( ) ( )

(drop , not a variable)

Estimates conditional distribution on given the data using Bayes' rule. likelihood x prior posterior = marginal likelihood | , | , | w p X w p w p w X p X

σ

⇒ = y y y

Probabilistic Regression

Posterior distribution on is Gaussian. w

slide-47
SLIDE 47

48

ADVANCED MACHINE LEARNING

48

( )

1 1 1 1 2 2

2

1 1

1 | ,

,

T T w w

XX XX

p w X N X

σ σ

σ

− − − −

+ ∑ + ∑

      ⇒ ∝              y y

The conditional distribution of a Gaussian distribution is also Gaussian (image from Wikipedia)

Posterior distribution on is Gaussian. w

slide-48
SLIDE 48

49

ADVANCED MACHINE LEARNING

49

Probabilistic Regression

The expectation over the posterior distribution gives the best estimate: This is called the maximum a posteriori (MAP) estimate of w.

( )

{ }

1 1 2 2

1 1 | . ,

T w A

p w E XX X X σ σ

− −

  = + ∑     y y 

( )

1 1 1 1 2 2

2

1 1

1 | ,

,

T T w w

XX XX

p w X N X

σ σ

σ

− − − −

+ ∑ + ∑

      ⇒ ∝              y y

slide-49
SLIDE 49

50

ADVANCED MACHINE LEARNING

50

Probabilistic Regression

( ) ( ) ( )

We can now compute the posterior distribution on y : | , | , | , , p y x p w X p y x w w X d = ∫ y y

( )

1 2

1 1 2

1 with

1 | , , ,

T w

T T

A XX

p y x X N x A X x A x

σ

σ

− −

= + Σ

  =     y y

( )

1 1 1 1 2 2

2

1 1

1 | ,

,

T T w w

XX XX

p w X N X

σ σ

σ

− − − −

+ ∑ + ∑

      ⇒ ∝              y y

slide-50
SLIDE 50

51

ADVANCED MACHINE LEARNING

51

Probabilistic Regression

( )

1 2

1 1 2

1 with

1 , | , ,

T w

T T

XX A

p y N x x x x X X A A

σ

σ

− −

= + Σ

  =     y y

Testing point Training datapoints

{ }

1 2

The estimate of given a test point is given by : 1 ( | }

T

y x y E p y A x x X σ

= = y

slide-51
SLIDE 51

53

ADVANCED MACHINE LEARNING

53

Probabilistic Regression

{ }

1

The variance gives a measure of the uncertainty of the prediction: var ( | }

T

p y x x A x

=

( )

{ }

1 2

1 | , ,

T

E p y x X x A X σ

= y y

( )

1 2

1 1 2

1 with

1 , | , ,

T w

T T

XX A

p y N x x x x X X A A

σ

σ

− −

= + Σ

  =     y y

slide-52
SLIDE 52

54 MACHINE LEARNING – 2012 MACHINE LEARNING ADVANCED MACHINE LEARNING

54

( )

2

How to extend the simple linear Bayesian regressive model for nonlinear regression? 0,

T

y w x N σ = +

Gaussian Process Regression

slide-53
SLIDE 53

55 MACHINE LEARNING – 2012 MACHINE LEARNING ADVANCED MACHINE LEARNING

55

( )

x φ

( )

( )

2

, ~ 0,

T

y w x N φ ε ε σ = +

( )

2

How to extend the simple linear Bayesian regressive model for nonlinear regression? 0,

T

y w x N σ = +

Gaussian Process Regression

slide-54
SLIDE 54

56 MACHINE LEARNING – 2012 MACHINE LEARNING ADVANCED MACHINE LEARNING

56

( )

x φ

( )

( )

2

, ~ 0,

T

y w x N φ ε ε σ = +

( )

2

How to extend the simple linear Bayesian regressive model for nonlinear regression? 0,

T

y w x N σ = +

Gaussian Process Regression

Distribution over functions

slide-55
SLIDE 55

58 58

ADVANCED MACHINE LEARNING

58

( )

1 2

1 1 2

1

1 | , , , ,

T w

T T

A XX

p y x X N x A X x A x

σ

σ

− −

= + Σ

  =     y y

( )

x φ

Non-Linear Transformation

Gaussian Process Regression

( ) ( ) ( ) ( ) ( ) ( ) ( )

1 1 2 2 1

1 | , , , with

T T T w

p y x X N x A X x A x A X X

− − − −

  = Φ     = Φ Φ + Σ y y φ φ φ σ σ

( )

2

How to extend the simple linear Bayesian regressive model for nonlinear regression? 0,

T

y w x N σ = +

slide-56
SLIDE 56

59 59

ADVANCED MACHINE LEARNING

59

Gaussian Process Regression

Again, a Gaussian distribution.

( ) ( ) ( ) ( ) ( ) ( ) ( )

1 1 2 2 1

1 | , , , with

T T T w

p y x X N x A X x A x A X X

− − − −

  = Φ     = Φ Φ + Σ y y φ φ φ σ σ

slide-57
SLIDE 57

61 61

ADVANCED MACHINE LEARNING

61

Gaussian Process Regression

Inner product in feature space

( ) ( ) ( )

Define the kernel as: , ' '

T w

k x x x x φ φ = Σ

{ }

( )

( )

1 2

1

with ,

| , , ,

i

M i i

K X X I

y E y x X k x x

α σ

α

=

= +

= =    

y

y ( ) ( ) ( ) ( ) ( ) ( ) ( )

1 1 2 2 1

1 | , , , with

T T T w

p y x X N x A X x A x A X X

− − − −

  = Φ     = Φ Φ + Σ y y φ φ φ σ σ

See supplement for steps

slide-58
SLIDE 58

63 63

ADVANCED MACHINE LEARNING

63

Gaussian Process Regression { }

( )

( )

1 2

1

with ,

| , , ,

i

M i i

K X X I

y E y x X k x x

α σ

α

=

= +

= =    

y

y >0

i

α

 All datapoints are used in the computation!

slide-59
SLIDE 59

64 64

ADVANCED MACHINE LEARNING

64

Gaussian Process Regression { }

( )

( )

1 2

1

with ,

| , , ,

i

M i i

K X X I

y E y x X k x x

α σ

α

=

= +

= =    

y

y

The kernel and its hyperparameters are given by the user. These can be optimized through maximum likelihood over the marginal likelihood, see class’s supplement

RBF kernel, width = 0.1 RBF kernel, width = 0.5

slide-60
SLIDE 60

65 65

ADVANCED MACHINE LEARNING

65

Sensitivity to the choice of kernel width (called lengthscale in most books) when using Gaussian kernels (also called RBF or square exponential). Kernel Width=0.1

( )

'

, '

x x l

k x x e

− −

=

Gaussian Process Regression

slide-61
SLIDE 61

66 66

ADVANCED MACHINE LEARNING

66

Kernel Width=0.5

( )

'

, '

x x l

k x x e

− −

=

Sensitivity to the choice of kernel width (called lengthscale in most books) when using Gaussian kernels (also called RBF or square exponential).

Gaussian Process Regression

slide-62
SLIDE 62

67 67

ADVANCED MACHINE LEARNING

67

Gaussian Process Regression { }

( )

( )

1 2

1

with ,

| , , ,

i

M i i

K X X I

y E y x X k x x

α σ

α

=

= +

= =    

y

y

The value for the noise needs to pre-set by hand.

Sigma = 0.05 Sigma = 0.01

The larger the noise, the more uncertainty. The noise is <=1.

( )

( )

( ) ( ) ( ) ( )

1 2

cov | , , , , p y x K x x K x X K X X I K X x σ

  = − +  

slide-63
SLIDE 63

68 68

ADVANCED MACHINE LEARNING

68

Gaussian Process Regression

Low noise: σ=0.05

slide-64
SLIDE 64

69 69

ADVANCED MACHINE LEARNING

69

Gaussian Process Regression

High noise: σ=0.2

slide-65
SLIDE 65

70 70

ADVANCED MACHINE LEARNING

70

Gaussian Process Regression { }

( )

( )

1 2

1

with ,

| , , ,

i

M i i

K X X I

y E y x X k x x

α σ

α

=

= +

= =    

y

y

Kernel is usually Gaussian kernel with stationary covariance function  Non-Stationary Covariance Functions can encapsulate local variations in the density of the datapoints Gibbs’ non stationary covariance function (length-scale a function of x):

( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )

1 2 2 2 2 1 1

2 2 ' , ' exp ' '

N N i i i i i i i i i i

l x l x x x k x x l x l x l x l x

= =

    − = −         + +    

∑ ∏

slide-66
SLIDE 66

71 71

ADVANCED MACHINE LEARNING

71

Gaussian Process Regression

( )

2

0,

T

y w x N σ = +

( )

( )

2

, ~ 0,

T

y w x N φ ε ε σ = +

Linear Model Non-Linear Model

Both models follow a zero mean Gaussian distribution!

{ }

( )

{ }

{ }

2

0,

T T

E y E w x N E w x σ = + = + =

{ } ( )

( )

{ }

{ } ( )

2

0,

T T

E y E w x N E w x φ σ φ = + = + =

Predict y=0 away from datapoints

SVR predicts y=b away from datapoin (see exercise session)

slide-67
SLIDE 67

72

ADVANCED MACHINE LEARNING

72

Regression Algorithms in this Course

Support Vector Machine Relevance Vector Machine Boosting – random projections Boosting – random gaussians Random forest Gaussian Process Support vector regression Relevance vector regression Gaussian process regression Gradient boosting Locally weighted projected regression

slide-68
SLIDE 68

73

ADVANCED MACHINE LEARNING

73

Regression Algorithms in this Course

Relevance Vector Machine Boosting – random gaussians Gradient boosting

slide-69
SLIDE 69

74

ADVANCED MACHINE LEARNING

74

Gradient Boosting

Relevance Vector Machine Boosting – random gaussians Gradient boosting

1 2

Choose some regressive technique (any we have seen sofar) ˆ ˆ ˆ Apply boosting to train and combine the set of estimates , ... .

m

f f f

slide-70
SLIDE 70

75

ADVANCED MACHINE LEARNING

75

Gradient Boosting

Relevance Vector Machine Boosting – random gaussians Gradient boosting

1

1 ˆ ˆ Aggregate to get the final estimate

m i i

f f m

=

=

slide-71
SLIDE 71

76

ADVANCED MACHINE LEARNING

76

Regression: example

Wand & Zhu, Financial market forecasting using a two-step kernel learning method for the support vector regression, Annals of Op. Research, 2010

slide-72
SLIDE 72

77

ADVANCED MACHINE LEARNING

77

Pattern Recognition: Similar Trend

The actual Nikkei 225 opening cash index and its predicted values from the random walk, SVR and ICA–SVR models, using the last 50 data points of the Nikkei 225 index.

Wand & Zhu, Financial market forecasting using a two-step kernel learning method for the support vector regression, Annals of Op. Research, 2010

slide-73
SLIDE 73

78

ADVANCED MACHINE LEARNING

78

Regression: example

SVR for predicting cumulative log return over a period of 2500 days. Contrasted two methods to determine automatically the optimal features (i.e. moving average).

Wand & Zhu, Financial market forecasting using a two-step kernel learning method for the support vector regression, Annals of Op. Research, 2010

Found that short-term (daily and weekly) trends had a bigger impact than the long- term (monthly and quarterly) trends in predicting the next day return.

slide-74
SLIDE 74

79

ADVANCED MACHINE LEARNING

79

Regression Algorithms in this Course

Locally weighted projected regression

slide-75
SLIDE 75

88 88

ADVANCED MACHINE LEARNING

Choosing the actual number of local model is often difficult as it can lead to

  • verfit. Not a problem when the local models are learned purely from local
  • data. Then, an increasing number of local models does not overfit!

X’

Locally weighted projected regression (LWPR)

slide-76
SLIDE 76

90 90

ADVANCED MACHINE LEARNING

βκ

1

1 exp( ( ) ( )) 2 ( )

i T i ii k T T k k k

w x x D x x X W X X W Y β

= − − − =

Locally weighted projected regression (LWPR)

slide-77
SLIDE 77

91 91

ADVANCED MACHINE LEARNING

Approximate non-linear functions with a combination of multiple weighted linear models

1

1 exp( ( ) ( )) 2 ( ) ˆ ˆ ˆ /

i T i ii k T T k k k T k k i i i i i

w x x D x x X W X X W Y y x y w y w β β

= − − − = = =∑

Solve this problem for high dimensional space: LWPR

Sethu Vijayakumar, Aaron D'Souza and Stefan Schaal, Online Learning in High Dimensions, Neural Computation, vol. 17, pp. 2602-34 (2005)

X’ X’

Locally weighted projected regression (LWPR)

slide-78
SLIDE 78

92 92

ADVANCED MACHINE LEARNING

y = βx

Tx + β0 = β T ˜

x where ˜ x = x T 1

[ ]

T

w = exp − 1 2 x − c

( )

T D x − c

( )

    where D = MTM

  • The Linear Model
  • The Kernel Function
  • The Prediction

Open Parameters

1 1 M i i i K i i

w w

= =

= ∑

y y

Locally weighted projected regression (LWPR)

slide-79
SLIDE 79

93 93

ADVANCED MACHINE LEARNING

y = βx

Tx + β0 = β T ˜

x where ˜ x = x T 1

[ ]

T

w = exp − 1 2 x − c

( )

T D x − c

( )

    where D = MTM

  • The Linear Model
  • The Kernel Function
  • The Prediction

Locally weighted projected regression (LWPR)

Diagonal matrix usually Centers of local models are fixed, not learned

1 1 M i i i K i i

w w

= =

= ∑

y y

slide-80
SLIDE 80

97 97

ADVANCED MACHINE LEARNING

For learning each local model, LWPR employs an online formulation of weighted partial least squares (PLS) regression. Within each local model the input x is projected along selected directions u yielding “latent” variables s.

Locally weighted projected regression (LWPR)

slide-81
SLIDE 81

98 98

ADVANCED MACHINE LEARNING

98

PLS and CCA

PLS represents a form of CCA, where the criterion of maximal correlation is balanced with the requirement to explain as much variance as possible in both X and Y spaces

( )

[ ]

( )

( ) [

]

( )

( )

2 1

cov , max 1 var 1 var

x y

T T x y T T w w x y

w X w Y X w X X Y w Y Y γ γ γ γ

= =

− + − +

X Y γ γ = =

 CCA

1 X Y γ γ = =

 PLS

slide-82
SLIDE 82

99 99

ADVANCED MACHINE LEARNING

99

PLS and CCA

X Y γ γ = =

 CCA

1 X Y γ γ = =

 PLS

slide-83
SLIDE 83

101 101

ADVANCED MACHINE LEARNING

Partial Least Square

slide-84
SLIDE 84

104 104

ADVANCED MACHINE LEARNING

LWPR: A Basic Incremental Algorithm

Recursive Least Squares Stochastic Leave-

  • ne-out

Cross Validation

  • Given (X,y), for all K local models:
  • Create a new model:

( )

( )

1 1 1

' ' 1 ' ' '

T n n n T n k k k k k n T n n n k k k k T n k

w w β β β λ λ

+ + +

= + −     = −     +   x P x y x P x x P P P x P x 

1 1 1 1

and

n n n n T n k k k k k

J M M D M M M α

+ + + +

∂ = − = ∂

J = 1 wk,i

i=1 N

w k,i y i − ˆ y

k,i,− i 2 i=1 N

+γ Dk,ij

2 i=1, j=1 n

if min

k

wk

( )< wgen createnewRF at cK +1 = x

Automatic Structure Determination

slide-85
SLIDE 85

105 105

ADVANCED MACHINE LEARNING

Locally weighted projected regression (LWPR)

slide-86
SLIDE 86

106 106

ADVANCED MACHINE LEARNING

Increasing the number of components leads to a better fit of the local linearities.

Locally weighted projected regression (LWPR)

slide-87
SLIDE 87

107 107

ADVANCED MACHINE LEARNING

Empirical Evaluations (Cross Data)

Learned function Target function

Input Dimensionality = 2 (+ 8 or 18 redundant dimensions.) Noise ~ N(0,0.01) # training data = 500 Initial Receptive Fields Learned Receptive Field

Sethu Vijayakumar @ Univ. of Edinburgh

slide-88
SLIDE 88

108 108

ADVANCED MACHINE LEARNING

Locally Weighted Partial Least-squares (LWPR)

slide-89
SLIDE 89

109

ADVANCED MACHINE LEARNING

109

Summary

We have seen a few different techniques to perform non-linear regression in machine learning. The techniques differ in their algorithm and in the number of hyperparameters. Some techniques (GP, RVR) provide an metric of uncertainty of the model, which can be used to determine when inference is trustable. Some techniques (ν-SVR, RVR, LWPR) are designed to be computationally cheap at retrieval (very few support vectors, few models). Other techniques (GP) are meant to provide very accurate estimate of the data, at the cost of retaining all datapoints for retrieval. Or offer the ability to perform incremental learning (LWPR)