1
ADVANCED MACHINE LEARNING
1
ADVANCED MACHINE LEARNING Non-linear regression techniques (SVR and - - PowerPoint PPT Presentation
ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING Non-linear regression techniques (SVR and extensions, GPR, Gradient Boosting) 1 1 ADVANCED MACHINE LEARNING Regression: Principle N Map N-dim. input to a continuous
1
ADVANCED MACHINE LEARNING
1
2
ADVANCED MACHINE LEARNING
2
N N
y
1,...
Estimate that best predict set of training points , ?
i i i M
f x y
= x
1
x
1
y
2
x
2
y
3
x
3
y
4
x
4
y
True function Estimate
3
ADVANCED MACHINE LEARNING
3
N N
y
1,...
Estimate that best predict set of training points , ?
i i i M
f x y
= x
1
x
1
y
2
x
2
y
3
x
3
y
4
x
4
y
True function Estimate
4
ADVANCED MACHINE LEARNING
4
Support Vector Machine Relevance Vector Machine Boosting – random projections Boosting – random gaussians Random forest Gaussian Process Support vector regression Relevance vector regression Gaussian process regression Gradient boosting Locally weighted projected regression Not covered in class!!
5
ADVANCED MACHINE LEARNING
5
Support Vector Machine Relevance Vector Machine Boosting – random projections Support vector regression Relevance Vector Machine Relevance vector regression
6 6
ADVANCED MACHINE LEARNING
6
7 7
ADVANCED MACHINE LEARNING
7
1,...
i i i M
=
How to generalize the support vector machine framework for classification to estimate continuous functions? 1. Assume a non-linear mapping through feature space and then perform linear regression in feature space 2. Supervised learning – minimizes an error function. First determine a way to measure error on testing set in the linear case!
8 8
ADVANCED MACHINE LEARNING
8
Assume a linear mapping , s.t. .
T
f y f x w x b = = +
Measure the error on prediction b is estimated as in SVR through least-square regression on support vectors; hence we omit it from the rest
How to estimate and to best predict the pair of training points , ?
i i i M
w b x y
=
x
y
( )
T
y f x w x b = = +
9 9
ADVANCED MACHINE LEARNING
9
Set an upper bound on the error and consider as correctly classified all points such that ( ) , Penalize only datapoints that are not contained in the -tube. f x y ε ε ε − ≤
x
y
( )
T
y f x w x b = = +
10 10
ADVANCED MACHINE LEARNING
10
x
ε-margin
The ε-margin is a measure of the width of the ε- insensitive tube. It is a measure of the precision of the regression. A small ||w|| corresponds to a small slope for f. In the linear case, f is more horizontal. y
11 11
ADVANCED MACHINE LEARNING
11
x
ε-margin
y A large ||w|| corresponds to a large slope for f. In the linear case, f is more vertical. The flatter the slope of the function f, the larger the ε−margin. To maximize the margin, we must minimize the norm of w.
12 12
ADVANCED MACHINE LEARNING
12
2
1,...
This can be rephrased as a constraint-based optimization problem
1 minimize 2 , subject to ,
i i
i i
i M
w w x b y y w x b ε ε
∀ =
+ − ≤ − − ≤
Need to penalize points outside the ε-insensitive tube.
13 13
ADVANCED MACHINE LEARNING
13
Need to penalize points outside the ε-insensitive tube.
* 2 * 1 * *
Introduce slack variables , , 0 : 1 C minimize + 2 , subject to , 0,
i i
i i M i i i i i i i i i
C w M w x b y y w x b ξ ξ ξ ξ ε ξ ε ξ ξ ξ
=
≥ + + − ≤ + − − ≤ + ≥ ≥
i
ξ
* i
ξ
14 14
ADVANCED MACHINE LEARNING
14
All points outside the ε-tube become Support Vectors
i
ξ
* i
ξ
* 2 * 1 * *
Introduce slack variables , , 0 : 1 C minimize + 2 , subject to , 0,
i i
i i M i i i i i i i i i
C w M w x b y y w x b ξ ξ ξ ξ ε ξ ε ξ ξ ξ
=
≥ + + − ≤ + − − ≤ + ≥ ≥
We now have the solution to the linear regression problem. How to generalize this to the nonlinear case?
15 15
ADVANCED MACHINE LEARNING
15
Lift x into feature space and then perform linear regression in feature space.
Linear Case: , Non-Linear Case: , y f x w x b x x y f x w x b φ φ φ = = + → = = + w lives in feature space!
x x φ →
16 16
ADVANCED MACHINE LEARNING
16
2 * 1 * *
i i
M i i i i i i i i i
=
17 17
ADVANCED MACHINE LEARNING
17
2 * * * 1 1 1 * * 1
1 C C L , , *, = + 2 , ,
i i i i i
M M i i i i i i i M i i i M i i i
w b w M M y w x b y w x b ξ ξ ξ ξ η ξ η ξ α ε ξ φ α ε ξ φ
= = = =
+ − + − + + − − − + − + +
Again, we can solve this quadratic problem by introducing sets of Lagrange multipliers and writing the Lagrangian :
Lagrangian = Objective function + λ * constraints
18 18
ADVANCED MACHINE LEARNING
18
2 * * * 1 1 1 * * 1
1 C C L , , *, = + 2 , ,
i i i i i
M M i i i i i i i M i i i M i i i
w b w M M y w x b y w x b ξ ξ ξ ξ η ξ η ξ α ε ξ φ α ε ξ φ
= = = =
+ − + − + + − − − + − + +
i
ξ
* i
ξ
Constraints on points lying on either side of the ε-tube
*
0 for all points that do not satisfy the constraints points outside the -tube
i
i
α α ε = = →
i
α >
* i
α >
19 19
ADVANCED MACHINE LEARNING
19
*
0 for all points that do not satisfy the constraints points outside the -tube
i
i
α α ε = = →
i
α >
* i
α >
Requiring that the partial derivatives are all zero:
* 1
L 0.
i
M i i i
w x w α α φ
=
∂ = − − = ∂
* 1
.
i
M i i i
w x α α φ
=
⇒ = −
Linear combination of support vectors
* 1
L 0.
i
M i i
b α α
=
∂ = − = ∂
* 1 1
i
M M i i i
α α
= =
→ =
Rebalancing the effect of the support vectors on both sides of the ε-tube
20 20
ADVANCED MACHINE LEARNING
20
And replacing in the primal Lagrangian, we get the Dual optimization problem:
*
* * , 1 * * , 1 1 * * 1
1 , 2 max subject to 0 and , 0,
i i i i i
M i j i j j i j M M i i i i i M i i i i
k x x y C M
α α
α α α α ε α α α α α α α α
= = = =
− − − ⋅ − + + + − = ∈
( ) ( ) ( )
, ,
i j i j
k x x x x φ φ =
Kernel Trick
21 21
ADVANCED MACHINE LEARNING
21
The solution is given by:
* 1
i
M i i i
=
Linear Coefficients (Lagrange multipliers for each constraint). If one uses RBF Kernel, M un-normalized isotropic Gaussians centered on each training datapoint.
22 22
ADVANCED MACHINE LEARNING
22
The solution is given by:
y x
* 1
i
M i i i
=
Kernel places a Gauss function on each SV
23 23
ADVANCED MACHINE LEARNING
23
y x
* 1
i
M i i i
=
The solution is given by: The Lagrange multipliers define the importance of each Gaussian function.
* 1
1.5 α =
2
2 α =
4
3 α =
* 3
1.5 α =
* 5
6
b Converges to b when SV effect vanishes.
1
x
2
x
3
x
4
x
5
x
6
x
Y=f(x)
24 24
ADVANCED MACHINE LEARNING
24
* 1 * 1 1
j j M j j i i i i M M j j i i i j i
= = =
25 25
ADVANCED MACHINE LEARNING
25
* 1
Recall the solution to SVM: , a) What type of function can you model with the homogeneous polynomial? b) What minimum order of a homogeneous polynomial kernel do you need to achi
M i i i i
y f x k x x b f α α
=
= = − +
eve good regression on the set of 3 points below?
26 26
ADVANCED MACHINE LEARNING
26
2 * 1 * *
1 C minimize + 2 , subject to , 0,
i i
M i i i i i i i i i
w M w x b y y w x b ξ ξ ε ξ ε ξ ξ ξ
=
+ + − ≤ + − − ≤ + ≥ ≥
The solution to SVR we just saw is referred to as ε−SVR Two Hyperparameters C controls the penalty term on poor fit ε determines the minimal required precision
27 27
ADVANCED MACHINE LEARNING
27
Effect of the RBF kernel width on the fit. Here fit using C=100, ε=0.1, kernel width=0.01.
28 28
ADVANCED MACHINE LEARNING
28
Effect of the RBF kernel width on the fit. Here fit using C=100, ε=0.01, kernel width=0.01
Overfitting
29 29
ADVANCED MACHINE LEARNING
29
Effect of the RBF kernel width on the fit. Here fit using C=100, ε=0.05, kernel width=0.01 Reduction of the effect of the kernel width on the fit by choosing appropriate hyperparameters. .
30 30
ADVANCED MACHINE LEARNING
30
Mldemos does not display the support vectors if there is more than one point for the same x!
31 31
ADVANCED MACHINE LEARNING
31
2 * 1 * *
1 C minimize + 2 , subject to , 0,
i i
M i i i i i i i i i
w M w x b y y w x b ξ ξ ε ξ ε ξ ξ ξ
=
+ + − ≤ + − − ≤ + ≥ ≥
The solution to SVR we just saw is referred to as ε−SVR Two Hyperparameters C controls the penalty term on poor fit ε determines the minimal required precision
32 32
ADVANCED MACHINE LEARNING
32
33 33
ADVANCED MACHINE LEARNING
33
As the number of data grows, so does the number of support vectors.
fraction of support vectors (see previous case for SVM)
34 34
ADVANCED MACHINE LEARNING
34
As for ν-SVM, one can rewrite the problem as a convex optimization expression:
2 * 1 , , * *
1 min under constraints 2 , , 0, 0 1, 0, 0. The margin error is given by all the data points for which 0. is an upper bound
j j j j
M j j j w T j T j j j i
w w x b y y w x b
ξ ρ
νε ξ ξ ε ξ ε ξ ε ν ξ ξ ξ ν
=
+ + + ⋅ + − ≥ + − ⋅ + ≥ + ≥ ≤ ≤ ≥ ≥ >
fraction of support vectors.
35 35
ADVANCED MACHINE LEARNING
35
Effect of the automatic adaptation of ε using ν-SVR
36 36
ADVANCED MACHINE LEARNING
36
Effect of the automatic adaptation of ε using ν-SVR Added noise on data
37 37
ADVANCED MACHINE LEARNING
37
Same principle as that described for RVM (see slides on SVM and extensions). The derivation of the parameters however differ (see Tipping 2011 for details). To recall, we start from the solution of SVM.
1
, , ....... 1
M i i i T T
y x f x k x x b x i z x x x M α ψ α ψ ψ
=
= = + = Ψ Ψ =
Rewrite the solution of SVM as a linear combination over M basis functions
1 1
1 . . . ... . 1
M
α α α α = =
In the (binary) classification case, [0;1]. In the regression case, . y y ∈ ∈
A sparse solution has a majority of entries with alpha zero.
38
ADVANCED MACHINE LEARNING
38
Solution with ε-SVR: RBF kernel , C=3000, ε=0.08, σ=0.05, 37 support vectors
39
ADVANCED MACHINE LEARNING
39
Solution with ν-SVR: RBF kernel , C=3000, ν=0.04, σ=0.001, 17 support vectors
40
ADVANCED MACHINE LEARNING
40
Solution with RVR: RBF kernel , ε=0.08, σ=0.05, 7 support vectors
41
ADVANCED MACHINE LEARNING
41
Support Vector Machine Relevance Vector Machine Boosting – random projections Boosting – random gaussians Random forest Gaussian Process Support vector regression Relevance vector regression Gaussian process regression Gradient boosting Locally weighted projected regression
42
ADVANCED MACHINE LEARNING
42
Random forest Gaussian Process Gaussian process regression
44
ADVANCED MACHINE LEARNING
44
T N
PR is a statistical approach to classical linear regression that estimates the relationship between zero-mean variables y and x by building a linear model
2
T
If one assumes that the observed values of y differ from f(x) by an additive noise ε that follows a zero-mean Gaussian distribution (such an assumption consists of putting a prior distribution over the noise), then:
Change wT
Where have we seen this before? Answer: RVM / RVR
45
ADVANCED MACHINE LEARNING
45
1 2
i i
M i T
=
1 2 2 1
i
M i i i T i M i
= =
Parameters of the model
46
ADVANCED MACHINE LEARNING
46
1 2
i i
M i T
=
Prior model on distribution of parameter w:
1
T w w
−
Hyperparameters Given by user Parameters of the model
47
ADVANCED MACHINE LEARNING
47
1 1 1 1 2 2
2
1 1
T T w w
XX XX
σ σ
− − − −
+ ∑ + ∑
1
T w w
−
(drop , not a variable)
σ
48
ADVANCED MACHINE LEARNING
48
1 1 1 1 2 2
2
1 1
T T w w
XX XX
σ σ
− − − −
+ ∑ + ∑
The conditional distribution of a Gaussian distribution is also Gaussian (image from Wikipedia)
49
ADVANCED MACHINE LEARNING
49
The expectation over the posterior distribution gives the best estimate: This is called the maximum a posteriori (MAP) estimate of w.
1 1 2 2
T w A
− −
1 1 1 1 2 2
2
1 1
T T w w
XX XX
σ σ
− − − −
+ ∑ + ∑
50
ADVANCED MACHINE LEARNING
50
1 2
1 1 2
1 with
T w
T T
A XX
σ
−
− −
= + Σ
1 1 1 1 2 2
2
1 1
T T w w
XX XX
σ σ
− − − −
+ ∑ + ∑
51
ADVANCED MACHINE LEARNING
51
1 2
1 1 2
1 with
T w
T T
XX A
σ
−
− −
= + Σ
Testing point Training datapoints
1 2
T
−
53
ADVANCED MACHINE LEARNING
53
1
T
−
( )
{ }
1 2
1 | , ,
T
E p y x X x A X σ
−
= y y
1 2
1 1 2
1 with
T w
T T
XX A
σ
−
− −
= + Σ
54 MACHINE LEARNING – 2012 MACHINE LEARNING ADVANCED MACHINE LEARNING
54
2
T
55 MACHINE LEARNING – 2012 MACHINE LEARNING ADVANCED MACHINE LEARNING
55
2
T
2
T
56 MACHINE LEARNING – 2012 MACHINE LEARNING ADVANCED MACHINE LEARNING
56
2
T
2
T
58 58
ADVANCED MACHINE LEARNING
58
1 2
1 1 2
1
T w
T T
A XX
σ
−
− −
= + Σ
Non-Linear Transformation
1 1 2 2 1
T T T w
− − − −
2
T
59 59
ADVANCED MACHINE LEARNING
59
Again, a Gaussian distribution.
1 1 2 2 1
T T T w
− − − −
61 61
ADVANCED MACHINE LEARNING
61
Inner product in feature space
T w
1 2
1
i
M i i
−
=
1 1 2 2 1
T T T w
− − − −
See supplement for steps
63 63
ADVANCED MACHINE LEARNING
63
1 2
1
i
M i i
−
=
i
All datapoints are used in the computation!
64 64
ADVANCED MACHINE LEARNING
64
1 2
1
i
M i i
−
=
The kernel and its hyperparameters are given by the user. These can be optimized through maximum likelihood over the marginal likelihood, see class’s supplement
RBF kernel, width = 0.1 RBF kernel, width = 0.5
65 65
ADVANCED MACHINE LEARNING
65
Sensitivity to the choice of kernel width (called lengthscale in most books) when using Gaussian kernels (also called RBF or square exponential). Kernel Width=0.1
'
x x l
− −
66 66
ADVANCED MACHINE LEARNING
66
Kernel Width=0.5
'
x x l
− −
Sensitivity to the choice of kernel width (called lengthscale in most books) when using Gaussian kernels (also called RBF or square exponential).
67 67
ADVANCED MACHINE LEARNING
67
1 2
1
i
M i i
−
=
The value for the noise needs to pre-set by hand.
Sigma = 0.05 Sigma = 0.01
The larger the noise, the more uncertainty. The noise is <=1.
1 2
−
68 68
ADVANCED MACHINE LEARNING
68
Low noise: σ=0.05
69 69
ADVANCED MACHINE LEARNING
69
High noise: σ=0.2
70 70
ADVANCED MACHINE LEARNING
70
1 2
1
i
M i i
−
=
Kernel is usually Gaussian kernel with stationary covariance function Non-Stationary Covariance Functions can encapsulate local variations in the density of the datapoints Gibbs’ non stationary covariance function (length-scale a function of x):
1 2 2 2 2 1 1
N N i i i i i i i i i i
= =
71 71
ADVANCED MACHINE LEARNING
71
2
T
2
T
Linear Model Non-Linear Model
Both models follow a zero mean Gaussian distribution!
2
T T
2
T T
Predict y=0 away from datapoints
SVR predicts y=b away from datapoin (see exercise session)
72
ADVANCED MACHINE LEARNING
72
Support Vector Machine Relevance Vector Machine Boosting – random projections Boosting – random gaussians Random forest Gaussian Process Support vector regression Relevance vector regression Gaussian process regression Gradient boosting Locally weighted projected regression
73
ADVANCED MACHINE LEARNING
73
Relevance Vector Machine Boosting – random gaussians Gradient boosting
74
ADVANCED MACHINE LEARNING
74
Relevance Vector Machine Boosting – random gaussians Gradient boosting
1 2
m
75
ADVANCED MACHINE LEARNING
75
Relevance Vector Machine Boosting – random gaussians Gradient boosting
1
m i i
=
76
ADVANCED MACHINE LEARNING
76
Wand & Zhu, Financial market forecasting using a two-step kernel learning method for the support vector regression, Annals of Op. Research, 2010
77
ADVANCED MACHINE LEARNING
77
The actual Nikkei 225 opening cash index and its predicted values from the random walk, SVR and ICA–SVR models, using the last 50 data points of the Nikkei 225 index.
Wand & Zhu, Financial market forecasting using a two-step kernel learning method for the support vector regression, Annals of Op. Research, 2010
78
ADVANCED MACHINE LEARNING
78
SVR for predicting cumulative log return over a period of 2500 days. Contrasted two methods to determine automatically the optimal features (i.e. moving average).
Wand & Zhu, Financial market forecasting using a two-step kernel learning method for the support vector regression, Annals of Op. Research, 2010
Found that short-term (daily and weekly) trends had a bigger impact than the long- term (monthly and quarterly) trends in predicting the next day return.
79
ADVANCED MACHINE LEARNING
79
Locally weighted projected regression
88 88
ADVANCED MACHINE LEARNING
Choosing the actual number of local model is often difficult as it can lead to
X’
90 90
ADVANCED MACHINE LEARNING
1
1 exp( ( ) ( )) 2 ( )
i T i ii k T T k k k
w x x D x x X W X X W Y β
−
= − − − =
91 91
ADVANCED MACHINE LEARNING
Approximate non-linear functions with a combination of multiple weighted linear models
1
1 exp( ( ) ( )) 2 ( ) ˆ ˆ ˆ /
i T i ii k T T k k k T k k i i i i i
w x x D x x X W X X W Y y x y w y w β β
−
= − − − = = =∑
Solve this problem for high dimensional space: LWPR
Sethu Vijayakumar, Aaron D'Souza and Stefan Schaal, Online Learning in High Dimensions, Neural Computation, vol. 17, pp. 2602-34 (2005)
X’ X’
92 92
ADVANCED MACHINE LEARNING
y = βx
Tx + β0 = β T ˜
x where ˜ x = x T 1
T
w = exp − 1 2 x − c
( )
T D x − c
( )
where D = MTM
Open Parameters
1 1 M i i i K i i
w w
= =
= ∑
y y
93 93
ADVANCED MACHINE LEARNING
y = βx
Tx + β0 = β T ˜
x where ˜ x = x T 1
T
w = exp − 1 2 x − c
( )
T D x − c
( )
where D = MTM
Diagonal matrix usually Centers of local models are fixed, not learned
1 1 M i i i K i i
w w
= =
= ∑
y y
97 97
ADVANCED MACHINE LEARNING
For learning each local model, LWPR employs an online formulation of weighted partial least squares (PLS) regression. Within each local model the input x is projected along selected directions u yielding “latent” variables s.
98 98
ADVANCED MACHINE LEARNING
98
PLS represents a form of CCA, where the criterion of maximal correlation is balanced with the requirement to explain as much variance as possible in both X and Y spaces
2 1
x y
T T x y T T w w x y
= =
99 99
ADVANCED MACHINE LEARNING
99
101 101
ADVANCED MACHINE LEARNING
104 104
ADVANCED MACHINE LEARNING
Recursive Least Squares Stochastic Leave-
Cross Validation
( )
( )
1 1 1
' ' 1 ' ' '
T n n n T n k k k k k n T n n n k k k k T n k
w w β β β λ λ
+ + +
= + − = − + x P x y x P x x P P P x P x
1 1 1 1
and
n n n n T n k k k k k
J M M D M M M α
+ + + +
∂ = − = ∂
J = 1 wk,i
i=1 N
w k,i y i − ˆ y
k,i,− i 2 i=1 N
+γ Dk,ij
2 i=1, j=1 n
if min
k
wk
Automatic Structure Determination
105 105
ADVANCED MACHINE LEARNING
106 106
ADVANCED MACHINE LEARNING
Increasing the number of components leads to a better fit of the local linearities.
107 107
ADVANCED MACHINE LEARNING
Sethu Vijayakumar @ Univ. of Edinburgh
108 108
ADVANCED MACHINE LEARNING
109
ADVANCED MACHINE LEARNING
109
We have seen a few different techniques to perform non-linear regression in machine learning. The techniques differ in their algorithm and in the number of hyperparameters. Some techniques (GP, RVR) provide an metric of uncertainty of the model, which can be used to determine when inference is trustable. Some techniques (ν-SVR, RVR, LWPR) are designed to be computationally cheap at retrieval (very few support vectors, few models). Other techniques (GP) are meant to provide very accurate estimate of the data, at the cost of retaining all datapoints for retrieval. Or offer the ability to perform incremental learning (LWPR)