APPLIED MACHINE LEARNING
1
MACHINE LEARNING Linear and Weighted Regression Support Vector - - PowerPoint PPT Presentation
APPLIED MACHINE LEARNING MACHINE LEARNING Linear and Weighted Regression Support Vector Regression 1 APPLIED MACHINE LEARNING Classification (reminder) Maps N -dimensions input x N to discrete values y E.g.: x = [ Length , Color ]
APPLIED MACHINE LEARNING
1
APPLIED MACHINE LEARNING
2
Color Length Bananas Apples
Maps N-dimensions input x ∈ ℝN to discrete values y
x = [Length, Color] “Banana” or “Apple”
E.g.:
APPLIED MACHINE LEARNING
3 Income: GDP 2003 (log scale) Life satisfaction India Nigeria Cambodia China US Japan Italy 10 000 20 000 30 000 40 000 3 4 5 6 7 Bangladesh
Maps N-dimensions input x ∈ ℝN to continuous values y ∈ ℝ
Continuous value of life satisfaction
Income (GDP)
APPLIED MACHINE LEARNING
4
Maps N-dimensions input x ∈ ℝN to continuous values y ∈ ℝ
Income (GDP) Continuous value of life satisfaction
Income: GDP 2003 (log scale) Life satisfaction India Nigeria Cambodia China US Japan Italy 10 000 20 000 30 000 40 000 3 4 5 6 7 Bangladesh
Query point: Russia GDP = 30 000 Estimation of life satisfaction = 6.5
APPLIED MACHINE LEARNING
5
School of Engineering – Section Microtechnique @ 2004 A.. Billard – Adapted from Blei 99 and Dorr & Montz 2004
Predict the number of diplomas that will be awarded in the next ten years across the two EPF the number of diploma follow a non-linear curve as a function of time.
APPLIED MACHINE LEARNING
6
Predict the velocity of the robot given its position.
x*: target
APPLIED MACHINE LEARNING
7
APPLIED MACHINE LEARNING
8
T
Linear regression searches a linear mapping between input x and
y x
b
APPLIED MACHINE LEARNING
9
T
One can omit the intercept by centering the data:
Linear regression searches a linear mapping between input x and
*
' and ' , , : mean on and ' ' ' with ' Least-square estimate of ' ' ' ' '.
T T T T
y y y x x x x y x y y w x b b b w x y b y w x y w x
APPLIED MACHINE LEARNING
10
T
Linear regression searches a linear mapping between input x and
y x
APPLIED MACHINE LEARNING
11
1 2 1 2
M M i N i
Find the optimal parameter w through least-square regression:
2 * 1
i i
M T w i
Finds an analytical solution through partial differentiation:
1 *= T T
ℝ ℝ
APPLIED MACHINE LEARNING
12 x
All points have equal weight.
Regression through weighted Least Square
2 * 1 2 1
i i i i
M T M w i
y
Standard linear regression
APPLIED MACHINE LEARNING
13
Points in red have large weights.
Regression through weighted Least Square
2 * 1
i i i i
M T w i
x y
APPLIED MACHINE LEARNING
14
Regression through weighted Least Square
Points in red have large weights.
x y
2 * 1
i i i i
M T w i
APPLIED MACHINE LEARNING
15
1 2
Assuming a set of weights for all datapoints, we set B a diagonal matrix with entries , ..... ................... Change of variable: and . Minimizing f
i i M T
B Z BX v By
1
ˆ
T T T T
y y x w x Z Z Z v
1 * T T
w X X X y
Contrast to the solution for un-weighted linear regression
APPLIED MACHINE LEARNING
16
assumes that a single linear dependency applies everywhere. Not true for data sets with local dependencies.
y x
2 * 1
i i i i
M T w i
Regression through weighted Least Square
APPLIED MACHINE LEARNING
17
assumes that a single linear dependency applies everywhere. Not true for data sets with local dependencies. It would be useful to design a regression method that estimates best the linear dependencies locally.
2 * 1
i i i i
M T w i
Regression through weighted Least Square
APPLIED MACHINE LEARNING
18
,
= , , with , , , .
i i i i i
d x x i x
K d x x K d x x e d x x x x
ˆ y x
X: query point
Estimate is determined through local influence of each group of datapoints
1 1
/ : weights function of x
i i j i
M M i j
y x x y x x
y
APPLIED MACHINE LEARNING
19
,
= , , with , , , .
i i i i i
d x x i x
K d x x K d x x e d x x x x
Estimate is determined through local influence of each group of datapoints
X: query point
ˆ y x
1 1
/ : weights function of x
i i j i
M M i j
y x x y x x
y
APPLIED MACHINE LEARNING
20
Estimate is determined through local influence of each group of datapoints
1 1
/ : weights function of x
i i j i
M M i j
y x x y x x
Model-free regression! No longer explicit model of the form
T
Regression computed at each query point. Depends on training points.
= ,
i
i x
K d x x
APPLIED MACHINE LEARNING
21
1 1
/ : weights function of x
i i j i
M M i j
y x x y x x
Estimate is determined through local influence of each group of datapoints
2 1
ˆ min min , Local cost function at , the query point.
i i
M i
J x y y K d x x x
Optimal solution to the local cost function:
= ,
i
i x
K d x x
APPLIED MACHINE LEARNING
22
= ,
i
i x
K d x x
Estimate is determined through local influence of each group of datapoints
1 1
/ : weights function of x
i i j i
M M i j
y x x y x x
Which training points? Which kernel?
APPLIED MACHINE LEARNING
23
APPLIED MACHINE LEARNING
24
y x Blue: true function Red: estimated function
Good prediction depends on the choice of datapoints.
APPLIED MACHINE LEARNING
25
y
Good prediction depends on the choice of datapoints. The more datapoints, the better the fit. Computational costs increase dramatically with number of datapoints
x
Blue: true function Red: estimated function
APPLIED MACHINE LEARNING
26
y
Several methods in ML for performing non-linear regression. Differ in the objective function, in the amount of parameters.
Gaussian Process Regression (GPR) uses all datapoints (model-free) x
Gaussian Process Regression not covered in class! Not examined in the final exam! Blue: true function Red: estimated function
APPLIED MACHINE LEARNING
27
y
Several methods in ML for performing non-linear regression. Differ in the objective function, in the amount of parameters.
Gaussian Process Regression (GPR) uses all datapoints (model-free) Support Vector Regression (SVR) picks a subset of datapoints (support vectors) x
Blue: true function Red: estimated function
APPLIED MACHINE LEARNING
28
y x
Several methods in ML for performing non-linear regression. Differ in the objective function, in the amount of parameters.
Gaussian Process Regression (GPR) uses all datapoints (model-free) Support Vector Regression (SVR) picks a subset of datapoints (support vectors) Gaussian Mixture Regression (GMR) generates a new set of datapoints (centers of Gaussian functions)
Blue: true function Red: estimated function
APPLIED MACHINE LEARNING
29
y x
Estimate of the noise is important to measure goodness of fit.
APPLIED MACHINE LEARNING
31
y x Support Vector Regression (SVR) assumes an estimate of the noise model (e-tube) and then compute f directly within a noise-tolerance.
Estimate of the noise is important to measure goodness of fit.
APPLIED MACHINE LEARNING
32
y x Gaussian Mixture Regression (GMR) builds a local estimate of the noise model through the variance of the system.
Estimate of the noise is important to measure goodness of fit.
APPLIED MACHINE LEARNING
33
APPLIED MACHINE LEARNING
34
1,...
i i i M
How to generalize the support vector machine framework for classification to estimate continuous functions? 1. Assume a non-linear mapping through feature space and then perform linear regression in feature space 2. Supervised learning – minimizes an error function. First determine a way to measure error on testing set in the linear case!
APPLIED MACHINE LEARNING
35
Assume a linear mapping , s.t. .
T
f y f x w x b
Measure the error on prediction b is estimated as in SVR through least-square regression on support vectors; hence we ignore it for the rest of the developments .
How to estimate and to best predict the pair of training points , ?
i i i M
w b x y
x
y
T
y f x w x b
APPLIED MACHINE LEARNING
36
Set an upper bound on the error and consider as correctly classified all points such that ( ) , Penalize only datapoints that are not contained in the -tube. f x y e e e
x
y
T
y f x w x b
APPLIED MACHINE LEARNING
37
x
e-margin
The e-margin is a measure of the width of the e-insensitive tube. It is a measure of the precision of the regression. A small ||w|| corresponds to a small slope for f. In the linear case, f is more horizontal. y
APPLIED MACHINE LEARNING
38
x
e-margin
y A large ||w|| corresponds to a large slope for f. In the linear case, f is more vertical. The flatter the slope of the function f, the larger the emargin. To maximize the margin, we must minimize the norm of w.
APPLIED MACHINE LEARNING
41
2
1,...
This can be rephrased as a constraint-based optimization problem
1 minimize 2 , subject to ,
i i
i i
i M
w w x b y y w x b e e
Need to penalize points outside the e-insensitive tube.
APPLIED MACHINE LEARNING
42
Need to penalize points outside the e-insensitive tube.
* 2 * 1 * *
Introduce slack variables , , 0 : 1 C minimize + 2 , subject to , 0,
i i
i i M i i i i i i i i i
C w M w x b y y w x b e e
i
* i
APPLIED MACHINE LEARNING
43
All points outside the e-tube become Support Vectors
i
* i
* 2 * 1 * *
Introduce slack variables , , 0 : 1 C minimize + 2 , subject to , 0,
i i
i i M i i i i i i i i i
C w M w x b y y w x b e e
We now have the solution to the linear regression problem. How to generalize this to the nonlinear case?
APPLIED MACHINE LEARNING
44
Lift x into feature space and then perform linear regression in feature space.
Linear Case: , Non-Linear Case: , y f x w x b x x y f x w x b w lives in feature space!
x x
APPLIED MACHINE LEARNING
45
2 * 1 * *
i i
M i i i i i i i i i
APPLIED MACHINE LEARNING
46
2 * * * 1 1 1 * * 1
1 C C L , , *, = + 2 , ,
i i i i i
M M i i i i i i i M i i i M i i i
w b w M M y w x b y w x b e e
Lagrangian = Objective function + l * constraints
APPLIED MACHINE LEARNING
47
2 * * * 1 1 1 * * 1
1 C C L , , *, = + 2 , ,
i i i i i
M M i i i i i i i M i i i M i i i
w b w M M y w x b y w x b e e
i
* i
Constraints on points lying on either side of the e-tube
* &
0 for all points that do not satisfy the constraints points outside the -tube
i
i
e
i
* i
APPLIED MACHINE LEARNING
48
i
* i
Requiring that the partial derivatives are all zero:
* 1
L 0.
i
M i i i
w x w
* 1
.
i
M i i i
w x
Linear combination of support vectors
* 1
L 0.
i
M i i
b
* 1 1
i
M M i i i
Rebalancing the effect of the support vectors on both sides of the e-tube
* &
0 for all points that do not satisfy the constraints points outside the -tube
i
i
e
APPLIED MACHINE LEARNING
49
*
* * , 1 * * , 1 1 * * 1
1 , 2 max subject to 0 and , 0,
i i i i i
M i j i j j i j M M i i i i i M i i i i
x x y C M
e
And replacing in the primal Lagrangian, we get the Dual optimization problem:
, ,
i j i j
k x x x x
Kernel Trick
APPLIED MACHINE LEARNING
50
The solution is given by:
* 1
i
M i i i
Linear Coefficients (Lagrange multipliers for each constraint). If one uses RBF Kernel, M un-normalized isotropic Gaussians centered on each training datapoint.
* 1
i
M i i i
, ,
i j i j
k x x x x
Kernel Trick
APPLIED MACHINE LEARNING
51
y x
* 1
i
M i i i
The solution is given by: Kernel places a Gauss function on each SV
APPLIED MACHINE LEARNING
52
y x
* 1
i
M i i i
The solution is given by: The Lagrange multipliers define the importance of each Gaussian function.
* 1
1.5
2
2
4
3
* 3
1.5
* 5
6
b Converges to b when SV effect vanishes.
1
x
2
x
3
x
4
x
5
x
6
x
Y=f(x)
APPLIED MACHINE LEARNING
53
* 1 * 1 1
j j M j j i i i i M M j j i i i j i
APPLIED MACHINE LEARNING
54
2 * 1 * *
1 C minimize + 2 , subject to , 0,
i i
M i i i i i i i i i
w M w x b y y w x b e e
The solution to SVR we just saw is referred to as eSVR Two Hyperparameters C controls the penalty term on poor fit e determines the minimal required precision
APPLIED MACHINE LEARNING
55
Effect of the RBF kernel width on the fit. Here fit using C=100, e=0.1, kernel width=0.01.
APPLIED MACHINE LEARNING
56
Effect of the RBF kernel width on the fit. Here fit using C=100, e=0.01, kernel width=0.01
Overfitting
APPLIED MACHINE LEARNING
57
Effect of the RBF kernel width on the fit. Here fit using C=100, e=0.05, kernel width=0.01 Reduction of the effect of the kernel width on the fit by choosing appropriate hyperparameters. .
APPLIED MACHINE LEARNING
58
Mldemos does not display the support vectors if there is more than one point for the same x!
APPLIED MACHINE LEARNING
59
Linear regression can be solved through Least-Mean-Square estimation and yields an optimal analytical solution. Weighted regression offers the possibility to perform a local regression and yields also an optimal analytical solution. The estimate is no longer global and is computed around each group of data point! Support Vector Regression: performs regression on a non-linear
estimate is globally optimal.
APPLIED MACHINE LEARNING
60
Examples of Applications of SVR Next
APPLIED MACHINE LEARNING
61
Build model of dynamics using Support Vector Regression Compute derivative (closed form) Use model in Extended Kalman Filter for real-time tracking
1
,
i
M T T i i i
x k x x x x b
APPLIED MACHINE LEARNING
62
Lorenzo Piccardi, Jean-Baptiste Keller, Martin Duvanel, Olivier Barbey, Karim Benmachiche, Dario Poggiali, Dave Bergomi, Basilio Noris
96° x 96° field of view, 25Hz / 50Hz, 180g
www.pomelo-technologies.com
APPLIED MACHINE LEARNING
63
Image Understanding, 2011
APPLIED MACHINE LEARNING
64
APPLIED MACHINE LEARNING
65
APPLIED MACHINE LEARNING
66
We normalize the image through high-pass filtering We collect images of the eyes and directions of the gaze.
+ + + + ...
Learn mapping Eye Image Position in Image through Support Vector Regression (SVR)
APPLIED MACHINE LEARNING
67
Different elements give different cues
Pupil, Iris and Cornea
Wrinkles, Eyelids and Eyelashes Support Vector Regression
APPLIED MACHINE LEARNING
68
APPLIED MACHINE LEARNING
69
www.pomelo-technologies.com
APPLIED MACHINE LEARNING
70
Gaze tracking using SVR Object detection using SVM
APPLIED MACHINE LEARNING
71