MACHINE LEARNING
1
MACHINE LEARNING Overview 1 MACHINE LEARNING Oral Presentations of - - PowerPoint PPT Presentation
MACHINE LEARNING MACHINE LEARNING Overview 1 MACHINE LEARNING Oral Presentations of Projects Start at 9h15 am and last until 12h30 2 MACHINE LEARNING Exam Format The exam lasts a total of 40 minutes: -Upon entering the room, you pick at
MACHINE LEARNING
1
MACHINE LEARNING
2
MACHINE LEARNING
3
The exam lasts a total of 40 minutes:
questions you have picked
When needed make schematic or prepare an example
Exam is closed book but you can bring one A4 page with personal notes written recto-verso.
MACHINE LEARNING
4
Exam questions will entail two parts: one conceptual and one algorithmic i. Explain SVM and give an example in which it could be applied ii. Discuss the different terms in the objective function of SVM.
MACHINE LEARNING
5
Exam questions will entail two parts: one conceptual and one algorithmic
MACHINE LEARNING
6
Exam questions may also tackle fundamental topics of ML
MACHINE LEARNING
7
This overview is meant to cover solely some of the key concepts that we expect to be known and to highlight similarities and differences across the different methods presented in class. Exam material encompass:
website)
MACHINE LEARNING
8
Formalism:
Taxonomy:
Principles of evaluation:
MACHINE LEARNING
9
To assess the validity of a Machine Learning algorithm, one measures its performance against the training, validation and testing sets. These sets are built from partitioning the data set at hand.
Training Set Validation Set Testing Set
Crossvalidation Crossvalidation
N-fold crossvalidation: Typical choice is 10-fold crossvalidation
MACHINE LEARNING
10
Mathematical notions of probability distribution function, cumulative distribution function, marginal, maximum likelihood, MAP, etc.
x
x
MACHINE LEARNING
11
11 X Y X Y
, p X Y
2 2
1st eigenvector 2nd eigenvector
1 1 2 2
Length of the ellipse's axes are equal to and , with . Each contour line corresponds to a multiple of the standard deviation along the eigenvectors.
T
V V
MACHINE LEARNING
12
The conditional and marginal of a multi-dimensional Gaussian distribution are also Gaussians.
13
MACHINE LEARNING
13
Kernel Methods: Determine a metric which brings out features of the data so as to make subsequent computation easier
Original Space
x1 x2
After Lifting Data in Feature Space Data becomes linearly separable when using a rbf kernel and projecting onto first 2 PC of kernelPCA.
MACHINE LEARNING
14
Kernel Methods in Machine Learning
Is based on the observation that the associated linear method is based on computing an inner product across variables. This inner product can be replaced by the kernel function if known. The problem becomes then linear in feature space.
: , , .
i j i j
k X X k x x x x
Metric of similarity across datapoints
MACHINE LEARNING
15
For each algorithm, be able to explain: – what it can do: classification, regression, structure discovery / reduction of dimensionality – what one should be careful about (limitations of the algorithm, choice
– the key steps of the algorithm, its hyperparameters, the variables it takes as input and the variables it outputs
MACHINE LEARNING
16
SVM
– what it can do: classification, regression, structure discovery / reduction of dimensionality Performs binary classification; can be extended to multi-class classification; can be extended to regression (SVR) – what one should be careful about (limitations of the algorithm, choice
e.g. choice of kernel; too small kernel width in Gaussian kernels may lead to overfitting; one can proceed to iterative estimation of the kernel parameters – the key steps of the algorithm, its hyperparameters, the variables it takes as input and the variables it outputs
In red: what you should know; in blue, what would be good to know / bonus.
MACHINE LEARNING
17
This class has presented groups of methods for doing classification, regression, structure discovery, estimation of time series. Note that several algorithms do more than one of these types of computation.
Classification / Clustering
Kernel K-means, GMM Decision Trees + boosting/bagging SVM
Regression
SVR GMR GPR
Structure Discovery
Linear / Kernel PCA, CCA
Time Series
RL
MACHINE LEARNING
18
in class); which to use when?
pros and cons
MACHINE LEARNING
19
Linear mapping Reduction of the dimensionality 1st axis aligned with maximal variance and determines correlation across dimensions of variables! 2nd, 3rd axes orthogonal! All projections are uncorrelated!
A
N q A
Raw 2D dataset Projected onto two first principal components
MACHINE LEARNING
20
Pros:
minimum error at reconstruction) Cons:
(requires statistical independence) – ICA NOT COVERED IN CLASS! PCA remains a very powerful method; Worth trying it out on your data before using any other method!
MACHINE LEARNING
21
kPCA differs from PCA. The eigenvectors are M-dimensional (size of number of datapoints) Projecting onto the eigenvectors after kPCA finds structure in the data
Circles and elliptic contour lines with RBF kernels
MACHINE LEARNING
22
Hyperbolas and intersecting lines when using a polynomial kernel
kPCA differs from PCA. The eigenvectors are M-dimensional (size of number of datapoints) Projecting onto the eigenvectors after kPCA finds structure in the data
MACHINE LEARNING
23
Video description Audio description
1 1
N
P
2 2
,
x y
T T x y w w corr w x w y
Extract hidden structure that maximizes correlation across two different projections.
MACHINE LEARNING
24
MACHINE LEARNING
25
in class); which to use when?
pros and cons
MACHINE LEARNING
26
Class with label y=-1 Class with label y=+1 x1 x2
2 ,
i=1,2,....,M.
1 2 , 1 when 1 , 1, , 1 when 1
i i i i i i
w b
w w x b y y w x b w x b y
Constrained based optimization Convex problem global optimum but not unique solution! Find separating plane with maximal margin
MACHINE LEARNING
27
1
i i
M i i
1 1
M i i i i i i M i i i i
Non-linear separation is achieved using the kernel trick
MACHINE LEARNING
28
Children Female Adult Male Adult
1
K
3
f
1
f
2
f
1,... 1
i i
M j j i j M i
Sufficient to compute only K-1 classifier for K classes But computing the K’th classifier may provide tighter bounds on the Kth class.
MACHINE LEARNING
29
classifiers on each class
How does it work? What does it learn?
classifiers
Why is it good?
good generalization
complex classifiers (SVM)
8 mixtures es per class
MACHINE LEARNING
30
good generalization
dimension
2 mixtures es per class
How does it work? What does it learn? Why is it good?
MACHINE LEARNING
31
get a non-linear classifier
hyperplanes that are “combined” together
the hidden layer)
n=1 n=1 n=2 n=2 n=3 n=3 n=4 n=4 y x “hidden layer”
neuron input
How does it work? What does it learn? Why is it good?
MACHINE LEARNING
32
RANSAC
Training Testing
SVM GMM MLP GP Boost KNN RVM SVM GMM MLP GP Boost KNN RVM
WARNING: most of these algorithms require a certain amount of tweaking of the hyperparameters to get optimal results
Bagging Bagging RANSAC
MACHINE LEARNING
33
Several criteria (application – dependent):
versus negative class
rendered by likelihood; not always available)
good results? Then, yes, it may be a good idea to run a non- linear version of the method (PCA vs kPCA, Linear SVM, vs kernel SVM)
MACHINE LEARNING
34
in class); which to use when?
pros and cons
MACHINE LEARNING
35
N
Deterministic regressive model
2
Probabilistic regressive model Build an estimate of the noise model and then compute f directly (Support Vector Regression)
MACHINE LEARNING – 2012 36 MACHINE LEARNING MACHINE LEARNING
y
N
Deterministic regressive model
2
Probabilistic regressive model Probabilistic estimate of the nonlinear relationship between y and x through the conditional density: (estimates the noise model and f) And then compute the estimate by taking the expectation over the conditional density:
| p y x
Gaussian Mixture Regression (GMR) computes first p(x, y) and, then, derives p(y|x). Gaussian Process Regression (GPR) computes directly p(y|x).
MACHINE LEARNING – 2012 37 MACHINE LEARNING MACHINE LEARNING
SVR, GMR and GPR are based on the same probabilistic regressive model. But they do not optimize the same objective function find different solutions.
ensured to find the optimal estimate; but not unique solution
38 MACHINE LEARNING MACHINE LEARNING
x
1) Estimate the joint density, p(x,y), across pairs of datapoints using GMM.
1
, , ; , , with , ; , , , : mean and covariance matrix of Gaussian
K i i i i i i i i i i
p x y p x y p x y N i
2D projection of a Gauss function Ellipse contour ~ 2 std deviation
x y
y
39 MACHINE LEARNING MACHINE LEARNING
x
1) Estimate the joint density, p(x,y), across pairs of datapoints using GMM.
1
, , ; , , with , ; , , , : mean and covariance matrix of Gaussian
K i i i i i i i i i i
p x y p x y p x y N i
Parameters are learned through Expectation- maximization. Iterative procedure. Start with random initialization.
y
40 MACHINE LEARNING MACHINE LEARNING
y x
1) Estimate the joint density, p(x,y), across pairs of datapoints using GMM.
1
, , ; , , with , ; , , , : mean and covariance matrix of Gaussian
K i i i i i i i i i i
p x y p x y p x y N i
1
K i i
Mixing Coefficients Probability that all M datapoints were generated by Gaussian i:
1
|
M j i j
p i p i x
1 2
41 MACHINE LEARNING MACHINE LEARNING
1) Estimate the joint density, p(x,y), across pairs of datapoints using GMM. 2) Compute the regressive signal, by taking p(y|x)
y x
1
K i i i i
1
; , with ; ,
i i i i K j j j j
p x x p x
Gauss function
The variance changes depending on the query point
42 MACHINE LEARNING MACHINE LEARNING
1) Estimate the joint density, p(x,y), across pairs of datapoints using GMM. 2) Compute the regressive signal, by taking p(y|x)
y x
1
K i i i i
1
; , with ; ,
i i i i K j j j j
p x x p x
2 x
1 x
Influence of each marginal is modulated by
Query point
43 MACHINE LEARNING MACHINE LEARNING
y x
3) The regressive signal is then obtained by computing E{p(y|x)}:
1 1
1
|
K i i i i i i
i x
K i i i
x x
E p y x
Linear combination of K local regressive models
1 x
2 x
2 x
1 x
44 MACHINE LEARNING MACHINE LEARNING
y x
1 x
2 x
Computing the variance var{p(x,y)} provides information on the uncertainty of the prediction computed from the conditional distribution. Careful: This is not the uncertainty of the model. Use the likelihood to compute the uncertainty of the predictor.!
2 2 2 1 1
var |
K K i i i i i i i
p y x x x x x
45 MACHINE LEARNING – 2012 MACHINE LEARNING
var | p y x
| E p y x
Computing the variance var{p(x,y)} provides information on the uncertainty of the prediction computed from the conditional distribution.
MACHINE LEARNING – 2012 46 MACHINE LEARNING MACHINE LEARNING
y
N
Deterministic regressive model
2
Probabilistic regressive model Probabilistic estimate of the nonlinear relationship between y and x through the conditional density: (estimates the noise model and
f)
And then compute the estimate by taking the expectation over the conditional density:
| p y x
47 MACHINE LEARNING MACHINE LEARNING
A signal y can be estimated through regression y=f(x) by taking the expectation over the conditional probability of p on x, for a choice of parameters for p:
The simplest way to estimate p(y|x) is through Probabilistic Regression that estimates a linear regressive model.
48 MACHINE LEARNING MACHINE LEARNING
T N
PR is a statistical approach to classical linear regression that estimates the relationship between zero-mean variables y and x by building a linear model
2
T
If one assumes that the observed values of y differ from f(x) by an additive noise that follows a zero-mean Gaussian distribution (such an assumption consists of putting a prior distribution over the noise), then:
49 MACHINE LEARNING MACHINE LEARNING
1 2 2 1
i
M i i i T i M i
1 2
i i
M i T
Parameters of the model
50 MACHINE LEARNING MACHINE LEARNING
1 2
i i
M i T
Prior model on distribution of parameter w:
1
T w w
Hyperparameters Given by user
51 MACHINE LEARNING MACHINE LEARNING
1 2
1 1 2
1 with
T w
T T
A XX
Testing point Training datapoints
52 MACHINE LEARNING – 2012 MACHINE LEARNING
How to extend the simple linear Bayesian regressive model for nonlinear regression, such that the non-linear problem becomes linear again?
x y
?
2
T
2
T
x y
v
MACHINE LEARNING – 2012 53 MACHINE LEARNING MACHINE LEARNING
1 2
1 1 2
T w
T T
Non-Linear Transformation
1 1 2 2 1
T T T w
2
T
How to extend the simple linear Bayesian regressive model for nonlinear regression, such that the non-linear problem becomes linear again?
MACHINE LEARNING – 2012 54 MACHINE LEARNING MACHINE LEARNING
1 1 2 2 1
T T T w
Inner product in feature space
T w
Take as kernel
1 2
1
i
M i i
MACHINE LEARNING – 2012 55 MACHINE LEARNING MACHINE LEARNING
1 2
1
i
M i i
All datapoints are used in the computation!
MACHINE LEARNING – 2012 56 MACHINE LEARNING MACHINE LEARNING
1 2
1
i
M i i
The kernel and its hyperparameters are given by the user. These can be optimized through maximum likelihood over the marginal likelihood, i.e. p(y|X;parameters).
MACHINE LEARNING
57
This course covered a variety of topics that are core to Machine Learning. It gives you the basis to go and read recent advances in each of these topics. We hope that you will find this material useful and that you will use some
If you do so, drop us a note and we would be glad to include your application in future lectures as examples!