MACHINE LEARNING Overview 1 MACHINE LEARNING Oral Presentations of - - PowerPoint PPT Presentation

machine learning
SMART_READER_LITE
LIVE PREVIEW

MACHINE LEARNING Overview 1 MACHINE LEARNING Oral Presentations of - - PowerPoint PPT Presentation

MACHINE LEARNING MACHINE LEARNING Overview 1 MACHINE LEARNING Oral Presentations of Projects Start at 9h15 am and last until 12h30 2 MACHINE LEARNING Exam Format The exam lasts a total of 40 minutes: -Upon entering the room, you pick at


slide-1
SLIDE 1

MACHINE LEARNING

1

MACHINE LEARNING Overview

slide-2
SLIDE 2

MACHINE LEARNING

2

Oral Presentations of Projects

Start at 9h15 am and last until 12h30

slide-3
SLIDE 3

MACHINE LEARNING

3

Exam Format

The exam lasts a total of 40 minutes:

  • Upon entering the room, you pick at random 3 questions. You can leave
  • ne out!
  • Spend 20 minutes in the back of the room preparing answers to the two

questions you have picked

 When needed make schematic or prepare an example

  • Present for 20 minutes your answers on the black board.

Exam is closed book but you can bring one A4 page with personal notes written recto-verso.

slide-4
SLIDE 4

MACHINE LEARNING

4

Example of exam question - I

Exam questions will entail two parts: one conceptual and one algorithmic i. Explain SVM and give an example in which it could be applied ii. Discuss the different terms in the objective function of SVM.

slide-5
SLIDE 5

MACHINE LEARNING

5

Example of exam question - II

Exam questions will entail two parts: one conceptual and one algorithmic

  • ii. What are the pros and the cons of GPR compared to SVR ?
  • ii. How can we derive GPR from linear probabilistic regression?
slide-6
SLIDE 6

MACHINE LEARNING

6

Example of exam question - III

Exam questions may also tackle fundamental topics of ML

  • ii. Give the formal definition of a pdf
  • ii. What is good evaluation practice in ML?
slide-7
SLIDE 7

MACHINE LEARNING

7

Class Overview

This overview is meant to cover solely some of the key concepts that we expect to be known and to highlight similarities and differences across the different methods presented in class. Exam material encompass:

  • The lecture notes (selected chapters/sections highlighted on the

website)

  • Slides
  • Solutions to the exercises
  • Solutions to the practicals.
slide-8
SLIDE 8

MACHINE LEARNING

8

Formalism:

  • Be capable of giving formal definitions of a pdf, marginal, likelihood
  • Be capable of giving principle of each ML algorithm seen in class

Taxonomy:

  • Know the difference between supervised / unsupervised learning, reinforcement
  • Be able to discuss concepts such as generative vs. discriminative methods

Principles of evaluation:

  • Know the basic principles of evaluation of ML techniques:
  • training vs. testing sets,
  • crossvalidation,
  • ground truth.

Basic Concepts in ML

slide-9
SLIDE 9

MACHINE LEARNING

9

Basic Concepts in ML

To assess the validity of a Machine Learning algorithm, one measures its performance against the training, validation and testing sets. These sets are built from partitioning the data set at hand.

Training Set Validation Set Testing Set

Crossvalidation Crossvalidation

N-fold crossvalidation: Typical choice is 10-fold crossvalidation

slide-10
SLIDE 10

MACHINE LEARNING

10

Mathematical notions of probability distribution function, cumulative distribution function, marginal, maximum likelihood, MAP, etc.

         

PDF: ( ) 0, and ( ) 1 CDF: ( ) Marginal prob. of x given joint distribution: ( , ) Likelihood function: | , , | ˆ Maximum likelihood: arg max L | Maximum a posterior

x

p x x p x dx D x p x dx p x p x y dy L x y p x y x

   

 

       

  

 

ˆ i: arg max |

x

x p x  

Basic Concepts in ML

slide-11
SLIDE 11

MACHINE LEARNING

11

11 X Y X Y

 

, p X Y

 

2 2

, | , , , : 2-dim vector : 2 2 covariance matrix p X Y X Y       

1st eigenvector 2nd eigenvector

1 1 2 2

Length of the ellipse's axes are equal to and , with . Each contour line corresponds to a multiple of the standard deviation along the eigenvectors.

T

V V            

Basic Concepts in ML

slide-12
SLIDE 12

MACHINE LEARNING

12

The conditional and marginal of a multi-dimensional Gaussian distribution are also Gaussians.

Basic Concepts in ML

slide-13
SLIDE 13

13

MACHINE LEARNING

13

Kernel Methods: Determine a metric which brings out features of the data so as to make subsequent computation easier

Original Space

x1 x2

After Lifting Data in Feature Space Data becomes linearly separable when using a rbf kernel and projecting onto first 2 PC of kernelPCA.

Basic Concepts in ML

slide-14
SLIDE 14

MACHINE LEARNING

14

Kernel Methods in Machine Learning

  • allow to model non-linear relationship across data
  • exploit the Kernel Trick:

Is based on the observation that the associated linear method is based on computing an inner product across variables. This inner product can be replaced by the kernel function if known. The problem becomes then linear in feature space.

     

: , , .

i j i j

k X X k x x x x     

Metric of similarity across datapoints

Basic Concepts in ML

slide-15
SLIDE 15

MACHINE LEARNING

15

For each algorithm, be able to explain: – what it can do: classification, regression, structure discovery / reduction of dimensionality – what one should be careful about (limitations of the algorithm, choice

  • f hyperparameters) and how does this choice influences the results.

– the key steps of the algorithm, its hyperparameters, the variables it takes as input and the variables it outputs

Preparation for the Exam

slide-16
SLIDE 16

MACHINE LEARNING

16

  • For each algorithm, be able to explain:

SVM

– what it can do: classification, regression, structure discovery / reduction of dimensionality Performs binary classification; can be extended to multi-class classification; can be extended to regression (SVR) – what one should be careful about (limitations of the algorithm, choice

  • f hyperparameters)

e.g. choice of kernel; too small kernel width in Gaussian kernels may lead to overfitting; one can proceed to iterative estimation of the kernel parameters – the key steps of the algorithm, its hyperparameters, the variables it takes as input and the variables it outputs

In red: what you should know; in blue, what would be good to know / bonus.

Preparation for the Exam: example

slide-17
SLIDE 17

MACHINE LEARNING

17

Class Overview

This class has presented groups of methods for doing classification, regression, structure discovery, estimation of time series. Note that several algorithms do more than one of these types of computation.

Classification / Clustering

Kernel K-means, GMM Decision Trees + boosting/bagging SVM

Regression

SVR GMR GPR

Structure Discovery

Linear / Kernel PCA, CCA

Time Series

RL

slide-18
SLIDE 18

MACHINE LEARNING

18

Topics Requested

  • Comparison between PCA, CCA and kernel PCA (ICA; not covered

in class); which to use when?

  • SVM, Boosting (Neural Networks: not covered in class):

pros and cons

  • GMR and probabilistic regression
slide-19
SLIDE 19

MACHINE LEARNING

19

PCA

Linear mapping Reduction of the dimensionality 1st axis aligned with maximal variance and determines correlation across dimensions of variables! 2nd, 3rd axes orthogonal!  All projections are uncorrelated!

A

X Y AX   

: , q

N q A

A N   

Raw 2D dataset Projected onto two first principal components

slide-20
SLIDE 20

MACHINE LEARNING

20

PCA

Pros:

  • Easy to implement (batch and incremental versions possible)
  • Gives easy to interpret projections of the data
  • Extract main correlations across the data
  • Optimal reduction of dimensionality (loose minimum information;

minimum error at reconstruction) Cons:

  • Extracts only linear correlations  kernel PCA
  • Very sensitive to noise (outliers)  probabilistic PCA
  • Cannot deal with incomplete data  probabilistic PCA
  • Forces the projection to be orthogonal and decorrelate data  ICA

(requires statistical independence) – ICA NOT COVERED IN CLASS! PCA remains a very powerful method; Worth trying it out on your data before using any other method!

slide-21
SLIDE 21

MACHINE LEARNING

21

Kernel PCA

kPCA differs from PCA. The eigenvectors are M-dimensional (size of number of datapoints) Projecting onto the eigenvectors after kPCA finds structure in the data

Circles and elliptic contour lines with RBF kernels

slide-22
SLIDE 22

MACHINE LEARNING

22

Kernel PCA

Hyperbolas and intersecting lines when using a polynomial kernel

kPCA differs from PCA. The eigenvectors are M-dimensional (size of number of datapoints) Projecting onto the eigenvectors after kPCA finds structure in the data

slide-23
SLIDE 23

MACHINE LEARNING

23

CCA

Video description Audio description

 

1 1

, x y

N

x

P

y  

2 2

, x y

 

,

max ,

x y

T T x y w w corr w x w y

Extract hidden structure that maximizes correlation across two different projections.

slide-24
SLIDE 24

MACHINE LEARNING

24

PCA versus CCA (see solutions exercise 2)

slide-25
SLIDE 25

MACHINE LEARNING

25

Topics Requested

  • Comparison between PCA, CCA and kernel PCA (ICA; not covered

in class); which to use when?

  • SVM, Boosting (Neural Networks: not covered in class):

pros and cons

  • GMR and probabilistic regression
slide-26
SLIDE 26

MACHINE LEARNING

26

SVM

Class with label y=-1 Class with label y=+1 x1 x2

 

2 ,

i=1,2,....,M.

1 2 , 1 when 1 , 1, , 1 when 1

min

i i i i i i

w b

w w x b y y w x b w x b y                 

Constrained based optimization Convex problem  global optimum but not unique solution! Find separating plane with maximal margin

slide-27
SLIDE 27

MACHINE LEARNING

27

SVM

 

1

( ) ,

i i

M i i

f x sgn y x x b x x  

       

 

 

 

 

 

 

 

 

1 1

, , , ,

M i i i i i i M i i i i

f x y x x b x x k x x f x y k x x b      

 

    

 

Non-linear separation is achieved using the kernel trick

slide-28
SLIDE 28

MACHINE LEARNING

28

Multi-Class SVM

Children Female Adult Male Adult

1

Construct a set of K binary classifiers f ,....,f , each trained to separate one class from the rest.

K

3

f

1

f

2

f

 

1,... 1

Compute the class label in a winner-take-all approach: j= , argmax

i i

M j j i j M i

y k x x b 

 

      

Sufficient to compute only K-1 classifier for K classes But computing the K’th classifier may provide tighter bounds on the Kth class.

slide-29
SLIDE 29

MACHINE LEARNING

29

Boosting: Mixture of Weak Classifiers

  • Fit a mixture of simple

classifiers on each class

  • Weighted combination of votes

How does it work? What does it learn?

  • The optimal combination of the

classifiers

Why is it good?

  • Small number of parameters for

good generalization

  • Very easy to implement
  • Can be extended to combining

complex classifiers (SVM)

8 mixtures es per class

slide-30
SLIDE 30

MACHINE LEARNING

30

Gaussian Mixture Models (GMM) + Bayes

  • Fit a GMM model on each class
  • Compare the pdf of each class
  • a density model for each class
  • Small number of parameters for

good generalization

  • Learns importance for each

dimension

2 mixtures es per class

How does it work? What does it learn? Why is it good?

slide-31
SLIDE 31

MACHINE LEARNING

31

Multi-Layer Perceptron (with Back-Propagation)

  • Each neuron “cuts a plane”
  • We combine n neurons together to

get a non-linear classifier

  • Cuts the space into

hyperplanes that are “combined” together

  • Fixed size of the model (size of

the hidden layer)

  • Extremely fast at testing time

n=1 n=1 n=2 n=2 n=3 n=3 n=4 n=4 y x “hidden layer”

  • utput

neuron input

  • utput

How does it work? What does it learn? Why is it good?

slide-32
SLIDE 32

MACHINE LEARNING

32

RANSAC

Training Testing

SVM GMM MLP GP Boost KNN RVM SVM GMM MLP GP Boost KNN RVM

A comparison across classifiers

WARNING: most of these algorithms require a certain amount of tweaking of the hyperparameters to get optimal results

Bagging Bagging RANSAC

slide-33
SLIDE 33

MACHINE LEARNING

33

What to use when?

Several criteria (application – dependent):

  • Do you care about being quick at training?
  • Do you care about being quick at testing?
  • Is stack memory an issue for your application?
  • Is precision at testing crucial?
  • In classification, distinguish between precision for positive

versus negative class

  • Do you care about generalization away from the data?
  • Need a notion of worthiness of the prediction (usually

rendered by likelihood; not always available)

  • Have you run the equivalent linear method and it did not yield

good results? Then, yes, it may be a good idea to run a non- linear version of the method (PCA vs kPCA, Linear SVM, vs kernel SVM)

slide-34
SLIDE 34

MACHINE LEARNING

34

Topics Requested

  • Comparison between PCA, CCA and kernel PCA (ICA; not covered

in class); which to use when?

  • SVM, Boosting (Neural Networks: not covered in class):

pros and cons

  • GMR and probabilistic regression
slide-35
SLIDE 35

MACHINE LEARNING

35

Kernel methods for regression

 ,

,

N

y f x x y   

Deterministic regressive model

 

 

2

, with 0, y f x N      

Probabilistic regressive model Build an estimate of the noise model and then compute f directly (Support Vector Regression)

slide-36
SLIDE 36

MACHINE LEARNING – 2012 36 MACHINE LEARNING MACHINE LEARNING

Kernel methods for regression

y

 ,

,

N

y f x x y   

Deterministic regressive model

 

 

2

, with 0, y f x N      

Probabilistic regressive model Probabilistic estimate of the nonlinear relationship between y and x through the conditional density: (estimates the noise model and f) And then compute the estimate by taking the expectation over the conditional density:

 

| p y x

 

 

   

 

| y E f x f x y E p y x     

Gaussian Mixture Regression (GMR) computes first p(x, y) and, then, derives p(y|x). Gaussian Process Regression (GPR) computes directly p(y|x).

slide-37
SLIDE 37

MACHINE LEARNING – 2012 37 MACHINE LEARNING MACHINE LEARNING

SVR, GPR, GMR: Differences

SVR, GMR and GPR are based on the same probabilistic regressive model. But they do not optimize the same objective function  find different solutions.

  • SVR:
  • minimizes reconstruction error through convex optimization

 ensured to find the optimal estimate; but not unique solution

  • usually finds a nm of models <= nm of datapoints (support vectors)
  • GMR:
  • learns p(x,y) through maximum likelihood  finds local optimum
  • compute a generative model p(x,y) from which it derives p(y|x)
  • starts with a low nm of models << nm of datapoints
  • GPR:
  • No optimization; analytical (optimal) solution
  • expresses p(y|x) as a full density model
  • nm of models = nm of datapoints!
slide-38
SLIDE 38

38 MACHINE LEARNING MACHINE LEARNING

x

Gaussian Mixture Regression (GMR)

1) Estimate the joint density, p(x,y), across pairs of datapoints using GMM.

 

     

1

, , ; , , with , ; , , , : mean and covariance matrix of Gaussian

K i i i i i i i i i i

p x y p x y p x y N i     

      

2D projection of a Gauss function Ellipse contour ~ 2 std deviation

x y

y

slide-39
SLIDE 39

39 MACHINE LEARNING MACHINE LEARNING

x

1) Estimate the joint density, p(x,y), across pairs of datapoints using GMM.

 

     

1

, , ; , , with , ; , , , : mean and covariance matrix of Gaussian

K i i i i i i i i i i

p x y p x y p x y N i     

      

Parameters are learned through Expectation- maximization. Iterative procedure. Start with random initialization.

y

Gaussian Mixture Regression (GMR)

slide-40
SLIDE 40

40 MACHINE LEARNING MACHINE LEARNING

y x

1) Estimate the joint density, p(x,y), across pairs of datapoints using GMM.

 

     

1

, , ; , , with , ; , , , : mean and covariance matrix of Gaussian

K i i i i i i i i i i

p x y p x y p x y N i     

      

1

1

K i i

Mixing Coefficients Probability that all M datapoints were generated by Gaussian i:

 

 

1

|

M j i j

p i p i x 

 

1 2

  

Gaussian Mixture Regression (GMR)

slide-41
SLIDE 41

41 MACHINE LEARNING MACHINE LEARNING

1) Estimate the joint density, p(x,y), across pairs of datapoints using GMM. 2) Compute the regressive signal, by taking p(y|x)

y x

    

1

| | ; ,

K i i i i

p y x x p y x  

  

 

   

1

; , with ; ,

i i i i K j j j j

p x x p x     

  

Gauss function

The variance changes depending on the query point

Gaussian Mixture Regression (GMR)

slide-42
SLIDE 42

42 MACHINE LEARNING MACHINE LEARNING

1) Estimate the joint density, p(x,y), across pairs of datapoints using GMM. 2) Compute the regressive signal, by taking p(y|x)

y x

    

1

| | ; ,

K i i i i

p y x x p y x  

  

 

   

1

; , with ; ,

i i i i K j j j j

p x x p x     

  

 

2 x

 

1 x

Influence of each marginal is modulated by 

Query point

Gaussian Mixture Regression (GMR)

slide-43
SLIDE 43

43 MACHINE LEARNING MACHINE LEARNING

y x

3) The regressive signal is then obtained by computing E{p(y|x)}:

   

 

   

 

 

 

 

   

1 1

1

|

|

Y YX XX X

K i i i i i i

i x

K i i i

x x

E p y x

E p y x x x

  

 

 

    

 

Linear combination of K local regressive models

 

1 x

 

2 x

 

2 x

 

1 x

Gaussian Mixture Regression (GMR)

slide-44
SLIDE 44

44 MACHINE LEARNING MACHINE LEARNING

y x

 

1 x

 

2 x

Computing the variance var{p(x,y)} provides information on the uncertainty of the prediction computed from the conditional distribution. Careful: This is not the uncertainty of the model. Use the likelihood to compute the uncertainty of the predictor.!

 

 

   

 

 

 

   

 

2 2 2 1 1

var |

K K i i i i i i i

p y x x x x x    

 

     

 

Gaussian Mixture Regression (GMR)

slide-45
SLIDE 45

45 MACHINE LEARNING – 2012 MACHINE LEARNING

 

 

var | p y x

 

 

| E p y x

Gaussian Mixture Regression (GMR)

Computing the variance var{p(x,y)} provides information on the uncertainty of the prediction computed from the conditional distribution.

slide-46
SLIDE 46

MACHINE LEARNING – 2012 46 MACHINE LEARNING MACHINE LEARNING

Kernel methods for regression

y

 ,

,

N

y f x x y   

Deterministic regressive model

 

 

2

, with 0, y f x N      

Probabilistic regressive model Probabilistic estimate of the nonlinear relationship between y and x through the conditional density: (estimates the noise model and

f)

And then compute the estimate by taking the expectation over the conditional density:

 

| p y x

 

 

   

 

| y E f x f x y E p y x     

slide-47
SLIDE 47

47 MACHINE LEARNING MACHINE LEARNING

Probability Density Function and Regression

A signal y can be estimated through regression y=f(x) by taking the expectation over the conditional probability of p on x, for a choice of parameters for p:

 

 

| y E p y x 

The simplest way to estimate p(y|x) is through Probabilistic Regression that estimates a linear regressive model.

slide-48
SLIDE 48

48 MACHINE LEARNING MACHINE LEARNING

 

, , ,

T N

y f x w w x w x   

PR is a statistical approach to classical linear regression that estimates the relationship between zero-mean variables y and x by building a linear model

  • f the form:

 

2

, with 0,

T

y w x N      

If one assumes that the observed values of y differ from f(x) by an additive noise  that follows a zero-mean Gaussian distribution (such an assumption consists of putting a prior distribution over the noise), then:

Probabilistic Regression (PR)

slide-49
SLIDE 49

49 MACHINE LEARNING MACHINE LEARNING

Probabilistic Regression  

   

1 2 2 1

Data points are independently and identically distributed (i.i.d) | , , ~ | , , 1 exp 2 2

i

M i i i T i M i

p y X w p y x w y w x    

 

          

 

  

  

 

1 2

Training set of pairs of data points , , Likelihood of the regressive model 0, ~ | , ,

i i

M i T

M X y x y y w X N y p y X w  

   

Parameters of the model

slide-50
SLIDE 50

50 MACHINE LEARNING MACHINE LEARNING

  

  

 

1 2

Training set of pairs of data points , , Likelihood of the regressive model 0, ~ | , ,

i i

M i T

M X y x y y w X N y p y X w  

   

Probabilistic Regression

Prior model on distribution of parameter w:

   

1

1 0, exp 2

T w w

p w N w w

          

Hyperparameters Given by user

slide-51
SLIDE 51

51 MACHINE LEARNING MACHINE LEARNING

Probabilistic Regression

 

1 2

1 1 2

1 with

1 | , , ,

T w

T T

A XX

p y x X N x A X x A x

 

  

       y y

Testing point Training datapoints

slide-52
SLIDE 52

52 MACHINE LEARNING – 2012 MACHINE LEARNING

How to extend the simple linear Bayesian regressive model for nonlinear regression, such that the non-linear problem becomes linear again?

From Probabilistic Regression to Gaussian Process Regression

x y

?

 

2

0,

T

y w x N   

 

x   

 

2

, ~ 0,

T

y w x N      

x y

v

slide-53
SLIDE 53

MACHINE LEARNING – 2012 53 MACHINE LEARNING MACHINE LEARNING

 

1 2

1 1 2

1

1 | , , , ,

T w

T T

A XX

p y x X N x A X x A x

 

  

       y y

 

x 

Non-Linear Transformation

Gaussian Process Regression

             

1 1 2 2 1

1 | , , , with

T T T w

p y x X N x A X x A x A X X

   

            y y     

 

2

0,

T

y w x N   

How to extend the simple linear Bayesian regressive model for nonlinear regression, such that the non-linear problem becomes linear again?

slide-54
SLIDE 54

MACHINE LEARNING – 2012 54 MACHINE LEARNING MACHINE LEARNING

Gaussian Process Regression

             

1 1 2 2 1

1 | , , , with

T T T w

p y x X N x A X x A x A X X

   

            y y     

Inner product in feature space

     

, ' '

T w

k x x x x    

Take as kernel

 

 

 

1 2

1

with ,

| , , ,

i

M i i

K X X I

y E y x X k x x

 

 

     

y

y

slide-55
SLIDE 55

MACHINE LEARNING – 2012 55 MACHINE LEARNING MACHINE LEARNING

Gaussian Process Regression  

 

 

1 2

1

with ,

| , , ,

i

M i i

K X X I

y E y x X k x x

 

 

     

y

y >0 

 All datapoints are used in the computation!

slide-56
SLIDE 56

MACHINE LEARNING – 2012 56 MACHINE LEARNING MACHINE LEARNING

Gaussian Process Regression  

 

 

1 2

1

with ,

| , , ,

i

M i i

K X X I

y E y x X k x x

 

 

     

y

y

The kernel and its hyperparameters are given by the user. These can be optimized through maximum likelihood over the marginal likelihood, i.e. p(y|X;parameters).

slide-57
SLIDE 57

MACHINE LEARNING

57

Overview of Topics Covered

This course covered a variety of topics that are core to Machine Learning. It gives you the basis to go and read recent advances in each of these topics. We hope that you will find this material useful and that you will use some

  • f these algorithms in the future.

If you do so, drop us a note and we would be glad to include your application in future lectures as examples!