Introduction Lecturer: Prof. Aude Billard (aude.billard@epfl.ch) - - PowerPoint PPT Presentation

introduction lecturer prof aude billard aude billard epfl
SMART_READER_LITE
LIVE PREVIEW

Introduction Lecturer: Prof. Aude Billard (aude.billard@epfl.ch) - - PowerPoint PPT Presentation

MACHINE LEARNING 2013 MACHINE LEARNING Introduction Lecturer: Prof. Aude Billard (aude.billard@epfl.ch) Assistants: Dr. Basilio Noris, Nicolas Sommer (basilio.noris@epfl.ch; n.sommer@epfl.ch) 1 1 MACHINE LEARNING 2013 Practicalities


slide-1
SLIDE 1

MACHINE LEARNING – 2013

1 1

MACHINE LEARNING Introduction

Lecturer: Prof. Aude Billard (aude.billard@epfl.ch) Assistants: Dr. Basilio Noris, Nicolas Sommer (basilio.noris@epfl.ch; n.sommer@epfl.ch)

slide-2
SLIDE 2

MACHINE LEARNING – 2013

2 2

Practicalities

Alternate:

  • Lectures: 9h15-11h00 + Exercises: 11h15-13h00

(in room MEB331)

  • Practicals 9h15-13h00 (in room GRC02)
slide-3
SLIDE 3

MACHINE LEARNING – 2013

3 3

Class Timetable

http://lasa.epfl.ch/teaching/lectures/ML_Phd/index.php

slide-4
SLIDE 4

MACHINE LEARNING – 2013

4 4

Practicalities

Website of the class: http://lasa.epfl.ch/teaching/lectures/ML_Phd Lecture Notes Machine Learning Techniques

Available at the Librairie Polytechnique Course covers selected chapters of the lecture notes, see website.

slide-5
SLIDE 5

MACHINE LEARNING – 2013

5 5

Grading

50% of the grade based on personal work. Choice between:

1. Mini-project implementing and evaluating the algorithm performance and sensibility to parameter choices (should be done individually). OR 2. A literature survey on a topic chosen among a list provided in class (can be done in team of two people) ~25-30 hours of personal work, i.e. count one week of work.

50% based on final Oral Exam

20 minutes preparation 20 minutes answer on the black board (closed book, but allowed to bring a recto-verso A4 page with personal notes)

slide-6
SLIDE 6

MACHINE LEARNING – 2013

6 6

Prerequisites

Linear Algebra, Probabilities and Statistics Basics in ML can be an advantage (otherwise catch up with lecture notes), Lecture Notes on the website

http://lasa.epfl.ch/teaching/lectures/ML_Phd/

slide-7
SLIDE 7

MACHINE LEARNING – 2013

7 7

Syllabus

Compulsory reading of background chapters before class!

slide-8
SLIDE 8

MACHINE LEARNING – 2013

8 8

Today’s class format

  • Examples of ML applications
  • Taxonomy and basic concepts of ML
  • Brief recap of basic maths for the class
  • Overview of practicals
slide-9
SLIDE 9

MACHINE LEARNING – 2013

9 9

What is Machine Learning to you? What do you think it is used for? Why are you taking this class?!

slide-10
SLIDE 10

MACHINE LEARNING – 2013

10 10

Machine Learning, a definition

Machine Learning is the field of scientific study that concentrates on induction algorithms and on other algorithms that can be said to ``learn.''

Machine Learning Journal, Kluwer Academic

Machine Learning is an area of artificial intelligence involving

developing techniques to allow computers to “learn”. More specifically, machine learning is a method for creating computer programs by the analysis of data sets, rather than the intuition of engineers. Machine learning overlaps heavily with statistics, since both fields study the analysis of data.

Webster Dictionary

Machine learning is a branch of statistics and computer science, which studies algorithms and architectures that learn from data sets. WordIQ

slide-11
SLIDE 11

MACHINE LEARNING – 2012

11 11 11

11

What is Machine Learning?

Machine Learning encompasses a large set of algorithms that aim at inferring information from what is hidden.

  • A. M. Bronstein, M. M. Bronstein, M. Zibulevsky, "On separation of semitransparent dynamic images from static background", Proc. Intl. Conf. on Independent Component Analysis

and Blind Signal Separation, pp. 934-940, 2006.

slide-12
SLIDE 12

MACHINE LEARNING – 2012

12 12 12

12

What is Machine Learning?

Recognizing human speech. Here this the wave produced when uttering the word “allright”. The strength of ML algorithms is that they can apply to arbitrary set of data. It can recognizing patterns from what from various source of data.

slide-13
SLIDE 13

MACHINE LEARNING – 2012

13 13 13

13

What is Machine Learning?

Piano note Same note played by a oboe

slide-14
SLIDE 14

MACHINE LEARNING – 2013

14 14

What is Machine Learning?

What is sometimes impossible to see for humans is easy for ML to pick. Demo Eyes-No-Gaze Demo Eyes-With-Gazes

slide-15
SLIDE 15

MACHINE LEARNING – 2013

15 15

What is Machine Learning?

ML algorithms makes inference from analyzing a set of signals or data- points. Demo PCA

?

Wrinkles, Eyelids and Eyelashes Support Vector Regression

Noris et al, 2011, Computer Vision and Image Understanding.

slide-16
SLIDE 16

MACHINE LEARNING – 2013

16 16

16

What is Machine Learning?

Conversely, things that seem evident to humans may require more than

  • ne ML tool and also some intuition for encoding the data.

There is an ambiguity. The two sets of images are differentiable by both

  • rientation and color. Orientation is spurious information coming from

poor choice of training data. Color is the feature we try to teach the algorithm.

slide-17
SLIDE 17

MACHINE LEARNING – 2012

17 17 17

17

What is Machine Learning?

A good training set must make sure to provide enough information for the algorithm to do proper inference. Here, one must provide images of the two pen in the same set of orientation. Conversely, things that seem evident to humans may require more than

  • ne ML tool and also some intuition for encoding the data.
slide-18
SLIDE 18

MACHINE LEARNING – 2013

18 18

Learning versus Memorization

Learning implies generalizing. Generalizing consists of extracting key features from the data, matching those across data (to find resemblances) and storing a generalized representation of the data features that accounts best (according to a given metric) for all the small differences across data. Classification and clustering techniques are examples of methods that generalize by categorizing the data. Generalizing is the opposite of memorizing and often one might want to find a tradeoff between over-generalizing, hence losing information on the data, and over fitting, i.e. keeping more information than required. Generalization is particularly important in order to reduce the influence

  • f noise, introduced in the variability of the data.
slide-19
SLIDE 19

MACHINE LEARNING – 2013

19 19

  • Supervised learning – where the algorithm learns a function or

model that maps best a set of inputs to a set of desired outputs.

  • Reinforcement learning – where the algorithm learns a policy
  • r model of the set of transitions across a discrete set of input-output

states (Markovian world) in order to maximize a reward value (external reinforcement).

  • Unsupervised learning – where the algorithm learns a model

that best represent a set of inputs without any feedback (no desired

  • utput, no external reinforcement)
  • Learning to learn – where the algorithm learns its own inductive

bias based on previous experiences

Taxonomy in ML

slide-20
SLIDE 20

MACHINE LEARNING – 2013

20 20

Examples of ML Applications

slide-21
SLIDE 21

MACHINE LEARNING – 2013

21 21

Structure Discovery

Raw Data Trying to find some structure in the data…..

slide-22
SLIDE 22

MACHINE LEARNING – 2013

22 22

Structure Discovery: example

Methods for spectral analysis, such as linear/kernel PCA - CCA - ICA aim at finding hidden structure in the data.

Linear PCA Kernel PCA projections

Projection of handwritten digits; kernel PCA projections extract better some of the texture and is less sensitive to noise than linear PCA, which boost reconstruction and recognition of digits (Mika et al, NIPS 2000). Reconst.

slide-23
SLIDE 23

MACHINE LEARNING – 2013

23 23

Structure Discovery: example

Methods for spectral analysis, such as linear/kernel PCA - CCA - ICA aim at finding hidden structure in the data.

Person identification Task: Top row: Query image and 10 candidates in the gallery set. Bottom row: projections of the query image onto the pre-learned (through kernel PCA) appearance manifold of the 10 candidates.

Yang et al, Person Reidentification by Kernel PCA Based Appearance Learning, Canadian Conf. on Computer and Robot Vision (2011)

slide-24
SLIDE 24

MACHINE LEARNING – 2013

24 24

Structure Discovery

Spectral analysis proceeds by either projecting or lifting the data into a lower, respectively, higher dimensional space. In each projection, groups of datapoints appear more similar than in the

  • riginal space.

Looking at each projection separately allows to determine which feature each group of datapoints share.

Feature space Projections in feature space

This can be used in different ways:

  • To discard outliers by selecting only the

datapoints that have most features in common.

  • To group datapoints according to

shared features.

  • To rank features according to how

frequently these appear.

x F(x) F1(x)

slide-25
SLIDE 25

MACHINE LEARNING – 2013

25 25

In this class, we will briefly review some of the key novel algorithms for spectral analysis, including:

  • Kernel PCA with wide application of its non-linear projections for a

variety of domains;

  • Kernel CCA (Canonical Correlation Analysis): Generalization of

kernel PCA to comparison across domains, e.g. combining visual and auditory information;

  • Kernel ICA that attempt to solve more complex blind source

decomposition using non-linear projections;

Structure Discovery

slide-26
SLIDE 26

MACHINE LEARNING – 2012

26

Clustering

Clustering encompasses a large set of methods that try to find patterns that are similar in some way. Hierarchical clustering builds tree-like structure by pairing datapoints according to increasing levels of similarity.

slide-27
SLIDE 27

MACHINE LEARNING – 2013

27 27

Clustering: example

Hierarchical clustering can be used with arbitrary sets of data. Example: Hierarchical clustering to discover similar temporal pattern of crimes across districts in India.

Chandra et al, “A Multivariate Time Series Clustering Approach for Crime Trends Prediction”, IEEE SMC 2008.

slide-28
SLIDE 28

MACHINE LEARNING – 2013

28 28

Clustering: example

Clustering is used in computer vision for pre-processing and post-processing

  • f images

Multispectral medical image segmentation. (left: MRI-image from 1 channel) (right: classification from a 9-cluster semi-supervised learning); Clusters should identify patterns, such as cerebro-spinal fluid, white matter, striated muscle, tumor. (Lundervolt et al, 1996).

slide-29
SLIDE 29

MACHINE LEARNING – 2013

29 29

Clustering: example

Clustering assume groups of points are somewhat similar according to the same metric of similarity. All current clustering techniques fail at clustering the above seven groups of points.

Jain, 2010, Data clustering: 50 years beyond K-means, Pattern Recognition Letters

slide-30
SLIDE 30

MACHINE LEARNING – 2013

30 30

Clustering

Different techniques or heuristics can be developed to help the algorithm determine the right boundaries across clusters:

Jain, 2010, Data clustering: 50 years beyond K-means, Pattern Recognition Letters

slide-31
SLIDE 31

MACHINE LEARNING – 2013

31 31

Clustering

We will point out some of the emerging and useful research directions to tackle key issues in designing clustering algorithms:

  • semi-supervised clustering,
  • ensemble clustering,
  • simultaneous feature selection during data clustering,
  • large scale data clustering.

In this class, we will briefly review some algorithms for spectral clustering, starting with K-means and moving to advanced methods such as Kernel K-means.

slide-32
SLIDE 32

MACHINE LEARNING – 2013

32 32

Classification

Classification is a supervised clustering process. Classification is usually multi-class; Given a set of known classes, the algorithm learns to extract combinations of data features that best predict the true class of the data.

Original Data After 4-class classification using SVM

slide-33
SLIDE 33

MACHINE LEARNING – 2013

33 33

Classification: example

Classification of finance data to assess solvability using Support Vector Machine (SVM).

Swiderski et al, Decision Multistage classification by using logistic regression and neural networks for assessment of financial condition of company, Support System, 2012

5-classes insolvency risk Excellent, good, satisfactory, passable, poor

slide-34
SLIDE 34

MACHINE LEARNING – 2013

34 34

Classification: issues

A recurrent problem when applying classification to real life problems is that classes are often very unbalanced. This can affect drastically classification performance, as classes with many data points have more influence on the error measure during training.  In this class, you will get the chance to practice this by using real datasets during the computer-based practical session and to discuss what one must do in case of unbalanced datasets in each class.

Data from Swiderski et al. have more positive examples than negative examples

slide-35
SLIDE 35

MACHINE LEARNING – 2013

35 35

Regression

Regression is a supervised machine learning technique. Non-linear regression techniques, such as Support Vector Regression and Gaussian Process Regression, model the non-linear relationships across the data.

y

  1,...

Estimate that best predict set of training points , ?

i i i M

f x y

 x

1

x

1

y

2

x

2

y

3

x

3

y

4

x

4

y

 

y f x  Predict given input through a non-linear function : y x f

slide-36
SLIDE 36

MACHINE LEARNING – 2013

36 36

Regression: example

SVR for predicting cumulative log return over a period of 2500 days. Contrasted two methods to determine automatically the optimal features (i.e. moving average).

Wand & Zhu, Financial market forecasting using a two-step kernel learning method for the support vector regression, Annals of Op. Research, 2010

Found that short-term (daily and weekly) trends had a bigger impact than the long- term (monthly and quarterly) trends in predicting the next day return.

slide-37
SLIDE 37

MACHINE LEARNING – 2013

37 37

Regression: example

SVR for predicting the optimal position and orientation of the golf club to hit the ball in a golf experiment.

Kronander, Khansari and Billard, JTSC award, IEEE Int. Conf. on Int. and Rob. Systems 2011.

Contrast prediction of two methods (Gaussian Process Regression and Gaussian Mixture Regression) in terms of precision and generalization.

GPR GMR

slide-38
SLIDE 38

MACHINE LEARNING – 2013

38 38

Regression

Machine learning techniques for non-linear regression are model free. They estimate both the function and its parameters (density based estimate of the data distribution). In this class, we will:

  • Compare three of the major non-linear regression techniques
  • Show similarities (same mathematical framework).
  • Discuss differences (parameters estimation, objective function)
  • Determine which technique is best suited when.
slide-39
SLIDE 39

MACHINE LEARNING – 2013

39 39

Machine Learning in Practice

The choice of dataset for training the algorithm is crucial and biases strongly performance

slide-40
SLIDE 40

MACHINE LEARNING – 2013

40 40 y x

b X a Y   

Regression minimizing Mean Square Error

     

2 1

ˆ 1 

 

m i i i

x y x y m MSE

Estimating from sampling the datapoints

y ˆ

Sampling

slide-41
SLIDE 41

MACHINE LEARNING – 2013

41 41 y x

Regression minimizing Mean Square Error

     

2 1

ˆ 1 

 

m i i i

x y x y m MSE

Estimating from sampling the datapoints

y ˆ y ˆ

The choice of training data (training set) is crucial

Crossvalidation

b X a Y   

slide-42
SLIDE 42

MACHINE LEARNING – 2013

42 42

ML in Practice: Training and Evaluation

Best practice to assess the validity of a Machine Learning algorithm is to measure its performance against the training, validation and testing sets. These sets are built from partitioning the data set at hand.

Training Set Validation Set Testing Set

Crossvalidation

Training and validation sets are used to determine the sensitivity of the learning to the choice of hyperparameters (i.e. parameters not learned during training). Values for the hyperparameters are set through a grid search. Once the optimal hyperparameters have been picked, the model is trained with complete training + validation set and tested on the testing set. In practice, one often uses solely training and testing sets and performs crossvalidation directly

  • n these.

Crossvalidation

slide-43
SLIDE 43

MACHINE LEARNING – 2013

43 43

Choice of training / testing ratio Avoid overfitting  Train the classifier with a small sample of all datapoints and test it with the remaining datapoints. Typical choice of training/testing set ratio is 2/3rd training, 1/3rd testing. The smaller the ratio, the more robust the classification N-fold crossvalidation Repeats the procedure N times by picking randomly points from the dataset to create the training set. Typical choice is 10-fold crossvalidation, although this should depend on the number of datapoints you have!

ML in Practice: Training and Evaluation

slide-44
SLIDE 44

MACHINE LEARNING – 2013

44 44

Time Performance How long can it take before an acceptable level of performance is achieved?

What would be an optimal learning curve?

When is good enough achieved?

Progress in a machine’s performance must be measurable and must be significant. A machine must eventually reach a minimal level of performance (“good enough”) within an acceptable time frame.

Performance measures in ML

slide-45
SLIDE 45

MACHINE LEARNING – 2013

45 45

Performance measures in ML

These vary and depend entirely on the algorithm and the function you wish to optimize. Performance measure for supervised learning algorithms are well defined and relate directly to a distance measure between the desired output and the estimated one. In classification, people tend to compute the performance in terms

  • f % of items correctly classified. This can be very misleading if

instances of each class are not well balanced and if there is a lot of variation in classification across classes. The old-fashioned measures of mean, median, std remain very reliable measures of performance.

slide-46
SLIDE 46

MACHINE LEARNING – 2013

46 46

Performance measures in ML

Performance often depends on choosing well parameters, e.g. threshold in classification: e.g. in naïve Bayes classification

 

Bayes rule for binary classification: x has class label +1 if 1| else x has class label -1 P y x    

X

 

1| P y x  

slide-47
SLIDE 47

MACHINE LEARNING – 2013

47 47

Performance measures in ML: Ground truth

1/ Comparing the performance of a novel method to existing ones or trivial baselines is crucial. Try to have the « ground truth ». This often means hand-coded solutions (assuming humans outperform the machine). 2/ Using the testing set, even in a way that seems reasonnable is always

  • dangerous. It is extremely hard to predict how much it artificially improves

the estimated performance.

slide-48
SLIDE 48

MACHINE LEARNING – 2013

48 48

Some Machine Learning Resources

http://www.machinelearning.org/index.html

  • http://www.pascal-network.org/ Network of excellence on Pattern Recognition, Statistical Modelling

and Computational Learning (summer schools and workshops) Databases:

  • http://expdb.cs.kuleuven.be/expdb/index.php
  • http://archive.ics.uci.edu/ml/

Journals:

  • Machine Learning Journal, Kluwer Publisher
  • IEEE Transactions on Signal processing
  • IEEE Transactions on Pattern Analysis
  • IEEE Transactions on Pattern Recognition
  • The Journal of Machine Learning Research

Conferences:

  • ICML: int. conf. on machine learning
  • Neural Information Processing Conference – on-line repository of all research papers,

www.nips.org

slide-49
SLIDE 49

MACHINE LEARNING – 2013

49 49

Topics for Literature survey and Mini-Projects

Topics for survey will entail:

  • Survey of clustering methods applied to finances
  • Survey of classification methods applied to biometric data

The exact list of topics for lit. survey and mini-project will be posted by March 8 Topics for mini-project will entail implementing either of these:

  • Clustering techniques (DBSCAN, FLAME, KMEANS++)
  • Regression (Gradient Boosting, Locally Weighted Regression)

The exact list of topics for lit. survey and mini-project will be posted in the second week of March

slide-50
SLIDE 50

MACHINE LEARNING – 2013

50 50

Overview of Practicals

slide-51
SLIDE 51

MACHINE LEARNING – 2013

51 51

Brief recap of basic maths for the class

slide-52
SLIDE 52

MACHINE LEARNING – 2013

52 52

Math Background Needed for this Class Probability, Statistics: covariance, pdf, …. Linear Algebra: formal notation, matrix inversion, … Derivatives: partial derivatives, gradient, Jacobian, … Optimization: normalized/weighted MSQ, Lagrange multipliers, …

slide-53
SLIDE 53

MACHINE LEARNING – 2013

53 53

     

 

1

: the probability that the variable x takes value x , 1, 1,..., , and 1. Idem for , 1,...

i i i M i i j

P x x P x x i M P x x P y y j N

         

Discrete Probabilities

Consider two variables x and y taking discrete values over the intervals [x1,…, xM] and [y1,…, yN] respectively.

slide-54
SLIDE 54

MACHINE LEARNING – 2013

54 54

The joint probability that the two events A (variable x takes value xi) and B (variable y takes value yj) occur is expressed as: P(A | B) is the conditional probability that event A will take place if event B already took place

     

| P A B P A B P B  

      

 

,

i j

P A B P A B P x x y y      

       

| | P B A P A P A B P B 

Bayes' theorem:

Discrete Probabilities

slide-55
SLIDE 55

MACHINE LEARNING – 2013

55 55

The so-called marginal probability that variable x will take value xi is given by:

Discrete Probabilities

1

( ): ( , )

N x i i j j

P x x P x x y y

   

slide-56
SLIDE 56

MACHINE LEARNING – 2013

56 56

( ) 0, ( ) 1 p x x p x dx

 

   

Probability Distributions, Density Functions

p(x) a continuous function is the probability density function or probability distribution function (PDF) (sometimes also called probability distribution or simply density) of variable x. The pdf is not bounded by 1. It can grow unbounded, depending on the value taken by x.

p(x) x

slide-57
SLIDE 57

MACHINE LEARNING – 2012

57 57 57

57

Probability Distributions, Density Functions

( ): ( ) ( )

b a

x a P a x b D a x b p x dx b a         

The probability that the variable x takes a value in the subinterval [a,b] is given by: The cumulative distribution function (or simply distribution function) of X is:

 

( ) D x p x dx  

p(x) dx ~ probability of x to fall within an infinitesimal interval [x, x + dx]

 

( ) d p x D x dx  

D(x) p(x) x x

slide-58
SLIDE 58

MACHINE LEARNING – 2013

58 58

Parametric PDF

The Gaussian function is entirely determined by its mean and variance. For this reason, it is often referred to as a parametric distribution. For other pdf, the variance represents a notion of dispersion around the expected value.

slide-59
SLIDE 59

MACHINE LEARNING – 2013

59 59

Expectation

   

When x takes discrete values: ( ) For continuous distributions: ( )

i i

i

E x x P x E x x p x dx        

 

The expectation of the probability P(x) (in the discrete case) and of the pdf p(x) (in the continuous case), also called the expected value or mean, is the average value weighted by p(x):

slide-60
SLIDE 60

MACHINE LEARNING – 2013

60 60

Variance

 

 

 

 

2 2 2 2

( ) Var x E x E x E x           

2

 , the variance of a distribution measures the amount of spread of the

distribution around its mean:

 is the standard deviation of x.

slide-61
SLIDE 61

MACHINE LEARNING – 2013

61 61

Mean and variance in PDF

For other pdf than Gaussian distribution, the variance represents a notion of dispersion around the expected value.

  • 4
  • 3
  • 2
  • 1

1 2 3 4 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 * x f(0)+f(1)+f(-2) expectation std=1.38

3 Gaussians distributions Resulting distribution when superposing the 3 Gaussian distributions.

slide-62
SLIDE 62

MACHINE LEARNING – 2013

62 62

Probability Distributions, Density Functions

 

 

2 2

2

1 , μ:mean, σ:variance 2

x

p x e

 

 

         

The uni-dimensional Gaussian or Normal distribution is a distribution with pdf given by:

The Gaussian function is entirely determined by its mean and variance. For this reason, it is often referred to as a parametric distribution.

slide-63
SLIDE 63

MACHINE LEARNING – 2012

63 63 63

 

( , )

x

p x p x y dy  

Marginal, Likelihood

Consider two random variables x and y with joint distribution p(x,y), then the marginal probability of x given y is:

slide-64
SLIDE 64

MACHINE LEARNING – 2013

64 64

 

( , )

x

p x p x y dy  

Marginal, Likelihood

Consider two random variables x and y with joint distribution p(x,y), then the marginal probability of x given y is: Consider that the pdf of x, y is parametrized, s.t. one can compute the conditional Then, the likelihood function (short – likelihood) of the model parameters is given by:

 

, | , p x y   ,  

   

, | , : , | , L x y p x y     

slide-65
SLIDE 65

MACHINE LEARNING – 2013

65 65

Maximum Likelihood

Machine learning techniques often assume that the form of the distribution function is known and that sole its parameters must be optimized to fit at best a set of observed

  • datapoints. It then proceeds to determine these parameters through maximum

likelihood optimization. The principle of maximum likelihood consists of finding the optimal parameters of a given distribution by maximizing the likelihood function of these parameters, equivalently by maximizing the probability of the data given the model and its parameters, e.g.:

       

, ,

max , | max | , | , 0 and | , L x p x p x p x

 

    

 

          

If p is the Gaussian function, then the above has an analytical solution (assuming that one has enough observations of x to draw from).

slide-66
SLIDE 66

MACHINE LEARNING – 2013

66 66

ML in Practice : Caveats on Statistical Measures

A large number of algorithms we will see in class require knowing the mean and covariance of the probability distribution function of the data. In practice, the class means and covariances are not known. They can, however, be estimated from the training set. Either the maximum likelihood estimate or the maximum a posteriori estimate may be used in place of the exact value. Several of the algorithms to estimate these assume that the underlying distribution follows a normal distribution. This is usually not true. Thus,

  • ne should keep in mind that, although the estimates of the covariance

may be considered optimal, this does not mean that the resulting computation obtained by substituting these values is optimal, even if the assumption of normally distributed classes is correct.

slide-67
SLIDE 67

MACHINE LEARNING – 2013

67 67

Another complication that you will often encounter when dealing with algorithms that require computing the inverse of the covariance matrix of the data is that, with real data, the number of observations of each sample exceeds the number of samples. In this case, the covariance estimates do not have full rank, i.e. they cannot be inverted. There are a number of ways to deal with this. One is to use the pseudoinverse of the covariance matrix. Another way is to proceed to singular value decomposition (SVD).

ML in Practice : Caveats on Statistical Measures

slide-68
SLIDE 68

MACHINE LEARNING – 2013

68 68

Recall that Sections 2.1-2.2 of the Lecture Notes must be read before coming to class next week