MSc Course MACHINE LEARNING TECHNIQUES AND APPLICATIONS - - PowerPoint PPT Presentation

msc course
SMART_READER_LITE
LIVE PREVIEW

MSc Course MACHINE LEARNING TECHNIQUES AND APPLICATIONS - - PowerPoint PPT Presentation

APPLIED MACHINE LEARNING MSc Course MACHINE LEARNING TECHNIQUES AND APPLICATIONS Classification with GMM + Bayes 1 APPLIED MACHINE LEARNING Clustering, semi-supervised clustering and classification Classification Clustering


slide-1
SLIDE 1

APPLIED MACHINE LEARNING

1

MSc Course

MACHINE LEARNING TECHNIQUES AND APPLICATIONS Classification with GMM + Bayes

slide-2
SLIDE 2

APPLIED MACHINE LEARNING

2

Clustering, semi-supervised clustering and classification

Labels class 1 Labels class 2 Unlabeled

Clustering No labels for the points!

Group points according to the geometrical distribution of points

Semi-supervised clustering Labels a faction of the points Classification All points are labelled

Use the labels to choose hyperparameters of clustering using F1-measure. Use the labels to determine the boundary between the two classes

slide-3
SLIDE 3

APPLIED MACHINE LEARNING

3

From Clustering to Classification

Binary classification problem

slide-4
SLIDE 4

APPLIED MACHINE LEARNING

4

From Clustering to Classification

Need to decide to which class each point belongs. What if the probability of belonging to several classes is not zero?

Solution of GMM clustering with two Gaussian functions with isotropic/spherical covariance

slide-5
SLIDE 5

APPLIED MACHINE LEARNING

5

Gaussian Maximum Likelihood (ML) Discriminant Rule

 

 

1

1

| 1 ~ , p x y N   

   

Boundary: all points such that | 1 | 2 x p x y p x y   

 

 

2

2

| 2 ~ , p x y N   

slide-6
SLIDE 6

APPLIED MACHINE LEARNING

6

Gaussian ML Discriminant Rule

 

 

 

     

 

 

 

     

1 1 1 1 1 1 2 2 2

1 1/2 /2 1 2 2 1/2 /2 2

1 | 1 ~ , 2 1 | 2 ~ , 2

T T

x x N x x N

p x y N e p x y N e

   

   

 

       

       

  • 2-class problem, conditional densities to belong to classes y=1 and

y=2:

  • To determine the class label, compute likelihood ratio (optimal Bayes

classifier)  A new point x belong to class 1 if:

   

1| 2 | p y x p y x   

slide-7
SLIDE 7

APPLIED MACHINE LEARNING

7

Gaussian ML Discriminant Rule

   

1| 2 | p y x p y x   

                   

           

1 1 1 1 1 1 2 2 2 2

| By Bayes: | , 1,2. Assuming equal class distribution, 1 2 and replacing in (1) | 1 | 1 1 ln | 2 | 2 log log

T T

p x y i p y i p y i x i p x p y p y p x y p x y p x y p x y x x x x    

 

                                   

slide-8
SLIDE 8

APPLIED MACHINE LEARNING

8

From Clustering to Classification with GMM

Example of binary classification using GMM + Bayes rule (isotropic Gaussian functions) Train each Gaussian separately, using dataset of Class 1 for Gaussian 1 and dataset of class 2 for Gaussian 2

slide-9
SLIDE 9

APPLIED MACHINE LEARNING

9

From Clustering to Classification with GMM

Train each Gaussian separately, using dataset of Class 1 for Gaussian 1 and dataset of class 2 for Gaussian 2 Example of binary classification using GMM + Bayes rule (diagonal Gaussian functions)

slide-10
SLIDE 10

APPLIED MACHINE LEARNING

10

From Clustering to Classification with GMM

Train each Gaussian separately, using dataset of Class 1 for Gaussian 1 and dataset of class 2 for Gaussian 2 Example of binary classification using GMM + Bayes rule (full covariance Gaussian functions)

slide-11
SLIDE 11

APPLIED MACHINE LEARNING

11

Maximum Likelihood Discriminant Rule

  • A maximum likelihood classifier chooses the class label that is

the most likely.

  • Conditional density that a data point x has associated class

label y=k is:

  • The maximum likelihood (ML) discriminant rule predicts the

class of an observation x using:

( ) ( | )

k

p x p x y k  

( ) argmax ( )

k k k

c x p x 

slide-12
SLIDE 12

APPLIED MACHINE LEARNING

12

Gaussian ML Discriminant Rules

  • Muticlass problem with k=1…K classes, conditional densities for each

class is a multivariate Gaussian:

 

 

| ~ ,

k k

p x y k N   

  • ML discriminant rule is minimum of minus the log-likelihood (equiv. to

maximizing the likelihood):

     

 

1

( ) arg min log

T k k k k k k

C x x x  

     

slide-13
SLIDE 13

APPLIED MACHINE LEARNING

13

Gaussian ML Discriminant Rules

Example of 4-classes classification using four Gaussian distributions

slide-14
SLIDE 14

APPLIED MACHINE LEARNING

14

Gaussian ML Discriminant Rules

Example of 4-classes classification using four Gaussian distributions

slide-15
SLIDE 15

APPLIED MACHINE LEARNING

16

Gaussian ML Discriminant Rules

Example of 2-classes classification using 2 Gaussian distributions with equal covariance matrices

slide-16
SLIDE 16

APPLIED MACHINE LEARNING

17

Classification with GMM-s

Muti-class problem with l=1…L classes, and each class is modeled with a GMM composed of Kl multivariate Gaussian functions:

 

 

1

| ~ ,

l k k

K k k

p x y l N  

 

ML discriminant rule is minimum of minus the log-likelihood (equiv. to maximizing the likelihood):

 

 

 

( ) argmin log |

l l

c x p x y l   

slide-17
SLIDE 17

APPLIED MACHINE LEARNING

18

Example of binary classification using two Gaussian Mixture Models Train each Gaussian Mixture Model separately, using data set of Class 1 for GMM 1 and dataset of class 2 for GMM 2 (3 Gaussians for each GMM)

Classification with GMM-s

slide-18
SLIDE 18

APPLIED MACHINE LEARNING

19

Practical Issues

  • In practice, the population mean vectors k and covariance matrices  k

are estimated from the training set. Leads to numerical imprecisions in the maximum likelihood discriminant rule.

    

k

1,... 1,... 1 1

: the training set composed of data points per class 1 ˆ Estimated mean = 1 ˆ Estimated Covariance is also called the scatter matrix

k k k k i k k k k i i k

k K k k i i n n k k i n T k k k i

X x n x x n S x x x x n S 

   

      

 

slide-19
SLIDE 19

APPLIED MACHINE LEARNING

20

Classification with two GMM per class. Each GMM has 2 Gaussians. Train each GMM (composed of 2 Gaussian each) separately, using data set of class 1 for the first GMM and dataset of class 2 for the second dataset

slide-20
SLIDE 20

APPLIED MACHINE LEARNING

21

From clustering to classification

Clustering with GMM  Classification with GMM using Naïve Bayes Clustering does not have the class labels and hence end-up merging the classes

slide-21
SLIDE 21

APPLIED MACHINE LEARNING

22

Evaluating the classification

slide-22
SLIDE 22

APPLIED MACHINE LEARNING

23

Estimating from sampling the datapoints

If one trains the algorithm with all datapoints, one cannot test if the algorithm can predict well. To test the ability of the mode to predict correctly the class labels, one trains the model using only a subset of datapoints sampled randomly and one tests the prediction of the model on the datapoints not used during training.

Class 1 Class 2

slide-23
SLIDE 23

APPLIED MACHINE LEARNING

24

Estimating from sampling the datapoints

1) Sample the datapoints 2) Train the algorithm on the sampled points 3) Test the prediction of the learned model on the rest of the points

Class 1 Class 2 Sampled datapoints used for training Learned boundary between the classes Misclassified datapoint

slide-24
SLIDE 24

APPLIED MACHINE LEARNING

25

Estimating from sampling the datapoints

1) Pick another sample of datapoints 2) Train the algorithm on the new sampled points 3) Test the prediction of the learned model on the rest of the points

Class 1 Class 2 Sampled datapoints used for training Learned boundary between the classes Misclassified datapoint

Crossvalidation: repeat training/testing procedure several times and compute average performance.

slide-25
SLIDE 25

APPLIED MACHINE LEARNING

26

ML in Practice: Training and Evaluation

Best practice to assess the validity of a Machine Learning algorithm is to measure its performance against the training and testing sets. These sets are built from partitioning the data set at hand.

Training Set Testing Set

Crossvalidation

slide-26
SLIDE 26

APPLIED MACHINE LEARNING

27

Training and validation sets are used to determine the sensitivity of the learning to the choice of hyperparameters (i.e. parameters not learned during training). Values for the hyperparameters are set through a grid search. Once the optimal hyperparameters have been picked, the model is trained with complete training + validation set and tested on the testing set. In practice, one often uses solely training and testing sets and performs crossvalidation directly on these.

ML in Practice: Training and Evaluation

Training Set Validation Set Testing Set

Crossvalidation Crossvalidation

slide-27
SLIDE 27

APPLIED MACHINE LEARNING

28

Definition: “Cross validation is the practice

  • f

confirming an experimental finding by repeating the experiment using an independent assay technique"

Train data Test data All dataset

Random splits

f = 1 f = 2 f = F F folds

f-fold cross validation

  • Constant Train/Test ratio
  • At each iteration:

1) Random split of the data between Train and Test 2) Repetition of classification

  • Averaging of the result across folds

Crossvalidation

slide-28
SLIDE 28

APPLIED MACHINE LEARNING

29

Crossvalidation

Choice of training / testing ratio Avoid overfitting (i.e. fitting too well all datapoints including noise)  train the classifier with a small sample of all datapoints and test it with the remaining datapoints. Typical choice of training/testing set ratio is 2/3rd training, 1/3rd testing. The smaller the ratio, the more robust the classification Several-fold crossvalidation Typical choice is 10-fold crossvalidation. However, this depends on how many datapoints you have in your dataset!

slide-29
SLIDE 29

APPLIED MACHINE LEARNING

30

True Positives( ) : nm of datapoints of class 1 that are correctly classified False Negatives ( ) : nm of datapoints of class 1 that are incorrectly classified False Positives( ) : nm of datapoints of TP FN FP class 2 that are incorrectly classified Recall: Precision: 2*Precision*Recall Precision+Recall TP TP FN TP TP FP F   

Classification F-Measure:

(careful: similar but not the same F-measure as the F-measure we saw for clustering!)

Tradeoff between classifying correctly all datapoints of the same class and making sure that each class contains points of only one class. Recall: Proportion of datapoints correctly classified in Class 1 Precision: proportion of datapoints of class 1 correctly classified over all datapoints classified in class 1

Performance Measures

slide-30
SLIDE 30

APPLIED MACHINE LEARNING

31

Estimating from sampling the datapoints

Class 1 Class 2 Sampled datapoints used for training Learned boundary between the classes Misclassified datapoint True Positive False Negative

True Positives( ) : nm of datapoints of class 1 that are correctly classified False Negatives ( ) : nm of datapoints of class 1 that are incorrectly classified False Positives( ) : nm of datapoints of TP FN FP class 2 that are incorrectly classified Recall: Precision: 2*Precision*Recall Precision+Recall TP TP FN TP TP FP F   

False Positive

slide-31
SLIDE 31

APPLIED MACHINE LEARNING

32

Crossvalidation

  • Variance
  • f

classification performance across the different folds

  • f

crossvalidation measures the sensitivity of the classifier to the choice of training/testing ratio and hyperparameters.

  • Large variance on testing set  overfitting!
  • A small variance with a small training/testing set ratio indicates high

robustness of classification.

slide-32
SLIDE 32

APPLIED MACHINE LEARNING

33

In classification, performance = % of items correctly classified. Can lead to poor classification of one class if instances of each class are not well balanced.

Performance measures in ML

Model learned with SVM (see next week)

slide-33
SLIDE 33

APPLIED MACHINE LEARNING

34

In classification, performance = % of items correctly classified. Can lead to poor classification of one class if instances of each class are not well balanced.

Performance measures in ML

Situation worse for testing dataset!

Model learned with SVM (see next week)

slide-34
SLIDE 34

APPLIED MACHINE LEARNING

35

GMM + Bayes less sensitive to unbalanced data as it trains each model on each class separately

In classification, performance = % of items correctly classified. Can lead to poor classification of one class if instances of each class are not well balanced.

Performance measures in ML

Model learned with GMM + Bayes, 2 Gauss functions each

slide-35
SLIDE 35

APPLIED MACHINE LEARNING

36

In classification, performance = % of items correctly classified.

Performance measures in ML

Depends on choosing well parameters, e.g. threshold in naïve Bayes

   

Bayes rule for binary classification: x has class label 1 if | 1 | 2 else x has class label 2 P x y P x y         

1| 2 | P y x P y x    

   

| 1 | 2 , P x y P x y       

slide-36
SLIDE 36

APPLIED MACHINE LEARNING

37

To determine the correct hyperparameters

The ROC (Receiver Operating Characteristic) curve plots the fraction of true positives and false positives over the total number of samples (for binary classification only). Each point on the curve corresponds to a different value of the classifier’s hyperparameter (e.g. a threshold on Bayes’ classification).

0% 100% p(FP) p(TP) 100% Performance drops Performance improves

True Positives( ) : nm of datapoints of class 1 that are correctly classified False Positives( ) : nm of datapoints of class 2 that are incorrectly classified TP FP

Perfect classification Always negative Always positive

slide-37
SLIDE 37

APPLIED MACHINE LEARNING

38

To determine the correct hyperparameters

slide-38
SLIDE 38

APPLIED MACHINE LEARNING

39

39

Example of Over-Fitting

  • utliers
slide-39
SLIDE 39

APPLIED MACHINE LEARNING

40

40

Not perfect classification but no overfitting

  • utliers

Does not classify well one of the outliers but generally a good fit

slide-40
SLIDE 40

APPLIED MACHINE LEARNING

41

41

Over-Fitting

  • utliers

Classifies well all datapoints including outliers but requires 4 Gaussians for each model and overall shape of density not well encapsulated

slide-41
SLIDE 41

APPLIED MACHINE LEARNING

42

42

Quantification of over-fitting

When performing crossvalidation, overfitting is clearly visible in the variance on the testing set.

slide-42
SLIDE 42

APPLIED MACHINE LEARNING

43

Curse of Dimensionality

N: Nb of dimensions M: Nb of datapoints Computational Costs

O(N,M) O(N2, M2) Computational costs may grow as a function

  • f number of dimensions or of number of datapoints
slide-43
SLIDE 43

APPLIED MACHINE LEARNING

44

Example : Classification with 2 GMMs

(1 Gaussian per model, spherical covariance matrix)

Count the number of parameters  increase computation time for both training and testing

slide-44
SLIDE 44

APPLIED MACHINE LEARNING

45

GMM grows quadratically with N and linearly with M at training and quadratically with N at testing

Example : Classification with 2 GMMs

(1 Gaussian per model, full covariance matrix)

slide-45
SLIDE 45

APPLIED MACHINE LEARNING

46

  • We have seen how to interpret result of clustering with GMM using Bayes’

rule.

  • We have stressed the differences in principle between clustering, semi-

supervised clustering and classification.

  • We have seen how to use Bayes’ rule to perform classification on GMM,

training one GMM on each class separately.

  • We have then seen the key components to evaluate classification:
  • F-measure
  • Training/testing sets and training/testing ratio
  • Crossvalidation
  • ROC curve
  • We have discussed important issues in classification:
  • Overfitting
  • Unbalanced classes
  • Tradeoff between computational costs and performance

Summary