Machine Learning Classification: Introduction Hamid R. Rabiee - - PowerPoint PPT Presentation

machine learning
SMART_READER_LITE
LIVE PREVIEW

Machine Learning Classification: Introduction Hamid R. Rabiee - - PowerPoint PPT Presentation

Machine Learning Classification: Introduction Hamid R. Rabiee Jafar Muhammadi, Nima Pourdamghani Spring 2015 http://ce.sharif.edu/courses/93-94/2/ce717-1/ Agenda Agenda Introduction Classification: A Two-Step Process Evaluating


slide-1
SLIDE 1

Machine Learning

Classification: Introduction

Hamid R. Rabiee

Jafar Muhammadi, Nima Pourdamghani Spring 2015 http://ce.sharif.edu/courses/93-94/2/ce717-1/

slide-2
SLIDE 2

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

2

Agenda Agenda

 Introduction  Classification: A Two-Step Process  Evaluating Classification Methods  Classifier Performance  Performance Measures  Partitioning Methods

slide-3
SLIDE 3

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

3

Int Introducti roduction

  • n

 Classification

 predicts categorical class labels (discrete or nominal)  classifies data (constructs a model), based on the training set and the class labels, and uses it in classifying new data

 Typical applications

 Credit approval  Target marketing  Medical diagnosis  Fraud detection

slide-4
SLIDE 4

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

4

Cl Classif assificati ication:

  • n: A tw

A two-step step p proc rocess ess

 Model construction

 Each sample is assumed to belong to a predefined class, as determined by the class label  The set of samples used for model construction is called “training set”  The model is represented as classification rules, decision trees, probabilistic model, mathematical formulae and etc.

 Model usage

 for classifying future or unknown objects  Estimate accuracy of the model  The known label of test sample is compared with the classified result from the model  Accuracy rate is the percentage of test set samples that are correctly classified by the model  Test set is independent of training set, otherwise over-fitting will occur  If the accuracy is acceptable, use the model to classify data samples whose class labels are not known

slide-5
SLIDE 5

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

5

Evaluati Evaluating ng classi classifi ficati cation met

  • n methods

hods

 Performance

 classifier performance: predicting class label  Accuracy, {true positive, true negative}, {false positive, false negative}, …

 Time Complexity

 time to construct the model (training time)  the model will be constructed once  can be large  time to use the model (classification time)  must be tolerable  need for good data structures

 Robustness

 handling noise and missing values  handling incorrect training data

slide-6
SLIDE 6

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

6

Evaluati Evaluating ng classi classifi ficati cation met

  • n methods

hods

 Scalability

 efficiency in disk-resident databases

 Interpretability

 understanding and insight provided by the model

 Other measures: goodness of rules or compactness of classification rules

 rule of thumb: more compact, better generalization

slide-7
SLIDE 7

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

7

Perfor Performance mance measures measures

 Accuracy is not a good measure for classifier performance always (Why?)

 Suppose a “cancer detection” problem

 Presentation of Classifier Performance

 Use a confusion matrix or a receiver-operating characteristic (ROC) curve  We can extract some performance measures from the above matrix (or curve)

P N P N Real Predicted

slide-8
SLIDE 8

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

8

Perfor Performance mance measures measures

 ROC Example: ROC Space

 A: Acc: 0.68  B: Acc: 0.50  C: Acc: 0.18  C’: Acc: 0.82

TP: 77 FP: 77 154 FN: 23 TN: 23 46 100 100 200 TP: 24 FP: 88 112 FN: 76 TN: 12 88 100 100 200 TP: 76 FP: 12 88 FN: 24 TN: 88 112 100 100 200 TP: 63 FP: 28 91 FN: 37 TN: 72 109 100 100 200

slide-9
SLIDE 9

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

9

Perfor Performance mance measures measures

 Performance Measures

 Accuracy: (TP+TN) / (#data)  Specificity: TN / (FP+TN)  Sensitivity: TP / (FN+TP)  Index of Merit: (Specificity + Sensitivity) / 2 = (TP%+TN%) / 2  Also known as “percentage correct classifications”

 Performance measured using test set results

 Test set should be distinct and different from the train (learning) set.  Several methods are available to partition the data into separated training and testing sets, resulting in different estimates of the “true” index of merit

slide-10
SLIDE 10

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

10

Dat Data a parti partiti tioning

  • ning

 Goal: validating the classifier and its parameters

 Choose the best parameter set

 Idea: use a part of training data as the validation set  Validation set must be a good representative for the whole data  How to partition the training data

slide-11
SLIDE 11

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

11

Dat Data a parti partiti tioning

  • ning methods

methods

 Holdout methods: Random Sampling

 data is randomly partitioned into two independent sets

 Always size of train set is twice of test set  Assumption: data is uniformly distributed

 The true error estimate is obtained as the average of the separate estimates  Holdout methods: Bootstrap

 resample with replacement n sample of original data as training set.  Some numbers in the original sample may be included several times in the bootstrap sample (63.2% of samples are distinct)

Training set Test set All examples

i

E

slide-12
SLIDE 12

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

12

Dat Data a parti partiti tioning

  • ning methods

methods

 Holdout methods: Multiple train-and-test experiment Bootstrap  Holdout methods Drawbacks

 In problems where we have a sparse dataset we may not be able to afford the “luxury” of setting aside a portion of the dataset for testing.  Since it is a single train-and-test experiment, the holdout estimate of error rate will be misleading if we happen to get an “unfortunate” split.

Total of number examples

Test set Experiment #1 Experiment #2 Experiment #3

slide-13
SLIDE 13

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

13

Dat Data a parti partiti tioning

  • ning methods

methods

 Cross-validation (k-fold, where k = 10 is most popular)

 Randomly partition the data into k mutually exclusive subsets, each approximately equal size  At ith iteration, use Di as test set and others as training set  The mean of measures obtained in iterations used as output performance measure

Test set

Experiment #1 Experiment # i Experiment #2 Experiment # k

Test set Test set Test set

slide-14
SLIDE 14

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

14

Dat Data a parti partiti tioning

  • ning methods

methods

 Cross-validation (k-fold, where k = 10 is most popular)  Divide the total dataset into three subsets:

 Training data is used for learning the parameters of the model.  Validation data is not used of learning but is used for deciding what type of model and what amount of regularization works best.  Test data is used to get a final, unbiased estimate of how well the network works. We expect this estimate to be worse than on the validation data.

 As before, the true error is estimated as the average error rate:

slide-15
SLIDE 15

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

15

Dat Data a parti partiti tioning

  • ning methods

methods

 Leave-one-out

 k folds where k = # of samples, for small sized data

 As usual, the true error is estimated as the average error rate on test examples:

Test set

Experiment #1 Experiment # i Experiment #2 Experiment # k

slide-16
SLIDE 16

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

16

Dat Data a parti partiti tioning

  • ning methods

methods

 Stratified cross-validation

 folds are stratified so that class distributions in each fold is approximate the same as that in the initial data

slide-17
SLIDE 17

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

17

How many fol

  • w many folds are

ds are need needed? ed?

 With a large number of folds

 + The bias of the true error rate estimator will be small (the estimator will be very accurate) 

  • The variance of the true error rate estimator will be large

  • The computational time will be very large as well (many experiments)

 With small number of folds

 + The number of experiments and, therefore, computation time are reduced  + The variance of the estimator will be small 

  • The bias of the estimator will be large( conservative or higher than the true error rate)

 In practice, the choice of the number of folds depends on the size of the dataset

 For large datasets, even 3-Fold Cross Validation will be quite accurate  For very sparse datasets, we may have to use leave-one-out in order to train on as many examples as possible

slide-18
SLIDE 18

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

18

Three Three-way da way data ta splits splits

 If model selection and true error estimates are to be computed simultaneously, the data needs to be divided into three disjoint sets

 Training set: a set of examples used for learning: to fit the parameters of the classifier  Validation set: a set of examples used to tune the parameters of a classifier  Test set: a set of examples used only to assess the performance of a fully-trained classifier

 Why separate test and validation sets?

 The error rate estimate of the final model on validation data will be biased(smaller than the true error rate) since the validation set is used to select the final model  After assessing the final model with the test set, YOU MUST NOT tune the model any further

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

18

slide-19
SLIDE 19

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

19

Three Three-way da way data ta splits splits

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

19

Adapted from slides of Ricardo Gutierrez-Osuna

slide-20
SLIDE 20

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

20

Any Q Any Questi uestion?

  • n?

End of Lecture 5 Thank you!

Spring 2015

http://ce.sharif.edu/courses/93-94/2/ce717-1/

slide-21
SLIDE 21

Machine Learning

Classification I

Hamid R. Rabiee

Jafar Muhammadi, Alireza Ghassemi Spring 2015 http://ce.sharif.edu/courses/93-94/2/ce717-1/

slide-22
SLIDE 22

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

2

Agenda Agenda

 Bayesian Decision Theory  Prior Probabilities  Class-Conditional Probabilities  Posterior Probabilities  Probability of Error  Conditional Risk  Min-Error-Rate Classification  Probabilistic Discriminant Functions

 Discriminant Functions: Gaussian Density

 Minimax Classification  Neyman-Pearson

slide-23
SLIDE 23

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

3

Bayesi Bayesian D an Decisi ecision

  • n The

Theory

  • ry

 Bayesian Decision Theory is a fundamental statistical approach that quantifies the tradeoffs between various decisions using probabilities and costs that accompany such decisions.

 First, we will assume that all probabilities are known.  Then, we will study the cases where the probabilistic structure is not completely known.

slide-24
SLIDE 24

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

4

Bayesi Bayesian D an Decisi ecision

  • n The

Theory

  • ry

 We are using fish sorting example to illustrate these topics.  Fish sorting example revisited

 State of nature is a random variable.  Define w as the type of fish we observe (state of nature, class) where  w = w1 for sea bass,  w = w2 for salmon.  P(w1) is the a priori probability that the next fish is a sea bass.  P(w2) is the a priori probability that the next fish is a salmon.

slide-25
SLIDE 25

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

5

Pri Prior

  • r Probab

Probabil ilit ities ies

 Prior probabilities reflect our knowledge of how likely each type of fish will appear before we actually see it.  How can we choose P(w1) and P(w2)?

 Set P(w1) = P(w2) if they are equiprobable (uniform priors).  May use different values depending on the fishing area, time of the year, etc.

 Assume there are no other types of fish

 (exclusivity and exhaustivity).

1 2

P(w ) P(w ) 1

slide-26
SLIDE 26

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

6

Pri Prior

  • r Probab

Probabil ilit ities ies

 How can we make a decision with only the prior information?

 Decide

 What is the probability of error for this decision?

 P(error) = min{P(w1), P(w2)}

1 1 2 2

w if P(w ) P(w ) w

  • therwise
slide-27
SLIDE 27

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

7

Cl Class ass-Cond Conditi ition

  • nal

al Pro Proba babili bilities ties

 Let’s try to improve the decision using the lightness measurement x.

 Let x be a continuous random variable.  Define P(x|wj) as the class-conditional probability density (probability of x given that the state of nature is wj for j = 1, 2).  P(x|w1) and P(x|w2) describe the difference in lightness between populations of sea bass and salmon.  Hypothetical class-conditional probability density functions for two Classes.

slide-28
SLIDE 28

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

8

Cl Class ass-Cond Conditi ition

  • nal

al Pro Proba babili bilities ties

 How can we make a decision with only the class-conditional probabilities?

 Decide

 Looks good, but prior information are not used. It may degrade decision performance

 e.g what happens if we know a priori that 99% of fish are se basses?

 Class-conditional is known as “Maximum Likelihood”, also.

1 1 2 2

w if P(x|w ) P(x|w ) w

  • therwise
slide-29
SLIDE 29

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

9

Posteri Posterior

  • r Probab

Probabil ilit ities ies

 Suppose we know P(wj) and P(x|wj) for j = 1, 2, and measure the lightness

  • f a fish as the value x.

 Define P(wj |x) as the a posteriori probability (probability of the state of nature being wj given the measurement of feature value x).  We can use the Bayes formula to convert the prior probability to the posterior probability: in which P(x|wj) is called the likelihood and P(x) is called the evidence.

j j j

p(x|w )P(w ) P(w |x) p(x)

2 j j j 1

p(x) p(x|w )P(w )

slide-30
SLIDE 30

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

10

Posteri Posterior

  • r Probab

Probabil ilit ities ies

 How can we make a decision after observing the value of x?

 Decide

 Rewriting the rule gives

 Decide

Note that, at every x, P(w1|x) + P(w2|x) = 1.

1 1 2 2

w if P(w |x) P(w |x) w

  • therwise

1 2 1 2 1 2

P(x|w ) P(w ) w if P(x|w ) P(w ) w

  • therwise
slide-31
SLIDE 31

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

11

Probab Probabil ilit ity y of

  • f Err

Error

  • r

 What is the probability of error for this decision?  What is the average probability of error?  Bayes decision rule minimizes this error because P(error|x) = min{ P(w1|x), P(w2|x) }

1 2 2 1

P(w |x) if we decide w P(error|x) P(w |x) if we decide w P(error) P(error,x)dx P(error|x)P(x)dx

slide-32
SLIDE 32

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

12

Bayesi Bayesian D an Decisi ecision

  • n The

Theory

  • ry

 How can we generalize to

 More than one feature? (replace the scalar x by the feature vector x)  More than two states of nature? (just a difference in notation)  Allowing actions other than just decisions? (allow the possibility of rejection)  Different risks in the decision? (define how costly each action is)  Notations for generalization  Let {w1, . . . ,wc} be the finite set of c states of nature (classes, categories).  Let {α1, . . . , αa} be the finite set of a possible actions.  Let λ(αi|wj) be the loss incurred for taking action i when the state of nature is wj .  Let x be the d-dim vector-valued random variable called the feature vector.

slide-33
SLIDE 33

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

13

Condit Conditional ional Ri Risk sk

 Suppose we observe x and take action αi.

 If the true state of nature is wj , we incur the loss λ(αi|wj).  The expected loss with taking action i is

It is also called the conditional risk.

c i i j j j 1

R( |x) ( |w )P(w |x)

slide-34
SLIDE 34

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

14

Condit Conditional ional Ri Risk sk

 We want to find the decision rule that minimizes the overall risk  Bayesian decision rule minimizes the overall risk by selecting the action αi for which R(αi|x) is minimum  The resulting minimum overall risk is called the Bayesian risk and is the best performance that can be achieved.

R R( (x)|x)p(x)dx

slide-35
SLIDE 35

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

15

Condit Conditional ional Ri Risk sk

 Two-category classification example

 Define  α1 : deciding w1  α2 : deciding w2  λij : λ(αi | wj)  Conditional risks can be written as

1 11 1 12 2 2 21 1 22 2

R( |x) P(w |x) P(w |x) R( |x) P(w |x) 2P(w |x)

slide-36
SLIDE 36

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

16

Condit Conditional ional Ri Risk sk

 Two-category classification example

 The minimum-risk decision rule becomes  Decide  This corresponds to deciding w1 if comparing the likelihood ratio to a threshold that is independent of the

  • bservation x.

1 21 11 1 12 22 2 2

w if( )P(w |x) ( )P(w |x) w

  • therwise

1 12 22 2 2 21 11 1

p(x|w ) P(w ) p(x|w ) P(w )

slide-37
SLIDE 37

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

17

Mi Min-Error Error-Rate Rate Cl Class assif ification ication

 Problem definition:

 Actions are decisions on classes (αi is deciding wi).  If action αi is taken and the true state of nature is wj , then the decision is correct if i = j and in error if i≠j.  We want to find a decision rule that minimizes the probability of error.

 Define the zero-one loss function (all errors are equally costly).  Conditional risk becomes

i j

if i j ( |w ) i,j 1,...,c 1 if i j

c i i j j i j 1 j i

R( |x) ( |wj)P(w |x) P(w |x) 1 P(w |x)

slide-38
SLIDE 38

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

18

Mi Min-Error Error-Rate Rate Cl Class assif ification ication

 Minimizing the risk requires maximizing P(wi|x) and results in the minimum-error decision rule

 Decide wi if P(wi|x) > P(wj |x) for all j≠i.

 The resulting error is called the Bayesian error

 This is the best performance that can be achieved.

slide-39
SLIDE 39

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

19

Probab Probabil ilis isti tic D c Dis iscri crimi minant nant Func Functi tions

  • ns

 Discriminant functions: a useful way of representing classifiers

 gi(x), i = 1, . . . , c  Classifier assigns a feature vector x to class wi if gi(x) > gj(x) for all j≠i.  For the classifier that minimizes conditional risk  gi(x) = −R(αi |x).  For the classifier that minimizes error  gi(x) = P(wi|x).

slide-40
SLIDE 40

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

20

Probab Probabil ilis isti tic D c Dis iscri crimi minant nant Func Functi tions

  • ns

 These functions divide the feature space into c decision regions separated by decision boundaries (R1, . . . , Rc).

 Note that the results do not change even if we replace every gi(x) by f(gi(x)) where f(·) is a monotonically increasing function (e.g., logarithm).  This may lead to significant analytical and computational simplifications.

slide-41
SLIDE 41

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

21

Di Discri scrimi minant nant Func Funcs: s: Gaussian D aussian Densit ensity

 Discriminant functions for the Gaussian density in case of min-error-rate classification, can be written as (why?):

 gi(x) = ln p(x|wi) + ln P(wi), p(x|wi) = N(μi, Σi), or

T 1 i i i i i i

1 1 1 g (x) (x ) (x ) ln2 | | lnP(w ) 2 2 2

slide-42
SLIDE 42

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

22

Di Discri scrimi minant nant Func Funcs: s: Gaussian D aussian Densit ensity

 Case 1: Σi=σ2I

 Discriminant functions are Where and  (wi0 is the threshold or bias for the i’th category).  Decision boundaries are the hyperplanes gi(x) = gj(x), and can be written as wij

T (x − x0 (ij)) = 0

Where and Hyperplane separating Ri and Rj passes through the point x0

(ij) and is orthogonal

to the vector w.

T i i i0

g (x) w x w

i i 2

1 w

T i0 i i i 2

1 w lnP(w )

ij i j

w

2 (ij) i i j i j 2 j i j

P(w ) 1 x ( ) ln ( ) 2 P(w ) || ||

slide-43
SLIDE 43

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

23

Di Discri scrimi minant nant Func Funcs: s: Gaussian D aussian Densit ensity

 Case 1: Σi=σ2I

slide-44
SLIDE 44

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

24

Di Discri scrimi minant nant Func Funcs: s: Gaussian D aussian Densit ensity

 Case 1: Σi=σ2I

 Special case when P(wi) are the same for i = 1, . . . , c is the minimum-distance classifier that uses the decision rule assign x to wi* where i* = arg min ||x-μi||, i=1,…,c

slide-45
SLIDE 45

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

25

Di Discri scrimi minant nant Func Funcs: s: Gaussian D aussian Densit ensity

 Case 2: Σi= Σ

 Discriminant functions are Where and  Decision boundaries can be written as wij

T (x − x0 (ij)) = 0

Where and Hyperplane passes through x0

(ij) but is not necessarily orthogonal to the line

between the means.

T i i i0

g (x) w x w

1 i i

w

T 1 i0 i i i

w lnP(w )

ij i j

w

(ij) i i j i j T 1 j i j i j

P(w ) 1 1 x ( ) ln ( ) 2 P(w ) ( ) ( )

slide-46
SLIDE 46

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

26

Di Discri scrimi minant nant Func Funcs: s: Gaussian D aussian Densit ensity

 Case 2: Σi= Σ

slide-47
SLIDE 47

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

27

Di Discri scrimi minant nant Func Funcs: s: Gaussian D aussian Densit ensity

 Case 3: Σi= Arbitrary

 Discriminant functions are Where , and  Decision boundaries are hyperquadrics

T T i i i i0

g (x) x W x w x w

1 i i i

w

T 1 i0 i i i i i

1 1 w ln| | lnP(w ) 2 2

1 i i

1 W 2

slide-48
SLIDE 48

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

28

Di Discri scrimi minant nant Func Funcs: s: Gaussian D aussian Densit ensity

 Case 3: Σi= Arbitrary

slide-49
SLIDE 49

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

29

Di Discri scrimi minant nant Func Funcs: s: Gaussian D aussian Densit ensity

 Case 3: Σi= Arbitrary

slide-50
SLIDE 50

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

30

Mi Minimax nimax Cl Classif assificati ication

  • n

 In many real life applications, prior probabilities may be unknown, or time- varying, so we can not have a Bayesian optimal classification.  However, one may wish to minimize the max possible overall risk.

 The overall risk is, and , then

1

11 1 1 12 2 2 21 1 1 22 2 2 2

( ) ( | ) ( ) ( | ) ( ) ( | ) ( ) ( | )

R R

R P w P x w P w P x w dx P w P x w P w P x w dx

2 1

( ) 1 ( ) P w P w

1 2

1 2

( | ) 1 ( | )

R R

P x w dx P x w dx

1 2

1 1 22 12 22 2 1 11 22 21 11 1 12 22 2 1

( ), ( ) ( | ) ( ) ( ) ( ) ( | ) ( ) ( | )

R R R

R P w R P x w dx P w P x w dx P x w dx

slide-51
SLIDE 51

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

31

Mi Minimax nimax Cl Classif assificati ication

  • n

 For a fix R1, the overall risk is a linear function of P(w1), and the maximum error occurs in P(w1)=0, or P(w1)=1.

 Why should the line be a tangent to R(P(w1),R1)?

 For all possible R1s, we are looking for the one which minimizes this maximum error, i.e.

1

1 1 1

argmin max ( ),

R

R R P w R

slide-52
SLIDE 52

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

32

Mi Minimax nimax Der Deriv ivati ation

  • n

 Another way to solve R1 in minimax is from:

 If you get multiple solutions, choose one that gives you the minimum Risk

1 2 1

1 1 22 12 22 2 1 11 22 21 11 1 12 22 2

( ), ( ) ( | ) ( ) ( ) ( | ) ( ) ( | )

R R R

R P w R p x w dx P p x w dx p x w dx

R , minimax risk

mm

slide-53
SLIDE 53

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

33

Neyman eyman-Pe Pearso arson n Cri Criterion terion

 If we do not know the prior probabilities, Bayesian optimum classification is not possible.

 Suppose that the goal is maximizing the probability of detection, while constraining the probability of false-alarm to be less than or equal to a certain value.  E.g. in a radar system false alarm (assuming an enemy aircraft is approaching while this is not the case) may be OK but it is very important to maximize the probability of detecting a real attack  Based on this constraint (Neyman-Pearson criterion) we can design a classifier  Typically must adjust boundaries numerically (for some distributions, such as Gaussian, analytical solutions do exist.

slide-54
SLIDE 54

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

34

Any Q Any Questi uestion?

  • n?

End of Lecture 6 Thank you!

Spring 2015

http://ce.sharif.edu/courses/93-94/2/ce717-1/

slide-55
SLIDE 55

Machine Learning

Classification II

Hamid R. Rabiee

Jafar Muhammadi Spring 2015 http://ce.sharif.edu/courses/93-94/2/ce717-1 /

slide-56
SLIDE 56

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

2

Agenda Agenda

 Linear Discriminant Functions (LDF)  Multi-class problems

 Linear machine  Completely Linearly Separation  Pairwise Linearly Separation

 Linear Discriminant Function Design

 Least Mean Squared Error Method  Sum of Squared Error Method

 Ho-Kashyap Method

 Probabilistic Methods

slide-57
SLIDE 57

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

3

Linear Linear Di Discri scrimi minant nant Func Functi tions

  • ns (LD

(LDF) F)

 Definition:

 LDF is a function that is a linear combination of the components of x g(x) = wtx + w0  where w is the weight vector and w0 the bias, or threshold weight.

 A two-category classifier with a discriminant function of the above form uses the following rule:

 Decide w1 if g(x) > 0 and w2 if g(x) < 0  Decide w1 if wtx > -w0 and w2 otherwise  The value g(x) of the function for a certain point x is called functional margin  If g(x) = 0 then x is assigned to either class  The equation g(x) = 0 defines the decision surface that separates points assigned to the category w1 from points assigned to the category w2  When g(x) is linear, the decision surface is a hyperplane.

slide-58
SLIDE 58

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

4

Linear Linear Di Discri scrimi minant nant Func Functi tions

  • ns

 In conclusion, a linear discriminant function divides the feature space by a hyperplane decision surface

 Decision boundary g(x)=0 corresponds to (d-1)- dimensional hyperplane in d-dimensional x- space

 The orientation of the surface is determined by the normal vector w and the location of the surface is determined by the bias

 We can see Fisher method (LDA) as a linear discriminant function, too.

slide-59
SLIDE 59

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

5

Mul Multi ti-clas class s pro problems blems

 Suppose we have an n-classes classification problem, and we want to separate them with linear discriminant functions

 Do you have any idea about how to use discriminant function in this case

 We have many ways to do this.

 Using linear discriminant function in multi-class problems

 Linear machines (one versus one)  Completely linearly separation (one versus the rest)  Pairwise linearly separation

 We introduce above methods through illustrative examples in next slides.

slide-60
SLIDE 60

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

6

Case 1: Case 1: Linear Linear Machine Machine

 Suppose a 3-class classification problem with the following discriminant functions: and use the following rule for classification (linear machine rule): How these classes partition the space?

1 1 2 2 1 2 3 2

( ) ( ) 1 ( ) g x x x g x x x g x x

( ) ( ) ;

i i j

x C g x g x j i     

slide-61
SLIDE 61

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

7

Case Case 1: : Line Linear Mach ar Machine ine

 Each class partition can be obtained through solving two equations.  The result:

slide-62
SLIDE 62

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

8

Mor More e on

  • n Linear

Linear Machines Machines

 In some texts, it is called one versus one (one against one).  How many functions we need for n classes? (n)  The decision regions for linear machine are convex and this restriction limits the flexibility of the classifier.

slide-63
SLIDE 63

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

9

Case Case 2: : Completely Completely Line Linearly Sep arly Separa aration tion

 Suppose a 3-class classification problem with the following discriminant functions: and use the following rule for classification (completely linearly separation rule): How these classes partition the space? Determine the undecided sub-spaces.

1 1 2 2 1 2 3 2

( ) ( ) 5 ( ) 1 g x x x g x x x g x x         

( ) ( )

i i i i

if g x

  • x

C and if g x

  • x

C      

slide-64
SLIDE 64

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

10

Case 2: Case 2: Com Completel pletely L y Linearl inearly y Sep Separati aration

  • n

 Each class partition can be obtained through solving three equation.  The result:

slide-65
SLIDE 65

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

11

Mor More e on

  • n Com

Completel pletely L y Linearl inearly Se y Separati paration

  • n

 In some texts, it is called one versus the rest (one against all).  If we have n classes, what is the number of needed functions? (n)  Are the decision regions convex?  Compare the undecided sub-spaces in two cases.

slide-66
SLIDE 66

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

12

Case 3: Case 3: Pair Pairwi wise Li se Linearly Sepa nearly Separati ration

  • n

 Suppose a 3-class classification problem with the following discriminant functions: and use the following rule for classification: How these classes partition the space? Determine the undecided sub-spaces.

12 1 2 13 1 23 1 2

( ) 5 ( ) 3 ( ) g x x x g x x g x x x           ( )

i ij

x C j i g x

   ( ) ( )

ij ji

g x g x  

slide-67
SLIDE 67

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

13

Case Case 3: : Pa Pairw irwise Linea ise Linearly Se rly Sepa paration ration

 Each class partition can be obtained through solving two equation.  The result:

slide-68
SLIDE 68

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

14

Mor More e on

  • n Pair

Pairwi wise se Linearl Linearly y Sep Separati aration

  • n

 If we have n classes, what is the number of needed functions? (c(n,2))  Are the decision regions convex?

 Definition: A region Ri is convex iff

, (1 )

i i

y z R y z R         g ( ) ( ) and g ( ) ( ) ( (1 ) ) ( (1 ) )

i j i j i j

j i y g y z g z g y z g y z              

slide-69
SLIDE 69

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

15

Linear Linear Di Discri scrimi minant nant Func Functi tions

  • ns

 Main problem

 How to create the discriminant functions for each class (how obtain w)?

 Many methods exist for this purpose, such as:

 Error Minimization Methods  Least Mean Squared Error Method  will be discussed in next slides  Sum of Squared Error Method  will be discussed in next slides  Ho-Kashyap Method  will be discussed in next slides  Fisher Linear Discriminant Method  discussed in lecture 3  Perceptorn Method  will be discussed in lecture 9  Probabilistic Methods  discussed in lecture 6  etc.

slide-70
SLIDE 70

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

16

Leas Least t Mean Squared Mean Squared Err Error

  • r

 We want to choose the W that minimizes the mean-squared-error criterion function:  Note: means the “j”th feature of the “i”th sample. Here we assume has “n” different features.  We can also use the gradient descent rule for updating w instead of analytical solving.

2 ( ) ( ) ( ) ( ) 2 ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )

( ) ( ) ( ) ˆ argmin ( ) ( ) 2 ( ) 2 ) 2 ( ) ˆ ( )

i i i t i w i i t i i i i t i i i i t i i i i t i

J w E y g x E y w x w J w J w E x y w x E x y x w x E x y Ew x x w E x y E w E x x

( ) ( ) 1 ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) 1 1 1 1 ( ) (i) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) 2 1 2 2 ( ) (i) ( ) (i) ( ) ( ) 1

[ ] [ ] [ ] [ ] [( ) ] , [ ] [ ] [ ]

i i i i x x i i i i i i n i i i i i i t i i i n x i i i i n n n n

x y R E x y R E x x E x x x y E x x E x x x y R E x x E x y E E x x E x x x y

( ) i j

x

( ) i

x

slide-71
SLIDE 71

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

17

Sum Sum of

  • f Squ

Squared ared Err Error

  • r

 SSE uses the sum of squared error as objective function

 Also known as Pseudo inverse matrix method  Note: Here we have “n” samples.

2 ( ) ( ) 2 1 ( ) ( ) ( ) 1 1

( ) ( ) ( ) 2 ( ) 2 ( ) ( )

n t t i i i n i t i i t i t t t t x

J w w x b w x b J w x w x b x x w b w xb xx w xb xx w xb w xx xb x b xx

slide-72
SLIDE 72

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

18

Sum Sum of

  • f Squ

Squared ared Err Error

  • r

 Example

 Find the SSE boundary for the given data points,

1 2

1

t

w x x           

1 2

:[(1,2),(2,0)] :[(3,1),(2,3)] c c

1

1 1 2 5/4 13/12 3/4 7/12 1 2 0 ( ) 1/2 1/6 1/2 1/6 1 3 1 1/3 1/3 1 2 3

t t

X X X X X

 

                                     

   

assuming y 1 1 1 1 w=X 11/3 4/3 2/3

T

y

    

1 2

11 4 2 ( ) 3 3 3 g x x x    

slide-73
SLIDE 73

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

19

Ho Ho-Kash Kashyap yap M Meth ethod

  • d

 The main limitation of the SSE is lack of guarantees that a separating hyperplane will be found in the linearly separable case

 The SSE rule try to minimize  Finding a separating hyperplane depends on how suitably the outputs b are selected

 If the two classes are linearly separable, there must exists vectors w and b such that wTx = b>0

 if b were known, to compute the separating hyperplane, the SSE solution will be w = x-b  Nevertheless, since b is unknown, one must solve the equation for both w and b

 A possible algorithm is the Ho-Kashyap procedure:

  • 1. Find the target values b with gradient descent
  • 2. compute the weight vector w from the SSE solution
  • 3. Repeat 1 and 2 until convergence

2 t

w x b

slide-74
SLIDE 74

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

20

Ho Ho-Kash Kashyap yap M Meth ethod

  • d

 g(x) > 0 can be rewrite as g(x)=b; b>0

 How we can determine b?

 Objective function in this case is  Ho-Kashyap method offers an iterative method for obtaining w and b, using following steps:

 Keeping constant b and optimize J related to w (using obtained b from last step)  Using previous method we have:  Keeping constant w and optimize J related to b (using obtained w from last step)  The objective is to minimize  Using Gradient descent method we have:  To hold the constraint b>0, we set (xw-b) in this rule to zero if it becomes negative, then the rule will be:

2

( , )

t

J w b w x b

( 1) ( ) w t x b t ( 1) ( ) ( ) ( )

t t

b t b t w t x b w t x b

2

t

J w x b w

( 1) ( ) 2 ( )

t

b t b t w t x b

slide-75
SLIDE 75

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

21

Probab Probabil ilis isti tic M c Methods ethods

 Maximum likelihood

 gi(x) = P(x|wi)

 Bayesian Classifier

 gi(x) = P(wi|x)  gi(x) = p(x|wi) P(wi)  gi(x) = ln p(x|wi) + ln P(wi)

 Expected Loss (Conditional Risk)

 Uses loss function λ(ai|wj): is the loss incurred for taking action ai when the state of nature is wj.  R(ai|x) = Σj λ(ai|wj) P(wj|x)  We must minimize R for each class.

slide-76
SLIDE 76

Sharif University of Technology, Computer Engineering Department, Machine Learning Course

22

Any Q Any Questi uestion

  • n

End of Lecture 7 Thank you!

Spring 2015

http://ce.sharif.edu/courses/93-94/2/ce717-1/