Machine Learning
Classification: Introduction
Hamid R. Rabiee
Jafar Muhammadi, Nima Pourdamghani Spring 2015 http://ce.sharif.edu/courses/93-94/2/ce717-1/
Machine Learning Classification: Introduction Hamid R. Rabiee - - PowerPoint PPT Presentation
Machine Learning Classification: Introduction Hamid R. Rabiee Jafar Muhammadi, Nima Pourdamghani Spring 2015 http://ce.sharif.edu/courses/93-94/2/ce717-1/ Agenda Agenda Introduction Classification: A Two-Step Process Evaluating
Classification: Introduction
Hamid R. Rabiee
Jafar Muhammadi, Nima Pourdamghani Spring 2015 http://ce.sharif.edu/courses/93-94/2/ce717-1/
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
2
Agenda Agenda
Introduction Classification: A Two-Step Process Evaluating Classification Methods Classifier Performance Performance Measures Partitioning Methods
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
3
Int Introducti roduction
Classification
predicts categorical class labels (discrete or nominal) classifies data (constructs a model), based on the training set and the class labels, and uses it in classifying new data
Typical applications
Credit approval Target marketing Medical diagnosis Fraud detection
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
4
Cl Classif assificati ication:
A two-step step p proc rocess ess
Model construction
Each sample is assumed to belong to a predefined class, as determined by the class label The set of samples used for model construction is called “training set” The model is represented as classification rules, decision trees, probabilistic model, mathematical formulae and etc.
Model usage
for classifying future or unknown objects Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy rate is the percentage of test set samples that are correctly classified by the model Test set is independent of training set, otherwise over-fitting will occur If the accuracy is acceptable, use the model to classify data samples whose class labels are not known
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
5
Evaluati Evaluating ng classi classifi ficati cation met
hods
Performance
classifier performance: predicting class label Accuracy, {true positive, true negative}, {false positive, false negative}, …
Time Complexity
time to construct the model (training time) the model will be constructed once can be large time to use the model (classification time) must be tolerable need for good data structures
Robustness
handling noise and missing values handling incorrect training data
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
6
Evaluati Evaluating ng classi classifi ficati cation met
hods
Scalability
efficiency in disk-resident databases
Interpretability
understanding and insight provided by the model
Other measures: goodness of rules or compactness of classification rules
rule of thumb: more compact, better generalization
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
7
Perfor Performance mance measures measures
Accuracy is not a good measure for classifier performance always (Why?)
Suppose a “cancer detection” problem
Presentation of Classifier Performance
Use a confusion matrix or a receiver-operating characteristic (ROC) curve We can extract some performance measures from the above matrix (or curve)
P N P N Real Predicted
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
8
Perfor Performance mance measures measures
ROC Example: ROC Space
A: Acc: 0.68 B: Acc: 0.50 C: Acc: 0.18 C’: Acc: 0.82
TP: 77 FP: 77 154 FN: 23 TN: 23 46 100 100 200 TP: 24 FP: 88 112 FN: 76 TN: 12 88 100 100 200 TP: 76 FP: 12 88 FN: 24 TN: 88 112 100 100 200 TP: 63 FP: 28 91 FN: 37 TN: 72 109 100 100 200
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
9
Perfor Performance mance measures measures
Performance Measures
Accuracy: (TP+TN) / (#data) Specificity: TN / (FP+TN) Sensitivity: TP / (FN+TP) Index of Merit: (Specificity + Sensitivity) / 2 = (TP%+TN%) / 2 Also known as “percentage correct classifications”
Performance measured using test set results
Test set should be distinct and different from the train (learning) set. Several methods are available to partition the data into separated training and testing sets, resulting in different estimates of the “true” index of merit
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
10
Dat Data a parti partiti tioning
Goal: validating the classifier and its parameters
Choose the best parameter set
Idea: use a part of training data as the validation set Validation set must be a good representative for the whole data How to partition the training data
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
11
Dat Data a parti partiti tioning
methods
Holdout methods: Random Sampling
data is randomly partitioned into two independent sets
Always size of train set is twice of test set Assumption: data is uniformly distributed
The true error estimate is obtained as the average of the separate estimates Holdout methods: Bootstrap
resample with replacement n sample of original data as training set. Some numbers in the original sample may be included several times in the bootstrap sample (63.2% of samples are distinct)
Training set Test set All examples
i
E
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
12
Dat Data a parti partiti tioning
methods
Holdout methods: Multiple train-and-test experiment Bootstrap Holdout methods Drawbacks
In problems where we have a sparse dataset we may not be able to afford the “luxury” of setting aside a portion of the dataset for testing. Since it is a single train-and-test experiment, the holdout estimate of error rate will be misleading if we happen to get an “unfortunate” split.
Total of number examples
Test set Experiment #1 Experiment #2 Experiment #3
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
13
Dat Data a parti partiti tioning
methods
Cross-validation (k-fold, where k = 10 is most popular)
Randomly partition the data into k mutually exclusive subsets, each approximately equal size At ith iteration, use Di as test set and others as training set The mean of measures obtained in iterations used as output performance measure
…
Test set
Experiment #1 Experiment # i Experiment #2 Experiment # k
Test set Test set Test set
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
14
Dat Data a parti partiti tioning
methods
Cross-validation (k-fold, where k = 10 is most popular) Divide the total dataset into three subsets:
Training data is used for learning the parameters of the model. Validation data is not used of learning but is used for deciding what type of model and what amount of regularization works best. Test data is used to get a final, unbiased estimate of how well the network works. We expect this estimate to be worse than on the validation data.
As before, the true error is estimated as the average error rate:
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
15
Dat Data a parti partiti tioning
methods
Leave-one-out
k folds where k = # of samples, for small sized data
As usual, the true error is estimated as the average error rate on test examples:
…
Test set
Experiment #1 Experiment # i Experiment #2 Experiment # k
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
16
Dat Data a parti partiti tioning
methods
Stratified cross-validation
folds are stratified so that class distributions in each fold is approximate the same as that in the initial data
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
17
How many fol
ds are need needed? ed?
With a large number of folds
+ The bias of the true error rate estimator will be small (the estimator will be very accurate)
With small number of folds
+ The number of experiments and, therefore, computation time are reduced + The variance of the estimator will be small
In practice, the choice of the number of folds depends on the size of the dataset
For large datasets, even 3-Fold Cross Validation will be quite accurate For very sparse datasets, we may have to use leave-one-out in order to train on as many examples as possible
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
18
Three Three-way da way data ta splits splits
If model selection and true error estimates are to be computed simultaneously, the data needs to be divided into three disjoint sets
Training set: a set of examples used for learning: to fit the parameters of the classifier Validation set: a set of examples used to tune the parameters of a classifier Test set: a set of examples used only to assess the performance of a fully-trained classifier
Why separate test and validation sets?
The error rate estimate of the final model on validation data will be biased(smaller than the true error rate) since the validation set is used to select the final model After assessing the final model with the test set, YOU MUST NOT tune the model any further
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
18
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
19
Three Three-way da way data ta splits splits
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
19
Adapted from slides of Ricardo Gutierrez-Osuna
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
20
Any Q Any Questi uestion?
Spring 2015
http://ce.sharif.edu/courses/93-94/2/ce717-1/
Hamid R. Rabiee
Jafar Muhammadi, Alireza Ghassemi Spring 2015 http://ce.sharif.edu/courses/93-94/2/ce717-1/
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
2
Agenda Agenda
Bayesian Decision Theory Prior Probabilities Class-Conditional Probabilities Posterior Probabilities Probability of Error Conditional Risk Min-Error-Rate Classification Probabilistic Discriminant Functions
Discriminant Functions: Gaussian Density
Minimax Classification Neyman-Pearson
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
3
Bayesi Bayesian D an Decisi ecision
Theory
Bayesian Decision Theory is a fundamental statistical approach that quantifies the tradeoffs between various decisions using probabilities and costs that accompany such decisions.
First, we will assume that all probabilities are known. Then, we will study the cases where the probabilistic structure is not completely known.
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
4
Bayesi Bayesian D an Decisi ecision
Theory
We are using fish sorting example to illustrate these topics. Fish sorting example revisited
State of nature is a random variable. Define w as the type of fish we observe (state of nature, class) where w = w1 for sea bass, w = w2 for salmon. P(w1) is the a priori probability that the next fish is a sea bass. P(w2) is the a priori probability that the next fish is a salmon.
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
5
Pri Prior
Probabil ilit ities ies
Prior probabilities reflect our knowledge of how likely each type of fish will appear before we actually see it. How can we choose P(w1) and P(w2)?
Set P(w1) = P(w2) if they are equiprobable (uniform priors). May use different values depending on the fishing area, time of the year, etc.
Assume there are no other types of fish
(exclusivity and exhaustivity).
1 2
P(w ) P(w ) 1
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
6
Pri Prior
Probabil ilit ities ies
How can we make a decision with only the prior information?
Decide
What is the probability of error for this decision?
P(error) = min{P(w1), P(w2)}
1 1 2 2
w if P(w ) P(w ) w
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
7
Cl Class ass-Cond Conditi ition
al Pro Proba babili bilities ties
Let’s try to improve the decision using the lightness measurement x.
Let x be a continuous random variable. Define P(x|wj) as the class-conditional probability density (probability of x given that the state of nature is wj for j = 1, 2). P(x|w1) and P(x|w2) describe the difference in lightness between populations of sea bass and salmon. Hypothetical class-conditional probability density functions for two Classes.
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
8
Cl Class ass-Cond Conditi ition
al Pro Proba babili bilities ties
How can we make a decision with only the class-conditional probabilities?
Decide
Looks good, but prior information are not used. It may degrade decision performance
e.g what happens if we know a priori that 99% of fish are se basses?
Class-conditional is known as “Maximum Likelihood”, also.
1 1 2 2
w if P(x|w ) P(x|w ) w
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
9
Posteri Posterior
Probabil ilit ities ies
Suppose we know P(wj) and P(x|wj) for j = 1, 2, and measure the lightness
Define P(wj |x) as the a posteriori probability (probability of the state of nature being wj given the measurement of feature value x). We can use the Bayes formula to convert the prior probability to the posterior probability: in which P(x|wj) is called the likelihood and P(x) is called the evidence.
j j j
p(x|w )P(w ) P(w |x) p(x)
2 j j j 1
p(x) p(x|w )P(w )
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
10
Posteri Posterior
Probabil ilit ities ies
How can we make a decision after observing the value of x?
Decide
Rewriting the rule gives
Decide
Note that, at every x, P(w1|x) + P(w2|x) = 1.
1 1 2 2
w if P(w |x) P(w |x) w
1 2 1 2 1 2
P(x|w ) P(w ) w if P(x|w ) P(w ) w
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
11
Probab Probabil ilit ity y of
Error
What is the probability of error for this decision? What is the average probability of error? Bayes decision rule minimizes this error because P(error|x) = min{ P(w1|x), P(w2|x) }
1 2 2 1
P(w |x) if we decide w P(error|x) P(w |x) if we decide w P(error) P(error,x)dx P(error|x)P(x)dx
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
12
Bayesi Bayesian D an Decisi ecision
Theory
How can we generalize to
More than one feature? (replace the scalar x by the feature vector x) More than two states of nature? (just a difference in notation) Allowing actions other than just decisions? (allow the possibility of rejection) Different risks in the decision? (define how costly each action is) Notations for generalization Let {w1, . . . ,wc} be the finite set of c states of nature (classes, categories). Let {α1, . . . , αa} be the finite set of a possible actions. Let λ(αi|wj) be the loss incurred for taking action i when the state of nature is wj . Let x be the d-dim vector-valued random variable called the feature vector.
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
13
Condit Conditional ional Ri Risk sk
Suppose we observe x and take action αi.
If the true state of nature is wj , we incur the loss λ(αi|wj). The expected loss with taking action i is
It is also called the conditional risk.
c i i j j j 1
R( |x) ( |w )P(w |x)
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
14
Condit Conditional ional Ri Risk sk
We want to find the decision rule that minimizes the overall risk Bayesian decision rule minimizes the overall risk by selecting the action αi for which R(αi|x) is minimum The resulting minimum overall risk is called the Bayesian risk and is the best performance that can be achieved.
R R( (x)|x)p(x)dx
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
15
Condit Conditional ional Ri Risk sk
Two-category classification example
Define α1 : deciding w1 α2 : deciding w2 λij : λ(αi | wj) Conditional risks can be written as
1 11 1 12 2 2 21 1 22 2
R( |x) P(w |x) P(w |x) R( |x) P(w |x) 2P(w |x)
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
16
Condit Conditional ional Ri Risk sk
Two-category classification example
The minimum-risk decision rule becomes Decide This corresponds to deciding w1 if comparing the likelihood ratio to a threshold that is independent of the
1 21 11 1 12 22 2 2
w if( )P(w |x) ( )P(w |x) w
1 12 22 2 2 21 11 1
p(x|w ) P(w ) p(x|w ) P(w )
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
17
Mi Min-Error Error-Rate Rate Cl Class assif ification ication
Problem definition:
Actions are decisions on classes (αi is deciding wi). If action αi is taken and the true state of nature is wj , then the decision is correct if i = j and in error if i≠j. We want to find a decision rule that minimizes the probability of error.
Define the zero-one loss function (all errors are equally costly). Conditional risk becomes
i j
if i j ( |w ) i,j 1,...,c 1 if i j
c i i j j i j 1 j i
R( |x) ( |wj)P(w |x) P(w |x) 1 P(w |x)
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
18
Mi Min-Error Error-Rate Rate Cl Class assif ification ication
Minimizing the risk requires maximizing P(wi|x) and results in the minimum-error decision rule
Decide wi if P(wi|x) > P(wj |x) for all j≠i.
The resulting error is called the Bayesian error
This is the best performance that can be achieved.
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
19
Probab Probabil ilis isti tic D c Dis iscri crimi minant nant Func Functi tions
Discriminant functions: a useful way of representing classifiers
gi(x), i = 1, . . . , c Classifier assigns a feature vector x to class wi if gi(x) > gj(x) for all j≠i. For the classifier that minimizes conditional risk gi(x) = −R(αi |x). For the classifier that minimizes error gi(x) = P(wi|x).
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
20
Probab Probabil ilis isti tic D c Dis iscri crimi minant nant Func Functi tions
These functions divide the feature space into c decision regions separated by decision boundaries (R1, . . . , Rc).
Note that the results do not change even if we replace every gi(x) by f(gi(x)) where f(·) is a monotonically increasing function (e.g., logarithm). This may lead to significant analytical and computational simplifications.
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
21
Di Discri scrimi minant nant Func Funcs: s: Gaussian D aussian Densit ensity
Discriminant functions for the Gaussian density in case of min-error-rate classification, can be written as (why?):
gi(x) = ln p(x|wi) + ln P(wi), p(x|wi) = N(μi, Σi), or
T 1 i i i i i i
1 1 1 g (x) (x ) (x ) ln2 | | lnP(w ) 2 2 2
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
22
Di Discri scrimi minant nant Func Funcs: s: Gaussian D aussian Densit ensity
Case 1: Σi=σ2I
Discriminant functions are Where and (wi0 is the threshold or bias for the i’th category). Decision boundaries are the hyperplanes gi(x) = gj(x), and can be written as wij
T (x − x0 (ij)) = 0
Where and Hyperplane separating Ri and Rj passes through the point x0
(ij) and is orthogonal
to the vector w.
T i i i0
g (x) w x w
i i 2
1 w
T i0 i i i 2
1 w lnP(w )
ij i j
w
2 (ij) i i j i j 2 j i j
P(w ) 1 x ( ) ln ( ) 2 P(w ) || ||
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
23
Di Discri scrimi minant nant Func Funcs: s: Gaussian D aussian Densit ensity
Case 1: Σi=σ2I
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
24
Di Discri scrimi minant nant Func Funcs: s: Gaussian D aussian Densit ensity
Case 1: Σi=σ2I
Special case when P(wi) are the same for i = 1, . . . , c is the minimum-distance classifier that uses the decision rule assign x to wi* where i* = arg min ||x-μi||, i=1,…,c
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
25
Di Discri scrimi minant nant Func Funcs: s: Gaussian D aussian Densit ensity
Case 2: Σi= Σ
Discriminant functions are Where and Decision boundaries can be written as wij
T (x − x0 (ij)) = 0
Where and Hyperplane passes through x0
(ij) but is not necessarily orthogonal to the line
between the means.
T i i i0
g (x) w x w
1 i i
w
T 1 i0 i i i
w lnP(w )
ij i j
w
(ij) i i j i j T 1 j i j i j
P(w ) 1 1 x ( ) ln ( ) 2 P(w ) ( ) ( )
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
26
Di Discri scrimi minant nant Func Funcs: s: Gaussian D aussian Densit ensity
Case 2: Σi= Σ
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
27
Di Discri scrimi minant nant Func Funcs: s: Gaussian D aussian Densit ensity
Case 3: Σi= Arbitrary
Discriminant functions are Where , and Decision boundaries are hyperquadrics
T T i i i i0
g (x) x W x w x w
1 i i i
w
T 1 i0 i i i i i
1 1 w ln| | lnP(w ) 2 2
1 i i
1 W 2
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
28
Di Discri scrimi minant nant Func Funcs: s: Gaussian D aussian Densit ensity
Case 3: Σi= Arbitrary
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
29
Di Discri scrimi minant nant Func Funcs: s: Gaussian D aussian Densit ensity
Case 3: Σi= Arbitrary
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
30
Mi Minimax nimax Cl Classif assificati ication
In many real life applications, prior probabilities may be unknown, or time- varying, so we can not have a Bayesian optimal classification. However, one may wish to minimize the max possible overall risk.
The overall risk is, and , then
1
11 1 1 12 2 2 21 1 1 22 2 2 2
( ) ( | ) ( ) ( | ) ( ) ( | ) ( ) ( | )
R R
R P w P x w P w P x w dx P w P x w P w P x w dx
2 1
( ) 1 ( ) P w P w
1 2
1 2
( | ) 1 ( | )
R R
P x w dx P x w dx
1 2
1 1 22 12 22 2 1 11 22 21 11 1 12 22 2 1
( ), ( ) ( | ) ( ) ( ) ( ) ( | ) ( ) ( | )
R R R
R P w R P x w dx P w P x w dx P x w dx
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
31
Mi Minimax nimax Cl Classif assificati ication
For a fix R1, the overall risk is a linear function of P(w1), and the maximum error occurs in P(w1)=0, or P(w1)=1.
Why should the line be a tangent to R(P(w1),R1)?
For all possible R1s, we are looking for the one which minimizes this maximum error, i.e.
1
1 1 1
argmin max ( ),
R
R R P w R
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
32
Mi Minimax nimax Der Deriv ivati ation
Another way to solve R1 in minimax is from:
If you get multiple solutions, choose one that gives you the minimum Risk
1 2 1
1 1 22 12 22 2 1 11 22 21 11 1 12 22 2
( ), ( ) ( | ) ( ) ( ) ( | ) ( ) ( | )
R R R
R P w R p x w dx P p x w dx p x w dx
R , minimax risk
mm
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
33
Neyman eyman-Pe Pearso arson n Cri Criterion terion
If we do not know the prior probabilities, Bayesian optimum classification is not possible.
Suppose that the goal is maximizing the probability of detection, while constraining the probability of false-alarm to be less than or equal to a certain value. E.g. in a radar system false alarm (assuming an enemy aircraft is approaching while this is not the case) may be OK but it is very important to maximize the probability of detecting a real attack Based on this constraint (Neyman-Pearson criterion) we can design a classifier Typically must adjust boundaries numerically (for some distributions, such as Gaussian, analytical solutions do exist.
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
34
Any Q Any Questi uestion?
Spring 2015
http://ce.sharif.edu/courses/93-94/2/ce717-1/
Hamid R. Rabiee
Jafar Muhammadi Spring 2015 http://ce.sharif.edu/courses/93-94/2/ce717-1 /
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
2
Agenda Agenda
Linear Discriminant Functions (LDF) Multi-class problems
Linear machine Completely Linearly Separation Pairwise Linearly Separation
Linear Discriminant Function Design
Least Mean Squared Error Method Sum of Squared Error Method
Ho-Kashyap Method
Probabilistic Methods
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
3
Linear Linear Di Discri scrimi minant nant Func Functi tions
(LDF) F)
Definition:
LDF is a function that is a linear combination of the components of x g(x) = wtx + w0 where w is the weight vector and w0 the bias, or threshold weight.
A two-category classifier with a discriminant function of the above form uses the following rule:
Decide w1 if g(x) > 0 and w2 if g(x) < 0 Decide w1 if wtx > -w0 and w2 otherwise The value g(x) of the function for a certain point x is called functional margin If g(x) = 0 then x is assigned to either class The equation g(x) = 0 defines the decision surface that separates points assigned to the category w1 from points assigned to the category w2 When g(x) is linear, the decision surface is a hyperplane.
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
4
Linear Linear Di Discri scrimi minant nant Func Functi tions
In conclusion, a linear discriminant function divides the feature space by a hyperplane decision surface
Decision boundary g(x)=0 corresponds to (d-1)- dimensional hyperplane in d-dimensional x- space
The orientation of the surface is determined by the normal vector w and the location of the surface is determined by the bias
We can see Fisher method (LDA) as a linear discriminant function, too.
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
5
Mul Multi ti-clas class s pro problems blems
Suppose we have an n-classes classification problem, and we want to separate them with linear discriminant functions
Do you have any idea about how to use discriminant function in this case
We have many ways to do this.
Using linear discriminant function in multi-class problems
Linear machines (one versus one) Completely linearly separation (one versus the rest) Pairwise linearly separation
We introduce above methods through illustrative examples in next slides.
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
6
Case 1: Case 1: Linear Linear Machine Machine
Suppose a 3-class classification problem with the following discriminant functions: and use the following rule for classification (linear machine rule): How these classes partition the space?
1 1 2 2 1 2 3 2
( ) ( ) 1 ( ) g x x x g x x x g x x
( ) ( ) ;
i i j
x C g x g x j i
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
7
Case Case 1: : Line Linear Mach ar Machine ine
Each class partition can be obtained through solving two equations. The result:
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
8
Mor More e on
Linear Machines Machines
In some texts, it is called one versus one (one against one). How many functions we need for n classes? (n) The decision regions for linear machine are convex and this restriction limits the flexibility of the classifier.
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
9
Case Case 2: : Completely Completely Line Linearly Sep arly Separa aration tion
Suppose a 3-class classification problem with the following discriminant functions: and use the following rule for classification (completely linearly separation rule): How these classes partition the space? Determine the undecided sub-spaces.
1 1 2 2 1 2 3 2
( ) ( ) 5 ( ) 1 g x x x g x x x g x x
( ) ( )
i i i i
if g x
C and if g x
C
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
10
Case 2: Case 2: Com Completel pletely L y Linearl inearly y Sep Separati aration
Each class partition can be obtained through solving three equation. The result:
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
11
Mor More e on
Completel pletely L y Linearl inearly Se y Separati paration
In some texts, it is called one versus the rest (one against all). If we have n classes, what is the number of needed functions? (n) Are the decision regions convex? Compare the undecided sub-spaces in two cases.
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
12
Case 3: Case 3: Pair Pairwi wise Li se Linearly Sepa nearly Separati ration
Suppose a 3-class classification problem with the following discriminant functions: and use the following rule for classification: How these classes partition the space? Determine the undecided sub-spaces.
12 1 2 13 1 23 1 2
( ) 5 ( ) 3 ( ) g x x x g x x g x x x ( )
i ij
x C j i g x
( ) ( )
ij ji
g x g x
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
13
Case Case 3: : Pa Pairw irwise Linea ise Linearly Se rly Sepa paration ration
Each class partition can be obtained through solving two equation. The result:
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
14
Mor More e on
Pairwi wise se Linearl Linearly y Sep Separati aration
If we have n classes, what is the number of needed functions? (c(n,2)) Are the decision regions convex?
Definition: A region Ri is convex iff
, (1 )
i i
y z R y z R g ( ) ( ) and g ( ) ( ) ( (1 ) ) ( (1 ) )
i j i j i j
j i y g y z g z g y z g y z
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
15
Linear Linear Di Discri scrimi minant nant Func Functi tions
Main problem
How to create the discriminant functions for each class (how obtain w)?
Many methods exist for this purpose, such as:
Error Minimization Methods Least Mean Squared Error Method will be discussed in next slides Sum of Squared Error Method will be discussed in next slides Ho-Kashyap Method will be discussed in next slides Fisher Linear Discriminant Method discussed in lecture 3 Perceptorn Method will be discussed in lecture 9 Probabilistic Methods discussed in lecture 6 etc.
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
16
Leas Least t Mean Squared Mean Squared Err Error
We want to choose the W that minimizes the mean-squared-error criterion function: Note: means the “j”th feature of the “i”th sample. Here we assume has “n” different features. We can also use the gradient descent rule for updating w instead of analytical solving.
2 ( ) ( ) ( ) ( ) 2 ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )
( ) ( ) ( ) ˆ argmin ( ) ( ) 2 ( ) 2 ) 2 ( ) ˆ ( )
i i i t i w i i t i i i i t i i i i t i i i i t i
J w E y g x E y w x w J w J w E x y w x E x y x w x E x y Ew x x w E x y E w E x x
( ) ( ) 1 ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) 1 1 1 1 ( ) (i) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) 2 1 2 2 ( ) (i) ( ) (i) ( ) ( ) 1
[ ] [ ] [ ] [ ] [( ) ] , [ ] [ ] [ ]
i i i i x x i i i i i i n i i i i i i t i i i n x i i i i n n n n
x y R E x y R E x x E x x x y E x x E x x x y R E x x E x y E E x x E x x x y
( ) i j
x
( ) i
x
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
17
Sum Sum of
Squared ared Err Error
SSE uses the sum of squared error as objective function
Also known as Pseudo inverse matrix method Note: Here we have “n” samples.
2 ( ) ( ) 2 1 ( ) ( ) ( ) 1 1
( ) ( ) ( ) 2 ( ) 2 ( ) ( )
n t t i i i n i t i i t i t t t t x
J w w x b w x b J w x w x b x x w b w xb xx w xb xx w xb w xx xb x b xx
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
18
Sum Sum of
Squared ared Err Error
Example
Find the SSE boundary for the given data points,
1 2
1
t
w x x
1 2
:[(1,2),(2,0)] :[(3,1),(2,3)] c c
1
1 1 2 5/4 13/12 3/4 7/12 1 2 0 ( ) 1/2 1/6 1/2 1/6 1 3 1 1/3 1/3 1 2 3
t t
X X X X X
assuming y 1 1 1 1 w=X 11/3 4/3 2/3
T
y
1 2
11 4 2 ( ) 3 3 3 g x x x
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
19
Ho Ho-Kash Kashyap yap M Meth ethod
The main limitation of the SSE is lack of guarantees that a separating hyperplane will be found in the linearly separable case
The SSE rule try to minimize Finding a separating hyperplane depends on how suitably the outputs b are selected
If the two classes are linearly separable, there must exists vectors w and b such that wTx = b>0
if b were known, to compute the separating hyperplane, the SSE solution will be w = x-b Nevertheless, since b is unknown, one must solve the equation for both w and b
A possible algorithm is the Ho-Kashyap procedure:
2 t
w x b
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
20
Ho Ho-Kash Kashyap yap M Meth ethod
g(x) > 0 can be rewrite as g(x)=b; b>0
How we can determine b?
Objective function in this case is Ho-Kashyap method offers an iterative method for obtaining w and b, using following steps:
Keeping constant b and optimize J related to w (using obtained b from last step) Using previous method we have: Keeping constant w and optimize J related to b (using obtained w from last step) The objective is to minimize Using Gradient descent method we have: To hold the constraint b>0, we set (xw-b) in this rule to zero if it becomes negative, then the rule will be:
2
( , )
t
J w b w x b
( 1) ( ) w t x b t ( 1) ( ) ( ) ( )
t t
b t b t w t x b w t x b
2
t
J w x b w
( 1) ( ) 2 ( )
t
b t b t w t x b
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
21
Probab Probabil ilis isti tic M c Methods ethods
Maximum likelihood
gi(x) = P(x|wi)
Bayesian Classifier
gi(x) = P(wi|x) gi(x) = p(x|wi) P(wi) gi(x) = ln p(x|wi) + ln P(wi)
Expected Loss (Conditional Risk)
Uses loss function λ(ai|wj): is the loss incurred for taking action ai when the state of nature is wj. R(ai|x) = Σj λ(ai|wj) P(wj|x) We must minimize R for each class.
Sharif University of Technology, Computer Engineering Department, Machine Learning Course
22
Any Q Any Questi uestion
Spring 2015
http://ce.sharif.edu/courses/93-94/2/ce717-1/