 
              Measuring performance of a classifier Measuring performance of a hypothesis is partitioning data to 1 Training set Validation set different from training set. Test set different from training and validation sets. Problems with this approach 2 Training and validation sets may be small and may contain exceptional instances such as noise, which may mislead us. The learning algorithm may depend on other random factors affecting the accuracy such as initial weights of a neural network trained with BP. We must train/test several times and average the results. Important points 3 Performance of a hypothesis estimated using training set conditioned on the used data set and cant used to compare algorithms in domain independent ways. Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 5 / 31
Measuring performance of a classifier Measuring performance of a hypothesis is partitioning data to 1 Training set Validation set different from training set. Test set different from training and validation sets. Problems with this approach 2 Training and validation sets may be small and may contain exceptional instances such as noise, which may mislead us. The learning algorithm may depend on other random factors affecting the accuracy such as initial weights of a neural network trained with BP. We must train/test several times and average the results. Important points 3 Performance of a hypothesis estimated using training set conditioned on the used data set and cant used to compare algorithms in domain independent ways. Validation set is used for model selection, comparing two algorithms, and decide to stop learning. Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 5 / 31
Measuring performance of a classifier Measuring performance of a hypothesis is partitioning data to 1 Training set Validation set different from training set. Test set different from training and validation sets. Problems with this approach 2 Training and validation sets may be small and may contain exceptional instances such as noise, which may mislead us. The learning algorithm may depend on other random factors affecting the accuracy such as initial weights of a neural network trained with BP. We must train/test several times and average the results. Important points 3 Performance of a hypothesis estimated using training set conditioned on the used data set and cant used to compare algorithms in domain independent ways. Validation set is used for model selection, comparing two algorithms, and decide to stop learning. In order to report the expected performance, we should use a separate test set unused during learning. Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 5 / 31
Error of a classifier Definition (Sample error) The sample error (denoted E E ( h )) of hypothesis h with respect to target concept c and data sample S of size N is. E E ( h ) = 1 � I [ c ( x ) � = h ( x )] N x ∈ S Definition (True error) The true error (denoted E ( h )) of hypothesis h with respect to target concept c and distribution D is the probability that h will misclassify an instance drawn at random according to distribution D . E ( h ) = P x ∼D [ c ( x ) � = h ( x )] Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 6 / 31
Notions of errors True error is 1 Instance space X c h Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 7 / 31
Notions of errors True error is 1 Instance space X c h Our concern 2 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 7 / 31
Notions of errors True error is 1 Instance space X c h Our concern 2 How we can estimate the true error ( E ( h )) of hypothesis using its sample error ( E E ( h )) ? Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 7 / 31
Notions of errors True error is 1 Instance space X c h Our concern 2 How we can estimate the true error ( E ( h )) of hypothesis using its sample error ( E E ( h )) ? Can we bound true error of h given sample error of h ? Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 7 / 31
Outline Introduction 1 Some performance measures of classifiers 2 Evaluating the performance of a classifier 3 Estimating true error 4 Confidence intervals 5 Paired t Test 6 ROC Curves 7 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 8 / 31
Some performance measures of classifiers Error rate The error rate is the fraction of incorrect predictions for 1 the classifier over the testing set, defined as E E ( h ) = 1 � I [ c ( x ) � = h ( x )] N x ∈ S Error rate is an estimate of the probability of misclassification. Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 8 / 31
Some performance measures of classifiers Error rate The error rate is the fraction of incorrect predictions for 1 the classifier over the testing set, defined as E E ( h ) = 1 � I [ c ( x ) � = h ( x )] N x ∈ S Error rate is an estimate of the probability of misclassification. Accuracy The accuracy of a classifier is the fraction of correct 2 predictions over the testing set: Accuracy ( h ) = 1 � I [ c ( x ) = h ( x )] = 1 − E E ( h ) N x ∈ S Accuracy gives an estimate of the probability of a correct prediction. Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 8 / 31
Some performance measures of classifiers What you can say about the accuracy of 90% or the error of 10% ? 1 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 9 / 31
Some performance measures of classifiers What you can say about the accuracy of 90% or the error of 10% ? 1 For example, if 3 − 4% of examples are from negative class, clearly 2 accuracy of 90% is not acceptable. Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 9 / 31
Some performance measures of classifiers What you can say about the accuracy of 90% or the error of 10% ? 1 For example, if 3 − 4% of examples are from negative class, clearly 2 accuracy of 90% is not acceptable. Confusion matrix 3 Actual label (+) ( − ) Predicted label TP FP (+) FN TN ( − ) Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 9 / 31
Some performance measures of classifiers What you can say about the accuracy of 90% or the error of 10% ? 1 For example, if 3 − 4% of examples are from negative class, clearly 2 accuracy of 90% is not acceptable. Confusion matrix 3 Actual label (+) ( − ) Predicted label TP FP (+) FN TN ( − ) Given C classes, a confusion matrix is a table of C × C . 4 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 9 / 31
Some performance measures of classifiers Precision (Positive predictive value) Precision is proportion of 1 predicted positives which are actual positive and defined as TP Precision ( h ) = TP + FP Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 10 / 31
Some performance measures of classifiers Precision (Positive predictive value) Precision is proportion of 1 predicted positives which are actual positive and defined as TP Precision ( h ) = TP + FP Recall (Sensitivity) Recall is proportion of actual positives which are 2 predicted positive and defined as TP Recall ( h ) = TP + FN Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 10 / 31
Some performance measures of classifiers Precision (Positive predictive value) Precision is proportion of 1 predicted positives which are actual positive and defined as TP Precision ( h ) = TP + FP Recall (Sensitivity) Recall is proportion of actual positives which are 2 predicted positive and defined as TP Recall ( h ) = TP + FN Specificity Specificity is proportion of actual negative which are 3 predicted negative and defined as TN Specificity ( h ) = TN + FP Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 10 / 31
Some performance measures of classifiers Balanced classification rate (BCR) Balanced classification rate 1 provides an average of recall (sensitivity) and specificity, it gives a more precise picture of classifier effectiveness. Balanced classification rate defined as BCR ( h ) = 1 2 [ Specificity ( h ) + Recall ( h )] Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 11 / 31
Some performance measures of classifiers Balanced classification rate (BCR) Balanced classification rate 1 provides an average of recall (sensitivity) and specificity, it gives a more precise picture of classifier effectiveness. Balanced classification rate defined as BCR ( h ) = 1 2 [ Specificity ( h ) + Recall ( h )] F-measure F-measure is harmonic mean between precision and recall 2 and defined as F − Measure ( h ) = 2 × Persicion ( h ) × Recall ( h ) Persicion ( h ) + Recall ( h ) Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 11 / 31
Outline Introduction 1 Some performance measures of classifiers 2 Evaluating the performance of a classifier 3 Estimating true error 4 Confidence intervals 5 Paired t Test 6 ROC Curves 7 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 12 / 31
Evaluating the performance of a classifier Hold-out method 1 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 12 / 31
Evaluating the performance of a classifier Hold-out method 1 Random Sub-sampling 2 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 12 / 31
Evaluating the performance of a classifier Hold-out method 1 Random Sub-sampling 2 Cross validation method 3 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 12 / 31
Evaluating the performance of a classifier Hold-out method 1 Random Sub-sampling 2 Cross validation method 3 Leave-one-out method 4 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 12 / 31
Evaluating the performance of a classifier Hold-out method 1 Random Sub-sampling 2 Cross validation method 3 Leave-one-out method 4 5 × 2 Cross validation method 5 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 12 / 31
Evaluating the performance of a classifier Hold-out method 1 Random Sub-sampling 2 Cross validation method 3 Leave-one-out method 4 5 × 2 Cross validation method 5 Bootstrapping method 6 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 12 / 31
Hold-out method Hold-out Hold-out partitions the given data into two independent 1 sets : training and test sets. Total number of examples Training Set Test Set Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 13 / 31
Hold-out method Hold-out Hold-out partitions the given data into two independent 1 sets : training and test sets. Total number of examples Training Set Test Set Typically two-thirds of the data are allocated to the training set and the remaining one-third is allocated to the test set. Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 13 / 31
Hold-out method Hold-out Hold-out partitions the given data into two independent 1 sets : training and test sets. Total number of examples Training Set Test Set Typically two-thirds of the data are allocated to the training set and the remaining one-third is allocated to the test set. The training set is used to drive the model. Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 13 / 31
Hold-out method Hold-out Hold-out partitions the given data into two independent 1 sets : training and test sets. Total number of examples Training Set Test Set Typically two-thirds of the data are allocated to the training set and the remaining one-third is allocated to the test set. The training set is used to drive the model. The test set is used to estimate the accuracy of the model. Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 13 / 31
Random sub-sampling method Random sub-sampling Random sub-sampling is a variation of the 1 hold-out method in which hold-out is repeated k times. Total number of examples Test example Experiment 1 Experiment 2 Experiment 3 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 14 / 31
Random sub-sampling method Random sub-sampling Random sub-sampling is a variation of the 1 hold-out method in which hold-out is repeated k times. Total number of examples Test example Experiment 1 Experiment 2 Experiment 3 The estimated error rate is the average of the error rates for classifiers derived for the independently and randomly generated test partitions. Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 14 / 31
Random sub-sampling method Random sub-sampling Random sub-sampling is a variation of the 1 hold-out method in which hold-out is repeated k times. Total number of examples Test example Experiment 1 Experiment 2 Experiment 3 The estimated error rate is the average of the error rates for classifiers derived for the independently and randomly generated test partitions. Random subsampling can produce better error estimates than a single train-and-test partition. Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 14 / 31
Cross validation method K -fold cross validation The initial data are randomly partitioned into 1 K mutually exclusive subsets or folds, S 1 , S 2 , . . . , S K , each of approximately equal size. . Total number of examples Experiment 1 Experiment 2 Experiment 3 Test examples Experiment 4 K-Fold Cross validation is similar to Random Subsampling Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 15 / 31
Cross validation method K -fold cross validation The initial data are randomly partitioned into 1 K mutually exclusive subsets or folds, S 1 , S 2 , . . . , S K , each of approximately equal size. . Total number of examples Experiment 1 Experiment 2 Experiment 3 Test examples Experiment 4 K-Fold Cross validation is similar to Random Subsampling Training and testing is performed K times. Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 15 / 31
Cross validation method K -fold cross validation The initial data are randomly partitioned into 1 K mutually exclusive subsets or folds, S 1 , S 2 , . . . , S K , each of approximately equal size. . Total number of examples Experiment 1 Experiment 2 Experiment 3 Test examples Experiment 4 K-Fold Cross validation is similar to Random Subsampling Training and testing is performed K times. In iteration k , partition S k is used for test and the remaining partitions collectively used for training. Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 15 / 31
Cross validation method K -fold cross validation The initial data are randomly partitioned into 1 K mutually exclusive subsets or folds, S 1 , S 2 , . . . , S K , each of approximately equal size. . Total number of examples Experiment 1 Experiment 2 Experiment 3 Test examples Experiment 4 K-Fold Cross validation is similar to Random Subsampling Training and testing is performed K times. In iteration k , partition S k is used for test and the remaining partitions collectively used for training. The accuracy is the percentage of the total number of correctly classified test examples. Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 15 / 31
Cross validation method K -fold cross validation The initial data are randomly partitioned into 1 K mutually exclusive subsets or folds, S 1 , S 2 , . . . , S K , each of approximately equal size. . Total number of examples Experiment 1 Experiment 2 Experiment 3 Test examples Experiment 4 K-Fold Cross validation is similar to Random Subsampling Training and testing is performed K times. In iteration k , partition S k is used for test and the remaining partitions collectively used for training. The accuracy is the percentage of the total number of correctly classified test examples. The advantage of K -fold cross validation is that all the examples in the dataset are eventually used for both training and testing. Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 15 / 31
Cross validation method K -fold cross validation The initial data are randomly partitioned into 1 K mutually exclusive subsets or folds, S 1 , S 2 , . . . , S K , each of approximately equal size. . Total number of examples Experiment 1 Experiment 2 Experiment 3 Test examples Experiment 4 K-Fold Cross validation is similar to Random Subsampling Training and testing is performed K times. In iteration k , partition S k is used for test and the remaining partitions collectively used for training. The accuracy is the percentage of the total number of correctly classified test examples. The advantage of K -fold cross validation is that all the examples in the dataset are eventually used for both training and testing. Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 15 / 31
Leave-one-out method Leave-one-out Leave-one-out is a special case of K -fold cross 1 validation where K is set to number of examples in dataset. Total number of examples Experiment 1 Experiment 2 Experiment 3 Single test example Experiment N As usual, the true error is estimated as the average error rat Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 16 / 31
Leave-one-out method Leave-one-out Leave-one-out is a special case of K -fold cross 1 validation where K is set to number of examples in dataset. Total number of examples Experiment 1 Experiment 2 Experiment 3 Single test example Experiment N As usual, the true error is estimated as the average error rat For a dataset with N examples, perform N experiments. Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 16 / 31
Leave-one-out method Leave-one-out Leave-one-out is a special case of K -fold cross 1 validation where K is set to number of examples in dataset. Total number of examples Experiment 1 Experiment 2 Experiment 3 Single test example Experiment N As usual, the true error is estimated as the average error rat For a dataset with N examples, perform N experiments. For each experiment use N − 1 examples for training and the remaining example for testing Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 16 / 31
5 × 2 Cross validation method (1 1) (1 1) (1) 1 2 1 1 (1 2) (1 2) (2) 1 5 × 2 Cross validation method repeates 1 (2 1) (2 1) (1) 2 2 five times 2-fold cross validation 2 2 (2 2) (2 2) (2) method. 2 (3 1) (3 1) (1) 3 2 3 3 (3 2) (3 2) (2) 3 (4 1) (4 1) (1) 4 2 4 4 (4 2) (4 2) (2) 4 (5 1) (5 1) (1) 5 2 5 5 (5 2) (5 2) (2) 5 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 17 / 31
5 × 2 Cross validation method (1 1) (1 1) (1) 1 2 1 1 (1 2) (1 2) (2) 1 5 × 2 Cross validation method repeates 1 (2 1) (2 1) (1) 2 2 five times 2-fold cross validation 2 2 (2 2) (2 2) (2) method. 2 Training and testing is performed 10 2 (3 1) (3 1) (1) 3 2 times. 3 3 (3 2) (3 2) (2) 3 (4 1) (4 1) (1) 4 2 4 4 (4 2) (4 2) (2) 4 (5 1) (5 1) (1) 5 2 5 5 (5 2) (5 2) (2) 5 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 17 / 31
5 × 2 Cross validation method (1 1) (1 1) (1) 1 2 1 1 (1 2) (1 2) (2) 1 5 × 2 Cross validation method repeates 1 (2 1) (2 1) (1) 2 2 five times 2-fold cross validation 2 2 (2 2) (2 2) (2) method. 2 Training and testing is performed 10 2 (3 1) (3 1) (1) 3 2 times. 3 3 (3 2) (3 2) (2) 3 The estimated error rate is the average 3 of the error rates for classifiers derived (4 1) (4 1) (1) 4 2 for the independently and randomly 4 4 (4 2) (4 2) (2) generated test partitions. 4 (5 1) (5 1) (1) 5 2 5 5 (5 2) (5 2) (2) 5 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 17 / 31
Bootstrapping method Bootstrapping The bootstrap uses sampling with replacement to 1 form the training set. Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 18 / 31
Bootstrapping method Bootstrapping The bootstrap uses sampling with replacement to 1 form the training set. Sample a dataset of N instances N times with replacement to form a new dataset of N instances. Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 18 / 31
Bootstrapping method Bootstrapping The bootstrap uses sampling with replacement to 1 form the training set. Sample a dataset of N instances N times with replacement to form a new dataset of N instances. Use this data as the training set. An instance may occur more than once in the training set. Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 18 / 31
Bootstrapping method Bootstrapping The bootstrap uses sampling with replacement to 1 form the training set. Sample a dataset of N instances N times with replacement to form a new dataset of N instances. Use this data as the training set. An instance may occur more than once in the training set. Use the instances from the original dataset that dont occur in the new training set for testing. Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 18 / 31
Bootstrapping method Bootstrapping The bootstrap uses sampling with replacement to 1 form the training set. Sample a dataset of N instances N times with replacement to form a new dataset of N instances. Use this data as the training set. An instance may occur more than once in the training set. Use the instances from the original dataset that dont occur in the new training set for testing. This method traines classifieron just on 63% of the instances. Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 18 / 31
Outline Introduction 1 Some performance measures of classifiers 2 Evaluating the performance of a classifier 3 Estimating true error 4 Confidence intervals 5 Paired t Test 6 ROC Curves 7 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 19 / 31
Estimating true error How well does E E ( h ) estimate E ( h )? 1 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 19 / 31
Estimating true error How well does E E ( h ) estimate E ( h )? 1 Bias in the estimate If training / test set is small, then the accuracy of the resulting hypothesis is a poor estimator of its accuracy over future examples. bias = E [ E E ( h )] − E ( h ) . For unbiased estimate, h and S must be chosen independently. Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 19 / 31
Estimating true error How well does E E ( h ) estimate E ( h )? 1 Bias in the estimate If training / test set is small, then the accuracy of the resulting hypothesis is a poor estimator of its accuracy over future examples. bias = E [ E E ( h )] − E ( h ) . For unbiased estimate, h and S must be chosen independently. Variance in the estimate Even with unbiased S , E E ( h ) may still vary from E ( h ). The smaller test set results in a greater expected variance. Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 19 / 31
Estimating true error How well does E E ( h ) estimate E ( h )? 1 Bias in the estimate If training / test set is small, then the accuracy of the resulting hypothesis is a poor estimator of its accuracy over future examples. bias = E [ E E ( h )] − E ( h ) . For unbiased estimate, h and S must be chosen independently. Variance in the estimate Even with unbiased S , E E ( h ) may still vary from E ( h ). The smaller test set results in a greater expected variance. Hypothesis h misclassifies 12 of the 40 examples in S . What is E ( h )? 2 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 19 / 31
Estimating true error How well does E E ( h ) estimate E ( h )? 1 Bias in the estimate If training / test set is small, then the accuracy of the resulting hypothesis is a poor estimator of its accuracy over future examples. bias = E [ E E ( h )] − E ( h ) . For unbiased estimate, h and S must be chosen independently. Variance in the estimate Even with unbiased S , E E ( h ) may still vary from E ( h ). The smaller test set results in a greater expected variance. Hypothesis h misclassifies 12 of the 40 examples in S . What is E ( h )? 2 We use the following experiment 3 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 19 / 31
Estimating true error How well does E E ( h ) estimate E ( h )? 1 Bias in the estimate If training / test set is small, then the accuracy of the resulting hypothesis is a poor estimator of its accuracy over future examples. bias = E [ E E ( h )] − E ( h ) . For unbiased estimate, h and S must be chosen independently. Variance in the estimate Even with unbiased S , E E ( h ) may still vary from E ( h ). The smaller test set results in a greater expected variance. Hypothesis h misclassifies 12 of the 40 examples in S . What is E ( h )? 2 We use the following experiment 3 Choose sample S of size N according to distribution D . Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 19 / 31
Estimating true error How well does E E ( h ) estimate E ( h )? 1 Bias in the estimate If training / test set is small, then the accuracy of the resulting hypothesis is a poor estimator of its accuracy over future examples. bias = E [ E E ( h )] − E ( h ) . For unbiased estimate, h and S must be chosen independently. Variance in the estimate Even with unbiased S , E E ( h ) may still vary from E ( h ). The smaller test set results in a greater expected variance. Hypothesis h misclassifies 12 of the 40 examples in S . What is E ( h )? 2 We use the following experiment 3 Choose sample S of size N according to distribution D . Measure E E ( h ) Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 19 / 31
Estimating true error How well does E E ( h ) estimate E ( h )? 1 Bias in the estimate If training / test set is small, then the accuracy of the resulting hypothesis is a poor estimator of its accuracy over future examples. bias = E [ E E ( h )] − E ( h ) . For unbiased estimate, h and S must be chosen independently. Variance in the estimate Even with unbiased S , E E ( h ) may still vary from E ( h ). The smaller test set results in a greater expected variance. Hypothesis h misclassifies 12 of the 40 examples in S . What is E ( h )? 2 We use the following experiment 3 Choose sample S of size N according to distribution D . Measure E E ( h ) E E ( h ) is a random variable (i.e., result of an experiment). Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 19 / 31
Estimating true error How well does E E ( h ) estimate E ( h )? 1 Bias in the estimate If training / test set is small, then the accuracy of the resulting hypothesis is a poor estimator of its accuracy over future examples. bias = E [ E E ( h )] − E ( h ) . For unbiased estimate, h and S must be chosen independently. Variance in the estimate Even with unbiased S , E E ( h ) may still vary from E ( h ). The smaller test set results in a greater expected variance. Hypothesis h misclassifies 12 of the 40 examples in S . What is E ( h )? 2 We use the following experiment 3 Choose sample S of size N according to distribution D . Measure E E ( h ) E E ( h ) is a random variable (i.e., result of an experiment). E E ( h ) is an unbiased estimator for E ( h ). Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 19 / 31
Estimating true error How well does E E ( h ) estimate E ( h )? 1 Bias in the estimate If training / test set is small, then the accuracy of the resulting hypothesis is a poor estimator of its accuracy over future examples. bias = E [ E E ( h )] − E ( h ) . For unbiased estimate, h and S must be chosen independently. Variance in the estimate Even with unbiased S , E E ( h ) may still vary from E ( h ). The smaller test set results in a greater expected variance. Hypothesis h misclassifies 12 of the 40 examples in S . What is E ( h )? 2 We use the following experiment 3 Choose sample S of size N according to distribution D . Measure E E ( h ) E E ( h ) is a random variable (i.e., result of an experiment). E E ( h ) is an unbiased estimator for E ( h ). Given observed E E ( h ), what can we conclude about E ( h )? 4 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 19 / 31
Distribution of error E E ( h ) is a random variable with binomial distribution, for the 1 experiment with different randomly drawn S of size N , the probability of observing r misclassified examples is N ! r !( N − r )! E ( h ) r [1 − E ( h )] N − r P ( r ) = Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 20 / 31
Distribution of error E E ( h ) is a random variable with binomial distribution, for the 1 experiment with different randomly drawn S of size N , the probability of observing r misclassified examples is N ! r !( N − r )! E ( h ) r [1 − E ( h )] N − r P ( r ) = For example for N = 40 and E ( h ) = p = 0 . 2, 2 0 . 2 0 . 1 0 . 1 0 . 0 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 20 / 31 0 10 20 30 40
Distribution of error For binomial distribution, we have 1 E ( r ) = Np Var ( r ) = Np (1 − p ) We have shown that the random variable E E ( h ) obeys a Binomial distribution. Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 21 / 31
Distribution of error For binomial distribution, we have 1 E ( r ) = Np Var ( r ) = Np (1 − p ) We have shown that the random variable E E ( h ) obeys a Binomial distribution. Let p be the probability that the result of trial is a success. 2 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 21 / 31
Distribution of error For binomial distribution, we have 1 E ( r ) = Np Var ( r ) = Np (1 − p ) We have shown that the random variable E E ( h ) obeys a Binomial distribution. Let p be the probability that the result of trial is a success. 2 The E E ( h ) and E ( h ) are 3 E E ( h ) = r N E ( h ) = p where Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 21 / 31
Distribution of error For binomial distribution, we have 1 E ( r ) = Np Var ( r ) = Np (1 − p ) We have shown that the random variable E E ( h ) obeys a Binomial distribution. Let p be the probability that the result of trial is a success. 2 The E E ( h ) and E ( h ) are 3 E E ( h ) = r N E ( h ) = p where N is the number of instances in the sample S , Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 21 / 31
Distribution of error For binomial distribution, we have 1 E ( r ) = Np Var ( r ) = Np (1 − p ) We have shown that the random variable E E ( h ) obeys a Binomial distribution. Let p be the probability that the result of trial is a success. 2 The E E ( h ) and E ( h ) are 3 E E ( h ) = r N E ( h ) = p where N is the number of instances in the sample S , r is the number of instances from S misclassified by h , and Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 21 / 31
Distribution of error For binomial distribution, we have 1 E ( r ) = Np Var ( r ) = Np (1 − p ) We have shown that the random variable E E ( h ) obeys a Binomial distribution. Let p be the probability that the result of trial is a success. 2 The E E ( h ) and E ( h ) are 3 E E ( h ) = r N E ( h ) = p where N is the number of instances in the sample S , r is the number of instances from S misclassified by h , and p is the probability of misclassifying a single instance drawn from D . Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 21 / 31
Distribution of error It can be shown that E E ( h ) is unbiased estimator for E ( h ). 1 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 22 / 31
Distribution of error It can be shown that E E ( h ) is unbiased estimator for E ( h ). 1 Since r is Binomially distributed, its variance is Np (1 − p ). 2 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 22 / 31
Distribution of error It can be shown that E E ( h ) is unbiased estimator for E ( h ). 1 Since r is Binomially distributed, its variance is Np (1 − p ). 2 Unfortunately p is unknown, but we can substitute our estimate r N for 3 p . Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 22 / 31
Distribution of error It can be shown that E E ( h ) is unbiased estimator for E ( h ). 1 Since r is Binomially distributed, its variance is Np (1 − p ). 2 Unfortunately p is unknown, but we can substitute our estimate r N for 3 p . In general, given r errors in a sample of N independently drawn test 4 examples, the standard deviation for E E ( h ) is given by � p (1 − p ) Var [ E E ( h )] = N � E E ( h )(1 − E E ( h )) ≃ N Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 22 / 31
Outline Introduction 1 Some performance measures of classifiers 2 Evaluating the performance of a classifier 3 Estimating true error 4 Confidence intervals 5 Paired t Test 6 ROC Curves 7 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 23 / 31
Confidence intervals One common way to describe the uncertainty associated with an 1 estimate is to give an interval within which the true value is expected to fall, along with the probability with which it is expected to fall into this interval. Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 23 / 31
Confidence intervals One common way to describe the uncertainty associated with an 1 estimate is to give an interval within which the true value is expected to fall, along with the probability with which it is expected to fall into this interval. How can we derive confidence intervals for E E ( h )? 2 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 23 / 31
Recommend
More recommend