data mining classification alternative techniques
play

Data Mining Classification: Alternative Techniques Imbalanced Class - PDF document

Data Mining Classification: Alternative Techniques Imbalanced Class Problem Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar 1 Class Imbalance Problem Lots of classification problems where the classes are


  1. Data Mining Classification: Alternative Techniques Imbalanced Class Problem Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar 1 Class Imbalance Problem  Lots of classification problems where the classes are skewed (more records from one class than another) – Credit card fraud – Intrusion detection – Defective products in manufacturing assembly line – COVID-19 test results on a random sample Introduction to Data Mining, 2 nd Edition 10/05/2020 2 2

  2. Challenges  Evaluation measures such as accuracy are not well-suited for imbalanced class  Detecting the rare class is like finding a needle in a haystack Introduction to Data Mining, 2 nd Edition 10/05/2020 3 3 Confusion Matrix  Confusion Matrix: PREDICTED CLASS Class=Yes Class=No Class=Yes a b ACTUAL CLASS Class=No c d a: TP (true positive) b: FN (false negative) c: FP (false positive) d: TN (true negative) Introduction to Data Mining, 2 nd Edition 10/05/2020 4 4

  3. Accuracy PREDICTED CLASS Class=Yes Class=No Class=Yes a b ACTUAL (TP) (FN) CLASS Class=No c d (FP) (TN)  Most widely-used metric:   a d TP TN   Accuracy       a b c d TP TN FP FN Introduction to Data Mining, 2 nd Edition 10/05/2020 5 5 Problem with Accuracy  Consider a 2-class problem – Number of Class NO examples = 990 – Number of Class YES examples = 10 Introduction to Data Mining, 2 nd Edition 10/05/2020 6 6

  4. Problem with Accuracy  Consider a 2-class problem – Number of Class NO examples = 990 – Number of Class YES examples = 10 PREDICTED CLASS Class=Yes Class=No Class=Yes 0 10 ACTUAL CLASS Class=No 0 990 Introduction to Data Mining, 2 nd Edition 10/05/2020 7 7 Problem with Accuracy  Consider a 2-class problem – Number of Class NO examples = 990 – Number of Class YES examples = 10  If a model predicts everything to be class NO, accuracy is 990/1000 = 99 % – This is misleading because the model does not detect any class YES example – Detecting the rare class is usually more interesting (e.g., frauds, intrusions, defects, etc) Introduction to Data Mining, 2 nd Edition 10/05/2020 8 8

  5. Which model is better? PREDICTED A Class=Yes Class=No ACTUAL Class=Yes 0 10 Class=No 0 990 B PREDICTED Class=Yes Class=No ACTUAL Class=Yes 10 0 Class=No 90 900 Introduction to Data Mining, 2 nd Edition 10/05/2020 9 9 Which model is better? PREDICTED A Class=Yes Class=No ACTUAL Class=Yes 5 5 Class=No 0 990 B PREDICTED Class=Yes Class=No ACTUAL Class=Yes 10 0 Class=No 90 900 Introduction to Data Mining, 2 nd Edition 10/05/2020 10 10

  6. Alternative Measures PREDICTED CLASS Class=Yes Class=No Class=Yes a b ACTUAL Class=No c d CLASS a  Precision (p)  a c a  Recall (r)  a b 2 rp 2 a   F - measure (F)    r p 2 a b c Introduction to Data Mining, 2 nd Edition 10/05/2020 11 11 Alternative Measures 10   Precision (p) 0 . 5 PREDICTED CLASS  10 10 10 Class=Yes Class=No   Recall (r) 1  10 0 Class=Yes 10 0 2 * 1 * 0 . 5   ACTUAL F - measure (F) 0 . 62  1 0 . 5 CLASS Class=No 10 980 990   Accuracy 0 . 99 1000 Introduction to Data Mining, 2 nd Edition 10/05/2020 12 12

  7. Alternative Measures 10   Precision (p) 0 . 5 PREDICTED CLASS  10 10 10 Class=Yes Class=No   Recall (r) 1  10 0 Class=Yes 10 0 2 * 1 * 0 . 5 ACTUAL   F - measure (F) 0 . 62  1 0 . 5 CLASS Class=No 10 980 990   Accuracy 0 . 99 1000 1 PREDICTED CLASS   Precision (p) 1  1 0 Class=Yes Class=No 1   Recall (r) 0 . 1  1 9 Class=Yes 1 9 ACTUAL 2 * 0 . 1 * 1   F - measure (F) 0 . 18 CLASS Class=No 0 990  1 0 . 1 991   Accuracy 0 . 991 1000 Introduction to Data Mining, 2 nd Edition 10/05/2020 13 13 Alternative Measures PREDICTED CLASS  Precision (p) 0 . 8 Class=Yes Class=No  Recall (r) 0 . 8 Class=Yes 40 10  F - measure (F) 0 . 8 ACTUAL  Accuracy 0 . 8 CLASS Class=No 10 40 Introduction to Data Mining, 2 nd Edition 10/05/2020 14 14

  8. Alternative Measures PREDICTED CLASS  Precision (p) 0 . 8 Class=Yes Class=No A  Recall (r) 0 . 8 Class=Yes 40 10  F - measure (F) 0 . 8 ACTUAL  Accuracy 0 . 8 CLASS Class=No 10 40 PREDICTED CLASS B Class=Yes Class=No  Precision (p) ~ 0 . 04  Recall (r) 0 . 8 Class=Yes 40 10 ACTUAL  F - measure (F) ~ 0 . 08 CLASS Class=No 1000 4000  Accuracy ~ 0 . 8 Introduction to Data Mining, 2 nd Edition 10/05/2020 15 15 Measures of Classification Performance PREDICTED CLASS Yes No ACTUAL Yes TP FN CLASS No FP TN  is the probability that we reject the null hypothesis when it is true. This is a Type I error or a false positive (FP).  is the probability that we accept the null hypothesis when it is false. This is a Type II error or a false negative (FN). Introduction to Data Mining, 2 nd Edition 10/05/2020 16 16

  9. Alternative Measures PREDICTED CLASS Precision � p � � 0.8 TPR � Recall � r � � 0.8 Class=Yes Class=No FPR � 0.2 F � measure � F � � 0.8 Class=Yes 40 10 Accuracy � 0.8 ACTUAL CLASS Class=No 10 40 TPR FPR � 4 PREDICTED CLASS Precision � p � � 0.038 TPR � Recall � r � � 0.8 Class=Yes Class=No FPR � 0.2 F � measure � F � � 0.07 Class=Yes 40 10 ACTUAL Accuracy � 0.8 CLASS Class=No 1000 4000 TPR FPR � 4 Introduction to Data Mining, 2 nd Edition 10/05/2020 17 17 Alternative Measures PREDICTED CLASS  Precision (p) 0 . 5 Class=Yes Class=No   TPR Recall (r) 0 . 2 Class=Yes 10 40 ACTUAL  FPR 0 . 2 Class=No 10 40 CLASS   F measure 0.28 PREDICTED CLASS  Precision (p) 0 . 5 Class=Yes Class=No   TPR Recall (r) 0 . 5 Class=Yes 25 25  FPR 0 . 5 ACTUAL Class=No 25 25 CLASS   F measure 0.5 PREDICTED CLASS  Precision (p) 0 . 5 Class=Yes Class=No   TPR Recall (r) 0 . 8 Class=Yes 40 10  FPR 0 . 8 ACTUAL Class=No 40 10 CLASS   F measure 0.61 Introduction to Data Mining, 2 nd Edition 10/05/2020 18 18

  10. ROC (Receiver Operating Characteristic)  A graphical approach for displaying trade-off between detection rate and false alarm rate  Developed in 1950s for signal detection theory to analyze noisy signals  ROC curve plots TPR against FPR – Performance of a model represented as a point in an ROC curve – Changing the threshold parameter of classifier changes the location of the point Introduction to Data Mining, 2 nd Edition 10/05/2020 19 19 ROC Curve (TPR,FPR):  (0,0): declare everything to be negative class  (1,1): declare everything to be positive class  (1,0): ideal  Diagonal line: – Random guessing – Below diagonal line:  prediction is opposite of the true class Introduction to Data Mining, 2 nd Edition 10/05/2020 20 20

  11. ROC (Receiver Operating Characteristic)  To draw ROC curve, classifier must produce continuous-valued output – Outputs are used to rank test records, from the most likely positive class record to the least likely positive class record  Many classifiers produce only discrete outputs (i.e., predicted class) – How to get continuous-valued outputs?  Decision trees, rule-based classifiers, neural networks, Bayesian classifiers, k-nearest neighbors, SVM Introduction to Data Mining, 2 nd Edition 10/05/2020 21 21 Example: Decision Trees Decision Tree Continuous-valued outputs Introduction to Data Mining, 2 nd Edition 10/05/2020 22 22

  12. ROC Curve Example Introduction to Data Mining, 2 nd Edition 10/05/2020 23 23 ROC Curve Example - 1-dimensional data set containing 2 classes (positive and negative) - Any points located at x > t is classified as positive At threshold t: TPR=0.5, FNR=0.5, FPR=0.12, TNR=0.88 Introduction to Data Mining, 2 nd Edition 10/05/2020 24 24

  13. Using ROC for Model Comparison  No model consistently outperforms the other  M 1 is better for small FPR  M 2 is better for large FPR  Area Under the ROC curve Ideal:   Area = 1 Random guess:   Area = 0.5 Introduction to Data Mining, 2 nd Edition 10/05/2020 25 25 How to Construct an ROC curve • Use a classifier that produces a Instance Score True Class continuous-valued score for 1 0.95 + each instance 2 0.93 + • The more likely it is for the 3 0.87 - instance to be in the + class, the higher the score 4 0.85 - • Sort the instances in decreasing 5 0.85 - order according to the score 6 0.85 + • Apply a threshold at each unique 7 0.76 - value of the score 8 0.53 + • Count the number of TP, FP, 9 0.43 - TN, FN at each threshold 10 0.25 + • TPR = TP/(TP+FN) • FPR = FP/(FP + TN) Introduction to Data Mining, 2 nd Edition 10/05/2020 26 26

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend