data mining
play

DATA MINING LECTURE 9 Classification Basic Concepts Decision - PowerPoint PPT Presentation

DATA MINING LECTURE 9 Classification Basic Concepts Decision Trees Evaluation What is a hipster? Examples of hipster look A hipster is defined by facial hair Hipster or Hippie? Facial hair alone is not enough to characterize hipsters


  1. DATA MINING LECTURE 9 Classification Basic Concepts Decision Trees Evaluation

  2. What is a hipster? • Examples of hipster look • A hipster is defined by facial hair

  3. Hipster or Hippie? Facial hair alone is not enough to characterize hipsters

  4. How to be a hipster There is a big set of features that defines a hipster

  5. Classification • The problem of discriminating between different classes of objects • In our case: Hipster vs. Non-Hipster • Classification process: • Find examples for which you know the class (training set) • Find a set of features that discriminate between the examples within the class and outside the class • Create a function that given the features decides the class • Apply the function to new examples.

  6. Catching tax-evasion Tid Refund Marital Taxable Cheat Status Income 1 Yes Single 125K No Tax-return data for year 2011 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No A new tax return for 2012 5 No Divorced 95K Yes Is this a cheating tax return? No 6 No Married 60K Refund Marital Taxable 7 Yes Divorced 220K No Cheat Status Income 8 No Single 85K Yes No Married 80K ? 9 No Married 75K No 10 10 No Single 90K Yes 10 An instance of the classification problem: learn a method for discriminating between records of different classes (cheaters vs non-cheaters)

  7. What is classification? • Classification is the task of learning a target function f that maps attribute set x to one of the predefined class labels y Tid Refund Marital Taxable One of the attributes is the class attribute Cheat Status Income In this case: Cheat 1 Yes Single 125K No 2 No Married 100K No Two class labels (or classes): Yes (1), No (0) 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes No 6 No Married 60K 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 10

  8. Why classification? • The target function f is known as a classification model • Descriptive modeling: Explanatory tool to distinguish between objects of different classes (e.g., understand why people cheat on their taxes, or what makes a hipster) • Predictive modeling: Predict a class of a previously unseen record

  9. Examples of Classification Tasks • Predicting tumor cells as benign or malignant • Classifying credit card transactions as legitimate or fraudulent • Categorizing news stories as finance, weather, entertainment, sports, etc • Identifying spam email, spam web pages, adult content • Understanding if a web query has commercial intent or not Classification is everywhere in data science Big data has the answers all questions.

  10. General approach to classification • Training set consists of records with known class labels • Training set is used to build a classification model • A labeled test set of previously unseen data records is used to evaluate the quality of the model. • The classification model is applied to new records with unknown class labels

  11. Illustrating Classification Task Learning Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No algorithm 2 No Medium 100K No 3 No Small 70K No No 4 Yes Medium 120K Induction 5 No Large 95K Yes 6 No Medium 60K No Learn 7 Yes Large 220K No Model 8 No Small 85K Yes 9 No Medium 75K No Yes 10 No Small 90K 10 Model Training Set Apply Model Tid Attrib1 Attrib2 Attrib3 Class 11 No Small 55K ? 12 Yes Medium 80K ? Deduction 13 Yes Large 110K ? 14 No Small 95K ? 15 No Large 67K ? 10 Test Set

  12. Evaluation of classification models • Counts of test records that are correctly (or incorrectly) predicted by the classification model • Confusion matrix Predicted Class Actual Class Class = 1 Class = 0 Class = 1 f 11 f 10 Class = 0 f 01 f 00  # correct prediction s f f   11 00 Accuracy    total # of prediction s f f f f 11 10 01 00  # wrong prediction s f f   10 01 Error rate    total # of prediction s f f f f 11 10 01 00

  13. Classification Techniques • Decision Tree based Methods • Rule-based Methods • Memory based reasoning • Neural Networks • Naïve Bayes and Bayesian Belief Networks • Support Vector Machines

  14. Classification Techniques • Decision Tree based Methods • Rule-based Methods • Memory based reasoning • Neural Networks • Naïve Bayes and Bayesian Belief Networks • Support Vector Machines

  15. Decision Trees • Decision tree • A flow-chart-like tree structure • Internal node denotes a test on an attribute • Branch represents an outcome of the test • Leaf nodes represent class labels or class distribution

  16. Example of a Decision Tree Splitting Attributes Tid Refund Marital Taxable Cheat Status Income No 1 Yes Single 125K Refund 2 No Married 100K No Yes No Test outcome 3 No Single 70K No 4 Yes Married 120K No NO MarSt 5 No Divorced 95K Yes Married Single, Divorced 6 No Married 60K No TaxInc NO No 7 Yes Divorced 220K < 80K > 80K Yes 8 No Single 85K 9 No Married 75K No YES NO 10 No Single 90K Yes Class labels 10 Model: Decision Tree Training Data

  17. Another Example of Decision Tree Single, MarSt Married Divorced Tid Refund Marital Taxable Cheat Status Income NO Refund No 1 Yes Single 125K No Yes 2 No Married 100K No NO TaxInc 3 No Single 70K No 4 Yes Married 120K No < 80K > 80K 5 No Divorced 95K Yes NO YES 6 No Married 60K No No 7 Yes Divorced 220K Yes 8 No Single 85K There could be more than one tree that 9 No Married 75K No fits the same data! 10 No Single 90K Yes 10

  18. Decision Tree Classification Task Tree Tid Attrib1 Attrib2 Attrib3 Class Induction 1 Yes Large 125K No algorithm 2 No Medium 100K No No 3 No Small 70K 4 Yes Medium 120K No Induction 5 No Large 95K Yes 6 No Medium 60K No Learn 7 Yes Large 220K No Model Yes 8 No Small 85K 9 No Medium 75K No 10 No Small 90K Yes Model 10 Training Set Apply Decision Model Tree Tid Attrib1 Attrib2 Attrib3 Class 11 No Small 55K ? 12 Yes Medium 80K ? Deduction 13 Yes Large 110K ? 14 No Small 95K ? 15 No Large 67K ? 10 Test Set

  19. Apply Model to Test Data Test Data Start from the root of tree. Refund Marital Taxable Cheat Status Income No Married 80K ? Refund 10 Yes No NO MarSt Married Single, Divorced TaxInc NO < 80K > 80K NO YES

  20. Apply Model to Test Data Test Data Refund Marital Taxable Cheat Status Income No Married 80K ? Refund 10 Yes No NO MarSt Married Single, Divorced TaxInc NO < 80K > 80K NO YES

  21. Apply Model to Test Data Test Data Refund Marital Taxable Cheat Status Income No Married 80K ? Refund 10 Yes No NO MarSt Married Single, Divorced TaxInc NO < 80K > 80K NO YES

  22. Apply Model to Test Data Test Data Refund Marital Taxable Cheat Status Income No Married 80K ? Refund 10 Yes No NO MarSt Married Single, Divorced TaxInc NO < 80K > 80K NO YES

  23. Apply Model to Test Data Test Data Refund Marital Taxable Cheat Status Income No Married 80K ? Refund 10 Yes No NO MarSt Married Single, Divorced TaxInc NO < 80K > 80K NO YES

  24. Apply Model to Test Data Test Data Refund Marital Taxable Cheat Status Income No Married 80K ? Refund 10 Yes No NO MarSt Assign Cheat to “No” Married Single, Divorced TaxInc NO < 80K > 80K NO YES

  25. Decision Tree Classification Task Tree Tid Attrib1 Attrib2 Attrib3 Class Induction 1 Yes Large 125K No algorithm 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No Induction Yes 5 No Large 95K 6 No Medium 60K No Learn 7 Yes Large 220K No Model 8 No Small 85K Yes 9 No Medium 75K No Yes 10 No Small 90K 10 Model Training Set Apply Decision Model Tree Tid Attrib1 Attrib2 Attrib3 Class 11 No Small 55K ? 12 Yes Medium 80K ? Deduction 13 Yes Large 110K ? ? 14 No Small 95K 15 No Large 67K ? 10 Test Set

  26. Tree Induction • Goal: Find the tree that has low classification error in the training data (training error) • Finding the best decision tree (lowest training error) is NP-hard • Greedy strategy. • Split the records based on an attribute test that optimizes certain criterion. • Many Algorithms: • Hunt’s Algorithm (one of the earliest) • CART • ID3, C4.5 • SLIQ,SPRINT

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend