DATA MINING LECTURE 9 Classification Basic Concepts Decision - PowerPoint PPT Presentation

DATA MINING LECTURE 9 Classification Basic Concepts Decision Trees Evaluation

What is a hipster? • Examples of hipster look • A hipster is defined by facial hair

Hipster or Hippie? Facial hair alone is not enough to characterize hipsters

How to be a hipster There is a big set of features that defines a hipster

Classification • The problem of discriminating between different classes of objects • In our case: Hipster vs. Non-Hipster • Classification process: • Find examples for which you know the class (training set) • Find a set of features that discriminate between the examples within the class and outside the class • Create a function that given the features decides the class • Apply the function to new examples.

Catching tax-evasion Tid Refund Marital Taxable Cheat Status Income 1 Yes Single 125K No Tax-return data for year 2011 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No A new tax return for 2012 5 No Divorced 95K Yes Is this a cheating tax return? No 6 No Married 60K Refund Marital Taxable 7 Yes Divorced 220K No Cheat Status Income 8 No Single 85K Yes No Married 80K ? 9 No Married 75K No 10 10 No Single 90K Yes 10 An instance of the classification problem: learn a method for discriminating between records of different classes (cheaters vs non-cheaters)

What is classification? • Classification is the task of learning a target function f that maps attribute set x to one of the predefined class labels y Tid Refund Marital Taxable One of the attributes is the class attribute Cheat Status Income In this case: Cheat 1 Yes Single 125K No 2 No Married 100K No Two class labels (or classes): Yes (1), No (0) 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes No 6 No Married 60K 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 10

Why classification? • The target function f is known as a classification model • Descriptive modeling: Explanatory tool to distinguish between objects of different classes (e.g., understand why people cheat on their taxes, or what makes a hipster) • Predictive modeling: Predict a class of a previously unseen record

Examples of Classification Tasks • Predicting tumor cells as benign or malignant • Classifying credit card transactions as legitimate or fraudulent • Categorizing news stories as finance, weather, entertainment, sports, etc • Identifying spam email, spam web pages, adult content • Understanding if a web query has commercial intent or not Classification is everywhere in data science Big data has the answers all questions.

General approach to classification • Training set consists of records with known class labels • Training set is used to build a classification model • A labeled test set of previously unseen data records is used to evaluate the quality of the model. • The classification model is applied to new records with unknown class labels

Illustrating Classification Task Learning Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No algorithm 2 No Medium 100K No 3 No Small 70K No No 4 Yes Medium 120K Induction 5 No Large 95K Yes 6 No Medium 60K No Learn 7 Yes Large 220K No Model 8 No Small 85K Yes 9 No Medium 75K No Yes 10 No Small 90K 10 Model Training Set Apply Model Tid Attrib1 Attrib2 Attrib3 Class 11 No Small 55K ? 12 Yes Medium 80K ? Deduction 13 Yes Large 110K ? 14 No Small 95K ? 15 No Large 67K ? 10 Test Set

Evaluation of classification models • Counts of test records that are correctly (or incorrectly) predicted by the classification model • Confusion matrix Predicted Class Actual Class Class = 1 Class = 0 Class = 1 f 11 f 10 Class = 0 f 01 f 00  # correct prediction s f f   11 00 Accuracy    total # of prediction s f f f f 11 10 01 00  # wrong prediction s f f   10 01 Error rate    total # of prediction s f f f f 11 10 01 00

Classification Techniques • Decision Tree based Methods • Rule-based Methods • Memory based reasoning • Neural Networks • Naïve Bayes and Bayesian Belief Networks • Support Vector Machines

Decision Trees • Decision tree • A flow-chart-like tree structure • Internal node denotes a test on an attribute • Branch represents an outcome of the test • Leaf nodes represent class labels or class distribution

Example of a Decision Tree Splitting Attributes Tid Refund Marital Taxable Cheat Status Income No 1 Yes Single 125K Refund 2 No Married 100K No Yes No Test outcome 3 No Single 70K No 4 Yes Married 120K No NO MarSt 5 No Divorced 95K Yes Married Single, Divorced 6 No Married 60K No TaxInc NO No 7 Yes Divorced 220K < 80K > 80K Yes 8 No Single 85K 9 No Married 75K No YES NO 10 No Single 90K Yes Class labels 10 Model: Decision Tree Training Data

Another Example of Decision Tree Single, MarSt Married Divorced Tid Refund Marital Taxable Cheat Status Income NO Refund No 1 Yes Single 125K No Yes 2 No Married 100K No NO TaxInc 3 No Single 70K No 4 Yes Married 120K No < 80K > 80K 5 No Divorced 95K Yes NO YES 6 No Married 60K No No 7 Yes Divorced 220K Yes 8 No Single 85K There could be more than one tree that 9 No Married 75K No fits the same data! 10 No Single 90K Yes 10

Decision Tree Classification Task Tree Tid Attrib1 Attrib2 Attrib3 Class Induction 1 Yes Large 125K No algorithm 2 No Medium 100K No No 3 No Small 70K 4 Yes Medium 120K No Induction 5 No Large 95K Yes 6 No Medium 60K No Learn 7 Yes Large 220K No Model Yes 8 No Small 85K 9 No Medium 75K No 10 No Small 90K Yes Model 10 Training Set Apply Decision Model Tree Tid Attrib1 Attrib2 Attrib3 Class 11 No Small 55K ? 12 Yes Medium 80K ? Deduction 13 Yes Large 110K ? 14 No Small 95K ? 15 No Large 67K ? 10 Test Set

Apply Model to Test Data Test Data Start from the root of tree. Refund Marital Taxable Cheat Status Income No Married 80K ? Refund 10 Yes No NO MarSt Married Single, Divorced TaxInc NO < 80K > 80K NO YES

Apply Model to Test Data Test Data Refund Marital Taxable Cheat Status Income No Married 80K ? Refund 10 Yes No NO MarSt Married Single, Divorced TaxInc NO < 80K > 80K NO YES

Apply Model to Test Data Test Data Refund Marital Taxable Cheat Status Income No Married 80K ? Refund 10 Yes No NO MarSt Assign Cheat to “No” Married Single, Divorced TaxInc NO < 80K > 80K NO YES

Decision Tree Classification Task Tree Tid Attrib1 Attrib2 Attrib3 Class Induction 1 Yes Large 125K No algorithm 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No Induction Yes 5 No Large 95K 6 No Medium 60K No Learn 7 Yes Large 220K No Model 8 No Small 85K Yes 9 No Medium 75K No Yes 10 No Small 90K 10 Model Training Set Apply Decision Model Tree Tid Attrib1 Attrib2 Attrib3 Class 11 No Small 55K ? 12 Yes Medium 80K ? Deduction 13 Yes Large 110K ? ? 14 No Small 95K 15 No Large 67K ? 10 Test Set

Tree Induction • Goal: Find the tree that has low classification error in the training data (training error) • Finding the best decision tree (lowest training error) is NP-hard • Greedy strategy. • Split the records based on an attribute test that optimizes certain criterion. • Many Algorithms: • Hunt’s Algorithm (one of the earliest) • CART • ID3, C4.5 • SLIQ,SPRINT

DATA MINING LECTURE 9 Classification Basic Concepts Decision - PowerPoint PPT Presentation

DATA MINING LECTURE 9 Classification Basic Concepts Decision Trees Evaluation What is a hipster? Examples of hipster look A hipster is defined by facial hair Hipster or Hippie? Facial hair alone is not enough to characterize hipsters

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Introduction What is data mining? to Data mining functionalities Data Mining Major

Data mining Machine Intelligence Thomas D. Nielsen September 2008 Data mining September 2008

DATA MINING LECTURE 2 What is data? The data mining pipeline What is Data Mining? Data

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

Web MINING Web MINING Overview Overview Dr Ahmed Rafea Rafea Dr Ahmed 1 Web Mining Outline

LECTURE 1: INTRODUCTION TO DATA MINING Dr. Dhaval Patel CSE, IIT-Roorkee What is data mining?

Data Mining Based Detection Methods Data Mining in Intrusion detection Feng Pan Outline

DATA MINING LECTURE 1 Introduction What is data mining? After years of data mining there is

Cement, Aggregates, Mining Presentation Cement, Aggregates and Mining Cement, Aggregates and

Frequent Pattern Mining Frequent Sequence Mining Frequent Tree Mining Christian Borgelt

Web Mining Andreas Andersson Gustav Strmberg Sandra Stendahl Introduction Web mining o

Week 5 Video 2 Relationship Mining Causal Mining Causal Data Mining These slides developed in

Data Mining 2018 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 10, 2018

A plan for sustainable MIR evaluation Brian McFee* Eric Humphrey Julin Urbano Hypothesis

Academic Integrity and Misconduct for Students

ACADEMIC INTEGRITY & YOU: GRADUATE EDITION Office of Student Rights & Responsibilities 1

Vlad Kolesnikov Bell Labs DIMACS/Northeast Big Data Hub Workshop on Privacy and Security for Big

CSSE 220 2D Arrays and Maps Check out 2DArraysAndMapsInClass from SVN An aside: academic honesty

403: Algorithms and Data Structures Prof. Petko Bogdanov Introduction Fall 2016 UAlbany

INTRODUCTION AND LOGISTICS Mahdi Nazm Bojnordi Assistant Professor School of Computing

CHEER Trial Switch from Enfuvirtide to Raltegravir CHEER: Study Design Study Design: CHEER

DATA MINING LECTURE 9 Classification Basic Concepts Decision - PowerPoint PPT Presentation

DATA MINING LECTURE 9 Classification Basic Concepts Decision Trees Evaluation What is a hipster? Examples of hipster look A hipster is defined by facial hair Hipster or Hippie? Facial hair alone is not enough to characterize hipsters

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Introduction What is data mining? to Data mining functionalities Data Mining Major

Data mining Machine Intelligence Thomas D. Nielsen September 2008 Data mining September 2008

DATA MINING LECTURE 2 What is data? The data mining pipeline What is Data Mining? Data

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

Web MINING Web MINING Overview Overview Dr Ahmed Rafea Rafea Dr Ahmed 1 Web Mining Outline

LECTURE 1: INTRODUCTION TO DATA MINING Dr. Dhaval Patel CSE, IIT-Roorkee What is data mining?

Data Mining Based Detection Methods Data Mining in Intrusion detection Feng Pan Outline

DATA MINING LECTURE 1 Introduction What is data mining? After years of data mining there is

Cement, Aggregates, Mining Presentation Cement, Aggregates and Mining Cement, Aggregates and

Frequent Pattern Mining Frequent Sequence Mining Frequent Tree Mining Christian Borgelt

Web Mining Andreas Andersson Gustav Strmberg Sandra Stendahl Introduction Web mining o

Week 5 Video 2 Relationship Mining Causal Mining Causal Data Mining These slides developed in

Data Mining 2018 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 10, 2018

A plan for sustainable MIR evaluation Brian McFee* Eric Humphrey Julin Urbano Hypothesis

Academic Integrity and Misconduct for Students

ACADEMIC INTEGRITY &amp; YOU: GRADUATE EDITION Office of Student Rights &amp; Responsibilities 1

Vlad Kolesnikov Bell Labs DIMACS/Northeast Big Data Hub Workshop on Privacy and Security for Big

CSSE 220 2D Arrays and Maps Check out 2DArraysAndMapsInClass from SVN An aside: academic honesty

403: Algorithms and Data Structures Prof. Petko Bogdanov Introduction Fall 2016 UAlbany

INTRODUCTION AND LOGISTICS Mahdi Nazm Bojnordi Assistant Professor School of Computing

CHEER Trial Switch from Enfuvirtide to Raltegravir CHEER: Study Design Study Design: CHEER

ACADEMIC INTEGRITY & YOU: GRADUATE EDITION Office of Student Rights & Responsibilities 1