Data Mining Lecture 03: Introduction to classification Linear - PowerPoint PPT Presentation

CISC 4631 Data Mining Lecture 03: • Introduction to classification • Linear classifier Theses slides are based on the slides by • Tan, Steinbach and Kumar (textbook authors) • Eamonn Koegh (UC Riverside) 1

Classification: Definition • Given a collection of records ( training set ) – Each record contains a set of attributes , one of the attributes is the class . • Find a model for class attribute as a function of the values of other attributes. • Goal: previously unseen records should be assigned a class as accurately as possible. – A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it. 2

Illustrating Classification Task Learning Tid Attrib1 Attrib2 Attrib3 Class algorithm 1 Yes Large 125K No 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No Induction Yes 5 No Large 95K 6 No Medium 60K No Learn 7 Yes Large 220K No Model 8 No Small 85K Yes 9 No Medium 75K No Yes 10 No Small 90K Model 10 Training Set Apply Model Tid Attrib1 Attrib2 Attrib3 Class 11 No Small 55K ? 12 Yes Medium 80K ? Deduction 13 Yes Large 110K ? 14 No Small 95K ? ? 15 No Large 67K 10 Test Set 3

Examples of Classification Task • Predicting tumor cells as benign or malignant • Classifying credit card transactions as legitimate or fraudulent • Classifying secondary structures of protein as alpha-helix, beta-sheet, or random coil • Categorizing news stories as finance, weather, entertainment, sports, etc 4

Classification Techniques • Decision Tree based Methods • Rule-based Methods • Memory based reasoning • Neural Networks • Naïve Bayes and Bayesian Belief Networks • Support Vector Machines • We will start with a simple linear classifier 5

The Classification Problem Katydids (informal definition) Given a collection of annotated data. In this case 5 instances Katydids of and five of Grasshoppers , decide what type of Grasshoppers insect the unlabeled example is. Katydid or Grasshopper? 6

For any domain of interest, we can measure features Color Color {Green, Brown, Gray, Other} {Green, Brown, Gray, Other} Has Wings? Has Wings? Abdomen Abdomen Thorax Thorax Antennae Antennae Length Length Length Length Length Length Mandible Mandible Size Size Spiracle Diameter Leg Length 7

My_Collection My_Collection We can store features Insect Insect Abdomen Abdomen Antennae Antennae Insect Class in a database. ID ID Length Length Length Length 1 2.7 5.5 Grasshopper 2 8.0 9.1 Katydid The classification 3 0.9 4.7 Grasshopper problem can now be 4 1.1 3.1 Grasshopper expressed as: 5 5.4 8.5 Katydid 6 2.9 1.9 Grasshopper • Given a training database 7 6.1 6.6 Katydid ( My_Collection ), predict the class 8 0.5 1.0 Grasshopper label of a previously unseen instance 9 8.3 6.6 Katydid 10 8.1 4.7 Katydids 11 5.1 7.0 ??????? ??????? previously unseen instance previously unseen instance = = 8

Grasshoppers Katydids 10 9 8 7 Antenna Length Antenna Length 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 10 Abdomen Length Abdomen Length 9

Grasshoppers Katydids 10 9 8 7 Antenna Length Antenna Length 6 5 Each of these data objects are called… 4 • exemplars 3 • (training) 2 examples • instances 1 • tuples 1 2 3 4 5 6 7 8 9 10 Abdomen Length Abdomen Length 10

We will return to the previous We will return to the previous slide in two minutes. In the slide in two minutes. In the meantime, we are going to play meantime, we are going to play a quick game. a quick game. 11

Problem 1 Examples of class A Examples of class B 3 4 5 2.5 1.5 5 5 2 8 3 6 8 2.5 5 4.5 3 12

Problem 1 What class is this What class is this object? Examples of class A Examples of class B 8 1.5 3 4 5 2.5 What about this one, What about this one, 1.5 5 5 2 A or B? 8 3 6 8 4.5 7 2.5 5 4.5 3 13

Problem 2 Oh! This ones hard! Oh! This ones hard! Examples of class A Examples of class B 8 1.5 4 4 5 2.5 5 5 2 5 5 3 6 6 3 3 2.5 3 14

Problem 3 Examples of class A Examples of class B 6 6 This one is really hard! This one is really hard! 4 4 5 6 What is this, What is this, A or B? 1 5 7 5 4 8 6 3 3 7 7 7 15

Why did we spend so much Why did we spend so much time with this game? time with this game? Because we wanted to Because we wanted to show that almost all classification problems have a geometric interpretation, check out the next 3 slides… the next 3 slides… 16

Problem 1 10 9 8 7 Examples of class A Examples of class B 6 Left Bar Left Bar 5 4 3 2 3 4 5 2.5 1 1 2 3 4 5 6 7 8 9 10 Right Bar Right Bar 1.5 5 5 2 Here is the rule again. Here is the rule again. If the left bar is smaller If the left bar is smaller than the right bar, it is 8 3 6 8 an A, otherwise it is a B. 2.5 5 4.5 3 17

Problem 2 10 9 8 7 Examples of class A Examples of class B 6 Left Bar Left Bar 5 4 3 2 4 4 5 2.5 1 1 2 3 4 5 6 7 8 9 10 Right Bar Right Bar 5 5 2 5 Let me look it up… here it is.. Let me look it up… here it is.. the rule is, if the two bars are equal sizes, it is an A. Otherwise it is a B . Otherwise it is a 5 3 6 6 3 3 2.5 3 18

Problem 3 100 90 80 70 Examples of class A Examples of class B 60 Left Bar Left Bar 50 40 30 20 4 4 5 6 10 10 20 30 40 50 60 70 80 90 100 Right Bar Right Bar 1 5 7 5 4 8 6 3 The rule again: The rule again: if the square of the sum of the two bars is less than or equal to 100, it is an A. Otherwise it 3 7 7 7 is a is a B. 19

Grasshoppers Katydids 10 9 8 7 Antenna Length Antenna Length 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 10 Abdomen Length Abdomen Length 20

11 5.1 7.0 ??????? ??????? previously unseen instance = previously unseen instance = • We can “project” the previously • We can “project” the previously unseen instance into the same space unseen instance into the same space 10 as the database. as the database. 9 • We have now abstracted away the • We have now abstracted away the 8 details of our particular problem. It details of our particular problem. It 7 will be much easier to talk about will be much easier to talk about Antenna Length Antenna Length 6 points in space. points in space. 5 4 3 2 1 Katydids 1 2 3 4 5 6 7 8 9 10 Grasshoppers Abdomen Length Abdomen Length 21

Simple Linear Classifier Simple Linear Classifier 10 9 8 R.A. Fisher 7 1890-1962 6 5 If previously unseen instance above the line 4 then class is Katydid 3 else 2 class is Grasshopper 1 Katydids 1 2 3 4 5 6 7 8 9 10 Grasshoppers 22

Classification Accuracy Predicted class Class = Katydid (1) Class = Grasshopper (0) Class = Katydid (1) f 11 f 10 Actual Class Class = Grasshopper (0) f 01 f 00 Number of correct predictions f 11 + f 00 Accuracy = --------------------------------------------- = ----------------------- Total number of predictions f 11 + f 10 + f 01 + f 00 Number of wrong predictions f 10 + f 01 Error rate = --------------------------------------------- = ----------------------- Total number of predictions f 11 + f 10 + f 01 + f 00 23

Confusion Matrix • In a binary decision problem, a classifier labels examples as either positive or negative. • Classifiers produce confusion/ contingency matrix, which shows four entities: TP (true positive), TN (true negative), FP (false positive), FN (false negative) Confusion Matrix Positive Negative (+) (-) Predicted TP FP positive (Y) Predicted FN TN negative (N) 24

The simple linear classifier is defined for higher dimensional spaces… 25

… we can visualize it as being an n-dimensional hyperplane 26

It is interesting to think about what would happen in this example if we did not have the 3 rd dimension… 27

We can no longer get perfect accuracy with the simple linear classifier… We could try to solve this problem by user a simple quadratic classifier or a simple cubic classifier.. However, as we will later see, this is probably a bad idea… 28

Which of the “Problems” can be solved by the Simple 10 Linear Classifier? 9 8 7 6 5 1) Perfect 4 2) Useless 3 2 3) Pretty Good 1 1 2 3 4 5 6 7 8 9 10 10 100 9 90 Problems that can 8 80 7 70 be solved by a linear 6 60 classifier are call 5 50 linearly separable . 4 40 3 30 2 20 1 10 1 2 3 4 5 6 7 8 9 10 10 20 30 40 50 60 70 80 90 100 29

Virginica A Famous Problem R. A. Fisher’s Iris Dataset . • 3 classes • 50 of each class Setosa The task is to classify Iris plants Versicolor into one of 3 varieties using the Petal Length and Petal Width. 30 Iris Setosa Iris Versicolor Iris Virginica

Data Mining Lecture 03: Introduction to classification Linear - PowerPoint PPT Presentation

CISC 4631 Data Mining Lecture 03: Introduction to classification Linear classifier Theses slides are based on the slides by Tan, Steinbach and Kumar (textbook authors) Eamonn Koegh (UC Riverside) 1 Classification:

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Introduction What is data mining? to Data mining functionalities Data Mining Major

Data mining Machine Intelligence Thomas D. Nielsen September 2008 Data mining September 2008

DATA MINING LECTURE 2 What is data? The data mining pipeline What is Data Mining? Data

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

Web MINING Web MINING Overview Overview Dr Ahmed Rafea Rafea Dr Ahmed 1 Web Mining Outline

LECTURE 1: INTRODUCTION TO DATA MINING Dr. Dhaval Patel CSE, IIT-Roorkee What is data mining?

Data Mining Based Detection Methods Data Mining in Intrusion detection Feng Pan Outline

DATA MINING LECTURE 1 Introduction What is data mining? After years of data mining there is

Cement, Aggregates, Mining Presentation Cement, Aggregates and Mining Cement, Aggregates and

Frequent Pattern Mining Frequent Sequence Mining Frequent Tree Mining Christian Borgelt

Web Mining Andreas Andersson Gustav Strmberg Sandra Stendahl Introduction Web mining o

Week 5 Video 2 Relationship Mining Causal Mining Causal Data Mining These slides developed in

Data Mining 2018 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 10, 2018

Loves labour lost !!! HYBRID REPAIR OF COMPEX ABDOMINAL AORTIC ANEURYSM Anil Dhall VS Bedi

www.reach-energy.eu Project summary The aim of REACH is to contribute to energy poverty abatement

Employment incentives for sole parents: Labour market effects of changes to financial incentives

Content and Scope of No Further Action Letters OAC 3745 300 13 Certified Professional 8

Dragonflies & Damselflies of the Pinelands By Jennifer Bulava Dragonflies and Damselfies

Song of Solomon Chapters 5:10-16, 7:1-9 New English Song of Solomon - Chapter 7:1-10

Tenderness and Risk for Intra-abdominal Injury in Children with Blunt Abdominal Trauma Katherine

Acute Kidney Injury (AKI) In Primary Care Supporting early detection and consistent management

Data Mining Lecture 03: Introduction to classification Linear - PowerPoint PPT Presentation

CISC 4631 Data Mining Lecture 03: Introduction to classification Linear classifier Theses slides are based on the slides by Tan, Steinbach and Kumar (textbook authors) Eamonn Koegh (UC Riverside) 1 Classification:

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Introduction What is data mining? to Data mining functionalities Data Mining Major

Data mining Machine Intelligence Thomas D. Nielsen September 2008 Data mining September 2008

DATA MINING LECTURE 2 What is data? The data mining pipeline What is Data Mining? Data

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

Web MINING Web MINING Overview Overview Dr Ahmed Rafea Rafea Dr Ahmed 1 Web Mining Outline

LECTURE 1: INTRODUCTION TO DATA MINING Dr. Dhaval Patel CSE, IIT-Roorkee What is data mining?

Data Mining Based Detection Methods Data Mining in Intrusion detection Feng Pan Outline

DATA MINING LECTURE 1 Introduction What is data mining? After years of data mining there is

Cement, Aggregates, Mining Presentation Cement, Aggregates and Mining Cement, Aggregates and

Frequent Pattern Mining Frequent Sequence Mining Frequent Tree Mining Christian Borgelt

Web Mining Andreas Andersson Gustav Strmberg Sandra Stendahl Introduction Web mining o

Week 5 Video 2 Relationship Mining Causal Mining Causal Data Mining These slides developed in

Data Mining 2018 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 10, 2018

Loves labour lost !!! HYBRID REPAIR OF COMPEX ABDOMINAL AORTIC ANEURYSM Anil Dhall VS Bedi

www.reach-energy.eu Project summary The aim of REACH is to contribute to energy poverty abatement

Employment incentives for sole parents: Labour market effects of changes to financial incentives

Content and Scope of No Further Action Letters OAC 3745 300 13 Certified Professional 8

Dragonflies &amp; Damselflies of the Pinelands By Jennifer Bulava Dragonflies and Damselfies

Song of Solomon Chapters 5:10-16, 7:1-9 New English Song of Solomon - Chapter 7:1-10

Tenderness and Risk for Intra-abdominal Injury in Children with Blunt Abdominal Trauma Katherine

Acute Kidney Injury (AKI) In Primary Care Supporting early detection and consistent management

Dragonflies & Damselflies of the Pinelands By Jennifer Bulava Dragonflies and Damselfies