CS570 Introduction to Data Mining Classification and Prediction - PowerPoint PPT Presentation

CS570 Introduction to Data Mining Classification and Prediction Partial slide credits: Han and Kamber Tan,Steinbach, Kumar 1 1

�� Overview � Classification algorithms and methods � � Decision tree induction � Bayesian classification � kNN classification � Support Vector Machines (SVM) � Neural Networks Regression � Evaluation and measures � Ensemble methods � Data Mining: Concepts and Techniques 2 2

�� Skin Color Size Flesh Conclusion Hairy Brown Large Hard safe Hairy Hairy Green Green Large Large Hard Hard Safe Safe Smooth Red Large Soft Dangerous Hairy Green Large Soft Safe Smooth Small Hard Dangerous Red … Li Xiong Data Mining: Concepts and Techniques 3 3 3

�� Classification � predicts categorical class labels � constructs a model based on the training set and uses it in classifying new data � Prediction (Regression) � models continuous8valued functions, i.e., predicts models continuous8valued functions, i.e., predicts unknown or missing values � Typical applications � Credit approval � Target marketing � Medical diagnosis � Fraud detection Data Mining: Concepts and Techniques 4 4

�� Name Age Income … Credit Clark 35 High … Excellent Milton 38 High … Excellent Neo 25 Medium … Fair … … … … … … … … … … Classification rule: � � If age = “31...40” and income = high then credit_rating = excellent Future customers � � Paul: age = 35, income = high ⇒ excellent credit rating � John: age = 20, income = medium ⇒ fair credit rating Data Mining: Concepts and Techniques 5 5

�� !�� Model construction: describing a set of predetermined classes � � Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute � The set of tuples used for model construction is training set � The model is represented as classification rules, decision trees, or mathematical formulae Model usage: for classifying future or unknown objects Model usage: for classifying future or unknown objects � � � Estimate accuracy of the model � The known label of test sample is compared with the classified result from the model � Accuracy rate is the percentage of test set samples that are correctly classified by the model � Test set is independent of training set, otherwise over8fitting will occur � If the accuracy is acceptable, use the model to classify data tuples whose class labels are not known Data Mining: Concepts and Techniques 6 6

��"#$%�� Data Mining: Concepts and Techniques 7 7

��"&$%�'��(�� Data Mining: Concepts and Techniques 8 8

!��'��)�� Supervised learning (classification) � Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations indicating the class of the observations � New data is classified based on the training set � Unsupervised learning (clustering) � The class labels of training data is unknown � Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data Data Mining: Concepts and Techniques 9 9

��%��(�� Accuracy � Speed � time to construct the model (training time) � time to use the model (classification/prediction time) � Robustness: handling noise and missing values � Scalability: efficiency in disk8resident databases Scalability: efficiency in disk8resident databases � Interpretability � understanding and insight provided by the model � Other measures, e.g., goodness of rules, decision tree size or compactness of classification rules Data Mining: Concepts and Techniques 10 10

�� Overview � Classification algorithms and methods � � Decision tree � Bayesian classification � kNN classification � Support Vector Machines (SVM) � Others Evaluation and measures � Ensemble methods � Data Mining: Concepts and Techniques 11 11

��*�� age income student credit_rating buys_computer <=30 high no fair no <=30 high no excellent no 31�40 high no fair yes >40 medium no fair yes >40 >40 low low yes yes fair fair yes yes >40 low yes excellent no 31�40 low yes excellent yes <=30 medium no fair no <=30 low yes fair yes >40 medium yes fair yes <=30 medium yes excellent yes 31�40 medium no excellent yes 31�40 high yes fair yes >40 medium no excellent no Data Mining: Concepts and Techniques 12 12

��*��+ ,�-�.��/ �� Data Mining: Concepts and Techniques 13 13

��(��*�� ID3 (Iterative Dichotomiser), C4.5, by Quinlan � CART (Classification and Regression Trees) � Basic algorithm (a greedy algorithm) 8 tree is constructed with top8 � down recursive partitioning At start, all the training examples are at the root � � A test attribute is selected that “best” separate the data into � A test attribute is selected that “best” separate the data into partitions � Samples are partitioned recursively based on selected attributes Conditions for stopping partitioning � � All samples for a given node belong to the same class � There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf � There are no samples left Data Mining: Concepts and Techniques 14 14

��,��!�� Idea: select attribute that partition samples into homogeneous groups � Measures � Information gain (ID3) � Gain ratio (C4.5) Gain ratio (C4.5) � Gini index (CART) Data Mining: Concepts and Techniques 15 15

��,��!��%� ��0��"�*1$ Select the attribute with the highest information gain � Let p i be the probability that an arbitrary tuple in D belongs to class C i , � estimated by |C i , D |/|D| Information (entropy) needed to classify a tuple in D (before split): � � ∑ = − �� = = � � � Information needed (after using A to split D into v partitions) to � � = ∑ � classify D: � × �� = � � Information gain – difference between before and after splitting on � attribute A = − �� Data Mining: Concepts and Techniques 16 16

CS570 Introduction to Data Mining Classification and Prediction - PowerPoint PPT Presentation

CS570 Introduction to Data Mining Classification and Prediction Partial slide credits: Han and Kamber Tan,Steinbach, Kumar 1 1 Overview

CS570 Introduction to Data Mining Frequent Pattern Mining and Association Analysis Cengiz Gunay

CS570 Data Mining Classification: Ensemble Methods Cengiz Gnay Dept. Math & CS, Emory

CS570 Data Mining Frequent Pattern Mining and Association Analysis 2 Cengiz Gunay Slide

CS570 Introduction to Data Mining Department of Mathematics and Computer Science Li Xiong Data

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

CS570 Introduction to Data Mining Department of Mathematics and Computer Science Li Xiong Today

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Introduction What is data mining? to Data mining functionalities Data Mining Major

Data Mining 2020 Text Classification Naive Bayes Ad Feelders Universiteit Utrecht Ad Feelders

Data mining Machine Intelligence Thomas D. Nielsen September 2008 Data mining September 2008

DATA MINING LECTURE 2 What is data? The data mining pipeline What is Data Mining? Data

LECTURE 1: INTRODUCTION TO DATA MINING Dr. Dhaval Patel CSE, IIT-Roorkee What is data mining?

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

DATA MINING LECTURE 1 Introduction What is data mining? After years of data mining there is

Web MINING Web MINING Overview Overview Dr Ahmed Rafea Rafea Dr Ahmed 1 Web Mining Outline

Irrelevant Natural Extension for Choice Functions Arthur Van Camp & Enrique Miranda 3 July

We want to model indifference with choice functions. We want to model indifference with choice

CLASS webinar 27 June 2014 1 CLASS webinar 27 June 2014 Simon Brooke, Low Carbon Projects

I nsecurities and I naccuracies I iti d I i of the Sequoia AVC Advantage q g 9 .0 0 H DRE

goes to Zabbix : true opensource monitoring solution Zabbix : true opensource monitoring

Retroactive Funding Swaps: Short Term Solution Why We Need It, How to Use It March 17 and March

S CALABLE V IDEO C ODING IN C ONTENT -A WARE N ETWORKS Michael Grafl Institute of Information

IEEE 1857 Standard Empowering Smart Video Surveillance Systems Wen Gao, Yonghong Tian, Tiejun

CS570 Introduction to Data Mining Classification and Prediction - PowerPoint PPT Presentation

CS570 Introduction to Data Mining Classification and Prediction Partial slide credits: Han and Kamber Tan,Steinbach, Kumar 1 1 Overview

CS570 Introduction to Data Mining Frequent Pattern Mining and Association Analysis Cengiz Gunay

CS570 Data Mining Classification: Ensemble Methods Cengiz Gnay Dept. Math &amp; CS, Emory

CS570 Data Mining Frequent Pattern Mining and Association Analysis 2 Cengiz Gunay Slide

CS570 Introduction to Data Mining Department of Mathematics and Computer Science Li Xiong Data

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

CS570 Introduction to Data Mining Department of Mathematics and Computer Science Li Xiong Today

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Introduction What is data mining? to Data mining functionalities Data Mining Major

Data Mining 2020 Text Classification Naive Bayes Ad Feelders Universiteit Utrecht Ad Feelders

Data mining Machine Intelligence Thomas D. Nielsen September 2008 Data mining September 2008

DATA MINING LECTURE 2 What is data? The data mining pipeline What is Data Mining? Data

LECTURE 1: INTRODUCTION TO DATA MINING Dr. Dhaval Patel CSE, IIT-Roorkee What is data mining?

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

DATA MINING LECTURE 1 Introduction What is data mining? After years of data mining there is

Web MINING Web MINING Overview Overview Dr Ahmed Rafea Rafea Dr Ahmed 1 Web Mining Outline

Irrelevant Natural Extension for Choice Functions Arthur Van Camp &amp; Enrique Miranda 3 July

We want to model indifference with choice functions. We want to model indifference with choice

CLASS webinar 27 June 2014 1 CLASS webinar 27 June 2014 Simon Brooke, Low Carbon Projects

I nsecurities and I naccuracies I iti d I i of the Sequoia AVC Advantage q g 9 .0 0 H DRE

goes to Zabbix : true opensource monitoring solution Zabbix : true opensource monitoring

Retroactive Funding Swaps: Short Term Solution Why We Need It, How to Use It March 17 and March

S CALABLE V IDEO C ODING IN C ONTENT -A WARE N ETWORKS Michael Grafl Institute of Information

IEEE 1857 Standard Empowering Smart Video Surveillance Systems Wen Gao, Yonghong Tian, Tiejun

CS570 Data Mining Classification: Ensemble Methods Cengiz Gnay Dept. Math & CS, Emory

Irrelevant Natural Extension for Choice Functions Arthur Van Camp & Enrique Miranda 3 July