Data Mining Techniques: Statistical Decision Theory Nearest - PDF document

Classification and Prediction Overview • Introduction • Decision Trees Data Mining Techniques: • Statistical Decision Theory • Nearest Neighbor Classification and Prediction • Bayesian Classification • Artificial Neural Networks Mirek Riedewald • Support Vector Machines (SVMs) • Prediction Some slides based on presentations by • Accuracy and Error Measures Han/Kamber, Tan/Steinbach/Kumar, and Andrew Moore • Ensemble Methods 2 Classification vs. Prediction Induction: Model Construction • Assumption: after data preparation, have single data Classification set where each record has attributes X 1 ,…, X n , and Y. Algorithm Training • Goal: learn a function f:(X 1 ,…, X n )  Y, then use this Data function to predict y for a given input record (x 1 ,…, x n ). – Classification: Y is a discrete attribute, called the class label Model • Usually a categorical attribute with small domain NAME RANK YEARS TENURED – Prediction: Y is a continuous attribute (Function) Mike Assistant Prof 3 no • Called supervised learning, because true labels (Y- Mary Assistant Prof 7 yes values) are known for the initially provided data Bill Professor 2 yes IF rank = ‘professor’ • Typical applications: credit approval, target marketing, Jim Associate Prof 7 yes OR years > 6 medical diagnosis, fraud detection Dave Assistant Prof 6 no THEN tenured = ‘yes’ Anne Associate Prof 3 no 3 4 Deduction: Using the Model Classification and Prediction Overview • Introduction Model • Decision Trees (Function) • Statistical Decision Theory • Bayesian Classification • Artificial Neural Networks Test Unseen Data Data • Support Vector Machines (SVMs) • Nearest Neighbor (Jeff, Professor, 4) • Prediction NAME RANK YEARS TENURED • Accuracy and Error Measures Tenured? Tom Assistant Prof 2 no • Ensemble Methods Merlisa Associate Prof 7 no George Professor 5 yes Joseph Assistant Prof 7 yes 5 6 1

Example of a Decision Tree Another Example of Decision Tree Single, Splitting Attributes MarSt Tid Refund Marital Taxable Married Divorced Cheat Tid Refund Marital Taxable Status Income Cheat Status Income NO Refund 1 Yes Single 125K No 1 Yes Single 125K No No Refund Yes 2 No Married 100K No 2 No Married 100K No Yes No 3 No Single 70K No 3 No Single 70K No NO TaxInc 4 Yes Married 120K No NO MarSt No < 80K > 80K 4 Yes Married 120K Yes 5 No Divorced 95K Married Single, Divorced 5 No Divorced 95K Yes NO YES 6 No Married 60K No 6 No Married 60K No TaxInc NO 7 Yes Divorced 220K No 7 Yes Divorced 220K No < 80K > 80K 8 No Single 85K Yes 8 No Single 85K Yes 9 No Married 75K No NO YES There could be more than one tree that 9 No Married 75K No 10 No Single 90K Yes fits the same data! 10 No Single 90K Yes 10 10 Model: Decision Tree Training Data 7 8 Apply Model to Test Data Apply Model to Test Data Test Data Test Data Start from the root of tree. Refund Marital Taxable Refund Marital Taxable Status Income Cheat Status Income Cheat No Married 80K ? No Married 80K ? Refund Refund 10 10 Yes No Yes No NO MarSt NO MarSt Married Married Single, Divorced Single, Divorced TaxInc TaxInc NO NO < 80K > 80K < 80K > 80K NO YES NO YES 9 10 Apply Model to Test Data Apply Model to Test Data Test Data Test Data Refund Marital Taxable Refund Marital Taxable Cheat Cheat Status Income Status Income No Married 80K ? No Married 80K ? Refund Refund 10 10 Yes No Yes No NO MarSt NO MarSt Married Married Single, Divorced Single, Divorced TaxInc NO TaxInc NO < 80K > 80K < 80K > 80K NO YES NO YES 11 12 2

Apply Model to Test Data Apply Model to Test Data Test Data Test Data Refund Marital Taxable Refund Marital Taxable Cheat Cheat Status Income Status Income No Married 80K ? No Married 80K ? Refund Refund 10 10 Yes No Yes No NO MarSt NO MarSt Assign Cheat to “No” Married Married Single, Divorced Single, Divorced TaxInc TaxInc NO NO < 80K > 80K < 80K > 80K NO YES NO YES 13 14 Decision Tree Induction Decision Boundary 1 x 2 • Basic greedy algorithm 0.9 X 1 – Top-down, recursive divide-and-conquer < 0.43? 0.8 – At start, all the training records are at the root 0.7 Yes No – Training records partitioned recursively based on split attributes 0.6 X 2 – Split attributes selected based on a heuristic or statistical X 2 0.5 < 0.33? < 0.47? measure (e.g., information gain) 0.4 • Conditions for stopping partitioning Yes No Yes No 0.3 Refund – Pure node (all records belong Yes No 0.2 : 4 : 0 : 0 : 4 to same class) : 0 : 4 : 3 : 0 0.1 – No remaining attributes for NO MarSt 0 further partitioning Married Single, Divorced 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x 1 • Majority voting for classifying the leaf TaxInc NO – No cases left Decision boundary = border between two neighboring regions of different classes. < 80K > 80K For trees that split on a single attribute at a time, the decision boundary is parallel NO YES to the axes. 15 16 How to Specify Split Condition? Splitting Nominal Attributes • Depends on attribute types • Multi-way split: use as many partitions as distinct values. – Nominal CarType – Ordinal Family Luxury Sports – Numeric (continuous) • Binary split: divides values into two subsets; • Depends on number of ways to split need to find optimal partitioning. – 2-way split CarType CarType – Multi-way split {Sports, OR {Family, {Family} {Sports} Luxury} Luxury} 17 18 3

Splitting Ordinal Attributes Splitting Continuous Attributes • Multi-way split: • Different options Size – Discretization to form an ordinal categorical Small Large attribute Medium • Static – discretize once at the beginning • Binary split: • Dynamic – ranges found by equal interval bucketing, Size Size OR {Small, {Medium, equal frequency bucketing (percentiles), or clustering. {Large} {Small} Medium} Large} – Binary Decision: (A < v) or (A  v) • What about this split? Size • Consider all possible splits, choose best one {Small, {Medium} Large} 19 20 Splitting Continuous Attributes How to Determine Best Split Before Splitting: 10 records of class 0, 10 records of class 1 Taxable Taxable Income Income? > 80K? Own Car Student Car? Type? ID? < 10K > 80K Family Luxury c 1 c 20 Yes No Yes No c 10 c 11 Sports [10K,25K) [25K,50K) [50K,80K) C0: 6 C0: 4 C0: 1 C0: 8 C0: 1 C0: 1 ... C0: 1 C0: 0 ... C0: 0 C1: 4 C1: 6 C1: 3 C1: 0 C1: 7 C1: 0 C1: 0 C1: 1 C1: 1 (i) Binary split (ii) Multi-way split Which test condition is the best? 21 22 Attribute Selection Measure: How to Determine Best Split Information Gain • Select attribute with highest information gain • Greedy approach: • p i = probability that an arbitrary record in D belongs to class – Nodes with homogeneous class distribution are C i , i =1,…,m • Expected information (entropy) needed to classify a record preferred in D: m  • Need a measure of node impurity:   Info( D ) p log ( p ) i 2 i  i 1 • Information needed after using attribute A to split D into v C0: 5 C0: 9 partitions D 1 ,…, D v : C1: 5 C1: 1 | D | v   j Info ( ) Info( ) D D A j | D | Non-homogeneous, Homogeneous,  1 j High degree of impurity Low degree of impurity • Information gained by splitting on attribute A:   Gain (D) Info (D) Info (D) A A 23 24 4

Data Mining Techniques: Statistical Decision Theory Nearest - PDF document

Classification and Prediction Overview Introduction Decision Trees Data Mining Techniques: Statistical Decision Theory Nearest Neighbor Classification and Prediction Bayesian Classification Artificial Neural Networks Mirek

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

DATA MINING LECTURE 2 What is data? The data mining pipeline What is Data Mining? Data

Data Mining: Concepts and Techniques Chapter 1 Introduction 1 August 19, 2013

Introduction What is data mining? to Data mining functionalities Data Mining Major

Data mining Machine Intelligence Thomas D. Nielsen September 2008 Data mining September 2008

DATA MINING LECTURE 1 Introduction What is data mining? After years of data mining there is

Data Mining: Concepts and Techniques Web Mining Li Xiong Slides credits: Jiawei Han and

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

CS6220: DATA MINING TECHNIQUES Chapter 7: Advanced Pattern Mining Instructor: Yizhou Sun

Web MINING Web MINING Overview Overview Dr Ahmed Rafea Rafea Dr Ahmed 1 Web Mining Outline

LECTURE 1: INTRODUCTION TO DATA MINING Dr. Dhaval Patel CSE, IIT-Roorkee What is data mining?

Web Mining Web Mining to automatically discover and extract information from Web

Web Mining Web Mining to automatically discover and extract information from Web

Data Mining: Concepts and Techniques Chap 8. Data Streams, Time Series Data, and Sequential

Scalable and Memory-Efficient Clustering of Large-Scale Social Networks Joyce Jiyoung Whang, Xin

Physics Analysis with Advanced Data Mining Techniques Hai-Jun Yang University of Michigan, Ann

Data Mining meets Football (soccer) Ulf Brefeld Knowledge Mining & Assessment TU Darmstadt

An Overview of CS512 @Spring 2020 JIAWEI HAN COMPUTER SCIENCE UNIVERSITY OF ILLINOIS AT

Data Preprocessing Data Mining and Exploration: Preprocessing Data preparation is a big issue for

Data Replication and Power Consumption in Data Grids Karl Smith, Susan Vrbsky, Ming Lei, Jeff

Distributed Databases 1 19.1 Distributed Database System A distributed database system

A Cloud-native Architecture for Replicated Data Services Hemant Saxena, Jeffery Pound University

Data Mining Techniques: Statistical Decision Theory Nearest - PDF document

Classification and Prediction Overview Introduction Decision Trees Data Mining Techniques: Statistical Decision Theory Nearest Neighbor Classification and Prediction Bayesian Classification Artificial Neural Networks Mirek

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

DATA MINING LECTURE 2 What is data? The data mining pipeline What is Data Mining? Data

Data Mining: Concepts and Techniques Chapter 1 Introduction 1 August 19, 2013

Introduction What is data mining? to Data mining functionalities Data Mining Major

Data mining Machine Intelligence Thomas D. Nielsen September 2008 Data mining September 2008

DATA MINING LECTURE 1 Introduction What is data mining? After years of data mining there is

Data Mining: Concepts and Techniques Web Mining Li Xiong Slides credits: Jiawei Han and

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

CS6220: DATA MINING TECHNIQUES Chapter 7: Advanced Pattern Mining Instructor: Yizhou Sun

Web MINING Web MINING Overview Overview Dr Ahmed Rafea Rafea Dr Ahmed 1 Web Mining Outline

LECTURE 1: INTRODUCTION TO DATA MINING Dr. Dhaval Patel CSE, IIT-Roorkee What is data mining?

Web Mining Web Mining to automatically discover and extract information from Web

Web Mining Web Mining to automatically discover and extract information from Web

Data Mining: Concepts and Techniques Chap 8. Data Streams, Time Series Data, and Sequential

Scalable and Memory-Efficient Clustering of Large-Scale Social Networks Joyce Jiyoung Whang, Xin

Physics Analysis with Advanced Data Mining Techniques Hai-Jun Yang University of Michigan, Ann

Data Mining meets Football (soccer) Ulf Brefeld Knowledge Mining &amp; Assessment TU Darmstadt

An Overview of CS512 @Spring 2020 JIAWEI HAN COMPUTER SCIENCE UNIVERSITY OF ILLINOIS AT

Data Preprocessing Data Mining and Exploration: Preprocessing Data preparation is a big issue for

Data Replication and Power Consumption in Data Grids Karl Smith, Susan Vrbsky, Ming Lei, Jeff

Distributed Databases 1 19.1 Distributed Database System A distributed database system

A Cloud-native Architecture for Replicated Data Services Hemant Saxena, Jeffery Pound University

Data Mining meets Football (soccer) Ulf Brefeld Knowledge Mining & Assessment TU Darmstadt