Data Mining Ian H. Witten The problem Data Mining with Weka - PDF document

Data Mining Ian H. Witten The problem Data Mining with Weka Classification (“supervised”) Given  A set of classified examples “instances” Produce  A way of classifying new examples Instances: described by fixed set of features “attributes” Classes: discrete or continuous “classification” “regression” Interested in: Results? (classifying new instances) Ian H. Witten Model? (how the decision is made) Computer Science Department Association rules Waikato University Look for rules that relate features to other features New Zealand Clustering (“unsupervised”) http://www.cs.waikato.ac.nz/~ihw http://www.cs.waikato.ac.nz/ml/weka There are no classes Simplicity first! Agenda  A very simple strategy  Simple algorithms often work very well! Overfitting, evaluation   Statistical modeling  There are many kinds of simple structure, eg:  Bayes rule One attribute does all the work   Constructing decision trees All attributes contribute equally and independently   Constructing rules  A decision tree involving tests on a few attributes  + Association rules Rules that assign instances to classes   Linear models Distance in instance space from a few class “prototypes”  Regression, perceptrons, neural nets, SVMs, model trees  Result depends on a linear combination of attributes   Instance-based learning and clustering  Success of method depends on the domain  Hierarchical, probabilistic clustering  Engineering the input and output Attribute selection, data transformations, PCA  Bagging, boosting, stacking, co-training  One attribute does all the work Example  Learn a 1-level decision tree Outlook Outlook Temp Temp Humidity Humidity Wind Wind Play Play Attribute Rules Errors Total errors  i.e., rules that all test one particular attribute Sunny Sunny Hot Hot High High False False No No Sunny → No Outlook 2/5 4/14 Sunny Sunny Hot Hot High High True True No No  Basic version Overcast → Yes 0/4 Overcast Overcast Hot Hot High High False False Yes Yes One branch for each value  Rainy → Yes 2/5 Rainy Rainy Mild Mild High High False False Yes Yes Each branch assigns most frequent class  Temp Hot → No* 2/4 5/14 Rainy Rainy Cool Cool Normal Normal False False Yes Yes Error rate: proportion of instances that don’t belong to  Mild → Yes 2/6 Rainy Rainy Cool Cool Normal Normal True True No No the majority class of their corresponding branch Cool → Yes 1/4 Overcast Overcast Cool Cool Normal Normal True True Yes Yes Choose attribute with smallest error rate  Humidity High → No 3/7 4/14 Sunny Sunny Mild Mild High High False False No No Normal → Yes 1/7 Sunny Sunny Cool Cool Normal Normal False False Yes Yes Wind False → Yes 2/8 5/14 For each attribute, Rainy Rainy Mild Mild Normal Normal False False Yes Yes For each value of the attribute, make a rule as follows: True → No* 3/6 Sunny Sunny Mild Mild Normal Normal True True Yes Yes count how often each class appears Overcast Overcast Mild Mild High High True True Yes Yes find the most frequent class * indicates a tie Overcast Overcast Hot Hot Normal Normal False False Yes Yes make the rule assign that class to this attribute-value Rainy Rainy Mild Mild High High True True No No Calculate the error rate of this attribute’s rules Choose the attribute with the smallest error rate

Data Mining Ian H. Witten Complications: Missing values Complications: Overfitting  Nominal vs numeric values for attributes  Omit instances where the attribute value is missing Outlook Temp Humidity Wind Play Attribute Rules Errors Total  Treat “missing” as a separate possible value errors Sunny 85 85 False No Temp 85 → No 0/1 0/14 Sunny 80 90 True No 80 → No 0/1 “Missing” means what? Overcast 83 86 False Yes 83 → Yes 0/1 Rainy 75 80 False Yes  Unknown? 75 → Yes 0/1 … … … … …  Unrecorded? … …  Irrelevant?  Memorization vs generalization  Do not evaluate rules on the training data Is there significance in the fact that a value is missing?  Here, independent test data shows poor performance  To fix, use Training data — to form rules  Validation data — to decide on best rule   Test data — to determine system performance Evaluating the result One attribute does all the work  Evaluate on training set? — NO!  This incredibly simple method was described in a 1993 paper  Independent test set  An experimental evaluation on 16 datasets Used cross-validation so that results were   Cross-validation representative of performance on new data  Simple rules often outperformed far more  Stratified cross-validation complex methods  Stratified 10-fold cross-validation,  Simplicity first pays off! repeated 10 times  Leave-one-out  The “Bootstrap” “Very Simple Classification Rules Perform Well on Most Commonly Used Datasets” Robert C. Holte, Computer Science Department, University of Ottawa Agenda Statistical modeling One attribute does all the work?  A very simple strategy  Overfitting, evaluation  Opposite strategy: use all the attributes  Statistical modeling  Two assumptions: Attributes are Bayes rule   equally important a priori  Constructing decision trees  statistically independent (given the class value)  Constructing rules I.e., knowing the value of one attribute says nothing + Association rules  about the value of another (if the class is known)  Linear models  Independence assumption is never correct!  Regression, perceptrons, neural nets, SVMs, model trees  But … often works well in practice  Instance-based learning and clustering Hierarchical, probabilistic clustering   Engineering the input and output  Attribute selection, data transformations, PCA Bagging, boosting, stacking, co-training 

Data Mining Ian H. Witten The problem Data Mining with Weka - PDF document

Data Mining Ian H. Witten The problem Data Mining with Weka Classification (supervised) Given A set of classified examples instances Produce A way of classifying new examples Instances: described by fixed set of features

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Introduction What is data mining? to Data mining functionalities Data Mining Major

Data mining Machine Intelligence Thomas D. Nielsen September 2008 Data mining September 2008

DATA MINING LECTURE 2 What is data? The data mining pipeline What is Data Mining? Data

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

Web MINING Web MINING Overview Overview Dr Ahmed Rafea Rafea Dr Ahmed 1 Web Mining Outline

LECTURE 1: INTRODUCTION TO DATA MINING Dr. Dhaval Patel CSE, IIT-Roorkee What is data mining?

Data Mining Based Detection Methods Data Mining in Intrusion detection Feng Pan Outline

DATA MINING LECTURE 1 Introduction What is data mining? After years of data mining there is

Cement, Aggregates, Mining Presentation Cement, Aggregates and Mining Cement, Aggregates and

Frequent Pattern Mining Frequent Sequence Mining Frequent Tree Mining Christian Borgelt

Web Mining Andreas Andersson Gustav Strmberg Sandra Stendahl Introduction Web mining o

Week 5 Video 2 Relationship Mining Causal Mining Causal Data Mining These slides developed in

Data Mining 2018 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 10, 2018

ETC5510: Introduction to Data Analysis ETC5510: Introduction to Data Analysis Week 7, part B

Retention of women in Computer Science Vashti Galpin vashti@cs.wits.ac.za

The Development of a National Census of the Health Information Workforce: Expert Panel

Sequential Data Modeling - Conditional Random Fields Graham Neubig Nara Institute of Science and

Sampling Techniques Department of Political Science and Government Aarhus University September

A NEW TOOL FOR COMPARING ADAPTIVE DESIGNS; A POSTERIORI EFFICIENCY Jos e A. Moler, Universidad

Distributed Statistical Inference using Type Based Random Access over Multi-access Fading

Outline A taxonomy of CR security threats Primary user emulation attacks Cognitive Radio