An Exercise in An Exercise in Machine Learning Machine Learning - - PowerPoint PPT Presentation

an exercise in an exercise in machine learning machine
SMART_READER_LITE
LIVE PREVIEW

An Exercise in An Exercise in Machine Learning Machine Learning - - PowerPoint PPT Presentation

An Exercise in An Exercise in Machine Learning Machine Learning http://www.cs.iastate.edu/~cs573x/bbsilab.html Machine Learning Software Preparing Data Building Classifiers Interpreting Results Machine Learning Software


slide-1
SLIDE 1

An Exercise in An Exercise in Machine Learning Machine Learning

http://www.cs.iastate.edu/~cs573x/bbsilab.html

  • Machine Learning Software
  • Preparing Data
  • Building Classifiers
  • Interpreting Results
slide-2
SLIDE 2

Machine Learning Software Machine Learning Software

  • Suites (General Purpose)

Suites (General Purpose)

  • WEKA

WEKA (Source: Java) (Source: Java)

  • MLC++

MLC++ (Source: C++) (Source: C++)

  • SAS

SAS

  • List from

List from KDNuggets KDNuggets (Various) (Various)

  • Specific

Specific

  • Classification: C5.0,

Classification: C5.0, SVMlight SVMlight

  • Association Rule Mining

Association Rule Mining

  • Bayesian Net

Bayesian Net … … … …

  • Commercial vs. Free vs. Programming

Commercial vs. Free vs. Programming

slide-3
SLIDE 3

What does WEKA do? What does WEKA do?

  • Implementation of state

Implementation of state-

  • of
  • f-
  • art learning

art learning algorithm algorithm

  • Main strengths in the classification

Main strengths in the classification

  • Regression, Association Rules and clustering

Regression, Association Rules and clustering algorithms algorithms

  • Extensible to try new learning schemes

Extensible to try new learning schemes

  • Large variety of handy tools (transforming

Large variety of handy tools (transforming datasets, filters, visualization etc datasets, filters, visualization etc… …) )

slide-4
SLIDE 4

WEKA resources WEKA resources

  • API Documentation, Tutorial, Source code.

API Documentation, Tutorial, Source code.

  • WEKA mailing list

WEKA mailing list

  • Data Mining: Practical Machine Learning Tools and

Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations Techniques with Java Implementations

  • Weka

Weka-

  • related Projects:

related Projects:

  • Weka

Weka-

  • Parallel

Parallel -

  • parallel processing for

parallel processing for Weka Weka

  • RWeka

RWeka -

  • linking R and

linking R and Weka Weka

  • YALE

YALE -

  • Yet Another Learning Environment

Yet Another Learning Environment

  • Many others

Many others… …

slide-5
SLIDE 5

Getting Started Getting Started

  • Installation (Java runtime +WEKA)

Installation (Java runtime +WEKA)

  • Setting up the environment (

Setting up the environment (CLASSPATH

CLASSPATH)

)

  • Reference Book and online API document

Reference Book and online API document

  • Preparing Data sets

Preparing Data sets

  • Running WEKA to build classifiers

Running WEKA to build classifiers

  • Interpreting Results

Interpreting Results

slide-6
SLIDE 6

ARFF Data Format ARFF Data Format

  • Attribute

Attribute-

  • Relation File Format

Relation File Format

  • Header

Header – – describing the attribute describing the attribute types types

  • Data

Data – – (instances, examples) (instances, examples) comma comma-

  • separated list

separated list

  • Use the right data format:

Use the right data format:

  • Filestem

Filestem, CSV , CSV ARFF format ARFF format

  • Use

Use C45Loader C45Loader and and CSVLoader CSVLoader to to convert convert

slide-7
SLIDE 7

Launching WEKA Launching WEKA

slide-8
SLIDE 8

Load Dataset into WEKA Load Dataset into WEKA

slide-9
SLIDE 9

Data Filters Data Filters

  • Useful support for data preprocessing

Useful support for data preprocessing

  • Removing or adding attributes,

Removing or adding attributes, resampling resampling the dataset, removing examples, etc. the dataset, removing examples, etc.

  • Creates stratified cross

Creates stratified cross-

  • validation folds of the

validation folds of the given dataset, and class distributions are given dataset, and class distributions are approximately retained within each fold. approximately retained within each fold.

  • Typically split data as 2/3 in training and 1/3

Typically split data as 2/3 in training and 1/3 in testing in testing

slide-10
SLIDE 10

Building Classifiers Building Classifiers

  • A classifier model

A classifier model -

  • mapping from dataset

mapping from dataset attributes to the class (target) attribute. attributes to the class (target) attribute. Creation and form differs. Creation and form differs.

  • Decision Tree and Na

Decision Tree and Naï ïve ve Bayes Bayes Classifiers Classifiers

  • Which one is the better?

Which one is the better?

  • No Free Lunch!

No Free Lunch!

slide-11
SLIDE 11

Building Classifier Building Classifier

slide-12
SLIDE 12

(1) (1) weka.classifiers.rules.ZeroR weka.classifiers.rules.ZeroR

  • Building and using a 0

Building and using a 0-

  • R classifier. Predicts the

R classifier. Predicts the mean (for a numeric class) or the mode (for a mean (for a numeric class) or the mode (for a nominal class). nominal class).

(2) (2) weka.classifiers.bayes.NaiveBayes weka.classifiers.bayes.NaiveBayes

  • Class for building a Naive Bayesian classifier

Class for building a Naive Bayesian classifier

slide-13
SLIDE 13

(3) weka.classifiers.trees.J48 (3) weka.classifiers.trees.J48

  • Class for generating an

Class for generating an unpruned unpruned or a pruned

  • r a pruned

C4.5 decision tree. C4.5 decision tree.

slide-14
SLIDE 14

Test Options Test Options

  • Percentage Split (2/3 Training; 1/3 Testing)

Percentage Split (2/3 Training; 1/3 Testing)

  • Cross

Cross-

  • validation

validation

  • Estimating the generalization error based on

Estimating the generalization error based on resampling resampling when limited data; averaged error when limited data; averaged error estimate. estimate.

  • Stratified 10

Stratified 10-

  • fold

fold

  • Leave

Leave-

  • one
  • ne-
  • out (
  • ut (Loo

Loo) )

  • 10

10-

  • fold vs.

fold vs. Loo Loo

slide-15
SLIDE 15

Understanding Output Understanding Output

slide-16
SLIDE 16

Decision Tree Output (1) Decision Tree Output (1)

=== Error on training data === === Error on training data === Correctly Classified Instance 14 100 % Correctly Classified Instance 14 100 % Incorrectly Classified Instances 0 0 % Incorrectly Classified Instances 0 0 % Kappa statistic 1 Kappa statistic 1 Mean absolute error 0 Mean absolute error 0 Root mean squared error 0 Root mean squared error 0 Relative absolute error 0% Relative absolute error 0% Root relative squared error 0% Root relative squared error 0% Total Number of Instances 14 Total Number of Instances 14 === Detailed Accuracy By Class === === Detailed Accuracy By Class === TP FP Precision Recall F TP FP Precision Recall F-

  • Measure Class

Measure Class 1 0 1 1 1 yes 1 0 1 1 1 yes 1 0 1 1 1 no 1 0 1 1 1 no === Confusion Matrix === === Confusion Matrix === a b < a b <--

  • - classified as

classified as

9 9

0 | a = yes 0 | a = yes

10 10

0 5 | b = no 0 5 | b = no

J48 pruned tree J48 pruned tree

  • utlook = sunny
  • utlook = sunny

| humidity <= 75: yes (2.0) | humidity <= 75: yes (2.0) | humidity > 75: no (3.0) | humidity > 75: no (3.0)

  • utlook = overcast: yes (4.0)
  • utlook = overcast: yes (4.0)
  • utlook = rainy
  • utlook = rainy

| windy = TRUE: no (2.0) | windy = TRUE: no (2.0) | windy = FALSE: yes (3.0) | windy = FALSE: yes (3.0) Number of Leaves : 5 Number of Leaves : 5 Size of the tree : 8 Size of the tree : 8

slide-17
SLIDE 17

Decision Tree Output (2) Decision Tree Output (2)

=== Stratified cross === Stratified cross-

  • validation ===

validation === Correctly Classified Instances 9 64.2857 % Correctly Classified Instances 9 64.2857 % Incorrectly Classified Instances 5 35.7143 % Incorrectly Classified Instances 5 35.7143 % Kappa statistic 0.186 Kappa statistic 0.186 Mean absolute error 0.2857 Mean absolute error 0.2857 Root mean squared error 0.4818 Root mean squared error 0.4818 Relative absolute error 60% Relative absolute error 60% Root relative squared error 97.6586 % Root relative squared error 97.6586 % Total Number of Instances 14 Total Number of Instances 14 === Detailed Accuracy By Class === === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F TP Rate FP Rate Precision Recall F-

  • Measure Class

Measure Class 0.778 0.6 0.7 0.778 0.737 0.778 0.6 0.7 0.778 0.737 yes yes 0.4 0.222 0.5 0.4 0.444 0.4 0.222 0.5 0.4 0.444 no no === Confusion Matrix === === Confusion Matrix === a b < a b <--

  • - classified as

classified as 7 2 | a = yes 7 2 | a = yes 3 2 | b = no 3 2 | b = no

slide-18
SLIDE 18

Performance Measures Performance Measures

  • Accuracy & Error rate

Accuracy & Error rate

  • Mean absolute error

Mean absolute error

  • Root mean

Root mean-

  • squared root (square root of the

squared root (square root of the average quadratic loss) average quadratic loss)

  • Confusion matrix

Confusion matrix – – contingency table contingency table

  • True Positive rate & False Positive rate

True Positive rate & False Positive rate

  • Precision & Recall

Precision & Recall

  • F

F-

  • Measure =2*Precision*Recall/(

Measure =2*Precision*Recall/(Precision+Recall Precision+Recall) )

slide-19
SLIDE 19

Decision Tree Pruning Decision Tree Pruning

  • Overcome Over

Overcome Over-

  • fitting

fitting

  • Pre

Pre-

  • pruning and Post

pruning and Post-

  • pruning

pruning

  • Reduced error pruning

Reduced error pruning

  • Subtree

Subtree raising with different confidence raising with different confidence

  • Comparing tree size and accuracy.

Comparing tree size and accuracy.

slide-20
SLIDE 20

Subtree Subtree Replacement Replacement

  • Bottom

Bottom-

  • up: tree is considered for replacement

up: tree is considered for replacement

  • nce all its
  • nce all its subtrees

subtrees have been considered have been considered

slide-21
SLIDE 21

Subtree Subtree Raising Raising

  • Deletes node and redistributes instances

Deletes node and redistributes instances

  • Slower than

Slower than subtree subtree replacement replacement

slide-22
SLIDE 22

Na Naï ïve Bayesian Classifier ve Bayesian Classifier

  • Output CPT, same set of performance measures

Output CPT, same set of performance measures

  • By default, use normal distribution to model

By default, use normal distribution to model numeric attributes. numeric attributes.

  • Kernel density estimator could improve

Kernel density estimator could improve performance if normality assumption is performance if normality assumption is

  • incorrect. (
  • incorrect. (-
  • k option)

k option)

slide-23
SLIDE 23

Data Sets to work on Data Sets to work on

  • Data sets were preprocessed into ARFF format

Data sets were preprocessed into ARFF format

  • Three data sets from UCI repository

Three data sets from UCI repository

  • Two data sets from Computational Biology

Two data sets from Computational Biology

  • Protein Function Prediction

Protein Function Prediction

  • Surface Residue Prediction

Surface Residue Prediction

slide-24
SLIDE 24

Protein Function Prediction Protein Function Prediction

  • Build a Decision Tree classifier that assign protein

Build a Decision Tree classifier that assign protein sequences into functional families based on sequences into functional families based on characteristic motif compositions characteristic motif compositions

  • Each attribute (motif) has a

Each attribute (motif) has a Prosite Prosite access number: access number: PS#### PS####

  • Class label use

Class label use Prosite Prosite Doc ID: PDOC#### Doc ID: PDOC####

  • 73 attributes (binary) & 10 classes (

73 attributes (binary) & 10 classes (PDOC

PDOC).

).

  • Suggested method

Suggested method: Use 10 : Use 10-

  • fold CV and Pruning the

fold CV and Pruning the tree using Sub tree using Sub-

  • tree raising method

tree raising method

slide-25
SLIDE 25

Surface Residue Prediction Surface Residue Prediction

  • Prediction is based on the identity of the target

Prediction is based on the identity of the target residue and its 4 sequence neighbors residue and its 4 sequence neighbors

  • Window Size = 5

Window Size = 5

  • Target residue is on Surface or not?

Target residue is on Surface or not?

  • 5 attributes and binary classes.

5 attributes and binary classes.

  • Suggested method: Use Na

Suggested method: Use Naï ïve Bayesian ve Bayesian Classifier with no kernels Classifier with no kernels

X1 X1 X2 X2 X3 X3 X4 X4 X5 X5

slide-26
SLIDE 26

To Do List: To Do List:

  • Generate Decision Tree classifier for each data set

Generate Decision Tree classifier for each data set using 10 using 10-

  • fold CV with no pruning

fold CV with no pruning

  • Generate Decision Tree classifier for each data set

Generate Decision Tree classifier for each data set using 10 using 10-

  • fold CV with Reduced Error pruning

fold CV with Reduced Error pruning

  • Generate Decision Tree classifier for each data set

Generate Decision Tree classifier for each data set using 10 using 10-

  • fold CV with

fold CV with Subtree Subtree Raising pruning Raising pruning

  • Generate Decision Tree classifier for each data set

Generate Decision Tree classifier for each data set using 10 using 10-

  • fold CV with binary Split

fold CV with binary Split

  • Generate Na

Generate Naï ïve ve Bayes Bayes classifier for each data set classifier for each data set