An Exercise in An Exercise in Machine Learning Machine Learning
http://www.cs.iastate.edu/~cs573x/bbsilab.html
- Machine Learning Software
- Preparing Data
- Building Classifiers
- Interpreting Results
An Exercise in An Exercise in Machine Learning Machine Learning - - PowerPoint PPT Presentation
An Exercise in An Exercise in Machine Learning Machine Learning http://www.cs.iastate.edu/~cs573x/bbsilab.html Machine Learning Software Preparing Data Building Classifiers Interpreting Results Machine Learning Software
WEKA (Source: Java) (Source: Java)
MLC++ (Source: C++) (Source: C++)
SAS
List from KDNuggets KDNuggets (Various) (Various)
Classification: C5.0, SVMlight SVMlight
Association Rule Mining
Bayesian Net … … … …
Weka-
Parallel -
parallel processing for Weka Weka
RWeka -
linking R and Weka Weka
YALE -
Yet Another Learning Environment
Many others… …
CLASSPATH)
Attribute-
Relation File Format
Header – – describing the attribute describing the attribute types types
Data – – (instances, examples) (instances, examples) comma comma-
separated list
Use the right data format:
Filestem, CSV , CSV ARFF format ARFF format
Use C45Loader C45Loader and and CSVLoader CSVLoader to to convert convert
No Free Lunch!
Class for generating an unpruned unpruned or a pruned
C4.5 decision tree. C4.5 decision tree.
Estimating the generalization error based on resampling resampling when limited data; averaged error when limited data; averaged error estimate. estimate.
Stratified 10-
fold
Leave-
Loo) )
10-
fold vs. Loo Loo
=== Error on training data === === Error on training data === Correctly Classified Instance 14 100 % Correctly Classified Instance 14 100 % Incorrectly Classified Instances 0 0 % Incorrectly Classified Instances 0 0 % Kappa statistic 1 Kappa statistic 1 Mean absolute error 0 Mean absolute error 0 Root mean squared error 0 Root mean squared error 0 Relative absolute error 0% Relative absolute error 0% Root relative squared error 0% Root relative squared error 0% Total Number of Instances 14 Total Number of Instances 14 === Detailed Accuracy By Class === === Detailed Accuracy By Class === TP FP Precision Recall F TP FP Precision Recall F-
Measure Class 1 0 1 1 1 yes 1 0 1 1 1 yes 1 0 1 1 1 no 1 0 1 1 1 no === Confusion Matrix === === Confusion Matrix === a b < a b <--
classified as
9 9
0 | a = yes 0 | a = yes
10 10
0 5 | b = no 0 5 | b = no
J48 pruned tree J48 pruned tree
| humidity <= 75: yes (2.0) | humidity <= 75: yes (2.0) | humidity > 75: no (3.0) | humidity > 75: no (3.0)
| windy = TRUE: no (2.0) | windy = TRUE: no (2.0) | windy = FALSE: yes (3.0) | windy = FALSE: yes (3.0) Number of Leaves : 5 Number of Leaves : 5 Size of the tree : 8 Size of the tree : 8
=== Stratified cross === Stratified cross-
validation === Correctly Classified Instances 9 64.2857 % Correctly Classified Instances 9 64.2857 % Incorrectly Classified Instances 5 35.7143 % Incorrectly Classified Instances 5 35.7143 % Kappa statistic 0.186 Kappa statistic 0.186 Mean absolute error 0.2857 Mean absolute error 0.2857 Root mean squared error 0.4818 Root mean squared error 0.4818 Relative absolute error 60% Relative absolute error 60% Root relative squared error 97.6586 % Root relative squared error 97.6586 % Total Number of Instances 14 Total Number of Instances 14 === Detailed Accuracy By Class === === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F TP Rate FP Rate Precision Recall F-
Measure Class 0.778 0.6 0.7 0.778 0.737 0.778 0.6 0.7 0.778 0.737 yes yes 0.4 0.222 0.5 0.4 0.444 0.4 0.222 0.5 0.4 0.444 no no === Confusion Matrix === === Confusion Matrix === a b < a b <--
classified as 7 2 | a = yes 7 2 | a = yes 3 2 | b = no 3 2 | b = no
Protein Function Prediction
Surface Residue Prediction
Build a Decision Tree classifier that assign protein sequences into functional families based on sequences into functional families based on characteristic motif compositions characteristic motif compositions
Each attribute (motif) has a Prosite Prosite access number: access number: PS#### PS####
Class label use Prosite Prosite Doc ID: PDOC#### Doc ID: PDOC####
73 attributes (binary) & 10 classes (PDOC
PDOC).
).
Suggested method: Use 10 : Use 10-
fold CV and Pruning the tree using Sub tree using Sub-
tree raising method
X1 X1 X2 X2 X3 X3 X4 X4 X5 X5