Data Mining
Practical Machine Learning Tools and Techniques
Slides for Chapter 5 of Data Mining by I. H. Witten, E. Frank and
- M. A. Hall
Data Mining Practical Machine Learning Tools and Techniques Slides - - PowerPoint PPT Presentation
Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 5 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Credibility: Evaluating whats been learned Issues: training, testing, tuning Predicting
Slides for Chapter 5 of Data Mining by I. H. Witten, E. Frank and
2 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 5)
3 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 5)
♦ Otherwise 1-NN would be the optimum classifier!
♦ Split data into training and test set
♦ More sophisticated techniques need to be used
4 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 5)
♦ Number of correct classifications ♦ Accuracy of probability estimates ♦ Error in numeric predictions
♦ Many practical applications involve costs
5 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 5)
♦ Success: instance’s class is predicted correctly ♦ Error: instance’s class is predicted incorrectly ♦ Error rate: proportion of errors made over the whole
set of instances
6 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 5)
samples of the underlying problem
different towns A and B
town, test it on data from B
7 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 5)
8 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 5)
9 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 5)
♦ Depends on the amount of test data
♦ “Head” is a “success”, “tail” is an “error”
♦ Statistical theory provides us with confidence intervals for
the true underlying proportion
10 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 5)
11 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 5)
Pr [−z≤X≤z]=c Pr [−z≤X≤z]=1−2×Pr [x≥z]
12 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 5)
and a variance of 1:
have 0 mean and unit variance
0.25 40% 0.84 20% 1.28 10% 1.65 5% 2.33 2.58 3.09 z 1% 0.5% 0.1% Pr[X ≥ z]
–1 0 1 1.65
Pr[−1.65≤X≤1.65]=90%
13 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 5)
(i.e. subtract the mean and divide by the standard deviation)
f−p
p1−p/N
Pr [−z≤
f −p
√p(1−p)/N ≤z]=c
p=f
z2 2N∓z f N− f 2 N z2 4N2/1 z2 N
14 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 5)
large N (i.e. N > 100)
(should be taken with a grain of salt) p∈[0.732,0.767] p∈[0.691,0.801] p∈[0.549,0.881]
15 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 5)
♦ Usually: one third for testing, the rest for training
♦ Example: class might be missing in the test data
♦ Ensures that each class is represented with approximately
equal proportions in both subsets
16 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 5)
♦ In each iteration, a certain proportion is randomly selected for
training (possibly with stratificiation)
♦ The error rates on the different iterations are averaged to yield
an overall error rate
♦ Can we prevent overlapping?
17 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 5)
♦ First step: split data into k subsets of equal size ♦ Second step: use each subset in turn for testing, the
remainder for training
18 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 5)
♦ Extensive experiments have shown that this is the best choice
to get an accurate estimate
♦ There is also some theoretical evidence for this
♦ E.g. ten-fold cross-validation is repeated ten times and results
are averaged (reduces the variance)
19 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 5)
♦ Set number of folds to number of training instances ♦ I.e., for n training instances, build classifier n times
♦ (exception: NN)
20 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 5)
♦ It guarantees a non-stratified sample because
there is only one instance in the test set!
♦ Best inducer predicts majority class ♦ 50% accuracy on fresh data ♦ Leave-One-Out-CV estimate is 100% error!
21 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 5)
for a particular training/test set
form a new dataset of n instances
dataset that don’t occur in the new training set for testing
22 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 5)
♦ A particular instance has a probability of 1–
♦ Thus its probability of ending up in the test
♦ This means the training data will contain
1−
1 n n≈e−1≈0.368
23 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 5)
♦ Trained on just ~63% of the instances
err=0.632×etest instances0.368×etraining_instances
24 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 5)
♦ Consider the random dataset from above ♦ A perfect memorizer will achieve
0% resubstitution error and ~50% error on test data
♦ Bootstrap estimate for this classifier: ♦ True expected error: 50%
err=0.632×50%0.368×0%=31.6%
25 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 5)
♦ Need to show convincingly that a particular method
26 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 5)
particular domain
♦ For a given amount of training data ♦ On average, across all possible training sets
domain:
♦ Sample infinitely many dataset of specified size ♦ Obtain cross-validation estimate on each dataset for each
scheme
♦ Check if mean accuracy for scheme A is better than mean
accuracy for scheme B
27 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 5)
estimates for computing the mean
significantly different
different datasets from the domain
♦ The same CV is applied twice
William Gosset
Born: 1876 in Canterbury; Died: 1937 in Beaconsfield, England Obtained a post as a chemist in the Guinness brewery in Dublin in 1899. Invented the t-test to handle small samples for quality control in brewing. Wrote under the name "Student".
28 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 5)
different datasets
independent samples is normally distributed
σx
2/k and σy 2/k
are approximately normally distributed with mean 0, variance 1
mx−x
x
2/k
my−y
y
2/k
29 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 5)
0.88 20% 1.38 10% 1.83 5% 2.82 3.25 4.30 z 1% 0.5% 0.1% Pr[X ≥ z] 0.84 20% 1.28 10% 1.65 5% 2.33 2.58 3.09 z 1% 0.5% 0.1% Pr[X ≥ z]
9 degrees of freedom normal distribution
Assuming we have 10 estimates
30 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 5)
2 be the variance of the difference
md
d
2/k
31 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 5)
there is a (100-α)% chance that the true means differ
rejected
32 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 5)
x
2
k y
2
j
33 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 5)
♦ E.g. running cross-validations with different
♦ Assume we use the repeated hold-out method, with n1
♦ New test statistic is:
md
1 k n2 n1d 2
34 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 5)
35 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 5)
*, the
true probabilities
∑j pj−a j2=∑j!=c pj
2a−pc2
E[∑j p j−a j2]
36 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 5)
where c is the index of the instance’s actual class
class
* … pk * be the true class probabilities
*
−p1
∗ log2p1−...−pk ∗ log2pk
37 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 5)
♦ Both encourage honesty ♦ Quadratic loss function takes into account all class
probability estimates for an instance
♦ Informational loss focuses only on the probability
estimate for the actual class
♦ Quadratic loss is bounded:
it can never exceed 2
♦ Informational loss can be infinite
1∑j pj
2
38 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 5)
♦ Terrorist profiling
♦ Loan decisions ♦ Oil-slick detection ♦ Fault diagnosis ♦ Promotional mailing
39 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 5)
Actual class True negative False positive No False negative True positive Yes No Yes Predicted class
40 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 5)
actual predictor (left) vs. random predictor (right)
measures relative improvement over random predictor
Dobserved−Drandom Dperfect−Drandom
41 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 5)
♦ Cost is given by appropriate entry in the cost
42 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 5)
♦ Basic idea: only predict high-cost class when very confident
about prediction
♦ Normally we just predict the most likely class ♦ Here, we should make the prediction that minimizes the
expected cost
appropriate column in cost matrix
43 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 5)
assigned to the different classes
44 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 5)
promising, 0.4% of these respond (400)
40% of responses for 10% of cost may pay off
respond (800)
45 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 5)
… …
…
Yes 0.88
4
No 0.93
3
Yes 0.93
2
Yes 0.95
1
Actual class Predicted probability
46 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 5)
40% of responses for 10% of cost 80% of responses for 40% of cost
47 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 5)
♦ Stands for “receiver operating characteristic” ♦ Used in signal detection to show tradeoff between
hit rate and false alarm rate over noisy channel
♦ y axis shows percentage of true positives in sample
rather than absolute number
♦ x axis shows percentage of false positives in sample
rather than sample size
48 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 5)
49 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 5)
♦ Collect probabilities for instances in test folds ♦ Sort instances according to probabilities
♦ Another possibility is to generate an ROC curve for
each fold and average them
50 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 5)
51 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 5)
q × t1 + (1-q) × t2
q × f1+(1-q) × f2
52 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 5)
(TP+FP)
recall =TP/(TP+FN)
(three-point average recall)
probability that randomly chosen positive instance is ranked above randomly chosen negative one
53 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 5)
Explanation Plot Domain TP/(TP+FN) TP/(TP+FP) Recall Precision Information retrieval Recall- precision curve TP/(TP+FN) FP/(FP+TN) TP rate FP rate Communications ROC curve TP (TP+FP)/ (TP+FP+TN+FN) TP Subset size Marketing Lift chart
54 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 5)
55 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 5)
Normalized expected cost=fn×pc[+]fp×1−pc[+] Probability cost functionpc[+]=
p[+]C[+|-] p[+]C[+|-]p[-]C[-|+]
56 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 5)
p1−a12...pn−an2 n
57 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 5)
p1−a12...pn−an2 n ∣p1−a1∣...∣pn−an∣ n
58 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 5)
p1−a12...pn−an2 a−a12... a−an2 ∣p1−a1∣...∣pn−an∣ ∣ a−a1∣...∣ a−an∣
59 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 5)
SPA
SPSA
SP=
∑i pi− p2 n−1
SA=
∑i ai− a2 n−1
SPA=
∑i pi− pai− a n−1
60 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 5)
0.91 0.89 0.88 0.88 Correlation coefficient 30.4% 34.8% 40.1% 43.1% Relative absolute error 35.8% 39.4% 57.2% 42.2% Root rel squared error 29.2 33.4 38.5 41.3 Mean absolute error 57.4 63.3 91.7 67.8 Root mean-squared error D C B A
61 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 5)
+ space required to describe the theory’s mistakes
62 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 5)
compromise between:
achieves high accuracy on the given data
the best theory is the smallest one that describes all the facts
William of Ockham, born in the village of Ockham in Surrey (England) about 1285, was the most influential philosopher of the 14th century and a controversial theologian.
63 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 5)
♦ Less accurate than Copernicus’s latest refinement of the
Ptolemaic theory of epicycles
64 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 5)
most
store the model and its mistakes
65 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 5)
constant
Pr [E|T]Pr[T] Pr[E]
66 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 5)
MDL theory
determining the prior probability Pr[T] of the theory
principle: coding scheme for the theory
likely we need fewer bits to encode it
67 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 5)
selecting a model
probabilities for theories are crucial
theories that are consistent with the data
68 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 5)
♦ e.g. cluster centers
♦ e.g. distance to cluster center