EXPLAINING DATASETS THROUGH HIGH-ACCURACY REGIONS
Ina Fiterau, Carnegie Mellon University Artur Dubrawski, Carnegie Mellon University
Women in Machine Learning Workshop 12th of December 2011
1
Work under review at the SIAM Data Mining Conference
E XPLAINING D ATASETS THROUGH H IGH -A CCURACY R EGIONS 1 Work - - PowerPoint PPT Presentation
Women in Machine Learning Workshop 12 th of December 2011 Ina Fiterau, Carnegie Mellon University Artur Dubrawski, Carnegie Mellon University E XPLAINING D ATASETS THROUGH H IGH -A CCURACY R EGIONS 1 Work under review at the SIAM Data Mining
Ina Fiterau, Carnegie Mellon University Artur Dubrawski, Carnegie Mellon University
Women in Machine Learning Workshop 12th of December 2011
1
Work under review at the SIAM Data Mining Conference
2
Border control: vehicles are scanned Human in the loop interpreting results
3
vehicle scan prediction feedback
Accurate, but hard to interpret
4
How is the prediction derived from the input? Image obtained with the Adaboost applet.
5
Radiation > x% Payload type = ceramics Uranium level > max. admissible for ceramics Consider balance
and Co60 Clear yes no yes no Threat yes no
6
7
2 Gaussians Uniform cube
(X,Y) plot
1 2 3 4 5
1 2 3 4 5
1 2 3 4 5
8
9
10
11
12
OK NOT OK
13
14
15
16
17
18
19
20
yes yes yes no no no
21
Enclose points in simple convex shapes (multiple per iteration)
Grow contour while train error is ≤ ε
22
Incorrectly classified Correctly classified decision
Enclose points in simple convex shapes (multiple per iteration)
Grow contour while train error is ≤ ε
Calibration on hold out set - remove shapes that: do not contain calibration points over which the classifier is not accurate
23
Incorrectly classified Correctly classified decision
Enclose points in simple convex shapes (multiple per iteration)
Grow contour while train error is ≤ ε
Calibration on hold out set - remove shapes that: do not contain calibration points over which the classifier is not accurate Intuitive, visually appealing - hyper-rectangles/spheres
24
Incorrectly classified Correctly classified decision
25
Typical XOR dataset
26
Typical XOR dataset
CART
leverage structure of data
27
Typical XOR dataset
EOP
Iteration 1 Iteration 2
28
CART
leverage structure of data
+ o
What is the price of understandability? Why boosting?
It is an [arguably] good black-box classifier Learns an ensemble using any type of classifier Iteratively targets data misclassified earlier
Criterion: Complexity of the resulting model
29
Problem: Binary classification 10D Gaussians/uniform cubes for each class Statistical significance: repeat experiment with several
Results obtained through 5-fold cross validation
30
EOP is often less accurate, but not significantly the reduction of complexity is statistically significant
0.85 0.9 0.95 1
1 2 3 4 5 6 7 8 9 10
Boosting EOP (nonparametric) 100 200 300
1 2 3 4 5 6 7 8 9 10
31
0.2 0.4 0.6 0.8 1 1.2 BCW MB V BT
10 20 30 40
CART EOP N. EOP P.
32
Dataset # of Features # of Points Breast Tissue 10 1006 Vowel 9 990 MiniBOONE 10 5000 Breast Cancer 10 596
Parametric
CART is
1st Iteration classier labels everything as spam high confidence regions do enclose mostly
Incidence of the word ‘your’ is low Length of text in capital letters is high
33
2nd Iteration the threshold for the incidence of `your' is
the required incidence of capitals is increased the square region on the left also encloses
34
3rd Iteration Classifier marks everything as spam Frequency of ‘your’ and ‘hi’ determine the regions
35
EOP maintains classification accuracy but uses
EOP with decision stumps finds less complex
EOP gives interpretable high accuracy regions We are currently testing EOP in a range of
36
37
38
39