e xplaining d atasets through
play

E XPLAINING D ATASETS THROUGH H IGH -A CCURACY R EGIONS 1 Work - PowerPoint PPT Presentation

Women in Machine Learning Workshop 12 th of December 2011 Ina Fiterau, Carnegie Mellon University Artur Dubrawski, Carnegie Mellon University E XPLAINING D ATASETS THROUGH H IGH -A CCURACY R EGIONS 1 Work under review at the SIAM Data Mining


  1. Women in Machine Learning Workshop 12 th of December 2011 Ina Fiterau, Carnegie Mellon University Artur Dubrawski, Carnegie Mellon University E XPLAINING D ATASETS THROUGH H IGH -A CCURACY R EGIONS 1 Work under review at the SIAM Data Mining Conference

  2. 2 O UTLINE  Motivation of need for interpretability  Explanation-Oriented Partitioning (EOP)  Evaluation of EOP

  3. 3 E XAMPLE A PPLICATION : N UCLEAR T HREAT D ETECTION  Border control: vehicles are scanned  Human in the loop interpreting results prediction feedback vehicle scan

  4. 4 B OOSTED D ECISION S TUMPS  Accurate, but hard to interpret How is the prediction derived from the input? Image obtained with the Adaboost applet.

  5. 5 D ECISION T REE – M ORE I NTERPRETABLE yes no Radiation > x% no yes Payload type = ceramics yes no Uranium level > max. Consider balance admissible for ceramics of Th232, Ra226 and Co60 Threat Clear

  6. 6 M OTIVATION Many users are willing to trade accuracy to better understand the system-yielded results Need : simple, interpretable model Need : explanatory prediction process

  7. 7 E XPLANATION -O RIENTED P ARTITIONING (EOP)

  8. 8 E XPLANATION -O RIENTED P ARTITIONING (EOP) E XECUTION E XAMPLE – 3D DATA Uniform cube 2 Gaussians 5 4 3 2 1 0 -1 -2 -3 5 4 3 2 5 4 5 1 3 2 0 4 1 -1 0 3 -1 -2 -2 2 -3 -3 -4 1 0 -1 -2 -3 -4 -3 -2 -1 0 1 2 3 4 5 (X,Y) plot

  9. 9 EOP E XECUTION E XAMPLE – 3D DATA Step 1: Select a projection - (X 1 ,X 2 )

  10. 10 EOP E XECUTION E XAMPLE – 3D DATA Step 1: Select a projection - (X 1 ,X 2 )

  11. 11 EOP E XECUTION E XAMPLE – 3D DATA h 1 Step 2: Choose a good classifier - call it h 1

  12. 12 EOP E XECUTION E XAMPLE – 3D DATA Step 2: Choose a good classifier - call it h 1

  13. 13 EOP E XECUTION E XAMPLE – 3D DATA OK NOT OK Step 3: Estimate accuracy of h 1 at each point

  14. 14 EOP E XECUTION E XAMPLE – 3D DATA Step 3: Estimate accuracy of h 1 for each point

  15. 15 EOP E XECUTION E XAMPLE – 3D DATA Step 4: Identify high accuracy regions

  16. 16 EOP E XECUTION E XAMPLE – 3D DATA Step 4: Identify high accuracy regions

  17. 17 EOP E XECUTION E XAMPLE – 3D DATA Step 5:Training points - removed from consideration

  18. 18 EOP E XECUTION E XAMPLE – 3D DATA Step 5:Training points - removed from consideration

  19. 19 EOP E XECUTION E XAMPLE – 3D DATA Finished first iteration

  20. 20 EOP E XECUTION E XAMPLE – 3D DATA Iterate until all data is accounted for or error cannot be decreased

  21. 21 L EARNED M ODEL – P ROCESSING QUERY [ X 1 X 2 X 3 ] yes h 1 (x 1 x 2 ) [x 1 x 2 ] in R 1 ? no yes h 2 (x 2 x 3 ) [x 2 x 3 ] in R 2 ? no yes h 3 (x 1 x 3 ) [x 1 x 3 ] in R 3 ? no Default Value

  22. 22 PARAMETRIC R EGIONS OF HIGH CONFIDENCE (B OUNDING P OLYHEDRA )  Enclose points in simple convex shapes (multiple per iteration) Grow contour while train error is ≤ ε decision Incorrectly classified Correctly classified

  23. 23 PARAMETRIC R EGIONS OF HIGH CONFIDENCE (B OUNDING P OLYHEDRA )  Enclose points in simple convex shapes (multiple per iteration) Grow contour while train error is ≤ ε decision Incorrectly classified Correctly classified  Calibration on hold out set - remove shapes that:  do not contain calibration points  over which the classifier is not accurate

  24. 24 PARAMETRIC R EGIONS OF HIGH CONFIDENCE (B OUNDING P OLYHEDRA )  Enclose points in simple convex shapes (multiple per iteration) Grow contour while train error is ≤ ε decision Incorrectly classified Correctly classified  Calibration on hold out set - remove shapes that:  do not contain calibration points  over which the classifier is not accurate  Intuitive, visually appealing - hyper-rectangles/spheres

  25. 25 O UTLINE  Motivation of need for interpretability  Explanation-Oriented Partitioning (EOP)  Evaluation of EOP  Summary

  26. 26 B ENEFITS OF EOP - A VOIDING N EEDLESS C OMPLEXITY - Typical XOR dataset

  27. 27 B ENEFITS OF EOP - A VOIDING N EEDLESS C OMPLEXITY - Typical XOR dataset CART • is accurate • takes many iterations • does not uncover or leverage structure of data

  28. 28 B ENEFITS OF EOP - A VOIDING N EEDLESS C OMPLEXITY - Typical XOR dataset EOP • equally accurate CART • uncovers structure • is accurate + o • takes many iterations • does not uncover or leverage structure of data Iteration 1 o + Iteration 2

  29. 29 C OMPARISON T O B OOSTING  What is the price of understandability?  Why boosting?  It is an [arguably] good black-box classifier  Learns an ensemble using any type of classifier  Iteratively targets data misclassified earlier  Criterion: Complexity of the resulting model = number of vector operations to make a prediction

  30. 30 C OMPARISON TO BOOSTING - S ETUP  Problem: Binary classification  10D Gaussians/uniform cubes for each class  Statistical significance: repeat experiment with several datasets and compute paired t-test p-values  Results obtained through 5-fold cross validation

  31. 31 EOP VS A DA B OOST - SVM BASE CLASSIFIERS  EOP is often less accurate, but not significantly  the reduction of complexity is statistically significant 10 10 9 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1 1 0.85 0.9 0.95 1 0 100 200 300 Accuracy Complexity Boosting EOP (nonparametric) Accuracy p-value: 0.832 Complexity p-value: 0.003

  32. 32 EOP ( STUMPS AS BASE CLASSIFIERS ) VS CART D ATA FROM THE UCI REPOSITORY CART EOP N. BT EOP P. V MB BCW 0 0.2 0.4 0.6 0.8 1 0 1.2 10 20 30 40 Accuracy Complexity  CART is  Parametric the most EOP yields Dataset # of Features # of Points accurate the simplest Breast Tissue 10 1006 models Vowel 9 990 MiniBOONE 10 5000 Breast Cancer 10 596

  33. 33 E XPLAINING R EAL D ATA - S PAMBASE  1 st Iteration  classier labels everything as spam  high confidence regions do enclose mostly spam and  Incidence of the word ‘your’ is low  Length of text in capital letters is high

  34. 34 E XPLAINING R EAL D ATA - S PAMBASE  2 nd Iteration  the threshold for the incidence of `your' is lowered  the required incidence of capitals is increased  the square region on the left also encloses examples that will be marked as `not spam'

  35. 35 E XPLAINING R EAL D ATA - S PAMBASE  3 rd Iteration  Classifier marks everything as spam  Frequency of ‘your’ and ‘hi’ determine the regions

  36. 36 S UMMARY  EOP maintains classification accuracy but uses less complex models when compared to Boosting  EOP with decision stumps finds less complex models than CART at the price of a small decrease in accuracy  EOP gives interpretable high accuracy regions  We are currently testing EOP in a range of practical application scenarios

  37. 37 T HANK Y OU

  38. 38 E XTRA R ESULTS

  39. 39 E XPLAINING REAL DATA - FUEL

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend