Squares: Supporting Interactive Performance Analysis for Multiclass - PowerPoint PPT Presentation

Squares: Supporting Interactive Performance Analysis for Multiclass Classifiers Donghao Ren 1,2 , Saleema Amershi 2 , Bongshin Lee 2 , Jina Suh 2 and Jason D. Williams 2 1 University of California, Santa Barbara 2 Microsoft Research, Redmond

Performance analysis is critical in machine learning Data Feature Model Performance Collection Creation Building Analysis 2

Performance analysis is critical in machine learning Data Feature Model Performance Collection Creation Building Analysis 3

Performance analysis is critical in machine learning Data Model Performance Feature Collection Building Analysis Creation 4

Performance analysis is critical in machine learning Data Feature Performance Model Collection Creation Analysis Building 5

Common ways of performance analysis • Summary statistics • Confusion Matrix • Accuracy Predicted Class • Precision • Recall Actual Class • Log-Loss • … 6

Problems • Disconnected from the underlying data. • Hide important information such as score distribution. • Not trivial to support multiclass classifiers. 7

Squares

Design Process Survey of Machine Controlled Design of Squares Learning Practices Experiment Revise Design 9

Design Goals • G1: Show performance at multiple levels of detail to help practitioners prioritize efforts. • Overall / Class-level / Instance-level • Error severity (errors with higher score on the wrong class are more severe) • G2: Be agnostic to common performance metrics. • Support a wider range of scenarios. • G3: Connect performance to data. • Provide access to data. Use small visual footprint to reserve space for scenario- dependent data access views. 13

Squares Visualization Design 1. Each class is shown as a column 14 Dataset: Glasses from the UCI Machine Learning Repository

Visualization Design 1. Each class is shown as a column 2. Each instance is shown as a box 15 Dataset: Glasses from the UCI Machine Learning Repository

Visualization Design 1. Each class is shown as a column 2. Each instance is shown as a box 3. Instances are binned according to prediction scores 16 Dataset: Glasses from the UCI Machine Learning Repository

Visualization Design 17 Dataset: Glasses from the UCI Machine Learning Repository

Visualizing Count-Based Metrics: Overall Accuracy • Accuracy: Correct Predictions = Total # of Instances Higher Accuracy Lower Accuracy 18

Visualizing Count-Based Metrics: Class-Level • Class-level precision and recall: Precision: Recall: FPs and FNs are comparably salient: One-to-one correspondence between outlined boxes and striped boxes Lower Precision Lower Recall 19

Visualizing Score-Based Metrics Higher scoring instance (more confident) Lower scoring instance (less confident) Worse score distribution 20

Help Prioritizing Debugging Efforts More severe error (confidently wrong) Less severe error (prediction can flip if scores change slightly) 21

Visualizing Confusion Between Classes C5 is confused with C3 22 Dataset: MNIST Handwritten Digits

Instance-Level Details On-hover parallel coordinates for detailed scores 23 Dataset: MNIST Handwritten Digits

Scalability Each strip represents 10 boxes Truncation indicators 24

Scalability Toggle between 3-levels of aggregation 25

Evaluation

Controlled Experiment • 24 participants • Part 1: Comparison • Compare Squares against a commonly used ConfusionMatrix • Within-subject design • Part 2: (Squares Only) Score Distribution • Evaluate Squares’ ability to convey score distribution 27

Part 1: Squares vs. Confusion Matrix Select/Deselect individual cells. Select cells of a given row/column. Squares with a Sortable Table Confusion Matrix with a Sortable Table 28

Part 1: Tasks • T1 – Overall • Select the classifier with the larger number of errors • T2 – Class-level • Select one of the two classes with the most errors • T3 – Instance-level • Select an error with a score of .9 or above in the wrong class 29

Part 1: Squares Performed Better • Task Time *** *** *** Squares lead to faster task time Squares scale better in terms of the (Main Effect: p < 0.001) number of classes (Interaction Effect: p = 0.012) 30

Part 1: Squares Performed Better • Accuracy 100 • Squares lead to more accurate 90 results 80 70 60 50 40 30 20 10 0 Squares Confusion Matrix (p < 0.001) 31

Part 1: People Preferred Squares Helpfulness Preference 5 26 21 4 16 3 11 2 6 1 1 T1/5 T1/15 T2/5 T2/15 T3/5 T3/15 T1/5 T1/15 T2/5 T2/15 T3/5 T3/15 Squares ConfusionMatrix Squares ConfusionMatrix Squares was more helpful Squares was preferred 32

Part 2: (Squares Only) Distribution Tasks • T4 – Overall • Select the classifier with the worst distribution • T5 – Class-level • Select one of the two classes with the worst distribution • T6 – Confusion • Select the two classes most confused with each other 33

Part 2: Squares was helpful in distribution tasks Task Time (s) Accuracy Helpfulness 20 100 5 80 15 4 60 10 3 40 5 2 20 0 0 1 T4 T5 T6 T4 T5 T6 T4 T5 T6 Small Large Small Large Small Large 34

Freeform Feedback • Positive: • “Granular and at the same time general overview of the classifiers is great.” • “Seeing the distribution of scores is very helpful.” • “Had fun for the first time while classifying!” • Negative: • “I prefer having numbers than pure display.” • “[Confusion Matrix is] more straightforward, lower learning curve.” 35

Future Work • Further Evaluation • Compare to alternative designs of Confusion Matrix, as well as other visualization designs in the literature • Scalability • Supporting more than 20 classes Confusion Wheel [B. Alsallakh, VAST '14] • Optimizing color assignments 36

Squares as a Tool • Deployed along with a machine learning toolkit within Microsoft Model Building Interface 37

Acknowledgements • We thank the support and feedback from the Machine Teaching Group at Microsoft Research. • We thank the anonymous reviewers for their constructive comments. 38

Thanks! Questions? Donghao Ren (donghao.ren@gmail.com) University of California, Santa Barbara 39

Survey of Machine Learning Practices • Survey within a large software company in July. 2015. • 102 respondents: Respondents’ Roles in the company 40 30 % 20 10 0 Data scientist Software Researcher Program Other engineer manager 41

Number of Classes • How many classes do your classifiers typically deal with (check all that apply)? • Most respondents typically deal with less than 20 classes. 42

Important Tasks • “How difficult” and “how important” ratings of tasks: • Prioritizing efforts is difficult even for expert users. • Understanding instance-level performance is relatively more difficult in common tools. 43

Integrating into LUIS (Language Understanding Intelligent Service) 44

Squares: Supporting Interactive Performance Analysis for Multiclass - PowerPoint PPT Presentation

Squares: Supporting Interactive Performance Analysis for Multiclass Classifiers Donghao Ren 1,2 , Saleema Amershi 2 , Bongshin Lee 2 , Jina Suh 2 and Jason D. Williams 2 1 University of California, Santa Barbara 2 Microsoft Research, Redmond

The Mathemagic of Magic Squares History of Magic Squares Mathematics and Magic Squares

Practical Least-Squares for Computer Graphics Siggraph Course 11 Siggraph Course 11 Practical

Squares of function spaces and function spaces on squares Miko laj Krupski University of

RISK ASSESSEMENT supporting TEST supporting supporting supporting supporting REAGENTS RISK

Whats My Identity? By Miss Elliott Squares vs. Rectangles Squares Rectangles 4 sides

Statistical Properties of the Regularized Least Squares Functional and a hybrid LSQR Newton method

Sums of Squares Bianca Homberg and Minna Liu June 24, 2010 Abstract For our exploration topic,

Least Mean Squares Regression Machine Learning 1 Least Squares Method for regression

Group embeddings of partial Latin squares Ian Wanless Monash University Latin squares Latin

Dixons random squares method Last time we discuss Dixons random squares method to

Interactive Proofs Lecture 18 AM 1 Interactive Proofs 2 Interactive Proofs IP[k] 2

1 Diagonal Complete Latin Squares (Author: Jenny Zhang) Definition : Right-diagonal complete

ECE 516: Adaptive Digital Filters Lecture 13 (Recursive Least-Squares) Mojtaba Soltanalian 2

Low-rank sums-of-squares representations Cynthia Vinzant, North Carolina State University joint

Statistical Geometry Processing Winter Semester 2011/2012 Least-Squares Least-Squares Fitting

9. Equality constraints and tradeoffs More least squares Example: moving average model

Neural Networks Hopfield Nets and Auto Associators Spring 2020 1 Story so far Neural

Spin glasses and Adiabatic Quantum Computing A.P. Young Talk at the Workshop on Theory and

TEACH ACCESS TEACHING ABOUT ACCESSIBLE TECH THE TEACH ACCESS MISSION To include and enhance the

Commonsense resources Grandmas glasses Toms grandma was reading a new book, when she

CHIS 1 5 - Rgime dassurance maladie du CERN CERN Health Insurance Scheme (CHIS) Basic

Drivers in Naturalistic Recordings Using Existing Tools Sumit Jha and Carlos Busso Multimodal

Can we use Bayesian methods to resolve the current crisis of statistically-significant research

South Carolina Department of Health and Human Services Vision Open Forum May 13, 2013 Agenda