Squares: Supporting Interactive Performance Analysis for Multiclass - - PowerPoint PPT Presentation

squares supporting interactive performance analysis for
SMART_READER_LITE
LIVE PREVIEW

Squares: Supporting Interactive Performance Analysis for Multiclass - - PowerPoint PPT Presentation

Squares: Supporting Interactive Performance Analysis for Multiclass Classifiers Donghao Ren 1,2 , Saleema Amershi 2 , Bongshin Lee 2 , Jina Suh 2 and Jason D. Williams 2 1 University of California, Santa Barbara 2 Microsoft Research, Redmond


slide-1
SLIDE 1

Squares: Supporting Interactive Performance Analysis for Multiclass Classifiers

Donghao Ren1,2, Saleema Amershi2, Bongshin Lee2, Jina Suh2 and Jason D. Williams2

1 University of California, Santa Barbara 2 Microsoft Research, Redmond

slide-2
SLIDE 2

Performance analysis is critical in machine learning

2

Data Collection Feature Creation Model Building Performance Analysis

slide-3
SLIDE 3

Performance analysis is critical in machine learning

3

Data Collection Feature Creation Model Building Performance Analysis

slide-4
SLIDE 4

Performance analysis is critical in machine learning

4

Data Collection Model Building Performance Analysis Feature Creation

slide-5
SLIDE 5

Performance analysis is critical in machine learning

5

Data Collection Feature Creation Performance Analysis Model Building

slide-6
SLIDE 6
  • Summary statistics
  • Accuracy
  • Precision
  • Recall
  • Log-Loss

Common ways of performance analysis

6

  • Confusion Matrix

Actual Class Predicted Class

slide-7
SLIDE 7
  • Disconnected from the underlying data.
  • Hide important information such as score distribution.
  • Not trivial to support multiclass classifiers.

Problems

7

slide-8
SLIDE 8

Squares

slide-9
SLIDE 9

Design Process

9

Survey of Machine Learning Practices Design of Squares Controlled Experiment

Revise Design

slide-10
SLIDE 10

Design Process

10

Survey of Machine Learning Practices Design of Squares Controlled Experiment

Revise Design

slide-11
SLIDE 11

Design Process

11

Survey of Machine Learning Practices Design of Squares Controlled Experiment

Revise Design

slide-12
SLIDE 12

Design Process

12

Survey of Machine Learning Practices Design of Squares Controlled Experiment

Revise Design

slide-13
SLIDE 13
  • G1: Show performance at multiple levels of detail to help

practitioners prioritize efforts.

  • Overall / Class-level / Instance-level
  • Error severity (errors with higher score on the wrong class are more severe)
  • G2: Be agnostic to common performance metrics.
  • Support a wider range of scenarios.
  • G3: Connect performance to data.
  • Provide access to data. Use small visual footprint to reserve space for scenario-

dependent data access views.

Design Goals

13

slide-14
SLIDE 14

Squares Visualization Design

14

  • 1. Each class is shown as a column

Dataset: Glasses from the UCI Machine Learning Repository

slide-15
SLIDE 15

Visualization Design

15

  • 2. Each instance is shown as a box

Dataset: Glasses from the UCI Machine Learning Repository

  • 1. Each class is shown as a column
slide-16
SLIDE 16

Visualization Design

16

Dataset: Glasses from the UCI Machine Learning Repository

  • 1. Each class is shown as a column
  • 3. Instances are binned according to prediction scores
  • 2. Each instance is shown as a box
slide-17
SLIDE 17

Visualization Design

17

Dataset: Glasses from the UCI Machine Learning Repository

slide-18
SLIDE 18
  • Accuracy:

Visualizing Count-Based Metrics: Overall Accuracy

18

Higher Accuracy Lower Accuracy

Correct Predictions Total # of Instances =

slide-19
SLIDE 19

Visualizing Count-Based Metrics: Class-Level

19

Precision: Recall: FPs and FNs are comparably salient:

One-to-one correspondence between

  • utlined boxes and striped boxes
  • Class-level precision and recall:

Lower Precision Lower Recall

slide-20
SLIDE 20

Visualizing Score-Based Metrics

20

Higher scoring instance (more confident) Lower scoring instance (less confident) Worse score distribution

slide-21
SLIDE 21

Help Prioritizing Debugging Efforts

21

More severe error (confidently wrong) Less severe error (prediction can flip if scores change slightly)

slide-22
SLIDE 22

Visualizing Confusion Between Classes

22

Dataset: MNIST Handwritten Digits

C5 is confused with C3

slide-23
SLIDE 23

Instance-Level Details

23

Dataset: MNIST Handwritten Digits

On-hover parallel coordinates for detailed scores

slide-24
SLIDE 24

Scalability

24

Each strip represents 10 boxes Truncation indicators

slide-25
SLIDE 25

Scalability

25

Toggle between 3-levels of aggregation

slide-26
SLIDE 26

Evaluation

slide-27
SLIDE 27
  • 24 participants
  • Part 1: Comparison
  • Compare Squares against a commonly used ConfusionMatrix
  • Within-subject design
  • Part 2: (Squares Only) Score Distribution
  • Evaluate Squares’ ability to convey score distribution

Controlled Experiment

27

slide-28
SLIDE 28

Part 1: Squares vs. Confusion Matrix

28

Squares with a Sortable Table Confusion Matrix with a Sortable Table

Select/Deselect individual cells. Select cells of a given row/column.

slide-29
SLIDE 29
  • T1 – Overall
  • Select the classifier with the larger number of errors
  • T2 – Class-level
  • Select one of the two classes with the most errors
  • T3 – Instance-level
  • Select an error with a score of .9 or above in the wrong class

Part 1: Tasks

29

slide-30
SLIDE 30
  • Task Time

Part 1: Squares Performed Better

30

*** *** ***

Squares lead to faster task time

(Main Effect: p < 0.001)

Squares scale better in terms of the number of classes

(Interaction Effect: p = 0.012)

slide-31
SLIDE 31
  • Accuracy

Part 1: Squares Performed Better

31

  • Squares lead to more accurate

results

10 20 30 40 50 60 70 80 90 100 Squares Confusion Matrix

(p < 0.001)

slide-32
SLIDE 32

Part 1: People Preferred Squares

32

1 2 3 4 5

T1/5 T1/15 T2/5 T2/15 T3/5 T3/15

Helpfulness

Squares ConfusionMatrix 1 6 11 16 21 26

T1/5 T1/15 T2/5 T2/15 T3/5 T3/15

Preference

Squares ConfusionMatrix

Squares was more helpful Squares was preferred

slide-33
SLIDE 33
  • T4 – Overall
  • Select the classifier with the worst distribution
  • T5 – Class-level
  • Select one of the two classes with the worst distribution
  • T6 – Confusion
  • Select the two classes most confused with each other

Part 2: (Squares Only) Distribution Tasks

33

slide-34
SLIDE 34

Part 2: Squares was helpful in distribution tasks

34

5 10 15 20 T4 T5 T6

Task Time (s)

Small Large 20 40 60 80 100 T4 T5 T6

Accuracy

Small Large 1 2 3 4 5 T4 T5 T6

Helpfulness

Small Large

slide-35
SLIDE 35
  • Positive:
  • “Granular and at the same time general overview of the classifiers is great.”
  • “Seeing the distribution of scores is very helpful.”
  • “Had fun for the first time while classifying!”
  • Negative:
  • “I prefer having numbers than pure display.”
  • “[Confusion Matrix is] more straightforward, lower learning curve.”

Freeform Feedback

35

slide-36
SLIDE 36
  • Further Evaluation
  • Compare to alternative designs of Confusion

Matrix, as well as other visualization designs in the literature

  • Scalability
  • Supporting more than 20 classes
  • Optimizing color assignments

Future Work

36 Confusion Wheel [B. Alsallakh, VAST '14]

slide-37
SLIDE 37
  • Deployed along with a machine learning toolkit within Microsoft

Squares as a Tool

37

Model Building Interface

slide-38
SLIDE 38
  • We thank the support and feedback from the Machine Teaching

Group at Microsoft Research.

  • We thank the anonymous reviewers for their constructive

comments.

Acknowledgements

38

slide-39
SLIDE 39

Thanks! Questions?

Donghao Ren (donghao.ren@gmail.com)

University of California, Santa Barbara

39

slide-40
SLIDE 40
slide-41
SLIDE 41
  • Survey within a large software company in July. 2015.
  • 102 respondents:

Survey of Machine Learning Practices

41

10 20 30 40 Data scientist Software engineer Researcher Program manager Other

Respondents’ Roles in the company

%

slide-42
SLIDE 42
  • How many classes do your classifiers typically deal with (check all

that apply)?

  • Most respondents typically deal with less than 20 classes.

Number of Classes

42

slide-43
SLIDE 43
  • “How difficult” and “how important” ratings of tasks:
  • Prioritizing efforts is difficult even for expert users.
  • Understanding instance-level performance is relatively more difficult in

common tools.

Important Tasks

43

slide-44
SLIDE 44

Integrating into LUIS (Language Understanding Intelligent Service)

44