Understanding and Visualizing Data Iteration in Machine Learning - - PowerPoint PPT Presentation

understanding and visualizing data iteration in machine
SMART_READER_LITE
LIVE PREVIEW

Understanding and Visualizing Data Iteration in Machine Learning - - PowerPoint PPT Presentation

Understanding and Visualizing Data Iteration in Machine Learning Fred Hohman , Georgia T ech, @fredhohman Kanit Wongsuphasawat , Apple, @kanitw Mary Beth Kery , Carnegie Mellon University, @mbkery Kayur Patel , Apple, @foil


slide-1
SLIDE 1

Fred Hohman, Georgia T ech, @fredhohman Kanit Wongsuphasawat, Apple, @kanitw Mary Beth Kery, Carnegie Mellon University, @mbkery Kayur Patel, Apple, @foil

Understanding and Visualizing Data Iteration in Machine Learning

slide-2
SLIDE 2

Test accuracy: 0.81

Convolutional Neural Network on MNIST

https://github.com/keras-team/keras/blob/master/examples/mnist_cnn.py

slide-3
SLIDE 3

How to improve performance? Convolutional Neural Network on MNIST

https://github.com/keras-team/keras/blob/master/examples/mnist_cnn.py

Test accuracy: 0.81

slide-4
SLIDE 4
  • A. Different architecture

How to improve performance? Convolutional Neural Network on MNIST

https://github.com/keras-team/keras/blob/master/examples/mnist_cnn.py

Test accuracy: 0.81

slide-5
SLIDE 5
  • A. Different architecture
  • B. Tweak hyperparameter

How to improve performance? Convolutional Neural Network on MNIST

https://github.com/keras-team/keras/blob/master/examples/mnist_cnn.py

Test accuracy: 0.81

slide-6
SLIDE 6
  • C. Adjust loss function
  • A. Different architecture
  • B. Tweak hyperparameter

How to improve performance? Convolutional Neural Network on MNIST

https://github.com/keras-team/keras/blob/master/examples/mnist_cnn.py

Test accuracy: 0.81

slide-7
SLIDE 7
  • C. Adjust loss function
  • A. Different architecture
  • B. Tweak hyperparameter
  • D. Get an ML PhD

How to improve performance? Convolutional Neural Network on MNIST

https://github.com/keras-team/keras/blob/master/examples/mnist_cnn.py

Test accuracy: 0.81

slide-8
SLIDE 8
  • C. Adjust loss function
  • A. Different architecture
  • B. Tweak hyperparameter
  • D. Get an ML PhD

Add more data!

How to improve performance? Convolutional Neural Network on MNIST

https://github.com/keras-team/keras/blob/master/examples/mnist_cnn.py

Test accuracy: 0.81

slide-9
SLIDE 9

https://github.com/keras-team/keras/blob/master/examples/mnist_cnn.py

Test accuracy: 0.81 Test accuracy: 0.99✓

Add 10x data

https://github.com/keras-team/keras/blob/master/examples/mnist_cnn.py

slide-10
SLIDE 10

6

https://twitter.com/karpathy/status/1231378194948706306

slide-11
SLIDE 11

7

“clean and grow the training set” “repeated this process until accuracy improved enough”

https://twitter.com/karpathy/status/1231378194948706306

slide-12
SLIDE 12

8

“clean and grow the training set” “repeated this process until accuracy improved enough”

https://twitter.com/karpathy/status/1231378194948706306

This is a data iteration!

slide-13
SLIDE 13

World Data Model Data Iteration Model Iteration

slide-14
SLIDE 14

World Data Model Data Iteration Model Iteration

slide-15
SLIDE 15

World Data Model Data Iteration Model Iteration

slide-16
SLIDE 16

Contributions

  • Identify common data iterations

and challenges through practitioner interviews at Apple

  • CHAMELEON: Interactive

visualization for data iteration

  • Case studies on real datasets

Understanding and Visualization Data Iteration

slide-17
SLIDE 17

Participant Information Interviews to Understand Data Iteration Practice

  • Semi-structured interviews with ML

researchers, engineers, and managers at Apple

  • 23 practitioners across 13 teams

Domain Specialization # of people Computer vision Large-scale classification,

  • bject detection, video

analysis, visual search 8 Natural language processing Text classification, question answering, language understanding 8 Applied ML + Systems Platform and infrastructure, crowdsourcing, annotation, deployment 5 Senors Activity recognition 1

slide-18
SLIDE 18

“Most of the time, we improve performance more by adding additional data or cleaning data rather than changing the model [code].”

— Applied ML practitioner in computer vision

slide-19
SLIDE 19

Findings Summary Interviews to Understand Data Iteration Practice

Entangled Iterations

  • Separate model and data

iterations to ensure fair comparisons Data Iteration Frequency

  • Models: monthly → daily
  • Data: monthly → per minute

Why do Data Iteration?

  • Data improves performance
  • Data boostraps modeling
  • The world changes, so must

your data

slide-20
SLIDE 20

Common Data Iterations Interviews to Understand Data Iteration Practice

Gather more data randomly sampled from population

+ Add sampled instances

Gather more data randomly sampled from population

+ Add specific instances

Gather more data intentionally for specific label or feature range

+ Add synthetic instances

Gather more data by creating synthetic data or augmenting existing data

+ Add labels

Add and enrich instance annotations

  • Remove instances

Remove noisy and erroneous outliers

~ Modify features, labels

Clean, edit, or fix data

slide-21
SLIDE 21

Challenges of Data Iteration Interviews to Understand Data Iteration Practice

  • Tracking experimental and iteration history
  • Manual failure case analysis
  • Building data blacklists
  • When to “unfreeze” data versions
  • When to stop collecting data
slide-22
SLIDE 22

CHAMELEON

  • Retroactively track and

explore data iterations and metrics over versions

  • Attribute model metric

change to data iterations

  • Understand model sensitivity
  • ver data versions

Understanding and Visualization Data Iteration

slide-23
SLIDE 23

CHAMELEON

Compare feature distributions by:

  • Training and testing splits
  • Performance (e.g., correct v.

incorrect predictions)

  • Data versions

Understanding and Visualization Data Iteration

Count Binned feature Incorrect Correct

“Overlaid diverging histogram” per feature

slide-24
SLIDE 24

CHAMELEON Visualizations

correct instances: incorrect instances: total instances: accuracy: 154 60 214 0.720

Aggregated Embedding Prediction Change Matrix Sensitivity Histogram

slide-25
SLIDE 25

Demo

slide-26
SLIDE 26

Demo

slide-27
SLIDE 27

Case Study I

  • Visualization challenges prior data collection beliefs
  • Finding failure cases
  • Interface utility

400 2,000 4,000 2,000 ,000 ,000 0,000 ,000 200

1.1 1.5 1.9 1.13 1.18

A feature’s long-tailed, multi-modal distribution shape solidifies over collection time: 1,442 → 64,205 instances

  • 64,502 instances
  • Collected over 2 months
  • 20 features

Sensor Prediction

slide-28
SLIDE 28

Learning from Logs Case Study II

  • Inspecting performance across features
  • Capturing data processing changes
  • Encouraging instance-level analysis

Filter Filter

correct instances: incorrect instances: total instances: accuracy: 125 12 137 0.912

  • 48,000 instances
  • Collected over 6 months
  • 34 features

Filtering across features quickly finds data subsets to compare against global distributions

slide-29
SLIDE 29

Opportunities for Future ML Iteration Tools

slide-30
SLIDE 30

Opportunities for Future ML Iteration Tools

  • Interfaces for both data and model iteration
slide-31
SLIDE 31

Opportunities for Future ML Iteration Tools

  • Interfaces for both data and model iteration
  • Data iteration tooling to help experimental handoff
slide-32
SLIDE 32

Opportunities for Future ML Iteration Tools

  • Interfaces for both data and model iteration
  • Data iteration tooling to help experimental handoff
  • Data as a shared connection across user expertise
slide-33
SLIDE 33

Opportunities for Future ML Iteration Tools

  • Interfaces for both data and model iteration
  • Data iteration tooling to help experimental handoff
  • Data as a shared connection across user expertise
  • Visualizing probabilistic labels from data programming
slide-34
SLIDE 34

Opportunities for Future ML Iteration Tools

  • Interfaces for both data and model iteration
  • Data iteration tooling to help experimental handoff
  • Data as a shared connection across user expertise
  • Visualizing probabilistic labels from data programming
  • Visualizations for other data types
slide-35
SLIDE 35

Fred Hohman, Georgia T ech, @fredhohman Kanit Wongsuphasawat, Apple, @kanitw Mary Beth Kery, Carnegie Mellon University, @mbkery Kayur Patel, Apple, @foil

Understanding and Visualizing Data Iteration in Machine Learning

fredhohman.com/papers/chameleon