Understanding and Visualizing Data Iteration in Machine Learning - PowerPoint PPT Presentation

Understanding and Visualizing Data Iteration in Machine Learning Fred Hohman , Georgia T ech, @fredhohman Kanit Wongsuphasawat , Apple, @kanitw Mary Beth Kery , Carnegie Mellon University, @mbkery Kayur Patel , Apple, @foil

https://github.com/keras-team/keras/blob/master/examples/mnist_cnn.py Convolutional Neural Network on MNIST Test accuracy: 0.81

https://github.com/keras-team/keras/blob/master/examples/mnist_cnn.py Convolutional Neural Network on MNIST How to improve performance? Test accuracy: 0.81

https://github.com/keras-team/keras/blob/master/examples/mnist_cnn.py Convolutional Neural Network on MNIST How to improve performance? A. Different architecture Test accuracy: 0.81

https://github.com/keras-team/keras/blob/master/examples/mnist_cnn.py Convolutional Neural Network on MNIST How to improve performance? A. Different architecture B. Tweak hyperparameter Test accuracy: 0.81

https://github.com/keras-team/keras/blob/master/examples/mnist_cnn.py Convolutional Neural Network on MNIST How to improve performance? A. Different architecture B. Tweak hyperparameter Test accuracy: 0.81 C. Adjust loss function

https://github.com/keras-team/keras/blob/master/examples/mnist_cnn.py Convolutional Neural Network on MNIST How to improve performance? A. Different architecture B. Tweak hyperparameter Test accuracy: 0.81 C. Adjust loss function D. Get an ML PhD

https://github.com/keras-team/keras/blob/master/examples/mnist_cnn.py Convolutional Neural Network on MNIST How to improve performance? A. Different architecture B. Tweak hyperparameter Test accuracy: 0.81 C. Adjust loss function D. Get an ML PhD Add more data!

https://github.com/keras-team/keras/blob/master/examples/mnist_cnn.py https://github.com/keras-team/keras/blob/master/examples/mnist_cnn.py Add 10x data Test accuracy: 0.99 ✓ Test accuracy: 0.81

6 https://twitter.com/karpathy/status/1231378194948706306

“clean and grow the training set” “repeated this process until accuracy improved enough” 7 https://twitter.com/karpathy/status/1231378194948706306

“clean and grow the training set” “repeated this process until accuracy improved enough” This is a data iteration ! 8 https://twitter.com/karpathy/status/1231378194948706306

Data Iteration Model Iteration World Data Model

Understanding and Visualization Data Iteration Contributions • Identify common data iterations and challenges through practitioner interviews at Apple • C HAMELEON : Interactive visualization for data iteration • Case studies on real datasets

Interviews to Understand Data Iteration Practice Participant Information # of Domain Specialization people • Semi-structured interviews with ML Large-scale classification, researchers, engineers, and managers Computer vision object detection, video 8 at Apple analysis, visual search • 23 practitioners across 13 teams Text classification, question Natural language answering, language 8 processing understanding Platform and infrastructure, Applied ML + crowdsourcing, annotation, 5 Systems deployment Senors Activity recognition 1

“Most of the time, we improve performance more by adding additional data or cleaning data rather than changing the model [code].” — Applied ML practitioner in computer vision

Interviews to Understand Data Iteration Practice Findings Summary Why do Data Iteration? Data Iteration Frequency Entangled Iterations • Models: monthly → daily • Data improves performance • Separate model and data iterations to ensure fair • Data: monthly → per minute • Data boostraps modeling comparisons • The world changes, so must your data

Gather more data randomly sampled from population Interviews to Understand Data Iteration Practice Common Data Iterations + Add sampled instances + Add labels Gather more data randomly sampled from population Add and enrich instance annotations + Add specific instances - Remove instances Gather more data intentionally for specific label or feature range Remove noisy and erroneous outliers + Add synthetic instances ~ Modify features, labels Gather more data by creating synthetic data or augmenting existing data Clean, edit, or fix data

Interviews to Understand Data Iteration Practice Challenges of Data Iteration • Tracking experimental and iteration history • When to “unfreeze” data versions • When to stop collecting data • Manual failure case analysis • Building data blacklists

C HAMELEON Understanding and Visualization Data Iteration • Retroactively track and explore data iterations and metrics over versions • Attribute model metric change to data iterations • Understand model sensitivity over data versions

C HAMELEON Understanding and Visualization Data Iteration Compare feature distributions by: • Training and testing splits Count Correct • Performance (e.g., correct v. incorrect predictions) Binned feature • Data versions Incorrect “Overlaid diverging histogram” per feature

C HAMELEON Visualizations Aggregated Embedding Prediction Change Matrix Sensitivity Histogram correct instances: 154 incorrect instances: 60 total instances: 214 accuracy: 0.720

Case Study I Sensor Prediction • Visualization challenges prior data collection beliefs • 64,502 instances • Finding failure cases • Collected over 2 months • 20 features • Interface utility �0,000 400 2,000 4,000 �,000 �,000 200 2,000 �,000 0 0 0 0 0 1.1 1.5 1.9 1.13 1.18 A feature’s long-tailed, multi-modal distribution shape solidifies over collection time: 1,442 → 64,205 instances

Case Study II Learning from Logs • Inspecting performance across features • Capturing data processing changes correct instances: 125 incorrect instances: 12 • Encouraging instance-level analysis total instances: 137 accuracy: 0.912 Filter Filter • 48,000 instances • Collected over 6 months • 34 features Filtering across features quickly finds data subsets to compare against global distributions

Opportunities for Future ML Iteration Tools

Opportunities for Future ML Iteration Tools • Interfaces for both data and model iteration

Opportunities for Future ML Iteration Tools • Interfaces for both data and model iteration • Data iteration tooling to help experimental handoff

Opportunities for Future ML Iteration Tools • Interfaces for both data and model iteration • Data iteration tooling to help experimental handoff • Data as a shared connection across user expertise

Opportunities for Future ML Iteration Tools • Interfaces for both data and model iteration • Data iteration tooling to help experimental handoff • Data as a shared connection across user expertise • Visualizing probabilistic labels from data programming

Opportunities for Future ML Iteration Tools • Interfaces for both data and model iteration • Data iteration tooling to help experimental handoff • Data as a shared connection across user expertise • Visualizing probabilistic labels from data programming • Visualizations for other data types

Understanding and Visualizing Data Iteration in Machine Learning Fred Hohman , Georgia T ech, @fredhohman Kanit Wongsuphasawat , Apple, @kanitw fredhohman.com/papers/chameleon Mary Beth Kery , Carnegie Mellon University, @mbkery Kayur Patel , Apple, @foil

Understanding and Visualizing Data Iteration in Machine Learning - PowerPoint PPT Presentation

Understanding and Visualizing Data Iteration in Machine Learning Fred Hohman , Georgia T ech, @fredhohman Kanit Wongsuphasawat , Apple, @kanitw Mary Beth Kery , Carnegie Mellon University, @mbkery Kayur Patel , Apple, @foil

Matrix Iteration Higher Modes Inverse Iteration Matrix Iteration Giacomo Boffi with Shifts

Outline - Tasks - Map projections - Visualizing area data - Visualizing point data -

Visualizing Heart Data Visualizing Heart Data of a living entity by analyzing time- -series data

Iteration/loops variety of iteration constructs provided with varying degrees of complexity,

Case Study: Montreal BIXI Bike Data Ryan Hafen Author, TrelliscopeJS DataCamp Visualizing Big

Blockly Lists & Iteration CT @ VT Things we are seeing Using lists to represent a data

CSSS 569 Visualizing Data and Models Lab 8: Visualizing Relational Data Kai Ping (Brian) Leung

CSSS 569 Visualizing Data and Models Lab 7: Visualizing Spatial Data Kai Ping (Brian) Leung

CME/STATS 195 CME/STATS 195 Lecture 4: Visualizing data Lecture 4: Visualizing data Evan

Visualizing Data with Graphs and Maps Yifan Hu AT&T Labs Research NIST May 7, 2012

Iteration and Debugging Check out Iteration from SVN Loop review Debugging Java programs

Markov Decision Processes and Exact Solution Methods: Value Iteration Policy Iteration Linear

CS 473: Artificial Intelligence MDP Planning: Value Iteration and Policy Iteration Travis Mandel

Iteration and Debugging Check out Iteration from SVN Loop review Debugging Java programs

CONVERGENCE OF A GENERALIZED MIDPOINT ITERATION JARED ABLE, DANIEL BRADLEY, ALVIN MOON, AND

Trust region policy optimization (TRPO) Value Iteration Value Iteration This is what we

Array Databases http://www.faculty.jacobs-university.de/pbaumann publications

2. Management of large objects LOB = Large OBject Normal DBMS regards a LOB as one

Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 - DB Applications C. Faloutsos A.

Terminology Chapter 1 Cloud Computing Advantages and Disadvantages

Database Management Systems Relational Model Sule H. Turgut Uyar O g ud uc u

Object-Relational Concepts These slides take a closer look as some of the features of SQL:1999

CS 10: Problem solving via Object Oriented Programming Winter

Fuzzing File Systems via Two-Dimensional Input Space Exploration Wen Xu, Hyungon Moon, Sanidhya

Understanding and Visualizing Data Iteration in Machine Learning - PowerPoint PPT Presentation

Understanding and Visualizing Data Iteration in Machine Learning Fred Hohman , Georgia T ech, @fredhohman Kanit Wongsuphasawat , Apple, @kanitw Mary Beth Kery , Carnegie Mellon University, @mbkery Kayur Patel , Apple, @foil

Matrix Iteration Higher Modes Inverse Iteration Matrix Iteration Giacomo Boffi with Shifts

Outline - Tasks - Map projections - Visualizing area data - Visualizing point data -

Visualizing Heart Data Visualizing Heart Data of a living entity by analyzing time- -series data

Iteration/loops variety of iteration constructs provided with varying degrees of complexity,

Case Study: Montreal BIXI Bike Data Ryan Hafen Author, TrelliscopeJS DataCamp Visualizing Big

Blockly Lists &amp; Iteration CT @ VT Things we are seeing Using lists to represent a data

CSSS 569 Visualizing Data and Models Lab 8: Visualizing Relational Data Kai Ping (Brian) Leung

CSSS 569 Visualizing Data and Models Lab 7: Visualizing Spatial Data Kai Ping (Brian) Leung

CME/STATS 195 CME/STATS 195 Lecture 4: Visualizing data Lecture 4: Visualizing data Evan

Visualizing Data with Graphs and Maps Yifan Hu AT&amp;T Labs Research NIST May 7, 2012

Iteration and Debugging Check out Iteration from SVN Loop review Debugging Java programs

Markov Decision Processes and Exact Solution Methods: Value Iteration Policy Iteration Linear

CS 473: Artificial Intelligence MDP Planning: Value Iteration and Policy Iteration Travis Mandel

Iteration and Debugging Check out Iteration from SVN Loop review Debugging Java programs

CONVERGENCE OF A GENERALIZED MIDPOINT ITERATION JARED ABLE, DANIEL BRADLEY, ALVIN MOON, AND

Trust region policy optimization (TRPO) Value Iteration Value Iteration This is what we

Array Databases http://www.faculty.jacobs-university.de/pbaumann publications

2. Management of large objects LOB = Large OBject Normal DBMS regards a LOB as one

Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 - DB Applications C. Faloutsos A.

Terminology Chapter 1 Cloud Computing Advantages and Disadvantages

Database Management Systems Relational Model Sule H. Turgut Uyar O g ud uc u

Object-Relational Concepts These slides take a closer look as some of the features of SQL:1999

CS 10: Problem solving via Object Oriented Programming Winter

Fuzzing File Systems via Two-Dimensional Input Space Exploration Wen Xu, Hyungon Moon, Sanidhya

Blockly Lists & Iteration CT @ VT Things we are seeing Using lists to represent a data

Visualizing Data with Graphs and Maps Yifan Hu AT&T Labs Research NIST May 7, 2012