understanding and visualizing data iteration in machine
play

Understanding and Visualizing Data Iteration in Machine Learning - PowerPoint PPT Presentation

Understanding and Visualizing Data Iteration in Machine Learning Fred Hohman , Georgia T ech, @fredhohman Kanit Wongsuphasawat , Apple, @kanitw Mary Beth Kery , Carnegie Mellon University, @mbkery Kayur Patel , Apple, @foil


  1. Understanding and Visualizing Data Iteration in Machine Learning Fred Hohman , Georgia T ech, @fredhohman Kanit Wongsuphasawat , Apple, @kanitw Mary Beth Kery , Carnegie Mellon University, @mbkery Kayur Patel , Apple, @foil

  2. https://github.com/keras-team/keras/blob/master/examples/mnist_cnn.py Convolutional Neural Network on MNIST Test accuracy: 0.81

  3. https://github.com/keras-team/keras/blob/master/examples/mnist_cnn.py Convolutional Neural Network on MNIST How to improve performance? Test accuracy: 0.81

  4. https://github.com/keras-team/keras/blob/master/examples/mnist_cnn.py Convolutional Neural Network on MNIST How to improve performance? A. Different architecture Test accuracy: 0.81

  5. https://github.com/keras-team/keras/blob/master/examples/mnist_cnn.py Convolutional Neural Network on MNIST How to improve performance? A. Different architecture B. Tweak hyperparameter Test accuracy: 0.81

  6. https://github.com/keras-team/keras/blob/master/examples/mnist_cnn.py Convolutional Neural Network on MNIST How to improve performance? A. Different architecture B. Tweak hyperparameter Test accuracy: 0.81 C. Adjust loss function

  7. https://github.com/keras-team/keras/blob/master/examples/mnist_cnn.py Convolutional Neural Network on MNIST How to improve performance? A. Different architecture B. Tweak hyperparameter Test accuracy: 0.81 C. Adjust loss function D. Get an ML PhD

  8. https://github.com/keras-team/keras/blob/master/examples/mnist_cnn.py Convolutional Neural Network on MNIST How to improve performance? A. Different architecture B. Tweak hyperparameter Test accuracy: 0.81 C. Adjust loss function D. Get an ML PhD Add more data!

  9. https://github.com/keras-team/keras/blob/master/examples/mnist_cnn.py https://github.com/keras-team/keras/blob/master/examples/mnist_cnn.py Add 10x data Test accuracy: 0.99 ✓ Test accuracy: 0.81

  10. 6 https://twitter.com/karpathy/status/1231378194948706306

  11. “clean and grow the training set” “repeated this process until accuracy improved enough” 7 https://twitter.com/karpathy/status/1231378194948706306

  12. “clean and grow the training set” “repeated this process until accuracy improved enough” This is a data iteration ! 8 https://twitter.com/karpathy/status/1231378194948706306

  13. Data Iteration Model Iteration World Data Model

  14. Data Iteration Model Iteration World Data Model

  15. Data Iteration Model Iteration World Data Model

  16. Understanding and Visualization Data Iteration Contributions • Identify common data iterations and challenges through practitioner interviews at Apple • C HAMELEON : Interactive visualization for data iteration • Case studies on real datasets

  17. Interviews to Understand Data Iteration Practice Participant Information # of Domain Specialization people • Semi-structured interviews with ML Large-scale classification, researchers, engineers, and managers Computer vision object detection, video 8 at Apple analysis, visual search • 23 practitioners across 13 teams Text classification, question Natural language answering, language 8 processing understanding Platform and infrastructure, Applied ML + crowdsourcing, annotation, 5 Systems deployment Senors Activity recognition 1

  18. “Most of the time, we improve performance more by adding additional data or cleaning data rather than changing the model [code].” — Applied ML practitioner in computer vision

  19. Interviews to Understand Data Iteration Practice Findings Summary Why do Data Iteration? Data Iteration Frequency Entangled Iterations • Models: monthly → daily • Data improves performance • Separate model and data iterations to ensure fair • Data: monthly → per minute • Data boostraps modeling comparisons • The world changes, so must your data

  20. Gather more data randomly sampled from population Interviews to Understand Data Iteration Practice Common Data Iterations + Add sampled instances + Add labels Gather more data randomly sampled from population Add and enrich instance annotations + Add specific instances - Remove instances Gather more data intentionally for specific label or feature range Remove noisy and erroneous outliers + Add synthetic instances ~ Modify features, labels Gather more data by creating synthetic data or augmenting existing data Clean, edit, or fix data

  21. Interviews to Understand Data Iteration Practice Challenges of Data Iteration • Tracking experimental and iteration history • When to “unfreeze” data versions • When to stop collecting data • Manual failure case analysis • Building data blacklists

  22. C HAMELEON Understanding and Visualization Data Iteration • Retroactively track and explore data iterations and metrics over versions • Attribute model metric change to data iterations • Understand model sensitivity over data versions

  23. C HAMELEON Understanding and Visualization Data Iteration Compare feature distributions by: • Training and testing splits Count Correct • Performance (e.g., correct v. incorrect predictions) Binned feature • Data versions Incorrect “Overlaid diverging histogram” per feature

  24. C HAMELEON Visualizations Aggregated Embedding Prediction Change Matrix Sensitivity Histogram correct instances: 154 incorrect instances: 60 total instances: 214 accuracy: 0.720

  25. Demo

  26. Demo

  27. Case Study I Sensor Prediction • Visualization challenges prior data collection beliefs • 64,502 instances • Finding failure cases • Collected over 2 months • 20 features • Interface utility �0,000 400 2,000 4,000 �,000 �,000 200 2,000 �,000 0 0 0 0 0 1.1 1.5 1.9 1.13 1.18 A feature’s long-tailed, multi-modal distribution shape solidifies over collection time: 1,442 → 64,205 instances

  28. Case Study II Learning from Logs • Inspecting performance across features • Capturing data processing changes correct instances: 125 incorrect instances: 12 • Encouraging instance-level analysis total instances: 137 accuracy: 0.912 Filter Filter • 48,000 instances • Collected over 6 months • 34 features Filtering across features quickly finds data subsets to compare against global distributions

  29. Opportunities for Future ML Iteration Tools

  30. Opportunities for Future ML Iteration Tools • Interfaces for both data and model iteration

  31. Opportunities for Future ML Iteration Tools • Interfaces for both data and model iteration • Data iteration tooling to help experimental handoff

  32. Opportunities for Future ML Iteration Tools • Interfaces for both data and model iteration • Data iteration tooling to help experimental handoff • Data as a shared connection across user expertise

  33. Opportunities for Future ML Iteration Tools • Interfaces for both data and model iteration • Data iteration tooling to help experimental handoff • Data as a shared connection across user expertise • Visualizing probabilistic labels from data programming

  34. Opportunities for Future ML Iteration Tools • Interfaces for both data and model iteration • Data iteration tooling to help experimental handoff • Data as a shared connection across user expertise • Visualizing probabilistic labels from data programming • Visualizations for other data types

  35. Understanding and Visualizing Data Iteration in Machine Learning Fred Hohman , Georgia T ech, @fredhohman Kanit Wongsuphasawat , Apple, @kanitw fredhohman.com/papers/chameleon Mary Beth Kery , Carnegie Mellon University, @mbkery Kayur Patel , Apple, @foil

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend