SLIDE 1 Scaling Data Products Under Startup Constraints
A Case Study of ML Bias Testing
SLIDE 2 Scaling Data Products Under Startup Constraints
A Case Study of ML Bias Testing
SLIDE 3 Edwin Ong @edwin Co-Founder, TinyData
Founded CastTV (acquired by Tribune) Founded FileFish (acquired by Oracle) Stanford Symbolic Systems
SLIDE 4 TinyData
- Help other companies make data products
- Make our own data products
SLIDE 5 Problem: Testing Machine Learning in Production
- Tools for machine learning testing in training
- Not as many tools for machine learning testing in
production
- Different tools needed because ML testing is different
from traditional software testing
SLIDE 6
Traditional Software Has Deterministic Outcomes
SLIDE 7
Traditional Software Has Deterministic Outcomes
SLIDE 8 ML Has Probabilistic Outcomes
Dog vs Muffin given new user input
SLIDE 9 ML Has Probabilistic Outcomes That Change Over Time
Version 1: Muffin (59%) Version 2: Muffin (66%)
SLIDE 10 ML Platforms Often End at Deploy
New User Input
Production Testing ML Chaos Engineering
SLIDE 11 Requirements for Production ML Testing Tool
- 1. “Entropy”: Generation of new inputs against model
servers
- 2. Recording of outputs from model servers
- 3. Feedback loop for additional training
SLIDE 12 Challenges for Building as a Startup
- 1. Need access to non-toy model servers
- 2. Need access to generated data for testing model servers
SLIDE 13
Access to Non-Toy Model Servers
SLIDE 14
Non-Toy Model Servers: Commercial Cloud Services
SLIDE 15 Commercial Image Recognition Services
- Opaque systems
- Object and scene detection, facial recognition, facial analysis,
NSFW detection, text detection
- Facial analysis includes gender detection
SLIDE 16
GenderShades.org
SLIDE 17 Testing Commercial Systems for Gender Bias
- Testing = Finding cases where trained systems fail
- Hypothesis: Gender labels are trained on traditional images
- What if we generate “non-traditional” images?
SLIDE 18 Training Data Test Data
Training Data vs Test Data
SLIDE 19
A Man with Long Hair
SLIDE 20
A Man with Long Hair
SLIDE 21
A Man with Long Hair
SLIDE 22
A Woman with Short Hair
SLIDE 23
A Woman with Short Hair
SLIDE 24
A Woman with Short Hair
SLIDE 25
A Woman with Short Hair
SLIDE 26
A Woman with Short Hair
SLIDE 27
Woman with Long Hair
SLIDE 28
Woman with Long Hair
SLIDE 29
“Facial Analysis”?
SLIDE 30
Data Generation
SLIDE 31
Data Generation
SLIDE 32
Prototype Data
SLIDE 33
Global Standard
SLIDE 34
Data Generation
SLIDE 35
Data Generation
SLIDE 36
Data Generation
SLIDE 37
Woman with Short Hair
SLIDE 38
Woman with Short Hair
SLIDE 39
Man with Long Hair
SLIDE 40
Man with Long Hair
SLIDE 41
Man with Long Hair
SLIDE 42
Man with Long Hair
SLIDE 43
Man with Makeup
SLIDE 44
Man with Makeup
SLIDE 45
Man with Makeup
SLIDE 46
Man with Makeup
SLIDE 47
Man with Makeup
SLIDE 48
Man with Makeup
SLIDE 49
Automating Data Generation + Testing
SLIDE 50
Automating Data Generation + Testing
SLIDE 51
Tracking Results Over Time
SLIDE 52 Takeaways
- Even the best trained commercial ML systems are far
from perfect
- Systems return different results over time as new
versions get deployed
- Cumbersome & intractable to test without tools &
automation
SLIDE 53 Scaling Data Products as a Startup
- Bootstrap servers with commercial APIs
- Bootstrap data with open web, public & synthetic
datasets
- Automation is startups’ best friend
SLIDE 54
Questions / Comments
edwin@tinydata.co Twitter: @edwin