Bighead
Airbnb’s End-to-End Machine Learning Infrastructure
Andrew Hoh ML Infra @ Airbnb
Bighead Airbnbs End-to-End Machine Learning Infrastructure Andrew - - PowerPoint PPT Presentation
Bighead Airbnbs End-to-End Machine Learning Infrastructure Andrew Hoh ML Infra @ Airbnb Architecture Background Design Goals Open Source Deep Dive Background Airbnbs Product A global travel community that offers magical
Andrew Hoh ML Infra @ Airbnb
Background Design Goals Architecture Deep Dive Open Source
→ ML models take on average 8 to 12 weeks to build
○ Online and Offline ○ Data size ○ SLA ○ GPU training ○ Scheduled and Ad hoc
○ Prototyping and Production ○ Online and Offline
Execution Management: Bighead Library Environment Management: Docker Image Service Feature Data Management: Zipline Bighead Service / UI
Prototyping Lifecycle Management Production
Real Time Inference Batch Training + Inference Redspot ML Automator Airflow Deep Thought
Execution Management: Bighead Library Environment Management: Docker Image Service Redspot Feature Data Management: Zipline Bighead Service / UI Deep Thought ML Automator
Prototyping Lifecycle Management Production
Airflow Real Time Inference Batch Training + Inference
What are those?
“Creators need an immediate connection to what they are creating.”
a Supercharged Jupyter Notebook Service
Versatile
AWS EC2 Instance Types e.g. P3, X1
Dependencies: Docker Images e.g. Py2.7, Py3.6+Tensorflow
a Supercharged Jupyter Notebook Service Consistent
the exact environment that your model will use in production Seamless
Bighead Service & Docker Image Service via APIs & UI widgets
Execution Management: Bighead Library Environment Management: Docker Image Service Redspot Feature Data Management: Zipline Bighead Service / UI Deep Thought ML Automator
Prototyping Lifecycle Management Production
Airflow Real Time Inference Batch Training + Inference
faces our users
Execution Management: Bighead Library Environment Management: Docker Image Service Redspot Feature Data Management: Zipline Bighead Service / UI Deep Thought ML Automator
Prototyping Lifecycle Management Production
Airflow Real Time Inference Batch Training + Inference
important as tracking code changes
reproducible to be sustainable
launch models into production is critical
Seamless
visualizations that carry
experience
Consistent
management service
about the state of a model, it’s dependencies, and what’s deployed
Execution Management: Bighead Library Environment Management: Docker Image Service Redspot Feature Data Management: Zipline Bighead Service / UI Deep Thought ML Automator
Prototyping Lifecycle Management Production
Airflow Real Time Inference Batch Training + Inference
Frameworks Training data
Unstructured (image, text) Environment
Versatile
preprocessing / inference / training / evaluation / visualization
Consistent
training, offline inference, online inference
Execution Management: Bighead Library Environment Management: Docker Image Service Redspot Feature Data Management: Zipline Bighead Service / UI Deep Thought ML Automator
Prototyping Lifecycle Management Production
Airflow Real Time Inference Batch Training + Inference
Easy to do
launch models without engineer team
rebuild models
Scalable
requirements varies across models
across time Consistent with training
dependencies
Seamless
logging, dashboard
Consistent
Library: Same data source, pipeline, environment from training Scalable
pods can easily scale
across models
Execution Management: Bighead Library Environment Management: Docker Image Service Redspot Feature Data Management: Zipline Bighead Service / UI Deep Thought ML Automator
Prototyping Lifecycle Management Production
Airflow Real Time Inference Batch Training + Inference
Automated training, inference, and evaluation are necessary
Seamless
Airflow: Generate DAGs for training, inference,
resources
for training and scoring data
Consistent
Library: Same data source, pipeline, environment across the stack Scalable
computing for large datasets
Execution Management: Bighead Library Environment Management: Docker Image Service Redspot Feature Data Management: Zipline Bighead Service / UI Deep Thought ML Automator
Prototyping Lifecycle Management Production
Airflow Real Time Inference Batch Training + Inference
Seamless
Thought and ML Automator
Consistent
training/scoring
development/production
across features to prevent label leakage Scalable
Flink to scale Batch and Streaming workloads
Production Data Stores Model Scoring Data Warehouse Zipline Model Training
Features
Scoring Training
Features
Zipline Addresses the Consistency Challenge Between Training and Scoring
End-to-End platform to build and deploy ML models to production that is seamless, versatile, consistent, and scalable
Built on open source technology