Bighead Airbnbs End-to-End Machine Learning Infrastructure Andrew - - PowerPoint PPT Presentation

bighead
SMART_READER_LITE
LIVE PREVIEW

Bighead Airbnbs End-to-End Machine Learning Infrastructure Andrew - - PowerPoint PPT Presentation

Bighead Airbnbs End-to-End Machine Learning Infrastructure Andrew Hoh ML Infra @ Airbnb Architecture Background Design Goals Open Source Deep Dive Background Airbnbs Product A global travel community that offers magical


slide-1
SLIDE 1

Bighead

Airbnb’s End-to-End Machine Learning Infrastructure

Andrew Hoh ML Infra @ Airbnb

slide-2
SLIDE 2

Background Design Goals Architecture Deep Dive Open Source

slide-3
SLIDE 3

Background

slide-4
SLIDE 4

Airbnb’s Product

A global travel community that offers magical end-to-end trips, including where you stay, what you do and the people you meet.

slide-5
SLIDE 5

Airbnb is already driven by Machine Learning

Search Ranking Smart Pricing Fraud Detection

slide-6
SLIDE 6

But there are *many* more opportunities for ML

  • Paid Growth - Hosts
  • Classifying / Categorizing Listings
  • Experience Ranking + Personalization
  • Room Type Categorizations
  • Customer Service Ticket Routing
  • Airbnb Plus
  • Listing Photo Quality
  • Object Detection - Amenities
  • ....
slide-7
SLIDE 7

Intrinsic Complexities with Machine Learning

  • Understanding the business domain
  • Selecting the appropriate Model
  • Selecting the appropriate Features
  • Fine tuning
slide-8
SLIDE 8

Incidental Complexities with Machine Learning

  • Integrating with Airbnb’s Data Warehouse
  • Scaling model training & serving
  • Keeping consistency between: Prototyping vs Production, Training vs Inference
  • Keeping track of multiple models, versions, experiments
  • Supporting iteration on ML models

→ ML models take on average 8 to 12 weeks to build

→ ML workflows tended to be slow, fragmented, and brittle

slide-9
SLIDE 9

The ML Infrastructure Team addresses these challenges Vision Airbnb routinely ships ML-powered features throughout the product. Mission Equip Airbnb with shared technology to build production-ready ML applications with no incidental complexity.

slide-10
SLIDE 10

Supporting the Full ML Lifecycle

slide-11
SLIDE 11

Bighead: Design Goals

slide-12
SLIDE 12

Scalable Seamless Versatile Consistent

slide-13
SLIDE 13

Seamless

  • Easy to prototype, easy to productionize
  • Same workflow across different frameworks
slide-14
SLIDE 14

Versatile

  • Supports all major ML frameworks
  • Meets various requirements

○ Online and Offline ○ Data size ○ SLA ○ GPU training ○ Scheduled and Ad hoc

slide-15
SLIDE 15

Consistent

  • Consistent environment across the stack
  • Consistent data transformation

○ Prototyping and Production ○ Online and Offline

slide-16
SLIDE 16

Scalable

  • Horizontal
  • Elastic
slide-17
SLIDE 17

Bighead: Architecture Deep Dive

slide-18
SLIDE 18

Execution Management: Bighead Library Environment Management: Docker Image Service Feature Data Management: Zipline Bighead Service / UI

Prototyping Lifecycle Management Production

Real Time Inference Batch Training + Inference Redspot ML Automator Airflow Deep Thought

slide-19
SLIDE 19

Execution Management: Bighead Library Environment Management: Docker Image Service Redspot Feature Data Management: Zipline Bighead Service / UI Deep Thought ML Automator

Prototyping Lifecycle Management Production

Airflow Real Time Inference Batch Training + Inference

slide-20
SLIDE 20

Redspot

Prototyping with Jupyter Notebooks

slide-21
SLIDE 21

Jupyter Notebooks?

What are those?

“Creators need an immediate connection to what they are creating.”

  • Bret Victor
slide-22
SLIDE 22
  • Interactivity and Feedback
  • Access to Powerful Hardware
  • Access to Data

The ideal Machine Learning development environment?

slide-23
SLIDE 23
  • A fork of the JupyterHub project
  • Integrated with our Data Warehouse
  • Access to specialized hardware (e.g. GPUs)
  • File sharing between users via AWS EFS
  • Packaged in a familiar Jupyterhub UI

Redspot

a Supercharged Jupyter Notebook Service

slide-24
SLIDE 24

Redspot

slide-25
SLIDE 25

Versatile

  • Customized Hardware:

AWS EC2 Instance Types e.g. P3, X1

  • Customized

Dependencies: Docker Images e.g. Py2.7, Py3.6+Tensorflow

Redspot

a Supercharged Jupyter Notebook Service Consistent

  • Promotes prototyping in

the exact environment that your model will use in production Seamless

  • Integrated with

Bighead Service & Docker Image Service via APIs & UI widgets

slide-26
SLIDE 26

Execution Management: Bighead Library Environment Management: Docker Image Service Redspot Feature Data Management: Zipline Bighead Service / UI Deep Thought ML Automator

Prototyping Lifecycle Management Production

Airflow Real Time Inference Batch Training + Inference

slide-27
SLIDE 27

Docker Image Service

Environment Customization

slide-28
SLIDE 28
  • ML Users have a diverse, heterogeneous set of dependencies
  • Need an easy way to bootstrap their own runtime environments
  • Need to be consistent with the rest of Airbnb’s infrastructure

Docker Image Service - Why

+

slide-29
SLIDE 29
  • Our configuration management solution
  • A composition layer on top of Docker
  • Includes a customization service that

faces our users

  • Promotes Consistency and Versatility

Docker Image Service - Dependency Customization

slide-30
SLIDE 30

Execution Management: Bighead Library Environment Management: Docker Image Service Redspot Feature Data Management: Zipline Bighead Service / UI Deep Thought ML Automator

Prototyping Lifecycle Management Production

Airflow Real Time Inference Batch Training + Inference

slide-31
SLIDE 31

Bighead Service

Model Lifecycle Management

slide-32
SLIDE 32
  • Tracking ML model changes is just as

important as tracking code changes

  • ML model work needs to be

reproducible to be sustainable

  • Comparing experiments before you

launch models into production is critical

Model Lifecycle Management - why?

slide-33
SLIDE 33
slide-34
SLIDE 34
slide-35
SLIDE 35
slide-36
SLIDE 36

Seamless

  • Context-aware

visualizations that carry

  • ver from the prototyping

experience

Bighead Service

Consistent

  • Central model

management service

  • Single source of truth

about the state of a model, it’s dependencies, and what’s deployed

slide-37
SLIDE 37

Execution Management: Bighead Library Environment Management: Docker Image Service Redspot Feature Data Management: Zipline Bighead Service / UI Deep Thought ML Automator

Prototyping Lifecycle Management Production

Airflow Real Time Inference Batch Training + Inference

slide-38
SLIDE 38

Bighead Library

slide-39
SLIDE 39

ML Models are highly heterogeneous in

Frameworks Training data

  • Data quality
  • Structured vs

Unstructured (image, text) Environment

  • GPU vs CPU
  • Dependencies
slide-40
SLIDE 40

ML Models are hard to keep consistent

  • Data in production is different from data in training
  • Offline pipeline is different from online pipeline
  • Everyone does everything in a different way
slide-41
SLIDE 41

Versatile

  • Pipeline on steroids - compute graph for

preprocessing / inference / training / evaluation / visualization

  • Composable, Reusable, Shareable
  • Support popular frameworks

Bighead Library

Consistent

  • Uniform API
  • Serializable - same pipeline used in

training, offline inference, online inference

  • Fast primitives for preprocessing
  • Metadata for trained models
slide-42
SLIDE 42

Bighead Library: ML Pipeline

slide-43
SLIDE 43

Visualization - Pipeline

slide-44
SLIDE 44

Easy to Serialize/Deserialize

slide-45
SLIDE 45

Visualization - Training Data

slide-46
SLIDE 46

Visualization - Transformer

slide-47
SLIDE 47

Execution Management: Bighead Library Environment Management: Docker Image Service Redspot Feature Data Management: Zipline Bighead Service / UI Deep Thought ML Automator

Prototyping Lifecycle Management Production

Airflow Real Time Inference Batch Training + Inference

slide-48
SLIDE 48

Deep Thought

Online Inference

slide-49
SLIDE 49

Easy to do

  • Data scientists can’t

launch models without engineer team

  • Engineers often need to

rebuild models

Hard to make online model serving...

Scalable

  • Resource

requirements varies across models

  • Throughput fluctuates

across time Consistent with training

  • Different data
  • Different pipeline
  • Different

dependencies

slide-50
SLIDE 50

Seamless

  • Integration with event

logging, dashboard

  • Integration with Zipline

Deep Thought

Consistent

  • Docker + Bighead

Library: Same data source, pipeline, environment from training Scalable

  • Kubernetes: Model

pods can easily scale

  • Resource segregation

across models

slide-51
SLIDE 51
slide-52
SLIDE 52

Execution Management: Bighead Library Environment Management: Docker Image Service Redspot Feature Data Management: Zipline Bighead Service / UI Deep Thought ML Automator

Prototyping Lifecycle Management Production

Airflow Real Time Inference Batch Training + Inference

slide-53
SLIDE 53

ML Automator

Offline Training and Batch Inference

slide-54
SLIDE 54

Automated training, inference, and evaluation are necessary

  • Scheduling
  • Resource allocation
  • Saving results
  • Dashboards and alerts
  • Orchestration

ML Automator - Why

slide-55
SLIDE 55

Seamless

  • Automate tasks via

Airflow: Generate DAGs for training, inference,

  • etc. with appropriate

resources

  • Integration with Zipline

for training and scoring data

ML Automator

Consistent

  • Docker + Bighead

Library: Same data source, pipeline, environment across the stack Scalable

  • Spark: Distributed

computing for large datasets

slide-56
SLIDE 56

ML Automator

slide-57
SLIDE 57

Execution Management: Bighead Library Environment Management: Docker Image Service Redspot Feature Data Management: Zipline Bighead Service / UI Deep Thought ML Automator

Prototyping Lifecycle Management Production

Airflow Real Time Inference Batch Training + Inference

slide-58
SLIDE 58

Zipline

ML Data Management Framework

slide-59
SLIDE 59

Feature management is hard

  • Inconsistent offline and online datasets
  • Tricky to generate training sets that depend on time correctly
  • Slow training sets backfill
  • Inadequate data quality checks or monitoring
  • Unclear feature ownership and sharing
slide-60
SLIDE 60

Seamless

  • Integration with Deep

Thought and ML Automator

Zipline

Consistent

  • Consistent data across

training/scoring

  • Consistent data across

development/production

  • Point-in-time correctness

across features to prevent label leakage Scalable

  • Leverages Spark and

Flink to scale Batch and Streaming workloads

slide-61
SLIDE 61

Production Data Stores Model Scoring Data Warehouse Zipline Model Training

Features

Scoring Training

Features

Zipline Addresses the Consistency Challenge Between Training and Scoring

slide-62
SLIDE 62

Big Summary

End-to-End platform to build and deploy ML models to production that is seamless, versatile, consistent, and scalable

  • Model lifecycle management
  • Feature generation & management
  • Online & offline inference
  • Pipeline library supporting major frameworks
  • Docker image customization service
  • Multi-tenant training environment

Built on open source technology

  • TensorFlow, PyTorch, Keras, MXNet, Scikit-learn, XGBoost
  • Spark, Jupyter, Kubernetes, Docker, Airflow
slide-63
SLIDE 63

To be Open Sourced We are selecting our first couple private

  • collaborators. If you are interested, please email

me at andrew.hoh@airbnb.com

slide-64
SLIDE 64

Questions?

slide-65
SLIDE 65

Appendix

slide-66
SLIDE 66