Machine Learning Pipeline for Real-time Forecasting @Uber Marketplace - - PowerPoint PPT Presentation

▶

Mar 04, 2024 13 likes •904 views

Machine Learning Pipeline for Real-time Forecasting @Uber Marketplace Chong Sun, Danny Yuan Forecasting On A Global Scale Cases For Real-Time Forecasting 01.01.17 Dynamic Pricing: Every Minute, Every Where Dynamic Pricing: Every Minute, Every

SLIDE 1

Machine Learning Pipeline for Real-time Forecasting @Uber Marketplace

Chong Sun, Danny Yuan

SLIDE 2

Forecasting On A Global Scale

SLIDE 3

SLIDE 4

01.01.17

Cases For Real-Time Forecasting

SLIDE 5

Dynamic Pricing: Every Minute, Every Where

SLIDE 6

Dynamic Pricing: Every Minute, Every Where, Every Trip

SLIDE 7

We Forecast Time Series

SLIDE 8

We Forecast Time Series For Given Geo Locations

SLIDE 9

SLIDE 10

A Few Constraints

More recent data has more signals

SLIDE 11

A Few Constraints

Smaller areas have more noise

SLIDE 12

A Few Constraints

Smaller areas have more noise

SLIDE 13

A Few Constraints

More recent data has more signals
Smaller areas have more noise
We were rolling out business city by city with competing

models ○ FFT ○ Kalman Filter ○ Regressions ○ LSTM

SLIDE 14

First Pipeline

SLIDE 15

The Training Pipeline

SLIDE 16

The Training Pipeline

SLIDE 17

The Training Pipeline

SLIDE 18

The Training Pipeline

Airflow
PySpark
SciPy

SLIDE 19

The Training Pipeline

Cassandra

SLIDE 20

A Need for Fast Time Series DB

Cassandra
Elasticsearch

SLIDE 21

A Need For Streaming Data

Kafka

SLIDE 22

A Need For Unified Feature Engine

SLIDE 23

A Digression To Feature Engine

SLIDE 24

A Digression To Feature Engine

DataFlow API

SLIDE 25

A Digression To Feature Engine

Flink

SLIDE 26

A Digression To Feature Engine

Reusable functions
Schema driven
Discoverable by meta data

SLIDE 27

Inferencing Pipeline

Elasticsearch

SLIDE 28

Inferencing Pipeline

SLIDE 29

Real-time Visualization

SLIDE 30

Real-time Validation

SLIDE 31

A New Challenge: Model Management

SLIDE 32

SLIDE 33

More Signals

SLIDE 34

Scalable Model Evaluation

SLIDE 35

Metrics-as-a-Service

SLIDE 36

Model Lifecycle Management System (MLMS)

SLIDE 37

What if you're supporting 5+ teams, 10+ products with 4000+ model instances in production

SLIDE 38

SLIDE 39

SLIDE 40

SLIDE 41

SLIDE 42

Machine Learning Model Lifecycle

SLIDE 43

Machine Learning Model Lifecycle

SLIDE 44

Machine Learning Model Lifecycle

SLIDE 45

Machine Learning Model Lifecycle

SLIDE 46

Machine Learning Model Lifecycle

SLIDE 47

Machine Learning Model Lifecycle

SLIDE 48

Common Questions in the process ...

Where am I going to save and serve my models?
How do I keep track of the model metadata, e.g., training data used？
How can I easily find a previous model for testing and performance comparison?
How can I automatically deploy a large scale number of models?
When should I decide to trigger model re-training?
How can I make sure I would not override any (production) models?
How do we manage multiple dependent models?
… ...

SLIDE 49

Common Questions in the process ...

Where am I going to save and serve my models?
How do I keep track of the model metadata, e.g., training data used？
How can I easily find a previous model for testing and performance comparison?
How can I automatically deploy a large scale number of models?
When should I decide to trigger model re-training?
How can I make sure I would not override any (production) models?
How do we manage multiple dependent models?
… ...

Model Lifecycle Management System (MLMS)

SLIDE 50

MLMS Design Principles

Immutable Models
Model Neutral
Flexible
Automated Dynamic Orchestration

SLIDE 51

MLMS Architecture

SLIDE 52

MLMS Architecture

SLIDE 53

MLMS Architecture

SLIDE 54

MLMS Architecture

SLIDE 55

MLMS Architecture

SLIDE 56

MLMS Architecture

SLIDE 57

MLMS Architecture

SLIDE 58

Machine Learning Model Lifecycle MLMS

SLIDE 59

Data Science and Engineering Work Flow

SLIDE 60

Data Scientists And Engineers Work In Lock Steps

SLIDE 61

Engineers Are Blocked Before Modeling Is Done

SLIDE 62

Time For Productization Is Often Squeezed

SLIDE 63

Rolling Out To All Cities Are Slow And Painful

SLIDE 64

Analysis of Bottlenecks

Model Exploration (DS, Python) Model Training and Serving Implementation (DS/Eng, Python/Go/Java) Model Serving Production (Eng, Go/Java)

SLIDE 65

Analysis of Bottlenecks

Model Exploration (DS, Python) Model Training and Serving Implementation (DS/Eng, Python/Go/Java) Model Serving Production (Eng, Go/Java) Restricted Models

SLIDE 66

Analysis of Bottlenecks

Model Exploration (DS, Python) Model Training and Serving Implementation (DS/Eng, Python/Go/Java) Model Serving Production (Eng, Go/Java) DS → Eng Knowledge Transfer Reimplementing Model

SLIDE 67

Analysis of Bottlenecks

Model Exploration (DS, Python) Model Training and Serving Implementation (DS/Eng, Python/Go/Java) Model Serving Production (Eng, Go/Java) DS/Eng Model Parity

SLIDE 68

Analysis of Bottlenecks

Model Exploration (DS, Python) Model Training and Serving Implementation (DS/Eng, Python/Go/Java) Model Serving Production (Eng, Go/Java) DS/Eng Performance Debug

SLIDE 69

Key Insight: Can We All Enjoy One ML Ecosystem?

SLIDE 70

Unified Framework → Many Benefits

Standardized project structure
Out-of-box support of local and remote deployment
Reusable algorithms and framework
Design review between engineer and DS
Code review between engineer and DS
Who codes, who debugs

SLIDE 71

SLIDE 72

SLIDE 73

SLIDE 74

SLIDE 75

TensorFlow

Model Exploration (DS, Python) Model Training and Serving Implementation (DS/Eng, Python/Java) Model Serving Production (Eng, Java) Restricted Models DS → Eng Knowledge Transfer DS/Eng Model Parity Eng Model Performance Debug Dev (Python) Train (Python) Serve (Python/Java) TensorFlow Graph (C++) Client Runtime Reimplementing Model

SLIDE 76

Enable DS to Write Production-Ready Code

Tensorflow

○ Efficient core ○ DS-friendly API

Engineers focusing on optimization and automation

○ Parallelization of algorithms ○ End-to-end automation ○ Visualization ○ Integration ○ Project scaffolding

SLIDE 77

Example

Build your own FTRL Use a framework

SLIDE 78

Building Tools

Model Lifecycle Management System
Hyperparameter Tuning
Horovod for Distributed TensorFlow Training

SLIDE 79

Conclusion

A fully automated MLMS is key to the success of complex ML

systems

A single framework for DS and engineers boosts productivity
Building great tools is crucial to ML projects

SLIDE 80

Q & A

SLIDE 81

SLIDE 82

How do we make the forecasts?

SLIDE 83

Batch forecasting (2015)

Batch Forecast Data Sources Forecasts (ARIMA, FFT)

SLIDE 84

Batch forecasting + Real-time Adjustment

Batch Forecast Data Sources Forecasts (ARIMA, FFT) Realtime Adjust & Serve Consumer (Exponential Smoothing)

SLIDE 85

Issues Observed

Not many ML libraries for Node.js Real-time component (Node.js) can not support CPU intensive computation Can not handle large scale data features in real-time Can not share code for batch and online processing

SLIDE 86

Second Generation of Forecasting Engine

(Inspired by DataFlow and TensorFlow) Some interesting design principles: Both realtime and batch prediction: prediction is minute level, backtesting/evaluation requires batch processing

SLIDE 87

Machine Learning Model Lifecycle

SLIDE 88

MLMS Architecture

Given model_name=linear_demand_model and city_id=1 When status == 'alerting' and time_sustained > 3 days Then retrainModel(model_name, city_id, model_version)

SLIDE 89