[PPT] - Prediction Serving what happens after learning? Joseph E. Gonzalez PowerPoint Presentation

SLIDE 1

Joseph E. Gonzalez

Asst. Professor, UC Berkeley

jegonzal@cs.berkeley.edu Co-founder, Dato Inc. joseph@dato.com

what happens after learning?

Prediction Serving

SLIDE 2

Outline

Daniel Crankshaw, Xin Wang Michael Franklin, & Ion Stoica

VELOX Clipper

Create

VW

Caffe

SLIDE 3

Big Data

Big Model

Training

Learning

Timescale: minutes to days Systems: offline and batch optimized Heavily studied ... major focus of the AMPLab

SLIDE 4

Big Data

Big Model

Training

Application

Decision Query

?

Learning Inference

SLIDE 5

Big Data Training

Learning Inference

Big Model Application

Decision Query

Timescale: ~10 milliseconds Systems: online and latency optimized Less studied …

SLIDE 6

Big Data

Big Model

Training

Application

Decision Query

Learning Inference

Feedback

SLIDE 7

Big Data Training

Application

Decision

Learning Inference

Feedback

Timescale: hours to weeks Systems: combination of systems Less studied …

SLIDE 8

Big Data

Big Model

Training

Application

Decision Query

Learning Inference

Feedback

Responsive (~10ms) Adaptive (~1 seconds)

SLIDE 9

Responsive (~10ms) Adaptive (~1 seconds)

VELOX Model Serving System [CIDR’15]

Daniel Crankshaw, Peter Bailis, Haoyuan Li, Zhao Zhang, Joseph Gonzalez, Michael J. Franklin, Ali Ghodsi, and Michael I. Jordan

Key Insight:

Decompose models into fast and slow changing components

SLIDE 10

Big Data Training

Application

Decision Query

Learning Inference

Feedback

SLIDE 11

Big Data Training

Application

Decision Query

Learning Inference

Feedback Slow

Slow Changing Model Fast Changing Model

SLIDE 12

Hybrid Offline + Online Learning

Update the user weights online:

Simple to train + more robust model
Address rapidly changing user statistics

Update feature functions offline using batch solvers

Leverage high-throughput systems (Tensor Flow)
Exploit slow change in population statistics

f(x; θ)T wu

SLIDE 13

Common modeling structure

f(x; θ)T wu

Items Users Matrix Factorization

Input

Deep Learning Ensemble Methods

SLIDE 14

Big Data Training

Application

Decision Query

Learning Inference

Feedback Slow

Slow Changing Model Fast Changing Model

SLIDE 15

Big Data Training

Application

Decision Query

Learning Inference

Feedback Slow

Slow Changing Model Fast Changing Model per user

SLIDE 16

Velox Online Learning for Recommendations

(20-News Groups)

0.1 0.2 0.3 0.4 0.5 0.6 10 20 30

Error Examples

Online Updates: 0.4 ms Retraining: 7.1 seconds

>4 orders-of-magnitude faster adaptation given sufficient offline training data

SLIDE 17

Velox Online Learning for Recommendations

(20-News Groups)

0.1 0.2 0.3 0.4 0.5 0.6 10 20 30

Error Examples

Partial Updates: 0.4 ms Retraining: 7.1 seconds

>4 orders-of-magnitude faster adaptation

SLIDE 18

Big Data Training

Application

Decision Query

Learning Inference

Feedback Slow

Slow Changing Model Fast Changing Model per user

SLIDE 19

Big Data Training

Application

Decision Query

Learning Inference

Feedback Slow

Slow Changing Model Fast Changing Model per user

Velox

SLIDE 20

B D A S

Tachyon Mesos Spark HDFS, S3, … Spark Streaming Spark SQL BlinkDB GraphX Graph Frames MLLib Keystone ML

Learning

erkeley ata nalytics tack

VELOX: the Missing Piece of BDAS

SLIDE 21

B D A S erkeley ata nalytics tack

Tachyon Mesos Spark HDFS, S3, … Spark Streaming Spark SQL BlinkDB GraphX Graph Frames MLLib Keystone ML

Learning

Management and Serving

VELOX: the Missing Piece of BDAS Velox

SLIDE 22

B D A S erkeley ata nalytics tack

Mesos HDFS, S3, … Spark Streaming Spark SQL BlinkDB GraphX Graph Frames

Learning

Management and Serving

VELOX: the Missing Piece of BDAS Velox

Tachyon Spark MLLib Keystone ML

SLIDE 23

VELOX Architecture

Spark MLLib

Single JVM Instance

Velox

Keystone ML

Content Rec. Fraud Detection

SLIDE 24

VELOX Architecture

Spark MLLib

Single JVM Instance

Velox

Keystone ML

Content Rec. Fraud Detection Personal Asst. Robotic Control Machine Translation

Create

VW Caffe

SLIDE 25

VELOX as a Middle Layer Arch?

Spark MLLib

Velox

Keystone ML

Content Rec. Fraud Detection Personal Asst. Robotic Control Machine Translation

Create

VW Caffe Generalize ?

SLIDE 26

Daniel Crankshaw Xin Wang Michael Franklin Joseph E. Gonzalez Ion Stoica

A Low-Latency Online Prediction Serving System

Clipper

SLIDE 27

Clipper Generalizes Velox Across ML Frameworks

Clipper

Content Rec. Fraud Detection Personal Asst. Robotic Control Machine Translation

Create

VW Caffe

SLIDE 28

Clipper

Create

VW Caffe

Key Insight: The challenges of prediction serving can be addressed between end-user applications and machine learning frameworks

As a result, Clipper is able to: Ø hide complexity

Ø by providing a common prediction interface

Ø bound latency and maximize throughput

Ø through approximate caching and adaptive batching

Ø enable robust online learning and personalization

Ø through generalized split-model correction policies

without modifying machine learning frameworks or end-user applications

SLIDE 29

Clipper Design Goals

Low and bounded latency predictions

Ø interactive applications need reliable latency objectives

Up-to-date and personalized predictions across models and frameworks

Ø generalize the split model decomposition

Optimize throughput for performance under heavy load

Ø single query can trigger many predictions

Simplify deployment

Ø serve models using the original code and systems

SLIDE 30

Clipper Architecture

Clipper

Content Rec. Fraud Detection Personal Asst. Robotic Control Machine Translation

VW Caffe

Create

SLIDE 31

Clipper Architecture

Clipper

Applications Predict Observe RPC/REST Interface

VW Caffe

Create

SLIDE 32

Clipper Architecture

Clipper Caffe

Applications

ust

Predict Observe RPC/REST Interface

Model Wrapper (MW) MW MW MW

RPC RPC RPC RPC

SLIDE 33

Clipper Architecture

Clipper Caffe

Applications Predict Observe RPC/REST Interface

Model Wrapper (MW) MW MW MW

RPC RPC RPC RPC Model Abstraction Layer

Provide a common interface to models while bounding latency and maximizing throughput.

Correction Layer

Improve accuracy through ensembles,

nline learning and personalization

SLIDE 34

Clipper Architecture

Clipper Caffe

Applications Predict Observe RPC/REST Interface

Model Wrapper (MW) MW MW MW

RPC RPC RPC RPC Correction Layer

Correction Policy

Model Abstraction Layer

Approximate Caching Adaptive Batching

SLIDE 35

Caffe

Provides a unified generic prediction API across frameworks Ø Reduce Latency à Approximate Caching Ø Increase Throughput à Adaptive Batching Ø Simplify Deployment à RPC + Model Wrapper

Model Wrapper (MW) MW MW MW

RPC RPC RPC RPC Model Abstraction Layer

Approximate Caching Adaptive Batching Approximate Caching Adaptive Batching

Model Wrapper (MW) MW MW MW

SLIDE 36

Caffe

Model Wrapper (MW) MW MW MW

RPC RPC RPC RPC Model Abstraction Layer

Approximate Caching Adaptive Batching

common interface

SLIDE 37

Model Wrapper (MW)

RPC

Caffe

MW

RPC

MW

RPC

MW

RPC Model Abstraction Layer

Approximate Caching Adaptive Batching Common Interface à Simplifies Deployment: Ø Evaluate models using original code & systems Ø Models run in separate processes

Ø Resource isolation

SLIDE 38

Model Abstraction Layer

Approximate Caching Adaptive Batching

Model Wrapper (MW)

RPC

Caffe

MW

RPC

MW

RPC

MW

RPC

MW

RPC

MW

RPC

Common Interface à Simplifies Deployment: Ø Evaluate models using original code & systems Ø Models run in separate processes

Ø Resource isolation Ø Scale-out

Problem: frameworks optimized for batch processing not latency

SLIDE 39

A single page load may generate many queries

Adaptive Batching to Improve Throughput

Ø Optimal batch depends on:

Ø hardware configuration Ø model and framework Ø system load

Clipper Solution: be as slow as allowed… Ø Inc. batch size until the latency objective is exceeded (Additive Increase) Ø If latency exceeds SLO cut batch size by a fraction (Multiplicative Decrease) Ø Why batching helps:

Hardware Acceleration Helps amortize system overhead

SLIDE 40

Throughput (Queries Per Second) Latency (ms) Batch Sizes (Queries)

Tensor Flow Conv. Net (GPU)

Latency Deadline

Optimal Batch Size

SLIDE 41

Comparison to TensorFlow Serving

Takeaway: Clipper is able to match the average latency of TensorFlow Serving while reducing tail latency (2x) and improving throughput (2x)

SLIDE 42

Approximate Caching to Reduce Latency

Clipper Solution: Approximate Caching apply locality sensitive hash functions Ø Opportunity for caching Ø Need for approximation

Popular items may be evaluated frequently

High Dimensional and continuous valued queries have low cache hit rate.

Bag-of-Words Model Images

? ?

Cache Hit Cache Miss

?

Cache Hit Error

SLIDE 43

Clipper Architecture

Clipper Caffe

Applications Predict Observe RPC/REST Interface

Model Wrapper (MW) MW MW MW

RPC RPC RPC RPC Correction Layer

Correction Policy

Model Abstraction Layer

Approximate Caching Adaptive Batching

SLIDE 44

Goal: Maximize accuracy through ensembles, online learning, and personalization Generalize the split-model insight from Velox to achieve: Ø robust predictions by combining multiple models & frameworks Ø online learning and personalization by correcting and personalizing predictions in response to feedback

Clipper

Correction Layer

Correction Policy

SLIDE 45

Big Data

Application

Learning Inference

Feedback Slow

Slow Changing Model Fast Changing User Model

Velox

SLIDE 46

Caffe

Big Data

Application

Learning Inference

Feedback Slow

Slow Changing Model Fast Changing User Model

Clipper

SLIDE 47

Caffe

Slow Changing Model Fast Changing User Model

Clipper

Correction Policy

Improves prediction accuaray by: Ø Incorporating real-time feedback Ø Managing personalization Ø Combine models & frameworks

Ø enables frameworks to compete

SLIDE 48

Improved Prediction Accuracy (ImageNet)

System Model Error Rate #Errors Caffe VGG 13.05% 6525 Caffe LeNet 11.52% 5760 Caffe ResNet 9.02% 4512 TensorFlow Inception v3 6.18% 3088

sequence of pre-trained state-of-the-art models

SLIDE 49

Improved Prediction Accuracy

System Model Error Rate #Errors Caffe VGG 13.05% 6525 Caffe LeNet 11.52% 5760 Caffe ResNet 9.02% 4512 TensorFlow Inception v3 6.18% 3088 Clipper Ensemble 5.86% 2930

5.2% relative improvement in prediction accuracy!

SLIDE 50

Increased Load Ø Solutions:

Ø Caching and Batching Ø Load-shedding correction policy can prioritize frameworks

Stragglers

Ø e.g., framework fails to meet SLO

Ø Solution: Anytime predictions

Ø Correction policy must render predictions with missing inputs Ø e.g., built-in correction policies substitute expected value

Caffe

Slow Changing Model Fast Changing User Model

Clipper

Cost of Ensembles ?

SLIDE 51

Evaluation of Throughput Under Heavy Load

Accuracy Throughput (queries per second) Takeaway: Clipper is able to gracefully degrade accuracy to maintain availability under heavy load.

SLIDE 52

Conclusion

Clipper

Create

VW Caffe

Clipper sits between applications and ML frameworks to Ø to simplifying deployment Ø bound latency and increase throughput Ø and enable real-time learning and personalization across machine learning frameworks

SLIDE 53

Big Data

Model

Training

Application

Decision Query

Feedback

VELOX Clipper

Create

VW

Caffe

SLIDE 54

Ongoing & Future Research Directions

Ø Serving and updating RL models Ø Bandit techniques in correction policies Ø Splitting inference across the cloud and the client to reduce latency and bandwidth requirements Ø Secure model evaluation on the client (model DRM)

SLIDE 55

No Coarsening

Coarsening + Anytime Predictions

Overly Coarsened More Features

Approx. Expectation

Better

Best

Coarser Hash

fi(x; θ) ≈ E [fi(x; θ)] fi(x; θ) ≈ fi(z; θ)