Prediction Serving what happens after learning? Joseph E. Gonzalez - - PowerPoint PPT Presentation

prediction serving
SMART_READER_LITE
LIVE PREVIEW

Prediction Serving what happens after learning? Joseph E. Gonzalez - - PowerPoint PPT Presentation

Prediction Serving what happens after learning? Joseph E. Gonzalez Asst. Professor, UC Berkeley jegonzal@cs.berkeley.edu Co-founder, Dato Inc. joseph@dato.com Outline VELOX Clipper Caffe VW Create Daniel Crankshaw, Xin Wang Michael


slide-1
SLIDE 1

Joseph E. Gonzalez

  • Asst. Professor, UC Berkeley

jegonzal@cs.berkeley.edu Co-founder, Dato Inc. joseph@dato.com

what happens after learning?

Prediction Serving

slide-2
SLIDE 2

Outline

Daniel Crankshaw, Xin Wang Michael Franklin, & Ion Stoica

VELOX Clipper

Create

VW

Caffe

slide-3
SLIDE 3

Big Data

Big Model

Training

Learning

Timescale: minutes to days Systems: offline and batch optimized Heavily studied ... major focus of the AMPLab

slide-4
SLIDE 4

Big Data

Big Model

Training

Application

Decision Query

?

Learning Inference

slide-5
SLIDE 5

Big Data Training

Learning Inference

Big Model Application

Decision Query

Timescale: ~10 milliseconds Systems: online and latency optimized Less studied …

slide-6
SLIDE 6

Big Data

Big Model

Training

Application

Decision Query

Learning Inference

Feedback

slide-7
SLIDE 7

Big Data Training

Application

Decision

Learning Inference

Feedback

Timescale: hours to weeks Systems: combination of systems Less studied …

slide-8
SLIDE 8

Big Data

Big Model

Training

Application

Decision Query

Learning Inference

Feedback

Responsive (~10ms) Adaptive (~1 seconds)

slide-9
SLIDE 9

Responsive (~10ms) Adaptive (~1 seconds)

VELOX Model Serving System [CIDR’15]

Daniel Crankshaw, Peter Bailis, Haoyuan Li, Zhao Zhang, Joseph Gonzalez, Michael J. Franklin, Ali Ghodsi, and Michael I. Jordan

Key Insight:

Decompose models into fast and slow changing components

slide-10
SLIDE 10

Big Data Training

Application

Decision Query

Learning Inference

Feedback

slide-11
SLIDE 11

Big Data Training

Application

Decision Query

Learning Inference

Feedback Slow

Slow Changing Model Fast Changing Model

slide-12
SLIDE 12

Hybrid Offline + Online Learning

Update the user weights online:

  • Simple to train + more robust model
  • Address rapidly changing user statistics

Update feature functions offline using batch solvers

  • Leverage high-throughput systems (Tensor Flow)
  • Exploit slow change in population statistics

f(x; θ)T wu

slide-13
SLIDE 13

Common modeling structure

f(x; θ)T wu

Items Users Matrix Factorization

Input

Deep Learning Ensemble Methods

slide-14
SLIDE 14

Big Data Training

Application

Decision Query

Learning Inference

Feedback Slow

Slow Changing Model Fast Changing Model

slide-15
SLIDE 15

Big Data Training

Application

Decision Query

Learning Inference

Feedback Slow

Slow Changing Model Fast Changing Model per user

slide-16
SLIDE 16

Velox Online Learning for Recommendations

(20-News Groups)

0.1 0.2 0.3 0.4 0.5 0.6 10 20 30

Error Examples

Online Updates: 0.4 ms Retraining: 7.1 seconds

>4 orders-of-magnitude faster adaptation given sufficient offline training data

slide-17
SLIDE 17

Velox Online Learning for Recommendations

(20-News Groups)

0.1 0.2 0.3 0.4 0.5 0.6 10 20 30

Error Examples

Partial Updates: 0.4 ms Retraining: 7.1 seconds

>4 orders-of-magnitude faster adaptation

slide-18
SLIDE 18

Big Data Training

Application

Decision Query

Learning Inference

Feedback Slow

Slow Changing Model Fast Changing Model per user

slide-19
SLIDE 19

Big Data Training

Application

Decision Query

Learning Inference

Feedback Slow

Slow Changing Model Fast Changing Model per user

Velox

slide-20
SLIDE 20

B D A S

Tachyon Mesos Spark HDFS, S3, … Spark Streaming Spark SQL BlinkDB GraphX Graph Frames MLLib Keystone ML

Learning

erkeley ata nalytics tack

VELOX: the Missing Piece of BDAS

slide-21
SLIDE 21

B D A S erkeley ata nalytics tack

Tachyon Mesos Spark HDFS, S3, … Spark Streaming Spark SQL BlinkDB GraphX Graph Frames MLLib Keystone ML

Learning

Management and Serving

VELOX: the Missing Piece of BDAS Velox

slide-22
SLIDE 22

B D A S erkeley ata nalytics tack

Mesos HDFS, S3, … Spark Streaming Spark SQL BlinkDB GraphX Graph Frames

Learning

Management and Serving

VELOX: the Missing Piece of BDAS Velox

Tachyon Spark MLLib Keystone ML

slide-23
SLIDE 23

VELOX Architecture

Spark MLLib

Single JVM Instance

Velox

Keystone ML

Content Rec. Fraud Detection

slide-24
SLIDE 24

VELOX Architecture

Spark MLLib

Single JVM Instance

Velox

Keystone ML

Content Rec. Fraud Detection Personal Asst. Robotic Control Machine Translation

Create

VW Caffe

slide-25
SLIDE 25

VELOX as a Middle Layer Arch?

Spark MLLib

Velox

Keystone ML

Content Rec. Fraud Detection Personal Asst. Robotic Control Machine Translation

Create

VW Caffe Generalize ?

slide-26
SLIDE 26

Daniel Crankshaw Xin Wang Michael Franklin Joseph E. Gonzalez Ion Stoica

A Low-Latency Online Prediction Serving System

Clipper

slide-27
SLIDE 27

Clipper Generalizes Velox Across ML Frameworks

Clipper

Content Rec. Fraud Detection Personal Asst. Robotic Control Machine Translation

Create

VW Caffe

slide-28
SLIDE 28

Clipper

Create

VW Caffe

Key Insight: The challenges of prediction serving can be addressed between end-user applications and machine learning frameworks

As a result, Clipper is able to: Ø hide complexity

Ø by providing a common prediction interface

Ø bound latency and maximize throughput

Ø through approximate caching and adaptive batching

Ø enable robust online learning and personalization

Ø through generalized split-model correction policies

without modifying machine learning frameworks or end-user applications

slide-29
SLIDE 29

Clipper Design Goals

Low and bounded latency predictions

Ø interactive applications need reliable latency objectives

Up-to-date and personalized predictions across models and frameworks

Ø generalize the split model decomposition

Optimize throughput for performance under heavy load

Ø single query can trigger many predictions

Simplify deployment

Ø serve models using the original code and systems

slide-30
SLIDE 30

Clipper Architecture

Clipper

Content Rec. Fraud Detection Personal Asst. Robotic Control Machine Translation

VW Caffe

Create

slide-31
SLIDE 31

Clipper Architecture

Clipper

Applications Predict Observe RPC/REST Interface

VW Caffe

Create

slide-32
SLIDE 32

Clipper Architecture

Clipper Caffe

Applications

ust

Predict Observe RPC/REST Interface

Model Wrapper (MW) MW MW MW

RPC RPC RPC RPC

slide-33
SLIDE 33

Clipper Architecture

Clipper Caffe

Applications Predict Observe RPC/REST Interface

Model Wrapper (MW) MW MW MW

RPC RPC RPC RPC Model Abstraction Layer

Provide a common interface to models while bounding latency and maximizing throughput.

Correction Layer

Improve accuracy through ensembles,

  • nline learning and personalization
slide-34
SLIDE 34

Clipper Architecture

Clipper Caffe

Applications Predict Observe RPC/REST Interface

Model Wrapper (MW) MW MW MW

RPC RPC RPC RPC Correction Layer

Correction Policy

Model Abstraction Layer

Approximate Caching Adaptive Batching

slide-35
SLIDE 35

Caffe

Provides a unified generic prediction API across frameworks Ø Reduce Latency à Approximate Caching Ø Increase Throughput à Adaptive Batching Ø Simplify Deployment à RPC + Model Wrapper

Model Wrapper (MW) MW MW MW

RPC RPC RPC RPC Model Abstraction Layer

Approximate Caching Adaptive Batching Approximate Caching Adaptive Batching

Model Wrapper (MW) MW MW MW

slide-36
SLIDE 36

Caffe

Model Wrapper (MW) MW MW MW

RPC RPC RPC RPC Model Abstraction Layer

Approximate Caching Adaptive Batching

common interface

slide-37
SLIDE 37

Model Wrapper (MW)

RPC

Caffe

MW

RPC

MW

RPC

MW

RPC Model Abstraction Layer

Approximate Caching Adaptive Batching Common Interface à Simplifies Deployment: Ø Evaluate models using original code & systems Ø Models run in separate processes

Ø Resource isolation

slide-38
SLIDE 38

Model Abstraction Layer

Approximate Caching Adaptive Batching

Model Wrapper (MW)

RPC

Caffe

MW

RPC

MW

RPC

MW

RPC

MW

RPC

MW

RPC

Common Interface à Simplifies Deployment: Ø Evaluate models using original code & systems Ø Models run in separate processes

Ø Resource isolation Ø Scale-out

Problem: frameworks optimized for batch processing not latency

slide-39
SLIDE 39

A single page load may generate many queries

Adaptive Batching to Improve Throughput

Ø Optimal batch depends on:

Ø hardware configuration Ø model and framework Ø system load

Clipper Solution: be as slow as allowed… Ø Inc. batch size until the latency objective is exceeded (Additive Increase) Ø If latency exceeds SLO cut batch size by a fraction (Multiplicative Decrease) Ø Why batching helps:

Hardware Acceleration Helps amortize system overhead

slide-40
SLIDE 40

Throughput (Queries Per Second) Latency (ms) Batch Sizes (Queries)

Tensor Flow Conv. Net (GPU)

Latency Deadline

Optimal Batch Size

slide-41
SLIDE 41

Comparison to TensorFlow Serving

Takeaway: Clipper is able to match the average latency of TensorFlow Serving while reducing tail latency (2x) and improving throughput (2x)

slide-42
SLIDE 42

Approximate Caching to Reduce Latency

Clipper Solution: Approximate Caching apply locality sensitive hash functions Ø Opportunity for caching Ø Need for approximation

Popular items may be evaluated frequently

High Dimensional and continuous valued queries have low cache hit rate.

Bag-of-Words Model Images

? ?

Cache Hit Cache Miss

?

Cache Hit Error

slide-43
SLIDE 43

Clipper Architecture

Clipper Caffe

Applications Predict Observe RPC/REST Interface

Model Wrapper (MW) MW MW MW

RPC RPC RPC RPC Correction Layer

Correction Policy

Model Abstraction Layer

Approximate Caching Adaptive Batching

slide-44
SLIDE 44

Goal: Maximize accuracy through ensembles, online learning, and personalization Generalize the split-model insight from Velox to achieve: Ø robust predictions by combining multiple models & frameworks Ø online learning and personalization by correcting and personalizing predictions in response to feedback

Clipper

Correction Layer

Correction Policy

slide-45
SLIDE 45

Big Data

Application

Learning Inference

Feedback Slow

Slow Changing Model Fast Changing User Model

Velox

slide-46
SLIDE 46

Caffe

Big Data

Application

Learning Inference

Feedback Slow

Slow Changing Model Fast Changing User Model

Clipper

slide-47
SLIDE 47

Caffe

Slow Changing Model Fast Changing User Model

Clipper

Correction Policy

Improves prediction accuaray by: Ø Incorporating real-time feedback Ø Managing personalization Ø Combine models & frameworks

Ø enables frameworks to compete

slide-48
SLIDE 48

Improved Prediction Accuracy (ImageNet)

System Model Error Rate #Errors Caffe VGG 13.05% 6525 Caffe LeNet 11.52% 5760 Caffe ResNet 9.02% 4512 TensorFlow Inception v3 6.18% 3088

sequence of pre-trained state-of-the-art models

slide-49
SLIDE 49

Improved Prediction Accuracy

System Model Error Rate #Errors Caffe VGG 13.05% 6525 Caffe LeNet 11.52% 5760 Caffe ResNet 9.02% 4512 TensorFlow Inception v3 6.18% 3088 Clipper Ensemble 5.86% 2930

5.2% relative improvement in prediction accuracy!

slide-50
SLIDE 50

Increased Load Ø Solutions:

Ø Caching and Batching Ø Load-shedding correction policy can prioritize frameworks

Stragglers

Ø e.g., framework fails to meet SLO

Ø Solution: Anytime predictions

Ø Correction policy must render predictions with missing inputs Ø e.g., built-in correction policies substitute expected value

Caffe

Slow Changing Model Fast Changing User Model

Clipper

Cost of Ensembles ?

slide-51
SLIDE 51

Evaluation of Throughput Under Heavy Load

Accuracy Throughput (queries per second) Takeaway: Clipper is able to gracefully degrade accuracy to maintain availability under heavy load.

slide-52
SLIDE 52

Conclusion

Clipper

Create

VW Caffe

Clipper sits between applications and ML frameworks to Ø to simplifying deployment Ø bound latency and increase throughput Ø and enable real-time learning and personalization across machine learning frameworks

slide-53
SLIDE 53

Big Data

Model

Training

Application

Decision Query

Feedback

VELOX Clipper

Create

VW

Caffe

slide-54
SLIDE 54

Ongoing & Future Research Directions

Ø Serving and updating RL models Ø Bandit techniques in correction policies Ø Splitting inference across the cloud and the client to reduce latency and bandwidth requirements Ø Secure model evaluation on the client (model DRM)

slide-55
SLIDE 55

No Coarsening

Coarsening + Anytime Predictions

Overly Coarsened More Features

  • Approx. Expectation

Better

Best

Coarser Hash

fi(x; θ) ≈ E [fi(x; θ)] fi(x; θ) ≈ fi(z; θ)