Clipper A Low-Latency Online Prediction Serving System Dan - - PowerPoint PPT Presentation

clipper
SMART_READER_LITE
LIVE PREVIEW

Clipper A Low-Latency Online Prediction Serving System Dan - - PowerPoint PPT Presentation

Clipper A Low-Latency Online Prediction Serving System Dan Crankshaw crankshaw@cs.berkeley.edu http://clipper.ai https://github.com/ucbrise/clipper December 8, 2017 Serving Training Query Big Training Data Decision Model Application


slide-1
SLIDE 1

Dan Crankshaw

crankshaw@cs.berkeley.edu http://clipper.ai https://github.com/ucbrise/clipper December 8, 2017

A Low-Latency Online Prediction Serving System

Clipper

slide-2
SLIDE 2

Big Data

Training

Training

Application

Decision Query

Model

Prediction-Serving for interactive applications Timescale: ~10s of milliseconds

Serving

slide-3
SLIDE 3

Prediction-Serving Challenges

Support low-latency, high- throughput serving workloads

???

Create

VW

Caffe

3

Large and growing ecosystem

  • f ML models and frameworks
slide-4
SLIDE 4

Prediction-Serving Today

Highly specialized systems for specific problems

Decision Query

X Y

Offline scoring with existing frameworks and systems

Clipper aims to unify these approaches New class of systems: Prediction-Serving Systems

slide-5
SLIDE 5

Clipper

Predict

MC MC

RPC RPC RPC RPC

Clipper Decouples Applications and Models

Applications

Model Container (MC) MC

Caffe

RPC/REST Interface

slide-6
SLIDE 6

Clipper Caffe

MC MC MC

RPC RPC RPC RPC

Model Container (MC)

Common Interface à Simplifies Deployment: Ø Evaluate models using original code & systems Ø Models run in separate processes as Docker containers

Ø Resource isolation: Cutting edge ML frameworks can be buggy Ø Scale-out and deployment on Kubernetes

slide-7
SLIDE 7

Clipper Architecture

Clipper Caffe

Applications Predict

MC MC MC

RPC RPC RPC RPC

Caching Latency-Aware Batching

Model Container (MC)

slide-8
SLIDE 8

Status of the project

Ø First released in May 2017 with a focus on usability Ø Currently working towards 0.3 release and actively working with early

users

Ø

Focused on performance improvements and better monitoring and stability

Ø Supports native deployments on Kubernetes and a local Docker mode Ø Goal: Community-owned platform for model deployment and serving

Ø

Post issues and questions on GitHub and subscribe to our mailing list clipper- dev@googlegroups.com

https://github.com/ucbrise/clipper

slide-9
SLIDE 9

Simplifying Model Deployment with Clipper

slide-10
SLIDE 10

Getting Started with Clipper is Easy

Docker images available on DockerHub Clipper admin is distributed as pip package: pip install clipper_admin Get up and running without cloning or compiling!

slide-11
SLIDE 11

MC

Driver Program SparkContext Worker Node Executor Task Task Worker Node Executor Task Task Worker Node Executor Task Task

Web Server

Database Cache

Clipper

Clipper Connects Training and Serving

slide-12
SLIDE 12

Problem: Models don’t run in isolation

Must extract model plus pre- and post-processing logic

slide-13
SLIDE 13

Clipper provides a library of model deployers

Ø Deployer automatically and intelligently saves all prediction code Ø Captures both framework-specific models and arbitrary serializable code Ø Replicates required subset of training environment and loads prediction code in a Clipper model container

slide-14
SLIDE 14

Clipper provides a (growing) library of model deployers

ØPython

ØCombine framework specific models with external featurization, post-processing, business logic ØCurrently support Scikit-Learn, PySpark, TensorFlow ØPyTorch, Caffe2, XGBoost coming soon

ØScala and Java with Spark:

Øboth MLLib and Pipelines APIs

ØArbitrary R functions

slide-15
SLIDE 15

Ongoing Research

slide-16
SLIDE 16

Supporting Modular Multi-Model Pipelines

Ensembles can improve accuracy Faster inference with prediction cascades

Fast model If confident then return Slow but accurate model

Faster development through model- reuse

Pre-trained DNN Task- specific model

Model specialization

Object detector If object detected If face detected Else Face detector

How to efficiently support serving arbitrary model pipelines?

slide-17
SLIDE 17

Challenges of Serving Model Pipelines

Ø Complex tradeoff space of latency, throughput, and monetary cost

Ø Many serving workloads are interactive and highly latency-sensitive Ø Performance and cost depend on model, workload, and physical resources available

Ø Model composition leads to combinatorial explosion in the size of the tradeoff space

Ø Developers must make decisions about how to configure individual models while reasoning about end-to-end pipeline performance

slide-18
SLIDE 18

Solution: Workload-Aware Optimizer

Ø Exploit structure and properties of inference computation

Ø Immutable state Ø Query-level parallelism Ø Compute-intensive

Ø Pipeline definition

Ø Intermingle arbitrary application code and Clipper-hosted model evaluation for maximum flexibility

Ø Optimizer input

Ø Pipeline, sample workload, and performance or cost constraints

Ø Optimizer output

Ø Optimal pipeline configuration that meets constraints

Ø Deployed models use Clipper as physical execution engine for serving

slide-19
SLIDE 19

Conclusion

Ø Challenges of serving increasingly complex models trained in

variety of frameworks while meeting strict performance demands

Ø Clipper adopts a container-based architecture and employs

prediction caching and latency-aware batching

Ø Clipper’s model deployer library makes it easy to deploy both

framework-specific models and arbitrary processing code

Ø Ongoing efforts on a workload-aware optimizer to optimize the

deployment of complex, multi-model pipelines

http://clipper.ai