Research at the intersection of AI + Systems Joseph E. Gonzalez - - PowerPoint PPT Presentation

research at the intersection of ai systems
SMART_READER_LITE
LIVE PREVIEW

Research at the intersection of AI + Systems Joseph E. Gonzalez - - PowerPoint PPT Presentation

Research at the intersection of AI + Systems Joseph E. Gonzalez Assistant Professor, UC Berkeley jegonzal@cs.berkeley.edu Looking Back on AI Systems Going back to when I started graduate school Machine learning community has had an


slide-1
SLIDE 1

Joseph E. Gonzalez

Assistant Professor, UC Berkeley jegonzal@cs.berkeley.edu

Research at the intersection of AI + Systems

slide-2
SLIDE 2

Looking Back on AI Systems

Going back to when I started graduate school …

slide-3
SLIDE 3
slide-4
SLIDE 4

Machine learning community has had an evolving focus on AI Systems 2006 2017

Fast Algorithms ML for Systems Distributed Algorithms Deep Learning Frameworks

Integration of Communities

Machine Learning Frameworks

slide-5
SLIDE 5

Big Data

Big Model

Training

Learning

The focus of AI Systems research has been on model training.

slide-6
SLIDE 6

Big Data

Big Model

Training

Distributed Dataflow Systems Stochastic Optimization GPU / TPU Acceleration Deep Learning (CNN/RNN) Domain Specific Languages (TensorFlow) Symbolic Methods

Enabling Machine Learning and Systems Innovations

slide-7
SLIDE 7

Splash

CoCoA

VW

rllab

Big Data

Big Model

Training

slide-8
SLIDE 8

Big Data

Big Model

Training

Learning

?

slide-9
SLIDE 9

Big Data

Big Model

Training

Learning

Drive Actions

slide-10
SLIDE 10

Big Data

Big Model

Training

Application

Decision Query

?

Learning Prediction

slide-11
SLIDE 11

Big Data Training

Learning Prediction

Big Model Application

Decision Query

Goal: ~10 ms under heavy load Complicated by Deep Learning è New ML Algorithms and Systems

slide-12
SLIDE 12

Models getting more complex

Ø 10s of GFLOPs [1] Ø Recurrent nets

Support low-latency, high-throughput serving workloads

Deployed on critical path

Ø Maintain latency goals under heavy load

[1] Deep Residual Learning for Image Recognition. He et al. CVPR 2015.

Using specialized hardware for predictions

slide-13
SLIDE 13

Google Translate

Serving

82,000 GPUs running 24/7

[1] https://www.nytimes.com/2016/12/14/magazine/the-great-ai-awakening.html

140 billion words a day1

Designed New Hardware! Tensor Processing Unit (TPU)

“If each of the world’s Android phones used the new Google voice search for just

three minutes a day, these engineers

realized, the company would

need twice as many data centers.”

– Wired

slide-14
SLIDE 14

Prediction-Serving Challenges

???

Create

VW

Caffe

14

Large and growing ecosystem of ML models and frameworks Support low-latency, high- throughput serving workloads

slide-15
SLIDE 15

Wide range of application and frameworks

slide-16
SLIDE 16

???

Create

VW Caffe

16

Wide range of application and frameworks

slide-17
SLIDE 17

One-Off Systems for High-Value Tasks

Problems:

Expensive to build and maintain Ø Requires AI + Systems expertise Tightly-coupled model, framework, and application Ø Difficult to update models and add new frameworks

slide-18
SLIDE 18

Prediction Serving is an Open Problem

Ø Computationally challenging

Ø Need low latency & high throughput

Ø No standard technology or abstractions for serving models

Low Latency Prediction Serving System [NSDI’17]

IDK

Prediction Cascades

Learning to make fast predictions

[Work in Progress]

slide-19
SLIDE 19

Daniel Crankshaw Xin Wang Ion Stoica Giulio Zhou Alexey Tumanov Corey Zumar Yika Luo

Clipper

Low Latency Prediction Serving System

slide-20
SLIDE 20

???

Create

VW Caffe

20

Wide range of application and frameworks

slide-21
SLIDE 21

Middle layer for prediction serving.

Common Abstraction System Optimizations

Create

VW Caffe

21
slide-22
SLIDE 22

Clipper

MC MC MC

RPC RPC RPC RPC

Clipper Decouples Applications and Models

Applications

Model Container (MC)

Caffe

Predict Observe RPC/REST Interface

slide-23
SLIDE 23

Clipper Architecture

Clipper

Applications Predict Observe RPC/REST Interface Model Abstraction Layer

Provide a common interface and system optimizations

Model Selection Layer

Combine predictions across frameworks MC MC MC

RPC RPC RPC RPC

Model Container (MC)

Caffe

slide-24
SLIDE 24

Clipper Architecture

Clipper

Applications Predict Observe RPC/REST Interface Model Selection Layer

Combine predictions across frameworks MC MC MC

RPC RPC RPC RPC

Model Container (MC)

Caffe

Model Abstraction Layer

Provide a common interface and system optimizations

Optimized Batching Caching Common API Model Isolation

slide-25
SLIDE 25

A single page load may generate many queries

Batching to Improve Throughput

Ø Optimal batch depends on:

Ø hardware configuration Ø model and framework Ø system load

Ø Why batching helps:

Throughput

Throughput-optimized frameworks

Batch size

Clipper Solution: Adaptively tradeoff latency and throughput… Ø Inc. batch size until the latency objective is exceeded (Additive Increase) Ø If latency exceeds SLO cut batch size by a fraction (Multiplicative Decrease)

slide-26
SLIDE 26

Throughput (QPS)

20000 40000 60000

4 8 3 8 6 2 2 8 5 9 2 9 3 5 4 8 9 3 4 1 9 7 4 7 2 1 9 8 9 6 3 3 1 7 7 2 6 1 9 2 2 3 1 9 2 1 AdaStLve 1R BatFKLng

20 40

2 2 2 2 2 8 2 5 5

1 R

  • 2

S 5 a n d R P ) R r e V t ( 6 K O e a r n ) L L n e a r 6 9 ( 3 y 6 S a r N ) L L n e a r 6 9 ( 6 K L e a r n ) K e r n e O 6 9 ( 6 K L e a r n ) L R g 5 e g r e V V L R n ( 6 K L e a r n ) 1000

1 3 4 4 9 6 3 2 1 3 1 8 3 1 2 2 4

P99 Latency (ms) Batch Size

slide-27
SLIDE 27

TensorFlow- Serving

Predict

RPC Interface

Applications

Overhead of decoupled architecture Clipper

Predict Feedback RPC/REST Interface

Caffe

MC MC MC

RPC RPC RPC RPC Applications

MC

slide-28
SLIDE 28

Throughput (QPS) Better P99 Latency (ms) Better

Overhead of decoupled architecture

Decentralized system matches performance of centralized design.

slide-29
SLIDE 29

Clipper Architecture

Clipper

Applications Predict Observe RPC/REST Interface Model Selection Layer

Combine predictions across frameworks MC MC MC

RPC RPC RPC RPC

Model Container (MC)

Caffe

Model Abstraction Layer

Provide a common interface and system optimizations

slide-30
SLIDE 30

Clipper Architecture

Clipper

Applications Predict Observe RPC/REST Interface

MC MC MC

RPC RPC RPC RPC

Model Container (MC)

Caffe

Model Abstraction Layer

Provide a common interface and system optimizations

Model Selection Layer

Combine predictions across frameworks

slide-31
SLIDE 31

Clipper

Predict Observe RPC/REST Interface

MC MC MC

RPC RPC RPC RPC

Model Container (MC)

Caffe

Model Abstraction Layer

Provide a common interface and system optimizations

Model Selection Layer

Combine predictions across frameworks

Caffe

Version 1 Version 2 Version 3

Periodic retraining Experiment with new models and frameworks

slide-32
SLIDE 32

Caffe

“CAT” “DJ” “CAT” “CAT”

“CAT”

UNSURE

Selection Policy can Calibrate Confidence

Policy

Version 2 Version 3

slide-33
SLIDE 33

ensemEle 4-agree 5-agree 0.0 0.2 0.4

7Rp-5 ErrRr 5ate

0.0586 0.0469 0.3182 0.0327 0.1983

Image1et

cRnIident unsure cRnIident unsure

Better

Selection Policy: Estimate confidence

slide-34
SLIDE 34

ensemEle 4-agree 5-agree 0.0 0.2 0.4

7Rp-5 ErrRr 5ate

0.0586 0.0469 0.3182 0.0327 0.1983

Image1et

cRnIident unsure cRnIident unsure

Better

width is percentage of query workloads

Selection Policy: Estimate confidence

slide-35
SLIDE 35

Open Research Questions

Ø Efficient execution of complex model compositions

Ø Optimal batching to achieve end-to-end latency goals

Ø Automatic model failure identification and correction

Ø Use anomaly detection techniques to identify model failures

Ø Prediction serving on the edge

Ø Allowing models to span cloud and edge infrastructure

http://clipper.ai

slide-36
SLIDE 36

Low Latency Prediction Serving System [NSDI’17]

IDK

Prediction Cascades

Learning to make fast predictions.

[Work in Progress]

slide-37
SLIDE 37

Low Latency Prediction Serving System [NSDI’17]

IDK

Prediction Cascades

Learning to make fast predictions.

[Work in Progress]

slide-38
SLIDE 38

56.6 69.8 73.3 76.2 77.4 78.3 10 20 30 40 50 60 70 80 90

Accuracy

0.08 0.15 0.31 0.33 0.67 1 0.2 0.4 0.6 0.8 1 1.2

Relative Cost

Model costs are increasing much faster than gains in accuracy.

Small but significant Order of magnitude gap

Complexity à Complexity à

slide-39
SLIDE 39

IDK

Prediction Cascades

Simple models for simple tasks

Daniel Crankshaw Xin Wang Alexey Tumanov Yika Luo

Query Simple Model Prediction Accurate Model Prediction

I don’t Know

Fast Slow

Combine fast (inaccurate) models with slow (accurate) models to maximize accuracy while reducing computational costs.

https://arxiv.org/abs/1706.00885

slide-40
SLIDE 40

78.3 78.3 78.3 78.3 78.3 20 40 60 80 100

Accuracy

37% reduction in runtime @ no loss in accuracy

0.89 0.76 0.8 0.63 1 0.2 0.4 0.6 0.8 1 1.2

Relative Cost

Query Simple Model ResNet152

slide-41
SLIDE 41

Conv Conv Conv Conv Conv Conv Conv FC

Query Simple Model Prediction ResNet152 Prediction

I don’t Know

Fast Slow

Gate Gate

Ø Cascades within a Model

Query Prediction

slide-42
SLIDE 42

Conv Conv Conv Conv Conv Conv Conv FC

Query Simple Model Prediction Accurate Model Prediction

I don’t Know

Fast Slow

Gate Gate

Ø Cascades within a Model

Query Prediction

Skip Blocks

slide-43
SLIDE 43

Cascading reduces computational cost

5es1et110 5es1et74 5es1et38 20 40 60 80 100 120

AverDge Depth

110.0 74.0 38.0 67.08 54.16 35.82 40% 67.72 28% 54.69 10% 34.31

1R GDte CRnvGDte 511GDte

Similar gains on larger models

slide-44
SLIDE 44

Number of Layers Skipped

Skip More Skip More Skip Less Skip Less

Difficult Images Easy Images

slide-45
SLIDE 45

Future Directions for Cascades

Ø Using reinforcement learning techniques to reduce gating costs Ø Query triage during load spikes à forcing fractions

  • f the network to go dark

Ø Irregular execution à

Ø complicates batching Ø Issues for parallel execution

slide-46
SLIDE 46

Low Latency Prediction Serving System [NSDI’17]

IDK

Prediction Cascades

Simple models for simple tasks

[Work in Progress]

Jarvis

Managing the Machine Learning Lifecycle

Ray

Distributed Python for Reinforcement Learning

Other AI Systems Projects in RISE

slide-47
SLIDE 47

We are developing new technologies that will enables applications to make low-latency intelligent decision on live data with strong security guarantees.

Joseph E. Gonzalez jegonzal@cs.berkeley.edu