[PPT] - ri RISE to the Challenges of AI Systems Joseph E. Gonzalez PowerPoint Presentation

SLIDE 1

ri

Joseph E. Gonzalez

Assistant Professor, UC Berkeley jegonzal@cs.berkeley.edu

RISE

to the Challenges of AI Systems

SLIDE 2

Big Data

Big Model

Training

Large-Scale parallel and distributed systems

SLIDE 3

Big Data

Big Model

Training

SLIDE 4

Big Data

Big Model

Training

Splash

CoCoA

VW

SLIDE 5

How to do Research in AI Systems

Ø Manage Complexity

Ø seek parsimony in system design Ø great systems research is often about what features are taken away Ø Do a few things well and be composable

Ø Identify Tradeoffs

Ø With each design decision what do you gain and lose? Ø What trade-offs are fundamental?

Ø Evaluate your System

Ø Positive: How fast and scalable is it and why? Ø Negative: When does it fail and what are it’s limitations?

SLIDE 6

Hemingway*

Modeling Throughput and Convergence for ML Workloads

Ø What is the best algorithm and level of parallelism for an ML task?

Ø Trade-off: Parallelism, Coordination, & Convergence

Ø Research challenge: Can we model this trade-off explicitly?

*follow-up work to Shivaram’s Ernest System in NSDI’16

Shivaram Venkataraman Xinghao Pan Zi Zheng

Loss as a function of iterations i and cores p

L(i, p) I(p) Iterations per second as

a function of cores p

ML Metric Loss Iteration Systems Metric

Iter. / Sec.

Cores We can estimate I from data on many systems We can estimate L from data for our problem

SLIDE 7

Hemingway*

Modeling Throughput and Convergence for ML Workloads

Ø What is the best algorithm and level of parallelism for an ML task?

Ø Trade-off: Parallelism, Coordination, & Convergence

Ø Research challenge: Can we model this trade-off explicitly?

Shivaram Venkataraman Xinghao Pan Zi Zheng

Loss as a function of iterations i and cores p

L(i, p) I(p) Iterations per second as

a function of cores p

loss(t, p) = L (t∗I (p), p)

How long does it take to get to a given loss?
Given a time budget and number of cores

which algorithm will give the best result?

*follow-up work to Shivaram’s Ernest System in NSDI’16

SLIDE 8

Training Loss

Convergence as a function

f Parallelism and Iterations

Iteration

System Performance as a function of Parallelism

Parallelism Time Per. Iteration Training Loss

Convergence as a fn.

f Time and Parallelism

Hemingway: Modeling Distributed Optimization Algorithms.

Xinghao Pan, Shivaram Venkataraman, Zizheng Tai, Joseph Gonzalez. NIPS’16 ML-Sys Workshop.

SLIDE 9

Take away …

try to decouple System Improvements Algorithm Improvements use data collection + sparse modeling to understand your system

SLIDE 10

Big Data

Big Model

Training

Splash

CoCoA

VW

SLIDE 11

Big Data

Big Model

Training

SLIDE 12

Big Data

Big Model

Training

Learning

?

SLIDE 13

Big Data

Big Model

Training

Learning

Conference Papers

SLIDE 14

Big Data

Big Model

Training

Learning

Conference Papers

Dashboards and Reports

SLIDE 15

Big Data

Big Model

Training

Learning

Conference Papers Dashboards and Reports

Drive Actions

SLIDE 16

Big Data

Big Model

Training

Learning

Drive Actions

SLIDE 17

Big Data

Big Model

Training

Learning Inference

SLIDE 18

Big Data

Big Model

Training

Application

Decision Query

?

Learning Inference

SLIDE 19

Big Data Training

Learning Inference

Big Model Application

Decision Query

Often overlooked Timescale: ~10 milliseconds

Billions of Queries a Day à Costly

SLIDE 20

why is challenging?

Need to render low latency (< 10ms) predictions for complex

under heavy load with system failures.

Models Queries

Top K

Features

SELECT * FROM users JOIN items, click_logs, pages WHERE …

Inference

SLIDE 21

is moving beyond the cloud

Mobile Assistants Augmented Reality Home Security Home Automation Self Driving Cars Personal Robotics

Inference

SLIDE 22

is moving beyond the cloud

Opportunities

Ø Reduce latency and improve privacy Ø Address network partitions

Research Challenges

Ø Minimize power consumption Ø Limited hardware & long life-cycles Ø Develop new hybrid models to leverage the cloud and edge devices

Inference

SLIDE 23

Robust is critical

Self “Parking” Cars Self “Driving” Cars Chat AIs

Inference

SLIDE 24

Big Data

Big Model

Training

Application

Decision Query

Learning Inference

Feedback

SLIDE 25

Big Data Training

Application

Decision

Learning Inference

Feedback

Timescale: hours to weeks

Often re-run training Sensitive to feedback loops

SLIDE 26

Why is challenging?

Closing the Loop

Self Reinforcing Feedback Loops Implicit and Delayed Feedback

d dt

World Changes at varying rates

SLIDE 27

Big Data

Big Model

Training

Application

Decision Query

Learning Inference

Feedback

Responsive (~10ms) Adaptive (~1 seconds)

SLIDE 28

Learning Inference

Responsive (~10ms) Adaptive (~1 seconds)

?

SLIDE 29

Learning Inference

Responsive (~10ms) Adaptive (~1 seconds)

Secure

SLIDE 30

Augmented Reality Home Monitoring Voice Technologies Medical Imaging

Protect the data, the model, and the query

Intelligence in Sensitive Contexts

SLIDE 31

Data

High-Value Data is Sensitive

Medical Info.
Home video
Finance

Models capture value in data

Core Asset
Sensitive

Queries can be as sensitive as the data

Protect the data, the model, and the query

SLIDE 32

Opaque: Analytics on Secure Enclaves

Exploit hardware support to enable computing on encrypted data Ø Today: prototype system running in Apache Spark

Ø support SQL queries in untrusted cloud Ø ~50% reduction in perf.

Ø Future: enable prediction serving on enc. queries

Spark Execution Catalyst

-filter

query optimization

SQL ML Graph

Opaque

-groupby
-join

Wenting et al. (NSDI’17)

SLIDE 33

Adaptive Responsive Secure

SLIDE 34

SLIDE 35

riselab

UC Berkeley

SLIDE 36

A Low-Latency Online Prediction Serving System

Clipper

NSDI’17

Daniel Crankshaw Xin Wang Giulio Zhou Michael J. Franklin Joseph E. Gonzalez Ion Stoica

SLIDE 37

Big Data Training

Application

Learning Inference

Feedback

Decision Query

SLIDE 38

Big Data Training

Application

Learning Inference

Feedback Slow

Slow Changing Parameters Fast Changing Parameters Decision Query

SLIDE 39

Hybrid Offline + Online Learning

Update the user weights online:

Simple to train + more robust model
Address rapidly changing user statistics

Update “feature” functions offline using batch solvers

Leverage high-throughput systems (Tensor Flow)
Exploit slow change in population statistics

f(x; θ)T wu

SLIDE 40

Common modeling structure

f(x; θ)T wu

Items Users Matrix Factorization

Input

Deep Learning Ensemble Methods

SLIDE 41

Clipper Online Learning for Recommendations

(Simulated News Rec.)

0.2 0.4 0.6 10 20 30

Error Examples

Partial Updates: 0.4 ms Retraining: 7.1 seconds

>4 orders-of- magnitude faster adaptation

SLIDE 42

Big Data

Application

Learning Inference

Feedback Slow

Slow Changing Parameters Fast Changing Parameters

SLIDE 43

Caffe

Big Data

Application

Learning Inference

Feedback Slow

Slow Changing Parameters Fast Changing Parameters

Clipper

SLIDE 44

Clipper Serves Predictions across ML Frameworks

Clipper

Content Rec. Fraud Detection Personal Asst. Robotic Control Machine Translation

Create

VW Caffe

SLIDE 45

Clipper

Create

VW Caffe

Key Insight: The challenges of prediction serving can be addressed between end-user applications and machine learning frameworks

As a result, Clipper is able to: Ø hide complexity by

Ø providing a common interface to applications

Ø bound latency and maximize throughput

Ø through caching, adaptive batching, model replication

Ø enable robust online learning and personalization

Ø through model selection and ensemble algorithms

without modifying machine learning frameworks or front-end applications

SLIDE 46

Clipper Architecture

Clipper

Content Rec. Fraud Detection Personal Asst. Robotic Control Machine Translation

VW Caffe

Create

SLIDE 47

Clipper Architecture

Clipper

Applications Predict Observe RPC/REST Interface

VW Caffe

Create

SLIDE 48

Clipper Architecture

Clipper Caffe

Applications Predict Observe RPC/REST Interface

Model Wrapper (MW) MW MW MW

RPC RPC RPC RPC

SLIDE 49

Clipper Architecture

Clipper

Applications Predict Observe RPC/REST Interface Model Abstraction Layer

Provide a common interface to models while bounding latency and maximizing throughput.

Model Selection Layer

Improve accuracy through bandit methods, ensembles, online learning, and personalization

Caffe

Model Wrapper (MW) MW MW MW

RPC RPC RPC RPC

SLIDE 50

Clipper Architecture

Clipper

Applications Predict Observe RPC/REST Interface Model Selection Layer

Anytime Predictions

Model Abstraction Layer

Caching Adaptive Batching

Caffe

Model Wrapper (MW) MW MW MW

RPC RPC RPC RPC

SLIDE 51

Model Abstraction Layer

Caching Adaptive Batching

Caffe

Model Wrapper (MW) MW MW MW

RPC RPC RPC RPC

SLIDE 52

Model Abstraction Layer

Caching Adaptive Batching

Model Wrapper (MW)

RPC

Caffe

MW

RPC

MW

RPC

MW

RPC Model Abstraction Layer Provide a common interface to models while bounding latency and maximizing throughput.

Ø Models run in separate processes as Docker containers

Ø Resource isolation

SLIDE 53

Model Abstraction Layer

Caching Adaptive Batching

Model Wrapper (MW)

RPC

Caffe

MW

RPC

MW

RPC

MW

RPC

MW

RPC

MW

RPC Model Abstraction Layer Provide a common interface to models while bounding latency and maximizing throughput.

Ø Models run in separate processes as Docker containers

Ø Resource isolation

Problem: frameworks optimized for batch processing not latency

Ø Scaling under heavy load

SLIDE 54

A single page load may generate many queries

Adaptive Batching to Improve Throughput

Ø Optimal batch depends on:

Ø hardware configuration Ø model and framework Ø system load

Clipper Solution: be as slow as allowed… Ø Application specifies latency objective Ø Clipper uses TCP-like tuning algorithm to increase latency up to the objective Ø Why batching helps:

Hardware Acceleration Helps amortize system overhead

SLIDE 55

Throughput (Queries Per Second) Latency (ms) Batch Sizes (Queries)

Tensor Flow Conv. Net (GPU)

Latency Deadline

Optimal Batch Size

SLIDE 56

Throughput (QPS) P99 Latency (𝜈s)

Better Better 20000 is Good Enough

SLIDE 57

Throughput (QPS) P99 Latency (𝜈s)

Better Better 20000 is Good Enough

SLIDE 58

Overhead of modularity?

The decoupled Clipper architecture can be as fast as the in-process approach adopted by TensorFlow-Serving

Better Better 40000 is Good Enough

SLIDE 59

Approximate Caching to Reduce Latency

Clipper Solution: Approximate Caching apply locality sensitive hash functions Ø Opportunity for caching Ø Need for approximation

Popular items may be evaluated frequently

High Dimensional and continuous valued queries have low cache hit rate.

Bag-of-Words Model Images ? ? Cache Hit Cache Miss ? Cache Hit Error

SLIDE 60

Clipper Architecture

Clipper

Applications Predict Observe RPC/REST Interface Model Selection Layer

Selection Policy

Model Abstraction Layer

Caching Adaptive Batching

Caffe

Model Wrapper (MW) MW MW MW

RPC RPC RPC RPC

SLIDE 61

Goal: Maximize accuracy through bandits, ensembles, online learning, and personalization Incorporate feedback in real-time to achieve: Ø robust predictions by adaptively combining predictions from multiple models and frameworks Ø online learning and personalization by selecting and personalizing predictions in response to feedback

Clipper

Model Selection Layer

Selection Policy

SLIDE 62

Caffe

Clipper

Model Selection Policy

Improves prediction accuracy by: Ø Combining predictions from multiple frameworks

Ø Ensemble methods

Ø Incorporate real-time feedback

Ø Personalized ensembles Ø Bandit algorithms

Ø Estimates confidence of predictions

Ø Agreement between models

Model Selection Layer Selection Policy

SLIDE 63

Ensemble Prediction Accuracy (ImageNet)

System Model Error Rate #Errors Caffe VGG 13.05% 6525 Caffe LeNet 11.52% 5760 Caffe ResNet 9.02% 4512 TensorFlow Inception v3 6.18% 3088

sequence of pre-trained models

SLIDE 64

Ensemble Prediction Accuracy (ImageNet)

System Model Error Rate #Errors Caffe VGG 13.05% 6525 Caffe LeNet 11.52% 5760 Caffe ResNet 9.02% 4512 TensorFlow Inception v3 6.18% 3088 Clipper Ensemble 5.86% 2930

5.2% relative improvement in prediction accuracy!

SLIDE 65

Caffe

Slow Changing Model Fast Changing Linear Model

Clipper

Ensemble Methods Create Stragglers

Application

20ms

✓ ✓

Solution: Replace missing prediction with an estimator

E[

(x) ]

SLIDE 66

Caffe

Slow Changing Model Fast Changing Model

Anytime Predictions

+ + fscikit(x) fCaffe(x)

✓ ✓

EX [fTF(X)] wscikit wTF wCaffe

SLIDE 67

Anytime Predictions

Ø Tolerates some loss of models

Ø Depends heavily on ensemble

SLIDE 68

Ensemble’s to Estimate Confidence

SLIDE 69

Clipper

Create

VW Caffe

Ø to simplifying model serving Ø bound latency and increase throughput Ø and enable real-time learning and personalization across machine learning frameworks Clipper is a prediction serving system that spans multiple ML Frameworks and is designed to

“Clipper: A Low-Latency Online Prediction Serving System” https://github.com/ucbrise/clipper (open source)

SLIDE 70

Ongoing Clipper Subprojects

Ø Adaptive Batching for Prediction

Ø Leverage internal data-parallelism and hardware acceleration

Ø Approximate Caching

Ø Detect “similar” queries and re-use cached predictions

Ø Prediction Cascades

Ø Automatically deriving cascades of increasingly GPU intensive models

Ø RL/Control

Ø Serving and updating RL policies based on feedback

Ø Scheduling and resource allocation

Ø Reduce the need to over-provision for bursty workloads

SLIDE 71

riselab

UC Berkeley

We are developing new technologies that will enables applications to make low-latency intelligent decision on live data with strong security guarantees.

SLIDE 72

riselab

UC Berkeley

Adaptive Responsive Secure