Learning Systems Research at the Intersection of Machine Learning - - PowerPoint PPT Presentation

learning systems
SMART_READER_LITE
LIVE PREVIEW

Learning Systems Research at the Intersection of Machine Learning - - PowerPoint PPT Presentation

Learning Systems Research at the Intersection of Machine Learning & Data Systems Joseph E. Gonzalez Asst. Professor, UC Berkeley jegonzal@cs.berkeley.edu How can machine learning techniques be used to address systems challenges ? Learning


slide-1
SLIDE 1

Joseph E. Gonzalez

  • Asst. Professor, UC Berkeley

jegonzal@cs.berkeley.edu

Learning Systems

Research at the Intersection of

Machine Learning & Data Systems

slide-2
SLIDE 2

Learning Systems

How can machine learning techniques be used to address systems challenges? How can systems techniques be used to address machine learning challenges?

slide-3
SLIDE 3

Learning Systems

How can machine learning techniques be used to address systems challenges? How can systems techniques be used to address machine learning challenges?

slide-4
SLIDE 4

How can machine learning techniques be used to address systems challenges?

Systems are getting increasing complex: Ø Resource Disaggregation à growing diversity of system configurations and freedom to add resources as needed Ø New Pricing Models à dynamic pricing and potential to bid for different types of resources Ø Data-centric Workloads à performance depends on interaction between system, algorithms, and data

slide-5
SLIDE 5

Paris

Performance Aware Runtime Inference System

Ø What vm-type should I use to run my experiment?

Neeraja Yadwadkar Bharath Hariharan Randy Katz

t2.nano t2.micro t2.small t2.medium t2.large m4.large m4.xlarge m4.2xlarge m4.4xlarge m4.10xlarge m3.medium m3.large m3.xlarge m3.2xlarge c4.large c4.xlarge c4.2xlarge c4.4xlarge c4.8xlarge x1.32xlarge r3.large r3.xlarge r3.2xlarge r3.4xlarge r3.8xlarge g2.2xlarge g2.8xlarge

slide-6
SLIDE 6

Paris

Performance Aware Runtime Inference System

Ø What vm-type should I use to run my experiment?

Neeraja Yadwadkar Bharath Hariharan Randy Katz

t2.nano t2.micro t2.small t2.medium t2.large m4.large m4.xlarge m4.2xlarge m4.4xlarge m4.10xlarge m3.medium m3.large m3.xlarge m3.2xlarge c4.large c4.xlarge c4.2xlarge c4.4xlarge c4.8xlarge x1.32xlarge r3.large r3.xlarge r3.2xlarge r3.4xlarge r3.8xlarge g2.2xlarge g2.8xlarge

Instance Types

54

slide-7
SLIDE 7

Paris

Performance Aware Runtime Inference System

Ø What vm-type should I use to run my experiment? Ø Answer: workload specific and depends on cost & runtime goals

Neeraja Yadwadkar Bharath Hariharan Randy Katz

t2.nano t2.micro t2.small t2.medium t2.large m4.large m4.xlarge m4.2xlarge m4.4xlarge m4.10xlarge m3.medium m3.large m3.xlarge m3.2xlarge c4.large c4.xlarge c4.2xlarge c4.4xlarge c4.8xlarge x1.32xlarge r3.large r3.xlarge r3.2xlarge r3.4xlarge r3.8xlarge g2.2xlarge g2.8xlarge

54 25 18

slide-8
SLIDE 8

Paris

Performance Aware Runtime Inference System

Ø Best vm-type depends on workload as well as cost & runtime goals

Neeraja Yadwadkar Bharath Hariharan Randy Katz

Runtime

Which VM will cost me the least?

m1.small is cheapest?

Price

slide-9
SLIDE 9

Paris

Performance Aware Runtime Inference System

Ø Best vm-type depends on workload as well as cost & runtime goals

Neeraja Yadwadkar Bharath Hariharan Randy Katz

Runtime Job Cost

Requires accurate runtime prediction.

Price

slide-10
SLIDE 10

Paris

Performance Aware Runtime Inference System

Ø Goal: Predict the runtime of workload w on VM type v

Ø Challenge: How do we model workloads and VM types

Ø Insight:

Ø Extensive benchmarking to model relationships between VM types

Ø Costly but run once for all workloads

Ø Lightweight workload “fingerprinting” by on a small set of test VMs Ø Generalize workload performance on other VMs

Ø Results: Runtime prediction 17% Relative RMSE (56% Baseline)

Neeraja Yadwadkar Bharath Hariharan Randy Katz

Benchmarking vm1 vm2 vm100

Workload Fingerprinting

slide-11
SLIDE 11

Hemingway*

Modeling Throughput and Convergence for ML Workloads

Ø What is the best algorithm and level of parallelism for an ML task?

Ø Trade-off: Parallelism, Coordination, & Convergence

Ø Research challenge: Can we model this trade-off explicitly?

*follow-up work to Shivaram’s Ernest paper Shivaram Venkataraman Xinghao Pan Zi Zheng

Loss as a function of iterations i and cores p

L(i, p) I(p) Iterations per second as

a function of cores p

ML Metric

Loss Iteration

Systems Metric

  • Iter. / Sec.

Cores

We can estimate I from data on many systems We can estimate L from data for our problem

slide-12
SLIDE 12

Hemingway*

Modeling Throughput and Convergence for ML Workloads

Ø What is the best algorithm and level of parallelism for an ML task?

Ø Trade-off: Parallelism, Coordination, & Convergence

Ø Research challenge: Can we model this trade-off explicitly?

Shivaram Venkataraman Xinghao Pan Zi Zheng

Loss as a function of iterations i and cores p

L(i, p) I(p) Iterations per second as

a function of cores p

loss(t, p) = L (t∗I (p), p)

  • How long does it take to get to a given loss?
  • Given a time budget and number of cores

which algorithm will give the best result?

*follow-up work to Shivaram’s Ernest paper
slide-13
SLIDE 13

Deep Code Completion

Neural architectures for reasoning about programs

Ø Goals:

Ø Smart naming of variables and routines Ø Learn coding styles and patterns Ø Predict large code fragments

Ø Char and Symbol LSTMs

Ø Programs are more tree shaped…

Xin Wang Chang Liu Dawn Song def fib( ): x if x < 2 return x = return y + fib(x–2) fib(x–1) y

:

else:

slide-14
SLIDE 14

Deep Code Completion

Neural architectures for reasoning about programs

Ø Goals:

Ø Smart naming of variables and routines Ø Learn coding styles and patterns Ø Predict large code fragments

Ø Char and Symbol LSTMs

Ø Programs are more tree shaped…

Xin Wang Chang Liu Dawn Song def fib( ): x if x < 2 return x = return y + fib(x–2) fib(x–1) y

Parse Tree

slide-15
SLIDE 15

Deep Code Completion

Neural architectures for reasoning about programs

Ø Goals:

Ø Smart naming of variables and routines Ø Learn coding styles and patterns Ø Predict large code fragments

Ø Char and Symbol LSTMs Ø Exploring Tree LSTMs

Ø Issue: dependencies flow in both directions

Xin Wang Chang Liu Dawn Song def fib( ): x if x < 2 return x = return y + fib(x–2) fib(x–1) y

Parse Tree

Kai Sheng Tai, Richard Socher, Christopher D. Manning. “Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks.” (ACL 2015)
slide-16
SLIDE 16

Deep Code Completion

Neural architectures for reasoning about computer programs

Ø Goals:

Ø Smart naming of variables and routines Ø Learn coding styles and patterns Ø Predict large code fragments

Ø Current studying Char-LSTM and Tree-LSTM on benchmark C++ code and JavaScript code. Ø Plan to extend Tree-LSTM with downward information flow

Xin Wang Chang Liu Dawn Song

Vanilla LSTM Tree- LSTM

slide-17
SLIDE 17

Fun Code Sample Generated by Char-LSTM

Code Prefix Generated Code Sample

For now, the neural network can learn some code patterns like matching the parenthesis, if-else block, etc but the variable name issue still hasn’t been solved. *this is trained on the leetcode OJ code submissions from Github.

slide-18
SLIDE 18

Learning Systems

How can machine learning techniques be used to address systems challenges? How can systems techniques be used to address machine learning challenges?

slide-19
SLIDE 19

Learning Systems

How can machine learning techniques be used to address systems challenges? How can systems techniques be used to address machine learning challenges?

slide-20
SLIDE 20

Big Data

Big Model

Training

Systems for Machine Learning

Timescale: minutes to days Systems: offline and batch optimized Heavily studied ... primary focus of the ML research

slide-21
SLIDE 21

Big Data

Big Model

Training

Splash

CoCoA

Please make a Logo!
slide-22
SLIDE 22

Big Data

Big Model

Training

Splash

CoCoA

Please make a Logo!

emgine

slide-23
SLIDE 23

Temgine

A Scalable Multivariate Time Series Analysis Engine

Francois Billetti Evan Sparks Xin Wang

Time Time Time Sensor 1 Sensor 2 Sensor 3

Irregularly Sampled

Time Time Time Sensor 1 Sensor 2 Sensor 3 t0 t1 t2 t3 t4 t5 t6

Regularly Sampled

Samples are easy to align (requires sorting) Difficult to align!

Challenge: Ø Estimate second order statistics

Ø E.g. Auto-correlation, auto-regressive models, …

Ø for high-dimensional & irregularly sampled time series

slide-24
SLIDE 24

Temgine

A Scalable Multivariate Time Series Analysis Engine

Challenge: Ø Estimate second order statistics

Ø E.g. Auto-correlation, auto-regressive models, …

Ø for high-dimensional & irregularly sampled time series

Francois Billetti Evan Sparks Xin Wang

Time Time Time Sensor 1 Sensor 2 Sensor 3

Irregularly Sampled

Difficult to align!

Solution:

  • Project onto Fourier basis
  • does not require data alignment
  • Infer statistics in frequency domain
  • equivalent to kernel smoothing
  • analysis of bias – variance tradeoff
slide-25
SLIDE 25

Temgine

A Scalable Multivariate Time Series Analysis Engine

Challenge: Ø Estimate second order statistics

Ø E.g. Auto-correlation, auto-regressive models, …

Ø for high-dimensional & irregularly sampled time series

Francois Billetti Evan Sparks Xin Wang

Solution:

  • Project onto Fourier basis
  • does not require data alignment
  • Infer statistics in frequency domain
  • equivalent to kernel smoothing
  • analysis of bias – variance tradeoff

Define an operator DAG (like TF) and then rely on query-optimization to define efficient execution.

emgine

slide-26
SLIDE 26

Big Data

Big Model

Training

Learning

slide-27
SLIDE 27

Big Data

Big Model

Training

Application

Decision Query

?

Learning Inference

slide-28
SLIDE 28

Big Data Training

Learning Inference

Big Model Application

Decision Query

Timescale: ~10 milliseconds Systems: online and latency optimized Less Studied …

slide-29
SLIDE 29

why is challenging?

Need to render low latency (< 10ms) predictions for complex

under heavy load with system failures.

Models Queries

Top K

Features

SELECT * FROM users JOIN items, click_logs, pages WHERE …

Inference

slide-30
SLIDE 30

Big Data Training

Learning Inference

Big Model Application

Decision Query

Timescale: ~10 milliseconds Systems: online and latency optimized Less studied …

Claim: next big area

  • f research in

scalable ML systems

slide-31
SLIDE 31

Big Data

Big Model

Training

Application

Decision Query

Learning Inference

Feedback

slide-32
SLIDE 32

Big Data Training

Application

Decision

Learning Inference

Feedback

Timescale: hours to weeks Issues: No standard solutions … implicit feedback, sample bias, …

slide-33
SLIDE 33

Why is challenging?

Ø Exposes system to feedback loops

Ø Address Explore – Exploit trade-off in real-time

Ø Adverserial feedback

Ø Opportunities for multi-task learning and anomly detection

Ø Need to address temporal variation

Ø Need to model time directly? When do we forget the past?

Feedback

slide-34
SLIDE 34

Big Data

Big Model

Training

Application

Decision Query

Learning Inference

Feedback

slide-35
SLIDE 35

Big Data

Big Model

Training

Application

Decision Query

Learning Inference

Feedback

Responsive (~10ms) Adaptive (~1 seconds)

slide-36
SLIDE 36

Learning Inference

Responsive (~10ms) Adaptive (~1 seconds)

Techniques we are studying (or should be …):

Multi-task Learning Anytime Inference Adaptive Batching Approx. Caching Model Switching Meta-Policy RL Load Shedding Model Compression Online Ensemble Learning Inference

  • n the Edge
slide-37
SLIDE 37

Daniel Crankshaw Xin Wang Michael Franklin Ion Stoica

Prediction Serving

Giulio Zhou

slide-38
SLIDE 38

Big Data Training

Application

Decision Query

Learning Inference

Feedback

slide-39
SLIDE 39

Big Data Training

Application

Decision Query

Learning Inference

Feedback Slow

Slow Changing Parameters Fast Changing Parameters

slide-40
SLIDE 40

Hybrid Offline + Online Learning

Update the user weights online:

  • Simple to train + more robust model
  • Address rapidly changing user statistics

Update feature functions offline using batch solvers

  • Leverage high-throughput systems (Tensor Flow)
  • Exploit slow change in population statistics

f(x; θ)T wu

slide-41
SLIDE 41

Common modeling structure

f(x; θ)T wu

Items Users Matrix Factorization

Input

Deep Learning Ensemble Methods

slide-42
SLIDE 42

Clipper Online Learning for Recommendations

(Simulated News Rec.)

0.1 0.2 0.3 0.4 0.5 0.6 10 20 30

Error Examples

Partial Updates: 0.4 ms Retraining: 7.1 seconds

>4 orders-of-magnitude faster adaptation

slide-43
SLIDE 43

Big Data

Application

Learning Inference

Feedback Slow

Slow Changing Parameters Fast Changing Parameters

slide-44
SLIDE 44

Caffe

Big Data

Application

Learning Inference

Feedback Slow

Slow Changing Parameters Fast Changing Parameters

Clipper

slide-45
SLIDE 45

Clipper Serves Predictions across ML Frameworks

Clipper

Content Rec. Fraud Detection Personal Asst. Robotic Control Machine Translation

Create

VW Caffe

slide-46
SLIDE 46

Clipper Architecture

Clipper

Applications Predict Observe RPC/REST Interface

VW Caffe

Create

slide-47
SLIDE 47

Clipper Architecture

Clipper Caffe

Applications Predict Observe RPC/REST Interface

Model Wrapper (MW) MW MW MW

RPC RPC RPC RPC

slide-48
SLIDE 48

Clipper Architecture

Clipper Caffe

Applications Predict Observe RPC/REST Interface

Model Wrapper (MW) MW MW MW

RPC RPC RPC RPC Model Abstraction Layer

Provide a common interface to models while bounding latency and maximizing throughput.

Model Selection Layer

Improve accuracy through ensembles,

  • nline learning and personalization
slide-49
SLIDE 49

Clipper Architecture

Clipper Caffe

Applications Predict Observe RPC/REST Interface

Model Wrapper (MW) MW MW MW

RPC RPC RPC RPC Model Selection Layer

Anytime Predictions

Model Abstraction Layer

Approximate Caching Adaptive Batching

slide-50
SLIDE 50

A single page load may generate many queries

Adaptive Batching to Improve Throughput

Ø Optimal batch depends on:

Ø hardware configuration Ø model and framework Ø system load

Clipper Solution: be as slow as allowed… Ø Application specifies latency objective Ø Clipper uses TCP-like tuning algorithm to increase latency up to the objective Ø Why batching helps:

Hardware Acceleration Helps amortize system overhead

slide-51
SLIDE 51

Throughput (Queries Per Second) Latency (ms) Batch Sizes (Queries)

Tensor Flow Conv. Net (GPU)

Latency Deadline

Optimal Batch Size

slide-52
SLIDE 52

Approximate Caching to Reduce Latency

Clipper Solution: Approximate Caching apply locality sensitive hash functions Ø Opportunity for caching Ø Need for approximation

Popular items may be evaluated frequently

High Dimensional and continuous valued queries have low cache hit rate.

Bag-of-Words Model Images

? ?

Cache Hit Cache Miss

?

Cache Hit Error

slide-53
SLIDE 53

A single page load may generate many queries

Adaptive Batching to Improve Throughput

Ø Optimal batch depends on:

Ø hardware configuration Ø model and framework Ø system load

Clipper Solution: be as slow as allowed… Ø Application specifies latency objective Ø Clipper uses TCP-like tuning algorithm to increase latency up to the objective Ø Why batching helps:

Hardware Acceleration Helps amortize system overhead

slide-54
SLIDE 54

Throughput (Queries Per Second) Latency (ms) Batch Sizes (Queries)

Tensor Flow Conv. Net (GPU)

Latency Deadline

Optimal Batch Size

slide-55
SLIDE 55

Caffe

Slow Changing Model Fast Changing Linear Model

Clipper

Anytime Predictions

Application

20ms

✓ ✓

Solution: Replace missing prediction with an estimator

E[

(x) ]

slide-56
SLIDE 56

Caffe

Slow Changing Model Fast Changing Model

Anytime Predictions

+ + fscikit(x) fCaffe(x)

✓ ✓

EX [fTF(X)] wscikit wTF wCaffe

slide-57
SLIDE 57

Comparison to TensorFlow Serving

Takeaway: Clipper is able to match the average latency of TensorFlow Serving while reducing tail latency (2x) and improving throughput (2x)

slide-58
SLIDE 58

Evaluation of Throughput Under Heavy Load

Accuracy Throughput (queries per second) Takeaway: Clipper is able to gracefully degrade accuracy to maintain availability under heavy load.

slide-59
SLIDE 59

Improved Prediction Accuracy (ImageNet)

System Model Error Rate #Errors Caffe VGG 13.05% 6525 Caffe LeNet 11.52% 5760 Caffe ResNet 9.02% 4512 TensorFlow Inception v3 6.18% 3088

sequence of pre-trained models

slide-60
SLIDE 60

Improved Prediction Accuracy (ImageNet)

System Model Error Rate #Errors Caffe VGG 13.05% 6525 Caffe LeNet 11.52% 5760 Caffe ResNet 9.02% 4512 TensorFlow Inception v3 6.18% 3088 Clipper Ensemble 5.86% 2930

5.2% relative improvement in prediction accuracy!

slide-61
SLIDE 61

Clipper

Create

VW Caffe

Ø to simplifying model serving Ø bound latency and increase throughput Ø and enable real-time learning and personalization across machine learning frameworks

Clipper prediction serving system that spans multiple ML Frameworks and is designed to

slide-62
SLIDE 62

Joseph E. Gonzalez

773 Soda Hall jegonzal@cs.berkeley.edu

Learning Systems

Daniel Crankshaw Xin Wang Ankur Dave Neeraja Yadwadkar Xinghao Pan

Graduate students collaborators on this work:

Wenting Zheng Francois Billetti

slide-63
SLIDE 63

R SE

Real-time, Intelligent, and Secure Systems Lab

slide-64
SLIDE 64

From batch data to advanced analytics AMP Lab From live data to real-time decisions RISE Lab

slide-65
SLIDE 65

Goal

Real-time decisions

decide in ms

  • n live data

the current state as data arrives

with strong security

privacy, confidentiality, and integrity

65

decide in ms privacy, confidentiality, integrity the current state of the environment

slide-66
SLIDE 66

R SE

Real-time, Intelligent, and Secure Systems Lab

Learn More:

  • CS294 Course on RISE Topics

https://ucbrise.github.io/cs294-rise-fa16/

  • Early RISErs Seminar on Mondays at 9:30 AM
slide-67
SLIDE 67

Security: Protecting Models

Data is a core asset & models capture the value in data Ø Expensive: many engineering & compute hours to develop Ø Models can reveal private information about the data How do we protect models from being stolen? Ø Prevent them from being copied from devices (DRM? SGX?) Ø Defend against active learning attacks on decision boundaries How do we identify when models have been stolen? Ø Watermarks in decision boundaries?