CS 744: CLIPPER Shivaram Venkataraman Fall 2020 ADMINISTRIVIA - - PowerPoint PPT Presentation

cs 744 clipper
SMART_READER_LITE
LIVE PREVIEW

CS 744: CLIPPER Shivaram Venkataraman Fall 2020 ADMINISTRIVIA - - PowerPoint PPT Presentation

morning ! Good CS 744: CLIPPER Shivaram Venkataraman Fall 2020 ADMINISTRIVIA Course Project Proposals - Due on Friday! - See Piazza for template - Submission instructions soon midterms dangle / the ML on Pinera at upto


slide-1
SLIDE 1

CS 744: CLIPPER

Shivaram Venkataraman Fall 2020

Good

morning !

slide-2
SLIDE 2

ADMINISTRIVIA

Course Project Proposals

  • Due on Friday!
  • See Piazza for template
  • Submission instructions soon

Midterm details

  • Open book, open notes
  • Held in class time 9.30-10.45am Central Time
  • Type / Upload photos (extra 15 mins)

section

/

dangle

midterms

→ at the

upto

ML

  • n Pinera
slide-3
SLIDE 3

MACHINE LEARNING: INFERENCE

÷:

:

÷:

÷

O

O

slide-4
SLIDE 4

GOALS

  • Interactive latencies (tail latency < 100ms)
  • High throughput to handle load
  • Improved prediction accuracy
  • Generality (?)

(

Fw!!

ggg percentile

  • r

99.9

"

percentile

how

latency

[ - - -iger

many

users "

many

requests

that

need

to be

made

ML

specific

II t \

handle

as

many mi

models /

frameworks

as

qwitpj.AM?.wxe..m

possible

slide-5
SLIDE 5

ARCHITECHTURE

Requests

HTTP

sk

inform

  • ver

^

\

  • n*iJL÷¥÷÷

.

t

  • Improve

accuracy eager

'

  • r
→ Failures

eight L

  • → Replicates

It [

  • ↳ zseikit

/

  • ↳ a spark

rabbi

fair

herd

X

D Deel

go

L

t

slide-6
SLIDE 6

MODEL CONTAINERS

Run using Docker containers Can be replicated across machines

tint #dell

people

API

÷dciM

Interface

is implemented
  • nce

per

framework

'ate TF shim

]µ initiative

  • 'r

model

instantiate

TF

Mim

frameworks

are →

.pe?f

. 'Y ' !

win rent

  • so
slide-7
SLIDE 7

MODEL ABSTRACTION LAYER

Caching

  • Improve performance for frequent queries
  • LRU eviction policy
  • Important for feedback

datapoint

Predict

, * good

gpeoloinieddder

.

¥kiEm;;

to

predict

movies for user- id - I

  • es
are TF
  • 101
µ , M"
  • r

spark -50

.

Tt

Her

:

  • T

feedback

high

dir ith

¢

I

① predict . mm!?

www.tfiidrYDTE.jo/iv.::.:Eiesf&dback

and

Predictions

→ Model
  • update
slide-8
SLIDE 8

BATCHING, QUEUING

Goals, Insight

  • Increase latency (within SLO)

for improved throughput

  • Reduce RPC overheads
  • GPU / BLAS acceleration

Approach

  • Per container queues.
  • Why?

To

do

an

RPC

size

that

max .

+

put while

within

↳ fixed

lost

eating SLO

  • (

both

  • verhead
  • → amortize

batches

we lead

hasmdddisswt

;÷%

:L

→ hardware

  • pium

could vary

f-

each

model !

parellism

Gpu

Cpu

slide-9
SLIDE 9

ADAPTIVE BATCHING

1 2 3 4 5 2 4 6 8 10 Batch Size Time

AIMD: Additive Inc Multiplicative Dec Why ? Delayed: Wait until batch exists Why?

  • bserve

latency

  • LAND

)

write

late

SLO Increase

batch

in

carefully

inc .

batch

4

& ed-domrmaffiHE.7.su !

FL

f

s

2

Collect

examples

Gang

upto

a

certainties

then

dinette

Elias

→link) should

.?

wait ?

↳Der

few

should Iad ?

slide-10
SLIDE 10

MODEL SELECTION

Improve

Accuracy

ensembles

÷

  • y
slide-11
SLIDE 11

SINGLE MODEL SELECTION

Multi-Arm Bandit formulation

  • Explore vs Exploit
  • Regret: Loss by not

picking optimal action

  • Goal: Minimize regret

Clipper

  • Exp3 algorithm
  • Single evaluation
  • Scales to more models
'

a n÷÷÷

atta "¥a-get

with each

  • ption
  • update

weights

based

¥y

  • n feedback

④ model I

Omodd2

slide-12
SLIDE 12

MULTI MODELS

Ensemble

  • Combine output from models (weighted average)
  • How do we get the weights ?

Robust Prediction

  • React to model changes
  • Output confidence score

ensembles predict

movies

it 5¥

\

ft

Este

v

Apart linear

Combination tf

=L y.at/32i

  • se

t

t

Binary

classifier

  • cat
" dos

Expo

,

update

d &

B

. CI

0.25

  • .

CZ

O
  • 4
O
  • 6

combine

&

threshold

> as → cat

dog

slide-13
SLIDE 13

STRAGGLER MITIGATION

Why do stragglers occur? Approach

  • 9TH
we

wait

for

N

model

containers

to

totem

reply ,

some

  • f

them

might

be

slow ?

more

replicas

locating ?

Approx

result

based

  • n

. ..2rep

whatever

has

finished

→ Better

approx

them

late !

→ ML

specific

slide-14
SLIDE 14

SUMMARY

  • Clipper: ML inference Workloads + Requirements
  • Layered architecture provides generality
  • Caching, Batching, Replication to improve latency, throughput
  • Multi-Arm bandits to improve accuracy
slide-15
SLIDE 15

DISCUSSION

https://forms.gle/FCVhPURqz7HSbDtg6

slide-16
SLIDE 16

Consider a scenario where you run a model serving service that hosts a number of different applications. The traffic for some applications is sporadic (e.g. only a few hours where they are used). What are some advantages / disadvantages of using Clipper for such a service?

Advantages

Disadvantages

Rade might

be

contented

Adaptive

batching

delayed

→ tune

multiple

replicas

  • hooted

.net?fashim ?

elasticity

roti

frequent

greeted

→ Containerization

inlet

applications

  • ppc

T

  • we. provided

pt⇒

. slow t ↳ de -

Effie

slide-17
SLIDE 17

bing.g.ms

.

homie :O

:L,

smug

)

different

things ?

judith

?

O

D

O

O -

D

Ao

:

↳ µ

ensembles

Treasonable

accurate

tetany inflation

is

very

low