CS 744: CLIPPER Shivaram Venkataraman Fall 2019 ADMINISTRIVIA - - - PowerPoint PPT Presentation

cs 744 clipper
SMART_READER_LITE
LIVE PREVIEW

CS 744: CLIPPER Shivaram Venkataraman Fall 2019 ADMINISTRIVIA - - - PowerPoint PPT Presentation

CS 744: CLIPPER Shivaram Venkataraman Fall 2019 ADMINISTRIVIA - Assignment 2 grading - Midterm details - Course Project template MACHINE LEARNING So FAR MACHINE LEARNING: INFERENCE GOALS - Interactive latencies (tail latency <


slide-1
SLIDE 1

CS 744: CLIPPER

Shivaram Venkataraman Fall 2019

slide-2
SLIDE 2

ADMINISTRIVIA

  • Assignment 2 grading
  • Midterm details
  • Course Project template
slide-3
SLIDE 3

MACHINE LEARNING So FAR

slide-4
SLIDE 4

MACHINE LEARNING: INFERENCE

slide-5
SLIDE 5

GOALS

  • Interactive latencies (tail latency < 100ms)
  • High throughput to handle load
  • Improved prediction accuracy
  • Generality (?)
slide-6
SLIDE 6

ARCHITECHTURE

slide-7
SLIDE 7

MODEL CONTAINERS

  • Run using Docker containers
  • Can be replicated across machines
slide-8
SLIDE 8

MODEL ABSTRACTION LAYER

Caching

  • Improve performance for frequent queries
  • LRU eviction policy
  • Important for feedback
slide-9
SLIDE 9

BATCHING, QUEUING

Goals, Insight

  • Increase latency (within SLO)

for improved throughput

  • Reduce RPC overheads
  • GPU / BLAS acceleration

Approach

  • Per container queues.
  • Maximum batch size.
  • Why?
slide-10
SLIDE 10

ADAPTIVE BATCHING

1 2 3 4 5 2 4 6 8 10 Batch Size Time

AIMD: Additive Inc Multiplicative Dec Why ? Delayed: Wait until batch exists Why?

slide-11
SLIDE 11

MODEL SELECTION

slide-12
SLIDE 12

SINGLE MODEL SELECTION

Multi-Arm Bandit formulation

  • Explore vs Exploit
  • Regret: Loss by not

picking optimal action

  • Goal: Minimize regret

Clipper

  • Exp3 algorithm
  • Single evaluation
  • Scales to more models
slide-13
SLIDE 13

MULTI MODELS

Ensemble

  • Combine output from models (weighted average)
  • How do we get the weights ?

Robust Prediction

  • React to model changes
  • Output confidence score
slide-14
SLIDE 14

STRAGGLER MITIGATION

Why do stragglers occur? Approach

slide-15
SLIDE 15

TAKEAWAYS

  • ML inference: Workloads + Requirements
  • Layered architecture provides generality
  • Caching, Batching, Replication to improve latency, throughput
  • Multi-Arm bandits to improve accuracy
slide-16
SLIDE 16

DISCUSSION

https://forms.gle/pZMuhCWcap2q3LQJ9

slide-17
SLIDE 17

(Discussion question from last week) Considering AllReduce using MPI as the baseline parallel programming task. Discuss the improvements made by MapReduce, Spark over MPI and discuss if/how Ray further contributes to the comparison.

slide-18
SLIDE 18

Consider a scenario where you run a model serving service that hosts a number of different models. The traffic for some models is sporadic (e.g. only a few hours where they are used). What are some advantages / disadvantages

  • f using Clipper for such a service?
slide-19
SLIDE 19