Machine Learning Pipelines Marco Serafini COMPSCI 532 Lecture 21 - - PowerPoint PPT Presentation

machine learning pipelines
SMART_READER_LITE
LIVE PREVIEW

Machine Learning Pipelines Marco Serafini COMPSCI 532 Lecture 21 - - PowerPoint PPT Presentation

Machine Learning Pipelines Marco Serafini COMPSCI 532 Lecture 21 Training vs. Inference Training: data model Computationally expensive No hard real-time requirements (typically) Inference: data + model prediction


slide-1
SLIDE 1

Machine Learning Pipelines

Marco Serafini

COMPSCI 532 Lecture 21

slide-2
SLIDE 2
slide-3
SLIDE 3

3

3

Training vs. Inference

  • Training: data à model
  • Computationally expensive
  • No hard real-time requirements (typically)
  • Inference: data + model à prediction
  • Computationally cheaper
  • Real-time requirements (sometimes sub-millisecond)
  • Today we talk about inference
slide-4
SLIDE 4

44

Lifecycle

slide-5
SLIDE 5

55

Challenge: Different Frameworks

  • Different training frameworks, each has its strengths
  • E.g.: Caffe for computer vision, HTK for speech recognition
  • Each uses different formats à tailored deployment
  • Best tool may change over time
  • Solution: model abstraction
slide-6
SLIDE 6

66

Challenge: Prediction Latency

  • Many ML models have high prediction latency
  • Some are too slow to use online, e.g., when choosing an ad
  • Combining model outputs makes it worse
  • Trade-off between accuracy and latency
  • Solutions
  • Adaptive batching
  • Enable mixing models with different complexity
  • Straggler mitigation when using multiple models
slide-7
SLIDE 7

77

Challenge: Model Selection

  • How to decide which models to deploy?
  • Selecting the best model offline is expensive
  • Best model changes over time
  • Concept drift: relationships in data change over time
  • Feature corruption
  • Combining multiple models can increase accuracy
  • Solution: automatically select among multiple models
slide-8
SLIDE 8

88

Overview

  • Requests flow

top to bottom and back

  • We start

reviewing the Model Abstraction Layer

Project 3

slide-9
SLIDE 9

99

Caching

  • Stores prediction results
  • Avoids rerunning inference on recent predictions
  • Enables correlating prediction with feedback
  • Useful when selecting one model
slide-10
SLIDE 10

10

10

Batching

  • Maximize batch size given upper bound on latency
  • Advantages of batching
  • Fewer RPC requests
  • Data-parallel optimizations (e.g. using GPUs)
  • Different queue/batch size per model container
  • Some systems like TensorFlow require static batch sizes
  • Adaptive Batch Sizing: AIMD
  • Additively increase batch size until exceed latency threshold
  • Scale down by 10%
slide-11
SLIDE 11

11

11

Benefits of (Adaptive) Batching

up to 26x throughput increase

slide-12
SLIDE 12

12

12

Per-Model Batch size

  • Different models have different optimal batch sizes
  • Linear latency growth, easy to predict with AIDM
slide-13
SLIDE 13

13

13

Delayed Batching

  • When a batch is done and the next is not full, wait
  • Not always beneficial
slide-14
SLIDE 14

14

14

Model Containers

slide-15
SLIDE 15

15

15

Model Containers

  • Docker containers
  • API to be implemented
  • State (parameters) passed during initialization
  • No other state management
  • Clipper replicates containers as needed
slide-16
SLIDE 16

16

16

Effect of Replication

  • 10 GB network: GPU bottleneck, scales out
  • 1 GB network: network bottleneck does not scale out
slide-17
SLIDE 17

17

17

Model Selection

slide-18
SLIDE 18

18

18

Model Selection

  • Enables running multiple models
  • Advantages
  • Combine outputs from different models (if run in parallel)
  • Estimate prediction accuracy (through comparison)
  • Switch to better model (when feedback available)
  • Disadvantage of running models in parallel: stragglers
  • They can often be ignored with minimal accuracy loss
  • Context: different model selection state per user or

session

slide-19
SLIDE 19

19

19

Model Selection API

S: Selection policy state X: Input Y: Prediction/Feedback incorporate feedback

slide-20
SLIDE 20

20

20

Single-Model Selection

  • Multi-Armed Bandit
  • Select one action, observe outcome
  • Decide whether to explore a new action or exploit current one
  • Exp3 algorithm
  • Choose an action based on a probability distribution
  • Adjust probably distribution of current choice based on loss
slide-21
SLIDE 21

21

21

Multi-Model Ensembles

slide-22
SLIDE 22

22

22

Ensembles and Changing Accuracy

slide-23
SLIDE 23

23

23

Ensembles and Stragglers

24

24

Ensembles and Stragglers

slide-24
SLIDE 24

24

24

Personalized Model Selection

  • Model selection can be done per-user
slide-25
SLIDE 25

25

25

TensorFlow Serving

  • Inference mechanism of TensorFlow
  • Can run TensorFlow models
  • Also uses batching (static)
  • Missing features
  • Latency objectives
  • No support for multiple models
  • No feedback