machine learning pipelines
play

Machine Learning Pipelines Marco Serafini COMPSCI 532 Lecture 21 - PowerPoint PPT Presentation

Machine Learning Pipelines Marco Serafini COMPSCI 532 Lecture 21 Training vs. Inference Training: data model Computationally expensive No hard real-time requirements (typically) Inference: data + model prediction


  1. Machine Learning Pipelines Marco Serafini COMPSCI 532 Lecture 21

  2. Training vs. Inference • Training: data à model • Computationally expensive • No hard real-time requirements (typically) •Inference: data + model à prediction • Computationally cheaper • Real-time requirements (sometimes sub-millisecond) • Today we talk about inference 3 3

  3. Lifecycle 4 4

  4. Challenge: Different Frameworks • Different training frameworks, each has its strengths • E.g.: Caffe for computer vision, HTK for speech recognition • Each uses different formats à tailored deployment • Best tool may change over time • Solution: model abstraction 5 5

  5. Challenge: Prediction Latency • Many ML models have high prediction latency • Some are too slow to use online, e.g., when choosing an ad • Combining model outputs makes it worse • Trade-off between accuracy and latency • Solutions • Adaptive batching • Enable mixing models with different complexity • Straggler mitigation when using multiple models 6 6

  6. Challenge: Model Selection • How to decide which models to deploy? • Selecting the best model offline is expensive • Best model changes over time • Concept drift: relationships in data change over time • Feature corruption • Combining multiple models can increase accuracy • Solution: automatically select among multiple models 7 7

  7. Overview • Requests flow top to bottom and back • We start reviewing the Model Abstraction Layer Project 3 8 8

  8. Caching • Stores prediction results • Avoids rerunning inference on recent predictions • Enables correlating prediction with feedback • Useful when selecting one model 9 9

  9. Batching • Maximize batch size given upper bound on latency • Advantages of batching • Fewer RPC requests • Data-parallel optimizations (e.g. using GPUs) • Different queue/batch size per model container • Some systems like TensorFlow require static batch sizes • Adaptive Batch Sizing: AIMD • Additively increase batch size until exceed latency threshold • Scale down by 10% 10 10

  10. Benefits of (Adaptive) Batching up to 26x throughput increase 11 11

  11. Per-Model Batch size • Different models have different optimal batch sizes • Linear latency growth, easy to predict with AIDM 12 12

  12. Delayed Batching • When a batch is done and the next is not full, wait • Not always beneficial 13 13

  13. Model Containers 14 14

  14. Model Containers • Docker containers • API to be implemented • State (parameters) passed during initialization • No other state management • Clipper replicates containers as needed 15 15

  15. Effect of Replication • 10 GB network: GPU bottleneck, scales out • 1 GB network: network bottleneck does not scale out 16 16

  16. Model Selection 17 17

  17. Model Selection • Enables running multiple models • Advantages • Combine outputs from different models (if run in parallel) • Estimate prediction accuracy (through comparison) • Switch to better model (when feedback available) • Disadvantage of running models in parallel: stragglers • They can often be ignored with minimal accuracy loss • Context: different model selection state per user or session 18 18

  18. Model Selection API S: Selection policy state X: Input Y: Prediction/Feedback incorporate feedback 19 19

  19. Single-Model Selection • Multi-Armed Bandit • Select one action, observe outcome • Decide whether to explore a new action or exploit current one • Exp3 algorithm • Choose an action based on a probability distribution • Adjust probably distribution of current choice based on loss 20 20

  20. Multi-Model Ensembles 21 21

  21. Ensembles and Changing Accuracy 22 22

  22. Ensembles and Stragglers Ensembles and Stragglers 23 24 23 24

  23. Personalized Model Selection • Model selection can be done per-user 24 24

  24. TensorFlow Serving • Inference mechanism of TensorFlow • Can run TensorFlow models • Also uses batching (static) • Missing features • Latency objectives • No support for multiple models • No feedback 25 25

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend