parity models erasure coded resilience for prediction
play

Parity Models Erasure-Coded Resilience for Prediction Serving - PowerPoint PPT Presentation

Parity Models Erasure-Coded Resilience for Prediction Serving Systems Jack Kosaian Rashmi Vinayak Shivaram Venkataraman Rashmi Vinayak Shivaram Venkataraman 2 Machine learning lifecycle Training Inference Deploy model in Get a model to


  1. Parity Models Erasure-Coded Resilience for Prediction Serving Systems Jack Kosaian Rashmi Vinayak Shivaram Venkataraman

  2. Rashmi Vinayak Shivaram Venkataraman 2

  3. Machine learning lifecycle Training Inference Deploy model in Get a model to reach target domain desired accuracy “Batch” jobs Online Hours to weeks Milliseconds 3

  4. Machine learning inference queries predictions 0.8 0.05 0.15 cat dog bird 4

  5. Prediction serving systems Inference in datacenter/cluster settings Open Source Cloud Services 5

  6. Prediction serving system architectures queries predictions Frontend model instances 6

  7. Machine learning inference question- answering translation ranking Must operate with low, predictable latency 7

  8. Unavailability in serving systems • Slowdowns and failures (unavailability) - Resource contention - Hardware failures - Runtime slowdowns - ML-specific events • Result in inflated tail latency - Cause prediction serving systems to miss SLOs Must alleviate slowdowns and failures 8

  9. Redundancy-based resilience • Proactive: send each query to 2+ servers • Reactive: wait for a timeout before duplicating query Reactive Recovery Delay (lower is better) Proactive Resource Overhead (lower is better) 9

  10. Erasure codes: proactive, resource-efficient Relation to (n, k) notation k data units r “parity” units encoding n = k + r “parity” D 2 P = D 1 + D 2 D 1 D 1 P D 2 D 2 = P – D 1 any k out of (k+r) units original k data units decoding 11

  11. Erasure codes: proactive, resource-efficient Storage Reactive Communication Recovery Delay Prediction Serving Systems erasure (lower is better) codes Proactive Resource Overhead (lower is better) 12

  12. Coded-computation Our goal: Using erasure codes to reduce tail latency in prediction serving Goal: preserve results of computation over queries X 1 X 2 queries F F F models F(X 1 ) F(X 2 ) predictions 13

  13. Coded-computation Our goal: Using erasure codes to reduce tail latency in prediction serving Encode queries X 1 X 2 encode “parity query” F F F F(X 1 ) F(X 2 ) 14

  14. Coded-computation Our goal: Using erasure codes to reduce tail latency in prediction serving Decode results of inference over queries X 1 X 2 encode “parity query” F F F F(X 1 ) F(P) decode F(X 2 ) 15

  15. Traditional coding vs. coded-computation Codes for storage Coded-computation D 1 D 2 X 1 X 2 encode encode F F F D 1 D 2 P F(X 1 ) F(P) decode decode F(X 2 ) D 2 Need to recover computation over inputs 16

  16. Challenge: Non-linear computation Linear computation Non-linear computation Example: F(X) = 2X Example: F(X) = X 2 X 2 P = X 1 + X 2 X 2 P = X 1 + X 2 X 1 X 1 2X 2X X 2 X 2 X 2 2X F(X 2 ) = F(P) – F(X 1 ) F(X 2 ) = F(P) – F(X 1 ) F(X 2 ) = 2(X 1 + X 2 ) 2 – X 12 F(X 2 ) = 2(X 1 + X 2 ) – X 1 F(X 2 ) = X 22 + 2X 1 X 2 F(X 2 ) = 2X 2 Actual is X 22 17

  17. Challenge: Non-linear computation Linear computation Non-linear computation Example: F(X) = 2X X 2 P = X 1 + X 2 X 2 P = X 1 + X 2 X 1 X 1 2X 2X 2X F(X 2 ) = F(P) – F(X 1 ) F(X 2 ) = F(P) – F(X 1 ) F(X 2 ) = 2(X 1 + X 2 ) – X 1 F(X 2 ) = ??? F(X 2 ) = 2X 2 18

  18. Current approaches to coded-computation • Lots of great work on linear computations • Huang 1984, Lee 2015, Dutta 2016, Dutta 2017, Mallick 2018, more… • Recent work supports restricted nonlinear computations • Yu 2018 • At least 2x resource overhead Current approaches insufficient for neural networks in prediction serving systems 19

  19. Our approach: Learning-based coded-computation Learning a Code: Machine Learning for Approximate Non-Linear Coded Computation https://arxiv.org/abs/1806.01259 Parity Models: Erasure-Coded Resilience for Prediction Serving Systems To appear in ACM SOSP 2019 https://jackkosaian.github.io 21

  20. Learning an erasure code? Design encoder and decoder as neural networks X 1 X 2 encoder Accurate decoder Learning a Code: Machine Learning for Approximate Non-Linear Coded Computation https://arxiv.org/abs/1806.01259 22

  21. Learning an erasure code? Design encoder and decoder as neural networks X 1 X 2 encoder Accurate Expensive Computationally encoder/decoder expensive decoder Learning a Code: Machine Learning for Approximate Non-Linear Coded Computation https://arxiv.org/abs/1806.01259 23

  22. Learn computation over parities Use simple, fast encoders and decoders Learn computation over parities: “parity model” P = X 1 + X 2 X 2 X 1 Accurate parity model Efficient (F P ) encoder/decoder F(X 2 ) = F P (P) – F(X 1 ) Parity Models: Erasure-Coded Resilience for Prediction Serving Systems To appear in ACM SOSP 2019 https://jackkosaian.github.io 24

  23. Designing parity models Parity model goal Transform parities such that decoder can reconstruct unavailable predictions P = X 1 + X 2 X 2 X 1 parity model (F P ) F(X 2 ) = F P (P) – F(X 1 ) 25

  24. Designing parity models Parity model goal Transform parities such that decoder can reconstruct unavailable predictions P = X 1 + X 2 X 2 X 1 parity model (F P ) F(X 2 ) = F P (P) – F(X 1 ) Learn a parity model F P (P) = F(X 1 ) + F(X 2 ) 26

  25. Designing parity models Parity model goal Transform parities such that decoder can reconstruct unavailable predictions P = X 1 + X 2 X 2 X 1 parity model (F P ) F(X 2 ) = F P (P) – F(X 1 ) F P (P) = F(X 1 ) + F(X 2 ) 27

  26. Training a parity model 1. Sample k inputs and encode 2. Perform inference with parity model P = X 1 + X 2 3. Compute loss 4. Backpropogate loss 5. Repeat F P (P) 1 compute loss Desired output: F(X 1 ) + F(X 2 ) 0.8 0.15 0.05 0.2 0.7 0.1 28

  27. Training a parity model 1. Sample k inputs and encode 2. Perform inference with parity model P = X 1 + X 2 3. Compute loss 4. Backpropogate loss 5. Repeat F P (P) 2 compute loss Desired output: F(X 1 ) + F(X 2 ) 0.15 0.8 0.05 0.3 0.5 0.2 29

  28. Training a parity model 1. Sample k inputs and encode 2. Perform inference with parity model P = X 1 + X 2 3. Compute loss 4. Backpropogate loss 5. Repeat F P (P) 3 compute loss Desired output: F(X 1 ) + F(X 2 ) 0.03 0.02 0.95 0.3 0.3 0.4 30

  29. Training a parity model: higher parameter k 1. Sample inputs and encode 2. Perform inference with parity model P = X 1 + X 2 + X 3 + X 4 3. Compute loss 4. Backpropogate loss 5. Repeat F P (P) 3 Desired output: F(X 1 ) + F(X 2 ) + F(X 3 ) + F(X 4 ) 31

  30. Training a parity model: different encoder 1. Sample inputs and encode 2. Perform inference with parity model P = 3. Compute loss 4. Backpropogate loss 5. Repeat F P (P) 3 32

  31. Learning results in approximate reconstructions Appropriate for machine learning inference 1. Predictions resulting from inference are approximations 2. Inaccuracy only at play when predictions otherwise slow/failed 33

  32. Implementing parity models in Clipper queries predictions Frontend Encoder Decoder parity model slow/failed 34

  33. Design space in parity models framework Encoder/decoder • Many possibilities • Generic: addition/subtraction P = X 1 + X 2 X 2 X 1 • Can specialize to task parity model Parity model architecture (F P ) • Again, many possibilities • Same as original model ⇒ same latency as original F(X 2 ) = F P (P) – F(X 1 ) 35

  34. Evaluation 1. How accurate are reconstructions using parity models? 2. How much can parity models help reduce tail latency? 36

  35. Evaluation of Accuracy • Addition/subtraction code • k = 2, r = 1 (P = X 1 + X 2 ) • 2x less overhead than replication 37

  36. Evaluation of Accuracy Parity model only comes into play when predictions are slow/failed • Addition/subtraction code • k = 2, r = 1 (P = X 1 + X 2 ) • 2x less overhead than replication 38

  37. Evaluation of Accuracy Parity model only comes into play when predictions are slow/failed • Addition/subtraction code • k = 2, r = 1 (P = X 1 + X 2 ) • 2x less overhead than replication 39

  38. Evaluation of Overall Accuracy Parity model only comes into play when predictions are slow/failed • Addition/subtraction code • k = 2, r = 1 (P = X 1 + X 2 ) • 2x less overhead than 6.1% replication 40

  39. Evaluation of Overall Accuracy Parity model only comes into play when predictions are slow/failed • Addition/subtraction code • k = 2, r = 1 (P = X 1 + X 2 ) • 2x less overhead than 0.6% 6.1% replication 41

  40. Evaluation of Overall Accuracy Parity model only comes into play when predictions are slow/failed • Addition/subtraction code • k = 2, r = 1 (P = X 1 + X 2 ) expected operating regime • 2x less overhead than replication 42

  41. Evaluation of Accuracy: Higher values of k Tradeoff between resource-overhead, resilience, and accuracy • Addition/subtraction code 43

  42. Evaluation of Accuracy: Object-localization Ground Truth Available Parity Models 44

  43. Evaluation of Accuracy: Task-specific encoder 22% accuracy improvement over addition/subtraction at k = 4 Input Images Parity Image 32 encode 32 32 32 45

  44. Evaluation of Tail Latency Reduction: Setup • Implemented in Clipper prediction serving system • Evaluate with 18-36 nodes on AWS with varying: • Inference hardware (GPUs, CPUs) • Query arrival rates • Batch sizes • Levels of load imbalance • Amounts of redundancy • Baseline approaches • Baseline: approach with same number of resources as parity models 46

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend