Parity Models Erasure-Coded Resilience for Prediction Serving - - PowerPoint PPT Presentation

parity models erasure coded resilience for prediction
SMART_READER_LITE
LIVE PREVIEW

Parity Models Erasure-Coded Resilience for Prediction Serving - - PowerPoint PPT Presentation

Parity Models Erasure-Coded Resilience for Prediction Serving Systems Jack Kosaian Rashmi Vinayak Shivaram Venkataraman Rashmi Vinayak Shivaram Venkataraman 2 Machine learning lifecycle Training Inference Deploy model in Get a model to


slide-1
SLIDE 1

Parity Models Erasure-Coded Resilience for Prediction Serving Systems

Jack Kosaian Rashmi Vinayak Shivaram Venkataraman

slide-2
SLIDE 2

Rashmi Vinayak Shivaram Venkataraman

2

slide-3
SLIDE 3

Machine learning lifecycle

3

Training Inference Get a model to reach desired accuracy Deploy model in target domain Hours to weeks Milliseconds “Batch” jobs Online

slide-4
SLIDE 4

Machine learning inference

4

queries predictions

cat

0.15 0.8 0.05

dog bird

slide-5
SLIDE 5

Prediction serving systems

Inference in datacenter/cluster settings

Open Source Cloud Services

5

slide-6
SLIDE 6

Prediction serving system architectures

6

Frontend

queries predictions

model instances

slide-7
SLIDE 7

Machine learning inference

7

Must operate with low, predictable latency

translation question- answering ranking

slide-8
SLIDE 8

Unavailability in serving systems

  • Slowdowns and failures (unavailability)
  • Resource contention
  • Hardware failures
  • Runtime slowdowns
  • ML-specific events
  • Result in inflated tail latency
  • Cause prediction serving systems to miss SLOs

8

Must alleviate slowdowns and failures

slide-9
SLIDE 9

Redundancy-based resilience

  • Proactive: send each query to 2+ servers
  • Reactive: wait for a timeout before duplicating query

9

Recovery Delay Resource Overhead Reactive Proactive (lower is better) (lower is better)

slide-10
SLIDE 10

11

D1 P D1 D2 P = D1 + D2 D2 = P – D1

encoding decoding

D2 k data units r “parity” units any k out of (k+r) units

  • riginal k data units

“parity”

Erasure codes: proactive, resource-efficient

n = k + r Relation to (n, k) notation

slide-11
SLIDE 11

12

Storage Communication Prediction Serving Systems

Recovery Delay Resource Overhead Reactive Proactive (lower is better) (lower is better) erasure codes

Erasure codes: proactive, resource-efficient

slide-12
SLIDE 12

Coded-computation

13

F F F F(X1) F(X2) queries models predictions

Goal: preserve results of computation over queries

Our goal: Using erasure codes to reduce tail latency in prediction serving X1 X2

slide-13
SLIDE 13

Coded-computation

14

Encode queries

X1 X2 encode F(X1) F(X2) F F F “parity query” Our goal: Using erasure codes to reduce tail latency in prediction serving

slide-14
SLIDE 14

Coded-computation

15

Decode results of inference over queries

encode X1 X2 decode F(X2) F(X1) F(P) F F F “parity query” Our goal: Using erasure codes to reduce tail latency in prediction serving

slide-15
SLIDE 15

Traditional coding vs. coded-computation

16

Need to recover computation over inputs

Coded-computation Codes for storage

D2 D1 D1 D2 P F F F encode X1 X2 decode F(X2) encode decode D2 F(X1) F(P)

slide-16
SLIDE 16

Challenge: Non-linear computation

17

Linear computation X1 X2 P = X1 + X2 F(X2) = F(P) – F(X1) Non-linear computation X1 X2 P = X1 + X2 F(X2) = F(P) – F(X1) Actual is X22 Example: F(X) = 2X Example: F(X) = X2 2X 2X 2X X2 X2 X2 F(X2) = 2(X1 + X2) – X1 F(X2) = 2X2 F(X2) = 2(X1 + X2)2 – X12 F(X2) = X22 + 2X1X2

slide-17
SLIDE 17

Challenge: Non-linear computation

18

Linear computation X1 X2 P = X1 + X2 F(X2) = F(P) – F(X1) Non-linear computation X1 X2 P = X1 + X2 F(X2) = F(P) – F(X1) Example: F(X) = 2X 2X 2X 2X F(X2) = 2(X1 + X2) – X1 F(X2) = 2X2 F(X2) = ???

slide-18
SLIDE 18

Current approaches to coded-computation

  • Lots of great work on linear computations
  • Huang 1984, Lee 2015, Dutta 2016, Dutta 2017, Mallick 2018, more…
  • Recent work supports restricted nonlinear computations
  • Yu 2018
  • At least 2x resource overhead

19

Current approaches insufficient for neural networks in prediction serving systems

slide-19
SLIDE 19

Our approach: Learning-based coded-computation

21

Learning a Code: Machine Learning for Approximate Non-Linear Coded Computation https://arxiv.org/abs/1806.01259 Parity Models: Erasure-Coded Resilience for Prediction Serving Systems To appear in ACM SOSP 2019

https://jackkosaian.github.io

slide-20
SLIDE 20

Learning an erasure code?

22

Design encoder and decoder as neural networks

encoder

X1 X2

decoder

Accurate

Learning a Code: Machine Learning for Approximate Non-Linear Coded Computation https://arxiv.org/abs/1806.01259

slide-21
SLIDE 21

Learning an erasure code?

Design encoder and decoder as neural networks

23

encoder

Computationally expensive X1 X2

decoder

Accurate Expensive encoder/decoder

Learning a Code: Machine Learning for Approximate Non-Linear Coded Computation https://arxiv.org/abs/1806.01259

slide-22
SLIDE 22

Accurate

Learn computation over parities

Use simple, fast encoders and decoders

24

Learn computation over parities: “parity model”

P = X1 + X2 X1 X2

parity model (FP)

F(X2) = FP(P) – F(X1) Efficient encoder/decoder

Parity Models: Erasure-Coded Resilience for Prediction Serving Systems To appear in ACM SOSP 2019 https://jackkosaian.github.io

slide-23
SLIDE 23

Designing parity models

25

Parity model goal

Transform parities such that decoder can reconstruct unavailable predictions P = X1 + X2 X1 X2 F(X2) = FP(P) – F(X1)

parity model (FP)

slide-24
SLIDE 24

Designing parity models

26

Parity model goal

Transform parities such that decoder can reconstruct unavailable predictions FP(P) = F(X1) + F(X2) P = X1 + X2 X1 X2 F(X2) = FP(P) – F(X1)

parity model (FP) Learn a parity model

slide-25
SLIDE 25

Designing parity models

27

Parity model goal

Transform parities such that decoder can reconstruct unavailable predictions FP(P) = F(X1) + F(X2) P = X1 + X2 X1 X2 F(X2) = FP(P) – F(X1)

parity model (FP)

slide-26
SLIDE 26

Training a parity model

28

  • 1. Sample k inputs and encode
  • 2. Perform inference with parity model
  • 3. Compute loss
  • 4. Backpropogate loss
  • 5. Repeat

F(X1) + F(X2) P = X1 + X2 FP(P)1 compute loss

0.8 0.15 0.05 0.2 0.7 0.1

Desired output:

slide-27
SLIDE 27

Training a parity model

29

  • 1. Sample k inputs and encode
  • 2. Perform inference with parity model
  • 3. Compute loss
  • 4. Backpropogate loss
  • 5. Repeat

P = X1 + X2 FP(P)2 F(X1) + F(X2)

0.15 0.8 0.05 0.3 0.5 0.2

compute loss Desired output:

slide-28
SLIDE 28

Training a parity model

30

  • 1. Sample k inputs and encode
  • 2. Perform inference with parity model
  • 3. Compute loss
  • 4. Backpropogate loss
  • 5. Repeat

P = X1 + X2 FP(P)3 F(X1) + F(X2) Desired output:

0.03 0.02 0.95 0.3 0.3 0.4

compute loss

slide-29
SLIDE 29

Training a parity model: higher parameter k

31

P = X1 + X2 + X3 + X4

  • 1. Sample inputs and encode
  • 2. Perform inference with parity model
  • 3. Compute loss
  • 4. Backpropogate loss
  • 5. Repeat

FP(P)3 F(X1) + F(X2) + F(X3) + F(X4) Desired output:

slide-30
SLIDE 30

Training a parity model: different encoder

32

P =

  • 1. Sample inputs and encode
  • 2. Perform inference with parity model
  • 3. Compute loss
  • 4. Backpropogate loss
  • 5. Repeat

FP(P)3

slide-31
SLIDE 31

Appropriate for machine learning inference

33

  • 1. Predictions resulting from inference are approximations
  • 2. Inaccuracy only at play when predictions otherwise slow/failed

Learning results in approximate reconstructions

slide-32
SLIDE 32

34

Frontend

Encoder Decoder

parity model queries predictions

slow/failed

Implementing parity models in Clipper

slide-33
SLIDE 33

Design space in parity models framework

Encoder/decoder

  • Many possibilities
  • Generic: addition/subtraction
  • Can specialize to task

Parity model architecture

  • Again, many possibilities
  • Same as original model ⇒

same latency as original

35

P = X1 + X2 X1 X2

parity model (FP)

F(X2) = FP(P) – F(X1)

slide-34
SLIDE 34

Evaluation

  • 1. How accurate are reconstructions using parity models?

36

  • 2. How much can parity models help reduce tail latency?
slide-35
SLIDE 35

Evaluation of Accuracy

37

  • Addition/subtraction code
  • k = 2, r = 1 (P = X1 + X2)
  • 2x less overhead than

replication

slide-36
SLIDE 36

Evaluation of Accuracy

38

Parity model only comes into play when predictions are slow/failed

  • Addition/subtraction code
  • k = 2, r = 1 (P = X1 + X2)
  • 2x less overhead than

replication

slide-37
SLIDE 37

Evaluation of Accuracy

39

Parity model only comes into play when predictions are slow/failed

  • Addition/subtraction code
  • k = 2, r = 1 (P = X1 + X2)
  • 2x less overhead than

replication

slide-38
SLIDE 38

Evaluation of Overall Accuracy

40

Parity model only comes into play when predictions are slow/failed

  • Addition/subtraction code
  • k = 2, r = 1 (P = X1 + X2)
  • 2x less overhead than

replication

6.1%

slide-39
SLIDE 39

Evaluation of Overall Accuracy

41

Parity model only comes into play when predictions are slow/failed

  • Addition/subtraction code
  • k = 2, r = 1 (P = X1 + X2)
  • 2x less overhead than

replication

6.1% 0.6%

slide-40
SLIDE 40

Evaluation of Overall Accuracy

42

Parity model only comes into play when predictions are slow/failed

  • Addition/subtraction code
  • k = 2, r = 1 (P = X1 + X2)
  • 2x less overhead than

replication

expected operating regime

slide-41
SLIDE 41

Evaluation of Accuracy: Higher values of k

43

Tradeoff between resource-overhead, resilience, and accuracy

  • Addition/subtraction code
slide-42
SLIDE 42

Evaluation of Accuracy: Object-localization

44

Ground Truth Available Parity Models

slide-43
SLIDE 43

Evaluation of Accuracy: Task-specific encoder

45

32 32 32 32

encode

Input Images Parity Image 22% accuracy improvement over addition/subtraction at k = 4

slide-44
SLIDE 44

Evaluation of Tail Latency Reduction: Setup

  • Implemented in Clipper prediction serving system
  • Evaluate with 18-36 nodes on AWS with varying:
  • Inference hardware (GPUs, CPUs)
  • Query arrival rates
  • Batch sizes
  • Levels of load imbalance
  • Amounts of redundancy
  • Baseline approaches
  • Baseline: approach with same number of resources as parity models

46

slide-45
SLIDE 45

Evaluation of Tail Latency Reduction

47

40% same median

In presence of resource contention

slide-46
SLIDE 46

Limitations of current parity models framework

  • Training a parity model is slow!
  • Dataset with N samples ⇒ parity model dataset with Nk samples

48

slide-47
SLIDE 47

Training a parity model

49

F(X1) + F(X2) P = X1 + X2

  • 1. Sample k inputs and encode
  • 2. Perform inference with parity model
  • 3. Compute loss
  • 4. Backpropogate loss
  • 5. Repeat

FP(P)3 compute loss

slide-48
SLIDE 48

Limitations of current parity models framework

  • Training a parity model is slow!
  • Dataset with N samples ⇒ parity model dataset with Nk samples
  • How to efficiently train under this combinatorial explosion?
  • Theoretical understanding?
  • Subject to same problems as existing NNs (e.g., adversarial examples)
  • Can’t bound inaccuracy
  • Potential privacy concerns
  • Combining query A with query B into a parity query might leak info
  • More research needed to tackle the above

50

slide-49
SLIDE 49

Landscape of learning in coded-computation

51

Learn a code

encoder

X1 X2

decoder

Learning a parity model

P = X1 + X2 X1 X2

parity model (FP)

F(X2) = FP(P) – F(X1)

slide-50
SLIDE 50

Landscape of learning in coded-computation

52

encoder

X1 X2

decoder

Jointly learn encoders, decoders, and parity models?

parity model (FP)

Balance complexity, execution time across components

slide-51
SLIDE 51

Parity Models: Erasure-Coded Resilience for Prediction Serving Systems

53

  • Coded-computation is promising, but current approaches cannot

support popular machine learning models like neural networks

  • Parity models: judicious use of learning allows for accurate

reconstruction of unavailable ML inference predictions

  • Enables erasure-coded resilience in prediction serving systems

Code available: github.com/Thesys-lab/parity-models