PRETZEL:Opening the Black Box of Machine Learning Prediction Serving - - PowerPoint PPT Presentation

pretzel opening the black box of machine learning
SMART_READER_LITE
LIVE PREVIEW

PRETZEL:Opening the Black Box of Machine Learning Prediction Serving - - PowerPoint PPT Presentation

PRETZEL:Opening the Black Box of Machine Learning Prediction Serving Systems Presented by Qinyuan Sun Slides are modified from first author Yunseong Lees slides 2 Outline Prediction Serving Systems Limitations of Black Box


slide-1
SLIDE 1

PRETZEL:Opening the Black Box of Machine Learning Prediction Serving Systems

Presented by Qinyuan Sun Slides are modified from first author Yunseong Lee’s slides

slide-2
SLIDE 2

2

Outline

  • Prediction Serving Systems
  • Limitations of Black Box Approaches
  • PRETZEL: White-box Prediction ServingSystem
  • Evaluation
  • Conclusion
slide-3
SLIDE 3

Machine Learning PredictionServing

  • 1. Models are learned fromdata
  • 2. Modelsare deployed and served together

Predictionserving

Server Users Data

T raining

Model Learn Deploy Performance goal: 1) Low latency 2) High throughput 3) Minimal resourceusage

2

slide-4
SLIDE 4
  • Assumption: models are blackbox
  • Re-use the same code in training phase
  • Encapsulate alloperations

into a function call (e.g.,predict())

  • Apply externaloptimizations

ML Prediction ServingSystems: State-of-the-art

Clipper TFServing ML.Net

PredictionServing System

“Pretzel is tasty” cat car

T ext Analysis Image Recognition Result caching ensemble 4 Replication Request Batching

slide-5
SLIDE 5

How does ModelsLook Like inside Boxes?

<Example: SentimentAnalysis>

Pretzel istasty

(text)

J

  • vs. L

(positive vs.negative)

5

Model

slide-6
SLIDE 6

How do ModelsLook inside Boxes?

Pretzel istasty

Ngram Word Ngram

<Example: SentimentAnalysis>

T

  • kenizer

Concat

J v s . L

DAG ofOperators Featurizers

Char

Predictor

Logistic Regression

6

slide-7
SLIDE 7

T

  • kenizer

How do ModelsLook inside Boxes?

<Example: SentimentAnalysis>

Pretzel istasty

Char Ngram Word Ngram Concat Logistic Regression

DAG ofOperators

J v s . L

Split text intotokens Extract N-grams Merge two vectors Compute finalscore 7

slide-8
SLIDE 8

Many ModelsHave Similar Structures

  • Many part of a model can be re-used in other models
  • Customer personalization, T

emplates, T ransferLearning

  • Identical set of operators with differentparameters

8

slide-9
SLIDE 9

9

Outline

  • Prediction Serving Systems
  • Limitations of Black BoxApproaches
  • PRETZEL: White-box Prediction ServingSystem
  • Evaluation
  • Conclusion
slide-10
SLIDE 10

Limitation 1: ResourceWaste

  • Resources are isolated across Blackboxes
  • 1. Unable to share memoryspace

è Waste memory to maintain duplicate

  • bjects (despite similarities between

models)

  • 2. Nocoordination for CPU resources between boxes

è Serving many models can use too many threads

machine

1

slide-11
SLIDE 11

T

  • kenizer

Char Ngram Word Ngram Concat Log Reg

Limitation 2: Inconsideration for Ops’Characteristics

  • 1. Operators have different performancecharacteristics
  • Concat materializes avector
  • LogReg takes only 0.3% (contrary to the training phase)
  • 2. There can be a better plan if such characteristics are considered
  • Re-use the existingvectors
  • Apply in-place update inLogReg

0% 20 % 100 % 40% 60% 80% Latency breakdown 23.1 34.2 32.7 9.6 CharNgram WordNgram Concat LogReg Others 0.3

1 1

slide-12
SLIDE 12

Limitation 3: LazyInitialization

  • ML.Net initializes code and memory lazily (efficient in training phase)
  • Run 250 SentimentAnalysis models 100 times

è cold: first execution / hot: average of the rest99

  • Long-tail latency in the coldcase
  • Code analysis, Just–in-time (JIT) compilation, memory allocation,etc
  • Difficult to provide strong Service-Level-Agreement(SLA)

T

  • kenizer

Char Ngram Word Ngram Concat Log Reg 13x 444x

1 2

slide-13
SLIDE 13

1 3

Outline

  • (Black-box) Prediction ServingSystems
  • Limitations of Black BoxApproaches
  • PRETZEL: White-box Prediction ServingSystem
  • Evaluation
  • Conclusion
slide-14
SLIDE 14

1 4

PRETZEL: White-box PredictionServing

  • We analyze models to optimize the internal execution
  • We let models co-exist on the same runtime,

sharing computation and memoryresources

  • We optimize models in twodirections:
  • 1. End-to-end optimizations
  • 2. Multi-model optimizations
slide-15
SLIDE 15

1 5

End-to-End Optimizations

Optimize the execution of individual models from start to end

  • 1. [Ahead-of-time Compilation]

Compile operators’ code inadvance à No JIToverhead

  • 2. [Vector pooling]

Pre-allocate data structures à No memory allocation on the data path

slide-16
SLIDE 16

1 6

Multi-model Optimizations

Share computation and memory acrossmodels

  • 1. [Object Store]

Share Operatorsparameters/weights

à Maintain only onecopy

2.[Sub-plan Materialization] Reuse intermediate results computed by othermodels

à Save computation

slide-17
SLIDE 17

System Components

  • 1. Flour: IntermediateRepresentation
  • 2. Oven: Compiler/Optimizer

var fContext = ...; var Tokenizer = ...; return fPrgm.Plan();

  • 3. Runtime: Execute inferencequeries

Runtime Object Store Scheduler

  • 4. FrontEnd: Handle userrequests

FrontEnd

17

slide-18
SLIDE 18

Prediction Serving withPRETZEL

  • 1. Offline
  • Analyze structural information ofmodels
  • Build ModelPlan for optimalexecution
  • Register ModelPlan toRuntime
  • 2. Online
  • Handle predictionrequests
  • Coordinate CPU & memoryresources

Runtime FrontEnd Runtime Register Model Analyze

1 8

Model Plan

slide-19
SLIDE 19

System Design: OfflinePhase

Char Ngram T

  • kenizer

Word Ngram Concat Log Reg

  • 1. T

ranslate Model into FlourProgram

<Model>

var fContext = new FlourContext(...) var tTokenizer = fContext.CSV .FromText(fields, fieldsType, sep) .Tokenize(); var tCNgram = tTokenizer.CharNgram(numCNgrms, ...); var tWNgram = tTokenizer.WordNgram(numWNgrms, ...); var fPrgrm = tCNgram .Concat(tWNgram) .ClassifierBinaryLinear(cParams);

<Flour Program>

return fPrgrm.Plan();

18

slide-20
SLIDE 20

System Design: OfflinePhase

var tCNgram = tTokenizer.CharNgram(numCNgrms, ...); var tWNgram = tTokenizer.WordNgram(numWNgrms, ...); var fPrgrm = tCNgram .Concat(tWNgram) .ClassifierBinaryLinear(cParams); return fPrgrm.Plan();

  • 2. Oven optimizer/compiler build ModelPlan

<Flour Program>

var fContext = new FlourContext(...) var tTokenizer = fContext.CSV .FromText(fields, fieldsType, sep) .Tokenize();

Push linearpredictor & RemoveConcat

Stage1 Stage2

Group ops intostages Rule-based

  • ptimizer

S1 S2 <Model Plan>

Logical DAG

19

slide-21
SLIDE 21
  • 2. Oven optimizer/compiler build ModelPlan

System Design: OfflinePhase

var fContext = new FlourContext(...) var tTokenizer = fContext.CSV .FromText(fields, fieldsType, sep) .Tokenize(); var tCNgram = tTokenizer.CharNgram(numCNgrms, ...); var tWNgram = tTokenizer.WordNgram(numWNgrms, ...); var fPrgrm = tCNgram .Concat(tWNgram) .ClassifierBinaryLinear(cParams); return fPrgrm.Plan();

<Flour Program>

Push linearpredictor & RemoveConcat

Stage1 Stage2

Group ops intostages Rule-based

  • ptimizer

S1 S2

var fContext = new FlourContext(...) var tTokenizer = fContext.CSV .FromText(fields, fieldsType, sep) .Tokenize(); var tCNgram = tTokenizer.CharNgram(numCNgrms, ...); var tWNgram = tTokenizer.WordNgram(numWNgrms, ...); var fPrgrm = tCNgram .Concat(tWNgram) .ClassifierBinaryLinear(cParams); return fPrgrm.Plan(); var fContext = new FlourContext(...) var tTokenizer = fContext.CSV .FromText(fields, fieldsType, sep) .Tokenize(); var tCNgram = tTokenizer.CharNgram(numCNgrms, ...); var tWNgram = tTokenizer.WordNgram(numWNgrms, ...); var fPrgrm = tCNgram .Concat(tWNgram) .ClassifierBinaryLinear(cParams);

e.g., Dictionary , N-gramLength

<Model Plan>

Logical DAG Parameters Statistics

return fPrgrm.Plan();

e.g., dense vs.sparse, maximum vectorsize

20

slide-22
SLIDE 22
  • 3. ModelPlan is registered to Runtime

System Design: OfflinePhase

<Model Plan>

Logical DAG Parameters Statistics

PhysicalStages

S1 S2

LogicalStages

Model1 S1 S2

  • 2. Find the most

efficient physical impl. using params &stats

  • 1. Store parameters&

mapping between logical stages

2 2 ObjectStore

slide-23
SLIDE 23
  • 3. ModelPlan is registered to Runtime

System Design: OfflinePhase

<Model Plan>

Logical DAG Parameters Statistics

PhysicalStages

S1 S2

Catalog

  • 3. Registerselected

physical stagesto Catalog

LogicalStages

Model1 S1 S2

ObjectStore

N-gramlength

  • 1vs. 3

Sparse vs.Dense

  • 2. Find the most

efficient physical impl. using params &stats

  • 1. Store parameters&

mapping between logical stages

2 3

slide-24
SLIDE 24

System Design: OnlinePhase

Runtime LogicalStages

Model1 S1 S2 Model2 S1’ S2’

  • 4. Send resultback

to Client

  • 1. When aprediction

request arrives <Model1, “Pretzel istasty”>

  • 3. Execute stagesusing

thread-pools, managed byScheduler PhysicalStages

  • 2. Instantiate

physicalstages along with parameters ObjectStore 2 4

slide-25
SLIDE 25

2 5

Outline

  • (Black-box) Prediction ServingSystems
  • Limitations of Black boxapproaches
  • PRETZEL: White-box Prediction ServingSystem
  • Evaluation
  • Conclusion
slide-26
SLIDE 26

2 6

Evaluation

  • Q. How PRETZEL improves performance overblack-box approaches?
  • in terms of latency

, memory andthroughput

  • 500 Models from Microsoft Machine Learning T

eam

  • 250 Sentiment Analysis(Memory-bound)
  • 250 Attendee Count(Compute-bound)
  • System configuration
  • 16 Cores CPU, 32GBRAM
  • Windows 10, .Net core2.0
slide-27
SLIDE 27

Evaluation: Latency

  • Micro-benchmark (No server-clientcommunication)
  • Score 250 SentimentAnalysis models 100 times for each
  • Compare ML.Net vs.PRETZEL

100

0.6 8.1

80 (%) 60 CDF 40 20

ML.Net ML.Net (hot) (cold)

10 2 10 1 100 101 Latency (ms, log-scaled)

ML.Net PRETZE L P99 (hot) P99 (cold) Worst (cold) ML.Net PRETZE L P99 (hot) 0.6 P99 (cold) 8.1 Worst (cold) ML.Net PRETZE L P99 (hot) 0. 6 0.2 P99 (cold) 8. 1 0.8 Worst (cold) ML.Net PRETZEL P99 (hot) 0.6 0.2 P99 (cold) 8.1 0.8 Worst(cold) 280.2 6.2

10

2 1

10 100 101 Latency (ms, log- scaled) 20 40 60 80 100 CDF (%)

PRETZEL(hot) PRETZEL(cold) 8.1 0.2 0.8 0.6 3x 10x 45x better 2 7

slide-28
SLIDE 28
  • Measure Cumulative Memory Usage after loading 250 models
  • Attendee Count models (smaller size than SentimentAnalysis)
  • 4 settings forComparison

Evaluation: Memory

better

Settings Shared Objects Shared Runtime ML.Net +Clipper ML.Net ✓ PRETZELwithout ObjectStore ✓ PRETZEL ✓ ✓

e g 32GB a s U 10GB ryed ) emoc 1GB l a s M- g eo v ti(l a 0.1GB l mu Cu 10MB 50 100 150 200 250 Number of pipelines Number of pipelines 10MB 0.1GB 1GB Cumulative Memory Usage (log-scaled) 32GB 10GB

164MB 9.7GB 3.7GB 2.9GB

25x 62x 2 8

PRETZEL

50 100 150 200 250

M L . N e t ( w .

  • .

O b j S t

  • r

e ) ML.Net +Clipper

slide-29
SLIDE 29

Evaluation:Throughput

13 1 2 4 8

  • Num. CPU Cores

5 10 15

(ideal ) (ideal )

Throughput (K QPS)

better

  • Micro-benchmark
  • Score 250Attendee Count models 1000 times for each
  • Request 1000queries in a batch
  • Compare ML.Net vs.PRETZEL

10x Close toideal scalability

More results in thepaper!

2 9

slide-30
SLIDE 30

3

Conclusion

  • PRETZELis the first white-box prediction serving system for ML pipelines
  • By using models’ structural info, we enable two types of optimizations:
  • End-to-end optimizations generate efficient execution plans for a model
  • Multi-model optimizations let models share computation and memory resources
  • Our evaluation shows that PRETZELcan improve performance compared to

Black-box systems (e.g.,ML.Net)

  • Decrease latency and memoryfootprint
  • Increase resource utilization andthroughput
  • Alot of external optimizations used by Cipper are orthogonal to PRETZEL
slide-31
SLIDE 31

Thank you! Questions?