PRETZEL:Opening the Black Box of Machine Learning Prediction Serving - - PowerPoint PPT Presentation
PRETZEL:Opening the Black Box of Machine Learning Prediction Serving - - PowerPoint PPT Presentation
PRETZEL:Opening the Black Box of Machine Learning Prediction Serving Systems Presented by Qinyuan Sun Slides are modified from first author Yunseong Lees slides 2 Outline Prediction Serving Systems Limitations of Black Box
2
Outline
- Prediction Serving Systems
- Limitations of Black Box Approaches
- PRETZEL: White-box Prediction ServingSystem
- Evaluation
- Conclusion
Machine Learning PredictionServing
- 1. Models are learned fromdata
- 2. Modelsare deployed and served together
Predictionserving
Server Users Data
T raining
Model Learn Deploy Performance goal: 1) Low latency 2) High throughput 3) Minimal resourceusage
2
- Assumption: models are blackbox
- Re-use the same code in training phase
- Encapsulate alloperations
into a function call (e.g.,predict())
- Apply externaloptimizations
ML Prediction ServingSystems: State-of-the-art
Clipper TFServing ML.Net
PredictionServing System
“Pretzel is tasty” cat car
T ext Analysis Image Recognition Result caching ensemble 4 Replication Request Batching
How does ModelsLook Like inside Boxes?
<Example: SentimentAnalysis>
Pretzel istasty
(text)
J
- vs. L
(positive vs.negative)
5
Model
How do ModelsLook inside Boxes?
Pretzel istasty
Ngram Word Ngram
<Example: SentimentAnalysis>
T
- kenizer
Concat
J v s . L
DAG ofOperators Featurizers
Char
Predictor
Logistic Regression
6
T
- kenizer
How do ModelsLook inside Boxes?
<Example: SentimentAnalysis>
Pretzel istasty
Char Ngram Word Ngram Concat Logistic Regression
DAG ofOperators
J v s . L
Split text intotokens Extract N-grams Merge two vectors Compute finalscore 7
Many ModelsHave Similar Structures
- Many part of a model can be re-used in other models
- Customer personalization, T
emplates, T ransferLearning
- Identical set of operators with differentparameters
8
9
Outline
- Prediction Serving Systems
- Limitations of Black BoxApproaches
- PRETZEL: White-box Prediction ServingSystem
- Evaluation
- Conclusion
Limitation 1: ResourceWaste
- Resources are isolated across Blackboxes
- 1. Unable to share memoryspace
è Waste memory to maintain duplicate
- bjects (despite similarities between
models)
- 2. Nocoordination for CPU resources between boxes
è Serving many models can use too many threads
machine
1
T
- kenizer
Char Ngram Word Ngram Concat Log Reg
Limitation 2: Inconsideration for Ops’Characteristics
- 1. Operators have different performancecharacteristics
- Concat materializes avector
- LogReg takes only 0.3% (contrary to the training phase)
- 2. There can be a better plan if such characteristics are considered
- Re-use the existingvectors
- Apply in-place update inLogReg
0% 20 % 100 % 40% 60% 80% Latency breakdown 23.1 34.2 32.7 9.6 CharNgram WordNgram Concat LogReg Others 0.3
1 1
Limitation 3: LazyInitialization
- ML.Net initializes code and memory lazily (efficient in training phase)
- Run 250 SentimentAnalysis models 100 times
è cold: first execution / hot: average of the rest99
- Long-tail latency in the coldcase
- Code analysis, Just–in-time (JIT) compilation, memory allocation,etc
- Difficult to provide strong Service-Level-Agreement(SLA)
T
- kenizer
Char Ngram Word Ngram Concat Log Reg 13x 444x
1 2
1 3
Outline
- (Black-box) Prediction ServingSystems
- Limitations of Black BoxApproaches
- PRETZEL: White-box Prediction ServingSystem
- Evaluation
- Conclusion
1 4
PRETZEL: White-box PredictionServing
- We analyze models to optimize the internal execution
- We let models co-exist on the same runtime,
sharing computation and memoryresources
- We optimize models in twodirections:
- 1. End-to-end optimizations
- 2. Multi-model optimizations
1 5
End-to-End Optimizations
Optimize the execution of individual models from start to end
- 1. [Ahead-of-time Compilation]
Compile operators’ code inadvance à No JIToverhead
- 2. [Vector pooling]
Pre-allocate data structures à No memory allocation on the data path
1 6
Multi-model Optimizations
Share computation and memory acrossmodels
- 1. [Object Store]
Share Operatorsparameters/weights
à Maintain only onecopy
2.[Sub-plan Materialization] Reuse intermediate results computed by othermodels
à Save computation
System Components
- 1. Flour: IntermediateRepresentation
- 2. Oven: Compiler/Optimizer
var fContext = ...; var Tokenizer = ...; return fPrgm.Plan();
- 3. Runtime: Execute inferencequeries
Runtime Object Store Scheduler
…
- 4. FrontEnd: Handle userrequests
FrontEnd
17
Prediction Serving withPRETZEL
- 1. Offline
- Analyze structural information ofmodels
- Build ModelPlan for optimalexecution
- Register ModelPlan toRuntime
- 2. Online
- Handle predictionrequests
- Coordinate CPU & memoryresources
Runtime FrontEnd Runtime Register Model Analyze
1 8
Model Plan
System Design: OfflinePhase
Char Ngram T
- kenizer
Word Ngram Concat Log Reg
- 1. T
ranslate Model into FlourProgram
<Model>
var fContext = new FlourContext(...) var tTokenizer = fContext.CSV .FromText(fields, fieldsType, sep) .Tokenize(); var tCNgram = tTokenizer.CharNgram(numCNgrms, ...); var tWNgram = tTokenizer.WordNgram(numWNgrms, ...); var fPrgrm = tCNgram .Concat(tWNgram) .ClassifierBinaryLinear(cParams);
<Flour Program>
return fPrgrm.Plan();
18
System Design: OfflinePhase
var tCNgram = tTokenizer.CharNgram(numCNgrms, ...); var tWNgram = tTokenizer.WordNgram(numWNgrms, ...); var fPrgrm = tCNgram .Concat(tWNgram) .ClassifierBinaryLinear(cParams); return fPrgrm.Plan();
- 2. Oven optimizer/compiler build ModelPlan
<Flour Program>
var fContext = new FlourContext(...) var tTokenizer = fContext.CSV .FromText(fields, fieldsType, sep) .Tokenize();
Push linearpredictor & RemoveConcat
Stage1 Stage2
Group ops intostages Rule-based
- ptimizer
S1 S2 <Model Plan>
Logical DAG
19
- 2. Oven optimizer/compiler build ModelPlan
System Design: OfflinePhase
var fContext = new FlourContext(...) var tTokenizer = fContext.CSV .FromText(fields, fieldsType, sep) .Tokenize(); var tCNgram = tTokenizer.CharNgram(numCNgrms, ...); var tWNgram = tTokenizer.WordNgram(numWNgrms, ...); var fPrgrm = tCNgram .Concat(tWNgram) .ClassifierBinaryLinear(cParams); return fPrgrm.Plan();
<Flour Program>
Push linearpredictor & RemoveConcat
Stage1 Stage2
Group ops intostages Rule-based
- ptimizer
S1 S2
var fContext = new FlourContext(...) var tTokenizer = fContext.CSV .FromText(fields, fieldsType, sep) .Tokenize(); var tCNgram = tTokenizer.CharNgram(numCNgrms, ...); var tWNgram = tTokenizer.WordNgram(numWNgrms, ...); var fPrgrm = tCNgram .Concat(tWNgram) .ClassifierBinaryLinear(cParams); return fPrgrm.Plan(); var fContext = new FlourContext(...) var tTokenizer = fContext.CSV .FromText(fields, fieldsType, sep) .Tokenize(); var tCNgram = tTokenizer.CharNgram(numCNgrms, ...); var tWNgram = tTokenizer.WordNgram(numWNgrms, ...); var fPrgrm = tCNgram .Concat(tWNgram) .ClassifierBinaryLinear(cParams);
e.g., Dictionary , N-gramLength
<Model Plan>
Logical DAG Parameters Statistics
return fPrgrm.Plan();
e.g., dense vs.sparse, maximum vectorsize
20
- 3. ModelPlan is registered to Runtime
System Design: OfflinePhase
<Model Plan>
Logical DAG Parameters Statistics
PhysicalStages
S1 S2
LogicalStages
Model1 S1 S2
- 2. Find the most
efficient physical impl. using params &stats
- 1. Store parameters&
mapping between logical stages
2 2 ObjectStore
- 3. ModelPlan is registered to Runtime
System Design: OfflinePhase
<Model Plan>
Logical DAG Parameters Statistics
PhysicalStages
S1 S2
Catalog
- 3. Registerselected
physical stagesto Catalog
LogicalStages
Model1 S1 S2
ObjectStore
N-gramlength
- 1vs. 3
Sparse vs.Dense
- 2. Find the most
efficient physical impl. using params &stats
- 1. Store parameters&
mapping between logical stages
2 3
System Design: OnlinePhase
Runtime LogicalStages
Model1 S1 S2 Model2 S1’ S2’
- 4. Send resultback
to Client
- 1. When aprediction
request arrives <Model1, “Pretzel istasty”>
- 3. Execute stagesusing
thread-pools, managed byScheduler PhysicalStages
- 2. Instantiate
physicalstages along with parameters ObjectStore 2 4
2 5
Outline
- (Black-box) Prediction ServingSystems
- Limitations of Black boxapproaches
- PRETZEL: White-box Prediction ServingSystem
- Evaluation
- Conclusion
2 6
Evaluation
- Q. How PRETZEL improves performance overblack-box approaches?
- in terms of latency
, memory andthroughput
- 500 Models from Microsoft Machine Learning T
eam
- 250 Sentiment Analysis(Memory-bound)
- 250 Attendee Count(Compute-bound)
- System configuration
- 16 Cores CPU, 32GBRAM
- Windows 10, .Net core2.0
Evaluation: Latency
- Micro-benchmark (No server-clientcommunication)
- Score 250 SentimentAnalysis models 100 times for each
- Compare ML.Net vs.PRETZEL
100
0.6 8.1
80 (%) 60 CDF 40 20
ML.Net ML.Net (hot) (cold)
10 2 10 1 100 101 Latency (ms, log-scaled)
ML.Net PRETZE L P99 (hot) P99 (cold) Worst (cold) ML.Net PRETZE L P99 (hot) 0.6 P99 (cold) 8.1 Worst (cold) ML.Net PRETZE L P99 (hot) 0. 6 0.2 P99 (cold) 8. 1 0.8 Worst (cold) ML.Net PRETZEL P99 (hot) 0.6 0.2 P99 (cold) 8.1 0.8 Worst(cold) 280.2 6.2
10
2 1
10 100 101 Latency (ms, log- scaled) 20 40 60 80 100 CDF (%)
PRETZEL(hot) PRETZEL(cold) 8.1 0.2 0.8 0.6 3x 10x 45x better 2 7
- Measure Cumulative Memory Usage after loading 250 models
- Attendee Count models (smaller size than SentimentAnalysis)
- 4 settings forComparison
Evaluation: Memory
better
Settings Shared Objects Shared Runtime ML.Net +Clipper ML.Net ✓ PRETZELwithout ObjectStore ✓ PRETZEL ✓ ✓
e g 32GB a s U 10GB ryed ) emoc 1GB l a s M- g eo v ti(l a 0.1GB l mu Cu 10MB 50 100 150 200 250 Number of pipelines Number of pipelines 10MB 0.1GB 1GB Cumulative Memory Usage (log-scaled) 32GB 10GB
164MB 9.7GB 3.7GB 2.9GB
25x 62x 2 8
PRETZEL
50 100 150 200 250
M L . N e t ( w .
- .
O b j S t
- r
e ) ML.Net +Clipper
Evaluation:Throughput
13 1 2 4 8
- Num. CPU Cores
5 10 15
(ideal ) (ideal )
Throughput (K QPS)
better
- Micro-benchmark
- Score 250Attendee Count models 1000 times for each
- Request 1000queries in a batch
- Compare ML.Net vs.PRETZEL
10x Close toideal scalability
More results in thepaper!
2 9
3
Conclusion
- PRETZELis the first white-box prediction serving system for ML pipelines
- By using models’ structural info, we enable two types of optimizations:
- End-to-end optimizations generate efficient execution plans for a model
- Multi-model optimizations let models share computation and memory resources
- Our evaluation shows that PRETZELcan improve performance compared to
Black-box systems (e.g.,ML.Net)
- Decrease latency and memoryfootprint
- Increase resource utilization andthroughput
- Alot of external optimizations used by Cipper are orthogonal to PRETZEL