Willump: A Statistically-Aware End-to-end Optimizer for ML Inference - - PowerPoint PPT Presentation

willump a statistically aware
SMART_READER_LITE
LIVE PREVIEW

Willump: A Statistically-Aware End-to-end Optimizer for ML Inference - - PowerPoint PPT Presentation

Willump: A Statistically-Aware End-to-end Optimizer for ML Inference Peter Kraft , Daniel Kang, Deepak Narayanan, Shoumik Palkar, Peter Bailis, Matei Zaharia 1 Problem: ML Inference Often performance-critical. Recent focus on tools for


slide-1
SLIDE 1

Willump: A Statistically-Aware End-to-end Optimizer for ML Inference

Peter Kraft, Daniel Kang, Deepak Narayanan, Shoumik Palkar, Peter Bailis, Matei Zaharia

1

slide-2
SLIDE 2

Problem: ML Inference

  • Often performance-critical.
  • Recent focus on tools for ML prediction serving.

2

slide-3
SLIDE 3

A Common Bottleneck: Feature Computation

3

  • Many applications bottlenecked by

feature computation.

  • Pipeline of transformations computes

numerical features from data for model.

Receive Raw Data Compute Features Predict With Model

slide-4
SLIDE 4

A Common Bottleneck: Feature Computation

4

  • Feature computation is bottleneck when models are

inexpensive—boosted trees, not DNNs.

  • Common on tabular/structured data!
slide-5
SLIDE 5

A Common Bottleneck: Feature Computation

Source: Pretzel (OSDI ‘18)

Feature computation takes >99% of the time! Production Microsoft sentiment analysis pipeline

Model run time

5

slide-6
SLIDE 6

Current State-of-the-art

  • Apply traditional serving optimizations, e.g. caching

(Clipper), compiler optimizations (Pretzel).

  • Neglect unique statistical properties of ML apps.

6

slide-7
SLIDE 7

Statistical Properties of ML

Amenability to approximation

7

slide-8
SLIDE 8

Statistical Properties of ML

Amenability to approximation

8

Easy input: Definitely not a dog. Hard input: Maybe a dog?

slide-9
SLIDE 9

Statistical Properties of ML

Amenability to approximation Existing Systems: Use Expensive Model for Both

9

Easy input: Definitely not a dog. Hard input: Maybe a dog?

slide-10
SLIDE 10

Statistical Properties of ML

Amenability to approximation Statistically-Aware Systems: Use cheap model on bucket, expensive model on cat.

10

Easy input: Definitely not a dog. Hard input: Maybe a dog?

slide-11
SLIDE 11

Statistical Properties of ML

  • Model is often part of a bigger app (e.g. top-K query)

11

slide-12
SLIDE 12

Statistical Properties of ML

  • Model is often part of a bigger app (e.g. top-K query)

12

Artist Score Rank Beatles 9.7 1 Bruce Springsteen 9.5 2 … … … Justin Bieber 5.6 999 Nickelback 4.1 1000

Problem: Return top 10 artists.

slide-13
SLIDE 13

Statistical Properties of ML

  • Model is often part of a bigger app (e.g. top-K query)

13

Artist Score Rank Beatles 9.7 1 Bruce Springsteen 9.5 2 … … … Justin Bieber 5.6 999 Nickelback 4.1 1000

Use expensive model for everything! Existing Systems

slide-14
SLIDE 14

Statistical Properties of ML

  • Model is often part of a bigger app (e.g. top-K query)

14

Artist Score Rank Beatles 9.7 1 Bruce Springsteen 9.5 2 … … … Justin Bieber 5.6 999 Nickelback 4.1 1000

High-value: Rank precisely, return. Low-value: Approximate, discard.

Statistically-aware Systems

slide-15
SLIDE 15

Prior Work: Statistically-Aware Optimizations

  • Statistically-aware optimizations exist in literature.
  • Always application-specific and custom-built.
  • Never automatic!

15

Source: Cheng et al. (DLRS’ 16), Kang et al. (VLDB ‘17)

slide-16
SLIDE 16

ML Inference Dilemna

  • ML inference systems:

Easy to use.

Slow.

  • Statistically-aware systems:

Fast

Require a lot of work to implement.

16

slide-17
SLIDE 17

Can an ML inference system be fast and easy to use?

17

slide-18
SLIDE 18

Willump: Overview

  • Statistically-aware optimizer for ML Inference.
  • Targets feature computation!
  • Automatic model-agnostic statistically-aware opts.
  • 10x throughput+latency improvements.

18

slide-19
SLIDE 19

Outline

19

  • System Overview
  • Optimization 1: End-to-end Cascades
  • Optimization 2: Top-K Query Approximation
  • Evaluation
slide-20
SLIDE 20

Willump: Goals

  • Automatically maximize performance of ML inference

applications whose performance bottleneck is feature computation

20

slide-21
SLIDE 21

def pipeline(x1, x2): input = lib.transform(x1, x2) preds = model.predict(input) return preds

System Overview

Input Pipeline

slide-22
SLIDE 22

def pipeline(x1, x2): input = lib.transform(x1, x2) preds = model.predict(input) return preds

Willump Optimization

Infer Transformation Graph

System Overview

Input Pipeline

slide-23
SLIDE 23

23

def pipeline(x1, x2): input = lib.transform(x1, x2) preds = model.predict(input) return preds

Willump Optimization

Infer Transformation Graph Statistically-Aware Optimizations:

  • 1. End-To-End Cascades
  • 2. Top-K Query Approximation

System Overview

Input Pipeline

slide-24
SLIDE 24

24

def pipeline(x1, x2): input = lib.transform(x1, x2) preds = model.predict(input) return preds

Willump Optimization

Infer Transformation Graph Compiler Optimizations (Weld—Palkar et al. VLDB ‘18)

System Overview

Input Pipeline

Statistically-Aware Optimizations:

  • 1. End-To-End Cascades
  • 2. Top-K Query Approximation
slide-25
SLIDE 25

25

def pipeline(x1, x2): input = lib.transform(x1, x2) preds = model.predict(input) return preds

Willump Optimization

Infer Transformation Graph Compiler Optimizations (Weld—Palkar et al. VLDB ‘18)

def willump_pipeline(x1, x2): preds = compiled_code(x1, x2) return preds

Optimized Pipeline

System Overview

Input Pipeline

Statistically-Aware Optimizations:

  • 1. End-To-End Cascades
  • 2. Top-K Query Approximation
slide-26
SLIDE 26

Outline

26

  • System Overview
  • Optimization 1: End-to-end Cascades
  • Optimization 2: Top-K Query Approximation
  • Evaluation
slide-27
SLIDE 27

Background: Model Cascades

  • Classify “easy” inputs with cheap model.
  • Cascade to expensive model for “hard” inputs.

27

Easy input: Definitely not a dog. Hard input: Maybe a dog?

slide-28
SLIDE 28

Background: Model Cascades

  • Used for image classification, object detection.
  • Existing systems application-specific and custom-built.

28

Source: Viola-Jones (CVPR’ 01), Kang et al. (VLDB ‘17)

slide-29
SLIDE 29

Our Optimization: End-to-end cascades

  • Compute only some features for “easy” data inputs;

cascade to computing all for “hard” inputs.

  • Automatic and model-agnostic, unlike prior work.

○ Estimates for runtime performance & accuracy of a feature set ○ Efficient search process for tuning parameters

29

slide-30
SLIDE 30

End-to-end Cascades: Original Model

Compute All Features Model Prediction

slide-31
SLIDE 31

Cascades Optimization

End-to-end Cascades: Approximate Model

Compute All Features Model Prediction Compute Selected Features Approximate Model Prediction

slide-32
SLIDE 32

Cascades Optimization

End-to-end Cascades: Confidence

Compute All Features Model Prediction Compute Selected Features Approximate Model Prediction

Confidence > Threshold

Yes

slide-33
SLIDE 33

Cascades Optimization

End-to-end Cascades: Final Pipeline

Compute All Features Model Prediction Compute Selected Features Compute Remaining Features Approximate Model Prediction Original Model

Confidence > Threshold

Yes No

slide-34
SLIDE 34

End-to-end Cascades: Constructing Cascades

34

  • Construct cascades during model training.
  • Need model training set and an accuracy target.
slide-35
SLIDE 35

End-to-end Cascades: Selecting Features

Compute Selected Features Compute Remaining Features Approximate Model Prediction Original Model

Confidence > Threshold

Yes No

Key question: Select which features?

slide-36
SLIDE 36

End-to-end Cascades: Selecting Features

36

  • Goal: Select features that minimize expected query time

given accuracy target.

slide-37
SLIDE 37

End-to-end Cascades: Selecting Features

Compute Selected Features Compute Remaining Features Approximate Model Prediction Original Model

Confidence > Threshold

Yes No

Two possibilities for a query: Can approximate or not.

Can approximate query. Can’t approximate query.

slide-38
SLIDE 38

End-to-end Cascades: Selecting Features

Compute Selected Features (S) Approximate Model Prediction

Confidence > Threshold

Yes

P(Yes) = P(approx) cost(𝑇)

min

𝑇

𝑄(approx)cost(𝑇) + 𝑄(~approx)cost(𝐺)

slide-39
SLIDE 39

End-to-end Cascades: Selecting Features

Compute Selected Features (S) Compute Remaining Features Approximate Model Prediction Original Model

Confidence > Threshold

No

P(No) = P(~approx)

min

𝑇

𝑄(approx)cost(𝑇) + 𝑄(~approx)cost(𝐺)

cost(𝐺)

slide-40
SLIDE 40

End-to-end Cascades: Selecting Features

Compute Selected Features (S) Compute Remaining Features Approximate Model Prediction Original Model

Confidence > Threshold

Yes No

P(Yes) = P(approx) P(No) = P(~approx) cost(𝑇)

min

𝑇

𝑄(approx)cost(𝑇) + 𝑄(~approx)cost(𝐺)

cost(𝐺)

slide-41
SLIDE 41

End-to-end Cascades: Selecting Features

41

  • Goal: Select feature set S that minimizes query time:

min

𝑇

𝑄(approx)cost(𝑇) + 𝑄(~approx)cost(𝐺)

slide-42
SLIDE 42

End-to-end Cascades: Selecting Features

42

  • Goal: Select feature set S that minimizes query time:

min

𝑇

𝑄(approx)cost(𝑇) + 𝑄(~approx)cost(𝐺)

  • Approach:

○ Choose several potential values of cost(𝑻). ○ Find best feature set with each cost(S). ○ Train model & find cascade threshold for each set. ○ Pick best overall.

slide-43
SLIDE 43

End-to-end Cascades: Selecting Features

43

  • Goal: Select feature set S that minimizes query time:

min

𝑇

𝑄(approx)cost(𝑇) + 𝑄(~approx)cost(𝐺)

  • Approach:

○ Choose several potential values of cost(𝑇). ○ Find best feature set with each cost(S). ○ Train model & find cascade threshold for each set. ○ Pick best overall.

slide-44
SLIDE 44

End-to-end Cascades: Selecting Features

44

  • Goal: Select feature set S that minimizes query time:

min

𝑇

𝑄(approx)cost(𝑇) + 𝑄(~approx)cost(𝐺)

  • Approach:

○ Choose several potential values of cost(𝑇). ○ Find best feature set with each cost(S). ○ Train model & find cascade threshold for each set. ○ Pick best overall.

slide-45
SLIDE 45

End-to-end Cascades: Selecting Features

45

  • Goal: Select feature set S that minimizes query time:

min

𝑇

𝑄(approx)cost(𝑇) + 𝑄(~approx)cost(𝐺)

  • Approach:

○ Choose several potential values of cost(𝑇). ○ Find best feature set with each cost(S). ○ Train model & find cascade threshold for each set. ○ Pick best overall.

slide-46
SLIDE 46

End-to-end Cascades: Selecting Features

46

  • Goal: Select feature set S that minimizes query time:

min

𝑇

𝑄(approx)cost(𝑇) + 𝑄(~approx)cost(𝐺)

  • Approach:

○ Choose several potential values of cost(𝑻). ○ Find best feature set with each cost(S). ○ Train model & find cascade threshold for each set. ○ Pick best overall.

slide-47
SLIDE 47

End-to-end Cascades: Selecting Features

47

  • Goal: Select feature set S that minimizes query time:

min

𝑇

𝑄(approx)cost(𝑇) + 𝑄(~approx)cost(𝐺)

  • Approach:

○ Choose several potential values of cost(𝑇). ○ Find best feature set with each cost(S). ○ Train model & find cascade threshold for each set. ○ Pick best overall.

slide-48
SLIDE 48

End-to-end Cascades: Selecting Features

48

  • Subgoal: Find S minimizing query time if 𝑑𝑝𝑡𝑢 𝑇 = 𝑑𝑛𝑏𝑦.

min

𝑇

𝑄(approx)cost(𝑇) + 𝑄(~approx)cost(𝐺)

slide-49
SLIDE 49

End-to-end Cascades: Selecting Features

49

  • Subgoal: Find S minimizing query time if 𝑑𝑝𝑡𝑢 𝑇 = 𝑑𝑛𝑏𝑦.

min

𝑇

𝑄(approx)cost(𝑇) + 𝑄(~approx)cost(𝐺)

  • Solution:

○ Find S maximizing approximate model accuracy.

slide-50
SLIDE 50

End-to-end Cascades: Selecting Features

50

  • Subgoal: Find S minimizing query time if 𝑑𝑝𝑡𝑢 𝑇 = 𝑑𝑛𝑏𝑦.

min

𝑇

𝑄(approx)cost(𝑇) + 𝑄(~approx)cost(𝐺)

  • Solution:

○ Find S maximizing approximate model accuracy. ○ Problem: Computing accuracy expensive.

slide-51
SLIDE 51

End-to-end Cascades: Selecting Features

51

  • Subgoal: Find S minimizing query time if 𝑑𝑝𝑡𝑢 𝑇 = 𝑑𝑛𝑏𝑦.

min

𝑇

𝑄(approx)cost(𝑇) + 𝑄(~approx)cost(𝐺)

  • Solution:

○ Find S maximizing approximate model accuracy. ○ Problem: Computing accuracy expensive. ○ Solution: Estimate accuracy via permutation importance -> knapsack problem.

slide-52
SLIDE 52

End-to-end Cascades: Selecting Features

52

  • Goal: Select feature set S that minimizes query time:

min

𝑇

𝑄(approx)cost(𝑇) + 𝑄(~approx)cost(𝐺)

  • Approach:

○ Choose several potential values of cost(𝑇). ○ Find best feature set with each cost(S). ○ Train model & find cascade threshold for each set. ○ Pick best overall.

slide-53
SLIDE 53

End-to-end Cascades: Selecting Features

53

  • Subgoal: Train model & find cascade threshold for S.

min

𝑇

𝑄(approx)cost(𝑇) + 𝑄(~approx)cost(𝐺)

  • Solution:

○ Compute empirically on held-out data.

slide-54
SLIDE 54

End-to-end Cascades: Selecting Features

54

  • Subgoal: Train model & find cascade threshold for S.

min

𝑇

𝑄(approx)cost(𝑇) + 𝑄(~approx)cost(𝐺)

  • Solution:

○ Compute empirically on held-out data. ○ Train approximate model from S.

slide-55
SLIDE 55

End-to-end Cascades: Selecting Features

55

  • Subgoal: Train model & find cascade threshold for S.

min

𝑇

𝑄(approx)cost(𝑇) + 𝑄(~approx)cost(𝐺)

  • Solution:

○ Compute empirically on held-out data. ○ Train approximate model from S. ○ Predict held-out set, determine cascade threshold empirically using accuracy target.

slide-56
SLIDE 56

End-to-end Cascades: Selecting Features

56

  • Goal: Select feature set S that minimizes query time:

min

𝑇

𝑄(approx)cost(𝑇) + 𝑄(~approx)cost(𝐺)

  • Approach:

○ Choose several potential values of cost(𝑇). ○ Find best feature set with each cost(S). ○ Train model & find cascade threshold for each set. ○ Pick best overall.

slide-57
SLIDE 57

End-to-end Cascades: Results

57

  • Speedups of up to 5x without statistically significant

accuracy loss.

  • Full evaluation at end of talk!
slide-58
SLIDE 58

Outline

58

  • System Overview
  • Optimization 1: End-to-end Cascades
  • Optimization 2: Top-K Query Approximation
  • Evaluation
slide-59
SLIDE 59

Top-K Approximation: Query Overview

  • Top-K problem: Rank K highest-scoring items of a

dataset.

  • Top-K example: Find 10 artists a user would like most

(recommender system).

59

slide-60
SLIDE 60

Top-K Approximation: Asymmetry

60

  • High-value items must be predicted, ranked precisely.
  • Low-value items need only be identified as low value.

Artist Score Rank Beatles 9.7 1 Bruce Springsteen 9.5 2 … … … Justin Bieber 5.6 999 Nickelback 4.1 1000

High-value: Rank precisely, return. Low-value: Approximate, discard.

slide-61
SLIDE 61

Top-K Approximation: How it Works

  • Use approximate model to identify and discard low-

value items.

  • Rank high-value items with powerful model.

61

slide-62
SLIDE 62

Top-K Approximation: Prior Work

  • Existing systems have similar ideas.
  • However, we automatically generate approximate

models for any ML application—prior systems don’t.

  • Similar challenges as in cascades.

62

Source: Cheng et al. (DLRS ‘16)

slide-63
SLIDE 63

Top-K Approximation: Automatic Tuning

  • Automatically selects features, tunes parameters to

maximize performance given accuracy target.

  • Works similarly to cascades.
  • See paper for details!

63

slide-64
SLIDE 64

Top-K Approximation: Results

64

  • Speedups of up to 10x for top-K queries.
  • Full eval at end of talk!
slide-65
SLIDE 65

Outline

65

  • System Overview
  • Optimization 1: End-to-end Cascades
  • Optimization 2: Top-K Query Approximation
  • Evaluation
slide-66
SLIDE 66

Willump Evaluation: Benchmarks

  • Benchmarks curated from top-performing entries to

data science competitions (e.g. Kaggle, WSDM, CIKM).

  • Three benchmarks in presentation (more in paper):

Music (music recommendation– queries remotely stored precomputed features)

Purchase (predict next purchase, tabular AutoML features)

Toxic (toxic comment detection – computes string features)

66

slide-67
SLIDE 67

End-to-End Cascades Evaluation: Throughput

67

15x 1.6x 1x 1x 2.4x 3.2x

slide-68
SLIDE 68

68

End-to-End Cascades Evaluation: Latency

slide-69
SLIDE 69

Top-K Query Approximation Evaluation

69

4.0x 1x 2.7x 1x 3.2x 30x

slide-70
SLIDE 70
  • We introduce Willump, a statistically-aware end-to-end
  • ptimizer for ML inference.
  • Statistical nature of ML enables new optimizations:

Willump applies them automatically for 10x speedups. github.com/stanford-futuredata/Willump

Summary

70