Cuttlefish Lightweight Primitives for Online Tuning by Tomer Kaftan - - PowerPoint PPT Presentation

cuttlefish
SMART_READER_LITE
LIVE PREVIEW

Cuttlefish Lightweight Primitives for Online Tuning by Tomer Kaftan - - PowerPoint PPT Presentation

Cuttlefish Lightweight Primitives for Online Tuning by Tomer Kaftan (UW), Magdalena Balazinska (UW), Alvin Cheung (UW), Johannes Gehrke (Microsoft) 1 Data processing workloads today are complicated. 2 Motivating Workload 3 Motivating


slide-1
SLIDE 1

Lightweight Primitives for Online Tuning

Cuttlefish

by Tomer Kaftan (UW), Magdalena Balazinska (UW), Alvin Cheung (UW), Johannes Gehrke (Microsoft)

1

slide-2
SLIDE 2

Data processing workloads today are complicated.

2

slide-3
SLIDE 3

Motivating Workload

3

slide-4
SLIDE 4

Motivating Workload

“A Cuttlefish pretending to be a rock”

3

*Image Sourced from https://www.flickr.com/photos/silkebaron/32001215104
slide-5
SLIDE 5

Motivating Workload

“A Cuttlefish pretending to be a rock”

3

Generate Training Data from:

etc.

*Image Sourced from https://www.flickr.com/photos/silkebaron/32001215104
slide-6
SLIDE 6

Motivating Workload

4

CNN

HTML Data

Train a caption- generating model

Output Model Conv RNN

Repeat

Regex Join

Images

Filter

Generate Training Labels

...

Conv

*caption-generating model portion of the logical plan inspired by: Xu et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ICML 2015

slide-7
SLIDE 7

Motivating Workload

4

CNN

HTML Data

Train a caption- generating model

Output Model Conv RNN

Repeat

Regex Join

Images

Filter

Generate Training Labels

...

Conv

*caption-generating model portion of the logical plan inspired by: Xu et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ICML 2015

Diverse, sophisticated operators, with multiple implementations!

slide-8
SLIDE 8

Example Operator: Convolution

5

slide-9
SLIDE 9

Example Operator: Convolution

5

Tested 3 convolution algorithms on 8000 Flickr images

Relative throughput normalized against the highest-throughput algorithm

slide-10
SLIDE 10

6

Use a Query Optimizer

Traditionally:

slide-11
SLIDE 11

6

(Collect Dataset Statistics, Apply Heuristics & Cost Models)

Use a Query Optimizer

Traditionally:

slide-12
SLIDE 12

These work great, BUT…

7

slide-13
SLIDE 13

These work great, BUT…

  • Designing good query optimizers takes time!

7

slide-14
SLIDE 14

These work great, BUT…

  • Designing good query optimizers takes time!
  • Requires deep knowledge of the operators and

significant development effort.

7

slide-15
SLIDE 15

These work great, BUT…

  • Designing good query optimizers takes time!
  • Requires deep knowledge of the operators and

significant development effort.

  • Spark SQL took 2 years to go from heuristics-based
  • ptimization to cost-based optimization! [1]

7

[1] http://databricks.com/blog/2017/08/31/cost-based-optimizer-in-apache-spark-2- 2.html

slide-16
SLIDE 16

These work great, BUT…

  • Designing good query optimizers takes time!
  • Requires deep knowledge of the operators and

significant development effort.

  • Spark SQL took 2 years to go from heuristics-based
  • ptimization to cost-based optimization! [1]
  • Modern data processing applications involve more than

just relational operators!

7

[1] http://databricks.com/blog/2017/08/31/cost-based-optimizer-in-apache-spark-2- 2.html

slide-17
SLIDE 17

Can we optimize without a full-fledged optimizer?

8

slide-18
SLIDE 18

Workload developer (or the query optimizer) inserts calls to Cuttlefish’s API to insert tuners that select implementations during execution

Cuttlefish: A Lightweight Primitive for Online Tuning

9

CNN

HTML Data

Train a caption- generating model

Output Model Conv RNN

Repeat

Regex Join

Images

Filter

Generate Training Labels

...

Conv

slide-19
SLIDE 19

Workload developer (or the query optimizer) inserts calls to Cuttlefish’s API to insert tuners that select implementations during execution

Cuttlefish: A Lightweight Primitive for Online Tuning

9

CNN

HTML Data

Train a caption- generating model

Output Model Conv RNN

Repeat

Regex Join

Images

Filter

Generate Training Labels

...

Conv

slide-20
SLIDE 20 Tuner Lifecycle Choose Execute Observe Nest. Loop Mat. Mult Sort FFT Lib 2 Lib 3 Lib 4 Lib 1 HTML Data

Join

Output Model Tuner Filter

Regex

Images RNN

Repeat CNN

Nest. Loop Mat. Mult FFT ... Tuner Hash Tuner Tuner

Conv Conv 10

Cuttlefish: A Lightweight Primitive for Online Tuning

Workload developer (or the query optimizer) inserts calls to Cuttlefish’s API to insert tuners that select implementations during execution

slide-21
SLIDE 21 Tuner Lifecycle Choose Execute Observe Nest. Loop Mat. Mult Sort FFT Lib 2 Lib 3 Lib 4 Lib 1 HTML Data

Join

Output Model Tuner Filter

Regex

Images RNN

Repeat CNN

Nest. Loop Mat. Mult FFT ... Tuner Hash Tuner Tuner

Conv Conv 10

Cuttlefish: A Lightweight Primitive for Online Tuning

Tuner Lifecycle Choose Execute Observe Nest. Loop Mat. Mult Sort FFT Lib 2 Lib 3 Lib 4 Lib 1 HTML Data

Join

Output Model Tuner Filter

Regex

Images RNN

Repeat CNN

Nest. Loop Mat. Mult FFT ... Tuner Hash Tuner Tuner

Conv Conv

Workload developer (or the query optimizer) inserts calls to Cuttlefish’s API to insert tuners that select implementations during execution

slide-22
SLIDE 22 Tuner Lifecycle Choose Execute Observe Nest. Loop Mat. Mult Sort FFT Lib 2 Lib 3 Lib 4 Lib 1 HTML Data

Join

Output Model Tuner Filter

Regex

Images RNN

Repeat CNN

Nest. Loop Mat. Mult FFT ... Tuner Hash Tuner Tuner

Conv Conv 10

Cuttlefish: A Lightweight Primitive for Online Tuning

Tuner Lifecycle Choose Execute Observe Nest. Loop Mat. Mult Sort FFT Lib 2 Lib 3 Lib 4 Lib 1 HTML Data

Join

Output Model Tuner Filter

Regex

Images RNN

Repeat CNN

Nest. Loop Mat. Mult FFT ... Tuner Hash Tuner Tuner

Conv Conv

Workload developer (or the query optimizer) inserts calls to Cuttlefish’s API to insert tuners that select implementations during execution The user maps tuning rounds to the execution model of each operator:

slide-23
SLIDE 23 Tuner Lifecycle Choose Execute Observe Nest. Loop Mat. Mult Sort FFT Lib 2 Lib 3 Lib 4 Lib 1 HTML Data

Join

Output Model Tuner Filter

Regex

Images RNN

Repeat CNN

Nest. Loop Mat. Mult FFT ... Tuner Hash Tuner Tuner

Conv Conv 10

Cuttlefish: A Lightweight Primitive for Online Tuning

Tuner Lifecycle Choose Execute Observe Nest. Loop Mat. Mult Sort FFT Lib 2 Lib 3 Lib 4 Lib 1 HTML Data

Join

Output Model Tuner Filter

Regex

Images RNN

Repeat CNN

Nest. Loop Mat. Mult FFT ... Tuner Hash Tuner Tuner

Conv Conv

Workload developer (or the query optimizer) inserts calls to Cuttlefish’s API to insert tuners that select implementations during execution The user maps tuning rounds to the execution model of each operator:

  • Regex: One round per HTML Doc
slide-24
SLIDE 24 Tuner Lifecycle Choose Execute Observe Nest. Loop Mat. Mult Sort FFT Lib 2 Lib 3 Lib 4 Lib 1 HTML Data

Join

Output Model Tuner Filter

Regex

Images RNN

Repeat CNN

Nest. Loop Mat. Mult FFT ... Tuner Hash Tuner Tuner

Conv Conv 10

Cuttlefish: A Lightweight Primitive for Online Tuning

Tuner Lifecycle Choose Execute Observe Nest. Loop Mat. Mult Sort FFT Lib 2 Lib 3 Lib 4 Lib 1 HTML Data

Join

Output Model Tuner Filter

Regex

Images RNN

Repeat CNN

Nest. Loop Mat. Mult FFT ... Tuner Hash Tuner Tuner

Conv Conv

Workload developer (or the query optimizer) inserts calls to Cuttlefish’s API to insert tuners that select implementations during execution The user maps tuning rounds to the execution model of each operator:

  • Regex: One round per HTML Doc
  • Convolve: One round per image
slide-25
SLIDE 25 Tuner Lifecycle Choose Execute Observe Nest. Loop Mat. Mult Sort FFT Lib 2 Lib 3 Lib 4 Lib 1 HTML Data

Join

Output Model Tuner Filter

Regex

Images RNN

Repeat CNN

Nest. Loop Mat. Mult FFT ... Tuner Hash Tuner Tuner

Conv Conv 10

Cuttlefish: A Lightweight Primitive for Online Tuning

Tuner Lifecycle Choose Execute Observe Nest. Loop Mat. Mult Sort FFT Lib 2 Lib 3 Lib 4 Lib 1 HTML Data

Join

Output Model Tuner Filter

Regex

Images RNN

Repeat CNN

Nest. Loop Mat. Mult FFT ... Tuner Hash Tuner Tuner

Conv Conv

Workload developer (or the query optimizer) inserts calls to Cuttlefish’s API to insert tuners that select implementations during execution The user maps tuning rounds to the execution model of each operator:

  • Regex: One round per HTML Doc
  • Convolve: One round per image
  • Parallel Distributed Join: One round per partition
slide-26
SLIDE 26

Cuttlefish

11

I. Problem & Motivation

  • II. The Cuttlefish API
  • III. Bandit-based Online Tuning
  • IV. Distributed Tuning Approach
  • V. Contextual Tuning
  • VI. Handling Nonstationary Settings

VII.Other Operators VIII.Conclusion

slide-27
SLIDE 27

The Cuttlefish Primitive

12

slide-28
SLIDE 28
  • 1. Construct a tuner (from a set of choices)

The Cuttlefish Primitive

12

slide-29
SLIDE 29
  • 1. Construct a tuner (from a set of choices)
  • 2. Tuner.choose (pick one of the choices)

The Cuttlefish Primitive

12

slide-30
SLIDE 30
  • 1. Construct a tuner (from a set of choices)
  • 2. Tuner.choose (pick one of the choices)
  • 3. Tuner.observe (observe a reward for a choice)

The Cuttlefish Primitive

12

slide-31
SLIDE 31
  • 1. Construct a tuner (from a set of choices)
  • 2. Tuner.choose (pick one of the choices)
  • 3. Tuner.observe (observe a reward for a choice)

Cuttlefish tuners maximize the total reward after multiple choose-observe tuning rounds

The Cuttlefish Primitive

12

slide-32
SLIDE 32

Tuning Convolution with Cuttlefish

13

convolve, token = tuner.choose() tuner.observe(token, reward)

slide-33
SLIDE 33

Tuning Convolution with Cuttlefish

def loopConvolve(image, filters): … def fftConvolve(image, filters): … def mmConvolve(image, filters): … for image, filters in convolutions: start = now() result = convolve(image, filters) elapsedTime = now() - start reward = computeReward(elapsedTime)

  • utput result

13

convolve, token = tuner.choose() tuner.observe(token, reward)

slide-34
SLIDE 34

Tuning Convolution with Cuttlefish

def loopConvolve(image, filters): … def fftConvolve(image, filters): … def mmConvolve(image, filters): … for image, filters in convolutions: start = now() result = convolve(image, filters) elapsedTime = now() - start reward = computeReward(elapsedTime)

  • utput result

13

tuner = Tuner([loopConvolve, fftConvolve, mmConvolve]) convolve, token = tuner.choose() tuner.observe(token, reward)

slide-35
SLIDE 35

Tuning Convolution with Cuttlefish

def loopConvolve(image, filters): … def fftConvolve(image, filters): … def mmConvolve(image, filters): … for image, filters in convolutions: start = now() result = convolve(image, filters) elapsedTime = now() - start reward = computeReward(elapsedTime)

  • utput result

13

tuner = Tuner([loopConvolve, fftConvolve, mmConvolve]) convolve, token = tuner.choose() tuner.observe(token, reward)

slide-36
SLIDE 36

Tuning Convolution with Cuttlefish

def loopConvolve(image, filters): … def fftConvolve(image, filters): … def mmConvolve(image, filters): … for image, filters in convolutions: start = now() result = convolve(image, filters) elapsedTime = now() - start reward = computeReward(elapsedTime)

  • utput result

13

tuner = Tuner([loopConvolve, fftConvolve, mmConvolve]) convolve, token = tuner.choose() tuner.observe(token, reward)

slide-37
SLIDE 37

Tuning Convolution with Cuttlefish

def loopConvolve(image, filters): … def fftConvolve(image, filters): … def mmConvolve(image, filters): … for image, filters in convolutions: start = now() result = convolve(image, filters) elapsedTime = now() - start reward = computeReward(elapsedTime)

  • utput result

13

tuner = Tuner([loopConvolve, fftConvolve, mmConvolve]) convolve, token = tuner.choose() tuner.observe(token, reward)

slide-38
SLIDE 38

Tuning Convolution with Cuttlefish

def loopConvolve(image, filters): … def fftConvolve(image, filters): … def mmConvolve(image, filters): … for image, filters in convolutions: start = now() result = convolve(image, filters) elapsedTime = now() - start reward = computeReward(elapsedTime)

  • utput result

13

tuner = Tuner([loopConvolve, fftConvolve, mmConvolve]) convolve, token = tuner.choose() tuner.observe(token, reward)

slide-39
SLIDE 39

Tuning Convolution with Cuttlefish

def loopConvolve(image, filters): … def fftConvolve(image, filters): … def mmConvolve(image, filters): … for image, filters in convolutions: start = now() result = convolve(image, filters) elapsedTime = now() - start reward = computeReward(elapsedTime)

  • utput result

13

tuner = Tuner([loopConvolve, fftConvolve, mmConvolve]) convolve, token = tuner.choose() tuner.observe(token, reward)

slide-40
SLIDE 40

Tuning Convolution with Cuttlefish

def loopConvolve(image, filters): … def fftConvolve(image, filters): … def mmConvolve(image, filters): … for image, filters in convolutions: start = now() result = convolve(image, filters) elapsedTime = now() - start reward = computeReward(elapsedTime)

  • utput result

13

tuner = Tuner([loopConvolve, fftConvolve, mmConvolve]) convolve, token = tuner.choose() tuner.observe(token, reward)

slide-41
SLIDE 41

Cuttlefish

14

I. Problem & Motivation

  • II. The Cuttlefish API

III.Bandit-based Online Tuning

  • IV. Distributed Tuning Approach
  • V. Contextual Tuning
  • VI. Handling Nonstationary Settings

VII.Other Operators VIII.Conclusion

slide-42
SLIDE 42

Approach: Tuning

15

slide-43
SLIDE 43

Multi-armed Bandit Problem

Approach: Tuning

15

slide-44
SLIDE 44
  • K possible choices (called arms)

Multi-armed Bandit Problem

Approach: Tuning

15

slide-45
SLIDE 45
  • K possible choices (called arms)
  • Arms have unknown reward distributions

Multi-armed Bandit Problem

Approach: Tuning

15

slide-46
SLIDE 46
  • K possible choices (called arms)
  • Arms have unknown reward distributions
  • At each round: select an Arm and observe a reward

Multi-armed Bandit Problem

Approach: Tuning

15

slide-47
SLIDE 47
  • K possible choices (called arms)
  • Arms have unknown reward distributions
  • At each round: select an Arm and observe a reward

Multi-armed Bandit Problem

Goal: Maximize Cumulative Reward

(by balancing exploration & exploitation)

Approach: Tuning

15

slide-48
SLIDE 48

Thompson Sampling

16

slide-49
SLIDE 49

Thompson Sampling

16

Reward Arm 1 Arm 2 Arm 3 Arm 4 Belief distributions about expected reward

slide-50
SLIDE 50

Thompson Sampling

17

Reward Arm 1 Arm 2 Arm 3 Arm 4

slide-51
SLIDE 51

Thompson Sampling

18

Reward Arm 1 Arm 2 Arm 3 Arm 4

slide-52
SLIDE 52

Thompson Sampling

18

Reward Arm 1 Arm 2 Arm 3 Arm 4

slide-53
SLIDE 53

Thompson Sampling

19

Reward Arm 1 Arm 2 Arm 3 Arm 4

slide-54
SLIDE 54

Thompson Sampling

20

Reward Arm 1 Arm 2 Arm 3 Arm 4 Better arms chosen more often

slide-55
SLIDE 55

Thompson Sampling

21

slide-56
SLIDE 56

Thompson Sampling

  • Gaussian runtimes with initially unknown means and variances

21

slide-57
SLIDE 57

Thompson Sampling

  • Gaussian runtimes with initially unknown means and variances
  • Belief distributions form t-distributions
  • Depend only on sample mean, variance, count

21

slide-58
SLIDE 58

Thompson Sampling

  • Gaussian runtimes with initially unknown means and variances
  • Belief distributions form t-distributions
  • Depend only on sample mean, variance, count
  • No meta-parameters, yet works well for diverse operators

21

slide-59
SLIDE 59

Thompson Sampling

  • Gaussian runtimes with initially unknown means and variances
  • Belief distributions form t-distributions
  • Depend only on sample mean, variance, count
  • No meta-parameters, yet works well for diverse operators
  • Constant memory overhead, 0.03 ms per tuning round

21

slide-60
SLIDE 60

Convolution Evaluation

22

slide-61
SLIDE 61

Convolution Evaluation

  • Prototype in Apache Spark

22

slide-62
SLIDE 62

Convolution Evaluation

  • Prototype in Apache Spark
  • Tune between three convolution algorithms (Nested Loops, FFT, or

Matrix Multiply)

  • Reward: -1*elapsedTime (maximizes throughput)

22

slide-63
SLIDE 63

Convolution Evaluation

  • Prototype in Apache Spark
  • Tune between three convolution algorithms (Nested Loops, FFT, or

Matrix Multiply)

  • Reward: -1*elapsedTime (maximizes throughput)
  • Convolve 8000 Flickr images with sets of filters (~32gb)
  • Vary number & size of filters

22

slide-64
SLIDE 64

Convolution Evaluation

  • Prototype in Apache Spark
  • Tune between three convolution algorithms (Nested Loops, FFT, or

Matrix Multiply)

  • Reward: -1*elapsedTime (maximizes throughput)
  • Convolve 8000 Flickr images with sets of filters (~32gb)
  • Vary number & size of filters
  • Compute intensive
  • (Some configs up to 45 min on a single node)

22

slide-65
SLIDE 65

Convolution Evaluation

  • Prototype in Apache Spark
  • Tune between three convolution algorithms (Nested Loops, FFT, or

Matrix Multiply)

  • Reward: -1*elapsedTime (maximizes throughput)
  • Convolve 8000 Flickr images with sets of filters (~32gb)
  • Vary number & size of filters
  • Compute intensive
  • (Some configs up to 45 min on a single node)
  • Run on an 8-node (AWS EC2 4-core r3.xlarge) cluster.
  • 32 total cores, ~252 images per core

22

slide-66
SLIDE 66

Convolution Results

23

Relative throughput normalized against the highest-throughput algorithm

slide-67
SLIDE 67

Convolution Results

23

Relative throughput normalized against the highest-throughput algorithm

slide-68
SLIDE 68

Convolution Results

23

Relative throughput normalized against the highest-throughput algorithm

slide-69
SLIDE 69

Cuttlefish

24

I. Problem & Motivation

  • II. The Cuttlefish API
  • III. Bandit-based Online Tuning
  • IV. Distributed Tuning Approach
  • V. Contextual Tuning
  • VI. Handling Nonstationary Settings

VII.Other Operators VIII.Conclusion

slide-70
SLIDE 70

Challenges in Distributed Tuning

25

slide-71
SLIDE 71

Challenges in Distributed Tuning

  • 1. Choosing and observing occur throughout a cluster
  • To maximize learning, need to communicate

25

slide-72
SLIDE 72

Challenges in Distributed Tuning

  • 1. Choosing and observing occur throughout a cluster
  • To maximize learning, need to communicate
  • 2. Synchronization & communication overheads

25

slide-73
SLIDE 73

Challenges in Distributed Tuning

  • 1. Choosing and observing occur throughout a cluster
  • To maximize learning, need to communicate
  • 2. Synchronization & communication overheads
  • 3. Feedback delay
  • How many times is `choose’ called before an

earlier reward is observed?

  • Fortunately, theoretically sound to have delays

25

slide-74
SLIDE 74

Distributed Tuning Approach

26

slide-75
SLIDE 75

Distributed Tuning Approach

26

Machine 1 Machine 2 Machine 3

Choose/Observe

Centralized Tuner

slide-76
SLIDE 76

Distributed Tuning Approach

26

Machine 1 Machine 2 Machine 3

Choose/Observe

Centralized Tuner

Machine 1 Machine 2 Machine 3

Push Local / Pull Global

Global Model Store

Independent Tuners, Centralized Store

slide-77
SLIDE 77

Distributed Tuning Approach

26

Machine 1 Machine 2 Machine 3

Choose/Observe

Centralized Tuner

Machine 1 Machine 2 Machine 3

Push Local / Pull Global

Global Model Store

Independent Tuners, Centralized Store

slide-78
SLIDE 78

Distributed Tuning Approach

26

Machine 1 Machine 2 Machine 3

Choose/Observe

Centralized Tuner

Machine 1 Machine 2 Machine 3

Push Local / Pull Global

Global Model Store

Independent Tuners, Centralized Store Peer-to-Peer is also a possibility, but requires more communication

slide-79
SLIDE 79

Distributed Tuning Approach

27

Local State Thread 1

Worker 1 Model Store

Non-local State Local State

Non-local State Thread 2 Thread 3 Worker 2: Local State Local State Thread 1

Worker 2

Non-local State Thread 2 Thread 3 Worker 1: Local State *On Master or a Parameter Server*

slide-80
SLIDE 80

Distributed Tuning Approach

27

Local State Thread 1

Worker 1 Model Store

Non-local State Local State

Non-local State Thread 2 Thread 3 Worker 2: Local State Local State Thread 1

Worker 2

Non-local State Thread 2 Thread 3 Worker 1: Local State *On Master or a Parameter Server*

slide-81
SLIDE 81

Distributed Tuning Approach

27

Local State Thread 1

Worker 1 Model Store

Non-local State Local State

Non-local State Thread 2 Thread 3 Worker 2: Local State Local State Thread 1

Worker 2

Non-local State Thread 2 Thread 3 Worker 1: Local State *On Master or a Parameter Server*

  • When choosing: aggregate local & non-local state
slide-82
SLIDE 82

Distributed Tuning Approach

27

Local State Thread 1

Worker 1 Model Store

Non-local State Local State

Non-local State Thread 2 Thread 3 Worker 2: Local State Local State Thread 1

Worker 2

Non-local State Thread 2 Thread 3 Worker 1: Local State *On Master or a Parameter Server*

  • When choosing: aggregate local & non-local state
  • When observing: update the local state
slide-83
SLIDE 83

Distributed Tuning Approach

27

Local State Thread 1

Worker 1 Model Store

Non-local State Local State

Non-local State Thread 2 Thread 3 Worker 2: Local State Local State Thread 1

Worker 2

Non-local State Thread 2 Thread 3 Worker 1: Local State *On Master or a Parameter Server*

  • When choosing: aggregate local & non-local state
  • When observing: update the local state
  • Model store aggregates non-local state
slide-84
SLIDE 84

Results with Distributed Approach

28

Relative throughput normalized against the highest-throughput algorithm

slide-85
SLIDE 85

Results with Distributed Approach

29

Throughput normalized against an ideal oracle that always picks the fastest option at each round

slide-86
SLIDE 86

Results with Distributed Approach

29

Throughput normalized against an ideal oracle that always picks the fastest option at each round

slide-87
SLIDE 87

Cuttlefish

30

I. Problem & Motivation

  • II. The Cuttlefish API
  • III. Bandit-based Online Tuning
  • IV. Distributed Tuning Approach
  • V. Contextual Tuning (by learning cost models)
  • VI. Handling Nonstationary Settings

VII.Other Operators VIII.Conclusion

slide-88
SLIDE 88

Contextual Tuning

31

slide-89
SLIDE 89

Contextual Tuning

  • Best physical operator for each round may depend
  • n current (easy to compute) context
  • e.g. convolution performance depends on the

image & filter dimensions

31

slide-90
SLIDE 90

Contextual Tuning

  • Best physical operator for each round may depend
  • n current (easy to compute) context
  • e.g. convolution performance depends on the

image & filter dimensions

  • Users may know important context features
  • e.g. from the asymptotic algorithmic complexity

31

slide-91
SLIDE 91

Contextual Tuning

  • Best physical operator for each round may depend
  • n current (easy to compute) context
  • e.g. convolution performance depends on the

image & filter dimensions

  • Users may know important context features
  • e.g. from the asymptotic algorithmic complexity
  • Users can specify context in Tuner.choose

31

slide-92
SLIDE 92

Contextual Tuning Algorithm

32

slide-93
SLIDE 93

Contextual Tuning Algorithm

  • Linear contextual Thompson sampling learns a linear

model that maps features to rewards

32

slide-94
SLIDE 94

Contextual Tuning Algorithm

  • Linear contextual Thompson sampling learns a linear

model that maps features to rewards

  • Feature Normalization & Regularization
  • Increased robustness towards feature choices

32

slide-95
SLIDE 95

Contextual Tuning Algorithm

  • Linear contextual Thompson sampling learns a linear

model that maps features to rewards

  • Feature Normalization & Regularization
  • Increased robustness towards feature choices
  • Effectively learns a cost model

32

slide-96
SLIDE 96

Tuning Convolution with Cuttlefish

def loopConvolve(image, filters): … def fftConvolve(image, filters): … def mmConvolve(image, filters): … for image, filters in convolutions: start = now() result = convolve(image, filters) elapsedTime = now() - start reward = computeReward(elapsedTime)

  • utput result

33

tuner = Tuner([loopConvolve, fftConvolve, mmConvolve]) convolve, token = tuner.choose() tuner.observe(token, reward)

slide-97
SLIDE 97

Tuning Convolution with Cuttlefish

def loopConvolve(image, filters): … def fftConvolve(image, filters): … def mmConvolve(image, filters): … def getDimensions(image, filters): … for image, filters in convolutions: context = getDimensions(image, filters) start = now() result = convolve(image, filters) elapsedTime = now() - start reward = computeReward(elapsedTime)

  • utput result

34

tuner = Tuner([loopConvolve, fftConvolve, mmConvolve]) convolve, token = tuner.choose(context) tuner.observe(token, reward) context

slide-98
SLIDE 98

Contextual Convolution Results

35

Throughput normalized against an ideal oracle that always picks the fastest algorithm

slide-99
SLIDE 99

Cuttlefish

36

I. Problem & Motivation

  • II. The Cuttlefish API
  • III. Bandit-based Online Tuning
  • IV. Distributed Tuning Approach
  • V. Contextual Tuning

VI.Handling Nonstationary Settings VII.Other Operators VIII.Conclusion

slide-100
SLIDE 100

Nonstationary Settings

37

slide-101
SLIDE 101

Nonstationary Settings

  • Runtimes may drift over time, or differ across nodes
  • heterogenous cluster, changing resource availabilities, data

properties varying throughout the workload, etc.

  • E.g. web crawl data and images may be stored sorted by website.

This could correlate with performance

37

slide-102
SLIDE 102

Nonstationary Settings

  • Runtimes may drift over time, or differ across nodes
  • heterogenous cluster, changing resource availabilities, data

properties varying throughout the workload, etc.

  • E.g. web crawl data and images may be stored sorted by website.

This could correlate with performance

  • We might not be capturing sufficient context!

37

slide-103
SLIDE 103

Nonstationary Settings

  • Runtimes may drift over time, or differ across nodes
  • heterogenous cluster, changing resource availabilities, data

properties varying throughout the workload, etc.

  • E.g. web crawl data and images may be stored sorted by website.

This could correlate with performance

  • We might not be capturing sufficient context!
  • Standard multi-armed bandit techniques fail

37

slide-104
SLIDE 104

Nonstationary Settings

  • Runtimes may drift over time, or differ across nodes
  • heterogenous cluster, changing resource availabilities, data

properties varying throughout the workload, etc.

  • E.g. web crawl data and images may be stored sorted by website.

This could correlate with performance

  • We might not be capturing sufficient context!
  • Standard multi-armed bandit techniques fail
  • Solution: only tune using observations from nodes & times with

statistically similar data

37

slide-105
SLIDE 105

Possible Solution

38

Observations Agents (core or machine)

slide-106
SLIDE 106

Possible Solution

39

Observations Agents (core or machine)

slide-107
SLIDE 107

Possible Solution

39

Observations Agents (core or machine)

Use all epochs that pass a statistical similarity test

slide-108
SLIDE 108

Possible Solution

39

Observations Agents (core or machine)

Use all epochs that pass a statistical similarity test

slide-109
SLIDE 109

To Lower Overheads

40

Observations Agents (core or machine)

Store only one ‘aggregated old state’ per epoch

slide-110
SLIDE 110

To Lower Overheads

40

Observations Agents (core or machine)

Store only one ‘aggregated old state’ per epoch At epoch end: If similar to old, merge into ‘old state’ . Otherwise, replace ‘old state’

slide-111
SLIDE 111

To Lower Overheads

40

Observations Agents (core or machine)

Store only one ‘aggregated old state’ per epoch At epoch end: If similar to old, merge into ‘old state’ . Otherwise, replace ‘old state’ Identify (& merge) similar non-local states only at communication rounds, in the centralized model store

slide-112
SLIDE 112

Nonstationary Results

41

Throughput normalized against an ideal oracle that always picks the fastest algorithm

slide-113
SLIDE 113

Cuttlefish

42

I. Problem & Motivation

  • II. The Cuttlefish API
  • III. Bandit-based Online Tuning
  • IV. Distributed Tuning Approach
  • V. Contextual Tuning
  • VI. Handling Nonstationary Settings

VII.Other Operators VIII.Conclusion

slide-114
SLIDE 114

Regex Operator

43

slide-115
SLIDE 115

Regex Operator

43

  • Tune between four regular expression searching libraries
  • Built-in Java Regex and 3 third-party libraries
slide-116
SLIDE 116

Regex Operator

43

  • Tune between four regular expression searching libraries
  • Built-in Java Regex and 3 third-party libraries
  • Search through 256k Common Crawl docs (~30gb uncompressed)
  • one tuning round per doc
slide-117
SLIDE 117

Regex Operator

43

  • Tune between four regular expression searching libraries
  • Built-in Java Regex and 3 third-party libraries
  • Search through 256k Common Crawl docs (~30gb uncompressed)
  • one tuning round per doc
  • Test 8 Regexes sourced from regex-sharing website RegExr
  • Match hyperlinks, trigrams, valid emails, color codes, etc.
slide-118
SLIDE 118

Regex Operator

43

  • Tune between four regular expression searching libraries
  • Built-in Java Regex and 3 third-party libraries
  • Search through 256k Common Crawl docs (~30gb uncompressed)
  • one tuning round per doc
  • Test 8 Regexes sourced from regex-sharing website RegExr
  • Match hyperlinks, trigrams, valid emails, color codes, etc.
  • Multiple of orders of magnitude variation in performance
  • Email validation regex w/ built-in java utilities takes 33μs to process

the fastest document, but over 1000s for the slowest document

slide-119
SLIDE 119

Regex Operator

43

  • Tune between four regular expression searching libraries
  • Built-in Java Regex and 3 third-party libraries
  • Search through 256k Common Crawl docs (~30gb uncompressed)
  • one tuning round per doc
  • Test 8 Regexes sourced from regex-sharing website RegExr
  • Match hyperlinks, trigrams, valid emails, color codes, etc.
  • Multiple of orders of magnitude variation in performance
  • Email validation regex w/ built-in java utilities takes 33μs to process

the fastest document, but over 1000s for the slowest document

  • 8-node (AWS EC2 4-core r3.xlarge) cluster
slide-120
SLIDE 120

Regex Results

44

Note: Y-axis is Log-scale

slide-121
SLIDE 121

Distributed Parallel Join Operator

45

slide-122
SLIDE 122

Distributed Parallel Join Operator

45

  • Hash-partition relations according to join attributes
slide-123
SLIDE 123

Distributed Parallel Join Operator

45

  • Hash-partition relations according to join attributes
  • On each partition, pick a local hash join or a local sort-merge join
slide-124
SLIDE 124

Distributed Parallel Join Operator

45

  • Hash-partition relations according to join attributes
  • On each partition, pick a local hash join or a local sort-merge join
  • Rewards capture total join time
  • measure from when joins begin until result iterators are fully consumed
slide-125
SLIDE 125

Distributed Parallel Join Operator

45

  • Hash-partition relations according to join attributes
  • On each partition, pick a local hash join or a local sort-merge join
  • Rewards capture total join time
  • measure from when joins begin until result iterators are fully consumed
  • Set as Spark SQL 2.2’s join for all equijoins too large to broadcast
  • No heuristics and cost models in the query optimizer, falls back on

explicit configurations (defaults to global sort-merge join)

slide-126
SLIDE 126

Distributed Parallel Join Operator

45

  • Hash-partition relations according to join attributes
  • On each partition, pick a local hash join or a local sort-merge join
  • Rewards capture total join time
  • measure from when joins begin until result iterators are fully consumed
  • Set as Spark SQL 2.2’s join for all equijoins too large to broadcast
  • No heuristics and cost models in the query optimizer, falls back on

explicit configurations (defaults to global sort-merge join)

  • Test on TPC-DS benchmark (scale factor 200)
slide-127
SLIDE 127

Distributed Parallel Join Operator

45

  • Hash-partition relations according to join attributes
  • On each partition, pick a local hash join or a local sort-merge join
  • Rewards capture total join time
  • measure from when joins begin until result iterators are fully consumed
  • Set as Spark SQL 2.2’s join for all equijoins too large to broadcast
  • No heuristics and cost models in the query optimizer, falls back on

explicit configurations (defaults to global sort-merge join)

  • Test on TPC-DS benchmark (scale factor 200)
  • Configure queries to use 512 shuffle / join partitions
slide-128
SLIDE 128

Join Results (Query Throughput)

46

slide-129
SLIDE 129

Cuttlefish join usually faster or very comparable (Join throughput graphs even more dramatic)

Join Results (Query Throughput)

46

slide-130
SLIDE 130

Cuttlefish join usually faster or very comparable (Join throughput graphs even more dramatic)

Join Results (Query Throughput)

46

But, requires exploration & doesn’t always provide ‘special ordering’ benefits

slide-131
SLIDE 131

Cuttlefish

47

I. Problem & Motivation

  • II. The Cuttlefish API
  • III. Bandit-based Online Tuning
  • IV. Distributed Tuning Approach
  • V. Contextual Tuning
  • VI. Handling Nonstationary Settings

VII.Other Operators VIII.Conclusion

slide-132
SLIDE 132

Cuttlefish

48

  • A simple, flexible API for online tuning
  • Thompson-sampling based tuning algorithms
  • Supports contextual tuning (learns cost models)
  • Distributed learning between workers
  • Adapts to nonstationary workloads
  • Prototyped in Apache Spark & successfully tunes

convolution, regex, and join operators

uwdb.io/projects/cuttlefish