Lightweight Primitives for Online Tuning
Cuttlefish
by Tomer Kaftan (UW), Magdalena Balazinska (UW), Alvin Cheung (UW), Johannes Gehrke (Microsoft)
1
Cuttlefish Lightweight Primitives for Online Tuning by Tomer Kaftan - - PowerPoint PPT Presentation
Cuttlefish Lightweight Primitives for Online Tuning by Tomer Kaftan (UW), Magdalena Balazinska (UW), Alvin Cheung (UW), Johannes Gehrke (Microsoft) 1 Data processing workloads today are complicated. 2 Motivating Workload 3 Motivating
Lightweight Primitives for Online Tuning
by Tomer Kaftan (UW), Magdalena Balazinska (UW), Alvin Cheung (UW), Johannes Gehrke (Microsoft)
1
Data processing workloads today are complicated.
2
3
“A Cuttlefish pretending to be a rock”
3
*Image Sourced from https://www.flickr.com/photos/silkebaron/32001215104“A Cuttlefish pretending to be a rock”
3
Generate Training Data from:
etc.
*Image Sourced from https://www.flickr.com/photos/silkebaron/320012151044
CNN
HTML Data
Train a caption- generating model
Output Model Conv RNN
Repeat
Regex Join
Images
Filter
Generate Training Labels
...
Conv
*caption-generating model portion of the logical plan inspired by: Xu et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ICML 2015
4
CNN
HTML Data
Train a caption- generating model
Output Model Conv RNN
Repeat
Regex Join
Images
Filter
Generate Training Labels
...
Conv
*caption-generating model portion of the logical plan inspired by: Xu et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ICML 2015
Diverse, sophisticated operators, with multiple implementations!
Example Operator: Convolution
5
Example Operator: Convolution
5
Tested 3 convolution algorithms on 8000 Flickr images
Relative throughput normalized against the highest-throughput algorithm
6
Use a Query Optimizer
6
(Collect Dataset Statistics, Apply Heuristics & Cost Models)
Use a Query Optimizer
7
7
significant development effort.
7
significant development effort.
7
[1] http://databricks.com/blog/2017/08/31/cost-based-optimizer-in-apache-spark-2- 2.html
significant development effort.
just relational operators!
7
[1] http://databricks.com/blog/2017/08/31/cost-based-optimizer-in-apache-spark-2- 2.html
Can we optimize without a full-fledged optimizer?
8
Workload developer (or the query optimizer) inserts calls to Cuttlefish’s API to insert tuners that select implementations during execution
Cuttlefish: A Lightweight Primitive for Online Tuning
9
CNN
HTML Data
Train a caption- generating model
Output Model Conv RNN
Repeat
Regex Join
Images
Filter
Generate Training Labels
...
Conv
Workload developer (or the query optimizer) inserts calls to Cuttlefish’s API to insert tuners that select implementations during execution
Cuttlefish: A Lightweight Primitive for Online Tuning
9
CNN
HTML Data
Train a caption- generating model
Output Model Conv RNN
Repeat
Regex Join
Images
Filter
Generate Training Labels
...
Conv
Join
Output Model Tuner FilterRegex
Images RNNRepeat CNN
Nest. Loop Mat. Mult FFT ... Tuner Hash Tuner TunerConv Conv 10
Cuttlefish: A Lightweight Primitive for Online Tuning
Workload developer (or the query optimizer) inserts calls to Cuttlefish’s API to insert tuners that select implementations during execution
Join
Output Model Tuner FilterRegex
Images RNNRepeat CNN
Nest. Loop Mat. Mult FFT ... Tuner Hash Tuner TunerConv Conv 10
Cuttlefish: A Lightweight Primitive for Online Tuning
Tuner Lifecycle Choose Execute Observe Nest. Loop Mat. Mult Sort FFT Lib 2 Lib 3 Lib 4 Lib 1 HTML Data
Join
Output Model Tuner Filter
Regex
Images RNN
Repeat CNN
Nest. Loop Mat. Mult FFT ... Tuner Hash Tuner Tuner
Conv Conv
Workload developer (or the query optimizer) inserts calls to Cuttlefish’s API to insert tuners that select implementations during execution
Join
Output Model Tuner FilterRegex
Images RNNRepeat CNN
Nest. Loop Mat. Mult FFT ... Tuner Hash Tuner TunerConv Conv 10
Cuttlefish: A Lightweight Primitive for Online Tuning
Tuner Lifecycle Choose Execute Observe Nest. Loop Mat. Mult Sort FFT Lib 2 Lib 3 Lib 4 Lib 1 HTML Data
Join
Output Model Tuner Filter
Regex
Images RNN
Repeat CNN
Nest. Loop Mat. Mult FFT ... Tuner Hash Tuner Tuner
Conv Conv
Workload developer (or the query optimizer) inserts calls to Cuttlefish’s API to insert tuners that select implementations during execution The user maps tuning rounds to the execution model of each operator:
Join
Output Model Tuner FilterRegex
Images RNNRepeat CNN
Nest. Loop Mat. Mult FFT ... Tuner Hash Tuner TunerConv Conv 10
Cuttlefish: A Lightweight Primitive for Online Tuning
Tuner Lifecycle Choose Execute Observe Nest. Loop Mat. Mult Sort FFT Lib 2 Lib 3 Lib 4 Lib 1 HTML Data
Join
Output Model Tuner Filter
Regex
Images RNN
Repeat CNN
Nest. Loop Mat. Mult FFT ... Tuner Hash Tuner Tuner
Conv Conv
Workload developer (or the query optimizer) inserts calls to Cuttlefish’s API to insert tuners that select implementations during execution The user maps tuning rounds to the execution model of each operator:
Join
Output Model Tuner FilterRegex
Images RNNRepeat CNN
Nest. Loop Mat. Mult FFT ... Tuner Hash Tuner TunerConv Conv 10
Cuttlefish: A Lightweight Primitive for Online Tuning
Tuner Lifecycle Choose Execute Observe Nest. Loop Mat. Mult Sort FFT Lib 2 Lib 3 Lib 4 Lib 1 HTML Data
Join
Output Model Tuner Filter
Regex
Images RNN
Repeat CNN
Nest. Loop Mat. Mult FFT ... Tuner Hash Tuner Tuner
Conv Conv
Workload developer (or the query optimizer) inserts calls to Cuttlefish’s API to insert tuners that select implementations during execution The user maps tuning rounds to the execution model of each operator:
Join
Output Model Tuner FilterRegex
Images RNNRepeat CNN
Nest. Loop Mat. Mult FFT ... Tuner Hash Tuner TunerConv Conv 10
Cuttlefish: A Lightweight Primitive for Online Tuning
Tuner Lifecycle Choose Execute Observe Nest. Loop Mat. Mult Sort FFT Lib 2 Lib 3 Lib 4 Lib 1 HTML Data
Join
Output Model Tuner Filter
Regex
Images RNN
Repeat CNN
Nest. Loop Mat. Mult FFT ... Tuner Hash Tuner Tuner
Conv Conv
Workload developer (or the query optimizer) inserts calls to Cuttlefish’s API to insert tuners that select implementations during execution The user maps tuning rounds to the execution model of each operator:
11
I. Problem & Motivation
VII.Other Operators VIII.Conclusion
12
12
12
12
Cuttlefish tuners maximize the total reward after multiple choose-observe tuning rounds
12
Tuning Convolution with Cuttlefish
13
convolve, token = tuner.choose() tuner.observe(token, reward)
Tuning Convolution with Cuttlefish
def loopConvolve(image, filters): … def fftConvolve(image, filters): … def mmConvolve(image, filters): … for image, filters in convolutions: start = now() result = convolve(image, filters) elapsedTime = now() - start reward = computeReward(elapsedTime)
13
convolve, token = tuner.choose() tuner.observe(token, reward)
Tuning Convolution with Cuttlefish
def loopConvolve(image, filters): … def fftConvolve(image, filters): … def mmConvolve(image, filters): … for image, filters in convolutions: start = now() result = convolve(image, filters) elapsedTime = now() - start reward = computeReward(elapsedTime)
13
tuner = Tuner([loopConvolve, fftConvolve, mmConvolve]) convolve, token = tuner.choose() tuner.observe(token, reward)
Tuning Convolution with Cuttlefish
def loopConvolve(image, filters): … def fftConvolve(image, filters): … def mmConvolve(image, filters): … for image, filters in convolutions: start = now() result = convolve(image, filters) elapsedTime = now() - start reward = computeReward(elapsedTime)
13
tuner = Tuner([loopConvolve, fftConvolve, mmConvolve]) convolve, token = tuner.choose() tuner.observe(token, reward)
Tuning Convolution with Cuttlefish
def loopConvolve(image, filters): … def fftConvolve(image, filters): … def mmConvolve(image, filters): … for image, filters in convolutions: start = now() result = convolve(image, filters) elapsedTime = now() - start reward = computeReward(elapsedTime)
13
tuner = Tuner([loopConvolve, fftConvolve, mmConvolve]) convolve, token = tuner.choose() tuner.observe(token, reward)
Tuning Convolution with Cuttlefish
def loopConvolve(image, filters): … def fftConvolve(image, filters): … def mmConvolve(image, filters): … for image, filters in convolutions: start = now() result = convolve(image, filters) elapsedTime = now() - start reward = computeReward(elapsedTime)
13
tuner = Tuner([loopConvolve, fftConvolve, mmConvolve]) convolve, token = tuner.choose() tuner.observe(token, reward)
Tuning Convolution with Cuttlefish
def loopConvolve(image, filters): … def fftConvolve(image, filters): … def mmConvolve(image, filters): … for image, filters in convolutions: start = now() result = convolve(image, filters) elapsedTime = now() - start reward = computeReward(elapsedTime)
13
tuner = Tuner([loopConvolve, fftConvolve, mmConvolve]) convolve, token = tuner.choose() tuner.observe(token, reward)
Tuning Convolution with Cuttlefish
def loopConvolve(image, filters): … def fftConvolve(image, filters): … def mmConvolve(image, filters): … for image, filters in convolutions: start = now() result = convolve(image, filters) elapsedTime = now() - start reward = computeReward(elapsedTime)
13
tuner = Tuner([loopConvolve, fftConvolve, mmConvolve]) convolve, token = tuner.choose() tuner.observe(token, reward)
Tuning Convolution with Cuttlefish
def loopConvolve(image, filters): … def fftConvolve(image, filters): … def mmConvolve(image, filters): … for image, filters in convolutions: start = now() result = convolve(image, filters) elapsedTime = now() - start reward = computeReward(elapsedTime)
13
tuner = Tuner([loopConvolve, fftConvolve, mmConvolve]) convolve, token = tuner.choose() tuner.observe(token, reward)
14
I. Problem & Motivation
III.Bandit-based Online Tuning
VII.Other Operators VIII.Conclusion
15
Multi-armed Bandit Problem
15
Multi-armed Bandit Problem
15
Multi-armed Bandit Problem
15
Multi-armed Bandit Problem
15
Multi-armed Bandit Problem
Goal: Maximize Cumulative Reward
(by balancing exploration & exploitation)
15
16
16
Reward Arm 1 Arm 2 Arm 3 Arm 4 Belief distributions about expected reward
17
Reward Arm 1 Arm 2 Arm 3 Arm 4
18
Reward Arm 1 Arm 2 Arm 3 Arm 4
18
Reward Arm 1 Arm 2 Arm 3 Arm 4
19
Reward Arm 1 Arm 2 Arm 3 Arm 4
20
Reward Arm 1 Arm 2 Arm 3 Arm 4 Better arms chosen more often
21
21
21
21
21
22
22
Matrix Multiply)
22
Matrix Multiply)
22
Matrix Multiply)
22
Matrix Multiply)
22
23
Relative throughput normalized against the highest-throughput algorithm
23
Relative throughput normalized against the highest-throughput algorithm
23
Relative throughput normalized against the highest-throughput algorithm
24
I. Problem & Motivation
VII.Other Operators VIII.Conclusion
Challenges in Distributed Tuning
25
Challenges in Distributed Tuning
25
Challenges in Distributed Tuning
25
Challenges in Distributed Tuning
earlier reward is observed?
25
26
26
Machine 1 Machine 2 Machine 3
Choose/Observe
Centralized Tuner
26
Machine 1 Machine 2 Machine 3
Choose/Observe
Centralized Tuner
Machine 1 Machine 2 Machine 3
Push Local / Pull Global
Global Model Store
Independent Tuners, Centralized Store
26
Machine 1 Machine 2 Machine 3
Choose/Observe
Centralized Tuner
Machine 1 Machine 2 Machine 3
Push Local / Pull Global
Global Model Store
Independent Tuners, Centralized Store
26
Machine 1 Machine 2 Machine 3
Choose/Observe
Centralized Tuner
Machine 1 Machine 2 Machine 3
Push Local / Pull Global
Global Model Store
Independent Tuners, Centralized Store Peer-to-Peer is also a possibility, but requires more communication
27
…
Local State Thread 1
Worker 1 Model Store
Non-local State Local State
Non-local State Thread 2 Thread 3 Worker 2: Local State Local State Thread 1
Worker 2
Non-local State Thread 2 Thread 3 Worker 1: Local State *On Master or a Parameter Server*
27
…
Local State Thread 1
Worker 1 Model Store
Non-local State Local State
Non-local State Thread 2 Thread 3 Worker 2: Local State Local State Thread 1
Worker 2
Non-local State Thread 2 Thread 3 Worker 1: Local State *On Master or a Parameter Server*
27
…
Local State Thread 1
Worker 1 Model Store
Non-local State Local State
Non-local State Thread 2 Thread 3 Worker 2: Local State Local State Thread 1
Worker 2
Non-local State Thread 2 Thread 3 Worker 1: Local State *On Master or a Parameter Server*
27
…
Local State Thread 1
Worker 1 Model Store
Non-local State Local State
Non-local State Thread 2 Thread 3 Worker 2: Local State Local State Thread 1
Worker 2
Non-local State Thread 2 Thread 3 Worker 1: Local State *On Master or a Parameter Server*
27
…
Local State Thread 1
Worker 1 Model Store
Non-local State Local State
Non-local State Thread 2 Thread 3 Worker 2: Local State Local State Thread 1
Worker 2
Non-local State Thread 2 Thread 3 Worker 1: Local State *On Master or a Parameter Server*
Results with Distributed Approach
28
Relative throughput normalized against the highest-throughput algorithm
Results with Distributed Approach
29
Throughput normalized against an ideal oracle that always picks the fastest option at each round
Results with Distributed Approach
29
Throughput normalized against an ideal oracle that always picks the fastest option at each round
30
I. Problem & Motivation
VII.Other Operators VIII.Conclusion
31
image & filter dimensions
31
image & filter dimensions
31
image & filter dimensions
31
32
model that maps features to rewards
32
model that maps features to rewards
32
model that maps features to rewards
32
Tuning Convolution with Cuttlefish
def loopConvolve(image, filters): … def fftConvolve(image, filters): … def mmConvolve(image, filters): … for image, filters in convolutions: start = now() result = convolve(image, filters) elapsedTime = now() - start reward = computeReward(elapsedTime)
33
tuner = Tuner([loopConvolve, fftConvolve, mmConvolve]) convolve, token = tuner.choose() tuner.observe(token, reward)
Tuning Convolution with Cuttlefish
def loopConvolve(image, filters): … def fftConvolve(image, filters): … def mmConvolve(image, filters): … def getDimensions(image, filters): … for image, filters in convolutions: context = getDimensions(image, filters) start = now() result = convolve(image, filters) elapsedTime = now() - start reward = computeReward(elapsedTime)
34
tuner = Tuner([loopConvolve, fftConvolve, mmConvolve]) convolve, token = tuner.choose(context) tuner.observe(token, reward) context
Contextual Convolution Results
35
Throughput normalized against an ideal oracle that always picks the fastest algorithm
36
I. Problem & Motivation
VI.Handling Nonstationary Settings VII.Other Operators VIII.Conclusion
37
properties varying throughout the workload, etc.
This could correlate with performance
37
properties varying throughout the workload, etc.
This could correlate with performance
37
properties varying throughout the workload, etc.
This could correlate with performance
37
properties varying throughout the workload, etc.
This could correlate with performance
statistically similar data
37
38
Observations Agents (core or machine)
39
Observations Agents (core or machine)
39
Observations Agents (core or machine)
Use all epochs that pass a statistical similarity test
39
Observations Agents (core or machine)
Use all epochs that pass a statistical similarity test
40
Observations Agents (core or machine)
Store only one ‘aggregated old state’ per epoch
40
Observations Agents (core or machine)
Store only one ‘aggregated old state’ per epoch At epoch end: If similar to old, merge into ‘old state’ . Otherwise, replace ‘old state’
40
Observations Agents (core or machine)
Store only one ‘aggregated old state’ per epoch At epoch end: If similar to old, merge into ‘old state’ . Otherwise, replace ‘old state’ Identify (& merge) similar non-local states only at communication rounds, in the centralized model store
41
Throughput normalized against an ideal oracle that always picks the fastest algorithm
42
I. Problem & Motivation
VII.Other Operators VIII.Conclusion
43
43
43
43
43
the fastest document, but over 1000s for the slowest document
43
the fastest document, but over 1000s for the slowest document
44
Note: Y-axis is Log-scale
Distributed Parallel Join Operator
45
Distributed Parallel Join Operator
45
Distributed Parallel Join Operator
45
Distributed Parallel Join Operator
45
Distributed Parallel Join Operator
45
explicit configurations (defaults to global sort-merge join)
Distributed Parallel Join Operator
45
explicit configurations (defaults to global sort-merge join)
Distributed Parallel Join Operator
45
explicit configurations (defaults to global sort-merge join)
Join Results (Query Throughput)
46
Cuttlefish join usually faster or very comparable (Join throughput graphs even more dramatic)
Join Results (Query Throughput)
46
Cuttlefish join usually faster or very comparable (Join throughput graphs even more dramatic)
Join Results (Query Throughput)
46
But, requires exploration & doesn’t always provide ‘special ordering’ benefits
47
I. Problem & Motivation
VII.Other Operators VIII.Conclusion
48
convolution, regex, and join operators
uwdb.io/projects/cuttlefish