Accelerating the Configuration Tuning of Big Data Analytics with - - PowerPoint PPT Presentation

accelerating the configuration tuning of big data
SMART_READER_LITE
LIVE PREVIEW

Accelerating the Configuration Tuning of Big Data Analytics with - - PowerPoint PPT Presentation

Accelerating the Configuration Tuning of Big Data Analytics with Similarity-aware Multitask Bayesian Optimization Ayat Fekry, Lucian Carata, Thomas Pasquier, Andrew Rice akmf3@cl.cam.ac.uk lucian.carata@cl.cam.ac.uk BigData2020 High-level


slide-1
SLIDE 1

Accelerating the Configuration Tuning of Big Data Analytics with Similarity-aware Multitask Bayesian Optimization

Ayat Fekry, Lucian Carata, Thomas Pasquier, Andrew Rice

BigData2020

akmf3@cl.cam.ac.uk lucian.carata@cl.cam.ac.uk

slide-2
SLIDE 2

2

High-level problem overview

  • We want to:

– optimize configurations of data

processing frameworks (Hadoop, Spark, Flink) in workload-specific ways.

– allow amortization of tuning costs in

realistic settings:

  • evolving input data (increase in size,

change of characteristics)

  • an elastic cluster configuration
slide-3
SLIDE 3

3

High-level problem overview

  • We want to:

  • ptimize execution of workloads in data processing

frameworks (Hadoop, Spark, Flink)

allow amortization of tuning costs in realistic settings:

  • evolving input data (increase in size, change of

characteristics)

  • an elastic cluster configuration
  • When assuming repeated workload execution

daily/weekly/monthly reporting

incremental data analysis

frequent analytics queries/processing

store/ load

Cluster

# Instances Instance Type

  • Memory
  • Disk (size, bw)
  • Network (bw)
  • ...

Topology

Workload Workload Workload ... Big-data processing framework Data

Configuration Execution run on run on

slide-4
SLIDE 4

4

High-level solution overview

  • How:

– By incrementally tuning the

configuration of the framework

  • per workload
  • determining and tuning only

significant parameters

  • aim is to quickly converge to

configurations close to optimum

W1 Exec 1 ... Big-data processing framework

Configuration Execution time

W1 Exec 2 W1 Exec n Base Tuner (Tuneful)

Config 1

...

2 n metrics

slide-5
SLIDE 5

5

High-level solution overview

  • How:

– By incrementally tuning the

configuration of the framework

  • per workload
  • determining and tuning only

significant parameters

– By leveraging existing tuning

knowledge across similar workloads

W1 Exec 1 ... Big-data processing framework

Configuration Execution time

W1 Exec 2 W1 Exec n Similiarity-Aware Tuner (SimTune)

Config 1

...

2 n metrics

Wx

Similar?

Yes

Tuning knowledge

slide-6
SLIDE 6

6

High-level solution overview

  • How:

– By incrementally tuning the

configuration of the framework

  • per workload
  • determining and tuning only

significant parameters

– By leveraging existing tuning

knowledge across similar workloads

– By carefully combining a number of

established ML techniques and adapting them to the problem domain

W1 Exec 1 ... Big-data processing framework

Configuration Execution time

W1 Exec 2 W1 Exec n Similiarity-Aware Tuner (SimTune)

Config 1

...

2 n metrics

Wx

Similar?

Yes

Tuned Config

slide-7
SLIDE 7

7

Required puzzle pieces

  • Workload characterization

1) Workload monitoring 2) Workload representations 3) Similarity analysis

W1 Exec 1 ... Big-data processing framework

Configuration Execution time

W1 Exec 2 W1 Exec n Similarity-Aware Tuner (SimTune)

Config 1

...

2 n metrics

Wx

Similar?

Yes

Tuned Config

1 2 3

slide-8
SLIDE 8

8

Required puzzle pieces

  • Workload characterization

1) Workload monitoring 2) Workload representations 3) Similarity analysis

  • Similarity-aware tuning

4) Multitask Bayesian Learning

[1] [1] K. Swersky et. all, Multi-task bayesian optimization

Single task modeling for blue Multitask modeling for blue using knowledge about red and green

slide-9
SLIDE 9

9

Workload characterization

  • Monitoring workload caracteristics & resource consumption

– Metric examples:

  • number of tasks per stage, input/output size, data spilled to disk, etc
  • CPU time, memory, GC time, serialization time, …

– Representing metrics in relative terms

  • GC time as proportion of total CPU time
  • Amount of shuffled/disk spilled data as proportion of total input data
slide-10
SLIDE 10

10

Workload characterization

  • Workload representation

– Would like a low-dimensionality representation because it’s difficult

to come up with informative distance metrics in high-dimensional space

– We propose an autoencoder based solution, where the low-

dimensionality representation is learned

  • ffline phase based on historic execution metrics
  • resulting encoding/decoding model can be reused
slide-11
SLIDE 11

11

Workload characterization

  • Similarity analysis

– Given new workload, find a source (already tuned) workload

  • Closest in encoded representation space (using L1 norm)
  • Distance computed on a fixed fingerprinting configuration for the new

workload

slide-12
SLIDE 12

12

Similarity-aware tuning

  • Assume a source workload s was found for workload w

1) Tune the same significant parameters as for s 2) Retrieve Bayesian tuning model of s, Ts 3) Add w as a new task to Ts 4) Suggest the next (tuned) configuration sample, csw for w 5) Update tuning model with metrics from executing w with configuration csw

slide-13
SLIDE 13

13

Similarity-aware tuning

  • Natural criteria for stopping the tuning

– e.g: Acquisition function maximum (Expected Improvement) drops

below 10%

  • Method able to detect inaccurate similar workload matching

– Large difference between cost predicted by model and actual

execution, across multiple executions

slide-14
SLIDE 14

14

Experiments

Workload (Abbrev) Input data sizes (DS) Units DS1 DS2 DS3 DS4 DS5 PageRank (PR) 5 10 15 20 25 million pages Bayes Classifier (Bayes) 5 10 30 40 50 million pages Wordcount (WC) 32 50 80 100 160 GB TPC-H Benchmark (TPCH) 20 40 60 80 100 GB (compressed) Terasort (TS) 20 40 60 80 100 GB pre-tuned (source) set

slide-15
SLIDE 15

15

Tuned execution times (at convergence)

Source dataset: *-DS1

slide-16
SLIDE 16

16

Tuned execution times (at convergence)

Source dataset: *-DS1

slide-17
SLIDE 17

17

Time until finding best configuration

Source dataset: *-DS1 log axis

slide-18
SLIDE 18

18

Extended tuned (source) dataset for Bayes-DS3

Source dataset: *-DS1 + Bayes DS2

slide-19
SLIDE 19

19

Tuning cost amortization (Bayes-DS3)

SimTune source dataset: *-DS1 SimTune-extended source dataset: *-DS1 + Bayes-DS2

slide-20
SLIDE 20

Thank you! Ready for questions! https://github.com/ayat-khairy/simtune

Interested in discussing off-line or colaborating?

BigData2020

akmf3@cl.cam.ac.uk lucian.carata@cl.cam.ac.uk