accelerating the configuration tuning of big data
play

Accelerating the Configuration Tuning of Big Data Analytics with - PowerPoint PPT Presentation

Accelerating the Configuration Tuning of Big Data Analytics with Similarity-aware Multitask Bayesian Optimization Ayat Fekry, Lucian Carata, Thomas Pasquier, Andrew Rice akmf3@cl.cam.ac.uk lucian.carata@cl.cam.ac.uk BigData2020 High-level


  1. Accelerating the Configuration Tuning of Big Data Analytics with Similarity-aware Multitask Bayesian Optimization Ayat Fekry, Lucian Carata, Thomas Pasquier, Andrew Rice akmf3@cl.cam.ac.uk lucian.carata@cl.cam.ac.uk BigData2020

  2. High-level problem overview ● We want to: – optimize configurations of data processing frameworks (Hadoop, Spark, Flink) in workload-specific ways. – allow amortization of tuning costs in realistic settings: evolving input data (increase in size, ● change of characteristics) an elastic cluster configuration ● 2

  3. High-level problem overview ● We want to: ... Workload Workload Workload optimize execution of workloads in data processing – run on frameworks (Hadoop, Spark, Flink) Big-data processing Data allow amortization of tuning costs in realistic – framework settings: Configuration Execution evolving input data (increase in size, change of ● characteristics) store/ an elastic cluster configuration run on ● load ● When assuming repeated workload execution Cluster daily/weekly/monthly reporting – Instance Type # Instances incremental data analysis Memory – ● Topology Disk (size, bw) ● frequent analytics queries/processing Network (bw) – ● ... ● 3

  4. High-level solution overview ● How: time – By incrementally tuning the W1 W1 W1 configuration of the framework ... Exec 1 Exec 2 Exec n per workload ● determining and tuning only ● Base Tuner (Tuneful) significant parameters ... n Config 1 2 aim is to quickly converge to ● Big-data processing configurations close to optimum framework Configuration Execution metrics 4

  5. High-level solution overview ● How: time – By incrementally tuning the W1 W1 W1 configuration of the framework ... Exec 1 Exec 2 Exec n Similar? per workload ● Wx determining and tuning only ● Similiarity-Aware Tuner significant parameters Tuning Yes (SimTune) knowledge – By leveraging existing tuning ... n Config 1 2 knowledge across similar workloads Big-data processing framework Configuration Execution metrics 5

  6. High-level solution overview ● How: time – By incrementally tuning the W1 W1 W1 configuration of the framework ... Exec 1 Exec 2 Exec n Similar? per workload ● Wx determining and tuning only ● Similiarity-Aware Tuner significant parameters Tuned Yes (SimTune) Config – By leveraging existing tuning ... n Config 1 2 knowledge across similar workloads Big-data processing framework – By carefully combining a number of Configuration Execution established ML techniques and adapting metrics them to the problem domain 6

  7. Required puzzle pieces ● Workload characterization 2 time 1) Workload monitoring 3 W1 W1 W1 ... 2) Workload representations Exec 1 Exec 2 Exec n Similar? 3) Similarity analysis Wx Similarity-Aware Tuner Tuned Yes (SimTune) Config ... n Config 1 2 Big-data processing framework Configuration Execution metrics 1 7

  8. Required puzzle pieces ● Workload characterization 1) Workload monitoring 2) Workload representations 3) Similarity analysis Single task modeling for blue ● Similarity-aware tuning 4) Multitask Bayesian Learning Multitask modeling for blue using knowledge about red and green [1] 8 [1] K. Swersky et. all, Multi-task bayesian optimization

  9. Workload characterization ● Monitoring workload caracteristics & resource consumption – Metric examples: number of tasks per stage, input/output size, data spilled to disk, etc ● CPU time, memory, GC time, serialization time, … ● – Representing metrics in relative terms GC time as proportion of total CPU time ● Amount of shuffled/disk spilled data as proportion of total input data ● 9

  10. Workload characterization ● Workload representation – Would like a low-dimensionality representation because it’s difficult to come up with informative distance metrics in high-dimensional space – We propose an autoencoder based solution, where the low- dimensionality representation is learned offline phase based on historic execution metrics ● resulting encoding/decoding model can be reused ● 10

  11. Workload characterization ● Similarity analysis – Given new workload, find a source (already tuned) workload Closest in encoded representation space (using L 1 norm) ● Distance computed on a fixed fingerprinting configuration for the new ● workload 11

  12. Similarity-aware tuning ● Assume a source workload s was found for workload w 1) Tune the same significant parameters as for s 2) Retrieve Bayesian tuning model of s, T s 3) Add w as a new task to T s 4) Suggest the next (tuned) configuration sample, cs w for w 5) Update tuning model with metrics from executing w with configuration cs w 12

  13. Similarity-aware tuning ● Natural criteria for stopping the tuning – e.g: Acquisition function maximum (Expected Improvement) drops below 10% ● Method able to detect inaccurate similar workload matching – Large difference between cost predicted by model and actual execution, across multiple executions 13

  14. Experiments pre-tuned (source) set Input data sizes (DS) Workload (Abbrev) Units DS1 DS2 DS3 DS4 DS5 PageRank (PR) 5 10 15 20 25 million pages Bayes Classifier (Bayes) 5 10 30 40 50 million pages Wordcount (WC) 32 50 80 100 160 GB TPC-H Benchmark (TPCH) 20 40 60 80 100 GB (compressed) Terasort (TS) 20 40 60 80 100 GB 14

  15. Tuned execution times (at convergence) Source dataset: *-DS1 15

  16. Tuned execution times (at convergence) Source dataset: *-DS1 16

  17. Time until finding best configuration log axis Source dataset: *-DS1 17

  18. Extended tuned (source) dataset for Bayes-DS3 Source dataset: *-DS1 + Bayes DS2 18

  19. Tuning cost amortization (Bayes-DS3) SimTune source dataset : *-DS1 SimTune-extended source dataset : *-DS1 + Bayes-DS2 19

  20. Thank you! Ready for questions! https://github.com/ayat-khairy/simtune Interested in discussing off-line or colaborating? akmf3@cl.cam.ac.uk lucian.carata@cl.cam.ac.uk BigData2020

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend