Accelerating the Configuration Tuning of Big Data Analytics with - PowerPoint PPT Presentation

Accelerating the Configuration Tuning of Big Data Analytics with Similarity-aware Multitask Bayesian Optimization Ayat Fekry, Lucian Carata, Thomas Pasquier, Andrew Rice akmf3@cl.cam.ac.uk lucian.carata@cl.cam.ac.uk BigData2020

High-level problem overview ● We want to: – optimize configurations of data processing frameworks (Hadoop, Spark, Flink) in workload-specific ways. – allow amortization of tuning costs in realistic settings: evolving input data (increase in size, ● change of characteristics) an elastic cluster configuration ● 2

High-level problem overview ● We want to: ... Workload Workload Workload optimize execution of workloads in data processing – run on frameworks (Hadoop, Spark, Flink) Big-data processing Data allow amortization of tuning costs in realistic – framework settings: Configuration Execution evolving input data (increase in size, change of ● characteristics) store/ an elastic cluster configuration run on ● load ● When assuming repeated workload execution Cluster daily/weekly/monthly reporting – Instance Type # Instances incremental data analysis Memory – ● Topology Disk (size, bw) ● frequent analytics queries/processing Network (bw) – ● ... ● 3

High-level solution overview ● How: time – By incrementally tuning the W1 W1 W1 configuration of the framework ... Exec 1 Exec 2 Exec n per workload ● determining and tuning only ● Base Tuner (Tuneful) significant parameters ... n Config 1 2 aim is to quickly converge to ● Big-data processing configurations close to optimum framework Configuration Execution metrics 4

High-level solution overview ● How: time – By incrementally tuning the W1 W1 W1 configuration of the framework ... Exec 1 Exec 2 Exec n Similar? per workload ● Wx determining and tuning only ● Similiarity-Aware Tuner significant parameters Tuning Yes (SimTune) knowledge – By leveraging existing tuning ... n Config 1 2 knowledge across similar workloads Big-data processing framework Configuration Execution metrics 5

High-level solution overview ● How: time – By incrementally tuning the W1 W1 W1 configuration of the framework ... Exec 1 Exec 2 Exec n Similar? per workload ● Wx determining and tuning only ● Similiarity-Aware Tuner significant parameters Tuned Yes (SimTune) Config – By leveraging existing tuning ... n Config 1 2 knowledge across similar workloads Big-data processing framework – By carefully combining a number of Configuration Execution established ML techniques and adapting metrics them to the problem domain 6

Required puzzle pieces ● Workload characterization 2 time 1) Workload monitoring 3 W1 W1 W1 ... 2) Workload representations Exec 1 Exec 2 Exec n Similar? 3) Similarity analysis Wx Similarity-Aware Tuner Tuned Yes (SimTune) Config ... n Config 1 2 Big-data processing framework Configuration Execution metrics 1 7

Required puzzle pieces ● Workload characterization 1) Workload monitoring 2) Workload representations 3) Similarity analysis Single task modeling for blue ● Similarity-aware tuning 4) Multitask Bayesian Learning Multitask modeling for blue using knowledge about red and green [1] 8 [1] K. Swersky et. all, Multi-task bayesian optimization

Workload characterization ● Monitoring workload caracteristics & resource consumption – Metric examples: number of tasks per stage, input/output size, data spilled to disk, etc ● CPU time, memory, GC time, serialization time, … ● – Representing metrics in relative terms GC time as proportion of total CPU time ● Amount of shuffled/disk spilled data as proportion of total input data ● 9

Workload characterization ● Workload representation – Would like a low-dimensionality representation because it’s difficult to come up with informative distance metrics in high-dimensional space – We propose an autoencoder based solution, where the low- dimensionality representation is learned offline phase based on historic execution metrics ● resulting encoding/decoding model can be reused ● 10

Workload characterization ● Similarity analysis – Given new workload, find a source (already tuned) workload Closest in encoded representation space (using L 1 norm) ● Distance computed on a fixed fingerprinting configuration for the new ● workload 11

Similarity-aware tuning ● Assume a source workload s was found for workload w 1) Tune the same significant parameters as for s 2) Retrieve Bayesian tuning model of s, T s 3) Add w as a new task to T s 4) Suggest the next (tuned) configuration sample, cs w for w 5) Update tuning model with metrics from executing w with configuration cs w 12

Similarity-aware tuning ● Natural criteria for stopping the tuning – e.g: Acquisition function maximum (Expected Improvement) drops below 10% ● Method able to detect inaccurate similar workload matching – Large difference between cost predicted by model and actual execution, across multiple executions 13

Experiments pre-tuned (source) set Input data sizes (DS) Workload (Abbrev) Units DS1 DS2 DS3 DS4 DS5 PageRank (PR) 5 10 15 20 25 million pages Bayes Classifier (Bayes) 5 10 30 40 50 million pages Wordcount (WC) 32 50 80 100 160 GB TPC-H Benchmark (TPCH) 20 40 60 80 100 GB (compressed) Terasort (TS) 20 40 60 80 100 GB 14

Tuned execution times (at convergence) Source dataset: *-DS1 15

Tuned execution times (at convergence) Source dataset: *-DS1 16

Time until finding best configuration log axis Source dataset: *-DS1 17

Extended tuned (source) dataset for Bayes-DS3 Source dataset: *-DS1 + Bayes DS2 18

Tuning cost amortization (Bayes-DS3) SimTune source dataset : *-DS1 SimTune-extended source dataset : *-DS1 + Bayes-DS2 19

Thank you! Ready for questions! https://github.com/ayat-khairy/simtune Interested in discussing off-line or colaborating? akmf3@cl.cam.ac.uk lucian.carata@cl.cam.ac.uk BigData2020

Accelerating the Configuration Tuning of Big Data Analytics with - PowerPoint PPT Presentation

Accelerating the Configuration Tuning of Big Data Analytics with Similarity-aware Multitask Bayesian Optimization Ayat Fekry, Lucian Carata, Thomas Pasquier, Andrew Rice akmf3@cl.cam.ac.uk lucian.carata@cl.cam.ac.uk BigData2020 High-level

Configuration management Configuration management Configuration management Configuration

Augeas a configuration API Raphal Pinson Configuration Management Sitewide configuration

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

CNC PINpad USA, December 2014 Configuration Configuration Description POS Dollar General

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

Parameters vs hyperparameters Dr. Shirin Glander Data Scientist DataCamp Hyperparameter Tuning

SELF TUNING MEMORY MANAGEMENT FOR DATA SERVERS By Sangeetha Sivaprakasam Introduction : 1)

Hyperparameter tuning in caret Dr. Shirin Glander Data Scientist DataCamp Hyperparameter

EPiServer och Configuration Management EPiServer och Configuration Management Configuration

Tuning the Untunable Techniques for Accelerating Deep Learning Optimization Talk ID: S9313

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department

COMP9313: Big Data Management Introduction to Big Data Management What is big data? Tweeted by

PAC PACE AUT AUTO-WER WERKS KS Vehicle Tuning Services Performance tuning with fuel

TUNING Russia: Development of master programmes in engineering education using the Tuning

CHAPTER 9: PID TUNING Process Solve the tuning Apply, is the reaction curve problem. Requires

Elementary Particles Lecture 4 Niels Tuning Harry van der Graaf Niels Tuning (1) Thanks

A GPU Register File using Static Data Compression Alexandra Angerd, Erik Sintorn, Per Stenstrm

ISPD 2005/2006 Placement Contest Updates Gi-Joon Nam IBM Corp. 2 ISPD Placement Contest ISPD

Active Circuits for Resonant Axion Detectors Second Workshop on Microwave Cavities and Detectors

in 97 . , - , CF B. Tree restructuring ( Parent lost one key Rotation ( Adoption )

Frequency Tuners Mini-Workshop Objectives Akira Yamamoto To be held at CERN, 5 September, 2014

Introduction to Machine Learning Tuning: Nested Resampling compstat-lmu.github.io/lecture_i2ml

Kernel Exploitation via Uninitialized Stack http://people.canonical.com/~kees/defcon19/ Kees

GENIE: Valida,on and Tuning status report Julia Yarba,