ML for Resource Management Arjun Karuvally, Priyanka Mary Mammen

Introduction Big data analytics on cloud is very crucial for industry and it is growing ● rapidly A number of techniques are used for data processing - Map Reduce, ● SQL-like languages, Deep Learning and in memory analytics A cluster of virtual machines is the execution environment for these type ● of jobs Different analytic jobs have diverse behavior and resource requirements ●

Problem Statement The task of resource management is to find the right cloud configuration ● for an application This configuration includes the number of VMs, number of CPUs, CPU ● speed per core, RAM, disk count, disk speed, network capacity etc Any technique that is used for resource management in cloud need to ● create a performance model This performance model indicates which configuration of the cloud is best ● for the particular job that is being run

Motivation Choosing the right configuration for an application is essential to service ● quality and commercial competitiveness. Lot of jobs are recurring - means that similar workloads are executed ● repeatedly Choosing poorly can result in a slowdown of 2-3x on average and 12x in ● the worst case

Challenges Evaluation of all the possible cloud configuration to find the best is ● prohibitively expensive Each workload has its own prefered choice of cloud configuration - ● difficult to come up with one configuration for all workloads Resource requirements to achieve a certain objective (execution time or ● running cost) for a specific workload are opaque The running time and cost has complex relation to the resources of cloud ● instances

CherryPick: Adaptively Unearthing the Best Cloud Configurations for Big Data Analytics

Features Uses Bayesian Optimization to build performance model for various ● applications Models are just accurate enough to find near optimal configuration with ● only a few test runs. Bayesian Optimization enables to obtains minimum number of samples ● to get near optimal configurations with good confidence interval

Problem Formulation For a given application workload the objective is to find optimal or near ● optimal cloud configuration that satisfies a performance requirement The problem is formulated mathematically as ● The cloud configuration is represented by x, C represents the cost ● P is the price per unit time for VMs using x, T is the running time function ●

Problem Formulation The unknown required to compute the cost is the function T for different ● configurations x Since this is expensive bayesian optimization is used to directly search for ● an approximate solution of the equation with significantly smaller cost

Bayesian Optimization Bayesian Optimization is used to solve optimization problems like the ● previous equation where the objective function C is unknown but can be observed using experiments Cost (C) can be modeled as a stochastic process (eg. Gaussian), ● confidence interval can be computed using one or more samples from C Observational noise can be incorporated in the computation of ● confidence interval of the objective function By integrating this CherryPick has the ability to learn the objective ● function quickly and only take samples in the areas that most likely contain the minimum point.

Working of BO

Prior and Acquisition function Prior is given assuming a Gaussian Process ● Acquisition function is given using Expected Improvement ● 𝜚 and 𝛠 are the standard normal cumulative function and standard normal probability density function respectively

Design options and decisions Prior function - Gaussian Process is chosen as the prior function ● C is described using a mean function and a kernel covariance function ● Matern with parameter 5/2 is chosen as the covariance function between ● inputs because it does not require strong smoothness Acquisition function - Expected Improvement is chosen as the ● acquisition function

Design options and decisions Stopping condition - When the EI is less than a threshold(10%) and at ● least N cloud configurations have been observed Starting points - Quasi random sequence to generate the starting points ● Encoding cloud configuration - x is a vector of number of VMs. number ● of cores, CPU speed per core, average RAM per core, disk count, disk speed and network capacity of the VM Normalization and discretization of most of the features ●

Handling uncertainties in clouds The resources of clouds are shared by multiple users so different ● workloads may interfere with one another Failures and resource overloading can impact the completion time of a ● job.

Implementation

Experimental Setup Benchmark applications on Spark and Hadoop to exercise different ● CPU/Disk/RAM/Network resources TPS-DS - a recent benchmark for big data systems that models a decision ● support workload. TPC-H - another SQL benchmark that contains a number of ad-hoc ● decision support queries that process large amounts of data Terasort - common benchmarking application for big data analytics ● SparkReg - Machine learning workloads on top of Spark ● SparkKm - A clustering machine learning working ●

Experimental Setup Cloud configurations - Four families in Amazon EC2: M4 (general purpose), ● C4 (compute optimized), R3 (memory optimized), I2 (disk optimized) EI = 10%, N=6 and 3 initial samples. EI is chosen such that it gives a good ● tradeoff between search cost and accuracy Baselines - Exhaustive Search, Coordinate descent ● Metrics - running cost, search cost ●

Results CherryPick finds the optimal configuration with low search time ●

Results It reaches better configurations with more stability compared to random ● search on similar budget

Results CherryPick comes up with similar running costs with a linear predictor ● based model but with lower search cost and time

Results CherryPick can tune EI to trade-off between search cost and accuracy ●

Results Effectiveness of CherryPick ● Scaling with workload size Navigation of search space Estimation of running time vs cluster size

Discussion Reliance on good representative workloads ● Larger search space - Complexity depends only on number of samples ● and not the number of candidates. Choice of prior - Choice of Gaussian as prior the assumption is that the ● final function is a sample from a Gaussian distribution.

Shortcomings of CherryPick Model Accuracy - Tries to accurately model the performance metric which ● requires more data Cold Start - Bayesian Approximation requires initial data to build the ● performance space. Fragility - Overly sensitive to initial parameters - initial points, kernel ● function, process

Scout: An Experienced Guide to Find the Best Cloud Configuration

Exploration and Exploitation Any search based method has two aspects - exploration and exploitation ● Exploration - Gather new information about the search space by ● executing a new cloud configuration Exploitation - Choose the most promising configuration based on ● information enclosed Additional exploration incurs high cost and exploitation without ● exploration leads to suboptimal solutions - exploration exploitation dilemma.

Features Search process efficiency - Performance and workload characterization ● derived from historical data of previous workloads/ Search process effectiveness - Uses comprehensive performance data for ● prediction, uses low level performance information. Search process reliability - Using different sets for unevaluated and ● evaluated configurations, historical data to create a model for current workload.

Methodology Low level information is incorporated into the feature vector of the ● configuration The set of all possible configurations are taken and split into unevaluated ● and evaluated To search for the next best configuration given a starting configuration a ● function f(F(S_i), F(S_j), L_i) is learned This function is a classification function that classifies as “better”, “fair” ● and “worse”

Search Strategy Given <F(S_i), L_i> we can obtain the different prediction classes for ● unevaluated configurations The next best configuration is chosen such that the expected ● performance is improved. Due to the use of historical data, the search space is minimized and ● exploitation is more. Search stops when it can no longer find a better configuration ● Also stops if it fails to find better solutions due to an inaccurate ● performance model

Experimental Setup Workloads: Diverse workloads (CPU intensive, memory heavy, IO-intensive and network intensive) such as PageRank, sorting, recommendation, OLAP etc run on Apache Hadoop and Apache Spark Deployment Choices: Single node as well as multiple node settings Parameters: 1) Labelled classes: “better+”, “better”, “fair”, “worse” and “worse+” 2) Probability thresholds: 0.5 3) Misprediction tolerance: 3 and 4 for single and multiple nodes resp.

ML for Resource Management Arjun Karuvally, Priyanka Mary Mammen - PowerPoint PPT Presentation

ML for Resource Management Arjun Karuvally, Priyanka Mary Mammen Introduction Big data analytics on cloud is very crucial for industry and it is growing rapidly A number of techniques are used for data processing - Map Reduce,

Resource Resource Management Management RESOURCE MANAGEMENT RESOURCE MANAGEMENT We have a

SDR CLOUDS SDR CLOUDS RESOURCE MANAGEMENT RESOURCE MANAGEMENT IMPLICATIONS IMPLICATIONS INDEX

New Resource Implementation Shawna Warneke, Resource Management Specialist Christina Weiler,

Chapter 6 Cloud Resource Management and Scheduling Contents Resource management and

HUMAN RESOURCE MANAGEMENT Topic: Strategic Human Resource Management Company: Shan Foods (Pvt)

Resource Management with systemd LinuxCon North America 2013 Lennart Poettering September 2013

Deadlock Example Process 1 Process 2 Resource 1 Resource 2 Example Process 1 Process 2

Placement resource view visualization $ openstack resource provider tree balazs.gibizer@est.tech

and Scheduling Techniques Agenda for Today Resource management encompasses all the

Water Resource Management The Oakdale Irrigation Districts strategic approach to resource

Fisheries Relevant Resources Resource Resource Development Research Habitat External

Hillsdale Historic Resource Survey Historic Maps: 1851 Hillsdale Historic Resource Survey

Resource efficiency targets and indicators Dr. Martin Hirschnitz-Garbers Coordinator Resource

City of Watsonville, Water Resource Center Entrance City of Watsonville, Water Resource Center

The Solar Resource The Solar Resource Overview Overview of the solar resource in the U.S.

Integrated Resource Plan Integrated Resource Plan Rick Haener September 4th, 2015 Integrated

Chris Sewell James Ahrens Los Alamos National Laboratory LA-UR-11-11980 Operated by Los Alamos

Scout and NegaScout Tsan-sheng Hsu tshsu@iis.sinica.edu.tw http://www.iis.sinica.edu.tw/~tshsu

Champion to Championing a Community Charlie Houchin October 19, HydroXphere 6 October 2015

Validation Notes by mainly Jo Anne Atlee, with modifications by Daniel Berry dberry a b uwaterloo

Reviving Rovers Presented by the 180 th Pacific Coast Scout Group FABRIKAM Session

Adventures in Crowdsourcing: Incident Management Tools EDC5 Webinar Series HAAS Alert/Makeway

via Aspects Kung Chen National Cheng-chi University, Taiwan Ongoing work, partial results

MySQL Test Framework for Troubleshooting February, 04, 2018 Sveta Smirnova What my Family Thinks