Comprehensive Elastic Resource Management to Ensure Predictable - - PowerPoint PPT Presentation

▶

Feb 28, 2023 484 likes •685 views

Comprehensive Elastic Resource Management to Ensure Predictable Performance for Scientific Applications on Public IaaS Clouds. In Kee Kim , Jacob Steele, Yanjun Qi, Marty Humphrey CS@University of Virginia Motivation Goals Meet Job

SLIDE 1

Comprehensive Elastic Resource Management to Ensure Predictable Performance for Scientific Applications

n Public IaaS Clouds.

In Kee Kim, Jacob Steele, Yanjun Qi, Marty Humphrey

CS@University of Virginia

SLIDE 2

Motivation

Goals

– Meet Job Deadline – Low Cost

SLIDE 3

[1] Schedule-based Scaling Static approach [2] Rule-based Scaling

Dynamic but Delays

– Reactive

Auto-Scaling
Scale Up – Job Deadline Satisfaction (High Demand)
Scale Down – Cost Efficiency (Low Demand)

Schedule-based Scaling T1 T2 T3 T1 Rule-based Scaling

Over Provisioning Under Provisioning Scale Up Delay Scale Down Delay

Current Approach

SLIDE 4

Research Goal and Approach

In order to meet 1) user-defined job deadline and 2)

minimize execution cost for scientific applications that have highly variable job execution time, we design a Comprehensive Resource Management System by utilizing

Local Linear Regression-based Job Execution Time Prediction
Cost/Performance-Ratio based Resource Evaluation
Availability-Aware Job Scheduling and VM Scaling

SLIDE 5

Outline

Motivation
Three approaches of LCA
Experiment
Conclusion

SLIDE 6

LLR: Job Execution Time Prediction

Initial Intuition

– Job execution time has a linear relationship with IaaS/Application parameters

Data Collection (26 samples on 4 types of VMs) and Correlation Analysis
Local Linear Regression

Size of Data Type of VM Non-Data Intensive Operation 0.0973 (negligible) 0.7089 (strong) Data Intensive Operation 0.6129 (moderate) 0.3223 (weak)

Simple Linear Model → Cannot Produce Reliable Prediction

error

(a) Global Linear regression on m1.large (using all samples) (b) Local Linear Regression on m1.large (Using three samples)

Job Execution Time (sec.) Job Execution Time (sec.)

SLIDE 7

Cost-Perf. Ratio-based Resource Evaluation

SLIDE 8

Availability-Aware Job Scheduling

AAJS first assigns as many jobs as possible to current running VMs

based on CP evaluation results.

– Maximize machine utilization of current running VM instances. – Minimizing overhead from staring new VMs

Job Assignment Criteria

1) VM which has higher order (rank) in Cost/Performance ratio. 2) VM which offers earliest job completion time if multiple options available.

Queue Wait Time + New Job Exec Time

SLIDE 9

Experiment Setup

Baselines

– SCS – MH [SC 2011] – SCS + LLR [NEW]

Implementation & Deploy

– LCA and 2 baselines on AWS

VM Types for Experiments
Workload Generation

# of Jobs 100 Watershed Delineation Jobs Job Deadline Mean Deadline STD DEV 30 min. 9.7 min. Job Duration Mean Duration STD DEV 15 min. 12.5 min.

(a) Steady (b) Bursty (c) Incremental (d) Random

Instance Type CPU/Mem Hourly Price m1.small 1/1.7G $0.091/Hr. m1.medium 1/3.7G $0.182/Hr. m1.large 2/7. 5G $0.364/Hr. m1.xlarge 4/15G $0.728/Hr.

SLIDE 10

Job Exec. Time Predictor Performance

LLR LR kNN Mean

Avg. Predict. Acc.

78.77%

67.62% 65.38% 60.99% MAPE

0.2773

0.3901 0.5012 0.8254

LLR: Local Linear Regression, LR: Linear Regression, MAPE: Mean Absolute Percentage Error

SLIDE 11

Job Deadline Satisfaction Rate

LCA: Average 83.25% of Job Deadline Satisfaction Rate

9% better than SCS+LLR
33% better than SCS

SLIDE 12

Overall Running Cost

LCA: Average $8.9 of Overall Running Cost

$2.5 of cheaper than SCS+LLR
$5.2 of more expensive than SCS
(but performance is not comparable)

SLIDE 13

Conclusion

LCA is a novel elastic resource management system for scientific

applications on public IaaS cloud based on three approaches:

[1] Local Linear Regression-based Job Execution Time Prediction [2] Cost-Performance Ratio-based Resource Evaluation [3] Availability-Aware Job Scheduling and VM Scaling

LCA has better performance than baselines (SCS, SCS with LLR) in Four

different workload patterns (Steady, Bursty, Incremental, Random).

– Predictor Performance: 11%-18% better accuracy – Job Deadline Satisfaction Rate: 9%-33% better rate – Overall Running Cost: $2.45 (22%) better cost efficiency

SLIDE 14

Thank you & Questions?

SLIDE 15

Back-up Slides

SLIDE 16

LCA System Design

Job Scheduling & VM Scaling Prediction Module

LLR Predictor Job History Repository

Resource Evaluation

Cost-Performance Optimized Evaluation Request Samples Availability-Aware Job Scheduling and VM Scaling VM Manager Prediction Results VM Ranking & Selection VM Req, Job Assign Job + Deadline +/- VMs, Job Assignment Update Exe Info Results VMs on IaaS User

SLIDE 17

VM Utilization

Startup Idle Job Running

LCA: Average 69.17% of VM Utilization

25% higher than SCS + LLR
11% higher than SCS

SLIDE 18

VM Instance Types

TABLE. SPECIFICATION OF GENERAL PURPOSE MICROSOFT WINDOWS INSTANCES ON AMAZON EC2 IN

US EAST REGION (THE PRICE IS BASED ON MARCH 2014)

Instance Type ECU[1] CPU Cores Memory Hourly Price m1.small 1 1 1.7GB $0.091/Hr. m1.medium 2 1 3.7GB $0.182/Hr. m1.large 4 2 7.5GB $0.364/Hr. m1.xlarge 8 4 15GB $0.728/Hr.

1Single ECU (EC2 Compute Unit) provides the equivalent CPUI capacity of a 1.0-1.2 GHz

2007 Opteron or 2007 Xeon Processor