Evaluation of a Failure Prediction Model for Large Scale Cloud - - PowerPoint PPT Presentation

▶

Nov 17, 2023 88 likes •233 views

Evaluation of a Failure Prediction Model for Large Scale Cloud Applications Mohammad S. Jassas and Qusay H. Mahmoud Presentation at Canadian AI 2020 Introduction Cloud services Complexity for cloud architectures. Cloud applications

SLIDE 1

Evaluation of a Failure Prediction Model for Large Scale Cloud Applications

Mohammad S. Jassas and Qusay H. Mahmoud

Presentation at Canadian AI 2020

SLIDE 2

Introduction

▪ Cloud services → Complexity for cloud architectures.

▪ Cloud applications have a high probability of failures

▪ Most Cloud providers have experienced failure in one of their services

▪ AWS experienced failure in (EBS) [7].

▪ Many organizations are planning to use public cloud environments. ▪ Cloud providers → Maintaining their services to provide cloud consumers with a high level of QoS).

[7] P. Marshall, K. Keahey, T. Freeman, Elastic site: Using clouds to elastically extend site resources, in: 10th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid 2010).

SLIDE 3

Problem Statement

Providing 24x7 services uptime become one of the most

significant challenges faces the cloud providers.

Failed jobs consume a notable amount of computational

resources and memory.

SLIDE 4

Objective

High Reliability + Availability Minimize Time + Cost Resource wastage Decrease the number

f failed tasks

Increase the performance of Cloud apps

SLIDE 5

Related Work

▪ Failure analysis and characterization have been studied widely in grid computing, cloud cluster and supercomputer [1]. ▪ The Google traces [3] are used in different research studies, including workload characterization [5] and applying statistical methods.

▪ In [2], we have studied the workload features such as memory usage, CPU speed, disk space.

▪ Limited research has been done on failure prediction [4,5,6].

▪ El-Sayed et al. [4] have designed a job failure prediction model using a RF classifier.

[1] Chen, X., Lu, C.D., Pattabiraman, K.: Failure analysis of jobs in compute clouds: a Google cluster case study. IEEE (2014). [2] Jassas, M., Mahmoud, Q.H.: Failure analysis and characterization of scheduling jobs in Google cluster trace. IEEE (2018) [3] Reiss, C., Wilkes, J., Hellerstein, J.L.: Google cluster-usage traces: format+ schema. Google Inc., (2011) [4] El-Sayed, N., Zhu, H., Schroeder, B.: Learning from failure across multiple clusters: a trace-driven approach to understanding, predicting, and mitigating job terminations. IEEE (2017) [5] Chen, X., Lu, C.D., Pattabiraman, K.: Failure analysis of jobs in compute clouds: a Google cluster case study. IEEE (2014) [6] Ros, A., Chen, L.Y., Binder, W.: Predicting and mitigating jobs failures in big data clusters. In: 2015 15th IEEE/ACM (2015)

SLIDE 6

Proposed Solution

SLIDE 7

Experiments and Evaluation Results

▪ Trace Description (Google and LANL) ▪ Experimental Setup

▪ scikit-learn → ML packages in python ▪ Microsoft Azure → Google trace has large volumes of data requiring HPC nodes for analysis and prediction.

google/cluster-data: Borg cluster traces from Google - GitHub The Atlas Cluster Trace Repository | USENIX

SLIDE 8

Experiments and Evaluation Results

▪ Classifiers and Prediction Techniques

Fig.5. Performance evaluation of different algorithms applied to the Google trace

SLIDE 9

Experiments and Evaluation Results

Fig. 6. Performance evaluation of different algorithms applied to the Mustang and Trinity Traces

SLIDE 10

Experiments and Evaluation Results

▪ Feature Selection Algorithms

SLIDE 11

Conclusion and Future Work

Developing a prediction model for failed jobs based on ML methods.
Detecting failed jobs before the cloud management system schedules them.
Increasing the reliability and availability of the job cloud execution.
Applying different classification algorithms to various workload traces.
In future work, we will develop the proposed model using a deep learning

approach to improve the accuracy.

Besides, future research will consider mitigation policies and techniques.

SLIDE 12

Evaluation of a Failure Prediction Model for Large Scale Cloud Applications

Mohammad S. Jassas and Qusay H. Mahmoud

Presentation at Canadian AI 2020

Introduction

▪ Cloud services → Complexity for cloud architectures.

▪ Cloud applications have a high probability of failures

▪ Most Cloud providers have experienced failure in one of their services

▪ AWS experienced failure in (EBS) [7].

▪ Many organizations are planning to use public cloud environments. ▪ Cloud providers → Maintaining their services to provide cloud consumers with a high level of QoS).

Problem Statement

significant challenges faces the cloud providers.

resources and memory.

Objective

High Reliability + Availability Minimize Time + Cost Resource wastage Decrease the number

Increase the performance of Cloud apps

Related Work

▪ Failure analysis and characterization have been studied widely in grid computing, cloud cluster and supercomputer [1]. ▪ The Google traces [3] are used in different research studies, including workload characterization [5] and applying statistical methods.

▪ In [2], we have studied the workload features such as memory usage, CPU speed, disk space.

▪ Limited research has been done on failure prediction [4,5,6].

▪ El-Sayed et al. [4] have designed a job failure prediction model using a RF classifier.

Proposed Solution

Experiments and Evaluation Results

▪ Trace Description (Google and LANL) ▪ Experimental Setup

▪ scikit-learn → ML packages in python ▪ Microsoft Azure → Google trace has large volumes of data requiring HPC nodes for analysis and prediction.

Experiments and Evaluation Results

▪ Classifiers and Prediction Techniques

Fig.5. Performance evaluation of different algorithms applied to the Google trace

Experiments and Evaluation Results

Experiments and Evaluation Results

▪ Feature Selection Algorithms

Conclusion and Future Work

approach to improve the accuracy.

Mohammad S. Jassas mohammad.jassas@ontariotechu.net Qusay H. Mahmoud qusay.mahmoud@ontariotechu.net