Evaluation of a Failure Prediction Model for Large Scale Cloud - - PowerPoint PPT Presentation

evaluation of a failure prediction model for
SMART_READER_LITE
LIVE PREVIEW

Evaluation of a Failure Prediction Model for Large Scale Cloud - - PowerPoint PPT Presentation

Evaluation of a Failure Prediction Model for Large Scale Cloud Applications Mohammad S. Jassas and Qusay H. Mahmoud Presentation at Canadian AI 2020 Introduction Cloud services Complexity for cloud architectures. Cloud applications


slide-1
SLIDE 1

Evaluation of a Failure Prediction Model for Large Scale Cloud Applications

Mohammad S. Jassas and Qusay H. Mahmoud

Presentation at Canadian AI 2020

slide-2
SLIDE 2

Introduction

2

▪ Cloud services → Complexity for cloud architectures.

▪ Cloud applications have a high probability of failures

▪ Most Cloud providers have experienced failure in one of their services

▪ AWS experienced failure in (EBS) [7].

▪ Many organizations are planning to use public cloud environments. ▪ Cloud providers → Maintaining their services to provide cloud consumers with a high level of QoS).

[7] P. Marshall, K. Keahey, T. Freeman, Elastic site: Using clouds to elastically extend site resources, in: 10th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid 2010).

slide-3
SLIDE 3

Problem Statement

3

  • Providing 24x7 services uptime become one of the most

significant challenges faces the cloud providers.

  • Failed jobs consume a notable amount of computational

resources and memory.

slide-4
SLIDE 4

Objective

4

High Reliability + Availability Minimize Time + Cost Resource wastage Decrease the number

  • f failed tasks

Increase the performance of Cloud apps

slide-5
SLIDE 5

Related Work

5

▪ Failure analysis and characterization have been studied widely in grid computing, cloud cluster and supercomputer [1]. ▪ The Google traces [3] are used in different research studies, including workload characterization [5] and applying statistical methods.

▪ In [2], we have studied the workload features such as memory usage, CPU speed, disk space.

▪ Limited research has been done on failure prediction [4,5,6].

▪ El-Sayed et al. [4] have designed a job failure prediction model using a RF classifier.

[1] Chen, X., Lu, C.D., Pattabiraman, K.: Failure analysis of jobs in compute clouds: a Google cluster case study. IEEE (2014). [2] Jassas, M., Mahmoud, Q.H.: Failure analysis and characterization of scheduling jobs in Google cluster trace. IEEE (2018) [3] Reiss, C., Wilkes, J., Hellerstein, J.L.: Google cluster-usage traces: format+ schema. Google Inc., (2011) [4] El-Sayed, N., Zhu, H., Schroeder, B.: Learning from failure across multiple clusters: a trace-driven approach to understanding, predicting, and mitigating job terminations. IEEE (2017) [5] Chen, X., Lu, C.D., Pattabiraman, K.: Failure analysis of jobs in compute clouds: a Google cluster case study. IEEE (2014) [6] Ros, A., Chen, L.Y., Binder, W.: Predicting and mitigating jobs failures in big data clusters. In: 2015 15th IEEE/ACM (2015)

slide-6
SLIDE 6

Proposed Solution

6

slide-7
SLIDE 7

Experiments and Evaluation Results

7

▪ Trace Description (Google and LANL) ▪ Experimental Setup

▪ scikit-learn → ML packages in python ▪ Microsoft Azure → Google trace has large volumes of data requiring HPC nodes for analysis and prediction.

google/cluster-data: Borg cluster traces from Google - GitHub The Atlas Cluster Trace Repository | USENIX

slide-8
SLIDE 8

Experiments and Evaluation Results

8

▪ Classifiers and Prediction Techniques

Fig.5. Performance evaluation of different algorithms applied to the Google trace

slide-9
SLIDE 9

Experiments and Evaluation Results

9

  • Fig. 6. Performance evaluation of different algorithms applied to the Mustang and Trinity Traces
slide-10
SLIDE 10

Experiments and Evaluation Results

10

▪ Feature Selection Algorithms

slide-11
SLIDE 11

Conclusion and Future Work

11

  • Developing a prediction model for failed jobs based on ML methods.
  • Detecting failed jobs before the cloud management system schedules them.
  • Increasing the reliability and availability of the job cloud execution.
  • Applying different classification algorithms to various workload traces.
  • In future work, we will develop the proposed model using a deep learning

approach to improve the accuracy.

  • Besides, future research will consider mitigation policies and techniques.
slide-12
SLIDE 12

Mohammad S. Jassas mohammad.jassas@ontariotechu.net Qusay H. Mahmoud qusay.mahmoud@ontariotechu.net