Evaluation of a Failure Prediction Model for Large Scale Cloud - - PowerPoint PPT Presentation
Evaluation of a Failure Prediction Model for Large Scale Cloud - - PowerPoint PPT Presentation
Evaluation of a Failure Prediction Model for Large Scale Cloud Applications Mohammad S. Jassas and Qusay H. Mahmoud Presentation at Canadian AI 2020 Introduction Cloud services Complexity for cloud architectures. Cloud applications
Introduction
2
▪ Cloud services → Complexity for cloud architectures.
▪ Cloud applications have a high probability of failures
▪ Most Cloud providers have experienced failure in one of their services
▪ AWS experienced failure in (EBS) [7].
▪ Many organizations are planning to use public cloud environments. ▪ Cloud providers → Maintaining their services to provide cloud consumers with a high level of QoS).
[7] P. Marshall, K. Keahey, T. Freeman, Elastic site: Using clouds to elastically extend site resources, in: 10th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid 2010).
Problem Statement
3
- Providing 24x7 services uptime become one of the most
significant challenges faces the cloud providers.
- Failed jobs consume a notable amount of computational
resources and memory.
Objective
4
High Reliability + Availability Minimize Time + Cost Resource wastage Decrease the number
- f failed tasks
Increase the performance of Cloud apps
Related Work
5
▪ Failure analysis and characterization have been studied widely in grid computing, cloud cluster and supercomputer [1]. ▪ The Google traces [3] are used in different research studies, including workload characterization [5] and applying statistical methods.
▪ In [2], we have studied the workload features such as memory usage, CPU speed, disk space.
▪ Limited research has been done on failure prediction [4,5,6].
▪ El-Sayed et al. [4] have designed a job failure prediction model using a RF classifier.
[1] Chen, X., Lu, C.D., Pattabiraman, K.: Failure analysis of jobs in compute clouds: a Google cluster case study. IEEE (2014). [2] Jassas, M., Mahmoud, Q.H.: Failure analysis and characterization of scheduling jobs in Google cluster trace. IEEE (2018) [3] Reiss, C., Wilkes, J., Hellerstein, J.L.: Google cluster-usage traces: format+ schema. Google Inc., (2011) [4] El-Sayed, N., Zhu, H., Schroeder, B.: Learning from failure across multiple clusters: a trace-driven approach to understanding, predicting, and mitigating job terminations. IEEE (2017) [5] Chen, X., Lu, C.D., Pattabiraman, K.: Failure analysis of jobs in compute clouds: a Google cluster case study. IEEE (2014) [6] Ros, A., Chen, L.Y., Binder, W.: Predicting and mitigating jobs failures in big data clusters. In: 2015 15th IEEE/ACM (2015)
Proposed Solution
6
Experiments and Evaluation Results
7
▪ Trace Description (Google and LANL) ▪ Experimental Setup
▪ scikit-learn → ML packages in python ▪ Microsoft Azure → Google trace has large volumes of data requiring HPC nodes for analysis and prediction.
google/cluster-data: Borg cluster traces from Google - GitHub The Atlas Cluster Trace Repository | USENIX
Experiments and Evaluation Results
8
▪ Classifiers and Prediction Techniques
Fig.5. Performance evaluation of different algorithms applied to the Google trace
Experiments and Evaluation Results
9
- Fig. 6. Performance evaluation of different algorithms applied to the Mustang and Trinity Traces
Experiments and Evaluation Results
10
▪ Feature Selection Algorithms
Conclusion and Future Work
11
- Developing a prediction model for failed jobs based on ML methods.
- Detecting failed jobs before the cloud management system schedules them.
- Increasing the reliability and availability of the job cloud execution.
- Applying different classification algorithms to various workload traces.
- In future work, we will develop the proposed model using a deep learning
approach to improve the accuracy.
- Besides, future research will consider mitigation policies and techniques.