Intelligent Operation and Maintenance
- f Public Cloud Based on GPU-
Accelerated Machine Learning
Feng Xie(stephen.xf@alibaba-inc.com) Zhaowei Ouyang(zhaowei.oyzw@alibaba-inc.com)
Accelerated Machine Learning Feng Xie(stephen.xf@alibaba-inc.com) - - PowerPoint PPT Presentation
Intelligent Operation and Maintenance of Public Cloud Based on GPU- Accelerated Machine Learning Feng Xie(stephen.xf@alibaba-inc.com) Zhaowei Ouyang(zhaowei.oyzw@alibaba-inc.com) AIOps on Public Cloud Product Maintenance and Scenario
Feng Xie(stephen.xf@alibaba-inc.com) Zhaowei Ouyang(zhaowei.oyzw@alibaba-inc.com)
Data Platform Case Scenario
VM Data
KPI,Abnor.,Event
Node Data
KPI,Abnor.,Power
Cluster Data
KPI,Abnor.,Event
IDC Data
Power, Rack Data
MaxCompute Blink SLS HybridDB for MySQL Dask Rapids Scheduling Maintenance and Upgrading Portrait Product recommendations Scheduling Maintenance and Upgrading Portrait Product recommendations Resource arrangement Power management Load Balance Migration downtime prediction Outage/Failure prediction Anomaly detection Customer Portrait VM Portrait Cluster health portrait Analysis of purchasing behavior Resource demand analysis …
Algorithm
Time Series Classification Regression Clustering … …
Online Data (SLS/HybridDB for MySQL/Blink) Offline Data (MaxComputer) Client Dask Worker (Predicte) Dask Worker (Data Prepare) Web Server Dask Scheduler Dask Worker (Train)
Message Queue
Model Repository (OSS) Redis
CPU Network Storage
Load Balancing Traffic Warning Anomaly detection Resource Scheduling
Similarity Periodicity
Clustering ? Yes Training a cluster- specific regression model Training a general regression model Acquire Training Data FFT No Is periodic Is not periodic Training a general regression model Clustering ? No Yes
Classified? Yes Predicting wih a cluster-specific regression model Predicting wih a general regression model Acquire Historical Data FFT No Is periodic Is not periodic Predicting wih a general regression model Classified? No Yes
The Fast Fourier Transform (FFT) is used to transform the time series data from the time domain to the frequency domain. The frequency domain distribution is analyzed to determine whether it is periodic or not.
𝑜=0 𝑂−1
Original series Input time series Frequency Domain
Use DTW distance as a measure of similarity between time series
=
0 1 0 𝑜−1 0 𝑜 1 0 1 𝑜−1 1 𝑜 𝑜−1 0 𝑜−1 1 𝑜−1 𝑜 𝑜 0 𝑜 1 𝑜 𝑜−1
… … … … … … … … … =
0 1 0 𝑜−1 0 𝑜 1 0 1 𝑜−1 1 𝑜 𝑜−1 0 𝑜−1 1 𝑜−1 𝑜 𝑜 0 𝑜 1 𝑜 𝑜−1
… … … … … … … … …
2𝜏2
The calculation of Dynamic Time Warping(DTW) distance is a task with high time and space complexity. We can use the powerful parallel computing power of GPU to accelerate the calculation of DTW distance. Use cuFFT to accelearate FFT calculation of massive time series data
Model Advantage Disadvantage ARIMA Simple Hyperparameter Optimization Low accuracy LSTM High accuracy Complicated Hyperparameter Optimization Poor interpretability XGBoost Regression Tree High accuracy Good interpretability Complicated Hyperparameter Optimization
Predict Next 24 Hours Result
Feature
migration)
Result
Use XGBoost Classification Tree to predict whether a VM is migration-sensitive
Whether is migration
insensitive sensitive Migrate immediately Regression algorithm Classification algorithm Predict next 24 hours load Classification algorithm Predict a nearest migration- insensitive window in next 24 hours
Accuracy ≈70% Migration-sensitive Recal:76%
GPU:NVIDIA Tesla P100 * 8 CPU:2 Socket Intel Xeon E5-2682 v4(Broadwell)
25ms 10ms
Latency:60% drop Throughout:20x Speed-up