Accelerated Machine Learning Feng Xie(stephen.xf@alibaba-inc.com) - - PowerPoint PPT Presentation

accelerated machine learning
SMART_READER_LITE
LIVE PREVIEW

Accelerated Machine Learning Feng Xie(stephen.xf@alibaba-inc.com) - - PowerPoint PPT Presentation

Intelligent Operation and Maintenance of Public Cloud Based on GPU- Accelerated Machine Learning Feng Xie(stephen.xf@alibaba-inc.com) Zhaowei Ouyang(zhaowei.oyzw@alibaba-inc.com) AIOps on Public Cloud Product Maintenance and Scenario


slide-1
SLIDE 1

Intelligent Operation and Maintenance

  • f Public Cloud Based on GPU-

Accelerated Machine Learning

Feng Xie(stephen.xf@alibaba-inc.com) Zhaowei Ouyang(zhaowei.oyzw@alibaba-inc.com)

slide-2
SLIDE 2

AIOps on Public Cloud

Data Platform Case Scenario

VM Data

KPI,Abnor.,Event

Node Data

KPI,Abnor.,Power

Cluster Data

KPI,Abnor.,Event

IDC Data

Power, Rack Data

MaxCompute Blink SLS HybridDB for MySQL Dask Rapids Scheduling Maintenance and Upgrading Portrait Product recommendations Scheduling Maintenance and Upgrading Portrait Product recommendations Resource arrangement Power management Load Balance Migration downtime prediction Outage/Failure prediction Anomaly detection Customer Portrait VM Portrait Cluster health portrait Analysis of purchasing behavior Resource demand analysis …

Algorithm

Time Series Classification Regression Clustering … …

slide-3
SLIDE 3

Machine Learning Platform Architecture

Online Data (SLS/HybridDB for MySQL/Blink) Offline Data (MaxComputer) Client Dask Worker (Predicte) Dask Worker (Data Prepare) Web Server Dask Scheduler Dask Worker (Train)

Message Queue

Model Repository (OSS) Redis

slide-4
SLIDE 4

KPI Prediction

CPU Network Storage

Load Balancing Traffic Warning Anomaly detection Resource Scheduling

slide-5
SLIDE 5

CPU Load Time Series

Similarity Periodicity

slide-6
SLIDE 6

Training Flow Chart

Clustering ? Yes Training a cluster- specific regression model Training a general regression model Acquire Training Data FFT No Is periodic Is not periodic Training a general regression model Clustering ? No Yes

slide-7
SLIDE 7

Predicting Flow Chart

Classified? Yes Predicting wih a cluster-specific regression model Predicting wih a general regression model Acquire Historical Data FFT No Is periodic Is not periodic Predicting wih a general regression model Classified? No Yes

slide-8
SLIDE 8

Periodicity of Time Series

The Fast Fourier Transform (FFT) is used to transform the time series data from the time domain to the frequency domain. The frequency domain distribution is analyzed to determine whether it is periodic or not.

𝑦𝑙 =

𝑜=0 𝑂−1

𝑦𝑜. 𝑓−𝑗2𝜌𝑙𝑜/𝑂

slide-9
SLIDE 9

Basics of Signal Processing

Original series Input time series Frequency Domain

slide-10
SLIDE 10

Similarity of Time Series

Use DTW distance as a measure of similarity between time series

=

0 1 0 𝑜−1 0 𝑜 1 0 1 𝑜−1 1 𝑜 𝑜−1 0 𝑜−1 1 𝑜−1 𝑜 𝑜 0 𝑜 1 𝑜 𝑜−1

… … … … … … … … … =

0 1 0 𝑜−1 0 𝑜 1 0 1 𝑜−1 1 𝑜 𝑜−1 0 𝑜−1 1 𝑜−1 𝑜 𝑜 0 𝑜 1 𝑜 𝑜−1

… … … … … … … … …

K i j = ℮−𝑒𝑗 𝑘

2𝜏2

slide-11
SLIDE 11

GPU-Accelearated FFT and DTW distance calculation

The calculation of Dynamic Time Warping(DTW) distance is a task with high time and space complexity. We can use the powerful parallel computing power of GPU to accelerate the calculation of DTW distance. Use cuFFT to accelearate FFT calculation of massive time series data

slide-12
SLIDE 12

Clustering results

slide-13
SLIDE 13

Time Series Regression Model

Model Advantage Disadvantage ARIMA Simple Hyperparameter Optimization Low accuracy LSTM High accuracy Complicated Hyperparameter Optimization Poor interpretability XGBoost Regression Tree High accuracy Good interpretability Complicated Hyperparameter Optimization

slide-14
SLIDE 14

Regression algorithm Result:XGBoost

Predict Next 24 Hours Result

slide-15
SLIDE 15

Regression algorithm accuracy

Algorithm RMSE MAPE ARIMA 70% <5 70% <0.5 XGB Regression tree 83% <5 83% <0.5

slide-16
SLIDE 16

Migration downtime prediction

Feature

  • Average vCPU utilization(1 hour before migration)
  • Amplitude of fluctuation with vCPU utilization(one day before

migration)

  • VM Instance Type(How many vCPU/Memory?)
  • ……

Result

  • Migration-insensitive VM (downtime <= 100 ms)
  • Migration-sensitive VM (downtime > 100 ms)

Use XGBoost Classification Tree to predict whether a VM is migration-sensitive

slide-17
SLIDE 17

Migration Prediction Flow Chart

Whether is migration

  • sensitive

insensitive sensitive Migrate immediately Regression algorithm Classification algorithm Predict next 24 hours load Classification algorithm Predict a nearest migration- insensitive window in next 24 hours

slide-18
SLIDE 18

Classification Algorithm Accuracy:XGBoost

Accuracy ≈70% Migration-sensitive Recal:76%

slide-19
SLIDE 19

Classification Algorithm Performance:XGBoost

GPU:NVIDIA Tesla P100 * 8 CPU:2 Socket Intel Xeon E5-2682 v4(Broadwell)

25ms 10ms

Latency:60% drop Throughout:20x Speed-up

slide-20
SLIDE 20

Questions?

slide-21
SLIDE 21