iCSI:
A Cloud Garbage VM Collector for Addressing Inactive VM with Machine Learning
1
In Kee Kim+, Sai Zeng*, Christopher Young*, Jinho Hwang*, and Marty Humphrey+
+University of Virginia *IBM T.J. Watson Research Center
iCSI : A Cloud Garbage VM Collector for Addressing Inactive VM with - - PowerPoint PPT Presentation
iCSI : A Cloud Garbage VM Collector for Addressing Inactive VM with Machine Learning In Kee Kim + , Sai Zeng * , Christopher Young * , Jinho Hwang * , and Marty Humphrey + + University of Virginia * IBM T.J. Watson Research Center 1 Motivation
1
In Kee Kim+, Sai Zeng*, Christopher Young*, Jinho Hwang*, and Marty Humphrey+
+University of Virginia *IBM T.J. Watson Research Center
2
million comatose servers world wide billion dollars in data center capital investment percent electrical energy waste maintenance, software license, cooling cost..
10 30 40
– Recent study from Stanford University (2015).
–VMs are cheaper to create, and easier to forget.
–Financial owners may not be the actual user. –Many zombie VMs keep legacy installations and data for future use. –Identifying active/inactive VMs with certainty is difficult with conventional tools.
3
–Inactive VMs can look “active”
–Active VMs can look “inactive”
time.
4
5
6
VMs s in in Priv rivate Cloud Process Utilization Login Network Others … … … … …
Creating meta data
7
Metadata
Description
Process
(e.g., patch update).
Utilization
Login
Network
Others
8
𝑗=1
𝑜
(𝑌𝑗− 𝑌)(𝑍𝑗− 𝑍) 𝑗=1
𝑜
𝑌𝑗− 𝑌 2 𝑗=1
𝑜
𝑍𝑗− 𝑍 2
–Failed to find (global) correlation with active / inactive VMs.
Features Correlation %CPU of Significant Procs 0.95 %MEM of VMs 0.95 # of Important Open Ports 0.90 # of Established Conn. 0.97 Etc. Features Correlation %CPU of Imp. Procs > 5% 0.72 %MEM of Imp Procs > 5% 0.73 # of Logins > 15 0.85 Daytime Login > 24 hrs 0.91 Etc.
<Analytics> <Development>
9
VM
Agent Data Collection Manager VM Metadata (Offline) Model Training
Identification Model
Base case identification VM Classification Determining VM purpose Network Affinity Analysis
Process Knowledge Base Reco commendatio ion Eng Engin ine Private Cloud VM Owners
I) Data Collector III) VM Mgmt. Action II) VM Identification
Proc, Login,
Meta data (offline) Identification Model Active/inactive VMs Recommendations VM Management
10
– This script should not mess up production services.
scale data centers. – Executed in every 4 hours. – Only collects 50KB data and sends it to the manager via cURL. – Deployed via an IBM Data Center Management tool.
11
VM Id Identific ication Model
Pro roc#1: Bas ase Cas ase VM VM Id Identif ificatio ion Pro Proc#2: Det Determin ining th the VM VM Pu Purp rpose Pro Proc#3: VM VM Clas lassific icatio ion wit ith SV SVM Pro Proc#4: Net etwork Aff ffin init ity An Analysis is
VM Metadata (Offline ) Model Trainin g Process Knowledg e Base
Meta data (offline) Identification Model (Sig.) Proc. Info
Active or Inactive VMs
12
period.
Listen ports and Mgmt ports are not considered.
process#1 host1.domain.com host2.domain.com host3.domain.com
13
– A VM with MySQL can be used for Storage, Development, Test,… Determin ined with us user fe feedback
14
–An optimal margin-based classifier with linear kernel. – Linear SVM tries to find a small number data points that separate all data points of two classes with a hyperplane. – Use specific correlated features according to the purpose of VMs.
Server Purpose Correlated Features Analytics %CPU, %MEM, #OpenPorts DevOps #SigProcs, %CPU_SigProcs, %MEM_SigProcs, #EstConns Development #LoginFreq (Daytime), AvgLoginHr, #SSH/VNCs, #UserActivityProcs . . .
15
𝑜
𝑜
16
–Linear SVM classifier can successfully classify Hadoop/Mesos master as “active” but, not for slave nodes.
17
Recommendati tion Trig Trigger r Condit itio ions No Action
0.5) Terminating VM
Suspending VM
Resizing VM
the VM
18
– 750 VMs on IBM Research Cloud Infrastructure. (3 data centers) – Ground Truth: User Feedbacks
– Active VMs are incorrectly identified as Inactive.
– Pleco (CNSM 2016) and Garbo (SoCC 2015)
19
20
Classified Active as Active Classified Active as “Inactive”
21
Rec ecall Pre recisio ion F-Measure Ple leco 0.75 0.69 0.72 Garbo 0.70 0.67 0.68 iCSI 0.9 .90 0.8 .81 0.8 .85
Improve with Network Affinity Analysis
𝑜 (𝜕𝑗 𝑘=1 𝑛
22
0.5 0.6 0.7 0.8 0.9 1.0 1.1 CSI2 Pleco Garbo Normalized Cost Saving
Baseline – Next Month Cost: $$$ 23% 11% 9% iCSI Pleco Garbo
23
iCSI Ple leco Garbo Average Improvement of VM Utilization 46% 46% 31% 29%
–A lightweight approach – only collects few kbytes data from each VM. –We have found specific correlated features according to the purpose of VMs on the production clouds.
–VM identification mechanism is composed of heuristics (rule- based) and machine learning (Linear SVM) –iCSI has over 90% of recall to identify active/inactive VMs. –For the future work, dealing with privacy regulations will be an critical issue.
24
25
26
Id Iden entif ificati tion Resu esult Active Inactive Tr Truth th Active TP TP: Active VMs are corre rrectl tly identified as active. FN: Active VMs are incorrectl tly identified as inactive. Inactive FP: Inactive VMs are incorrectl tly identified as active. TN: TN: Inactive VMs are corre rrectl tly identified as inactive.
–Current version is focused on managing Linux VMs:
legacy applications)
–Need a better approach for determining the purpose of VMs. –Need to be verified with larger scale data centers or real production clouds.
–We could only collect U.S. Owned VMs for this work!
27
28
Pleco (CNSM 2016) Garbo (SoCC 2015) Janitor Monkey (Netflix 2013) Desc.
Reference Model (ALDM) + Decision Tree Graph Theory + “mark and swap” Aging of VM + User Feedback
Target Platform
Private Clouds Amazon Web Services Amazon Web Services
Cons
Expensive Data Collection.
Static Connection. Only Considering Network Connectivity. Depending on user feedback. Not fully automated system.