Rapid and Robust Impact Assessment of Software Changes in Large Internet-based Services
Shenglin Zhang, Ying Liu, Dan Pei Yu Chen, Xianping Qu, Shimin Tao, Zhi Zang
6/22/18 CoNEXT 2015 1
Rapid and Robust Impact Assessment of Software Changes in Large - - PowerPoint PPT Presentation
Rapid and Robust Impact Assessment of Software Changes in Large Internet-based Services Shenglin Zhang, Ying Liu, Dan Pei Yu Chen, Xianping Qu, Shimin Tao, Zhi Zang 6/22/18 CoNEXT 2015 1 Internet-based Services l Search l Shopping l Social l
Rapid and Robust Impact Assessment of Software Changes in Large Internet-based Services
Shenglin Zhang, Ying Liu, Dan Pei Yu Chen, Xianping Qu, Shimin Tao, Zhi Zang
6/22/18 CoNEXT 2015 1
Internet-based Services lSearch lShopping lSocial lPortal lVideo
6/22/18 CoNEXT 2015 2
6/22/18 CoNEXT 2015 3
Introduce new feature Improve performance Fix bugs
Software Change: Software Upgrade or Configuration Change
Software Change: Software Upgrade or Configuration Change
6/22/18 CoNEXT 2015 4
Introduce new feature Improve performance Fix bugs
6/22/18 CoNEXT 2015 5
Introduce new feature Improve performance Fix bugs
Software Change: Software Upgrade or Configuration Change
Impact of Erroneous Software Upgrades 2012.10, Google
6
load balancing software
Gmail for 18 minutes
Impact of Erroneous Software Upgrades
2012.10, Google
7
2014.11, Microsoft Azure
load balancing software
Gmail for 18 minutes
to Azure Storage
across services utilizing Azure Storage
Impact of Erroneous Configuration Changes
2014.1, Dropbox
6/22/18
to upgrade the OS
been down for three hours
Impact of Erroneous Configuration Changes
2014.1, Dropbox
6/22/18
2014.6, Facebook
to upgrade the OS
been down for three hours
configuration of the software systems
minutes
Impact of Erroneous Software Changes
6/22/18 CoNEXT 2015 10
Impact of Erroneous Software Changes
6/22/18 CoNEXT 2015 11
The normalized number of successful orders A real-world example
Manual Software Change Impact Assessment
Select a subset of KPIs that maybe impacted
6/22/18 CoNEXT 2015 12
Manual Software Change Impact Assessment
Select a subset of KPIs that maybe impacted Inspect KPI changes
6/22/18 CoNEXT 2015 13
Manual Software Change Impact Assessment
Select a subset of KPIs that maybe impacted Inspect KPI changes Decide whether to roll back
6/22/18 CoNEXT 2015 14
KPI (Key Performance Indicator) in Software Change
6/22/18 CoNEXT 2015 15
KPI (Key Performance Indicator) in Software Change
6/22/18 CoNEXT 2015 16
KPI (Key Performance Indicator) in Software Change
6/22/18 CoNEXT 2015 17
Definition of KPI Change: Level Shift or Ramp up/down
6/22/18 CoNEXT 2015 18
Manual Software Change Impact Assessment
Select a subset of KPIs that maybe impacted Inspect KPI changes Decide whether to roll back
6/22/18 CoNEXT 2015 19
Design Goal
Select a subset of KPIs that maybe impacted Manual inspection of KPI changes Decide whether to roll back
6/22/18 CoNEXT 2015 20
Software Change Impact Assessment System
Outline
6/22/18 CoNEXT 2015 21
Challenge 1: Short Detection Delay Requirement Against Robustness
6/22/18 CoNEXT 2015 22
The number of successful orders (normalized) A real-world example
Challenge 1: Short Detection Delay Requirement Against Robustness
6/22/18 CoNEXT 2015 23
The number of successful orders (normalized) A real-world example level shift spike
Challenge 1: Short Detection Delay Requirement Against Robustness
6/22/18 CoNEXT 2015 24
The number of successful orders (normalized) A real-world example
Detect KPI changes rapidly and accurately
Challenge 2: Large Number of KPIs
6/22/18 CoNEXT 2015 25
Challenge 2: Large Number of KPIs
6/22/18 CoNEXT 2015 26
100+ Internet-based services 20+ Internet-based services has 100+ million users 10k+ modules 500+ thousand servers
Challenge 2: Large Number of KPIs
Monitored by
team
6/22/18 CoNEXT 2015 27
Challenge 2: Large Number of KPIs
Monitored by
team 10k+ software changes per day
6/22/18 CoNEXT 2015 28
Challenge 2: Large Number of KPIs
Monitored by
team 10k+ software changes per day 100+ KPIs in a software change
6/22/18 CoNEXT 2015 29
Challenge 2: Large Number of KPIs
Millions of KPIs should be monitored Monitored by
team 10k+ software changes per day 100+ KPIs in a software change
6/22/18 CoNEXT 2015 30
Challenge 2: Large Number of KPIs
Millions of KPIs be monitored Monitored by
team 10k+ software changes per day 100+ KPIs in a software change
6/22/18 CoNEXT 2015 31
Detect KPI changes with low computational cost
Challenge 3: Diverse Types of Data
6/22/18 CoNEXT 2015 32
Seasonal Variable Stationary Page view count NIC throughput Memory utilization
Challenge 3: Diverse Types of Data
6/22/18 CoNEXT 2015 33
Seasonal Variable Stationary Page view count NIC throughput Memory utilization
Robust to various KPIs
Challenge 4: KPI Changes Maybe Caused by Other Factors
6/22/18 CoNEXT 2015 34
Seasonality Network breakdowns Malicious attacks
Challenge 4: KPI Changes Maybe Caused by Other Factors
6/22/18 CoNEXT 2015 35
Seasonality Network breakdowns Malicious attacks
Eliminate KPI changes induced by other factors
Outline
6/22/18 CoNEXT 2015 36
Design Overview
6/22/18 CoNEXT 2015 37
Step 1 – Identify the impact set
Step 1
…
Software change in module A
Design Overview
6/22/18 CoNEXT 2015 38
Step 1 – Identify the impact set
Step 1
…
KPIs in the impact set
Software change in module A
Identify the Impact Set: Automatically Retrieve the Relevant KPIs
6/22/18 CoNEXT 2015 39
Identify the Impact Set: Automatically Retrieve the Relevant KPIs
6/22/18 CoNEXT 2015 40
Input from operators
module B, C, D
the software change is deployed.
Design Overview
6/22/18 CoNEXT 2015 41
Step 1 – Identify the impact set Step 2 – Detect behavior changes in KPIs
Step 1 Step 2
KPIs in the impact set
Software change in module A
Design Overview
6/22/18 CoNEXT 2015 42
Step 1 – Identify the impact set Step 2 – Detect behavior changes in KPIs
Step 1 Step 2
KPIs with behavior changes KPIs in the impact set
Software change in module A
Design Overview
6/22/18 CoNEXT 2015 43
Step 1 – Identify the impact set Step 2 – Detect behavior changes in KPIs
Step 1 Step 2
KPIs with behavior changes KPIs in the impact set
Short detection delay requirement against robustness Diverse types of data Large number of KPIs Software change in module A
Improved Singular Spectrum Transform (SST)
6/22/18 CoNEXT 2015 44
Accurate Short detection delay Advantage Short detection delay requirement against robustness
Improved Singular Spectrum Transform (SST)
6/22/18 CoNEXT 2015 45
Accurate Short detection delay Drawbacks Accuracy degrades with noisy baseline High computational cost Advantage
Improved Singular Spectrum Transform (SST)
6/22/18 CoNEXT 2015 46
Accurate Short detection delay Drawbacks Accuracy degrades with noisy baseline High computational cost Utilize more information in the testing space Improve robustness Advantage Diverse types
Improved Singular Spectrum Transform (SST)
6/22/18 CoNEXT 2015 47
Accurate Short detection delay Drawbacks Accuracy degrades with noisy baseline High computational cost Utilize more information in the testing space Matrix compression Implicit inner product calculation Reduce computational cost Improve robustness Advantage Large number
Design Overview
6/22/18 CoNEXT 2015 48
Step 1 – Identify the impact set Step 2 – Detect behavior changes in KPIs Step 3 – Eliminate KPI changes induced by other factors
Step 1 Step 2
KPIs in the impact set KPIs with behavior changes
Software change in module A
Design Overview
6/22/18 CoNEXT 2015 49
Step 1 – Identify the impact set Step 2 – Detect behavior changes in KPIs Step 3 – Eliminate KPI changes induced by other factors
Step 1 Step 2 Step 3
KPIs in the impact set KPIs with behavior changes KPIs with behavior changes induced by software change
Software change in module A
Design Overview
6/22/18 CoNEXT 2015 50
Step 1 – Identify the impact set Step 2 – Detect behavior changes in KPIs Step 3 – Eliminate KPI changes induced by other factors
Step 1 Step 2 Step 3
KPIs in the impact set KPIs with behavior changes KPIs with behavior changes induced by software change
KPI changes maybe caused by
Software change in module A
Eliminate KPI Changes Induced by Other Factors
6/22/18 CoNEXT 2015 51
Eliminate KPI Changes Induced by Other Factors
6/22/18 CoNEXT 2015 52
Eliminate KPI Changes Induced by Other Factors
6/22/18 CoNEXT 2015 53
Software change
Eliminate KPI Changes Induced by Other Factors
Treated group
6/22/18 CoNEXT 2015 54
treated group
Eliminate KPI Changes Induced by Other Factors
Treated group Control group
6/22/18 CoNEXT 2015 55
control group treated group
Eliminate KPI Changes Induced by Other Factors
Treated group Control group DiD method
6/22/18 CoNEXT 2015 56
control group treated group
Eliminate KPI Changes Induced by Other Factors
Treated group Control group DiD method
6/22/18 CoNEXT 2015 57
control group treated group KPI changes maybe caused by other factors
Design Overview
6/22/18 CoNEXT 2015 58
Step 1 – Identify the impact set Step 2 – Detect behavior changes in KPIs Step 3 – Eliminate KPI changes induced by other factors
Step 1 Step 2 Step 3
KPIs in the impact set KPIs with behavior changes KPIs with behavior changes induced by software change
improved SST split testing Software change in module A
Design Overview
6/22/18 CoNEXT 2015 59
Step 1 – Identify the impact set Step 2 – Detect behavior changes in KPIs Step 3 – Eliminate KPI changes induced by other factors
Step 1 Step 2 Step 3 Operators
KPIs in the impact set KPIs with behavior changes KPIs with behavior changes induced by software change
Software change in module A
KPIs with behavior changes induced by software change KPIs in the impact set KPIs with behavior changes
Design Overview
6/22/18 CoNEXT 2015 60
Step 1 – Identify the impact set Step 2 – Detect behavior changes in KPIs Step 3 – Eliminate KPI changes induced by other factors
Step 1 Step 2 Step 3 Operators Software change in module A
KPIs with behavior changes induced by software change KPIs with behavior changes KPIs in the impact set
Design Overview
6/22/18 CoNEXT 2015 61
Step 1 – Identify the impact set Step 2 – Detect behavior changes in KPIs Step 3 – Eliminate KPI changes induced by other factors
Step 1 Step 2 Step 3
Operators Software change in module A
Outline
6/22/18 CoNEXT 2015 62
Datasets of Evaluation
6/22/18 CoNEXT 2015 63
144 software changes of Baidu 72 introduced KPI changes 72 introduced no KPI changes
Datasets of Evaluation
6/22/18 CoNEXT 2015 64
9982 (software change, server/module/process, KPI)s Manually labelled by operators Large amount of labelling work 144 software changes of Baidu 72 introduced KPI changes 72 introduced no KPI changes
Datasets of Evaluation
6/22/18 CoNEXT 2015 65
Seasonal Variable Stationary 9982 (software change, server/module/process, KPI)s Manually labelled by operators Diverse KPIs Large amount of labelling work 144 software changes of Baidu 72 introduced KPI changes 72 introduced no KPI changes
Datasets of Evaluation
6/22/18 CoNEXT 2015 66
Seasonal Variable Stationary CUSUM (SIGCOMM 10) Multiscale Robust Local Subspace (CoNEXT 11) 9982 (software change, server/module/process, KPI)s Manually labelled by operators Comparison baseline Diverse KPIs Large amount of labelling work 144 software changes of Baidu 72 introduced KPI changes 72 introduced no KPI changes
Comparison of Accuracy
6/22/18 CoNEXT 2015 67
0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 70.00% 80.00% 90.00% 100.00% Statio nary FUNN EL Im prove d SST CUSUM MRLS
Comparison of Accuracy
6/22/18 CoNEXT 2015 68
0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 70.00% 80.00% 90.00% 100.00% Statio nary Sea sonal FUNN EL Im prove d SST CUSUM MRLS
Comparison of Accuracy
6/22/18 CoNEXT 2015 69
0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 70.00% 80.00% 90.00% 100.00% Statio nary Sea sonal Varia ble FUNN EL Im prove d SST CUSUM MRLS
Comparison of Computational Cost
6/22/18 CoNEXT 2015 70
Comparison of Computational Cost
6/22/18 CoNEXT 2015 71
Method FUNNEL CUSUM MRLS Number of cores for
7 31 47526
Comparison of Detection Delay
starts
6/22/18 CoNEXT 2015 72
time when the change starts time when the change is detected Detection delay
Comparison of Detection Delay
6/22/18 CoNEXT 2015 73
Comparison of Detection Delay
6/22/18 CoNEXT 2015 74
Case Study: An Erroneous Software Upgrade in Advertising
6/22/18 CoNEXT 2015 75
Case Study: An Erroneous Software Upgrade in Advertising
6/22/18 CoNEXT 2015 76
Case Study: An Erroneous Software Upgrade in Advertising
6/22/18 CoNEXT 2015 77
Customer complaints Inspecting KPIs Troubleshooting
Outline
6/22/18 CoNEXT 2015 78
Conclusion
Challenges of automatic software change impact assessment
FUNNEL
Evaluation
6/22/18 CoNEXT 2015 79
Thank you!
zhangsl12@mails.tsinghua.edu.cn
6/22/18 CoNEXT 2015 80
Q&A
6/22/18 CoNEXT 2015 81
Why 144 Software Changes
6/22/18 CoNEXT 2015 82
Why Using Cores
system needs
6/22/18 CoNEXT 2015 83
Why just a single team
6/22/18 CoNEXT 2015 84
Unbalanced hotspot
determination robust even in the face of hotspots.
6/22/18 CoNEXT 2015 85
The parameters of FUNNEL, CUSUM and MRLS
6/22/18 CoNEXT 2015 86
About the detection delay comparison
delay than FUNNEL at sometimes
6/22/18 CoNEXT 2015 87
Why not Just Split Testing?
6/22/18 CoNEXT 2015 88
Obtain the Relationship of Modules
hierarchy
6/22/18 CoNEXT 2015 89
Why not decide to roll out/back by FUNNEL?
change
changes is small
6/22/18 CoNEXT 2015 90
About the Deployment
based services
6/22/18 CoNEXT 2015 91
Number of software changes Number of changes that have impact Number of KPIs Number of KPI changes Precision 24119 268 2256390 10249 98.21%
If A Software Change is Deployed to All Servers …
software change
period but on historical days
6/22/18 CoNEXT 2015 92
About the Number of Software Changes
firstly, and then on another subset of servers
6/22/18 CoNEXT 2015 93