Rapid and Robust Impact Assessment of Software Changes in Large - - PowerPoint PPT Presentation

rapid and robust impact assessment of software changes in
SMART_READER_LITE
LIVE PREVIEW

Rapid and Robust Impact Assessment of Software Changes in Large - - PowerPoint PPT Presentation

Rapid and Robust Impact Assessment of Software Changes in Large Internet-based Services Shenglin Zhang, Ying Liu, Dan Pei Yu Chen, Xianping Qu, Shimin Tao, Zhi Zang 6/22/18 CoNEXT 2015 1 Internet-based Services l Search l Shopping l Social l


slide-1
SLIDE 1

Rapid and Robust Impact Assessment of Software Changes in Large Internet-based Services

Shenglin Zhang, Ying Liu, Dan Pei Yu Chen, Xianping Qu, Shimin Tao, Zhi Zang

6/22/18 CoNEXT 2015 1

slide-2
SLIDE 2

Internet-based Services lSearch lShopping lSocial lPortal lVideo

6/22/18 CoNEXT 2015 2

slide-3
SLIDE 3
  • Software upgrade

6/22/18 CoNEXT 2015 3

Introduce new feature Improve performance Fix bugs

Software Change: Software Upgrade or Configuration Change

slide-4
SLIDE 4

Software Change: Software Upgrade or Configuration Change

  • Software upgrade
  • Configuration change
  • e.g., traffic switching for load balancing reasons

6/22/18 CoNEXT 2015 4

Introduce new feature Improve performance Fix bugs

slide-5
SLIDE 5
  • Software upgrade
  • Configuration change
  • e.g., traffic switching for load balancing reasons
  • Occurs frequently
  • 10K+ per day in Baidu

6/22/18 CoNEXT 2015 5

Introduce new feature Improve performance Fix bugs

Software Change: Software Upgrade or Configuration Change

slide-6
SLIDE 6

Impact of Erroneous Software Upgrades 2012.10, Google

6

  • An update to Google’s

load balancing software

  • Poor performance to

Gmail for 18 minutes

slide-7
SLIDE 7

Impact of Erroneous Software Upgrades

2012.10, Google

7

2014.11, Microsoft Azure

  • An update to Google’s

load balancing software

  • Poor performance to

Gmail for 18 minutes

  • A performance update

to Azure Storage

  • Reduced capacity

across services utilizing Azure Storage

slide-8
SLIDE 8

Impact of Erroneous Configuration Changes

2014.1, Dropbox

6/22/18

  • Planned maintenance

to upgrade the OS

  • n some machines
  • Dropbox service

been down for three hours

slide-9
SLIDE 9

Impact of Erroneous Configuration Changes

2014.1, Dropbox

6/22/18

2014.6, Facebook

  • Planned maintenance

to upgrade the OS

  • n some machines
  • Dropbox service

been down for three hours

  • Update the

configuration of the software systems

  • Failed Facebook for 31

minutes

slide-10
SLIDE 10

Impact of Erroneous Software Changes

  • Poor user experience

6/22/18 CoNEXT 2015 10

slide-11
SLIDE 11

Impact of Erroneous Software Changes

  • Poor user experience
  • A drop in revenue

6/22/18 CoNEXT 2015 11

The normalized number of successful orders A real-world example

slide-12
SLIDE 12

Manual Software Change Impact Assessment

Select a subset of KPIs that maybe impacted

6/22/18 CoNEXT 2015 12

slide-13
SLIDE 13

Manual Software Change Impact Assessment

Select a subset of KPIs that maybe impacted Inspect KPI changes

6/22/18 CoNEXT 2015 13

slide-14
SLIDE 14

Manual Software Change Impact Assessment

Select a subset of KPIs that maybe impacted Inspect KPI changes Decide whether to roll back

6/22/18 CoNEXT 2015 14

slide-15
SLIDE 15

KPI (Key Performance Indicator) in Software Change

  • KPIs of servers
  • CPU utilization
  • Memory utilization
  • NIC throughput

6/22/18 CoNEXT 2015 15

slide-16
SLIDE 16

KPI (Key Performance Indicator) in Software Change

  • KPIs of servers
  • CPU utilization
  • Memory utilization
  • NIC throughput
  • KPIs of modules/processes
  • Web page view count
  • Web page view delay

6/22/18 CoNEXT 2015 16

slide-17
SLIDE 17

KPI (Key Performance Indicator) in Software Change

  • KPIs of servers
  • CPU utilization
  • Memory utilization
  • NIC throughput
  • KPIs of modules/processes
  • Web page view count
  • Web page view delay
  • Up to hundreds of KPIs for a single software change

6/22/18 CoNEXT 2015 17

slide-18
SLIDE 18

Definition of KPI Change: Level Shift or Ramp up/down

  • KPI change
  • Indicative of performance increase/degradation
  • Hard to simulate in testbeds
  • Not reproducible

6/22/18 CoNEXT 2015 18

slide-19
SLIDE 19

Manual Software Change Impact Assessment

Select a subset of KPIs that maybe impacted Inspect KPI changes Decide whether to roll back

6/22/18 CoNEXT 2015 19

  • Labor-intensive
  • Prone to error
  • Not scalable
slide-20
SLIDE 20

Design Goal

Select a subset of KPIs that maybe impacted Manual inspection of KPI changes Decide whether to roll back

6/22/18 CoNEXT 2015 20

  • Automatic
  • Scalable
  • Robust to various software changes and KPIs

Software Change Impact Assessment System

slide-21
SLIDE 21

Outline

  • Background and Motivation
  • Challenges
  • Key Ideas
  • Results
  • Conclusion

6/22/18 CoNEXT 2015 21

slide-22
SLIDE 22

Challenge 1: Short Detection Delay Requirement Against Robustness

  • Poor user experience
  • A drop in revenue

6/22/18 CoNEXT 2015 22

The number of successful orders (normalized) A real-world example

slide-23
SLIDE 23

Challenge 1: Short Detection Delay Requirement Against Robustness

  • Poor user experience
  • A drop in revenue

6/22/18 CoNEXT 2015 23

The number of successful orders (normalized) A real-world example level shift spike

slide-24
SLIDE 24

Challenge 1: Short Detection Delay Requirement Against Robustness

  • Poor user experience
  • A drop in revenue

6/22/18 CoNEXT 2015 24

The number of successful orders (normalized) A real-world example

Detect KPI changes rapidly and accurately

slide-25
SLIDE 25

Challenge 2: Large Number of KPIs

6/22/18 CoNEXT 2015 25

slide-26
SLIDE 26

Challenge 2: Large Number of KPIs

6/22/18 CoNEXT 2015 26

100+ Internet-based services 20+ Internet-based services has 100+ million users 10k+ modules 500+ thousand servers

slide-27
SLIDE 27

Challenge 2: Large Number of KPIs

Monitored by

  • ne operations

team

6/22/18 CoNEXT 2015 27

slide-28
SLIDE 28

Challenge 2: Large Number of KPIs

Monitored by

  • ne operations

team 10k+ software changes per day

6/22/18 CoNEXT 2015 28

slide-29
SLIDE 29

Challenge 2: Large Number of KPIs

Monitored by

  • ne operations

team 10k+ software changes per day 100+ KPIs in a software change

6/22/18 CoNEXT 2015 29

slide-30
SLIDE 30

Challenge 2: Large Number of KPIs

Millions of KPIs should be monitored Monitored by

  • ne operations

team 10k+ software changes per day 100+ KPIs in a software change

6/22/18 CoNEXT 2015 30

slide-31
SLIDE 31

Challenge 2: Large Number of KPIs

Millions of KPIs be monitored Monitored by

  • ne operations

team 10k+ software changes per day 100+ KPIs in a software change

6/22/18 CoNEXT 2015 31

Detect KPI changes with low computational cost

slide-32
SLIDE 32

Challenge 3: Diverse Types of Data

  • Diverse types of KPI data

6/22/18 CoNEXT 2015 32

Seasonal Variable Stationary Page view count NIC throughput Memory utilization

slide-33
SLIDE 33

Challenge 3: Diverse Types of Data

  • Diverse types of KPI data

6/22/18 CoNEXT 2015 33

Seasonal Variable Stationary Page view count NIC throughput Memory utilization

Robust to various KPIs

slide-34
SLIDE 34

Challenge 4: KPI Changes Maybe Caused by Other Factors

6/22/18 CoNEXT 2015 34

Seasonality Network breakdowns Malicious attacks

slide-35
SLIDE 35

Challenge 4: KPI Changes Maybe Caused by Other Factors

6/22/18 CoNEXT 2015 35

Seasonality Network breakdowns Malicious attacks

Eliminate KPI changes induced by other factors

slide-36
SLIDE 36

Outline

  • Background and Motivation
  • Challenges
  • Key Ideas
  • Results
  • Conclusion

6/22/18 CoNEXT 2015 36

slide-37
SLIDE 37

Design Overview

6/22/18 CoNEXT 2015 37

Step 1 – Identify the impact set

Step 1

Software change in module A

slide-38
SLIDE 38

Design Overview

6/22/18 CoNEXT 2015 38

Step 1 – Identify the impact set

Step 1

KPIs in the impact set

Software change in module A

slide-39
SLIDE 39

Identify the Impact Set: Automatically Retrieve the Relevant KPIs

6/22/18 CoNEXT 2015 39

slide-40
SLIDE 40

Identify the Impact Set: Automatically Retrieve the Relevant KPIs

6/22/18 CoNEXT 2015 40

Input from operators

  • Modules related module A:

module B, C, D

  • Servers/processes where

the software change is deployed.

slide-41
SLIDE 41

Design Overview

6/22/18 CoNEXT 2015 41

Step 1 – Identify the impact set Step 2 – Detect behavior changes in KPIs

Step 1 Step 2

KPIs in the impact set

Software change in module A

slide-42
SLIDE 42

Design Overview

6/22/18 CoNEXT 2015 42

Step 1 – Identify the impact set Step 2 – Detect behavior changes in KPIs

Step 1 Step 2

KPIs with behavior changes KPIs in the impact set

Software change in module A

slide-43
SLIDE 43

Design Overview

6/22/18 CoNEXT 2015 43

Step 1 – Identify the impact set Step 2 – Detect behavior changes in KPIs

Step 1 Step 2

KPIs with behavior changes KPIs in the impact set

Short detection delay requirement against robustness Diverse types of data Large number of KPIs Software change in module A

slide-44
SLIDE 44

Improved Singular Spectrum Transform (SST)

  • Improved singular spectrum transform (SST)

6/22/18 CoNEXT 2015 44

Accurate Short detection delay Advantage Short detection delay requirement against robustness

slide-45
SLIDE 45

Improved Singular Spectrum Transform (SST)

  • Improved singular spectrum transform (SST)

6/22/18 CoNEXT 2015 45

Accurate Short detection delay Drawbacks Accuracy degrades with noisy baseline High computational cost Advantage

  • T. Idé and K. Tsuda, SDM 2007
slide-46
SLIDE 46

Improved Singular Spectrum Transform (SST)

  • Improved singular spectrum transform (SST)

6/22/18 CoNEXT 2015 46

Accurate Short detection delay Drawbacks Accuracy degrades with noisy baseline High computational cost Utilize more information in the testing space Improve robustness Advantage Diverse types

  • f data
slide-47
SLIDE 47

Improved Singular Spectrum Transform (SST)

  • Improved singular spectrum transform (SST)

6/22/18 CoNEXT 2015 47

Accurate Short detection delay Drawbacks Accuracy degrades with noisy baseline High computational cost Utilize more information in the testing space Matrix compression Implicit inner product calculation Reduce computational cost Improve robustness Advantage Large number

  • f KPIs
slide-48
SLIDE 48

Design Overview

6/22/18 CoNEXT 2015 48

Step 1 – Identify the impact set Step 2 – Detect behavior changes in KPIs Step 3 – Eliminate KPI changes induced by other factors

Step 1 Step 2

KPIs in the impact set KPIs with behavior changes

Software change in module A

slide-49
SLIDE 49

Design Overview

6/22/18 CoNEXT 2015 49

Step 1 – Identify the impact set Step 2 – Detect behavior changes in KPIs Step 3 – Eliminate KPI changes induced by other factors

Step 1 Step 2 Step 3

KPIs in the impact set KPIs with behavior changes KPIs with behavior changes induced by software change

Software change in module A

slide-50
SLIDE 50

Design Overview

6/22/18 CoNEXT 2015 50

Step 1 – Identify the impact set Step 2 – Detect behavior changes in KPIs Step 3 – Eliminate KPI changes induced by other factors

Step 1 Step 2 Step 3

KPIs in the impact set KPIs with behavior changes KPIs with behavior changes induced by software change

KPI changes maybe caused by

  • ther factors

Software change in module A

slide-51
SLIDE 51

Eliminate KPI Changes Induced by Other Factors

6/22/18 CoNEXT 2015 51

slide-52
SLIDE 52

Eliminate KPI Changes Induced by Other Factors

  • Split testing
  • Evaluation of interventions instituted at a specific time
  • Control group & treated group

6/22/18 CoNEXT 2015 52

slide-53
SLIDE 53

Eliminate KPI Changes Induced by Other Factors

  • Split testing
  • Evaluation of interventions instituted at a specific time
  • Control group & treated group

6/22/18 CoNEXT 2015 53

Software change

slide-54
SLIDE 54

Eliminate KPI Changes Induced by Other Factors

  • Servers/processes in the impact set

Treated group

6/22/18 CoNEXT 2015 54

treated group

slide-55
SLIDE 55
  • Servers/processes in the same module
  • Without software change

Eliminate KPI Changes Induced by Other Factors

  • Servers/processes in the impact set

Treated group Control group

6/22/18 CoNEXT 2015 55

control group treated group

slide-56
SLIDE 56
  • Servers/processes in the same module
  • Without software change

Eliminate KPI Changes Induced by Other Factors

  • Servers/processes in the impact set

Treated group Control group DiD method

6/22/18 CoNEXT 2015 56

control group treated group

slide-57
SLIDE 57
  • Servers/processes in the same module
  • Without software change

Eliminate KPI Changes Induced by Other Factors

  • Servers/processes in the impact set

Treated group Control group DiD method

6/22/18 CoNEXT 2015 57

control group treated group KPI changes maybe caused by other factors

slide-58
SLIDE 58

Design Overview

6/22/18 CoNEXT 2015 58

Step 1 – Identify the impact set Step 2 – Detect behavior changes in KPIs Step 3 – Eliminate KPI changes induced by other factors

Step 1 Step 2 Step 3

KPIs in the impact set KPIs with behavior changes KPIs with behavior changes induced by software change

improved SST split testing Software change in module A

slide-59
SLIDE 59

Design Overview

6/22/18 CoNEXT 2015 59

Step 1 – Identify the impact set Step 2 – Detect behavior changes in KPIs Step 3 – Eliminate KPI changes induced by other factors

Step 1 Step 2 Step 3 Operators

KPIs in the impact set KPIs with behavior changes KPIs with behavior changes induced by software change

Software change in module A

slide-60
SLIDE 60

KPIs with behavior changes induced by software change KPIs in the impact set KPIs with behavior changes

Design Overview

6/22/18 CoNEXT 2015 60

Step 1 – Identify the impact set Step 2 – Detect behavior changes in KPIs Step 3 – Eliminate KPI changes induced by other factors

Step 1 Step 2 Step 3 Operators Software change in module A

slide-61
SLIDE 61

KPIs with behavior changes induced by software change KPIs with behavior changes KPIs in the impact set

Design Overview

6/22/18 CoNEXT 2015 61

Step 1 – Identify the impact set Step 2 – Detect behavior changes in KPIs Step 3 – Eliminate KPI changes induced by other factors

Step 1 Step 2 Step 3

FUNNEL

Operators Software change in module A

slide-62
SLIDE 62

Outline

  • Background and Motivation
  • Challenges
  • Key Ideas
  • Results
  • Conclusion

6/22/18 CoNEXT 2015 62

slide-63
SLIDE 63

Datasets of Evaluation

6/22/18 CoNEXT 2015 63

144 software changes of Baidu 72 introduced KPI changes 72 introduced no KPI changes

slide-64
SLIDE 64

Datasets of Evaluation

6/22/18 CoNEXT 2015 64

9982 (software change, server/module/process, KPI)s Manually labelled by operators Large amount of labelling work 144 software changes of Baidu 72 introduced KPI changes 72 introduced no KPI changes

slide-65
SLIDE 65

Datasets of Evaluation

6/22/18 CoNEXT 2015 65

Seasonal Variable Stationary 9982 (software change, server/module/process, KPI)s Manually labelled by operators Diverse KPIs Large amount of labelling work 144 software changes of Baidu 72 introduced KPI changes 72 introduced no KPI changes

slide-66
SLIDE 66

Datasets of Evaluation

6/22/18 CoNEXT 2015 66

Seasonal Variable Stationary CUSUM (SIGCOMM 10) Multiscale Robust Local Subspace (CoNEXT 11) 9982 (software change, server/module/process, KPI)s Manually labelled by operators Comparison baseline Diverse KPIs Large amount of labelling work 144 software changes of Baidu 72 introduced KPI changes 72 introduced no KPI changes

slide-67
SLIDE 67

Comparison of Accuracy

6/22/18 CoNEXT 2015 67

0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 70.00% 80.00% 90.00% 100.00% Statio nary FUNN EL Im prove d SST CUSUM MRLS

slide-68
SLIDE 68

Comparison of Accuracy

6/22/18 CoNEXT 2015 68

0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 70.00% 80.00% 90.00% 100.00% Statio nary Sea sonal FUNN EL Im prove d SST CUSUM MRLS

slide-69
SLIDE 69

Comparison of Accuracy

6/22/18 CoNEXT 2015 69

0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 70.00% 80.00% 90.00% 100.00% Statio nary Sea sonal Varia ble FUNN EL Im prove d SST CUSUM MRLS

slide-70
SLIDE 70

Comparison of Computational Cost

  • Real-world scenario
  • At least 1 million KPIs need to be monitored
  • The detection interval for each KPI is 1 minute
  • Runs on the same kinds of CPU as testing

6/22/18 CoNEXT 2015 70

slide-71
SLIDE 71

Comparison of Computational Cost

  • Real-world scenario
  • At least 1 million KPIs need to be monitored
  • Each KPI is detected every 1 minute
  • Runs on the same kinds of CPU as testing
  • Comparison results

6/22/18 CoNEXT 2015 71

Method FUNNEL CUSUM MRLS Number of cores for

  • ne million KPIs

7 31 47526

slide-72
SLIDE 72

Comparison of Detection Delay

  • Detection delay
  • time when a KPI change is detected – time when a KPI change

starts

6/22/18 CoNEXT 2015 72

time when the change starts time when the change is detected Detection delay

slide-73
SLIDE 73

Comparison of Detection Delay

  • Comparison results

6/22/18 CoNEXT 2015 73

slide-74
SLIDE 74

Comparison of Detection Delay

  • Comparison results

6/22/18 CoNEXT 2015 74

slide-75
SLIDE 75

Case Study: An Erroneous Software Upgrade in Advertising

  • Methodology
  • A fraction of software changes
  • Not deliver the results to the operators
  • The operators assessed the software changes independently

6/22/18 CoNEXT 2015 75

slide-76
SLIDE 76

Case Study: An Erroneous Software Upgrade in Advertising

  • Methodology
  • A fraction of software changes
  • Not deliver the results to the operators
  • The operators assess software changes independently
  • FUNNEL
  • 10 minutes
  • Seasonal KPIs

6/22/18 CoNEXT 2015 76

slide-77
SLIDE 77

Case Study: An Erroneous Software Upgrade in Advertising

  • Methodology
  • A fraction of software changes
  • Not deliver the results to the operators
  • The operators assess software changes independently
  • FUNNEL
  • 10 minutes
  • Seasonal KPIs
  • The operators
  • 1.5 hours

6/22/18 CoNEXT 2015 77

Customer complaints Inspecting KPIs Troubleshooting

slide-78
SLIDE 78

Outline

  • Background and Motivation
  • Challenges
  • Key Ideas
  • Results
  • Conclusion

6/22/18 CoNEXT 2015 78

slide-79
SLIDE 79

Conclusion

  • Short detection delay requirement against robustness
  • Large number of KPIs
  • Diverse types of data
  • KPI changes maybe caused by other factors

Challenges of automatic software change impact assessment

  • Improved SST – main algorithm contribution of the paper.
  • Split testing

FUNNEL

  • Real-world software changes

Evaluation

6/22/18 CoNEXT 2015 79

slide-80
SLIDE 80

Thank you!

zhangsl12@mails.tsinghua.edu.cn

6/22/18 CoNEXT 2015 80

slide-81
SLIDE 81

Q&A

6/22/18 CoNEXT 2015 81

slide-82
SLIDE 82

Why 144 Software Changes

  • Evaluation needs ground truth
  • FUNNEL
  • detect KPI changes
  • determine whether KPIs changes are induced by software change
  • Operators
  • Label whether there is behavior change in KPI
  • Label whether a KPI changes is caused by software change
  • 9982 (software change, server/module/process, KPI)s
  • A huge amount of work
  • Labelling for much more software changes is prohibitive

6/22/18 CoNEXT 2015 82

slide-83
SLIDE 83

Why Using Cores

  • The CPU utilization is 100% in testing
  • Assume the CPU utilization is also 100% in deployment
  • The operators care about how many servers/cores the

system needs

6/22/18 CoNEXT 2015 83

slide-84
SLIDE 84

Why just a single team

  • For the efficiency purpose
  • Build a single database to monitor all KPIs
  • By natural

6/22/18 CoNEXT 2015 84

slide-85
SLIDE 85

Unbalanced hotspot

  • Split testing
  • The number of hotspots is very small (3% in Microsoft)
  • Compare the treated group and the control group
  • The large number of KPIs in the control group makes the

determination robust even in the face of hotspots.

6/22/18 CoNEXT 2015 85

slide-86
SLIDE 86

The parameters of FUNNEL, CUSUM and MRLS

  • Two parameters
  • α in DiD method
  • ω in Improved SST
  • Best for accuracy
  • Operators care most about the accuracy
  • Fair for the four methods

6/22/18 CoNEXT 2015 86

slide-87
SLIDE 87

About the detection delay comparison

  • Set a threshold for FUNNEL
  • MRLS can detect behavior changes with smaller detection

delay than FUNNEL at sometimes

  • Sacrificing the accuracy

6/22/18 CoNEXT 2015 87

slide-88
SLIDE 88

Why not Just Split Testing?

  • Set threshold small
  • Sensitive to spikes
  • Many false positives
  • Set threshold large
  • The detection delay is large
  • Almost impossible to find a balance in our scenario
  • The improved SST
  • Robust
  • Short detection delay

6/22/18 CoNEXT 2015 88

slide-89
SLIDE 89

Obtain the Relationship of Modules

  • The operators name the modules based on the module

hierarchy

  • The operators know the relationship of modules

6/22/18 CoNEXT 2015 89

slide-90
SLIDE 90

Why not decide to roll out/back by FUNNEL?

  • The KPI changes & the decision
  • Hard to learn
  • Few cases for a specific combination of KPI change and software

change

  • Rolling back a software change is a big thing
  • The operators would like to decide themselves.
  • FUNNEL is helpful for the operators to make decision
  • The number of KPIs with behavior changes induced by software

changes is small

  • The work of the operators is small.

6/22/18 CoNEXT 2015 90

slide-91
SLIDE 91

About the Deployment

  • Assess the software changes of a few dozens of Internet-

based services

6/22/18 CoNEXT 2015 91

Number of software changes Number of changes that have impact Number of KPIs Number of KPI changes Precision 24119 268 2256390 10249 98.21%

slide-92
SLIDE 92

If A Software Change is Deployed to All Servers …

  • Treated group
  • Measurements of KPIs in the impact set around the

software change

  • Control group
  • Measurements of KPIs in the impact set in the same

period but on historical days

6/22/18 CoNEXT 2015 92

slide-93
SLIDE 93

About the Number of Software Changes

  • If a software change is deployed on a subset of servers

firstly, and then on another subset of servers

  • From the operators’ perspective
  • They are two software changes

6/22/18 CoNEXT 2015 93