Barricade: Defending Systems Against Operator Mistakes Ricardo - - PowerPoint PPT Presentation
Barricade: Defending Systems Against Operator Mistakes Ricardo - - PowerPoint PPT Presentation
Compute ter Science ce Barricade: Defending Systems Against Operator Mistakes Ricardo Bianchini In collaboration with Fbio Oliveira, Andrew Tjang, Rich Martin, Thu Nguyen Motivation Computer systems pervade our lives Work and
Motivation
Computer systems pervade our lives
Work and leisure Enterprise systems: email, storage, database, etc. Internet services: e-commerce, social networks, etc. Cloud computing
Systems need to be highly available, behave correctly Misbehavior and downtime can be costly Operator mistakes are common source of such problems
“Google Glitch Briefly Disrupts World’s Search”
By LIZ ROBBINS – The New York Times News Blog Jan 31, 2009
Google blacklisted all sites on the Web for 1 hour
Results warned ―This site may harm your computer‖
Sites would not load
Cause: Operator mistake
Operator added ―/‖ to blacklist file
Estimated cost: $2—3 million
Operator Mistakes Are Very Common
Avg of 3 Internet sites [Patterson’02]
Examples of mistakes: misconfiguration, improper
testing of changes, improper deployment, dissemination
Fixing mistakes can be time-consuming
Operator Hardware Software Overload 51% 15% 34% 0%
Mistakes are responsible (at least partly) for 79% of DB administration problems [Oliveira’06]
Our Idea
What do we need? Mistake-aware management of systems Focus: multi-node systems with replicated components Goals
Proactively defend the system against potential mistakes Confine mistakes to isolated, off-line subset of nodes Require less-experienced operators, lowering labor costs
Contributions
A framework for mistake-aware management and a
prototype system (Barricade)
Two case studies
Prototype 3-tier online auction service System that mimics our dept’s computing infrastructure
(email, DNS, authentication)
64 live experiments with 20 volunteers, showing
that mistake-aware management is effective
Barricade contained 75 out of 82 observed mistakes
Outline
Motivation Mistake-Aware Systems Management
Framework overview Example
Prototype Implementation: Barricade Evaluation Conclusions & Future Work
Mistake-Aware Management
Our approach to mistake-aware management is:
Monitor the operator’s actions Predict the expected cost of a mistake If cost is high, take nodes off-line and block actions Enforce testing of actions Lift blocks when tests confirm correctness
Blocking mechanisms: command and file blocking Operators = scripts Key questions: What should be blocked and when?
Can we make it all be un-intrusive?
Framework for MAM
For each task i, the expected cost of a mistake (ECM) is:
Blocking actions for task i
threshold task ECM
i)
(
Lifting actions for task i
) ( ) | ( ) ( ) (
i i i i
task CM task mistake P task P task ECM
Monitors Actuators
Task Prediction Module Cost Module Blocking Module Testing Module Mistake Prediction Module Diagnosis Module
threshold task ECM
i)
(
Overview of a MAM System
Managed Server Management Server
Task Prediction Model Test Procedures Mistake Prediction Model Blocking Actions Cost Model
Monitors Actuator Managed Server Managed Server Actuator Monitors Monitors Actuator
Diagnosis Model
System engineer instantiates and configures the MAM system
Actuator Monitors Monitors Actuator
Diagnosis Model
Overview of a MAM System
Managed Server Management Server
Task Prediction Model Test Procedures Mistake Prediction Model Blocking Actions Cost Model
Monitors Actuator Managed Server Managed Server Shell commands, changes to persistent state, info for testing
Operators interact with modified shell at managed/target servers (site of operation)
Overview of a MAM System
Managed Server Management Server
Task Prediction Model Test Procedures Mistake Prediction Model Blocking Actions Cost Model
Shell commands, changes to persistent state, info for testing Monitors Actuator Blocking/Lifting Actions Managed Server Managed Server Actuator Monitors Monitors Actuator
Diagnosis Model
Blocking/Lifting Actions
Super-operator can bypass the blocks
Example: 3-tiered Service
Application Server n Web Server 1 Web Server 2 Database Server Web Server m Application Server n+1 Application Server 1 Application Server 2
… …
Operator task: add an application server
Actions: install, config, start appl server; reconfig Web servers Site of operation: appl server and 1 Web server at a time
Example: 3-tiered Service
Application Server n Web Server 1 Web Server 2 Database Server Web Server m Application Server n+1 Application Server 1 Application Server 2
… …
Operator task: add an application server
P(add_app_server) increases
Application Server n Web Server 1 Web Server 2 Database Server Web Server m Application Server n+1 Application Server 1 Application Server 2
… …
Operator task: add an application server
P(add_app_server) increases
Example: 3-tiered Service
Site of operation
Application Server n Web Server 1 Web Server 2 Database Server Web Server m Application Server n+1 Application Server 1 Application Server 2
… …
Operator task: add an application server
Expected cost of mistake for ―add app server‖ > threshold
Example: 3-tiered Service
Site of operation
Application Server n Web Server 1 Web Server 2 Database Server Web Server m Application Server n+1 Application Server 1 Application Server 2
… …
Operator task: add an application server
Expected cost of mistake for ―add app server‖ > threshold Erect barricade for ―add app server‖: containment phase Block commands: startup and shutdown of server software Block files: configuration files of server software
Example: 3-tiered Service
Barricade Site of operation
Example: 3-tiered Service
Application Server n Web Server 1 Web Server 2 Database Server Web Server m Application Server n+1 Application Server 1 Application Server 2
… …
Operator task: add an application server
Expected cost of mistake for ―add app server‖ > threshold In background, run test procedures on site of operation: Check running processes; Verify consistency of config files; etc. Barricade Site of operation
Example: 3-tiered Service
Application Server n Web Server 1 Web Server 2 Database Server Web Server m Application Server n+1 Application Server 1 Application Server 2
… …
Operator task: add an application server
Operator starts working on ―Web server m‖ Barricade Site of operation
Example: 3-tiered Service
Application Server n Web Server 1 Web Server 2 Database Server Web Server m Application Server n+1 Application Server 1 Application Server 2
… …
Operator task: add an application server
Operator starts working on ―Web server m‖ Extend the barricade and the site of operation; take WS off-line Keep running tests until behavior on site of operation is ―correct‖ Barricade Site of operation
Example: 3-tiered Service
Application Server n Web Server 1 Web Server 2 Database Server Web Server m Application Server n+1 Application Server 1 Application Server 2
… …
Operator task: add an application server
Containment phase is over; dissemination phase begins Establish dissemination order and adjust barricade Operator is done working on site of operation AND all tests succeed Barricade Site of operation
Example: 3-tiered Service
Application Server n Web Server 1 Web Server 2 Database Server Web Server m Application Server n+1 Application Server 1 Application Server 2
… …
Operator task: add an application server
Next dissemination target is Web server 2 In background, run test procedures on dissemination target Dissemination target Barricade
Example: 3-tiered Service
Application Server n Web Server 1 Web Server 2 Database Server Web Server m Application Server n+1 Application Server 1 Application Server 2
… …
Operator task: add an application server
Operator is done working on dissemination target AND tests succeed Allow operator to proceed to next dissemination target Adjust barricade Dissemination target Barricade
Example: 3-tiered Service
Application Server n Web Server 1 Web Server 2 Database Server Web Server m Application Server n+1 Application Server 1 Application Server 2
…
Operator task: add an application server
Web server 1 is the next dissemination target Allow operator to proceed to next dissemination target Adjust barricade
…
Dissemination target Barricade
Example: 3-tiered Service
Application Server n Web Server 1 Web Server 2 Database Server Web Server m Application Server n+1 Application Server 1 Application Server 2
… …
Operator task: add an application server
Task is over Destroy barricade Operator completes last dissemination step AND tests succeed Dissemination target Barricade
Example: 3-tiered Service
Application Server n Web Server 1 Web Server 2 Database Server Web Server m Application Server n+1 Application Server 1 Application Server 2
… …
Operator task: add an application server
Outline
Motivation Mistake-Aware Systems Management Prototype Implementation: Barricade Evaluation Conclusions & Future Work
Barricade: Task Prediction Model
Recursive Bayesian Estimation
Each observed action triggers an update of the probabilities System eng identifies relevant tasks and defines signatures.
Creates ―other‖ task to account for unknown actions
n j j j i i i
task P task action P task P task action P task P
1
) ( ) | ( ) ( ) | ( ) (
Barricade: Mistake Prediction Model
Prior probabilities with feedback from tests System eng defines weights and TF Priors instantiated from experience, updated over time
) 1 ( ) | ( TF w w t prior task mistake P
j j j j j i i
Where: tj = result (1 or 0) of the jth test procedure wj = weight assigned to the jth test procedure TF = Testing Factor -- [0, 1]
Barricade: Test Procedures
System eng writes assertions about system properties
that are checked periodically [SRDS’09]
Assertions on configuration Assertions on dynamic properties (e.g., running processes, CPU
utilization)
Barricade: Cost Model
Defines avg cost of a mistake in task i, CM(taski) Potential metrics
Percentage of capacity loss Amount of time required to fix the mistake Amount of money lost
Costs can be instantiated from experience with
managed system and updated over time
Other Aspects of Barricade
Blocking actions
Defined per task by system engineer
Diagnosis model
Similar to Task Prediction Model Blocking actions prevent operator from acting on subsystems
unrelated to the problem
Guidance
Getinfo informs state of management system, including
probabilities and test results
Extremely useful feature for operators
Outline
Motivation Mistake-Aware Systems Management Prototype Implementation: Barricade Evaluation
Case Study: 3-Tier Auction Service
Conclusions & Future Work
Case Study: 3-Tier Auction Service
Live operator experiments
18 volunteers: 8 novices, 6 intermediates, 4 experts 43 experiments
Operator tasks
3 scheduled-maintenance tasks: Add app server, upgrade Web
server, migrate DBMS
3 diagnose-and-repair tasks: App hang, Web misconfig/crash,
DB disk error
Barricade config derived from previous experiments
[OSDI ’04]
Results of 43 Experiments
Breakdown of Mistakes per Operator Category
0.2 0.4 0.6 0.8 1 1.2 1.4
Local misconfig Global misconfig Incorrect restart Start of wrong SW version Unnecessary restart
- f SW component
Mistake Normalized Number of Mistakes Expert Intermediate Novice Categories of Operators
Left bars – Observed mistakes; Right bars – Contained mistakes
Mistakes
Barricade contained 34 out of 37 mistakes 3 mistakes missed
Bug in configuration file parser Missing test procedure Missing block
Super-operator needed in 3 cases
3 8 18 5
Containment Containment Blocked Dissemination Diagnose-repair
Conclusions & Future Work
Collected and disseminated data on operator behavior
64 live experiments, 20 volunteers, more than 90 hours
Proposed and evaluated mistake-aware management
Contained 75 out of 82 live mistakes
Combination of systems mechanisms, probabilitic
models, and proper configuration for a novel approach
Future work
Mistake-aware systems that handle malicious operators Learning models to ease instantiation/configuration of MAM
Backup Slides
Motivation
Computer systems pervade our lives
Work and leisure Enterprise systems: email, storage, database, etc. Internet services: e-commerce, social networks, etc. Cloud computing
Systems need to be highly available, behave correctly Downtime can be costly ($ per hour)
PayPal $7,200,000
(datacenterknowledge.com, 8/2009)
Brokerage operations $6,450,000
(Internet Week, 4/2000)
Credit card authorization $2,600,000
(Internet Week, 4/2000)
Ebay $225,000
(Internet Week, 4/2000)
Mistake-Oblivious Operational Model
A typical model of operation in data centers
1. Operate on one or a few nodes off-line 2. Test behavior 3. Disseminate changes to other nodes using scripts
What can go wrong?
Operate directly on nodes without taking them off-line Fail to test changes properly Disseminate changes before they are tested Dissemination scripts may be buggy
Mistake-Aware Management
Basic idea Management roles
Regular operators Super-operator System engineer
Blocking mechanisms
Command blocking File blocking
But what should be blocked and when?
Test Lift High cost? Block Normal operation
Y N
Related Work
Undo and for operators [USENIX’03]
Operators can ―undo‖ only after harmful actions affect the system
Guidance for operator actions [HotAC’06]
Operators may make mistakes while performing recommended actions
Auditing the interactions between applications and
persistent state [OSDI’06, LISA’06]
Finding a bad change does not guard the system against its effects
Automation IBM’s autonomic computing AutoBash [SOSP ’07]
Barricade Testing Assertions
Types of assertions used
Verify if certain processes are running Verify structure of configuration files Verify values critical configuration parameters Verify if system configuration allows adjacent
components to communicate
Motivation
Human mistakes are not surprising
Enterprises computer systems are complex
Distributed, inter-related SW components Difficult to operate and maintain
Source: James Hamilton (AWS)
Framework: Task Prediction Module
Shell commands executed Persistent-state changes Task Prediction Probability Distribution of Tasks
Task Prediction Parameters
Add App Mig Data Upg Web Other
- httpd.conf
0.01 0.01 0.13 0.0625 mcast_heartbeat_db.conf(w) 0.01 0.01 0.13 0.0625 apachectl start 0.115 0.094 0.13 0.0625 server.xml 0.115 0.01 0.01 0.0625 mcast_heartbeat_db.conf(a) 0.115 0.01 0.01 0.0625 startup.sh 0.115 0.094 0.01 0.0625 apachectl stop 0.115 0.094 0.13 0.0625 workers.properties 0.115 0.01 0.13 0.0625 shutdown.sh 0.115 0.094 0.01 0.0625 mysqldump 0.01 0.094 0.01 0.0625 my.cnf 0.01 0.094 0.01 0.0625 mysql_install_db 0.01 0.094 0.01 0.0625 mysqld_safe 0.01 0.094 0.01 0.0625 web.xml 0.115 0.094 0.01 0.0625 apachectl 0.01 0.01 0.13 0.0625 make 0.01 0.094 0.13 0.0625
Task Prediction Computation
Observed evidence: httpd.conf Add App Mig Data Upg Web Other 0.25 0.25 0.25 0.25 0.25*0.01 0.25*0.01 0.25*0.13 0.25*0.0625 Add App Mig Data Upg Web Other 0.047 0.047 0.612 0.294 Normalize the above distribution
Task Prediction Computation
Observed evidence: apachectl stop 0.047*0.115 0.047*0.094 0.612*0.13 0.294*0.0625 Add App Mig Data Upg Web Other 0.05 0.04 0.74 0.17 Normalize the above distribution Add App Mig Data Upg Web Other 0.047 0.047 0.612 0.294
Framework: Mistake Prediction Module
Shell commands executed Persistent-state changes Mistake Prediction P(mistake) for all tasks Results from test procedures
Cost Module
Probability Distribution
- f Tasks
P(mistake) for all tasks Cost Module Expected Cost of Mistake for all tasks
CM (Cost of Mistake) for each task
) ( ) | ( ) (
i i i
task CM task mistake P task P
Framework: Diagnosis Module
Output of monitors Diagnosis Probability Distribution of Problems
Diagnosis Model
Takes output of monitors and computes probability
distribution of possible system problems
Uses Bayesian Estimation and a table of monitor
coverage for each problem
Triggers a flow of info similar to Task Prediction Module Blocking actions prevent operator from acting on
subsystems unrelated to the problem
Diagnosis Model
)) | ( 1 ]( [ 1 ) | ( ] [ 1 ) | (
i m M m i m i M
f m P
- f
m P
- f
O P
F f M i i M M i
f P f O P f P f O P O f P
'
) ' ( ) ' | ( ) ( ) | ( ) | (
) | ( ) | ( ) (
i i i i
f mistake Cost f mistake P f P st ExpectecCo
monitor coverage prior
Updating the probability distribution and expected cost:
How about mis-predictions?
Super-operator intervention needed in 3 out of all
49 live experiments
2 instances of incorrect detection of site of operation
Simple heuristic used
1 instance of task prediction shifting towards wrong task
Rudimentary way to report command evidences + missing
relevant command
Results of 43 Experiments
Summary of Operator Mistakes
2 4 6 8 10 12 14 16 18 20 22 24 26 28 Degraded throughput or Service inaccessible Incomplete component integration Security vulnerability
Potential Impact of Mistake Number of Occurrences Unnecessary restart of SW component Start of wrong SW version Incorrect restart Global misconfiguration Local misconfiguration Categories of Mistakes
Summary of Results
Case study 1: auction service
43 live experiments: contained 34 out of 37 mistakes Super-operator needed in 3 cases 43 trace-replay experiments (OSDI’04): contained all 42 mistakes
Case study 2: our dept’s computing infrastructure
3 operator tasks 15 live experiments: contained 36 out of 39 mistakes Super-operator needed in 2 cases
Mistakes missed = bugs in implementation/configuration Super-operator = simplifications of our implementation