[PPT] - Barricade: Defending Systems Against Operator Mistakes Ricardo PowerPoint Presentation

SLIDE 1

Barricade: Defending Systems Against Operator Mistakes

Ricardo Bianchini In collaboration with Fábio Oliveira, Andrew Tjang, Rich Martin, Thu Nguyen

Compute ter Science ce

SLIDE 2

Motivation

 Computer systems pervade our lives

 Work and leisure  Enterprise systems: email, storage, database, etc.  Internet services: e-commerce, social networks, etc.  Cloud computing

 Systems need to be highly available, behave correctly  Misbehavior and downtime can be costly  Operator mistakes are common source of such problems

SLIDE 3

“Google Glitch Briefly Disrupts World’s Search”

By LIZ ROBBINS – The New York Times News Blog Jan 31, 2009

 Google blacklisted all sites on the Web for 1 hour



Results warned ―This site may harm your computer‖



Sites would not load

 Cause: Operator mistake



Operator added ―/‖ to blacklist file

 Estimated cost: $2—3 million

SLIDE 4

Operator Mistakes Are Very Common

Avg of 3 Internet sites [Patterson’02]

 Examples of mistakes: misconfiguration, improper

testing of changes, improper deployment, dissemination

 Fixing mistakes can be time-consuming

Operator Hardware Software Overload 51% 15% 34% 0%

Mistakes are responsible (at least partly) for 79% of DB administration problems [Oliveira’06]

SLIDE 5

Our Idea

 What do we need?  Mistake-aware management of systems  Focus: multi-node systems with replicated components  Goals

 Proactively defend the system against potential mistakes  Confine mistakes to isolated, off-line subset of nodes  Require less-experienced operators, lowering labor costs

SLIDE 6

Contributions

 A framework for mistake-aware management and a

prototype system (Barricade)

 Two case studies

 Prototype 3-tier online auction service  System that mimics our dept’s computing infrastructure

(email, DNS, authentication)

 64 live experiments with 20 volunteers, showing

that mistake-aware management is effective

 Barricade contained 75 out of 82 observed mistakes

SLIDE 7

Outline

 Motivation  Mistake-Aware Systems Management

 Framework overview  Example

 Prototype Implementation: Barricade  Evaluation  Conclusions & Future Work

SLIDE 8

Mistake-Aware Management

 Our approach to mistake-aware management is:

 Monitor the operator’s actions  Predict the expected cost of a mistake  If cost is high, take nodes off-line and block actions  Enforce testing of actions  Lift blocks when tests confirm correctness

 Blocking mechanisms: command and file blocking  Operators = scripts  Key questions: What should be blocked and when?

Can we make it all be un-intrusive?

SLIDE 9

Framework for MAM

For each task i, the expected cost of a mistake (ECM) is:

Blocking actions for task i

threshold task ECM

i)

(

Lifting actions for task i

) ( ) | ( ) ( ) (

i i i i

task CM task mistake P task P task ECM

Monitors Actuators

Task Prediction Module Cost Module Blocking Module Testing Module Mistake Prediction Module Diagnosis Module

threshold task ECM

i)

(

SLIDE 10

Overview of a MAM System

Managed Server Management Server

Task Prediction Model Test Procedures Mistake Prediction Model Blocking Actions Cost Model

Monitors Actuator Managed Server Managed Server Actuator Monitors Monitors Actuator

Diagnosis Model

System engineer instantiates and configures the MAM system

SLIDE 11

Actuator Monitors Monitors Actuator

Diagnosis Model

Overview of a MAM System

Managed Server Management Server

Task Prediction Model Test Procedures Mistake Prediction Model Blocking Actions Cost Model

Monitors Actuator Managed Server Managed Server Shell commands, changes to persistent state, info for testing

Operators interact with modified shell at managed/target servers (site of operation)

SLIDE 12

Overview of a MAM System

Managed Server Management Server

Task Prediction Model Test Procedures Mistake Prediction Model Blocking Actions Cost Model

Shell commands, changes to persistent state, info for testing Monitors Actuator Blocking/Lifting Actions Managed Server Managed Server Actuator Monitors Monitors Actuator

Diagnosis Model

Blocking/Lifting Actions

Super-operator can bypass the blocks

SLIDE 13

Example: 3-tiered Service

Application Server n Web Server 1 Web Server 2 Database Server Web Server m Application Server n+1 Application Server 1 Application Server 2

… …

 Operator task: add an application server

Actions: install, config, start appl server; reconfig Web servers Site of operation: appl server and 1 Web server at a time

SLIDE 14

Example: 3-tiered Service

Application Server n Web Server 1 Web Server 2 Database Server Web Server m Application Server n+1 Application Server 1 Application Server 2

… …

 Operator task: add an application server

 Operator task: add an application server

Web server 1 is the next dissemination target Allow operator to proceed to next dissemination target Adjust barricade

…

Dissemination target Barricade

SLIDE 25

Example: 3-tiered Service

Application Server n Web Server 1 Web Server 2 Database Server Web Server m Application Server n+1 Application Server 1 Application Server 2

… …

 Operator task: add an application server

Task is over Destroy barricade Operator completes last dissemination step AND tests succeed Dissemination target Barricade

SLIDE 26

Example: 3-tiered Service

Application Server n Web Server 1 Web Server 2 Database Server Web Server m Application Server n+1 Application Server 1 Application Server 2

… …

 Operator task: add an application server

SLIDE 27

Outline

 Motivation  Mistake-Aware Systems Management  Prototype Implementation: Barricade  Evaluation  Conclusions & Future Work

SLIDE 28

Barricade: Task Prediction Model

 Recursive Bayesian Estimation

 Each observed action triggers an update of the probabilities  System eng identifies relevant tasks and defines signatures.

Creates ―other‖ task to account for unknown actions

n j j j i i i

task P task action P task P task action P task P

1

) ( ) | ( ) ( ) | ( ) (

SLIDE 29

Barricade: Mistake Prediction Model

 Prior probabilities with feedback from tests  System eng defines weights and TF  Priors instantiated from experience, updated over time

) 1 ( ) | ( TF w w t prior task mistake P

j j j j j i i

Where: tj = result (1 or 0) of the jth test procedure wj = weight assigned to the jth test procedure TF = Testing Factor -- [0, 1]

SLIDE 30

Barricade: Test Procedures

 System eng writes assertions about system properties

that are checked periodically [SRDS’09]

 Assertions on configuration  Assertions on dynamic properties (e.g., running processes, CPU

utilization)

SLIDE 31

Barricade: Cost Model

 Defines avg cost of a mistake in task i, CM(taski)  Potential metrics

 Percentage of capacity loss  Amount of time required to fix the mistake  Amount of money lost

 Costs can be instantiated from experience with

managed system and updated over time

SLIDE 32

Other Aspects of Barricade

 Blocking actions

 Defined per task by system engineer

 Diagnosis model

 Similar to Task Prediction Model  Blocking actions prevent operator from acting on subsystems

unrelated to the problem

 Guidance

 Getinfo informs state of management system, including

probabilities and test results

 Extremely useful feature for operators

SLIDE 33

Outline

 Motivation  Mistake-Aware Systems Management  Prototype Implementation: Barricade  Evaluation

 Case Study: 3-Tier Auction Service

 Conclusions & Future Work

SLIDE 34

Case Study: 3-Tier Auction Service

 Live operator experiments

 18 volunteers: 8 novices, 6 intermediates, 4 experts  43 experiments

 Operator tasks

 3 scheduled-maintenance tasks: Add app server, upgrade Web

server, migrate DBMS

 3 diagnose-and-repair tasks: App hang, Web misconfig/crash,

DB disk error

 Barricade config derived from previous experiments

[OSDI ’04]

SLIDE 35

Results of 43 Experiments

Breakdown of Mistakes per Operator Category

0.2 0.4 0.6 0.8 1 1.2 1.4

Local misconfig Global misconfig Incorrect restart Start of wrong SW version Unnecessary restart

f SW component

Mistake Normalized Number of Mistakes Expert Intermediate Novice Categories of Operators

Left bars – Observed mistakes; Right bars – Contained mistakes

SLIDE 36

Mistakes

 Barricade contained 34 out of 37 mistakes  3 mistakes missed

 Bug in configuration file parser  Missing test procedure  Missing block

 Super-operator needed in 3 cases

3 8 18 5

Containment Containment Blocked Dissemination Diagnose-repair

SLIDE 37

Conclusions & Future Work

 Collected and disseminated data on operator behavior

 64 live experiments, 20 volunteers, more than 90 hours

 Proposed and evaluated mistake-aware management

 Contained 75 out of 82 live mistakes

 Combination of systems mechanisms, probabilitic

models, and proper configuration for a novel approach

 Future work

 Mistake-aware systems that handle malicious operators  Learning models to ease instantiation/configuration of MAM

SLIDE 38

Backup Slides

SLIDE 39

Motivation

 Computer systems pervade our lives

 Work and leisure  Enterprise systems: email, storage, database, etc.  Internet services: e-commerce, social networks, etc.  Cloud computing

 Systems need to be highly available, behave correctly  Downtime can be costly ($ per hour)



PayPal $7,200,000

(datacenterknowledge.com, 8/2009)



Brokerage operations $6,450,000

(Internet Week, 4/2000)



Credit card authorization $2,600,000

(Internet Week, 4/2000)



Ebay $225,000

(Internet Week, 4/2000)

SLIDE 40

Mistake-Oblivious Operational Model

 A typical model of operation in data centers

1. Operate on one or a few nodes off-line 2. Test behavior 3. Disseminate changes to other nodes using scripts

 What can go wrong?

 Operate directly on nodes without taking them off-line  Fail to test changes properly  Disseminate changes before they are tested  Dissemination scripts may be buggy

SLIDE 41

Mistake-Aware Management

 Basic idea  Management roles

 Regular operators  Super-operator  System engineer

 Blocking mechanisms

 Command blocking  File blocking

 But what should be blocked and when?

Test Lift High cost? Block Normal operation

Y N

SLIDE 42

Related Work

 Undo and for operators [USENIX’03]



Operators can ―undo‖ only after harmful actions affect the system

 Guidance for operator actions [HotAC’06]



Operators may make mistakes while performing recommended actions

 Auditing the interactions between applications and

persistent state [OSDI’06, LISA’06]



Finding a bad change does not guard the system against its effects

 Automation  IBM’s autonomic computing  AutoBash [SOSP ’07]

SLIDE 43

Barricade Testing Assertions

 Types of assertions used

 Verify if certain processes are running  Verify structure of configuration files  Verify values critical configuration parameters  Verify if system configuration allows adjacent

components to communicate

SLIDE 44

Motivation

 Human mistakes are not surprising

 Enterprises computer systems are complex

 Distributed, inter-related SW components  Difficult to operate and maintain

Source: James Hamilton (AWS)

SLIDE 45

Framework: Task Prediction Module

Shell commands executed Persistent-state changes Task Prediction Probability Distribution of Tasks

SLIDE 46

Task Prediction Parameters

Add App Mig Data Upg Web Other

httpd.conf

0.01 0.01 0.13 0.0625 mcast_heartbeat_db.conf(w) 0.01 0.01 0.13 0.0625 apachectl start 0.115 0.094 0.13 0.0625 server.xml 0.115 0.01 0.01 0.0625 mcast_heartbeat_db.conf(a) 0.115 0.01 0.01 0.0625 startup.sh 0.115 0.094 0.01 0.0625 apachectl stop 0.115 0.094 0.13 0.0625 workers.properties 0.115 0.01 0.13 0.0625 shutdown.sh 0.115 0.094 0.01 0.0625 mysqldump 0.01 0.094 0.01 0.0625 my.cnf 0.01 0.094 0.01 0.0625 mysql_install_db 0.01 0.094 0.01 0.0625 mysqld_safe 0.01 0.094 0.01 0.0625 web.xml 0.115 0.094 0.01 0.0625 apachectl 0.01 0.01 0.13 0.0625 make 0.01 0.094 0.13 0.0625

SLIDE 47

Task Prediction Computation

Observed evidence: httpd.conf Add App Mig Data Upg Web Other 0.25 0.25 0.25 0.25 0.25*0.01 0.25*0.01 0.25*0.13 0.25*0.0625 Add App Mig Data Upg Web Other 0.047 0.047 0.612 0.294 Normalize the above distribution

SLIDE 48

Task Prediction Computation

Observed evidence: apachectl stop 0.047*0.115 0.047*0.094 0.612*0.13 0.294*0.0625 Add App Mig Data Upg Web Other 0.05 0.04 0.74 0.17 Normalize the above distribution Add App Mig Data Upg Web Other 0.047 0.047 0.612 0.294

SLIDE 49

Framework: Mistake Prediction Module

Shell commands executed Persistent-state changes Mistake Prediction P(mistake) for all tasks Results from test procedures

SLIDE 50

Cost Module

Probability Distribution

f Tasks

P(mistake) for all tasks Cost Module Expected Cost of Mistake for all tasks

CM (Cost of Mistake) for each task

) ( ) | ( ) (

i i i

task CM task mistake P task P

SLIDE 51

Framework: Diagnosis Module

Output of monitors Diagnosis Probability Distribution of Problems

SLIDE 52

Diagnosis Model

 Takes output of monitors and computes probability

distribution of possible system problems

 Uses Bayesian Estimation and a table of monitor

coverage for each problem

 Triggers a flow of info similar to Task Prediction Module  Blocking actions prevent operator from acting on

subsystems unrelated to the problem

SLIDE 53

Diagnosis Model

)) | ( 1 ]( [ 1 ) | ( ] [ 1 ) | (

i m M m i m i M

f m P

f

m P

f

O P

F f M i i M M i

f P f O P f P f O P O f P

'

) ' ( ) ' | ( ) ( ) | ( ) | (

) | ( ) | ( ) (

i i i i

f mistake Cost f mistake P f P st ExpectecCo

monitor coverage prior

Updating the probability distribution and expected cost:

SLIDE 54

How about mis-predictions?

 Super-operator intervention needed in 3 out of all

49 live experiments

 2 instances of incorrect detection of site of operation

 Simple heuristic used

 1 instance of task prediction shifting towards wrong task

 Rudimentary way to report command evidences + missing

relevant command

SLIDE 55

Results of 43 Experiments

Summary of Operator Mistakes

2 4 6 8 10 12 14 16 18 20 22 24 26 28 Degraded throughput or Service inaccessible Incomplete component integration Security vulnerability

Potential Impact of Mistake Number of Occurrences Unnecessary restart of SW component Start of wrong SW version Incorrect restart Global misconfiguration Local misconfiguration Categories of Mistakes

SLIDE 56

Summary of Results

 Case study 1: auction service

 43 live experiments: contained 34 out of 37 mistakes  Super-operator needed in 3 cases  43 trace-replay experiments (OSDI’04): contained all 42 mistakes

 Case study 2: our dept’s computing infrastructure

 3 operator tasks  15 live experiments: contained 36 out of 39 mistakes  Super-operator needed in 2 cases

 Mistakes missed = bugs in implementation/configuration  Super-operator = simplifications of our implementation

SLIDE 57