Anomaly Detection Fault Tolerance Anticipation Patterns John - - PowerPoint PPT Presentation

anomaly detection fault tolerance anticipation
SMART_READER_LITE
LIVE PREVIEW

Anomaly Detection Fault Tolerance Anticipation Patterns John - - PowerPoint PPT Presentation

Anomaly Detection Fault Tolerance Anticipation Patterns John Allspaw SVP, Tech Ops Qcon London 2012 Wednesday, March 7, 12 Four Cornerstones Erik Hollnagel (Anticipation) (Response) Knowing Knowing Knowing Knowing What What What


slide-1
SLIDE 1

Patterns

Anomaly Detection Fault Tolerance Anticipation

John Allspaw SVP, Tech Ops Qcon London 2012

Wednesday, March 7, 12
slide-2
SLIDE 2

Knowing What Has Happened

(Learning)

Knowing What To Look For

(Monitoring)

Knowing What To Do

(Response)

Knowing What To Expect

(Anticipation)

Four Cornerstones

Erik Hollnagel

Wednesday, March 7, 12
slide-3
SLIDE 3

Knowing What Has Happened

(Learning)

Knowing What To Look For

(Monitoring)

Knowing What To Do

(Response)

Knowing What To Expect

(Anticipation)

Four Cornerstones

Erik Hollnagel

Wednesday, March 7, 12
slide-4
SLIDE 4

Knowing What Has Happened

(Learning)

Knowing What To Look For

(Monitoring)

Knowing What To Do

(Response)

Knowing What To Expect

(Anticipation)

Four Cornerstones

Erik Hollnagel

Wednesday, March 7, 12
slide-5
SLIDE 5

Anomaly Detection

Wednesday, March 7, 12
slide-6
SLIDE 6

Anomaly Detection

  • Getting at the state of health
  • Evaluating the state of health
  • Components AND systems
Wednesday, March 7, 12
slide-7
SLIDE 7

Example: Active health check

Supervisory

Monitor Component (webserver) check_http 200 OK

exit 0

Wednesday, March 7, 12
slide-8
SLIDE 8 Monitor Component (webserver) check_http 200 OK exit 0

Pros: Easy to implement Easy to understand Well-known pattern Cons: Messaging can fail Scalability is limited

Supervisory

Wednesday, March 7, 12
slide-9
SLIDE 9

Supervisor Sensitivity

1 sec timeout 1 retry 3 sec interval

X

1s

X

1s 3s 3 3 3 3 3 3 3 Up to ~2.9s for the previous interval

(7.9 sec exposure)

Wednesday, March 7, 12
slide-10
SLIDE 10

Request latency (max = 0.9s)

Monitor Component (webserver)

check_http

200 OK

exit 0

Schedule Latency (max = N) Response latency (max = 0.9s)

Supervisor Sensitivity

Wednesday, March 7, 12
slide-11
SLIDE 11

Supervisor Sensitivity

How many seconds of errors can you tolerate serving?

Wednesday, March 7, 12
slide-12
SLIDE 12

Example: Interval Passive health check Monitor Component (webserver) DISK consumption within bounds

exit 0

Supervisory

Wednesday, March 7, 12
slide-13
SLIDE 13

Example: Interval Passive health check

Monitor Component (webserver) DISK consumption within bounds exit 0

Pros:

Efficient Scalability is different Fewer moving parts Less exposure Can submit to multiple places

Cons:

Nonideal for network-based services Different tuning (windowed expectation)

Supervisory

Wednesday, March 7, 12
slide-14
SLIDE 14

{

Component

T I M E

?

?

Interval

Example: Passive health check

Supervisory

Wednesday, March 7, 12
slide-15
SLIDE 15

{

Component T I M E

✓ ✓

✓ ✓

?

Interval

Example: Passive health check

Supervisory

Exposure =

(Schedule + Interval )*UnknownConsecutiveIntervals+1 Schedule Latency

Interval

?

Wednesday, March 7, 12
slide-16
SLIDE 16

Frequency and Transience

Probability Of False Positives Probability Of Nondetection

Short intervals Low # of retries Short timeouts Long intervals High # of retries Long timeouts

Wednesday, March 7, 12
slide-17
SLIDE 17

Example: Passive application event logging

In-Line

monitor application

Wednesday, March 7, 12
slide-18
SLIDE 18

Example: Passive application event logging

Supervisory

monitor application

Pros: On-demand publish

Cons: Onus is on the app Can’t be 100% sure it’s working

Wednesday, March 7, 12
slide-19
SLIDE 19

Example: Passive application event logging

Supervisory

monitor application

Positive events (sales, registrations, etc.) Negative events (errors, exceptions, etc.) Lack or presence of data mean different things, so history is paramount.

Wednesday, March 7, 12
slide-20
SLIDE 20

Context

Wednesday, March 7, 12
slide-21
SLIDE 21

Evaluation

what is ‘abnormal’ ?

Wednesday, March 7, 12
slide-22
SLIDE 22 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9

Time Response Time

Wednesday, March 7, 12
slide-23
SLIDE 23 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9

Time Response Time

Warning Critical

Static Thresholds

Wednesday, March 7, 12
slide-24
SLIDE 24 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9

Time Response Time

Warning Critical

Static Thresholds

Wednesday, March 7, 12
slide-25
SLIDE 25 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9

Time Response Time

Warning Critical

Static Thresholds

Wednesday, March 7, 12
slide-26
SLIDE 26 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9

Time Response Time

Warning Critical

Static Thresholds

Wednesday, March 7, 12
slide-27
SLIDE 27

Static Thresholds

Wednesday, March 7, 12
slide-28
SLIDE 28

Static Thresholds

Wednesday, March 7, 12
slide-29
SLIDE 29

Context

Normal?

Wednesday, March 7, 12
slide-30
SLIDE 30

24 hours

Context

Wednesday, March 7, 12
slide-31
SLIDE 31

Context

7 days

Wednesday, March 7, 12
slide-32
SLIDE 32

Context

Normal But Noisy

Wednesday, March 7, 12
slide-33
SLIDE 33

Context Smoothing?

Wednesday, March 7, 12
slide-34
SLIDE 34

Context

Holt-Winters Exponential Smoothing Recent points influencing a forecast, exponentially decreasing influence backwards in time. en.wikipedia.org/wiki/Exponential_smoothing

Wednesday, March 7, 12
slide-35
SLIDE 35

Context

Aberrant Behavior Detection in Time Series for Network Monitoring

http://static.usenix.org/events/lisa00/ full_papers/brutlag/brutlag_html/

Wednesday, March 7, 12
slide-36
SLIDE 36

Dynamic Thresholds

Wednesday, March 7, 12
slide-37
SLIDE 37

Upper bound Lower bound Raw data

Dynamic Thresholds

Wednesday, March 7, 12
slide-38
SLIDE 38

Hrm....

Dynamic Thresholds

Wednesday, March 7, 12
slide-39
SLIDE 39

Hrm....

Dynamic Thresholds

Wednesday, March 7, 12
slide-40
SLIDE 40

Ah!

Holt-Winters Aberration

Dynamic Thresholds

Wednesday, March 7, 12
slide-41
SLIDE 41

Dynamic Thresholds

https://github.com/etsy/nagios_tools/blob/master/check_graphite_data

Nagios check for Graphite data

http://graphite.wikidot.com/

Graphite metrics collection w/Holt-Winters abberations

Wednesday, March 7, 12
slide-42
SLIDE 42

Knowing What Has Happened

(Learning)

Knowing What To Look For

(Monitoring)

Knowing What To Do

(Response)

Knowing What To Expect

(Anticipation)

Four Cornerstones

Erik Hollnagel

Wednesday, March 7, 12
slide-43
SLIDE 43

FAULT TOLERANCE

Wednesday, March 7, 12
slide-44
SLIDE 44

Detection of fault X Triggers corrective action Y Clean up, report back

(RECOVERY OR MASKING)

Wednesday, March 7, 12
slide-45
SLIDE 45

Variation Tolerance

Wednesday, March 7, 12
slide-46
SLIDE 46

Adaptive Systems

Expected Variation

Wednesday, March 7, 12
slide-47
SLIDE 47

Adaptive Systems

Expected Variation

Wednesday, March 7, 12
slide-48
SLIDE 48

Adaptive Systems

Expected Variation

Wednesday, March 7, 12
slide-49
SLIDE 49

Expected Variation

New Disturbances Arise Compensation is Exhausted

Control Disturbance

compensation

decompensation

Woods, 2011 Wednesday, March 7, 12
slide-50
SLIDE 50

Expected Variation

New Disturbances Arise Compensation is Exhausted

Control Disturbance

compensation

decompensation

Wednesday, March 7, 12
slide-51
SLIDE 51

Expected Variation

New Disturbances Arise Compensation is Exhausted

Control Disturbance

compensation

decompensation

Variation

Fault

Wednesday, March 7, 12
slide-52
SLIDE 52

Variations != Faults

Wednesday, March 7, 12
slide-53
SLIDE 53

Dead Corrupt Late Wrong

Wednesday, March 7, 12
slide-54
SLIDE 54

Fault Tolerance

Redundancy Spatial (server, network, process) Temporal (checkpoint, “rollback”) Informational (data in N locations)

Wednesday, March 7, 12
slide-55
SLIDE 55

Redundancy Spatial (server, network, process) Temporal (checkpoint, “rollback”) Informational (data in N locations)

Fault Tolerance

Wednesday, March 7, 12
slide-56
SLIDE 56

Spatial Redundancy

2 2

Wednesday, March 7, 12
slide-57
SLIDE 57

Spatial Redundancy

Active/Active

Wednesday, March 7, 12
slide-58
SLIDE 58

Active/Passive

Spatial Redundancy

Wednesday, March 7, 12
slide-59
SLIDE 59

Spatial Redundancy

Roaming Spare Dedicated Spare

Wednesday, March 7, 12
slide-60
SLIDE 60

In-Line Fault Tolerance

App Search (Lucene/Solr) Thrift PHP

(thrift client)

  • Connect timeout
  • Send timeout
  • Receive timeout
Wednesday, March 7, 12
slide-61
SLIDE 61

In-Line Fault Tolerance

App Search (Lucene/Solr)

X

  • 1. App attempts connection, can’t
  • 2. Caches APC user object with 60s “TTL” key=server:port
  • 3. Moves to next server in rotation, skipping any found in

APC

Wednesday, March 7, 12
slide-62
SLIDE 62

In-Line Fault Tolerance

http://thrift.apache.org/ /lib/php/src/TSocketPool.php

Wednesday, March 7, 12
slide-63
SLIDE 63

In-Line Fault Tolerance

Pros: Distributed checking and perspective Handles transient failures Auto-recovery

Cons: Onus is on the app for implementation

Wednesday, March 7, 12
slide-64
SLIDE 64

Nagios Event Handlers

Fault Tolerance

Attempt to recover from specific conditions Chain together recovery actions http://nagios.sourceforge.net/docs/3_0/ eventhandlers.html

Wednesday, March 7, 12
slide-65
SLIDE 65

If (fault X) then HUP process; re-check If (OK) then notify+exit ELSE Hard restart process; re-check If (OK) then notify+exit ELSE Remove from production; notify+exit

Wednesday, March 7, 12
slide-66
SLIDE 66

How many seconds of errors can you tolerate serving?

Wednesday, March 7, 12
slide-67
SLIDE 67

When fault is found, and can’t be recovered or masked, operations cease to protect the rest of the system from damage.

Fail Closed

Wednesday, March 7, 12
slide-68
SLIDE 68

Depth and Dependencies

App DB Monitor Load Balancers Health check

Wednesday, March 7, 12
slide-69
SLIDE 69

Depth and Dependencies

App DB Monitor Load Balancers Health check

WARNING: Don’t be too crazy

Wednesday, March 7, 12
slide-70
SLIDE 70

Fail Closed

Aggregate Cluster Checking

X X X X

If (clusterfail > 25%) then notify+exit ELSE OK

Wednesday, March 7, 12
slide-71
SLIDE 71

When a fault happens, and can’t be masked or recovered, operations continue without the feature.

Fail Open

Wednesday, March 7, 12
slide-72
SLIDE 72

Example 1 at Etsy: Geo Targeting

Fail Open

50ms Internal SLA on guessing location via client IP . If >50ms, we just don’t show local results.

Wednesday, March 7, 12
slide-73
SLIDE 73

Example 2 at Etsy: Rate Limiting

Fail Open

App Memcache Internal SLA on incrementing counters+checking totals. If >SLA, we let the action continue, and throw fire-and- forget counter if we can.

Wednesday, March 7, 12
slide-74
SLIDE 74

SYSTEMIC

Wednesday, March 7, 12
slide-75
SLIDE 75

App Cache DB Search Logging Queue

Wednesday, March 7, 12
slide-76
SLIDE 76

App Cache DB Search Logging Queue

Wednesday, March 7, 12
slide-77
SLIDE 77

Functional Resonance

Wednesday, March 7, 12
slide-78
SLIDE 78 Wednesday, March 7, 12
slide-79
SLIDE 79

Shop Stats

Wednesday, March 7, 12
slide-80
SLIDE 80

App Cache

DB

Search

Logging

Queue

Shop Stats

Wednesday, March 7, 12
slide-81
SLIDE 81

App Cache

DB

Search

Logging

Queue

Registration

Wednesday, March 7, 12
slide-82
SLIDE 82

Registration

Wednesday, March 7, 12
slide-83
SLIDE 83

Shop Stats Logins Registrations Checkout New Listings Photos Search API Rate limiting Data Analysis

Search A/B analysis

Page performance Search Ads Editorial content Email systems Feedback Messaging/Convos Activity Feeds Circles Shipping Mobile Internationalization Testing Fraud

Wednesday, March 7, 12
slide-84
SLIDE 84

Systemic

Application/Functionality Health Componential/Resource Health

Wednesday, March 7, 12
slide-85
SLIDE 85

Knowing What Has Happened

(Learning)

Knowing What To Look For

(Monitoring)

Knowing What To Do

(Response)

Knowing What To Expect

(Anticipation)

Four Cornerstones

Erik Hollnagel

Wednesday, March 7, 12
slide-86
SLIDE 86
  • During design of architecture
  • During choice of technologies
  • During design of monitoring and metrics

Anticipation

Wednesday, March 7, 12
slide-87
SLIDE 87

TRADE-OFFS

Wednesday, March 7, 12
slide-88
SLIDE 88

“What could possibly go wrong?”

Wednesday, March 7, 12
slide-89
SLIDE 89

REQUISITE IMAGINATION

Wednesday, March 7, 12
slide-90
SLIDE 90

Situations Considered By Expert Designer Situations Considered By Average Designer Situations Considered By Novice Designer

Possible Foreseeable Situations

Adamski and Westrum, 2003

Wednesday, March 7, 12
slide-91
SLIDE 91

Anticipation

Failure Mode Effects Analysis (FMEA)

http://en.wikipedia.org/wiki/Failure_mode_and_effects_analysis

Failure Mode Effects and Criticality Analysis (FMECA)

http://en.wikipedia.org/wiki/ Failure_mode,_effects,_and_criticality_analysis

Wednesday, March 7, 12
slide-92
SLIDE 92

Architectural reviews Go or No-Go meetings “Game Day” exercises

Wednesday, March 7, 12
slide-93
SLIDE 93

Anticipation

Servers Networks Software Applications Monitoring Metrics Traffic

Wednesday, March 7, 12
slide-94
SLIDE 94

PEOPLE

Wednesday, March 7, 12
slide-95
SLIDE 95

Knowing What Has Happened

(Learning)

Knowing What To Look For

(Monitoring)

Knowing What To Do

(Response)

Knowing What To Expect

(Anticipation)

Knowing What To Expect

(Anticipation)

Wednesday, March 7, 12
slide-96
SLIDE 96

THE END

Wednesday, March 7, 12