Monkeys in Lab Coats Automating Failure Testing Research at The - - PowerPoint PPT Presentation

monkeys in lab coats
SMART_READER_LITE
LIVE PREVIEW

Monkeys in Lab Coats Automating Failure Testing Research at The - - PowerPoint PPT Presentation

Monkeys in Lab Coats Automating Failure Testing Research at The whole is greater than the sum of its parts. - Aristotle [Metaphysics] The Professor vs The Practitioner Peter Alvaro Kolton Andrus Ex-Berkeley, Ex-Industry Ex-Netflix,


slide-1
SLIDE 1

Monkeys in Lab Coats

Automating Failure Testing Research at

slide-2
SLIDE 2

The whole is greater than the sum of its parts.

  • Aristotle

[Metaphysics]

slide-3
SLIDE 3

The Professor vs The Practitioner

Peter Alvaro

Ex-Berkeley, Ex-Industry Assistant Prof @ Santa Cruz Misses the calm of PhD life Likes prototyping stuff

Kolton Andrus

Ex-Netflix, Ex-Amazon ‘Chaos’ Engineer Misses his actual pager Likes breaking stuff

slide-4
SLIDE 4

Measures of Success

Academic

H-Index Grant warchest Department ranking

Industry

Availability (i.e. 99.99% uptime) Number of Incidents Reduce Operational Burden

slide-5
SLIDE 5

An Unlikely Team?

slide-6
SLIDE 6
slide-7
SLIDE 7

but ... it’s manual

Works Great!

slide-8
SLIDE 8

Surely there is a better way ...

slide-9
SLIDE 9
slide-10
SLIDE 10

Free lunch?

slide-11
SLIDE 11

The End?

(Academia + Industry)

slide-12
SLIDE 12

Let’s build it

“Can we, pretty please?”

slide-13
SLIDE 13

Freedom and Responsibility

Core Value

slide-14
SLIDE 14

Responsibility

Academic Industry Prove that it works Show that it scales Find real bugs

slide-15
SLIDE 15

The Big Idea

Lineage Driven Fault Injection

slide-16
SLIDE 16

What could possibly go wrong?

Consider computation involving 100 services

Search Space: 2100 executions

slide-17
SLIDE 17

“Depth” of bugs

Single Faults Search Space: 100 executions

slide-18
SLIDE 18

“Depth” of bugs

Combination of 4 faults Search Space: 3M executions

slide-19
SLIDE 19

“Depth” of bugs

Combination of 7 faults Search Space: 16B executions

slide-20
SLIDE 20

Random Search

Search Space: 2100 executions

slide-21
SLIDE 21

Engineer-guided Search

Search Space: ???

slide-22
SLIDE 22

Fault-tolerance “is just” redundancy

slide-23
SLIDE 23

How do we find the redundancy?

Could a bad ‘thing’ ever happen? Why did a good ‘thing’ happen?

slide-24
SLIDE 24

Lineage-driven fault injection

Why did a good thing happen? Consider its lineage.

The write is stable Stored on RepA Stored on RepB Bcast1 Bcast2 Client Client

slide-25
SLIDE 25

Lineage-driven fault injection

Why did a good thing happen? Consider its lineage. What could have gone wrong? Faults are cuts in the lineage graph. Is there a cut that breaks all supports?

The write is stable Stored on RepA Stored on RepB Bcast1 Bcast2 Client Client

slide-26
SLIDE 26

Lineage-driven fault injection

Why did a good thing happen? Consider its lineage. What could have gone wrong? Faults are cuts in the lineage graph. Is there a cut that breaks all supports?

The write is stable Stored on RepA Stored on RepB Bcast1 Bcast2 Client Client

slide-27
SLIDE 27

What would have to go wrong?

(RepA OR Bcast1)

The write is stable Stored on RepA Stored on RepB Bcast2 Client Client Bcast1

slide-28
SLIDE 28

What would have to go wrong?

(RepA OR Bcast1) AND (RepA OR Bcast2)

The write is stable Stored on RepA Stored on RepB Bcast1 Bcast2 Client Client

slide-29
SLIDE 29

What would have to go wrong?

(RepA OR Bcast1) AND (RepA OR Bcast2) AND (RepB OR Bcast2)

The write is stable Stored on RepA Stored on RepB Bcast1 Client Client Bcast2

slide-30
SLIDE 30

What would have to go wrong?

(RepA OR Bcast1) AND (RepA OR Bcast2) AND (RepB OR Bcast2) AND (RepB OR Bcast1)

The write is stable Stored on RepA Stored on RepB Bcast1 Bcast2 Client Client

slide-31
SLIDE 31

Lineage-driven fault injection

The write is stable Stored on RepA Stored on RepB Bcast1 Bcast2 Client Client

Hypothesis: {Bcast1, Bcast2}

slide-32
SLIDE 32

Search Space Reduction

Each Experiment finds a bug, OR Reduces the Search space

slide-33
SLIDE 33

The prototype system “Molly”

Recipe: 1. Start with a successful

  • utcome. Work backwards.

2. Ask why it happened: Lineage 3. Convert lineage to a boolean formula and solve 4. Lather, rinse, repeat

  • 2. Lineage
  • 3. CNF

Fail

  • 1. Success

Why? Encode Solve

  • 4. REPEAT
slide-34
SLIDE 34

The Big Idea

Meets Production

slide-35
SLIDE 35
  • 1. Start with a successful outcome

2. Lineage 3. CNF

Fail

  • 1. Success

Why? Encode Solve

  • 4. REPEAT
slide-36
SLIDE 36

What is success?

slide-37
SLIDE 37
slide-38
SLIDE 38

“Start with the customer and work backwards”

Leadership Principle

slide-39
SLIDE 39
slide-40
SLIDE 40
slide-41
SLIDE 41

Lesson 1

Work backwards from what you know

slide-42
SLIDE 42
  • 2. Ask why it happened

2. Lineage 3. CNF

Fail

  • 1. Success

Why? Encode Solve

  • 4. REPEAT
slide-43
SLIDE 43

Request Tracing

slide-44
SLIDE 44
slide-45
SLIDE 45

Request Tracing

slide-46
SLIDE 46

Alternate Execution

slide-47
SLIDE 47

Evolution over time

slide-48
SLIDE 48

Redundancy through History

slide-49
SLIDE 49

Lesson 2

Meet in the middle

slide-50
SLIDE 50
  • 3. Solve

2. Lineage 3. CNF

Fail

  • 1. Success

Why? Encode Solve

  • 4. REPEAT
slide-51
SLIDE 51

A “small” matter of code

slide-52
SLIDE 52
  • 4. Lather, Rinse, Repeat

2. Lineage 3. CNF

Fail

  • 1. Success

Why? Encode Solve

  • 4. REPEAT
slide-53
SLIDE 53

Turn the crank, right?

slide-54
SLIDE 54

Idempotence

slide-55
SLIDE 55
slide-56
SLIDE 56
slide-57
SLIDE 57

Bins and Balls

Request Class 1 Class 2 Class 3 Class n [...]

r’ r

slide-58
SLIDE 58

Class n

Predicting Request Graphs

Request

slide-59
SLIDE 59

Class n

Predicting Request Graphs

Request

Some function f: Requests → Classes

slide-60
SLIDE 60

F( ) =

Class n Request

Predicting Request Graphs

slide-61
SLIDE 61
slide-62
SLIDE 62

Solve the Machine Learning problem?

  • r the Failure Testing one?
slide-63
SLIDE 63

Simplest thing that will work?

slide-64
SLIDE 64

["bookmarks”, “recent”] ["playlist", 0, “name”] ["ratings"]

Falcor Path Mapping

=> “bookmarks,playlist,ratings”

slide-65
SLIDE 65

Lesson 3

Adapt the theory to the reality

slide-66
SLIDE 66

Many moons passed...

slide-67
SLIDE 67

Does it work?

YES!

slide-68
SLIDE 68

Case study: “Netflix AppBoot”

Services

~100

Search space (executions)

2100 (1,000,000,000,000,000,000,000,000,000,000)

Experiments performed

200

Critical bugs found

11

slide-69
SLIDE 69

Future Work

Richer device metrics Request class creation Better experiment selection Search prioritization Richer lineage collection Exploring temporal interleavings

slide-70
SLIDE 70
slide-71
SLIDE 71
slide-72
SLIDE 72
slide-73
SLIDE 73
slide-74
SLIDE 74
slide-75
SLIDE 75
slide-76
SLIDE 76
slide-77
SLIDE 77
slide-78
SLIDE 78
slide-79
SLIDE 79

Lessons

Work backwards from what you know Meet in the middle Adapt the theory to the reality

slide-80
SLIDE 80

Academia + Industry

slide-81
SLIDE 81

Academia + Industry Academia Industry

slide-82
SLIDE 82

Thank You! Peter Alvaro @palvaro palvaro@ucsc.edu Kolton Andrus @KoltonAndrus kolton@gremlininc.com

slide-83
SLIDE 83

References

  • Netflix Blog on ‘Automated Failure Testing’

http://techblog.netflix.com/2016/01/automated-failure-testing.html

  • Netflix Blog on ‘Failure Injection Testing’

techblog.netflix.com/2014/10/fit-failure-injection-testing.html

  • ‘Lineage Driven Fault Injection’

http://people.ucsc.edu/~palvaro/molly.pdf

  • ‘Automating Failure Testing Research at Scale’

https://people.ucsc.edu/~palvaro/socc16.pdf

slide-84
SLIDE 84

Photo Credits

  • http://etc.usf.edu/clipart/4000/4048/children_7_lg.gif
  • http://cdn.c.photoshelter.com/img-get2/I0000MIN8fL0q8AA/fit=1000x750/taiw

an-hiking-river-tracing-walking.jpg

  • http://i.imgur.com/iWKad22.jpg
  • https://blogs.endjin.com/2014/05/event-stream-manipulation-using-rx-part-2/
  • http://youpivot.com/category/features/
  • https://www.cloudave.com/33427/boards-need-evolve-time/
  • https://www.linkedin.com/pulse/amelia-packager-missing-data-imputation-ram

prakash-veluchamy