: LEARNING TO BEND BUT NOT BREAK AT Whoops, something went wrong - - PowerPoint PPT Presentation

learning to bend but not break at whoops something went
SMART_READER_LITE
LIVE PREVIEW

: LEARNING TO BEND BUT NOT BREAK AT Whoops, something went wrong - - PowerPoint PPT Presentation

: LEARNING TO BEND BUT NOT BREAK AT Whoops, something went wrong Netflix Streaming Error Were having trouble playing this title right now. Please try again later or select a different title. Functional Sharding RPC tuning Shard A


slide-1
SLIDE 1

:

LEARNING TO BEND BUT NOT BREAK AT

slide-2
SLIDE 2
slide-3
SLIDE 3
slide-4
SLIDE 4
slide-5
SLIDE 5

Whoops, something went wrong…

Netflix Streaming Error

We’re having trouble playing this title right now. Please try again later or select a different title.

slide-6
SLIDE 6
slide-7
SLIDE 7
slide-8
SLIDE 8

Shard A Shard B Shard C

Functional Sharding

Client Server

RPC tuning Bulkheads & Fallbacks

slide-9
SLIDE 9
slide-10
SLIDE 10

Non-Critical Service Owner.

How to fail well?

Critical Service Owner.

How to stay up in spite of change and turmoil?

Chaos Engineer.

How to help teams build more resilient systems?

slide-11
SLIDE 11

Service Criticality

Driver_free_car.jpg, CC BY-SA 3.0, BP63Vincente 2015, Wikimedia

slide-12
SLIDE 12

Service B Service A Service E Service C Service F Service D Service G

Service Criticality

Non-critical Critical KPI = Playback Starts Per Second (SPS)

slide-13
SLIDE 13
slide-14
SLIDE 14

Non-Critical Service Owner.

Critical Service Owner. Chaos Engineer.

slide-15
SLIDE 15

Badging

slide-16
SLIDE 16

My service is non-critical, who needs Chaos? How do you know your service is non-critical?

slide-17
SLIDE 17

https://github.com/Netflix/Hystrix

Insights Timeouts Bulkheads Fallbacks Circuit Breakers

slide-18
SLIDE 18

Badging Service API Service

Badging Service (Non-Critical)

Fallback

slide-19
SLIDE 19
slide-20
SLIDE 20

Badging Service API Service

Surprise! Badging is Critical!

Fallback

slide-21
SLIDE 21
  • Environmental factors may

differ between test and production (config, data, etc.)

  • Systems behave differently

under load than they do in a single unit or integration test

  • Users react differently to

failures than you expect.

Gaps in Traditional Testing

slide-22
SLIDE 22

Non-Critical Service Owner.

Critical Service Owner. Chaos Engineer.

How to fail well?

  • Functioning fallbacks.
  • Use Chaos to close gaps in traditional

testing methods.

slide-23
SLIDE 23

Chaos Engineer. Non-Critical Service Owner.

Critical Service Owner.

slide-24
SLIDE 24

Protect your service (and your customers)

slide-25
SLIDE 25

How can I decrease the blast radius of failures? How about functional sharding!

slide-26
SLIDE 26

API Service API Service API Service Playback Service URL Service

Playback Service Architecture

slide-27
SLIDE 27

NON-CRITICAL

Experience or Performance Impact

CRITICAL

Customer Streaming Impact

slide-28
SLIDE 28

API Service API Service API Service Critical Playback Service Critical URL Service Non-Critical Playback Service Non-Critical URL Service

Playback Service Functional Shards

slide-29
SLIDE 29

CC BY-NC 2.5, Randall Munroe, xkcd.com

slide-30
SLIDE 30

API Service API Service API Service Critical Playback Service URL Service Non-Critical Playback Service Non-Critical URL Service

Experimenting with Shards

slide-31
SLIDE 31
slide-32
SLIDE 32

Customer Behavior Insights

API Service API Service API Service Critical Playback Service URL Service Non-Critical Playback Service Non-Critical URL Service 25% More Traffic

slide-33
SLIDE 33

How do I confirm my system is tuned properly? Inject latency, of course!

slide-34
SLIDE 34
  • Retries
  • Timeouts
  • Load balancing

strategies

  • Concurrency limits
  • Circuit breakers

Dependency Tuning

Playback Service Customer Tag Service

slide-35
SLIDE 35

Calendar*, CC BY 2.0, Dafne Cholete 2011, Flikr

slide-36
SLIDE 36

Playback Service → Customer Tag Service

Customer Tag Service Playback Service

slide-37
SLIDE 37

Customer Tag Service Playback Service

Latency Injection - Round 1

slide-38
SLIDE 38

Customer Tag Service Playback Service

Latency Injection - Round 2

slide-39
SLIDE 39

1. Customer Tag Service Playback Service

  • 2. URL Service

Latency Injection - Round 2

300ms timeout 350ms Out of time!!

slide-40
SLIDE 40
  • Fewer changes between

experiments make it easier to isolate the regression.

  • Fine-grained experiments

scope the investigation (as

  • pposed to outages where

there are lots of red-herrings).

Continuous Experimentation FTW!

slide-41
SLIDE 41

Chaos Engineer. Non-Critical Service Owner.

Critical Service Owner.

How to stay up in spite of change and turmoil?

  • Functional sharding for fault isolation.
  • Tune RPC calls.
  • Use Chaos to validate config and resiliency strategies.
slide-42
SLIDE 42

Chaos Engineer.

Non-Critical Service Owner. Critical Service Owner.

slide-43
SLIDE 43
slide-44
SLIDE 44

How do you help teams build more resilient systems? We need to do more of the heavy lifting. Perhaps the Principles of Chaos can help!

slide-45
SLIDE 45

Principles of Chaos

  • Minimize Blast Radius
  • Build a Hypothesis around

Steady State Behavior

  • Vary Real-world Events
  • Run Experiments in

Production

  • Automate Experiments to

Run Continuously

https://principlesofchaos.org/

slide-46
SLIDE 46

Rock-em, CC BY-SA 2.0, Ariel Waldmane 2009, Flikr

Test v. Production

slide-47
SLIDE 47

How can we Minimize Blast Radius? Safety, safety, safety!!

slide-48
SLIDE 48

Kill Switch

slide-49
SLIDE 49

Service B Service A Service C Service B (Control) Service B (Experiment)

Canary Strategy

0.5% 0.5%

slide-50
SLIDE 50

Limit Impact

Runs In Progress Experiment Cluster Status Latency api-prod In Progress Latency dredd-prod In Progress Failure api-prod Queued

slide-51
SLIDE 51

Limit When Experiments can Run

Safety First during the Holidays

slide-52
SLIDE 52

Ensure Failures are Addressed

slide-53
SLIDE 53

1. Control errors too high. 2. Errors in chaos code unrelated to the experiment in question. 3. Platform components crashing (monitoring, worker nodes, etc).

Fail Open

slide-54
SLIDE 54

How should we Build a Hypothesis around Steady State? Observability is key! Add effective monitoring, analysis, and insights.

slide-55
SLIDE 55
slide-56
SLIDE 56

Insights

slide-57
SLIDE 57

Automated Canary Analysis (ACA)

https://medium.com/netflix-techblog/automated-canary-analysis-at-netflix-with-kayenta-3260bc7acc69

slide-58
SLIDE 58

ChAP ACA Configurations

Validate the experiment itself Validate the real-time monitoring didn’t miss anything Check for service failures even if they didn’t cause an impact in KPIs See if your service is approaching an unhealthy state

slide-59
SLIDE 59

How do you Vary Real-world Events in an automated fashion? By carefully designing and prioritizing your experiments, of course!

slide-60
SLIDE 60

Understand the Service Under Test

Dependency Insights:

  • Timeouts
  • Retries
  • % of Requests Involved
  • Requests Per Second
  • Latency
  • Hystrix Commands

○ Fallbacks ○ Timeouts

slide-61
SLIDE 61
slide-62
SLIDE 62

Evaluate Safety

NOT SAFE TO FAIL!!!

slide-63
SLIDE 63

Can more automation eventually lead to fewer experiments?

slide-64
SLIDE 64

Prioritize Experiments

Retries Traffic Percentage Failure Latency Experiment Type Aging

slide-65
SLIDE 65

Generate Experiments

Failure Latency Failure Latency

slide-66
SLIDE 66

Is it time to Run Experiments in Production? Here we go!

slide-67
SLIDE 67
slide-68
SLIDE 68

What happened?

14

Vulnerabilities Outages Confidence Tooling Gaps

slide-69
SLIDE 69

Example Finding

License Service Playback Service

376 ms Latency

No Fallback!

slide-70
SLIDE 70

88.85%

  • f cluster

traffic 10 threads

Thread Pool Rejections T i m e

  • u

t s Circuit Breaker

slide-71
SLIDE 71

Fully validated fix in tool before rollout!

slide-72
SLIDE 72

After a day's worth of data, the results are looking fantastic. Every negative metric [for that Hystrix command] had a drastic improvement, and some by an

  • rder of magnitude.
  • -Robert Reta,

Playback Licensing

slide-73
SLIDE 73

What else can be safer?

slide-74
SLIDE 74

Chaos Engineer.

Non-Critical Service Owner. Critical Service Owner. How do you help teams build more resilient systems?

  • Apply the “Principles of Chaos” to tooling.
  • Manage the heavy lifting.
slide-75
SLIDE 75

You Must be This Tall to Ride?

slide-76
SLIDE 76

Non-Critical Service Owner. Critical Service Owner. Chaos Engineer.

How to fail well?

  • Functioning fallbacks.
  • Use Chaos to close gaps in

traditional testing methods.

How to help teams build more resilient systems?

  • Apply the “Principles of Chaos” to

tooling.

  • Manage the heavy lifting.

How to stay up in spite of change and turmoil?

  • Functional sharding for fault isolation.
  • Tune RPC calls.
  • Use Chaos to validate config and resiliency strategies.
slide-77
SLIDE 77

You Can Either Curl Up In A Ball And Die… Or You Can Stand Up And Say, “We’re Different. We’re The Strong Ones, And You Can’t Break Us!”

Haley Tucker

Senior Software Engineer Chaos Engineering

@hwilson1204