: LEARNING TO BEND BUT NOT BREAK AT Whoops, something went wrong - PowerPoint PPT Presentation

: LEARNING TO BEND BUT NOT BREAK AT

Whoops, something went wrong… Netflix Streaming Error We’re having trouble playing this title right now. Please try again later or select a different title.

Functional Sharding RPC tuning Shard A Shard B Shard C Client Server Bulkheads & Fallbacks

How to stay up in spite of change and turmoil? How to fail well? How to help teams build more resilient systems? Non-Critical Critical Service Chaos Service Owner. Owner. Engineer.

Service Criticality Driver_free_car.jpg, CC BY-SA 3.0, BP63Vincente 2015, Wikimedia

Service Criticality KPI = Playback Starts Per Second ( SPS ) Service A Service F Service B Service C Service G Non-critical Service D Service E Critical

Non-Critical Service Critical Service Chaos Owner. Owner. Engineer.

Badging

My service is non-critical, who needs Chaos? How do you know your service is non-critical?

https://github.com/Netflix/Hystrix Insights Bulkheads Circuit Breakers Timeouts Fallbacks

Badging Service (Non-Critical) API Service Fallback Badging Service

Surprise! Badging is Critical! API Service Fallback Badging Service

Gaps in Traditional Testing ● Environmental factors may differ between test and production (config, data, etc.) ● Systems behave differently under load than they do in a single unit or integration test ● Users react differently to failures than you expect.

How to fail well? ● Functioning fallbacks. ● Use Chaos to close gaps in traditional testing methods. Non-Critical Service Critical Service Chaos Owner. Owner. Engineer.

Critical Service Non-Critical Chaos Owner. Service Owner. Engineer.

Protect your service (and your customers)

How can I decrease the blast radius of failures? How about functional sharding!

Playback Service Architecture API Service API Service API Service Playback Service URL Service

CRITICAL NON-CRITICAL Customer Experience or Streaming Performance Impact Impact

Playback Service Functional Shards API Service API Service API Service Critical Playback Non-Critical Service Playback Service Critical URL Non-Critical Service URL Service

CC BY-NC 2.5, Randall Munroe, xkcd.com

Experimenting with Shards API Service API Service API Service Critical Playback Non-Critical Service Playback Service Non-Critical URL Service URL Service

Customer Behavior Insights API Service API Service API Service Critical Playback Non-Critical Service Playback Service 25% More Non-Critical Traffic URL Service URL Service

How do I confirm my system is tuned properly? Inject latency, of course!

Dependency Tuning Playback Customer Tag Service Service ● Retries ● Timeouts ● Load balancing strategies ● Concurrency limits ● Circuit breakers

Calendar*, CC BY 2.0, Dafne Cholete 2011, Flikr

Playback Service → Customer Tag Service Playback Service Customer Tag Service

Latency Injection - Round 1 Playback Service Customer Tag Service

Latency Injection - Round 2 Playback Service Customer Tag Service

Latency Injection - Round 2 300ms timeout Playback Service 350ms Out of time!! 1. Customer 2. URL Service Tag Service

Continuous Experimentation FTW! ● Fewer changes between experiments make it easier to isolate the regression. ● Fine-grained experiments scope the investigation (as opposed to outages where there are lots of red-herrings) .

How to stay up in spite of change and turmoil? ● Functional sharding for fault isolation. ● Tune RPC calls. ● Use Chaos to validate config and resiliency strategies. Critical Service Non-Critical Chaos Owner. Service Owner. Engineer.

Chaos Non-Critical Critical Service Engineer. Service Owner. Owner.

How do you help teams build more resilient systems? We need to do more of the heavy lifting. Perhaps the Principles of Chaos can help!

Principles of Chaos ● Minimize Blast Radius ● Build a Hypothesis around Steady State Behavior ● Vary Real-world Events ● Run Experiments in Production ● Automate Experiments to Run Continuously https://principlesofchaos.org/

Test v. Production Rock-em, CC BY-SA 2.0, Ariel Waldmane 2009, Flikr

How can we Minimize Blast Radius? Safety, safety, safety!!

Kill Switch

Canary Strategy Service A Service B Service C 0.5% 0.5% Service B (Control) Service B (Experiment)

Limit Impact Runs In Progress Experiment Cluster Status Latency api-prod In Progress Latency dredd-prod In Progress Failure api-prod Queued

Limit When Experiments can Run Safety First during the Holidays

Ensure Failures are Addressed

Fail Open 1. Control errors too high. 2. Errors in chaos code unrelated to the experiment in question. 3. Platform components crashing (monitoring, worker nodes, etc).

How should we Build a Hypothesis around Steady Observability is key! State? Add effective monitoring, analysis, and insights.

Insights

Automated Canary Analysis (ACA) https://medium.com/netflix-techblog/automated-canary-analysis-at-netflix-with-kayenta-3260bc7acc69

ChAP ACA Configurations Validate the experiment itself Validate the real-time monitoring didn’t miss anything Check for service failures even if they didn’t cause an impact in KPIs See if your service is approaching an unhealthy state

How do you Vary Real-world Events in an automated fashion? By carefully designing and prioritizing your experiments, of course!

Understand the Service Under Test Dependency Insights: ● Timeouts ● Retries ● % of Requests Involved Requests Per Second ● ● Latency ● Hystrix Commands ○ Fallbacks Timeouts ○

Evaluate Safety NOT SAFE TO FAIL!!!

Can more automation eventually lead to fewer experiments?

Prioritize Experiments Retries Traffic Percentage Failure Latency Experiment Type Aging

Generate Experiments Failure Failure Latency Latency

Is it time to Run Experiments in Production? Here we go!

What happened? 14 0 Vulnerabilities Outages Tooling Confidence Gaps

Example Finding Playback Service No Fallback! 376 ms Latency License Service

88.85% of cluster traffic Circuit Breaker s t u o e m i T 10 threads Thread Pool Rejections

Fully validated fix in tool before rollout!

After a day's worth of data, the results are looking fantastic. Every negative metric [for that Hystrix command] had a drastic improvement, and some by an order of magnitude. --Robert Reta, Playback Licensing

What else can be safer?

How do you help teams build more resilient systems? ● Apply the “Principles of Chaos” to tooling. ● Manage the heavy lifting. Chaos Non-Critical Critical Service Engineer. Service Owner. Owner.

You Must be This Tall to Ride?

How to stay up in spite of change and turmoil? ● Functional sharding for fault isolation. ● Tune RPC calls. ● Use Chaos to validate config and resiliency strategies. How to fail well? How to help teams build more ● Functioning fallbacks. resilient systems? Use Chaos to close gaps in ● ● Apply the “Principles of Chaos” to traditional testing methods. tooling. ● Manage the heavy lifting. Non-Critical Critical Service Chaos Service Owner. Owner. Engineer.

You Can Either Curl Up In A Ball And Die… Or You Can Stand Up And Say, “We’re Different. We’re The Strong Ones, And You Can’t Break Us!” Haley Tucker Senior Software Engineer Chaos Engineering @hwilson1204

: LEARNING TO BEND BUT NOT BREAK AT Whoops, something went wrong - PowerPoint PPT Presentation

: LEARNING TO BEND BUT NOT BREAK AT Whoops, something went wrong Netflix Streaming Error Were having trouble playing this title right now. Please try again later or select a different title. Functional Sharding RPC tuning Shard A

Fort Bend I Bend ISD 2019 Texas Academic Performance Report & Public Hearing February 10,

Nason LWP First Bend Nason LWP First Bend Nason LWP First Bend Reach geomorphology

FISHERMANS BEND RECAST A MAJOR BREAK FROM CAR DEPENDENCY Victorian Transport Infrastructure

Coastal Bend Bays & Estuaries Program Protecting and Restoring the Bays & Estuaries of

US20 / US97 BEND NORTH CORRIDOR PEO October 28, 2019 US20 / US97 BEND NORTH CORRIDOR US20 /

Spot Mr Whoops Mistakes Activity 1 My favorite day of the week has always been a Sunday. Even

What is a suffix? Choose the correct answer. A race where two peoples legs are tied together.

1 To check something out (pv): to see, watch, examine, try. Something/someone is not ones cup of

Financial Disclosure Statement Something Old, Something New, Something Unbreakable, and Something

Affordable Housing Fund Discovery Park Lodge 53 senior housing units at 60% AMI CITY OF BEND |

Fort Bend County Future Growth Implications Todd LaRue, Principal, RCLCO Fort Bend County

GLOBAL PARTNERSHIP INITIATIVE Purpose e for or Travel el to o China Fort Bend County and Fort

Fort Bend Seniors Meals on Wheels Presentation | United Way Committee United Way Service Center

Bends Transportation Future Boyd Acres Neighborhood Association Monday, September 10, 2018

Identify the Break-Even Point 1 What does it mean to break-even? 2

VoIP Security Title : Something Old (H.323), Something New (IAX), Something Hallow ( Security ),

Trade Studies By Jacob Kloos What is a Trade Study?

SENSATA AEROSPACE - KLIXON E-Mail: gregory@elimec.co.il Website: http://www.elimec.co.il/ Elimec's

February 28, 2019 Armand Pires, Ph.D., Superintendent Donald Aicardi, Director of Finance and

Docking Drawer a division of JTech Solutions ETL Listed in-drawer electrical outlets Rev 100114

Big picture items in the budget center on several issues: Sustainability: a Snowball of one

MICO Solutions for Intelligent Power Distribution Monitor Detect Switch Off

City of Goshen, Indiana Impacts of TIF on Local Government Budgets and Revenue

PC57.167 IEEE Guide for Distribution Transformer Monitoring Participants have a duty to inform

: LEARNING TO BEND BUT NOT BREAK AT Whoops, something went wrong - PowerPoint PPT Presentation

: LEARNING TO BEND BUT NOT BREAK AT Whoops, something went wrong Netflix Streaming Error Were having trouble playing this title right now. Please try again later or select a different title. Functional Sharding RPC tuning Shard A

Fort Bend I Bend ISD 2019 Texas Academic Performance Report &amp; Public Hearing February 10,

Nason LWP First Bend Nason LWP First Bend Nason LWP First Bend Reach geomorphology

FISHERMANS BEND RECAST A MAJOR BREAK FROM CAR DEPENDENCY Victorian Transport Infrastructure

Coastal Bend Bays &amp; Estuaries Program Protecting and Restoring the Bays &amp; Estuaries of

US20 / US97 BEND NORTH CORRIDOR PEO October 28, 2019 US20 / US97 BEND NORTH CORRIDOR US20 /

Spot Mr Whoops Mistakes Activity 1 My favorite day of the week has always been a Sunday. Even

What is a suffix? Choose the correct answer. A race where two peoples legs are tied together.

1 To check something out (pv): to see, watch, examine, try. Something/someone is not ones cup of

Financial Disclosure Statement Something Old, Something New, Something Unbreakable, and Something

Affordable Housing Fund Discovery Park Lodge 53 senior housing units at 60% AMI CITY OF BEND |

Fort Bend County Future Growth Implications Todd LaRue, Principal, RCLCO Fort Bend County

GLOBAL PARTNERSHIP INITIATIVE Purpose e for or Travel el to o China Fort Bend County and Fort

Fort Bend Seniors Meals on Wheels Presentation | United Way Committee United Way Service Center

Bends Transportation Future Boyd Acres Neighborhood Association Monday, September 10, 2018

Identify the Break-Even Point 1 What does it mean to break-even? 2

VoIP Security Title : Something Old (H.323), Something New (IAX), Something Hallow ( Security ),

Trade Studies By Jacob Kloos What is a Trade Study?

SENSATA AEROSPACE - KLIXON E-Mail: gregory@elimec.co.il Website: http://www.elimec.co.il/ Elimec's

February 28, 2019 Armand Pires, Ph.D., Superintendent Donald Aicardi, Director of Finance and

Docking Drawer a division of JTech Solutions ETL Listed in-drawer electrical outlets Rev 100114

Big picture items in the budget center on several issues: Sustainability: a Snowball of one

MICO Solutions for Intelligent Power Distribution Monitor Detect Switch Off

City of Goshen, Indiana Impacts of TIF on Local Government Budgets and Revenue

PC57.167 IEEE Guide for Distribution Transformer Monitoring Participants have a duty to inform

Fort Bend I Bend ISD 2019 Texas Academic Performance Report & Public Hearing February 10,

Coastal Bend Bays & Estuaries Program Protecting and Restoring the Bays & Estuaries of