T
- wards a resilience pattern language
- r how to get resilient software design right
Uwe Friedrichsen (codecentric AG) – Berlin Expert Days – Berlin, 16. September 2016
T owards a resilience pattern language or how to get resilient - - PowerPoint PPT Presentation
T owards a resilience pattern language or how to get resilient software design right Uwe Friedrichsen (codecentric AG) Berlin Expert Days Berlin, 16. September 2016 @ufried Uwe Friedrichsen | uwe.friedrichsen@codecentric.de |
T
Uwe Friedrichsen (codecentric AG) – Berlin Expert Days – Berlin, 16. September 2016
Uwe Friedrichsen | uwe.friedrichsen@codecentric.de | http://slideshare.net/ufried | http://ufried.tumblr.com
Previously on “Resilience” …
Why resilience?
It‘s all about production!
Business Production Availability
Availability ≔ MTTF MTTF + MTTR
MTTF: Mean Time T
MTTR: Mean Time T
Traditional stability approach Availability ≔ MTTF MTTF + MTTR
Maximize MTTF
(Almost) every system is a distributed system
Chas Emerick
The Eight Fallacies of Distributed Computing
Peter Deutsch
https://blogs.oracle.com/jag/resource/Fallacies.html
A distributed system is one in which the failure
can render your own computer unusable.
Leslie Lamport
Failures in todays complex, distributed and interconnected systems are not the exception.
Do not try to avoid failures. Embrace them.
Resilience approach Availability ≔ MTTF MTTF + MTTR
Minimize MTTR
re resilience (IT) the ability of a system to handle unexpected situations
Do not fall for the “100% available” trap!
Isolation Latency Control
Fail Fast Circuit Breaker Timeouts Fan out & quickest reply Bounded Queues Shed Load Bulkheads
Loose Coupling
Asynchronous Communication Event-Driven Idempotency Self-Containment Relaxed T emporal Constraints Location Transparency Stateless
Supervision
Monitor Complete Parameter Checking Error Handler Escalation
… and there is more
A rich pattern family
(Title music starts & opening credits shown)
Let’s complete the picture first …
Isolation Latency Control Loose Coupling Supervision
Core
(Architectural)
Detection Treatment Prevention Recovery Mitigation
Prevention Detection Core
(Architectural)
Recovery Mitigation Treatment
Isolation Loose Coupling Latency Control
Node level
Supervision
System level
Prevention Detection Core
(Architectural)
Recovery Mitigation Treatment
Isolation Redundancy Communication paradigm Supporting patterns
Prevention Detection Core
(Architectural)
Recovery Mitigation Treatment
Supporting patterns Communication paradigm Redundancy Isolation
Prevention Detection Core
(Architectural)
Recovery Mitigation Treatment
Supporting patterns Communication paradigm Redundancy Isolation Bulkhead
Bulkheads
Prevention Detection Core
(Architectural)
Recovery Mitigation Treatment
Supporting patterns Communication paradigm Redundancy Isolation
Communication paradigm
Prevention Detection Core
(Architectural)
Recovery Mitigation Treatment
Supporting patterns Communication paradigm Redundancy Isolation
Redundancy
Failure types
Failure types
Usage of redundancy
(e.g., HAProxy)
(e.g., Consul, ZooKeeper)
(e.g., Pacemaker & Corosync)
Failure types
Usage of redundancy
Failure types
Usage of redundancy
Failure types
Usage of redundancy
Failure types
Usage of redundancy
Prevention Detection Core
(Architectural)
Recovery Mitigation Treatment
Supporting patterns Communication paradigm Redundancy Isolation Stateless Idempotency Escalation
Structural BehavioralZero downtime deployment Location transparency Relaxed temporal constraints
Prevention Detection Core
(Architectural)
Recovery Mitigation Treatment
Node level Supporting patterns System level
Prevention Detection Core
(Architectural)
Recovery Mitigation Treatment
Node level Supporting patterns System level Timeout Circuit breaker Complete parameter checking Checksum
Prevention Detection Core
(Architectural)
Recovery Mitigation Treatment
Node level Supporting patterns System level Monitor Watchdog Heartbeat Acknowledgement
Prevention Detection Core
(Architectural)
Recovery Mitigation Treatment
Node level Supporting patterns System level Voting Synthetic transaction Leaky bucket Routine checks Health check Fail fast
Prevention Detection Core
(Architectural)
Recovery Mitigation Treatment
Prevention Detection Core
(Architectural)
Recovery Mitigation Treatment
Retry Limit retries
Retry
Retry example
// doAction returns true if successful, false otherwise boolean doAction(...) { ... } // General pattern boolean success = false int tries = 0; while (!success && (tries < MAX_TRIES)) { success = doAction(...); tries++; } // Alternative one-retry-only variant success = doAction(...) || doAction(...);
Prevention Detection Core
(Architectural)
Recovery Mitigation Treatment
Retry Limit retries Rollback Checkpoint Safe point
Rollback
Prevention Detection Core
(Architectural)
Recovery Mitigation Treatment
Retry Limit retries Rollback Checkpoint Safe point Roll-forward
Roll-forward
Prevention Detection Core
(Architectural)
Recovery Mitigation Treatment
Retry Limit retries Rollback Roll-forward Checkpoint Safe point Restart Reconnect Data Reset Startup consistency Reset
Reset
Prevention Detection Core
(Architectural)
Recovery Mitigation Treatment
Retry Limit retries Rollback Restart Roll-forward Reconnect Checkpoint Safe point Data Reset Startup consistency Reset Failover
Failover
Prevention Detection Core
(Architectural)
Recovery Mitigation Treatment
Retry Limit retries Rollback Restart Roll-forward Reconnect Checkpoint Safe point Data Reset Startup consistency Failover Reset Read repair
Read repair
Read repair example (Riak, Java) 1/2
public class FooResolver implements ConflictResolver<Foo> { @Override public Foo resolve(List<Foo> siblings) { // Insert your sibling resolution logic here } } public class Buddy { public String name; public Set<String> nicknames; public Buddy(String name, Set<String> nicknames) { this.name = name; this.nicknames = nicknames; } }
Read repair example (Riak, Java) 2/2
public class BuddyResolver implements ConflictResolver<Buddy> { @Override public Buddy resolve(List<Buddy> siblings) { if (siblings.size == 0) { return null; } else if (siblings.size == 1) { return siblings.get(0); } else { // Name is also used as key. Thus, all siblings have the same name String name = siblings.get(0).name; Set<String> mergedNicknames = new HashSet<String>(); for (Buddy buddy : siblings) { mergedNicknames.addAll(buddy.nicknames); } return new Buddy(name, mergedNicknames); } } }
Prevention Detection Core
(Architectural)
Recovery Mitigation Treatment
Retry Limit retries Rollback Restart Roll-forward Reconnect Checkpoint Safe point Data Reset Startup consistency Failover Read repair Reset Error handler
Error Handler
Prevention Detection Core
(Architectural)
Recovery Mitigation Treatment
Retry Limit retries Rollback Restart Roll-forward Reconnect Checkpoint Safe point Data Reset Startup consistency Failover Read repair Error handler Reset
Prevention Detection Core
(Architectural)
Recovery Mitigation Treatment
Prevention Detection Core
(Architectural)
Recovery Mitigation Treatment
Fallback Fail silently Alternative action Default value
Fallback
Fail silently example (Hystrix, Java) 1/2
public class FailSilentlyCommand extends HystrixCommand<String> { private static final String COMMAND_GROUP = "default"; private final boolean preCondition; public FailSilentlyCommand(boolean preCondition) { super(HystrixCommandGroupKey.Factory.asKey(COMMAND_GROUP)); this.preCondition = preCondition; } @Override protected String run() throws Exception { if (!preCondition) throw new RuntimeException((”Action failed")); return ”I am a result"; } @Override protected String getFallback() { return null; // Turn into silent failure } }
Fail silently example (Hystrix, Java) 2/2
@Test public void shouldSucceed() { FailSilentlyCommand command = new FailSilentlyCommand(true); String s = command.execute(); assertEquals(”I am a result", s); } @Test public void shouldFailSilently() { FailSilentlyCommand command = new FailSilentlyCommand(false); String s = ”Dummy"; try { s = command.execute(); } catch (Exception e) { fail("Did not fail silently"); } assertNull(s); }
Default value example (Hystrix, Java) 1/2
public class DefaultValueCommand extends HystrixCommand<String> { private static final String COMMAND_GROUP = "default”; private final boolean preCondition; public DefaultValueCommand(boolean preCondition) { super(HystrixCommandGroupKey.Factory.asKey(COMMAND_GROUP)); this.preCondition = preCondition; } @Override protected String run() throws Exception { if (!preCondition) throw new RuntimeException((”Action failed")); return ”I am a smart result"; } @Override protected String getFallback() { return ”I am a default value"; // Return default value if action fails } }
Default value example (Hystrix, Java) 2/2
@Test public void shouldSucceed() { DefaultValueCommand command = new DefaultValueCommand(true); String s = command.execute(); assertEquals(”I am a smart result", s); } @Test public void shouldProvideDefaultValue () { DefaultValueCommand command = new DefaultValueCommand(false); String s = null; try { s = command.execute(); } catch (Exception e) { fail("Did not return default value"); } assertEquals(”I am a default value", s); }
Prevention Detection Core
(Architectural)
Recovery Mitigation Treatment
Fallback Fail silently Alternative action Default value Queue for resources Bounded queue Finish work in progress Fresh work before stale
Queues for resources
Prevention Detection Core
(Architectural)
Recovery Mitigation Treatment
Fallback Fail silently Alternative action Default value Queue for resources Bounded queue Finish work in progress Fresh work before stale Shed load
Shed Load
Prevention Detection Core
(Architectural)
Recovery Mitigation Treatment
Fallback Fail silently Alternative action Default value Shed load Queue for resources Bounded queue Finish work in progress Fresh work before stale Share load
Statically DynamicallyShare Load
Prevention Detection Core
(Architectural)
Recovery Mitigation Treatment
Fallback Fail silently Alternative action Default value Shed load Share load Queue for resources Bounded queue Finish work in progress Fresh work before stale
Statically DynamicallyDeferrable work
Deferrable work
Deferrable work example 1/2
// Do or wait variant <init batch> while(<more to process>) { int load = getLoad(); if (load > THRESHOLD) { waitFixedDuration(); } else { <process next batch of work> } } void waitFixedDuration() { Thread.sleep(DELAY); // try-catch left out for better readability }
Deferrable work example 2/2
// Adaptive load variant <init batch> while(<more to process>) { waitLoadBased(); <process next batch of work> } void waitLoadBased() { int load = getLoad(); long delay = calcDelay(load); Thread.sleep(delay); // try-catch left out for better readability } long calcDelay(int load) { // Simple example implementation if (load < THRESHOLD) { return 0L; } return (load – THRESHOLD) * DELAY_FACTOR; }
Prevention Detection Core
(Architectural)
Recovery Mitigation Treatment
Fallback Fail silently Alternative action Default value Shed load Share load Queue for resources Bounded queue Finish work in progress Fresh work before stale Marked data
Statically DynamicallyDeferrable work
Marked data
Prevention Detection Core
(Architectural)
Recovery Mitigation Treatment
Fallback Fail silently Alternative action Default value Shed load Share load Marked data Queue for resources Bounded queue Finish work in progress Fresh work before stale
Statically DynamicallyDeferrable work
Prevention Detection Core
(Architectural)
Recovery Mitigation Treatment
Prevention Detection Core
(Architectural)
Recovery Mitigation Treatment
Let sleeping dogs lie Small releases Hot deployments
Prevention Detection Core
(Architectural)
Recovery Mitigation Treatment
Prevention Detection Core
(Architectural)
Recovery Mitigation Treatment
Routine maintenance Anti-entropy
Routine maintenance
Prevention Detection Core
(Architectural)
Recovery Mitigation Treatment
Routine maintenance Spread the news Anti-entropy
Spread the news
Prevention Detection Core
(Architectural)
Recovery Mitigation Treatment
Routine maintenance Backup request Spread the news Anti-entropy
Backup request
Prevention Detection Core
(Architectural)
Recovery Mitigation Treatment
Routine maintenance Backup request Anti-fragility Diversity Jitter Spread the news Anti-entropy
Anti-fragility
Prevention Detection Core
(Architectural)
Recovery Mitigation Treatment
Routine maintenance Backup request Anti-fragility Diversity Jitter Error injection Spread the news Anti-entropy
Error injection
https://github.com/Netflix/SimianArmy
Prevention Detection Core
(Architectural)
Recovery Mitigation Treatment
Routine maintenance Backup request Anti-fragility Diversity Jitter Error injection Spread the news Anti-entropy
T
Decisions to make
Core
(Architectural)
Detection Treatment Prevention Recovery Mitigation
Isolation Redundancy Communication paradigm Supporting patterns Node level System level
1
Decide core system properties
2
Choose patterns per failure scenario (Have the different failure types in mind)
3
Decide complementing patterns
Ongoing
Create and refine system design and functional decomposition. Functionally decouple bulkheads (A good functional decomposition on business level is the prerequisite for an effective resilience)
Core
(Architectural)
Detection Treatment Prevention Recovery Mitigation
Isolation Redundancy Communication paradigm Supporting patterns Node level System level
Example: Erlang (Akka)
Monitor Messaging Actor Escalation Heartbeat Restart
(Let it crash)Hot deployments
Core
(Architectural)
Detection Treatment Prevention Recovery Mitigation
Isolation Redundancy Communication paradigm Supporting patterns Node level System level
Example: Netflix
Monitor Request/ response (Micro)Service Retry Zero downtime deployment
(Canary releases)Fallback Share load Bounded queue Timeout Circuit breaker Several variants Error injection
Core
(Architectural)
Detection Treatment Prevention Recovery Mitigation
Isolation Redundancy Communication paradigm Supporting patterns Node level System level
What is your pattern language?
Wrap-up
Further reading
1. Michael T. Nygard, Release It!, Pragmatic Bookshelf, 2007 2. Robert S. Hanmer, Patterns for Fault T
3. Andrew T anenbaum, Marten van Steen, Distributed Systems – Principles and Paradigms, Prentice Hall, 2nd Edition, 2006 4. Hystrix Wiki, https://github.com/Netflix/Hystrix/wiki 5. Uwe Friedrichsen, Patterns of resilience, http://de.slideshare.net/ufried/patterns-of-resilience
Do not avoid failures. Embrace them!
Uwe Friedrichsen | uwe.friedrichsen@codecentric.de | http://slideshare.net/ufried | http://ufried.tumblr.com