[PPT] - T owards a resilience pattern language or how to get resilient PowerPoint Presentation

SLIDE 1

T

wards a resilience pattern language
r how to get resilient software design right

Uwe Friedrichsen (codecentric AG) – Berlin Expert Days – Berlin, 16. September 2016

SLIDE 2

@ufried

Uwe Friedrichsen | uwe.friedrichsen@codecentric.de | http://slideshare.net/ufried | http://ufried.tumblr.com

SLIDE 3

Previously on “Resilience” …

SLIDE 4

Why resilience?

SLIDE 5

It‘s all about production!

SLIDE 6

Business Production Availability

SLIDE 7

Availability ≔ MTTF MTTF + MTTR

MTTF: Mean Time T

Failure

MTTR: Mean Time T

Recovery

SLIDE 8

Traditional stability approach Availability ≔ MTTF MTTF + MTTR

Maximize MTTF

SLIDE 9

(Almost) every system is a distributed system

Chas Emerick

SLIDE 10

The Eight Fallacies of Distributed Computing

1. The network is reliable
2. Latency is zero
3. Bandwidth is infinite
4. The network is secure
5. T
pology doesn't change
6. There is one administrator
7. Transport cost is zero
8. The network is homogeneous

Peter Deutsch

https://blogs.oracle.com/jag/resource/Fallacies.html

SLIDE 11

A distributed system is one in which the failure

f a computer you didn't even know existed

can render your own computer unusable.

Leslie Lamport

SLIDE 12

Failures in todays complex, distributed and interconnected systems are not the exception.

They are the normal case
They are not predictable
They are not avoidable

SLIDE 13

Do not try to avoid failures. Embrace them.

SLIDE 14

Resilience approach Availability ≔ MTTF MTTF + MTTR

Minimize MTTR

SLIDE 15

re resilience (IT) the ability of a system to handle unexpected situations

without the user noticing it (best case)
with a graceful degradation of service (worst case)

SLIDE 16

Do not fall for the “100% available” trap!

SLIDE 17

Isolation Latency Control

Fail Fast Circuit Breaker Timeouts Fan out & quickest reply Bounded Queues Shed Load Bulkheads

Loose Coupling

Asynchronous Communication Event-Driven Idempotency Self-Containment Relaxed T emporal Constraints Location Transparency Stateless

Supervision

Monitor Complete Parameter Checking Error Handler Escalation

SLIDE 18

… and there is more

Recovery & mitigation patterns
More supervision patterns
Architectural patterns
Anti-fragility patterns
Fault treatment & prevention patterns

A rich pattern family

SLIDE 19

(Title music starts & opening credits shown)

SLIDE 20

Let’s complete the picture first …

SLIDE 21

Isolation Latency Control Loose Coupling Supervision

SLIDE 22

Core

(Architectural)

Detection Treatment Prevention Recovery Mitigation

SLIDE 23

Prevention Detection Core

(Architectural)

Recovery Mitigation Treatment

Isolation Loose Coupling Latency Control

Node level

Supervision

System level

SLIDE 24

Prevention Detection Core

(Architectural)

Recovery Mitigation Treatment

Isolation Redundancy Communication paradigm Supporting patterns

SLIDE 25

Prevention Detection Core

(Architectural)

Recovery Mitigation Treatment

Supporting patterns Communication paradigm Redundancy Isolation

SLIDE 26

Prevention Detection Core

(Architectural)

Recovery Mitigation Treatment

Supporting patterns Communication paradigm Redundancy Isolation Bulkhead

SLIDE 27

Bulkheads

Core isolation pattern (a.k.a. “failure units” or “units of mitigation”)
Shaping good bulkheads is extremely hard (pure design issue)
Diverse implementation choices available, e.g., µservice, actor, scs, ...
Implementation choice impacts system and resilience design a lot

SLIDE 28

Prevention Detection Core

(Architectural)

Recovery Mitigation Treatment

Supporting patterns Communication paradigm Redundancy Isolation

SLIDE 29

Communication paradigm

Request-response <-> messaging <-> events
Not a pattern, but heavily influences resilience patterns to be used
Also heavily influences functional bulkhead design
Very fundamental decision which is often underestimated

SLIDE 30

Prevention Detection Core

(Architectural)

Recovery Mitigation Treatment

Supporting patterns Communication paradigm Redundancy Isolation

SLIDE 31

Redundancy

Core resilience concept
Applicable to all failure types
Basis for many recovery and mitigation patterns
Often different variants implemented in a system

SLIDE 32

Failure types

Crash failure
Omission failure
Timing failure
Response failure
Byzantine failure

SLIDE 33

Failure types

Crash failure
Omission failure
Timing failure
Response failure
Byzantine failure

Usage of redundancy

Patterns
Failover
Schemes
Active/Passive
Active/Active
N+M Redundancy
Implementation examples
Load balancer + health check

(e.g., HAProxy)

Dynamic routing + health check

(e.g., Consul, ZooKeeper)

Cluster manager with shared IP

(e.g., Pacemaker & Corosync)

SLIDE 34

Failure types

Crash failure
Omission failure
Timing failure
Response failure
Byzantine failure

Usage of redundancy

Patterns
Retry (to different replica)
Failover
Backup Request
Schemes
Identical replicas
Failover schemes (for failover)
Implementation examples
Client-based routing
Load balancer
Leaky bucket + dynamic routing

SLIDE 35

Failure types

Crash failure
Omission failure
Timing failure
Response failure
Byzantine failure

Usage of redundancy

Patterns
Timeout + retry to different replica
Timeout + failover
Backup Request
Schemes
Identical replicas
Failover schemes (for failover)
Implementation examples
Client-based routing
Load balancer
Circuit breaker + dynamic routing

SLIDE 36

Failure types

Crash failure
Omission failure
Timing failure
Response failure
Byzantine failure

Usage of redundancy

Patterns
Voting
Recovery blocks
Routine exercise
Schemes
Identical replicas
Different replicas (recovery blocks)
Implementation examples
Majority based quorum
Adaptive weighted sum
Synthetic computation

SLIDE 37

Failure types

Crash failure
Omission failure
Timing failure
Response failure
Byzantine failure

Usage of redundancy

Patterns
Voting
Recovery blocks
Routine exercise
Schemes
Identical replicas
Different replicas (recovery blocks)
Implementation examples
n > 3t quorum
Adaptive weighted sum
Synthetic computation

SLIDE 38

Prevention Detection Core

(Architectural)

Recovery Mitigation Treatment

Supporting patterns Communication paradigm Redundancy Isolation Stateless Idempotency Escalation

Structural Behavioral

Zero downtime deployment Location transparency Relaxed temporal constraints

SLIDE 39

Prevention Detection Core

(Architectural)

Recovery Mitigation Treatment

Node level Supporting patterns System level

SLIDE 40

Prevention Detection Core

(Architectural)

Recovery Mitigation Treatment

Node level Supporting patterns System level Timeout Circuit breaker Complete parameter checking Checksum

SLIDE 41

Prevention Detection Core

(Architectural)

Recovery Mitigation Treatment

Node level Supporting patterns System level Monitor Watchdog Heartbeat Acknowledgement

SLIDE 42

Prevention Detection Core

(Architectural)

Recovery Mitigation Treatment

Node level Supporting patterns System level Voting Synthetic transaction Leaky bucket Routine checks Health check Fail fast

SLIDE 43

Prevention Detection Core

(Architectural)

Recovery Mitigation Treatment

SLIDE 44

Prevention Detection Core

(Architectural)

Recovery Mitigation Treatment

Retry Limit retries

SLIDE 45

Retry

Very basic recovery pattern
Recover from omission or other transient errors
Limit retries to minimize extra load on an already loaded resource
Limit retries to avoid recurring errors

SLIDE 46

Retry example

// doAction returns true if successful, false otherwise boolean doAction(...) { ... } // General pattern boolean success = false int tries = 0; while (!success && (tries < MAX_TRIES)) { success = doAction(...); tries++; } // Alternative one-retry-only variant success = doAction(...) || doAction(...);

SLIDE 47

Prevention Detection Core

(Architectural)

Recovery Mitigation Treatment

Retry Limit retries Rollback Checkpoint Safe point

SLIDE 48

Rollback

Roll back state and/or execution path to a defined safe state
Recover from internal errors caused by external failures
Use checkpoints and safe points to provide safe rollback points
Limit retries to avoid recurring errors

SLIDE 49

Prevention Detection Core

(Architectural)

Recovery Mitigation Treatment

Retry Limit retries Rollback Checkpoint Safe point Roll-forward

SLIDE 50

Roll-forward

Advance execution past the point of error
Often used as escalation if retry or rollback do not succeed
Not applicable if skipped activity is essential
Use checkpoints and safe points to provide safe roll-forward points

SLIDE 51

Prevention Detection Core

(Architectural)

Recovery Mitigation Treatment

Retry Limit retries Rollback Roll-forward Checkpoint Safe point Restart Reconnect Data Reset Startup consistency Reset

SLIDE 52

Reset

Often used as radical escalation if all other measures failed
Restart service – do not forget to provide a consistent startup state
Reset data to a guaranteed consistent state if nothing else helps
Sometimes simply trying to reconnect helps (often forgotten)

SLIDE 53

Prevention Detection Core

(Architectural)

Recovery Mitigation Treatment

Retry Limit retries Rollback Restart Roll-forward Reconnect Checkpoint Safe point Data Reset Startup consistency Reset Failover

SLIDE 54

Failover

Used as escalation if other measures failed or would take too long
Requires redundancy – trades resources for availability
Many implementation variants available, incl. out-of-the-box solutions
Usually implemented as a monitor-dynamic router combination

SLIDE 55

Prevention Detection Core

(Architectural)

Recovery Mitigation Treatment

Retry Limit retries Rollback Restart Roll-forward Reconnect Checkpoint Safe point Data Reset Startup consistency Failover Reset Read repair

SLIDE 56

Read repair

Handle response failures due to relaxed temporal constraints
Requires redundancy – trades resources for availability
Decides correct state based on conflicting siblings
Often implemented in NoSQL databases (but not always accessible)

SLIDE 57

Read repair example (Riak, Java) 1/2

public class FooResolver implements ConflictResolver<Foo> { @Override public Foo resolve(List<Foo> siblings) { // Insert your sibling resolution logic here } } public class Buddy { public String name; public Set<String> nicknames; public Buddy(String name, Set<String> nicknames) { this.name = name; this.nicknames = nicknames; } }

SLIDE 58

Read repair example (Riak, Java) 2/2

public class BuddyResolver implements ConflictResolver<Buddy> { @Override public Buddy resolve(List<Buddy> siblings) { if (siblings.size == 0) { return null; } else if (siblings.size == 1) { return siblings.get(0); } else { // Name is also used as key. Thus, all siblings have the same name String name = siblings.get(0).name; Set<String> mergedNicknames = new HashSet<String>(); for (Buddy buddy : siblings) { mergedNicknames.addAll(buddy.nicknames); } return new Buddy(name, mergedNicknames); } } }

SLIDE 59

Prevention Detection Core

(Architectural)

Recovery Mitigation Treatment

Retry Limit retries Rollback Restart Roll-forward Reconnect Checkpoint Safe point Data Reset Startup consistency Failover Read repair Reset Error handler

SLIDE 60

Error Handler

Separate business logic and error handling
Business logic just focuses on getting the task done
Error handler focuses on recovering from errors
Easier to maintain – can be extended to structural escalation

SLIDE 61

Prevention Detection Core

(Architectural)

Recovery Mitigation Treatment

Retry Limit retries Rollback Restart Roll-forward Reconnect Checkpoint Safe point Data Reset Startup consistency Failover Read repair Error handler Reset

SLIDE 62

Prevention Detection Core

(Architectural)

Recovery Mitigation Treatment

SLIDE 63

Prevention Detection Core

(Architectural)

Recovery Mitigation Treatment

Fallback Fail silently Alternative action Default value

SLIDE 64

Fallback

Execute an alternative action if the original action fails
Basis for most mitigation patterns
Fail silently – silently ignore the error and continue processing
Default value – return a predefined default value if an error occurs

SLIDE 65

Fail silently example (Hystrix, Java) 1/2

public class FailSilentlyCommand extends HystrixCommand<String> { private static final String COMMAND_GROUP = "default"; private final boolean preCondition; public FailSilentlyCommand(boolean preCondition) { super(HystrixCommandGroupKey.Factory.asKey(COMMAND_GROUP)); this.preCondition = preCondition; } @Override protected String run() throws Exception { if (!preCondition) throw new RuntimeException((”Action failed")); return ”I am a result"; } @Override protected String getFallback() { return null; // Turn into silent failure } }

SLIDE 66

Fail silently example (Hystrix, Java) 2/2

@Test public void shouldSucceed() { FailSilentlyCommand command = new FailSilentlyCommand(true); String s = command.execute(); assertEquals(”I am a result", s); } @Test public void shouldFailSilently() { FailSilentlyCommand command = new FailSilentlyCommand(false); String s = ”Dummy"; try { s = command.execute(); } catch (Exception e) { fail("Did not fail silently"); } assertNull(s); }

SLIDE 67

Default value example (Hystrix, Java) 1/2

public class DefaultValueCommand extends HystrixCommand<String> { private static final String COMMAND_GROUP = "default”; private final boolean preCondition; public DefaultValueCommand(boolean preCondition) { super(HystrixCommandGroupKey.Factory.asKey(COMMAND_GROUP)); this.preCondition = preCondition; } @Override protected String run() throws Exception { if (!preCondition) throw new RuntimeException((”Action failed")); return ”I am a smart result"; } @Override protected String getFallback() { return ”I am a default value"; // Return default value if action fails } }

SLIDE 68

Default value example (Hystrix, Java) 2/2

@Test public void shouldSucceed() { DefaultValueCommand command = new DefaultValueCommand(true); String s = command.execute(); assertEquals(”I am a smart result", s); } @Test public void shouldProvideDefaultValue () { DefaultValueCommand command = new DefaultValueCommand(false); String s = null; try { s = command.execute(); } catch (Exception e) { fail("Did not return default value"); } assertEquals(”I am a default value", s); }

SLIDE 69

Prevention Detection Core

(Architectural)

Recovery Mitigation Treatment

Fallback Fail silently Alternative action Default value Queue for resources Bounded queue Finish work in progress Fresh work before stale

SLIDE 70

Queues for resources

Protect resource from temporary overload situations
Limit queue size to limit latency at longer-lasting overload
Finish work in progress – Create pushback on the callers
Fresh work before stale – Discard old entries

SLIDE 71

Prevention Detection Core

(Architectural)

Recovery Mitigation Treatment

Fallback Fail silently Alternative action Default value Queue for resources Bounded queue Finish work in progress Fresh work before stale Shed load

SLIDE 72

Shed Load

Use if overload will lead to unacceptable throughput of resource
Shed requests in order to keep throughput of resource acceptable
Shed load at periphery – Minimize impact on resource itself
Usually combined with monitor to watch load of resource

SLIDE 73

Prevention Detection Core

(Architectural)

Recovery Mitigation Treatment

Fallback Fail silently Alternative action Default value Shed load Queue for resources Bounded queue Finish work in progress Fresh work before stale Share load

Statically Dynamically

SLIDE 74

Share Load

Use if overload will lead to unacceptable throughput of resource
Share load between (added) resources to keep throughput good
Minimize amount of synchronization needed between resources
Usually combined with monitor to watch load of resource(s)

SLIDE 75

Prevention Detection Core

(Architectural)

Recovery Mitigation Treatment

Fallback Fail silently Alternative action Default value Shed load Share load Queue for resources Bounded queue Finish work in progress Fresh work before stale

Statically Dynamically

Deferrable work

SLIDE 76

Deferrable work

Maximize resources for online request processing under high load
Pause or slow down routine and batch jobs
Provide a means to pause routine and batch jobs from outside
Alternatively use a scheduler with dynamic resource allocation

SLIDE 77

Deferrable work example 1/2

// Do or wait variant <init batch> while(<more to process>) { int load = getLoad(); if (load > THRESHOLD) { waitFixedDuration(); } else { <process next batch of work> } } void waitFixedDuration() { Thread.sleep(DELAY); // try-catch left out for better readability }

SLIDE 78

Deferrable work example 2/2

// Adaptive load variant <init batch> while(<more to process>) { waitLoadBased(); <process next batch of work> } void waitLoadBased() { int load = getLoad(); long delay = calcDelay(load); Thread.sleep(delay); // try-catch left out for better readability } long calcDelay(int load) { // Simple example implementation if (load < THRESHOLD) { return 0L; } return (load – THRESHOLD) * DELAY_FACTOR; }

SLIDE 79

Prevention Detection Core

(Architectural)

Recovery Mitigation Treatment

Fallback Fail silently Alternative action Default value Shed load Share load Queue for resources Bounded queue Finish work in progress Fresh work before stale Marked data

Statically Dynamically

Deferrable work

SLIDE 80

Marked data

Avoid repeated and/or spreading errors due to erroneous data
Use if time or information to correct data immediately is missing
Mark data as being erroneous – check flag before processing data
Use routine maintenance job to correct data

SLIDE 81

Prevention Detection Core

(Architectural)

Recovery Mitigation Treatment

Fallback Fail silently Alternative action Default value Shed load Share load Marked data Queue for resources Bounded queue Finish work in progress Fresh work before stale

Statically Dynamically

Deferrable work

SLIDE 82

Prevention Detection Core

(Architectural)

Recovery Mitigation Treatment

SLIDE 83

Prevention Detection Core

(Architectural)

Recovery Mitigation Treatment

Let sleeping dogs lie Small releases Hot deployments

SLIDE 84

Prevention Detection Core

(Architectural)

Recovery Mitigation Treatment

SLIDE 85

Prevention Detection Core

(Architectural)

Recovery Mitigation Treatment

Routine maintenance Anti-entropy

SLIDE 86

Routine maintenance

Reduce system entropy – keep preventable errors from occurring
Especially important if errors were only mitigated, not corrected
Check system periodically and fix detected faults and errors
Balance benefits, costs and additional system load

SLIDE 87

Prevention Detection Core

(Architectural)

Recovery Mitigation Treatment

Routine maintenance Spread the news Anti-entropy

SLIDE 88

Spread the news

Pro-actively spread information about changes in system state
Use a gossip or epidemic protocol for robustness and efficiency
Can also be used for data reconciliation
Balance benefits, costs and additional network load

SLIDE 89

Prevention Detection Core

(Architectural)

Recovery Mitigation Treatment

Routine maintenance Backup request Spread the news Anti-entropy

SLIDE 90

Backup request

Send request to multiple workers (optionally a bit offset)
Use quickest reply and discard all other responses
Prevents latent responses (or at least reduces probability)
Requires redundancy – trades resources for availability

SLIDE 91

Prevention Detection Core

(Architectural)

Recovery Mitigation Treatment

Routine maintenance Backup request Anti-fragility Diversity Jitter Spread the news Anti-entropy

SLIDE 92

Anti-fragility

Avoid fragility caused by homogenization and standardization
Protect against disastrous failures by using diverse solutions
Protect against cumulating effects by introducing jitter
Balance risks, benefits and added costs and efforts carefully

SLIDE 93

Prevention Detection Core

(Architectural)

Recovery Mitigation Treatment

Routine maintenance Backup request Anti-fragility Diversity Jitter Error injection Spread the news Anti-entropy

SLIDE 94

Error injection

Make resilient software design sustainable
Inject errors at runtime and observe how the system reacts
Can also be used to detect yet unknown failure modes
Make sure to inject errors of all types

SLIDE 95

Chaos Monkey
Chaos Gorilla
Chaos Kong
Latency Monkey
Compliance Monkey
Security Monkey
Janitor Monkey
Doctor Monkey

https://github.com/Netflix/SimianArmy

SLIDE 96

Prevention Detection Core

(Architectural)

Recovery Mitigation Treatment

Routine maintenance Backup request Anti-fragility Diversity Jitter Error injection Spread the news Anti-entropy

SLIDE 97

T

wards a pattern language …

SLIDE 98

Decisions to make

General decisions
Bulkhead type
Communication paradigm
Decisions per failure scenario (repeat)
Error detection on node & system level
Recovery/mitigation mechanism
Supporting treatment mechanism
Supporting prevention mechanism
Complementing decisions
Complementing redundancy mechanism(s)
Complementing architectural patterns

SLIDE 99

Core

(Architectural)

Detection Treatment Prevention Recovery Mitigation

Isolation Redundancy Communication paradigm Supporting patterns Node level System level

1

Decide core system properties

2

Choose patterns per failure scenario (Have the different failure types in mind)

3

Decide complementing patterns

Ongoing

Create and refine system design and functional decomposition. Functionally decouple bulkheads (A good functional decomposition on business level is the prerequisite for an effective resilience)

SLIDE 100

Core

(Architectural)

Detection Treatment Prevention Recovery Mitigation

Isolation Redundancy Communication paradigm Supporting patterns Node level System level

Example: Erlang (Akka)

Monitor Messaging Actor Escalation Heartbeat Restart

(Let it crash)

Hot deployments

SLIDE 101

Core

(Architectural)

Detection Treatment Prevention Recovery Mitigation

Isolation Redundancy Communication paradigm Supporting patterns Node level System level

Example: Netflix

Monitor Request/ response (Micro)Service Retry Zero downtime deployment

(Canary releases)

Fallback Share load Bounded queue Timeout Circuit breaker Several variants Error injection

SLIDE 102

Core

(Architectural)

Detection Treatment Prevention Recovery Mitigation

Isolation Redundancy Communication paradigm Supporting patterns Node level System level

What is your pattern language?

SLIDE 103

Wrap-up

T
day’s systems are distributed
Failures are not avoidable
Failures are not predictable
Resilient software design needed
Rich pattern language
Start with core system properties
Choose patterns based on failure scenarios
Complement with careful functional design

SLIDE 104

@ufried

Uwe Friedrichsen | uwe.friedrichsen@codecentric.de | http://slideshare.net/ufried | http://ufried.tumblr.com

SLIDE 107