d istributed s ystems comp9243 lecture 8 fault tolerance
play

D ISTRIBUTED S YSTEMS [COMP9243] Lecture 8: Fault Tolerance C ASE S - PowerPoint PPT Presentation

D ISTRIBUTED S YSTEMS [COMP9243] Lecture 8: Fault Tolerance C ASE S TUDY : AWS FAILURE 2011 April 21, 2011 EBS (Elastic Block Store) in US East region unavailable for about 2 days 13% of volumes in one availability zone got stuck


  1. D ISTRIBUTED S YSTEMS [COMP9243] Lecture 8: Fault Tolerance C ASE S TUDY : AWS FAILURE 2011 ➜ April 21, 2011 ➜ EBS (Elastic Block Store) in US East region unavailable for about 2 days ➜ 13% of volumes in one availability zone got stuck Slide 1 Slide 3 ➜ led to control API errors and outage in whole region ➜ led to problems with EC2 instances and RDS in most popular region ➜ due to reconfig error and re-mirroring storm . ➀ Failure ➜ http://aws.amazon.com/message/65648/ ➁ Reliable Communication ➂ Process Resilience ➃ Recovery AWS EBS Overview: ➜ Region → Availability Zones D EPENDABILITY Region ➜ Clusters → Nodes → Volumes ➜ Volume: replicated in cluster Availability Zone Availability: system is ready to be used immediately ➜ Control Plane Services: API for EBS Cluster Reliability: system can run continuously without failure volumes for whole region Node Node ➜ Networks: primary, secondary Safety: when a system (temporarily) fails to operate correctly, Slide 2 Slide 4 Node What happened?: nothing catastrophic happens ➜ US east AZ Control Plane AZ Maintainability: how easily a failed system can be repaired server ➜ network config problem ➜ re-mirroring storm Building a dependable system comes down to ➜ CP API thread starvation controlling failure and faults. ➜ node race condition ➜ CP election overload C ASE S TUDY : AWS FAILURE 2011 1 C ASE S TUDY : AWS FAILURE 2011 2

  2. Solution: T OTAL VS P ARTIAL F AILURE ➜ Disconnect bad cluster ➜ Throttle re-mirroring Total Failure: ➜ Add more disk space ➜ Slowly un-throttle re-mirroring All components in a system fail ➜ Volumes unstuck → reconnect cluster ➜ Typical in nondistributed system ➜ 0.07% data lost Partial Failure: Slide 5 Slide 7 Lessons learned: One or more (but not all) components in a distributed ➜ Back off system fail ➜ Re-establish connectivity to previous replicas ➜ Some components affected ➜ Shorter timeouts ➜ Other components completely unaffected ➜ Snapshot stuck volumes ➜ Considered as fault for the whole system ➜ CP: one AZ shouldn’t crash another AZ ➜ Make it easier to use multiple AZs F AILURE C ATEGORISING F AULTS AND F AILURES Terminology: Failure: a system fails when it does not meet its promises or Types of Faults: cannot provide its services in the specified manner Transient Fault: occurs once then disappear Error: part of the system state that leads to failure (i.e., it Intermittent Fault: occurs, vanishes, reoccurs, vanishes, etc. differs from its intended value) Permanent Fault: persists until faulty component is replaced Slide 6 Slide 8 Fault: the cause of an error (results from design errors, manufacturing faults, deterioration, or external Types of Failures: disturbance) Process Failure: process proceeds incorrectly or not at all Recursive: Storage Failure: secondary storage is inaccessible ➜ Failure can be a fault ➜ Manufacturing fault leads to disk failure Communication Failure: communication link or node failure ➜ Disk failure is a fault that leads to database failure ➜ Database failure is a fault that leads to email service failure T OTAL VS P ARTIAL F AILURE 3 F AILURE M ODELS 4

  3. F AILURE M ODELS D ETECTING F AILURE Crash Failure: a server halts, but works correctly until it halts Failure Detector: Fail-Stop: server will stop in a way that clients can tell that it has ➜ Service that detects process failures halted. ➜ Answers queries about status of a process Fail-Resume: server will stop, then resume execution at a later Reliable: time. Slide 9 Slide 11 ➜ Failed – crashed Fail-Silent: clients do not know server has halted ➜ Unsuspected – hint Omission Failure: a server fails to respond to incoming Unreliable: requests ➜ Suspected – may still be alive Receive Omission: fails to receive incoming messages ➜ Unsuspected – hint Send Omission: fails to send messages Synchronous systems: Response Failure: a server’s response is incorrect ➜ Timeout Value Failure: the value of the response is wrong ➜ Failure detector sends probes to detect crash failures State Transition Failure: the server deviates from the correct flow of Asynchronous systems: control � Timeout gives no guarantees Slide 10 Slide 12 Timing Failure: a server’s response lies outside the specified ➜ Failure detector can track suspected failures time interval ➜ Combine results from multiple detectors � How to distinguish communication failure from process failure? Arbitrary Failure: a server may produce arbitrary response at ➜ Ignore messages from suspected processes arbitrary times (aka Byzantine failure ) � Turn an asynchronous system into a synchronous one D ETECTING F AILURE 5 F AULT T OLERANCE 6

  4. F AULT T OLERANCE Fault Tolerance: F AILURE P REDICTION ➜ System can provide its services even in the presence of faults Deal with expected faults: Goal: ➜ Test for error conditions ➜ Automatically recover from partial failure ➜ Error handling code Slide 13 Slide 15 ➜ Without seriously affecting overall performance ➜ Error correcting codes • checksums Techniques: • erasure codes ➜ Prevention : prevent or reduce occurrence of faults ➜ Masking : hide the occurrence of the fault ➜ Prediction : predict the faults that can occur and deal with them ➜ Recovery : restore an erroneous state to an error-free state F AILURE M ASKING F AILURE P REVENTION Try to hide occurrence of failures from other processes Make sure faults don’t happen: Mask: ➜ Quality hardware ➀ Communication Failure → Slide 14 ➜ Hardened hardware Slide 16 Reliable Communication ➜ Quality software ➁ Process Failure → Process Resilience F AILURE P REDICTION 7 F AILURE M ASKING 8

  5. Two Army Problem: Redundancy: Non-faulty processes but lossy communication. ➜ Information redundancy ➜ Time redundancy ➜ Physical redundancy 3000 3000 1 2 A B C (a) Slide 17 Slide 19 Voter 5000 A1 V1 B1 V4 C1 V7 A2 V2 B2 V5 C2 V8 ➜ 1 → 2 attack! Consensus with lossy com- ➜ 2 → 1 ack A3 V3 B3 V6 C3 V9 munication is impossible. ➜ 2: did 1 get my ack? ➜ 1 → 2 ack ack Why does TCP work? (b) ➜ 1: did 2 get my ack ack? ➜ etc. R ELIABLE P OINT - TO -P OINT C OMMUNICATION R ELIABLE C OMMUNICATION ➜ Reliable transport protocol (e.g., TCP) ➜ Communication channel experiences failure Slide 18 Slide 20 ➜ Focus on masking crash (lost/broken connections) and omission � Masks omission failure (lost messages) failures � Not crash failure R ELIABLE C OMMUNICATION 9 R ELIABLE P OINT - TO -P OINT C OMMUNICATION 10

  6. S CALABILITY OF R ELIABLE M ULTICAST Feedback Implosion: sender is swamped with feedback Example: Failure and RPC: messages Possible failures: Nonhierarchical Multicast: ➜ Client cannot locate server ➜ Use NACK s ➜ Request message to server is lost ➜ Feedback suppression: NACKs multicast to everyone Slide 21 Slide 23 ➜ Server crashes after receiving a request ➜ Prevents other receivers from sending NACKs if they’ve already ➜ Reply message from server is lost seen one. ➜ Client crashes after sending a request � Reduces (N)ACK load on server � Receivers have to be coordinated so they don’t all multicast How to deal with the various kinds of failure? NACKs at same time � Multicasting feedback also interrupts processes that successfully received message R ELIABLE G ROUP C OMMUNICATION Hierarchical Multicast: Receiver missed message #24 Sender Sender Receiver Receiver Receiver Receiver (Long-haul) connection Local-area network M25 S History Coordinator buffer Last = 24 Last = 24 Last = 23 Last = 24 M25 M25 M25 M25 C C Slide 22 Slide 24 Network R (a) Root Receiver Sender Receiver Receiver Receiver Receiver Last = 25 Last = 24 Last = 23 Last = 24 M25 M25 M25 M25 ACK 25 ACK 25 Missed 24 ACK 25 Network (b) S CALABILITY OF R ELIABLE M ULTICAST 11 P ROCESS R ESILIENCE 12

  7. R EPLICATION Create groups using replication Primary-Based: ➜ Primary-backup ➜ Hierarchical group ➜ If primary crashes others elect a new primary P ROCESS R ESILIENCE Replicated-Write: Slide 25 Slide 27 ➜ Active replication or Quorum Protection against process failures ➜ Flat group ➜ Ordering of requests (atomic multicast problem) k Fault Tolerance: ➜ can survive faults in k components and still meet its specifications ➜ k + 1 replicas enough if fail-silent (or fail-stop) ➜ 2 k + 1 required if if byzantine Groups: ➜ Organise identical processes into groups • Process groups are dynamic • Processes can be members of multiple groups • Mechanisms for managing groups and group membership S TATE M ACHINE R EPLICATION Slide 26 Slide 28 ➜ Deal with all processes in a group as a single abstraction Flat vs Hierarchical Groups: ➜ Flat group: all decisions made collectively ➜ Hierarchical group: coordinator makes decisions R EPLICATION 13 S TATE M ACHINE R EPLICATION 14

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend