an analysis of network partitioning fail ilures in in clo
play

An Analysis of Network-Partitioning Fail ilures in in Clo loud - PowerPoint PPT Presentation

An Analysis of Network-Partitioning Fail ilures in in Clo loud Systems Ahmed Alquraan, Hatem Takruri, Mohammed Alfatafta, Samer Al-Kiswany 1 Highlights Network-partitioning failures are catastrophic, silent, and deterministic


  1. An Analysis of Network-Partitioning Fail ilures in in Clo loud Systems Ahmed Alquraan, Hatem Takruri, Mohammed Alfatafta, Samer Al-Kiswany 1

  2. Highlights • Network-partitioning failures are catastrophic, silent, and deterministic • Surprisingly, partial partitions cause large number of failures • Debunk two common presumptions 1. Admins believe that systems can tolerate network partitions 2. Designers believe isolating one side of the partition is enough • NEAT: a network partitioning testing framework • Tested 7 systems  32 failures 2

  3. Motivation • High availability: systems should tolerate infrastructure failures (Devices, nodes, network, data centers) • We focus on network partitioning • Partitioning faults are common (once every two weeks at Google[1], 70% of downtime at Microsoft[2], once every 4 days at CENIC[3]) • Complex to handle What is the impact of network partitions on modern systems? [1] Govindan et al, "Evolve or Die: High-Availability Design Principles Drawn from Googles Network Infrastructure ” , ACM SIGCOMM 2016 [2] Gill et al, “ Understanding network failures in data centers: measurement, analysis, and implications ” , ACM SIGCOMM 2011 3 [3] Turner et al, “ California fault lines: understanding the causes and impact of network failures ” , ACM SIGCOMM 2010

  4. In-depth analysis of production failures Studied end-to-end failure sequence User workload New system Failure System reaction Network Partition configuration Visible to users (Leader election, reconfig, … ) • Study the impact of failures • Characterize conditions and sequence of events • Identify opportunities to improve fault tolerance 4

  5. Methodology • Studied 136 high-impact network-partitioning failures from 25 systems • 104 failures are user-reported failures • 32 failures are discovered by NEAT • Studied failure report, discussion, logs, code, and tests • Reproduced 24 failures to understand intricate details 5

  6. Highlights • Network partitioning failures are catastrophic, silent, and easy to manifest • Surprisingly, partial partitions cause large number of failures • Debunk two common presumptions 1. Admins believe that systems can tolerate network partitions 2. Designers believe isolating one side of the partition is enough • NEAT: a network partitioning testing framework • Tested 7 systems  32 failures 6

  7. Example – Dirty read in VoltDB Event1: Network partition Leader Event2: Write to minority election Event3: Read from minority read (key) Replica Master Replica Master key = Y Update locally Dirty Network partition key X key X key Y read key X 7

  8. Failure impact read (key) Master Replica Replica Master key = Y Dirty Update locally read Network partition key Y key X  Catastrophic failure • Data loss, dirty read, broken locks, Event 1: Network partition double dequeue, corruption Event 2: Write to minority Event 3: Read from minority Majority (80%) of the failures are catastrophic Majority (90%) of the failures are silent 8

  9. Timing and ordering read (key) Master Replica Replica Master key = Y  Require 3 events Dirty Update locally read Network partition key Y key X 70% of the failures require 3 or fewer events Event 1: Network partition Multiple events should happen in a timeout Event 2: Write to minority specific order Event 3: Read from minority Majority (80%) are deterministic or Old master shuts down have known timing constraints Timing : should occur before the old master shuts down Surprisingly, partition failures are deterministic, silent, and catastrophic 9

  10. Failure source Two leaders 57% Bad leader 20% Leader election Double voting 18% 40% Configuration Conflicting 4% change election 20% Data 14% 59% of the failures are due to design flaws Failures consolidation 13% Request routing 13% • Early design reviews can help Replication protocol 20% • High-impact area that needs Others further research 10

  11. Highlights • Network partitioning failures are catastrophic, silent, and easy to manifest • Surprisingly, partial partitions cause large number of failures • Debunk two common presumptions 1. Admins believe that systems can tolerate network partitions 2. Designers believe isolating one side of the partition is enough • NEAT: a network partitioning testing framework • Tested 7 systems  32 failures 11

  12. Partial network partitioning Network partition Network partition types Group 1 Group 2 • Complete • Partial • Simplex Group 3 12

  13. Partial network partition - double execution in MapReduce NodeMgr start AppMaster Resource AppMaster Manager NodeMgr Task NodeMgr User 13

  14. Partial network partition - double execution in MapReduce NodeMgr AppMaster Partition has failed Resource AppMaster Manager NodeMgr Start Another AppMaster AppMaster NodeMgr User • Double execution and data corruption 14

  15. Partial network partition - double execution in MapReduce NodeMgr Partition Resource AppMaster Manager NodeMgr AppMaster NodeMgr User • Double execution and data corruption • Confuses the user 15

  16. Partial network partitioning Network partition Group 1 Partial partitioning leads to 28% of the failures Group 2 • Affects leader election, scheduling, data placement, and configuration change Group 3 • Leads to inconsistent view of system state • Partial partitions are poorly understood and tested 16

  17. Highlights • Network partitioning failures are catastrophic, silent, and easy to manifest • Surprisingly, partial partitions cause large number of failures • Debunk two common presumptions 1. Admins believe that systems can tolerate network partitions 2. Designers believe isolating one side of the partition is enough • NEAT: a network partitioning testing framework • Tested 7 systems  32 failures 17

  18. Debunks two presumptions • Admins believe systems with data redundancy can tolerate partitioning  Action: low priority for repairing ToR switches[1] Reality: 83% of the failures occur by isolating a single node • Systems restrict client access to one side to eliminate failures Reality: 64% of the failures require no client access or access to one side only [1] Phillipa et al , “ Understanding network failures in data centers: measurement, analysis, and implications ” in OSDI ’ 14 18

  19. Other findings • Failures in proven protocols are due to optimizations • Majority (83%) of the failures can be reproduced with 3 nodes • Majority (93%) of the failures can be reproduced through tests 19

  20. Highlights • Network partitioning failures are catastrophic, silent, and easy to manifest • Surprisingly, partial partitions cause large number of failures • Debunk two common presumptions 1. Admins believe that systems can tolerate network partitions 2. Designers believe isolating one side of the partition is enough • NEAT: a network partitioning testing framework • Tested 7 systems  32 failures 20

  21. NEtwork pArtitioning Testing framework (NEAT) • Supports all types of network partitions Apache Ignite • Simple API double locking failure Network partition client1.createSemaphore(1) side1 = asList(S1, S2, client1); S1 S2 S3 side2 = asList(S3, client2); netPart = Partitioner.complete(side1, side2); acquire() acquire() assertTrue(client1.sem_trywait()); assertFalse(client2.sem_trywait()); Partitioner.heal(netPart); Client1 Client2 21

  22. Issue client operations NEAT design Client 1 Client 2 Client Driver • Orders client operations • Injects and heals partitions • OpenFlow Test • iptables Engine Net Partitioner Server Server Server 1 2 3 Run target system 22

  23. Testing with NEAT • We tested 7 systems using NEAT • Discovered 32 failures  30 catastrophic System # failures • Confirmed: 12 found ActiveMQ 2 Ceph 2 Ignite 15 Infinispan 1 Terracotta 9 MooseFS 2 DKron 1 23

  24. Concluding remarks • Further research is needed for network partition fault tolerance Specially partial partitions • Highlight the danger of using unreachability as an indicator of node crash • Identify ordering, timing, network characteristics to simplify testing • Identify common pitfalls for developers and admins • NEAT: network partitioning testing framework https://dsl.uwaterloo.ca/projects/neat/ 24

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend