An Analysis of Network-Partitioning Fail ilures in in Clo loud - PowerPoint PPT Presentation

An Analysis of Network-Partitioning Fail ilures in in Clo loud Systems Ahmed Alquraan, Hatem Takruri, Mohammed Alfatafta, Samer Al-Kiswany 1

Highlights • Network-partitioning failures are catastrophic, silent, and deterministic • Surprisingly, partial partitions cause large number of failures • Debunk two common presumptions 1. Admins believe that systems can tolerate network partitions 2. Designers believe isolating one side of the partition is enough • NEAT: a network partitioning testing framework • Tested 7 systems  32 failures 2

Motivation • High availability: systems should tolerate infrastructure failures (Devices, nodes, network, data centers) • We focus on network partitioning • Partitioning faults are common (once every two weeks at Google[1], 70% of downtime at Microsoft[2], once every 4 days at CENIC[3]) • Complex to handle What is the impact of network partitions on modern systems? [1] Govindan et al, "Evolve or Die: High-Availability Design Principles Drawn from Googles Network Infrastructure ” , ACM SIGCOMM 2016 [2] Gill et al, “ Understanding network failures in data centers: measurement, analysis, and implications ” , ACM SIGCOMM 2011 3 [3] Turner et al, “ California fault lines: understanding the causes and impact of network failures ” , ACM SIGCOMM 2010

In-depth analysis of production failures Studied end-to-end failure sequence User workload New system Failure System reaction Network Partition configuration Visible to users (Leader election, reconfig, … ) • Study the impact of failures • Characterize conditions and sequence of events • Identify opportunities to improve fault tolerance 4

Methodology • Studied 136 high-impact network-partitioning failures from 25 systems • 104 failures are user-reported failures • 32 failures are discovered by NEAT • Studied failure report, discussion, logs, code, and tests • Reproduced 24 failures to understand intricate details 5

Highlights • Network partitioning failures are catastrophic, silent, and easy to manifest • Surprisingly, partial partitions cause large number of failures • Debunk two common presumptions 1. Admins believe that systems can tolerate network partitions 2. Designers believe isolating one side of the partition is enough • NEAT: a network partitioning testing framework • Tested 7 systems  32 failures 6

Example – Dirty read in VoltDB Event1: Network partition Leader Event2: Write to minority election Event3: Read from minority read (key) Replica Master Replica Master key = Y Update locally Dirty Network partition key X key X key Y read key X 7

Failure impact read (key) Master Replica Replica Master key = Y Dirty Update locally read Network partition key Y key X  Catastrophic failure • Data loss, dirty read, broken locks, Event 1: Network partition double dequeue, corruption Event 2: Write to minority Event 3: Read from minority Majority (80%) of the failures are catastrophic Majority (90%) of the failures are silent 8

Timing and ordering read (key) Master Replica Replica Master key = Y  Require 3 events Dirty Update locally read Network partition key Y key X 70% of the failures require 3 or fewer events Event 1: Network partition Multiple events should happen in a timeout Event 2: Write to minority specific order Event 3: Read from minority Majority (80%) are deterministic or Old master shuts down have known timing constraints Timing : should occur before the old master shuts down Surprisingly, partition failures are deterministic, silent, and catastrophic 9

Failure source Two leaders 57% Bad leader 20% Leader election Double voting 18% 40% Configuration Conflicting 4% change election 20% Data 14% 59% of the failures are due to design flaws Failures consolidation 13% Request routing 13% • Early design reviews can help Replication protocol 20% • High-impact area that needs Others further research 10

Partial network partitioning Network partition Network partition types Group 1 Group 2 • Complete • Partial • Simplex Group 3 12

Partial network partition - double execution in MapReduce NodeMgr start AppMaster Resource AppMaster Manager NodeMgr Task NodeMgr User 13

Partial network partition - double execution in MapReduce NodeMgr AppMaster Partition has failed Resource AppMaster Manager NodeMgr Start Another AppMaster AppMaster NodeMgr User • Double execution and data corruption 14

Partial network partition - double execution in MapReduce NodeMgr Partition Resource AppMaster Manager NodeMgr AppMaster NodeMgr User • Double execution and data corruption • Confuses the user 15

Partial network partitioning Network partition Group 1 Partial partitioning leads to 28% of the failures Group 2 • Affects leader election, scheduling, data placement, and configuration change Group 3 • Leads to inconsistent view of system state • Partial partitions are poorly understood and tested 16

Debunks two presumptions • Admins believe systems with data redundancy can tolerate partitioning  Action: low priority for repairing ToR switches[1] Reality: 83% of the failures occur by isolating a single node • Systems restrict client access to one side to eliminate failures Reality: 64% of the failures require no client access or access to one side only [1] Phillipa et al , “ Understanding network failures in data centers: measurement, analysis, and implications ” in OSDI ’ 14 18

Other findings • Failures in proven protocols are due to optimizations • Majority (83%) of the failures can be reproduced with 3 nodes • Majority (93%) of the failures can be reproduced through tests 19

NEtwork pArtitioning Testing framework (NEAT) • Supports all types of network partitions Apache Ignite • Simple API double locking failure Network partition client1.createSemaphore(1) side1 = asList(S1, S2, client1); S1 S2 S3 side2 = asList(S3, client2); netPart = Partitioner.complete(side1, side2); acquire() acquire() assertTrue(client1.sem_trywait()); assertFalse(client2.sem_trywait()); Partitioner.heal(netPart); Client1 Client2 21

Issue client operations NEAT design Client 1 Client 2 Client Driver • Orders client operations • Injects and heals partitions • OpenFlow Test • iptables Engine Net Partitioner Server Server Server 1 2 3 Run target system 22

Testing with NEAT • We tested 7 systems using NEAT • Discovered 32 failures  30 catastrophic System # failures • Confirmed: 12 found ActiveMQ 2 Ceph 2 Ignite 15 Infinispan 1 Terracotta 9 MooseFS 2 DKron 1 23

Concluding remarks • Further research is needed for network partition fault tolerance Specially partial partitions • Highlight the danger of using unreachability as an indicator of node crash • Identify ordering, timing, network characteristics to simplify testing • Identify common pitfalls for developers and admins • NEAT: network partitioning testing framework https://dsl.uwaterloo.ca/projects/neat/ 24

An Analysis of Network-Partitioning Fail ilures in in Clo loud - PowerPoint PPT Presentation

An Analysis of Network-Partitioning Fail ilures in in Clo loud Systems Ahmed Alquraan, Hatem Takruri, Mohammed Alfatafta, Samer Al-Kiswany 1 Highlights Network-partitioning failures are catastrophic, silent, and deterministic

Clo cks 1 Goals of the lecture Logical Clo cks (Lamp o rt's clo cks)

Clo cks [Contd.] 1 Goals of the lecture Direct dep endency clo cks Pred and

Approaches for Resilience Against Cascading Fail ilures in in Clo loud Datacenters Haoyu Wang,

SSD Fail ilures in in Datacenters: What? When? And Why? Iyswarya Narayanan, Di Wang, Myeongjae

Partitioning and Divide-and- Conquer Strategies Partitioning Strategies Partitioning simply

Partitioning Introduction to Partitioning Mahapatra-Texas A&M-Spring02 1 System

Base Realig lignme ment and Clo losure (a (aka BRA BRAC) Two Su Successes and a Clo

Grid/Clo d Comp ting Grid/Clo d Comp ting Grid/Cloud Computing Grid/Cloud Computing over

-Sandeep Palur and Ajay Anthony (A20302187) (A20306352) 1 Int ntrodu roducti ction n

Partitioning under the hood in MySQL 5.5 Mattias Jonsson, Partitioning developer Mikael

1 1 Slide 5 Slide 6 Partitioning and Load Balancing Partitioning Goals Assignment of

Partitioning Problem and Usage Lecture 8 CSCI 4974/6971 26 Sep 2016 1 / 14 Todays Biz 1.

Investigating hypergraph-partitioning-based sparse matrix partitioning methods Bora U car

mndag 13 maj 13 OVERVIEW Fail-recovery Precedence (1,N) Logged register Byzantine (1,N)

Cut Not and Fail Cut, Not, and Fail York University CSE 3401 Vida Movahedi 1 York University

Basic Concept : Energy Saving Clo Close

Allie s in the strug g le : inte rse c tiona l work a s tra uma - informe d re sponse a nd pre

Malaysian Healthy Ageing Society The Pseudodementia Dilemma Dr. Prem Kumar Chandrasekaran

Meeting Your Communities Behavioral Health Needs Delta Region Community Health Services

Python Programming Eun Woo Kim Big Data Camp (May 11 th , 2016) 1/11 As a beginner of

Mechanical issues concerning Photo-multipliers for LBNE Milind Diwan ANT 2011, Dexel

Terrorism and Bioterrorism Communication Challenges Module Summary Working with new

DBT: AN OVERVIEW Learning the Principles of Dialectical Behavior Therapy KRISTIN B. WEBB, PSY.D.

Governing the AI Revolution Allan Dafoe Yale University Future of Humanity Institute University

An Analysis of Network-Partitioning Fail ilures in in Clo loud - PowerPoint PPT Presentation

An Analysis of Network-Partitioning Fail ilures in in Clo loud Systems Ahmed Alquraan, Hatem Takruri, Mohammed Alfatafta, Samer Al-Kiswany 1 Highlights Network-partitioning failures are catastrophic, silent, and deterministic

Clo cks 1 Goals of the lecture Logical Clo cks (Lamp o rt's clo cks)

Clo cks [Contd.] 1 Goals of the lecture Direct dep endency clo cks Pred and

Approaches for Resilience Against Cascading Fail ilures in in Clo loud Datacenters Haoyu Wang,

SSD Fail ilures in in Datacenters: What? When? And Why? Iyswarya Narayanan, Di Wang, Myeongjae

Partitioning and Divide-and- Conquer Strategies Partitioning Strategies Partitioning simply

Partitioning Introduction to Partitioning Mahapatra-Texas A&amp;M-Spring02 1 System

Base Realig lignme ment and Clo losure (a (aka BRA BRAC) Two Su Successes and a Clo

Grid/Clo d Comp ting Grid/Clo d Comp ting Grid/Cloud Computing Grid/Cloud Computing over

-Sandeep Palur and Ajay Anthony (A20302187) (A20306352) 1 Int ntrodu roducti ction n

Partitioning under the hood in MySQL 5.5 Mattias Jonsson, Partitioning developer Mikael

1 1 Slide 5 Slide 6 Partitioning and Load Balancing Partitioning Goals Assignment of

Partitioning Problem and Usage Lecture 8 CSCI 4974/6971 26 Sep 2016 1 / 14 Todays Biz 1.

Investigating hypergraph-partitioning-based sparse matrix partitioning methods Bora U car

mndag 13 maj 13 OVERVIEW Fail-recovery Precedence (1,N) Logged register Byzantine (1,N)

Cut Not and Fail Cut, Not, and Fail York University CSE 3401 Vida Movahedi 1 York University

Basic Concept : Energy Saving Clo Close

Allie s in the strug g le : inte rse c tiona l work a s tra uma - informe d re sponse a nd pre

Malaysian Healthy Ageing Society The Pseudodementia Dilemma Dr. Prem Kumar Chandrasekaran

Meeting Your Communities Behavioral Health Needs Delta Region Community Health Services

Python Programming Eun Woo Kim Big Data Camp (May 11 th , 2016) 1/11 As a beginner of

Mechanical issues concerning Photo-multipliers for LBNE Milind Diwan ANT 2011, Dexel

Terrorism and Bioterrorism Communication Challenges Module Summary Working with new

DBT: AN OVERVIEW Learning the Principles of Dialectical Behavior Therapy KRISTIN B. WEBB, PSY.D.

Governing the AI Revolution Allan Dafoe Yale University Future of Humanity Institute University

Partitioning Introduction to Partitioning Mahapatra-Texas A&M-Spring02 1 System