An Analysis of Network-Partitioning Fail ilures in in Clo loud - - PowerPoint PPT Presentation

an analysis of network partitioning fail ilures in in clo
SMART_READER_LITE
LIVE PREVIEW

An Analysis of Network-Partitioning Fail ilures in in Clo loud - - PowerPoint PPT Presentation

An Analysis of Network-Partitioning Fail ilures in in Clo loud Systems Ahmed Alquraan, Hatem Takruri, Mohammed Alfatafta, Samer Al-Kiswany 1 Highlights Network-partitioning failures are catastrophic, silent, and deterministic


slide-1
SLIDE 1

An Analysis of Network-Partitioning Fail ilures in in Clo loud Systems

Ahmed Alquraan, Hatem Takruri, Mohammed Alfatafta, Samer Al-Kiswany

1

slide-2
SLIDE 2

Highlights

  • Network-partitioning failures are catastrophic, silent, and deterministic
  • Surprisingly, partial partitions cause large number of failures
  • Debunk two common presumptions

1. Admins believe that systems can tolerate network partitions 2. Designers believe isolating one side of the partition is enough

  • NEAT: a network partitioning testing framework
  • Tested 7 systems  32 failures

2

slide-3
SLIDE 3

Motivation

  • High availability: systems should tolerate infrastructure failures

(Devices, nodes, network, data centers)

  • We focus on network partitioning
  • Partitioning faults are common

(once every two weeks at Google[1], 70% of downtime at Microsoft[2], once every 4 days at CENIC[3])

  • Complex to handle

3

What is the impact of network partitions on modern systems?

[1] Govindan et al, "Evolve or Die: High-Availability Design Principles Drawn from Googles Network Infrastructure”, ACM SIGCOMM 2016 [2] Gill et al, “Understanding network failures in data centers: measurement, analysis, and implications”, ACM SIGCOMM 2011 [3] Turner et al, “California fault lines: understanding the causes and impact of network failures”, ACM SIGCOMM 2010

slide-4
SLIDE 4

In-depth analysis of production failures

Studied end-to-end failure sequence

  • Study the impact of failures
  • Characterize conditions and sequence of events
  • Identify opportunities to improve fault tolerance

4

New system configuration Network Partition System reaction (Leader election, reconfig, …) Failure Visible to users

User workload

slide-5
SLIDE 5
  • Studied 136 high-impact network-partitioning failures from 25

systems

  • 104 failures are user-reported failures
  • 32 failures are discovered by NEAT
  • Studied failure report, discussion, logs, code, and tests
  • Reproduced 24 failures to understand intricate details

5

Methodology

slide-6
SLIDE 6

Highlights

  • Network partitioning failures are catastrophic, silent, and easy to manifest
  • Surprisingly, partial partitions cause large number of failures
  • Debunk two common presumptions
  • 1. Admins believe that systems can tolerate network partitions
  • 2. Designers believe isolating one side of the partition is enough
  • NEAT: a network partitioning testing framework
  • Tested 7 systems  32 failures

6

slide-7
SLIDE 7

Example – Dirty read in VoltDB

7

Master

Replica

Network partition

read (key) key = Y

Replica Master

Dirty read

Update locally

Leader election Event1: Network partition Event2: Write to minority Event3: Read from minority key X key X key X key Y

slide-8
SLIDE 8

8

Event 1: Network partition Event 2: Write to minority Event 3: Read from minority

Majority (80%) of the failures are catastrophic

  • Catastrophic failure
  • Data loss, dirty read, broken locks,

double dequeue, corruption

Majority (90%) of the failures are silent

Dirty read Master Replica

Network partition read (key) key = Y

Replica Master

Update locally

key Y key X

Failure impact

slide-9
SLIDE 9

Surprisingly, partition failures are deterministic, silent, and catastrophic

Dirty read Master Replica

Network partition read (key) key = Y

Replica Master

Update locally

9

70% of the failures require 3 or fewer events

  • Require 3 events

Timing: should occur before the old master shuts down Old master shuts down

key Y key X

Multiple events should happen in a specific order Majority (80%) are deterministic or have known timing constraints

Timing and ordering

timeout Event 1: Network partition Event 2: Write to minority Event 3: Read from minority

slide-10
SLIDE 10

10

Configuration change Data consolidation Request routing Replication protocol

40% 20% 13% 13%

Two leaders Bad leader Double voting Conflicting election

57% 20% 18% 4%

Leader election Others

14% 20%

Failures

59% of the failures are due to design flaws

  • Early design reviews can help
  • High-impact area that needs

further research

Failure source

slide-11
SLIDE 11

Highlights

  • Network partitioning failures are catastrophic, silent, and easy to manifest
  • Surprisingly, partial partitions cause large number of failures
  • Debunk two common presumptions
  • 1. Admins believe that systems can tolerate network partitions
  • 2. Designers believe isolating one side of the partition is enough
  • NEAT: a network partitioning testing framework
  • Tested 7 systems  32 failures

11

slide-12
SLIDE 12

12

Group 1 Group 2 Group 3

Network partition

Partial network partitioning

Network partition types

  • Complete
  • Partial
  • Simplex
slide-13
SLIDE 13

Partial network partition - double execution in MapReduce

13

Task

Resource Manager

User

NodeMgr NodeMgr AppMaster NodeMgr

start AppMaster

slide-14
SLIDE 14

Partial network partition - double execution in MapReduce

14

Resource Manager

User

AppMaster NodeMgr NodeMgr

Partition

Start Another AppMaster

AppMaster

  • Double execution and data corruption

NodeMgr

AppMaster has failed

slide-15
SLIDE 15

Partial network partition - double execution in MapReduce

15

Resource Manager

User

AppMaster NodeMgr NodeMgr AppMaster

  • Double execution and data corruption
  • Confuses the user

NodeMgr

Partition

slide-16
SLIDE 16

16

  • Leads to inconsistent view of system state
  • Partial partitions are poorly understood and tested

Partial partitioning leads to 28% of the failures

Partial network partitioning

  • Affects leader election, scheduling,

data placement, and configuration change

Group 1 Group 2 Group 3 Network partition

slide-17
SLIDE 17

Highlights

  • Network partitioning failures are catastrophic, silent, and easy to manifest
  • Surprisingly, partial partitions cause large number of failures
  • Debunk two common presumptions
  • 1. Admins believe that systems can tolerate network partitions
  • 2. Designers believe isolating one side of the partition is enough
  • NEAT: a network partitioning testing framework
  • Tested 7 systems  32 failures

17

slide-18
SLIDE 18

Debunks two presumptions

  • Admins believe systems with data redundancy can tolerate partitioning
  • Action: low priority for repairing ToR switches[1]
  • Systems restrict client access to one side to eliminate failures

18 [1] Phillipa et al, “Understanding network failures in data centers: measurement, analysis, and implications” in OSDI’14

Reality: 83% of the failures occur by isolating a single node

Reality: 64% of the failures require no client access or access to one side only

slide-19
SLIDE 19

Other findings

  • Failures in proven protocols are due to optimizations
  • Majority (83%) of the failures can be reproduced with 3 nodes
  • Majority (93%) of the failures can be reproduced through tests

19

slide-20
SLIDE 20

Highlights

  • Network partitioning failures are catastrophic, silent, and easy to manifest
  • Surprisingly, partial partitions cause large number of failures
  • Debunk two common presumptions
  • 1. Admins believe that systems can tolerate network partitions
  • 2. Designers believe isolating one side of the partition is enough
  • NEAT: a network partitioning testing framework
  • Tested 7 systems  32 failures

20

slide-21
SLIDE 21

NEtwork pArtitioning Testing framework (NEAT)

  • Supports all types of network partitions
  • Simple API

21

client1.createSemaphore(1) side1 = asList(S1, S2, client1); side2 = asList(S3, client2); netPart = Partitioner.complete(side1, side2); assertTrue(client1.sem_trywait()); assertFalse(client2.sem_trywait()); Partitioner.heal(netPart);

S2 S3 S1 Client1 Client2 Network partition acquire() acquire()

Apache Ignite double locking failure

slide-22
SLIDE 22

NEAT design

22

Client 1 Client 2 Server 1 Server 2 Server 3 Test Engine

Net Partitioner

Run target system Issue client operations

Client Driver

  • Orders client operations
  • Injects and heals partitions
  • OpenFlow
  • iptables
slide-23
SLIDE 23

Testing with NEAT

  • We tested 7 systems using NEAT
  • Discovered 32 failures  30 catastrophic
  • Confirmed: 12

23

System # failures found

ActiveMQ 2 Ceph 2 Ignite 15 Infinispan 1 Terracotta 9 MooseFS 2 DKron 1

slide-24
SLIDE 24

Concluding remarks

  • Further research is needed for network partition fault tolerance

Specially partial partitions

  • Highlight the danger of using unreachability as an indicator of node crash
  • Identify ordering, timing, network characteristics to simplify testing
  • Identify common pitfalls for developers and admins
  • NEAT: network partitioning testing framework

https://dsl.uwaterloo.ca/projects/neat/

24