Building a Fault- Building a Fault- Tolerant Distributed Tolerant - - PowerPoint PPT Presentation

building a fault building a fault tolerant distributed
SMART_READER_LITE
LIVE PREVIEW

Building a Fault- Building a Fault- Tolerant Distributed Tolerant - - PowerPoint PPT Presentation

Building a Fault- Building a Fault- Tolerant Distributed Tolerant Distributed System with System with zookeepertcl zookeepertcl Tcl Conference 2018 Tcl Conference 2018 Garrett McGrath Garrett McGrath /whois /whois /whois /whois


slide-1
SLIDE 1

Building a Fault- Building a Fault- Tolerant Distributed Tolerant Distributed System with System with zookeepertcl zookeepertcl

Tcl Conference 2018 Tcl Conference 2018 Garrett McGrath Garrett McGrath

slide-2
SLIDE 2

/whois /whois

slide-3
SLIDE 3

/whois /whois

Developer at FlightAware Work on Hyperfeed

slide-4
SLIDE 4

/whois /whois

Developer at FlightAware Work on Hyperfeed Current focus on distribution and reliability Talk based on this work

slide-5
SLIDE 5

System Definition System Definition

slide-6
SLIDE 6

System Definition System Definition

Multiple components (process) All need to run concurrently Too many to run on a single machine

slide-7
SLIDE 7

System Definition System Definition

Multiple components (process) All need to run concurrently Too many to run on a single machine Spread across multiple machines (nodes) Egalitarian system In terms of compute resources

slide-8
SLIDE 8

System Definition System Definition

Multiple components (process) All need to run concurrently Too many to run on a single machine Spread across multiple machines (nodes) Egalitarian system In terms of compute resources Each component Runs on one machine at a time Allow a node to run multiple components

slide-9
SLIDE 9

Faults and Failures Faults and Failures

slide-10
SLIDE 10

Faults and Failures Faults and Failures

Expect temporary and permanent failures Of components And nodes

slide-11
SLIDE 11

Faults and Failures Faults and Failures

Expect temporary and permanent failures Of components And nodes Want to tolerate Crash failures Omission failures

slide-12
SLIDE 12

Faults and Failures Faults and Failures

Expect temporary and permanent failures Of components And nodes Want to tolerate Crash failures Omission failures Consistency-Availability-Partition Address A and P

slide-13
SLIDE 13

Recovery and Failover Recovery and Failover

slide-14
SLIDE 14

Recovery and Failover Recovery and Failover

Since failure expected, when it happens

slide-15
SLIDE 15

Recovery and Failover Recovery and Failover

Since failure expected, when it happens To a component Want it to run on another node

slide-16
SLIDE 16

Recovery and Failover Recovery and Failover

Since failure expected, when it happens To a component Want it to run on another node To a node Want its components to run on other nodes

slide-17
SLIDE 17

Recovery and Failover Recovery and Failover

Since failure expected, when it happens To a component Want it to run on another node To a node Want its components to run on other nodes Want a system that Supports automated failover For common failure conditions

slide-18
SLIDE 18

Scope and Limitations Scope and Limitations

slide-19
SLIDE 19

Scope and Limitations Scope and Limitations

Cannot protect against all failures

slide-20
SLIDE 20

Scope and Limitations Scope and Limitations

Cannot protect against all failures Consistency / integrity faults unaddressed

slide-21
SLIDE 21

Scope and Limitations Scope and Limitations

Cannot protect against all failures Consistency / integrity faults unaddressed Byzantine Failure not touched Arbitrary and/or malicious responses Possibly from unintentional bugs Or, collusion among nodes to deceive

slide-22
SLIDE 22

Scope and Limitations Scope and Limitations

Cannot protect against all failures Consistency / integrity faults unaddressed Byzantine Failure not touched Arbitrary and/or malicious responses Possibly from unintentional bugs Or, collusion among nodes to deceive Partial addressing of network partitions

slide-23
SLIDE 23

Implementation Implementation

slide-24
SLIDE 24

Implementation Implementation

Fault tolerant distributed system With Tcl and Zookeeper

slide-25
SLIDE 25

Implementation Implementation

Fault tolerant distributed system With Tcl and Zookeeper Based on leader election recipe Use term in a peculiar way

slide-26
SLIDE 26

Implementation Implementation

Fault tolerant distributed system With Tcl and Zookeeper Based on leader election recipe Use term in a peculiar way Each component will have a leader

slide-27
SLIDE 27

Implementation Implementation

Fault tolerant distributed system With Tcl and Zookeeper Based on leader election recipe Use term in a peculiar way Each component will have a leader Who is running the component

slide-28
SLIDE 28

Implementation Implementation

Fault tolerant distributed system With Tcl and Zookeeper Based on leader election recipe Use term in a peculiar way Each component will have a leader Who is running the component With other nodes ready to step in

slide-29
SLIDE 29

Per Node Implemention Per Node Implemention

slide-30
SLIDE 30

Per Node Implemention Per Node Implemention

Each node runs a supervisor

slide-31
SLIDE 31

Per Node Implemention Per Node Implemention

Each node runs a supervisor Communicates with Zookeeper

slide-32
SLIDE 32

Per Node Implemention Per Node Implemention

Each node runs a supervisor Communicates with Zookeeper Elects components Starts them if win election Or if current leader fails

slide-33
SLIDE 33

Per Node Implemention Per Node Implemention

Each node runs a supervisor Communicates with Zookeeper Elects components Starts them if win election Or if current leader fails Monitors components, e.g., SIGCHLD

slide-34
SLIDE 34

Per Node Implemention Per Node Implemention

Each node runs a supervisor Communicates with Zookeeper Elects components Starts them if win election Or if current leader fails Monitors components, e.g., SIGCHLD Supervisor Knows How to start and stop each component Other nodes in the system

slide-35
SLIDE 35
slide-36
SLIDE 36

Zookeeper Zookeeper

slide-37
SLIDE 37

Zookeeper Zookeeper

Distributed coordination service

slide-38
SLIDE 38

Zookeeper Zookeeper

Distributed coordination service Developed at Yahoo Maintained by the ASF

slide-39
SLIDE 39

Zookeeper Zookeeper

Distributed coordination service Developed at Yahoo Maintained by the ASF Written in Java

slide-40
SLIDE 40

Zookeeper Zookeeper

Distributed coordination service Developed at Yahoo Maintained by the ASF Written in Java Runs Standalone (dev / testing)

slide-41
SLIDE 41

Zookeeper Zookeeper

Distributed coordination service Developed at Yahoo Maintained by the ASF Written in Java Runs Standalone (dev / testing) Replicated Handle k failures With 2k + 1 servers

slide-42
SLIDE 42

Coordination Coordination

slide-43
SLIDE 43

Coordination Coordination

Notoriously difficult to get right Deadlocks Race conditions

slide-44
SLIDE 44

Coordination Coordination

Notoriously difficult to get right Deadlocks Race conditions Examples

slide-45
SLIDE 45

Coordination Coordination

Notoriously difficult to get right Deadlocks Race conditions Examples Barriers

slide-46
SLIDE 46

Coordination Coordination

Notoriously difficult to get right Deadlocks Race conditions Examples Barriers Queues

slide-47
SLIDE 47

Coordination Coordination

Notoriously difficult to get right Deadlocks Race conditions Examples Barriers Queues Locks (read or write)

slide-48
SLIDE 48

Coordination Coordination

Notoriously difficult to get right Deadlocks Race conditions Examples Barriers Queues Locks (read or write) Two-phase commit (atomic transactions)

slide-49
SLIDE 49

Coordination Coordination

Notoriously difficult to get right Deadlocks Race conditions Examples Barriers Queues Locks (read or write) Two-phase commit (atomic transactions) Leader election

slide-50
SLIDE 50

API API

slide-51
SLIDE 51

API API

Does not come with pre-baked primitives based on coordination task

slide-52
SLIDE 52

API API

Does not come with pre-baked primitives based on coordination task Exposes a simple API instead More flexible Use it to implement coordination tasks Provides consistency and availability guarantees

slide-53
SLIDE 53

API, Cont. API, Cont.

slide-54
SLIDE 54

API, Cont. API, Cont.

Based on a file-system like abstraction

slide-55
SLIDE 55

API, Cont. API, Cont.

Based on a file-system like abstraction znode Combination of file and directory

slide-56
SLIDE 56

API, Cont. API, Cont.

Based on a file-system like abstraction znode Combination of file and directory Provides hierarchical namespace Enables process communication

slide-57
SLIDE 57

API, Cont. API, Cont.

Based on a file-system like abstraction znode Combination of file and directory Provides hierarchical namespace Enables process communication znodes contain Data (small amount, typically 1MB max)

slide-58
SLIDE 58

API, Cont. API, Cont.

Based on a file-system like abstraction znode Combination of file and directory Provides hierarchical namespace Enables process communication znodes contain Data (small amount, typically 1MB max) Metadata (ACLs, ctime, mtime, atime)

slide-59
SLIDE 59
slide-60
SLIDE 60

/

slide-61
SLIDE 61

/ /component0

slide-62
SLIDE 62

/ /component0 /component0/config

slide-63
SLIDE 63

/ /component0 /component0/election /component0/config

slide-64
SLIDE 64

API Operations API Operations

What Can We Do What Can We Do

slide-65
SLIDE 65

API Operations API Operations

What Can We Do What Can We Do

Create new znodes

slide-66
SLIDE 66

API Operations API Operations

What Can We Do What Can We Do

Create new znodes Durable or ephemeral

slide-67
SLIDE 67

API Operations API Operations

What Can We Do What Can We Do

Create new znodes Durable or ephemeral Sequential

slide-68
SLIDE 68

API Operations API Operations

What Can We Do What Can We Do

Create new znodes Durable or ephemeral Sequential Delete existing znodes

slide-69
SLIDE 69

API Operations API Operations

What Can We Do What Can We Do

Create new znodes Durable or ephemeral Sequential Delete existing znodes Query znodes

slide-70
SLIDE 70

API Operations API Operations

What Can We Do What Can We Do

Create new znodes Durable or ephemeral Sequential Delete existing znodes Query znodes Exist?

slide-71
SLIDE 71

API Operations API Operations

What Can We Do What Can We Do

Create new znodes Durable or ephemeral Sequential Delete existing znodes Query znodes Exist? Children?

slide-72
SLIDE 72

API Operations API Operations

What Can We Do What Can We Do

Create new znodes Durable or ephemeral Sequential Delete existing znodes Query znodes Exist? Children? Get / modify znode {meta,}data

slide-73
SLIDE 73

Watch Callbacks Watch Callbacks

slide-74
SLIDE 74

Watch Callbacks Watch Callbacks

Several operations support a watch callback One-time callback invoked when the znode changes

slide-75
SLIDE 75

Watch Callbacks Watch Callbacks

Several operations support a watch callback One-time callback invoked when the znode changes A get or exists watch Called when the znode modified

slide-76
SLIDE 76

Watch Callbacks Watch Callbacks

Several operations support a watch callback One-time callback invoked when the znode changes A get or exists watch Called when the znode modified A children watch Called when anything happens to the znode's children

slide-77
SLIDE 77

zookeepertcl zookeepertcl

slide-78
SLIDE 78

zookeepertcl zookeepertcl

Open-source library github.com/flightaware/zookeepertcl

slide-79
SLIDE 79

zookeepertcl zookeepertcl

Open-source library github.com/flightaware/zookeepertcl Wraps the official C client Supports the latest stable Zookeeper version r3.4.13

slide-80
SLIDE 80

zookeepertcl zookeepertcl

Open-source library github.com/flightaware/zookeepertcl Wraps the official C client Supports the latest stable Zookeeper version r3.4.13 Each API operation supports two styles Synchronous Asynchronous

slide-81
SLIDE 81
slide-82
SLIDE 82

# zookeepertcl provides aptly named zookeeper package package require zookeeper

slide-83
SLIDE 83

# zookeepertcl provides aptly named zookeeper package package require zookeeper # Turn off C client stderr debugging statements zookeeper::zookeeper debug_level none

slide-84
SLIDE 84

# zookeepertcl provides aptly named zookeeper package package require zookeeper # Turn off C client stderr debugging statements zookeeper::zookeeper debug_level none # Connect to a Zookeeper server/cluster # End up with a new command zk which supports # sub-commands for using the Zookeeper API set hostStr "host1:2181,host2:2181,host3:2181" set timeout 5000 zookeeper::zookeeper init zk $hostStr $timeout

slide-85
SLIDE 85
slide-86
SLIDE 86

# Use the Zookeeper API! ## Create some znodes for the system components for {set i 0} {$i < $totalComponents} {incr i} { set componentRoot [file join / component$i] zk create $componentRoot zk create [file join $componentRoot args] zk create [file join $componentRoot election] }

slide-87
SLIDE 87

# Use the Zookeeper API! ## Create some znodes for the system components for {set i 0} {$i < $totalComponents} {incr i} { set componentRoot [file join / component$i] zk create $componentRoot zk create [file join $componentRoot args] zk create [file join $componentRoot election] } ## Exists zk exists /component0; # 1

slide-88
SLIDE 88
slide-89
SLIDE 89

## Children set rootZnodes [zk children /] lsearch -all -inline -glob $rootZnodes component*

slide-90
SLIDE 90

## Children set rootZnodes [zk children /] lsearch -all -inline -glob $rootZnodes component* ## Get set c0Args [file join / component0 args] zk get $c0Args -stat c0ArgsStats

slide-91
SLIDE 91

## Children set rootZnodes [zk children /] lsearch -all -inline -glob $rootZnodes component* ## Get set c0Args [file join / component0 args] zk get $c0Args -stat c0ArgsStats ## Set zk set $c0Args "commadArgs" $c0ArgsStats(version)

slide-92
SLIDE 92

## Children set rootZnodes [zk children /] lsearch -all -inline -glob $rootZnodes component* ## Get set c0Args [file join / component0 args] zk get $c0Args -stat c0ArgsStats ## Set zk set $c0Args "commadArgs" $c0ArgsStats(version) ## Delete zk delete $c0Args [expr {$c0ArgsStats(version) + 1}]

slide-93
SLIDE 93

Leader Election Recipe

slide-94
SLIDE 94

Step 1 Step 1

Create Create znode znode z with path z with path "ELECTION/n_" with both "ELECTION/n_" with both SEQUENCE and EPHEMERAL SEQUENCE and EPHEMERAL flags; flags;

slide-95
SLIDE 95
slide-96
SLIDE 96

# assume that $electionRoot already exists set electionRoot [file join / component0 election]

slide-97
SLIDE 97

# assume that $electionRoot already exists set electionRoot [file join / component0 election] set myVote [file join $electionRoot "n_"]

slide-98
SLIDE 98

# assume that $electionRoot already exists set electionRoot [file join / component0 election] set myVote [file join $electionRoot "n_"] set z [zk create $myVote -ephemeral -sequence]

slide-99
SLIDE 99

Step 2 Step 2

Let C be the children of Let C be the children of "ELECTION", and i be the "ELECTION", and i be the sequence number of z; sequence number of z;

slide-100
SLIDE 100
slide-101
SLIDE 101

# zk children returns relative znode paths set C [zk children $electionRoot]

slide-102
SLIDE 102

# zk children returns relative znode paths set C [zk children $electionRoot] # create returns a full path set zRelative [lindex [file split $z] end]

slide-103
SLIDE 103

# zk children returns relative znode paths set C [zk children $electionRoot] # create returns a full path set zRelative [lindex [file split $z] end] # use scan to extract i since sequence numbers # in format %010d, i.e., 10 digits padded w/ 0s set i [scan [lindex [split $zRelative _] end] %d]

slide-104
SLIDE 104

Step 3 Step 3

Watch for changes on Watch for changes on "ELECTION/n_j", where j is the "ELECTION/n_j", where j is the largest sequence number such largest sequence number such that j < i and n_j is a znode in C; that j < i and n_j is a znode in C;

slide-105
SLIDE 105
slide-106
SLIDE 106

# Sort C to make things easier set Cdigits [lmap vote $C { scan [lindex [split $vote _] end] %d }] set sortedC [lsort -integer $Cdigits] watch_next_node $sortedC $i $electionRoot

slide-107
SLIDE 107

# Sort C to make things easier set Cdigits [lmap vote $C { scan [lindex [split $vote _] end] %d }] set sortedC [lsort -integer $Cdigits] watch_next_node $sortedC $i $electionRoot proc watch_next_node {sortedC i electionPath} { # i's position in the sorted list set iPos [lsearch $sortedC $i] # the leader is element 0 in the sorted list of votes if {$iPos != 0} { set j [lindex $sortedC [expr {$i - 1}]] set jPath [file join $electionPath "n_$j"] zk exists $jPath -watch election_change } else { # run the component since election was won } }

slide-108
SLIDE 108

Implementation Implementation Decisions Decisions

slide-109
SLIDE 109

Abdication Abdication

Giving up Leadership Giving up Leadership

slide-110
SLIDE 110

Abdication Abdication

Giving up Leadership Giving up Leadership

Timing of elections can result in massive asymmetries Do not want one node to crowd out

  • thers
slide-111
SLIDE 111

Abdication Abdication

Giving up Leadership Giving up Leadership

Timing of elections can result in massive asymmetries Do not want one node to crowd out

  • thers

Implement a policy of abdication Based on, e.g., fair distribution Delay after win election If leader, set children watch

slide-112
SLIDE 112

Restart Loops Restart Loops

Limiting Abdication Limiting Abdication

slide-113
SLIDE 113

Restart Loops Restart Loops

Limiting Abdication Limiting Abdication

Intermittent failures and abdication Single component could get passed around

slide-114
SLIDE 114

Restart Loops Restart Loops

Limiting Abdication Limiting Abdication

Intermittent failures and abdication Single component could get passed around Need to avoid this potential instability Matter of retaining sufficient state Can do locally Or in znodes

slide-115
SLIDE 115

Intentional Stops Intentional Stops

Retaining Leadership Retaining Leadership

slide-116
SLIDE 116

Intentional Stops Intentional Stops

Retaining Leadership Retaining Leadership

Often desirable to restart or stop component Without giving up current leadership

slide-117
SLIDE 117

Intentional Stops Intentional Stops

Retaining Leadership Retaining Leadership

Often desirable to restart or stop component Without giving up current leadership Main justification for using a supervisor

slide-118
SLIDE 118

Intentional Stops Intentional Stops

Retaining Leadership Retaining Leadership

Often desirable to restart or stop component Without giving up current leadership Main justification for using a supervisor Many potential methods of addressing this One is to use special znodes to pass commands

slide-119
SLIDE 119

Config Changes Config Changes

Targeted Restarts Targeted Restarts

slide-120
SLIDE 120

Config Changes Config Changes

Targeted Restarts Targeted Restarts

Watch callbacks on /config portion of component's znode hierarchy

slide-121
SLIDE 121

Config Changes Config Changes

Targeted Restarts Targeted Restarts

Watch callbacks on /config portion of component's znode hierarchy Callbacks can pile up E.g., delete one argument and add another

slide-122
SLIDE 122

Config Changes Config Changes

Targeted Restarts Targeted Restarts

Watch callbacks on /config portion of component's znode hierarchy Callbacks can pile up E.g., delete one argument and add another Need a way of performing targeted restarts

slide-123
SLIDE 123

Connection Loss Connection Loss

Zookeeper Session States Zookeeper Session States

slide-124
SLIDE 124

Connection Loss Connection Loss

Zookeeper Session States Zookeeper Session States

Need a policy about what to do when connection to Zookeeper is lost Watch callbacks do not persist

slide-125
SLIDE 125

Connection Loss Connection Loss

Zookeeper Session States Zookeeper Session States

Need a policy about what to do when connection to Zookeeper is lost Watch callbacks do not persist Zookeeper connections Called a session Represented as a state machine Distinguishes connection lost or interrupted