ZooKeeper Wait-free coordination for Internet-scale systems Patrick - - PowerPoint PPT Presentation

zookeeper
SMART_READER_LITE
LIVE PREVIEW

ZooKeeper Wait-free coordination for Internet-scale systems Patrick - - PowerPoint PPT Presentation

ZooKeeper Wait-free coordination for Internet-scale systems Patrick Hunt and Mahadev (Yahoo! Grid) Flavio Junqueira and Benjamin Reed (Yahoo! Research) Internet-scale Challenges Lots of servers, users, data FLP, CAP Mere mortal


slide-1
SLIDE 1

ZooKeeper

Wait-free coordination for Internet-scale systems

Patrick Hunt and Mahadev (Yahoo! Grid) Flavio Junqueira and Benjamin Reed (Yahoo! Research)

slide-2
SLIDE 2

Internet-scale Challenges

  • Lots of servers, users, data
  • FLP, CAP
  • Mere mortal programmers
slide-3
SLIDE 3

Classic Distributed System

Master Slave Slave Slave Slave Slave Slave

slide-4
SLIDE 4

Fault Tolerant Distributed System

Master Slave Slave Slave Slave Slave Slave Coordination Service Master

slide-5
SLIDE 5

Fault Tolerant Distributed System

Master Slave Slave Slave Slave Slave Slave Coordination Service Master

slide-6
SLIDE 6

Fully Distributed System

Worker Worker Worker Worker Worker Worker Coordination Service

slide-7
SLIDE 7

What is coordination?

  • Group membership
  • Leader election
  • Dynamic Configuration
  • Status monitoring
  • Queuing
  • Barriers
  • Critical sections
slide-8
SLIDE 8

Goals

  • Been done in the past

–ISIS, distributed locks (Chubby, VMS)

  • High Performance

–Multiple outstanding ops –Read dominant

  • General (Coordination Kernel)
  • Reliable
  • Easy to use
slide-9
SLIDE 9

wait-free

  • Pros

–Slow processes cannot slow down fast ones –No deadlocks –No blocking in the implementations

  • Cons

–Some coordination primitives are blocking –Need to be able to efficiently wait for conditions

slide-10
SLIDE 10

Serializable vs Linearizability

  • Linearizable writes
  • Serializable read (may be stale)
  • Client FIFO ordering
slide-11
SLIDE 11

Change Events

  • Clients request change notifications
  • Service does timely notifications
  • Do not block write requests
  • Clients get notification of a change before

they see the result of a change

slide-12
SLIDE 12

Solution

Order + wait-free + change events = coordination

slide-13
SLIDE 13

ZooKeeper API

String create(path, data, acl, flags) void delete(path, expectedVersion) Stat setData(path, data, expectedVersion) (data, Stat) getData(path, watch) Stat exists(path, watch) String[] getChildren(path, watch) void sync() Stat setACL(path, acl, expectedVersion) (acl, Stat) getACL(path)

slide-14
SLIDE 14

Data Model

  • Hierarchal namespace

(like a file system)

  • Each znode has data and

children

  • data is read and written in

its entirety

/ services users apps locks workers YaView s-1 worker2 worker1

slide-15
SLIDE 15

Create Flags

  • Ephemeral: znode deleted

when creator fails or explicitly deleted

  • Sequence: append a

monotonically increasing counter

/ services users apps locks workers YaView s-1 worker2 worker1 Ephemerals created by Session X Sequence appended

  • n create
slide-16
SLIDE 16

Configuration

  • Workers get configuration

–getData(“.../config/settings”, true)

  • Administrators change the configuration

–setData(“.../config/settings”, newConf, -1)

  • Workers notified of change and get the new settings

–getData(“.../config/settings”, true)

config settings

slide-17
SLIDE 17

Group Membership

  • Register serverName in group

–create(“.../workers/workerName”, hostInfo, EPHEMERAL)

  • List group members

–listChildren(“.../workers”, true)

workers worker2 worker1

slide-18
SLIDE 18

Leader Election

  • getData(“.../workers/leader”, true)
  • if successful follow the leader described in

the data and exit

  • create(“.../workers/leader”, hostname,

EPHEMERAL)

  • if successful lead and exit
  • goto step 1

workers worker2 worker1 If a watch is triggered for “.../workers/leader”, followers will restart the leader election process leader

slide-19
SLIDE 19

Locks

  • id = create(“.../locks/x-”,

SEQUENCE|EPHEMERAL)

  • getChildren(“.../locks”/, false)
  • if id is the 1st child, exit
  • exists(name of last child

before id, true)

  • if does not exist, goto 2)
  • wait for event
  • goto 2)

locks x-19 x-11 x-20 Each znode watches one other. No herd effect.

slide-20
SLIDE 20

Shared Locks

  • id = create(“.../locks/s-”,

SEQUENCE|EPHEMERAL)

  • getChildren(“.../locks”/, false)
  • if no children that start with x-

before id, exit

  • exists(name of the last x- before

id, true)

  • if does not exist, goto 2)
  • wait for event
  • goto 2)

locks x-19 s-11 x-20 x-19 x-19 s-21 x-22 s-20

slide-21
SLIDE 21

ZooKeeper Servers

ZooKeeper Service Server Server Server Server Server Server

  • All servers have a copy of the state in memory
  • A leader is elected at startup
  • Followers service clients, all updates go through leader
  • Update responses are sent when a majority of servers have persisted the

change

We need 2f+1 machines to tolerate f failures

slide-22
SLIDE 22

ZooKeeper Servers

ZooKeeper Service Server Server Server Server Server Server Client Client Client Client Client Client Client Client

Leader

slide-23
SLIDE 23

Current Performance

slide-24
SLIDE 24

Summary

  • Easy to use
  • High Performance
  • General
  • Reliable
  • Release 3.3 on Apache

–See http://hadoop.apache.org/zookeeper –Committers from Yahoo! and Cloudera