ZooKeeper: Wait-free coordination for Internet-scale systems
Xuyang Zhang Zhesheng Xie
ZooKeeper: Wait-free coordination for Internet-scale systems - - PowerPoint PPT Presentation
ZooKeeper: Wait-free coordination for Internet-scale systems Xuyang Zhang Zhesheng Xie What is ZooKeeper A distributed coordination service for distributed applications that providing a simple and high performance kernel for building more
Xuyang Zhang Zhesheng Xie
A distributed coordination service for distributed applications that providing a simple and high performance kernel for building more complex coordination primitives at the client.
environment
coordination(i.e. Configuration, Group membership, leader election)
ZooKeeper exposes generic API that allows custom primitives.
changes to the service core.
applications
○ A distributed publish-subscribe system
○ A distributed indexer ZooKeeper primitives that used: Group membership Leader election Configuration management
○ Part of YAhoo! ○ Has a master process that command page-fetching processes ○ Master provides fetchers with configuration ○ Fetchers write back informing of their status and health FS uses ZooKeeper to: Manage configuration metadata Elect masters (leader election).
○ Client API ○ Network Connection Management
2 types of znodes: Regular: manipulated(creat, delete) explicitly by client Ephemeral: created by clients, automatically removed by system when session ends Sequential flag: monotonically increasing counter appended to its name. Watch flag: allow clients to receive timely notifications of changes without requiring polling.
Illustration of ZooKeeper hierarchical name space. A file system with simplified API. Only store metadata for coordination purpose.
Represent the network connection between clients and ZooKeeper server With a session:
execution of its operations.
server to another within a ZooKeeper ensemble
create(path, data, flags) delete(path, version) exists(path, watch) getData(path, watch) setData(path, data, version) getChildren(path, watch) sync(path)
Ordering Guarantee
Leader election systems: A number of processes elects a leader to command worker processes. Requirements:
processes to start using the configuration that is being changed;
we do not want the processes to use this partial configuration.
How to satisfy the requirements? Ready znode New leader starts making changes: 1. Delete ready znode 2. Update configuration znodes(accelerated by pipeline) 3. Create a new ready znode Ordering guarantee: If a process see a ready znode, it must have seen all configuration changes made by the leader.
What else can be wrong?
configuration.
change is in progress. How to solve? Add a watch to ready node. The ordering guarantee for the notifications:
event before it sees the new state of the system after the change is made
What if clients have other communication channels in addition to ZooKeeper? A, B shares configurations on ZooKeeper service A: updated the configuration send the notification via the B: suppose to read the updated configuration may not due to delay. Solution: B send sync() before read operation.
the process will be notified
doesn’t know the address of the master ahead of time.
○ Master: Fill in the znode with the address ○ Client: Watch and read the znode
zgroup
ZooKeeper
○ If succeeds, the lock is acquired ○ If fails, watch the znode.
the lock will access ZooKeeper at the same time
computation
○ Use a znode zb to represent a barrier ○ Register: create a child znode under zb ○ Unregister: delete the the znode ○ Enter the barrier when there are enough child znode under zb ○ Leave the barrier when zb has no children
protocol ○ Guarantee the receiving order and sending order of the messages are same ○ Guarantee all messages from the previous leader are delivered to the new leader before the new leader wants to broadcast its own messages.
memory
○ Keep track of proposals by using log ○ Take snapshots periodically
○ Fuzzy snapshot: State is not locked when taking a snapshot
○ Write in order and no concurrent ○ Trigger notification locally
○ Handled locally ○ Tagged with zxid that corresponds to the last write seen by the server
○ After sync, the results of read can reflect any changes before sync
○ zxid is used to guarantee the view of the new server is no earlier than that of the client
1. Failure and recovery of a follower 2. Failure and recovery of a different follower 3. Failure of the leader 4. Failure of two followers (a, b) in the first two marks, and recovery at the third mark (c) 5. Failure of the leader 6. Recovery of the leader