zookeeper wait free coordination for internet scale
play

ZooKeeper: Wait-free coordination for Internet-scale systems - PowerPoint PPT Presentation

ZooKeeper: Wait-free coordination for Internet-scale systems Xuyang Zhang Zhesheng Xie What is ZooKeeper A distributed coordination service for distributed applications that providing a simple and high performance kernel for building more


  1. ZooKeeper: Wait-free coordination for Internet-scale systems Xuyang Zhang Zhesheng Xie

  2. What is ZooKeeper A distributed coordination service for distributed applications that providing a simple and high performance kernel for building more complex coordination primitives at the client.

  3. Motivation Coordinating and managing a service is difficult in distributed ● environment Large-scale distributed applications require different forms of ● coordination(i.e. Configuration, Group membership, leader election) Expose an API that enables implementation of custom primetives ●

  4. Why we need ZooKeeper ZooKeeper exposes generic API that allows custom primitives. A coordination kernel that enables new primitives without requiring ● changes to the service core. Enables multiple forms of coordination adapted to the requirements of ● applications

  5. Why we need ZooKeeper Wait-free ● Pipeline architecture ●

  6. ZooKeeper Applications Yahoo! Message Broker ● A distributed publish-subscribe system ○

  7. ZooKeeper Applications Katta ● A distributed indexer ○ ZooKeeper primitives that used: Group membership Leader election Configuration management

  8. ZooKeeper Applications The Fetching Service ● Part of YAhoo! ○ Has a master process that command page-fetching processes ○ Master provides fetchers with configuration ○ Fetchers write back informing of their status and health ○ FS uses ZooKeeper to: Manage configuration metadata Elect masters (leader election).

  9. ZooKeeper Service: Service Overview ZooKeeper Client Library: ● Client API ○ Network Connection Management ○ Client: user of Zookeeper Service ● Server: process providing ZooKeeper Service ● Znode: in-memory data object ● Session : network connection between a client and a ZooKeeper server ●

  10. znode 2 types of znodes: Regular: manipulated(creat, delete) explicitly by client Ephemeral: created by clients, automatically removed by system when session ends Sequential flag: monotonically increasing counter appended to its name. Watch flag: allow clients to receive timely notifications of changes without requiring polling.

  11. Data Model A file system with simplified API. Only store metadata for coordination purpose. Illustration of ZooKeeper hierarchical name space.

  12. Sessions Represent the network connection between clients and ZooKeeper server With a session: Server can use timeout to decide if a client is faulty ● A client observes a succession of state changes that reflect the ● execution of its operations. Enable persistent service: client to move transparently from one ● server to another within a ZooKeeper ensemble

  13. Client API create(path, data, flags) delete(path, version) exists(path, watch) getData(path, watch) setData(path, data, version) getChildren(path, watch) sync(path)

  14. ZooKeeper Guarantees Linearizable writes ● Ordering Guarantee FIFO client order ● Liveness ● Durability ●

  15. Scenario Analysis Leader election systems: A number of processes elects a leader to command worker processes. Requirements: As the new leader starts making changes, we do not want other ● processes to start using the configuration that is being changed; If the new leader dies before the configuration has been fully updated, ● we do not want the processes to use this partial configuration.

  16. Scenario Analysis How to satisfy the requirements? Ready znode New leader starts making changes: 1. Delete ready znode 2. Update configuration znodes(accelerated by pipeline) 3. Create a new ready znode Ordering guarantee: If a process see a ready znode, it must have seen all configuration changes made by the leader.

  17. Scenario Analysis What else can be wrong? A Process sees ready node first. ● The leader delete the old ready node, and start updating ● configuration. The process starts reading the configuration while the ● change is in progress. How to solve? Add a watch to ready node. The ordering guarantee for the notifications: if a client is watching for a change, the client will see the notification ● event before it sees the new state of the system after the change is made

  18. Scenario Analysis What if clients have other communication channels in addition to ZooKeeper? A, B shares configurations on ZooKeeper service A: updated the configuration send the notification via the B: suppose to read the updated configuration may not due to delay. Solution: B send sync() before read operation.

  19. Examples of Primitives

  20. Configuration Management Configuration is stored in a znode ● Process can obtain the configuration by reading the znode ● Process can set watch flag as True. So, when the configuration changes, ● the process will be notified

  21. Rendezvous Scenario: A client needs to start a master and several workers, but it ● doesn’t know the address of the master ahead of time. Solution: Use a znode to store the address ● Master: Fill in the znode with the address ○ Client: Watch and read the znode ○

  22. Group Membership Use a znode z group to represent a group ● When a member starts, it creates an ephemeral child znode under ● z group When a member fails or ends, its child znode will be removed by the ● ZooKeeper Can obtain the group information by listing the children of z group ●

  23. Simple Locks Use an ephemeral znode to represent a lock ● Lock: Try to create a znode ● If succeeds, the lock is acquired ○ If fails, watch the znode. ○ Unlock: Delete the znode ● Problem: When the lock is released, all processes who are waiting for ● the lock will access ZooKeeper at the same time

  24. Simple Locks without Herd Effect Use an znode l to represent a lock ●

  25. Read/Write Locks

  26. Double Barrier Scenario: Clients need to synchronize the beginning and the end of a ● computation Solution: ● Use a znode z b to represent a barrier ○ Register: create a child znode under z b ○ Unregister: delete the the znode ○ Enter the barrier when there are enough child znode under z b ○ Leave the barrier when z b has no children ○

  27. ZooKeeper Implementation

  28. Overview

  29. Request Processor Convert client request to idempotent transactions ● Calculate new data, new version number and updated timestamp ●

  30. Atomic Broadcast The changes will be broadcasted through Zab, an atomic broadcast ● protocol Guarantee the receiving order and sending order of the messages ○ are same Guarantee all messages from the previous leader are delivered to ○ the new leader before the new leader wants to broadcast its own messages.

  31. Replicated Database Every replica will store a copy of the state of the ZooKeeper in ● memory How to recover? ● Keep track of proposals by using log ○ Take snapshots periodically ○ How to take snapshots? ● Fuzzy snapshot: State is not locked when taking a snapshot ○

  32. Client-Server Interactions Write ● Write in order and no concurrent ○ Trigger notification locally ○ Read ● Handled locally ○ Tagged with zxid that corresponds to the last write seen by the ○ server Sync ● After sync, the results of read can reflect any changes before sync ○ Switch server ● zxid is used to guarantee the view of the new server is no earlier ○ than that of the client Failure detection: Use timeout ●

  33. Evaluation

  34. Throughput

  35. Throughput upon Failure 1. Failure and recovery of a follower 2. Failure and recovery of a different follower 3. Failure of the leader 4. Failure of two followers (a, b) in the first two marks, and recovery at the third mark (c) 5. Failure of the leader 6. Recovery of the leader

  36. Q&A

  37. Thank you!

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend