ZooKeeper: Wait-free coordination for Internet-scale systems - - PowerPoint PPT Presentation

zookeeper wait free coordination for internet scale
SMART_READER_LITE
LIVE PREVIEW

ZooKeeper: Wait-free coordination for Internet-scale systems - - PowerPoint PPT Presentation

ZooKeeper: Wait-free coordination for Internet-scale systems Xuyang Zhang Zhesheng Xie What is ZooKeeper A distributed coordination service for distributed applications that providing a simple and high performance kernel for building more


slide-1
SLIDE 1

ZooKeeper: Wait-free coordination for Internet-scale systems

Xuyang Zhang Zhesheng Xie

slide-2
SLIDE 2

What is ZooKeeper

A distributed coordination service for distributed applications that providing a simple and high performance kernel for building more complex coordination primitives at the client.

slide-3
SLIDE 3

Motivation

  • Coordinating and managing a service is difficult in distributed

environment

  • Large-scale distributed applications require different forms of

coordination(i.e. Configuration, Group membership, leader election)

  • Expose an API that enables implementation of custom primetives
slide-4
SLIDE 4

Why we need ZooKeeper

ZooKeeper exposes generic API that allows custom primitives.

  • A coordination kernel that enables new primitives without requiring

changes to the service core.

  • Enables multiple forms of coordination adapted to the requirements of

applications

slide-5
SLIDE 5

Why we need ZooKeeper

  • Wait-free
  • Pipeline architecture
slide-6
SLIDE 6

ZooKeeper Applications

  • Yahoo! Message Broker

○ A distributed publish-subscribe system

slide-7
SLIDE 7

ZooKeeper Applications

  • Katta

○ A distributed indexer ZooKeeper primitives that used: Group membership Leader election Configuration management

slide-8
SLIDE 8

ZooKeeper Applications

  • The Fetching Service

○ Part of YAhoo! ○ Has a master process that command page-fetching processes ○ Master provides fetchers with configuration ○ Fetchers write back informing of their status and health FS uses ZooKeeper to: Manage configuration metadata Elect masters (leader election).

slide-9
SLIDE 9

ZooKeeper Service: Service Overview

  • ZooKeeper Client Library:

○ Client API ○ Network Connection Management

  • Client: user of Zookeeper Service
  • Server: process providing ZooKeeper Service
  • Znode: in-memory data object
  • Session: network connection between a client and a ZooKeeper server
slide-10
SLIDE 10

znode

2 types of znodes: Regular: manipulated(creat, delete) explicitly by client Ephemeral: created by clients, automatically removed by system when session ends Sequential flag: monotonically increasing counter appended to its name. Watch flag: allow clients to receive timely notifications of changes without requiring polling.

slide-11
SLIDE 11

Data Model

Illustration of ZooKeeper hierarchical name space. A file system with simplified API. Only store metadata for coordination purpose.

slide-12
SLIDE 12

Sessions

Represent the network connection between clients and ZooKeeper server With a session:

  • Server can use timeout to decide if a client is faulty
  • A client observes a succession of state changes that reflect the

execution of its operations.

  • Enable persistent service: client to move transparently from one

server to another within a ZooKeeper ensemble

slide-13
SLIDE 13

Client API

create(path, data, flags) delete(path, version) exists(path, watch) getData(path, watch) setData(path, data, version) getChildren(path, watch) sync(path)

slide-14
SLIDE 14

ZooKeeper Guarantees

  • Linearizable writes
  • FIFO client order
  • Liveness
  • Durability

Ordering Guarantee

slide-15
SLIDE 15

Scenario Analysis

Leader election systems: A number of processes elects a leader to command worker processes. Requirements:

  • As the new leader starts making changes, we do not want other

processes to start using the configuration that is being changed;

  • If the new leader dies before the configuration has been fully updated,

we do not want the processes to use this partial configuration.

slide-16
SLIDE 16

Scenario Analysis

How to satisfy the requirements? Ready znode New leader starts making changes: 1. Delete ready znode 2. Update configuration znodes(accelerated by pipeline) 3. Create a new ready znode Ordering guarantee: If a process see a ready znode, it must have seen all configuration changes made by the leader.

slide-17
SLIDE 17

Scenario Analysis

What else can be wrong?

  • A Process sees ready node first.
  • The leader delete the old ready node, and start updating

configuration.

  • The process starts reading the configuration while the

change is in progress. How to solve? Add a watch to ready node. The ordering guarantee for the notifications:

  • if a client is watching for a change, the client will see the notification

event before it sees the new state of the system after the change is made

slide-18
SLIDE 18

Scenario Analysis

What if clients have other communication channels in addition to ZooKeeper? A, B shares configurations on ZooKeeper service A: updated the configuration send the notification via the B: suppose to read the updated configuration may not due to delay. Solution: B send sync() before read operation.

slide-19
SLIDE 19

Examples of Primitives

slide-20
SLIDE 20

Configuration Management

  • Configuration is stored in a znode
  • Process can obtain the configuration by reading the znode
  • Process can set watch flag as True. So, when the configuration changes,

the process will be notified

slide-21
SLIDE 21

Rendezvous

  • Scenario: A client needs to start a master and several workers, but it

doesn’t know the address of the master ahead of time.

  • Solution: Use a znode to store the address

○ Master: Fill in the znode with the address ○ Client: Watch and read the znode

slide-22
SLIDE 22

Group Membership

  • Use a znode zgroup to represent a group
  • When a member starts, it creates an ephemeral child znode under

zgroup

  • When a member fails or ends, its child znode will be removed by the

ZooKeeper

  • Can obtain the group information by listing the children of zgroup
slide-23
SLIDE 23

Simple Locks

  • Use an ephemeral znode to represent a lock
  • Lock: Try to create a znode

○ If succeeds, the lock is acquired ○ If fails, watch the znode.

  • Unlock: Delete the znode
  • Problem: When the lock is released, all processes who are waiting for

the lock will access ZooKeeper at the same time

slide-24
SLIDE 24

Simple Locks without Herd Effect

  • Use an znode l to represent a lock
slide-25
SLIDE 25

Read/Write Locks

slide-26
SLIDE 26

Double Barrier

  • Scenario: Clients need to synchronize the beginning and the end of a

computation

  • Solution:

○ Use a znode zb to represent a barrier ○ Register: create a child znode under zb ○ Unregister: delete the the znode ○ Enter the barrier when there are enough child znode under zb ○ Leave the barrier when zb has no children

slide-27
SLIDE 27

ZooKeeper Implementation

slide-28
SLIDE 28

Overview

slide-29
SLIDE 29

Request Processor

  • Convert client request to idempotent transactions
  • Calculate new data, new version number and updated timestamp
slide-30
SLIDE 30

Atomic Broadcast

  • The changes will be broadcasted through Zab, an atomic broadcast

protocol ○ Guarantee the receiving order and sending order of the messages are same ○ Guarantee all messages from the previous leader are delivered to the new leader before the new leader wants to broadcast its own messages.

slide-31
SLIDE 31

Replicated Database

  • Every replica will store a copy of the state of the ZooKeeper in

memory

  • How to recover?

○ Keep track of proposals by using log ○ Take snapshots periodically

  • How to take snapshots?

○ Fuzzy snapshot: State is not locked when taking a snapshot

slide-32
SLIDE 32

Client-Server Interactions

  • Write

○ Write in order and no concurrent ○ Trigger notification locally

  • Read

○ Handled locally ○ Tagged with zxid that corresponds to the last write seen by the server

  • Sync

○ After sync, the results of read can reflect any changes before sync

  • Switch server

○ zxid is used to guarantee the view of the new server is no earlier than that of the client

  • Failure detection: Use timeout
slide-33
SLIDE 33

Evaluation

slide-34
SLIDE 34

Throughput

slide-35
SLIDE 35

Throughput upon Failure

1. Failure and recovery of a follower 2. Failure and recovery of a different follower 3. Failure of the leader 4. Failure of two followers (a, b) in the first two marks, and recovery at the third mark (c) 5. Failure of the leader 6. Recovery of the leader

slide-36
SLIDE 36

Q&A

slide-37
SLIDE 37

Thank you!