Motivation Large-scale distributed systems becoming more common - - PowerPoint PPT Presentation

motivation
SMART_READER_LITE
LIVE PREVIEW

Motivation Large-scale distributed systems becoming more common - - PowerPoint PPT Presentation

Census: Location-Aware Membership Management for Large-Scale Distributed Systems James Cowling Dan R. K. Ports Barbara Liskov Raluca Ada Popa Abhijeet Gaikwad* MIT CSAIL *cole Centrale Paris Motivation Large-scale distributed systems


slide-1
SLIDE 1

Census:

Location-Aware Membership Management for Large-Scale Distributed Systems

James Cowling Dan R. K. Ports Barbara Liskov Raluca Ada Popa Abhijeet Gaikwad* MIT CSAIL *École Centrale Paris

slide-2
SLIDE 2

Motivation

Large-scale distributed systems becoming more common multiple datacenters, cloud computing, etc. Reconfigurable distributed services adapt as nodes join, leave, or fail A membership service that tracks changes in system membership can simplify system design

slide-3
SLIDE 3

Census

A platform for building large-scale, distributed applications Two main components: Membership service Multicast communication mechanism Designed to work in the wide-area Locality-aware; fault tolerant

slide-4
SLIDE 4

Membership Service

Time divided into sequential, fixed-duration epochs Each epoch has a membership view: List of nodes (ID, IP address, location, etc.) Consistency property: every node sees the same membership view for a particular epoch ➡ can simplify protocol design (e.g. partitioning storage)

slide-5
SLIDE 5

Consistency & Scalability

Existing systems: tradeoff between consistency and scalability Examples:

  • virtual synchrony (e.g. ISIS, Spread)
  • distributed hash tables (e.g. Chord, Pastry)

Census provides consistent membership views and is designed for large-scale, wide-area systems

slide-6
SLIDE 6

Membership Service: Basic Approach

  • Designate one

node as leader

slide-7
SLIDE 7

Membership Service: Basic Approach

  • Designate one

node as leader

  • Nodes report

membership changes to leader

slide-8
SLIDE 8

Membership Service: Basic Approach

  • Designate one

node as leader

  • Nodes report

membership changes to leader

  • Leader

aggregates changes; multicasts item

slide-9
SLIDE 9

Membership Service: Basic Approach

  • Designate one

node as leader

  • Nodes report

membership changes to leader

  • Leader

aggregates changes; multicasts item

  • Members enter

next epoch, update membership

slide-10
SLIDE 10

What are the Challenges?

Delivering items efficiently and reliably ➡ Multicast mechanism Reducing load on the leader ➡ Multi-region structure Dealing with leader failure ➡ Fault tolerance

slide-11
SLIDE 11

Outline

  • Overview
  • Basic Approach
  • Multicast Mechanism
  • Multi-region Design
  • Fault Tolerance
  • Evaluation
slide-12
SLIDE 12

Multicast Mechanism

Need multicast to distribute membership updates and application data efficiently Goals: high reliability, low latency, fair load balancing Many multicast protocols exist... Census takes a different approach exploits consistent membership information for a simpler design and lower overhead

slide-13
SLIDE 13

Multicast Topology

Multiple interior-disjoint trees (similar to SplitStream) Each node interior in one tree, leaf in others Membership data distributed in full on each tree. Application's multicast data erasure-coded Improved reliability and load balancing vs. a single tree

slide-14
SLIDE 14

Multicast Topology

slide-15
SLIDE 15

14

Multicast Topology

slide-16
SLIDE 16

15

Multicast Topology

slide-17
SLIDE 17

Building Multicast Trees

Exploit consistent membership knowledge: tree structure given by deterministic function of membership ➡ Allows simple “centralized” algorithm in distributed context Nodes independently recompute trees “on-the-fly”, upon receiving membership updates No protocol overhead beyond that of membership service (even during churn!)

slide-18
SLIDE 18

Tree Building Algorithm

slide-19
SLIDE 19

Tree Building Algorithm

d(x,y) ≈ latency(x,y)

Background: network coordinates (e.g. Vivaldi)

slide-20
SLIDE 20

Tree Building Algorithm

Assign nodes to a tree (color) based on ID

slide-21
SLIDE 21

Building the Red Tree

Split region through center of mass, along widest axis

slide-22
SLIDE 22

Building the Red Tree

Choose closest red node in each subregion, attach to root

slide-23
SLIDE 23

Building the Red Tree

Recursively subdivide each subregion in the same way

slide-24
SLIDE 24

Building the Red Tree

Recursively subdivide each subregion in the same way

slide-25
SLIDE 25

Building the Red Tree

Recursively subdivide each subregion in the same way

slide-26
SLIDE 26

Building the Red Tree

Recursively subdivide each subregion in the same way

slide-27
SLIDE 27

Building the Red Tree

Recursively subdivide each subregion in the same way

slide-28
SLIDE 28

Building the Red Tree

Attach other-colored nodes to the nearest available red node

slide-29
SLIDE 29

Multicast Improvements

Reduce bandwidth overhead – avoid sending redundant data Reduce multicast latency – choose fragments to send based on expected path length Improve reliability during failures – reconstruct missing fragments from other trees

slide-30
SLIDE 30

Outline

  • Overview
  • Basic Approach
  • Multicast Mechanism
  • Multi-region Design
  • Fault Tolerance
  • Evaluation
slide-31
SLIDE 31

Multi-Region Structure

Divide large deployments into location-based regions

slide-32
SLIDE 32

Multi-Region Structure

One region leader per region, plus global leader

slide-33
SLIDE 33

Multi-Region Structure

Region leaders aggregate membership changes from region

slide-34
SLIDE 34

Multi-Region Structure

Region leaders aggregate membership changes from region

slide-35
SLIDE 35

Multi-Region Structure

Global leader combines region reports to produce item

slide-36
SLIDE 36

Region Dynamics

Regions split when they grow too large Global leader signals split in the next item Nodes independently split region across widest axis using consistent membership knowledge Regions merge when one grows too small Similar process Nodes assigned to nearest region on joining

slide-37
SLIDE 37

Multi-Region Structure

Benefits – fewer messages processed by leader – fewer wide-area communications – cheaper multicast tree computation – useful abstraction for applications

slide-38
SLIDE 38

Partial Knowledge

Maintaining global membership knowledge is usually feasible Except: very large, dynamic, and/or bandwidth-constrained systems Partial knowledge: each node knows only the membership of its own region and summary information of other regions

slide-39
SLIDE 39

Outline

  • Overview
  • Basic Approach
  • Multicast Mechanism
  • Multi-region Design
  • Fault Tolerance
  • Evaluation
slide-40
SLIDE 40

Fault Tolerance

Global leader and region leaders can fail Solution: replication Use standard state machine replication techniques Replication level based on expected concurrent failures Optional: tolerating Byzantine faults

slide-41
SLIDE 41

Outline

  • Overview
  • Basic Approach
  • Multicast Mechanism
  • Multi-region Design
  • Fault Tolerance
  • Evaluation
slide-42
SLIDE 42

Evaluation

PlanetLab deployment 614 nodes Theoretical analysis scalability to larger systems Simulator evaluate multicast performance

slide-43
SLIDE 43

PlanetLab Deployment

614 nodes; 30 second epochs; 1 KB/epoch multicast

0.2 0.4 0.6 0.8 1 20 40 60 80 100 120 140 Mean bandwidth per node (KB/s) Time (epochs) 100 200 300 400 500 600 20 40 60 80 100 120 140 Reported membership (nodes) Time (epochs) 10% failed 25% failed

Bandwidth usage Multicast data size

slide-44
SLIDE 44

Bandwidth Overhead

Membership management cost analysis Very high churn rate (avg. node lifetime 30 minutes)

0.01 0.1 1 10 100 1000 10000 100000 Global Leader Region Leader Regular Node 0.01 0.1 1 10 100 1000 10000 100000 Bandwidth Overhead (KB/s) Number of Nodes Number of Nodes

Multiple Regions Partial Knowledge

slide-45
SLIDE 45

Multicast Reliability

Fraction of nodes successfully receiving multicast Simulation results (10,000 nodes)

0.2 0.4 0.6 0.8 1 0.01 0.1 1 Success Rate 8/16 coding (data) 12/16 coding (data) 16 trees (membership) Fraction of Bad Nodes

slide-46
SLIDE 46

Multicast Performance

Stretch: multicast latency / ideal (unicast) latency 1740-node measurement-derived topology

0.2 0.4 0.6 0.8 1 1 2 3 4 5 6 7 8 Cumulative Fraction of Nodes Stretch

slide-47
SLIDE 47

Conclusion

Census: a platform for membership management and communication in large distributed systems Provides consistent views while scaling to extreme sizes Support future wide-scale distributed applications Builds on an efficient multicast mechanism High reliability, low latency, low bandwidth overhead Exploit consistent knowledge High performance while avoiding complexity

slide-48
SLIDE 48

Conclusion

Census: a platform for membership management and communication in large distributed systems Provides consistent views while scaling to extreme sizes Support future wide-scale distributed applications Builds on an efficient multicast mechanism High reliability, low latency, low bandwidth overhead Exploit consistent knowledge High performance while avoiding complexity

Thank you. Questions?