Minimizing Churn in Distributed Systems Brighten Godfrey Scott - PowerPoint PPT Presentation

Minimizing Churn in Distributed Systems Brighten Godfrey Scott Shenker Ion Stoica SIGCOMM 2006 1

introduction Churn: an important factor for most distributed systems Turnover causes dropped requests, increased bandwidth, ... Would like to optimize for stability Can’t prevent a node from failing, but we Select which nodes to use can select which nodes to use 2

introduction Past work uses heuristics for specific systems ...applicable to a wide Our goal: a general study of minimizing churn range of systems How can we select nodes to minimize churn? Can we characterize how existing systems select nodes and the impact on their performance? 3

contents • an example system • evaluation of node selection strategies (how can we minimize churn?) • applications (how do existing systems select nodes?) • conclusions 4

example: overlay multicast Join: • Consider m random nodes X with # children < max • Pick one as parent to root minimize latency to root Interruption 5

example: overlay multicast 1600 Latency to root (ms) 1200 800 400 0 1 4 16 64 256 Nodes considered when picking parent (m) 6

example: overlay multicast 1600 6 Interruptions per node per day 5 Latency to root (ms) 1200 4 +86% 800 3 2 400 1 0 0 1 4 16 64 256 Nodes considered when picking parent (m) 7

example: overlay multicast In terms of interruption rate, Random Replacement Preference List better of parent selection than (m=1) (large m) Why? 8

the core problem Node selection task n nodes available pick k to be “in use” when one fails, pick a replacement Minimize churn: rate of change in set of in-use nodes 10

defining churn For each node: in use churn += 1 fail leave join k down available k = # of nodes in use Intuition: when a node joins or leaves a DHT, 1/ k of stored objects change ownership ...then divide by runtime 11

node selection strategies • Longest uptime • Max expectation Predictive • Most available • ... • Random • Preference List Agnostic Replacement 12

agnostic selection strategies Select random available Random Replacement node to replace failed node Rank nodes (e.g. by latency); Passive Preference List Select most preferred as replacement Pref List is: (1) essentially ...and switch to more preferred static across time Active Preference List (2) essentially nodes when they join unrelated to churn 13

evaluation churn Active PL Why such Passive PL 2 . 5-8 × a difference? 1 . 2-3 × ...even Random Replacement though neither uses history? 1 . 2-2 . 2 × Longest Uptime, Max Expectation 14

evaluation 5 traces of node availability PlanetLab [Stribling 2004-05] Web sites [Bakkaloglu et al 2002] Microsoft PCs [Bolosky et al 2000] Skype superpeers [Guha et al 2006] Gnutella peers [Saroiu et al 2002] Main conclusions held in all cases 15

evaluation: PlanetLab trace 1 Active PL Churn (turnover per day) Passive PL 0.8 RR Max Exp 0.6 0.4 0.2 0 0.01 0.1 1 Fraction of nodes in use 16

intuition: PL uses the top k nodes in the preference list <--- becomes more and more true for Passive as k increases preference list unrelated to stability failure rate is about mean node failure rate 17

intuition: RR An example of the classic “inspection paradox” RR like picking a node at a random time session = time between 2 failures selected TTF X X X X X X Time Long sessions occupy more time (trivially) So, RR biased towards landing in longer sessions but it depends on the session Failure rate can be arbitrarily lower than mean time distribution 18

RR vs. PL: analysis d � � � �� 2 1 α � E [ C ] = 1 − E exp 2(1 − α ) E [ C ] · L i − α d µ i i =1 Churn of RR decreases as session time distributions become “more skewed” (=> higher variance) RR can never have more than 2x the churn of PL strategies 19

applications of RR & PL anycast DHT replica placement overlay multicast DHT neighbor selection 21

overlay multicast 1600 6 Interruptions per node per day 5 Latency to root (ms) 1200 4 800 3 two separate effects of increasing m: 2 (1) tree becomes more balanced (small decrease 400 in interruptions) 1 (2) move from RR - to PL- like strategy (big increase) 0 0 1 4 16 64 256 Nodes considered when picking parent (m) 22

a peek inside the tree 2.5 2.5 2.5 2 2 2 Failures per day Failures per day Failures per day 1.5 1.5 1.5 1 1 1 0.5 0.5 0.5 m = 1 (random selection) m = n (latency-optimized) m = n (latency-optimized) 0 0 0 0 0 0 5 5 5 10 10 10 15 15 15 20 20 20 Depth in tree Depth in tree Depth in tree 23

overlay multicast notes Basic framework from [Sripanidkulchai et al SIGCOMM’04] Found random parent selection surprisingly good Tested 2 other heuristics to minimize interruptions Both can perform better with some randomization! 24

DHT neighbor selection Standard Chord topology Active PL strategy for 1 selecting each finger 2 3 Preference List arises accidentally 25

DHT neighbor selection Randomized topology Divide keyspace into 1/2, 1/4, 1/8, ... Finger points to random key within each interval 26

DHT neighbor selection Datagram-level simulation, i 3 Chord codebase, Gnutella trace 0.015 Fraction of requests failed easy 29% reduction at n = 850 0.01 0.005 Standard Chord topology Randomized Chord topology 0 32 64 128 256 512 1024 Average number of nodes in DHT 27

conclusions A guide to minimizing churn RR is pretty good; PL much worse RR and PL arise in many systems Design insights doing less work may improve watch out for (implicit) PL strategies performance! easy way to reduce churn: add some randomness 29

backup slides 30

Why use RR? Simplicity: no need to monitor and disseminate failure data Robustness to self-interested peers Legacy systems 31

Minimizing Churn in Distributed Systems Brighten Godfrey Scott - PowerPoint PPT Presentation

Minimizing Churn in Distributed Systems Brighten Godfrey Scott Shenker Ion Stoica SIGCOMM 2006 1 introduction Churn: an important factor for most distributed systems Turnover causes dropped requests, increased bandwidth, ... Would like to

Group CFO Mahindra & Mahindra Manthan redux? 1 May 30, 2016 Churn Churn all

Exploring Characteristics of Code Churn @JMKraaijeveld @EricBouwers Time Activities Code Churn

A Churn for the Better Localizing Censorship using Networklevel Path Churn and Network

Preventing Churn Using Predictive Modeling Alex Herbert Sales Manager James Cousins Sr.

Verification of Implementations of Distributed Systems under Churn Ryan Doenges , James R. Wilcox,

Distributed, Secure Load Balancing with Skew, Heterogeneity, and Churn Jonathan Ledlie and Margo

Distributed Systems (ICE 601) Distributed Transactions Dongman Lee ICU Class Overview

Distributed Systems Goals of Distributed Systems 13A. Distributed Systems: Goals & Challenges

Distributed Systems Goals of Distributed Systems 13A. Distributed Systems: Goals & Challenges

Distributed File Systems Distributed File Systems A distributed file system (DFS) is a

Lender Liability: Evaluating, Minimizing Lender Liability: Evaluating, Minimizing and Defending

Conflicting objectives in design Common design objectives: Minimizing mass ( sprint bike;

Minimizing errors in the questionnaire Minimizing errors in the questionnaire and monitoring the

Introduction to Distributed * Systems Introduction to Distributed * Systems Outline Outline

Introduction to Distributed Systems Introduction to Distributed Systems Outline Outline

Peer-to-peer systems and Data location overlay networks Churn Newscast algorithm

Speeding up the Inter-Planetary File System (IPFS) Speeding up the Inter-Planetary File System

Regulatory challenges of AI products A pre-market perspective Tyler Dumouchel, Ph.D. Senior

SPAR The Little Engine(s) That Could: Scaling Online Social Networks Arman Idani 28 Feb 2012

Th The proposed changes to BS10175 d h t BS10175 Investigation of potentially contaminated i

Presentation Policy Aug 2019 1 Cults Academy 1.0 Introduction The aim of this policy is to

Diamond S Shipping Inc. Company Introduction September 2019 DSSI LISTED NYSE Disclaimer and

P2P Perspective Mosharaf Chowdhury CS856: Advanced Topics in Distributed Computing May 5, 2008

Q4 2016 Results World leader in the international seaborne transportation of crude oil

Minimizing Churn in Distributed Systems Brighten Godfrey Scott - PowerPoint PPT Presentation

Minimizing Churn in Distributed Systems Brighten Godfrey Scott Shenker Ion Stoica SIGCOMM 2006 1 introduction Churn: an important factor for most distributed systems Turnover causes dropped requests, increased bandwidth, ... Would like to

Group CFO Mahindra &amp; Mahindra Manthan redux? 1 May 30, 2016 Churn Churn all

Exploring Characteristics of Code Churn @JMKraaijeveld @EricBouwers Time Activities Code Churn

A Churn for the Better Localizing Censorship using Networklevel Path Churn and Network

Preventing Churn Using Predictive Modeling Alex Herbert Sales Manager James Cousins Sr.

Verification of Implementations of Distributed Systems under Churn Ryan Doenges , James R. Wilcox,

Distributed, Secure Load Balancing with Skew, Heterogeneity, and Churn Jonathan Ledlie and Margo

Distributed Systems (ICE 601) Distributed Transactions Dongman Lee ICU Class Overview

Distributed Systems Goals of Distributed Systems 13A. Distributed Systems: Goals &amp; Challenges

Distributed Systems Goals of Distributed Systems 13A. Distributed Systems: Goals &amp; Challenges

Distributed File Systems Distributed File Systems A distributed file system (DFS) is a

Lender Liability: Evaluating, Minimizing Lender Liability: Evaluating, Minimizing and Defending

Conflicting objectives in design Common design objectives: Minimizing mass ( sprint bike;

Minimizing errors in the questionnaire Minimizing errors in the questionnaire and monitoring the

Introduction to Distributed * Systems Introduction to Distributed * Systems Outline Outline

Introduction to Distributed Systems Introduction to Distributed Systems Outline Outline

Peer-to-peer systems and Data location overlay networks Churn Newscast algorithm

Speeding up the Inter-Planetary File System (IPFS) Speeding up the Inter-Planetary File System

Regulatory challenges of AI products A pre-market perspective Tyler Dumouchel, Ph.D. Senior

SPAR The Little Engine(s) That Could: Scaling Online Social Networks Arman Idani 28 Feb 2012

Th The proposed changes to BS10175 d h t BS10175 Investigation of potentially contaminated i

Presentation Policy Aug 2019 1 Cults Academy 1.0 Introduction The aim of this policy is to

Diamond S Shipping Inc. Company Introduction September 2019 DSSI LISTED NYSE Disclaimer and

P2P Perspective Mosharaf Chowdhury CS856: Advanced Topics in Distributed Computing May 5, 2008

Q4 2016 Results World leader in the international seaborne transportation of crude oil

Group CFO Mahindra & Mahindra Manthan redux? 1 May 30, 2016 Churn Churn all

Distributed Systems Goals of Distributed Systems 13A. Distributed Systems: Goals & Challenges

Distributed Systems Goals of Distributed Systems 13A. Distributed Systems: Goals & Challenges