Minimizing Churn in Distributed Systems Brighten Godfrey Scott - - PowerPoint PPT Presentation

minimizing churn in distributed systems
SMART_READER_LITE
LIVE PREVIEW

Minimizing Churn in Distributed Systems Brighten Godfrey Scott - - PowerPoint PPT Presentation

Minimizing Churn in Distributed Systems Brighten Godfrey Scott Shenker Ion Stoica SIGCOMM 2006 1 introduction Churn: an important factor for most distributed systems Turnover causes dropped requests, increased bandwidth, ... Would like to


slide-1
SLIDE 1

Minimizing Churn in Distributed Systems

Brighten Godfrey Scott Shenker Ion Stoica SIGCOMM 2006

1

slide-2
SLIDE 2

introduction

Churn: an important factor for most distributed systems Turnover causes dropped requests, increased bandwidth, ... Would like to optimize for stability Select which nodes to use

Can’t prevent a node from failing, but we can select which nodes to use

2

slide-3
SLIDE 3

introduction

Past work uses heuristics for specific systems Our goal: a general study of minimizing churn How can we select nodes to minimize churn? Can we characterize how existing systems select nodes and the impact on their performance?

...applicable to a wide range of systems

3

slide-4
SLIDE 4

contents

  • an example system
  • evaluation of node selection strategies

(how can we minimize churn?)

  • applications

(how do existing systems select nodes?)

  • conclusions

4

slide-5
SLIDE 5

example: overlay multicast

Join:

  • Consider m random nodes

with # children < max

  • Pick one as parent to

minimize latency to root root X Interruption

5

slide-6
SLIDE 6

example: overlay multicast

400 800 1200 1600 1 4 16 64 256 Latency to root (ms) Nodes considered when picking parent (m)

6

slide-7
SLIDE 7

example: overlay multicast

400 800 1200 1600 1 4 16 64 256 1 2 3 4 5 6 Latency to root (ms) Interruptions per node per day Nodes considered when picking parent (m)

+86%

7

slide-8
SLIDE 8

example: overlay multicast

In terms of interruption rate, Random Replacement

  • f parent

(m=1) better than Preference List selection (large m) Why?

8

slide-9
SLIDE 9

contents

  • an example system
  • evaluation of node selection strategies

(how can we minimize churn?)

  • applications

(how do existing systems select nodes?)

  • conclusions

9

slide-10
SLIDE 10

the core problem

Node selection task n nodes available pick k to be “in use” when one fails, pick a replacement Minimize churn: rate of change in set of in-use nodes

10

slide-11
SLIDE 11

defining churn

For each node: Intuition: when a node joins or leaves a DHT, 1/k of stored objects change ownership ...then divide by runtime

in use down available

join leave fail churn += 1 k k = # of nodes in use

11

slide-12
SLIDE 12

node selection strategies

Predictive Agnostic

  • Longest uptime
  • Most available
  • Max expectation
  • ...
  • Random

Replacement

  • Preference List

12

slide-13
SLIDE 13

agnostic selection strategies

Random Replacement Passive Preference List Active Preference List Select random available node to replace failed node ...and switch to more preferred nodes when they join Rank nodes (e.g. by latency); Select most preferred as replacement

Pref List is: (1) essentially static across time (2) essentially unrelated to churn

13

slide-14
SLIDE 14

evaluation

Longest Uptime, Max Expectation churn Passive PL Active PL

1.2-3× 2.5-8×

Random Replacement

1.2-2.2×

Why such a difference?

...even though neither uses history?

14

slide-15
SLIDE 15

evaluation

5 traces of node availability PlanetLab [Stribling 2004-05] Web sites [Bakkaloglu et al 2002] Microsoft PCs [Bolosky et al 2000] Skype superpeers [Guha et al 2006] Gnutella peers [Saroiu et al 2002] Main conclusions held in all cases

15

slide-16
SLIDE 16

evaluation: PlanetLab trace

0.2 0.4 0.6 0.8 1 1 0.01 0.1 Churn (turnover per day) Fraction of nodes in use Active PL Passive PL RR Max Exp

16

slide-17
SLIDE 17

intuition: PL

uses the top k nodes in the preference list preference list unrelated to stability failure rate is about mean node failure rate

<--- becomes more and more true for Passive as k increases

17

slide-18
SLIDE 18

intuition: RR

RR like picking a node at a random time Long sessions occupy more time (trivially) So, RR biased towards landing in longer sessions Failure rate can be arbitrarily lower than mean Time selected TTF

An example of the classic “inspection paradox”

X X X X X X

but it depends

  • n the session

time distribution session = time between 2 failures

18

slide-19
SLIDE 19

RR vs. PL: analysis

Churn of RR decreases as session time distributions become “more skewed” (=> higher variance) RR can never have more than 2x the churn of PL strategies

E[C] = 2 αd

d

  • i=1

1 µi

  • 1 − E
  • exp

α 2(1 − α)E[C] · Li

  • 19
slide-20
SLIDE 20

contents

  • an example system
  • evaluation of node selection strategies

(how can we minimize churn?)

  • applications

(how do existing systems select nodes?)

  • conclusions

20

slide-21
SLIDE 21

applications of RR & PL

anycast DHT replica placement

  • verlay multicast

DHT neighbor selection

21

slide-22
SLIDE 22
  • verlay multicast

400 800 1200 1600 1 4 16 64 256 1 2 3 4 5 6 Latency to root (ms) Interruptions per node per day Nodes considered when picking parent (m)

two separate effects of increasing m: (1) tree becomes more balanced (small decrease in interruptions) (2) move from RR

  • to PL-

like strategy (big increase)

22

slide-23
SLIDE 23

a peek inside the tree

0.5 1 1.5 2 2.5 5 10 15 20 Failures per day Depth in tree 0.5 1 1.5 2 2.5 5 10 15 20 Failures per day Depth in tree m = 1 (random selection) m = n (latency-optimized) 0.5 1 1.5 2 2.5 5 10 15 20 Failures per day Depth in tree m = n (latency-optimized)

23

slide-24
SLIDE 24
  • verlay multicast notes

Basic framework from [Sripanidkulchai et al SIGCOMM’04] Found random parent selection surprisingly good Tested 2 other heuristics to minimize interruptions Both can perform better with some randomization!

24

slide-25
SLIDE 25

DHT neighbor selection

Standard Chord topology 1 2 3 Active PL strategy for selecting each finger Preference List arises accidentally

25

slide-26
SLIDE 26

DHT neighbor selection

Randomized topology Divide keyspace into 1/2, 1/4, 1/8, ... Finger points to random key within each interval

26

slide-27
SLIDE 27

DHT neighbor selection

Datagram-level simulation, i3 Chord codebase, Gnutella trace

easy 29% reduction at n = 850

0.005 0.01 0.015 32 64 128 256 512 1024 Fraction of requests failed Average number of nodes in DHT Standard Chord topology Randomized Chord topology

27

slide-28
SLIDE 28

contents

  • an example system
  • evaluation of node selection strategies

(how can we minimize churn?)

  • applications

(how do existing systems select nodes?)

  • conclusions

28

slide-29
SLIDE 29

conclusions

A guide to minimizing churn RR is pretty good; PL much worse RR and PL arise in many systems Design insights watch out for (implicit) PL strategies easy way to reduce churn: add some randomness

doing less work may improve performance!

29

slide-30
SLIDE 30

backup slides

30

slide-31
SLIDE 31

Why use RR?

Simplicity: no need to monitor and disseminate failure data Robustness to self-interested peers Legacy systems

31