minimizing churn in distributed systems
play

Minimizing Churn in Distributed Systems Brighten Godfrey Scott - PowerPoint PPT Presentation

Minimizing Churn in Distributed Systems Brighten Godfrey Scott Shenker Ion Stoica SIGCOMM 2006 1 introduction Churn: an important factor for most distributed systems Turnover causes dropped requests, increased bandwidth, ... Would like to


  1. Minimizing Churn in Distributed Systems Brighten Godfrey Scott Shenker Ion Stoica SIGCOMM 2006 1

  2. introduction Churn: an important factor for most distributed systems Turnover causes dropped requests, increased bandwidth, ... Would like to optimize for stability Can’t prevent a node from failing, but we Select which nodes to use can select which nodes to use 2

  3. introduction Past work uses heuristics for specific systems ...applicable to a wide Our goal: a general study of minimizing churn range of systems How can we select nodes to minimize churn? Can we characterize how existing systems select nodes and the impact on their performance? 3

  4. contents • an example system • evaluation of node selection strategies (how can we minimize churn?) • applications (how do existing systems select nodes?) • conclusions 4

  5. example: overlay multicast Join: • Consider m random nodes X with # children < max • Pick one as parent to root minimize latency to root Interruption 5

  6. example: overlay multicast 1600 Latency to root (ms) 1200 800 400 0 1 4 16 64 256 Nodes considered when picking parent (m) 6

  7. example: overlay multicast 1600 6 Interruptions per node per day 5 Latency to root (ms) 1200 4 +86% 800 3 2 400 1 0 0 1 4 16 64 256 Nodes considered when picking parent (m) 7

  8. example: overlay multicast In terms of interruption rate, Random Replacement Preference List better of parent selection than (m=1) (large m) Why? 8

  9. contents • an example system • evaluation of node selection strategies (how can we minimize churn?) • applications (how do existing systems select nodes?) • conclusions 9

  10. the core problem Node selection task n nodes available pick k to be “in use” when one fails, pick a replacement Minimize churn: rate of change in set of in-use nodes 10

  11. defining churn For each node: in use churn += 1 fail leave join k down available k = # of nodes in use Intuition: when a node joins or leaves a DHT, 1/ k of stored objects change ownership ...then divide by runtime 11

  12. node selection strategies • Longest uptime • Max expectation Predictive • Most available • ... • Random • Preference List Agnostic Replacement 12

  13. agnostic selection strategies Select random available Random Replacement node to replace failed node Rank nodes (e.g. by latency); Passive Preference List Select most preferred as replacement Pref List is: (1) essentially ...and switch to more preferred static across time Active Preference List (2) essentially nodes when they join unrelated to churn 13

  14. evaluation churn Active PL Why such Passive PL 2 . 5-8 × a difference? 1 . 2-3 × ...even Random Replacement though neither uses history? 1 . 2-2 . 2 × Longest Uptime, Max Expectation 14

  15. evaluation 5 traces of node availability PlanetLab [Stribling 2004-05] Web sites [Bakkaloglu et al 2002] Microsoft PCs [Bolosky et al 2000] Skype superpeers [Guha et al 2006] Gnutella peers [Saroiu et al 2002] Main conclusions held in all cases 15

  16. evaluation: PlanetLab trace 1 Active PL Churn (turnover per day) Passive PL 0.8 RR Max Exp 0.6 0.4 0.2 0 0.01 0.1 1 Fraction of nodes in use 16

  17. intuition: PL uses the top k nodes in the preference list <--- becomes more and more true for Passive as k increases preference list unrelated to stability failure rate is about mean node failure rate 17

  18. intuition: RR An example of the classic “inspection paradox” RR like picking a node at a random time session = time between 2 failures selected TTF X X X X X X Time Long sessions occupy more time (trivially) So, RR biased towards landing in longer sessions but it depends on the session Failure rate can be arbitrarily lower than mean time distribution 18

  19. RR vs. PL: analysis d � � � ��� 2 1 α � E [ C ] = 1 − E exp 2(1 − α ) E [ C ] · L i − α d µ i i =1 Churn of RR decreases as session time distributions become “more skewed” (=> higher variance) RR can never have more than 2x the churn of PL strategies 19

  20. contents • an example system • evaluation of node selection strategies (how can we minimize churn?) • applications (how do existing systems select nodes?) • conclusions 20

  21. applications of RR & PL anycast DHT replica placement overlay multicast DHT neighbor selection 21

  22. overlay multicast 1600 6 Interruptions per node per day 5 Latency to root (ms) 1200 4 800 3 two separate effects of increasing m: 2 (1) tree becomes more balanced (small decrease 400 in interruptions) 1 (2) move from RR - to PL- like strategy (big increase) 0 0 1 4 16 64 256 Nodes considered when picking parent (m) 22

  23. a peek inside the tree 2.5 2.5 2.5 2 2 2 Failures per day Failures per day Failures per day 1.5 1.5 1.5 1 1 1 0.5 0.5 0.5 m = 1 (random selection) m = n (latency-optimized) m = n (latency-optimized) 0 0 0 0 0 0 5 5 5 10 10 10 15 15 15 20 20 20 Depth in tree Depth in tree Depth in tree 23

  24. overlay multicast notes Basic framework from [Sripanidkulchai et al SIGCOMM’04] Found random parent selection surprisingly good Tested 2 other heuristics to minimize interruptions Both can perform better with some randomization! 24

  25. DHT neighbor selection Standard Chord topology Active PL strategy for 1 selecting each finger 2 3 Preference List arises accidentally 25

  26. DHT neighbor selection Randomized topology Divide keyspace into 1/2, 1/4, 1/8, ... Finger points to random key within each interval 26

  27. DHT neighbor selection Datagram-level simulation, i 3 Chord codebase, Gnutella trace 0.015 Fraction of requests failed easy 29% reduction at n = 850 0.01 0.005 Standard Chord topology Randomized Chord topology 0 32 64 128 256 512 1024 Average number of nodes in DHT 27

  28. contents • an example system • evaluation of node selection strategies (how can we minimize churn?) • applications (how do existing systems select nodes?) • conclusions 28

  29. conclusions A guide to minimizing churn RR is pretty good; PL much worse RR and PL arise in many systems Design insights doing less work may improve watch out for (implicit) PL strategies performance! easy way to reduce churn: add some randomness 29

  30. backup slides 30

  31. Why use RR? Simplicity: no need to monitor and disseminate failure data Robustness to self-interested peers Legacy systems 31

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend