CS 3700 Networks and Distributed Systems Overlay Networks (P2P DHT - - PowerPoint PPT Presentation

cs 3700
SMART_READER_LITE
LIVE PREVIEW

CS 3700 Networks and Distributed Systems Overlay Networks (P2P DHT - - PowerPoint PPT Presentation

CS 3700 Networks and Distributed Systems Overlay Networks (P2P DHT via KBR FTW) Revised 10/26/2016 Outline 2 Consistent Hashing Structured Overlays / DHTs Key/Value Storage Service 3 Imagine a simple service that stores


slide-1
SLIDE 1

CS 3700


Networks and Distributed Systems

Overlay Networks (P2P DHT via KBR FTW)

Revised 10/26/2016

slide-2
SLIDE 2

❑ Consistent Hashing ❑ Structured Overlays / DHTs

Outline

2

slide-3
SLIDE 3

Key/Value Storage Service

3

Imagine a simple service that stores key/value pairs

Similar to memcached or redis

slide-4
SLIDE 4

Key/Value Storage Service

3

put(“christo”, “abc…”)

Imagine a simple service that stores key/value pairs

Similar to memcached or redis

slide-5
SLIDE 5

Key/Value Storage Service

3

put(“christo”, “abc…”) get(“christo”) “abc…”

Imagine a simple service that stores key/value pairs

Similar to memcached or redis

slide-6
SLIDE 6

Key/Value Storage Service

3

One server is probably fine as long as total pairs < 1M

put(“christo”, “abc…”) get(“christo”) “abc…”

Imagine a simple service that stores key/value pairs

Similar to memcached or redis

slide-7
SLIDE 7

Key/Value Storage Service

3

One server is probably fine as long as total pairs < 1M How do we scale the service as pairs grows?

Add more servers and distribute the data across them put(“christo”, “abc…”) get(“christo”) “abc…”

Imagine a simple service that stores key/value pairs

Similar to memcached or redis

slide-8
SLIDE 8

Mapping Keys to Servers

4

Problem: how do you map keys to servers?

<“key1”, “value1”> <“key2”, “value2”> <“key3”, “value3”>

slide-9
SLIDE 9

Mapping Keys to Servers

4

Problem: how do you map keys to servers?

<“key1”, “value1”> <“key2”, “value2”> <“key3”, “value3”>

slide-10
SLIDE 10

Mapping Keys to Servers

4

Problem: how do you map keys to servers?

<“key1”, “value1”> <“key2”, “value2”> <“key3”, “value3”>

?

slide-11
SLIDE 11

Mapping Keys to Servers

4

Problem: how do you map keys to servers?

<“key1”, “value1”> <“key2”, “value2”> <“key3”, “value3”>

?

Keep in mind, the number of servers may change (e.g. we could add a new server, or a server could crash)

slide-12
SLIDE 12

Hash Tables

5

hash(key) % n array index

Array (length = n) <“key1”, “value1”> <“key2”, “value2”> <“key3”, “value3”>

slide-13
SLIDE 13

Hash Tables

5

hash(key) % n array index

Array (length = n) <“key1”, “value1”> <“key2”, “value2”> <“key3”, “value3”> <“key1”, “value1”>

slide-14
SLIDE 14

Hash Tables

5

hash(key) % n array index

Array (length = n) <“key1”, “value1”> <“key2”, “value2”> <“key3”, “value3”> <“key1”, “value1”> <“key2”, “value2”>

slide-15
SLIDE 15

Hash Tables

5

hash(key) % n array index

Array (length = n) <“key1”, “value1”> <“key2”, “value2”> <“key3”, “value3”> <“key1”, “value1”> <“key2”, “value2”> <“key3”, “value3”>

slide-16
SLIDE 16

(Bad) Distributed Key/Value Service

6

hash(str) % n array index

Array (length = n) IP address of node A IP address of node B IP address of node C IP address of node D <“key1”, “value1”> <“key2”, “value2”> <“key3”, “value3”> A B C D

slide-17
SLIDE 17

(Bad) Distributed Key/Value Service

6

hash(str) % n array index

Array (length = n) IP address of node A IP address of node B IP address of node C IP address of node D <“key1”, “value1”> <“key2”, “value2”> <“key3”, “value3”> A B C D k1

slide-18
SLIDE 18

(Bad) Distributed Key/Value Service

6

hash(str) % n array index

Array (length = n) IP address of node A IP address of node B IP address of node C IP address of node D <“key1”, “value1”> <“key2”, “value2”> <“key3”, “value3”> A B C D k1 k2

slide-19
SLIDE 19

(Bad) Distributed Key/Value Service

6

hash(str) % n array index

Array (length = n) IP address of node A IP address of node B IP address of node C IP address of node D <“key1”, “value1”> <“key2”, “value2”> <“key3”, “value3”> A B C D k1 k2 k3

slide-20
SLIDE 20

(Bad) Distributed Key/Value Service

6

hash(str) % n array index

Array (length = n) IP address of node A IP address of node B IP address of node C IP address of node D IP address of node E <“key1”, “value1”> <“key2”, “value2”> <“key3”, “value3”> A B C D E k1 k2 k3

slide-21
SLIDE 21

(Bad) Distributed Key/Value Service

6

hash(str) % n array index

Array (length = n) IP address of node A IP address of node B IP address of node C IP address of node D IP address of node E (length = n + 1) <“key1”, “value1”> <“key2”, “value2”> <“key3”, “value3”> A B C D E k1 k2 k3

slide-22
SLIDE 22

(Bad) Distributed Key/Value Service

6

hash(str) % n array index

Array (length = n) IP address of node A IP address of node B IP address of node C IP address of node D IP address of node E (length = n + 1) <“key1”, “value1”> <“key2”, “value2”> <“key3”, “value3”> A B C D E k1 k2 k3

slide-23
SLIDE 23

(Bad) Distributed Key/Value Service

6

hash(str) % n array index

Array (length = n) IP address of node A IP address of node B IP address of node C IP address of node D IP address of node E (length = n + 1) <“key1”, “value1”> <“key2”, “value2”> <“key3”, “value3”> A B C D E k1 k2 k3

slide-24
SLIDE 24

(Bad) Distributed Key/Value Service

6

hash(str) % n array index

Array (length = n) IP address of node A IP address of node B IP address of node C IP address of node D IP address of node E

  • Number of servers (n) will change
  • Need a “deterministic” mapping
  • As few changes as possible when machines join/leave

(length = n + 1) <“key1”, “value1”> <“key2”, “value2”> <“key3”, “value3”> A B C D E k1 k2 k3

slide-25
SLIDE 25

Consistent Hashing

7

Alternative hashing algorithm with many beneficial characteristics

1.

Deterministic (just like normal hashing algorithms)

2.

Balanced: given n servers, each server should get roughly 1/n keys

3.

Locality sensitive: if a server is added, only 1/(n+1) keys need to be moved

slide-26
SLIDE 26

Consistent Hashing

7

Alternative hashing algorithm with many beneficial characteristics

1.

Deterministic (just like normal hashing algorithms)

2.

Balanced: given n servers, each server should get roughly 1/n keys

3.

Locality sensitive: if a server is added, only 1/(n+1) keys need to be moved

Conceptually simple

Imagine a circular number line from 01 Place the servers at random locations on the number line Hash each key and place it at the next server on the number line

■ Move around the circle clockwise to find the next server

slide-27
SLIDE 27

Consistent Hashing Example

8

(hash(str) % 256)/256 ring location

<“key1”, “value1”> <“key2”, “value2”> <“key3”, “value3”> A B C D “server A” “server B” “server C” 1 “server D”

slide-28
SLIDE 28

Consistent Hashing Example

8

(hash(str) % 256)/256 ring location

<“key1”, “value1”> <“key2”, “value2”> <“key3”, “value3”> A B C D “server A” “server B” “server C” 1 A “server D”

slide-29
SLIDE 29

Consistent Hashing Example

8

(hash(str) % 256)/256 ring location

<“key1”, “value1”> <“key2”, “value2”> <“key3”, “value3”> A B C D “server A” “server B” “server C” 1 A B “server D”

slide-30
SLIDE 30

Consistent Hashing Example

8

(hash(str) % 256)/256 ring location

<“key1”, “value1”> <“key2”, “value2”> <“key3”, “value3”> A B C D “server A” “server B” “server C” 1 A B C “server D”

slide-31
SLIDE 31

Consistent Hashing Example

8

(hash(str) % 256)/256 ring location

<“key1”, “value1”> <“key2”, “value2”> <“key3”, “value3”> A B C D “server A” “server B” “server C” 1 A B C D “server D”

slide-32
SLIDE 32

Consistent Hashing Example

8

(hash(str) % 256)/256 ring location

<“key1”, “value1”> <“key2”, “value2”> <“key3”, “value3”> A B C D “server A” “server B” “server C” 1 A B C D k1 “server D”

slide-33
SLIDE 33

Consistent Hashing Example

8

(hash(str) % 256)/256 ring location

<“key1”, “value1”> <“key2”, “value2”> <“key3”, “value3”> A B C D k1 “server A” “server B” “server C” 1 A B C D k1 “server D”

slide-34
SLIDE 34

Consistent Hashing Example

8

(hash(str) % 256)/256 ring location

<“key1”, “value1”> <“key2”, “value2”> <“key3”, “value3”> A B C D k1 “server A” “server B” “server C” 1 A B C D k1 k2 “server D”

slide-35
SLIDE 35

Consistent Hashing Example

8

(hash(str) % 256)/256 ring location

<“key1”, “value1”> <“key2”, “value2”> <“key3”, “value3”> A B C D k1 k2 “server A” “server B” “server C” 1 A B C D k1 k2 “server D”

slide-36
SLIDE 36

Consistent Hashing Example

8

(hash(str) % 256)/256 ring location

<“key1”, “value1”> <“key2”, “value2”> <“key3”, “value3”> A B C D k1 k2 k3 “server A” “server B” “server C” 1 A B C D k1 k2 k3 “server D”

slide-37
SLIDE 37

Consistent Hashing Example

8

(hash(str) % 256)/256 ring location

<“key1”, “value1”> <“key2”, “value2”> <“key3”, “value3”> A B C D k1 k2 k3 “server A” “server B” “server C” 1 A B C D k1 k2 k3 “server D”

slide-38
SLIDE 38

Consistent Hashing Example

8

(hash(str) % 256)/256 ring location

<“key1”, “value1”> <“key2”, “value2”> <“key3”, “value3”> A B C D E k1 k2 k3 “server A” “server B” “server C” 1 A B C D k1 k2 k3 “server D” “server E”

slide-39
SLIDE 39

Consistent Hashing Example

8

(hash(str) % 256)/256 ring location

<“key1”, “value1”> <“key2”, “value2”> <“key3”, “value3”> A B C D E k1 k2 k3 “server A” “server B” “server C” 1 A B C D k1 k2 k3 “server D” “server E” E

slide-40
SLIDE 40

Consistent Hashing Example

8

(hash(str) % 256)/256 ring location

<“key1”, “value1”> <“key2”, “value2”> <“key3”, “value3”> A B C D E k1 k2 k3 “server A” “server B” “server C” 1 A B C D k1 k2 k3 “server D” “server E” E

slide-41
SLIDE 41

Consistent Hashing Example

8

(hash(str) % 256)/256 ring location

<“key1”, “value1”> <“key2”, “value2”> <“key3”, “value3”> A B C D E k1 k2 k3 “server A” “server B” “server C” 1 A B C D k1 k2 k3 “server D” “server E” E

slide-42
SLIDE 42

Practical Implementation

9

In practice, no need to implement complicated number lines Store a list of servers, sorted by their hash (floats from 0 1) To put() or get() a pair, hash the key and search through the list for the first

server where hash(server) >= hash(key)

O(log n) search time if we use a sorted data structure like a heap

O(log n) time to insert a new server into the list

slide-43
SLIDE 43

Improvements to Consistent Hashing

10

1

slide-44
SLIDE 44

Improvements to Consistent Hashing

10

1 A B

slide-45
SLIDE 45

Improvements to Consistent Hashing

10

Problem: hashing may not result in perfect

balance (1/n items per server)

Solution: balance the load by hashing each

server multiple times

1 A B

consistent_hash(“serverA_1”) = … consistent_hash(“serverA_2”) = … consistent_hash(“serverA_3”) = …

slide-46
SLIDE 46

Improvements to Consistent Hashing

10

Problem: hashing may not result in perfect

balance (1/n items per server)

Solution: balance the load by hashing each

server multiple times

1 A B A A B B

consistent_hash(“serverA_1”) = … consistent_hash(“serverA_2”) = … consistent_hash(“serverA_3”) = …

slide-47
SLIDE 47

1 B

Improvements to Consistent Hashing

10

Problem: hashing may not result in perfect

balance (1/n items per server)

Solution: balance the load by hashing each

server multiple times

Problem: if a server fails, data may be lost

Solution: replicate keys/value pairs on multiple

servers

A 1 A B A A B B

consistent_hash(“serverA_1”) = … consistent_hash(“serverA_2”) = … consistent_hash(“serverA_3”) = …

slide-48
SLIDE 48

1 B

Improvements to Consistent Hashing

10

Problem: hashing may not result in perfect

balance (1/n items per server)

Solution: balance the load by hashing each

server multiple times

consistent_hash(“key1”) = 0.4

Problem: if a server fails, data may be lost

Solution: replicate keys/value pairs on multiple

servers

A k1 1 A B A A B B

consistent_hash(“serverA_1”) = … consistent_hash(“serverA_2”) = … consistent_hash(“serverA_3”) = …

slide-49
SLIDE 49

1 B

Improvements to Consistent Hashing

10

Problem: hashing may not result in perfect

balance (1/n items per server)

Solution: balance the load by hashing each

server multiple times

consistent_hash(“key1”) = 0.4

Problem: if a server fails, data may be lost

Solution: replicate keys/value pairs on multiple

servers

A k1 1 A B A A B B

consistent_hash(“serverA_1”) = … consistent_hash(“serverA_2”) = … consistent_hash(“serverA_3”) = …

slide-50
SLIDE 50

1 B

Improvements to Consistent Hashing

10

Problem: hashing may not result in perfect

balance (1/n items per server)

Solution: balance the load by hashing each

server multiple times

consistent_hash(“key1”) = 0.4

Problem: if a server fails, data may be lost

Solution: replicate keys/value pairs on multiple

servers

A k1 1 A B A A B B

consistent_hash(“serverA_1”) = … consistent_hash(“serverA_2”) = … consistent_hash(“serverA_3”) = …

slide-51
SLIDE 51

1 B

Improvements to Consistent Hashing

10

Problem: hashing may not result in perfect

balance (1/n items per server)

Solution: balance the load by hashing each

server multiple times

consistent_hash(“key1”) = 0.4

Problem: if a server fails, data may be lost

Solution: replicate keys/value pairs on multiple

servers

k1 1 A B A A B B

consistent_hash(“serverA_1”) = … consistent_hash(“serverA_2”) = … consistent_hash(“serverA_3”) = …

slide-52
SLIDE 52

Consistent Hashing Summary

11

Consistent hashing is a simple, powerful tool for building distributed systems

Provides consistent, deterministic mapping between names and servers Often called locality sensitive hashing

■ Ideal algorithm for systems that need to scale up or down gracefully Many, many systems use consistent hashing

CDNs Databases: memcached, redis, Voldemort, Dynamo, Cassandra, etc. Overlay networks (more on this coming up…)

slide-53
SLIDE 53

❑ Consistent Hashing ❑ Structured Overlays / DHTs

Outline

12

slide-54
SLIDE 54

Layering, Revisited

13

Application Transport Network Data Link Physical Network Data Link Application Transport Network Data Link Physical

Host 1 Router Host 2

Physical

Layering hides low level details from higher layers

IP is a logical, point-to-point overlay

slide-55
SLIDE 55

Towards Network Overlays

14

slide-56
SLIDE 56

Towards Network Overlays

14

IP provides best-effort, point-to-point datagram service Maybe you want additional features not supported by IP or even TCP

Multicast Security Reliable, performance-based routing Content addressing, reliable data storage

Idea: overlay an additional routing layer on top of IP that adds additional

features

slide-57
SLIDE 57

Example: Virtual Private Network (VPN)

15

VPNs encapsulate IP packets over an IP network

34.67.0.1 34.67.0.2 34.67.0.3 34.67.0.4

Internet Private Private Public

74.11.0.1 74.11.0.2

slide-58
SLIDE 58

Example: Virtual Private Network (VPN)

15

VPNs encapsulate IP packets over an IP network

34.67.0.1 34.67.0.2 34.67.0.3 34.67.0.4

Internet Private Private Public

74.11.0.1 74.11.0.2 Dest: 34.67.0.4

slide-59
SLIDE 59

Example: Virtual Private Network (VPN)

15

VPNs encapsulate IP packets over an IP network

34.67.0.1 34.67.0.2 34.67.0.3 34.67.0.4

Internet Private Private Public

Dest: 74.11.0.2 74.11.0.1 74.11.0.2 Dest: 34.67.0.4

slide-60
SLIDE 60

Example: Virtual Private Network (VPN)

15

VPNs encapsulate IP packets over an IP network

34.67.0.1 34.67.0.2 34.67.0.3 34.67.0.4

Internet Private Private Public

Dest: 74.11.0.2 74.11.0.1 74.11.0.2 Dest: 34.67.0.4

slide-61
SLIDE 61

Example: Virtual Private Network (VPN)

15

VPNs encapsulate IP packets over an IP network

34.67.0.1 34.67.0.2 34.67.0.3 34.67.0.4

Internet Private Private Public

74.11.0.1 74.11.0.2 Dest: 34.67.0.4

slide-62
SLIDE 62

Example: Virtual Private Network (VPN)

15

VPNs encapsulate IP packets over an IP network

34.67.0.1 34.67.0.2 34.67.0.3 34.67.0.4

Internet Private Private Public

74.11.0.1 74.11.0.2 Dest: 34.67.0.4

  • VPN is an IP over IP overlay
  • Not all overlays need to be IP-based
slide-63
SLIDE 63

Network Overlays

16

Application Transport Network Data Link Physical Network Data Link Application Transport Network Data Link Physical

Host 1 Router Host 2

Physical VPN Network VPN Network

slide-64
SLIDE 64

Network Overlays

16

Application Transport Network Data Link Physical Network Data Link Application Transport Network Data Link Physical

Host 1 Router Host 2

Physical VPN Network VPN Network P2P Overlay P2P Overlay

slide-65
SLIDE 65

Network Layer, version 2?

17

Function:

Provide natural, resilient routes based on keys Enable new classes of P2P applications

Key challenge:

Routing table overhead Performance penalty vs. IP

Application Network Transport Network Data Link Physical

slide-66
SLIDE 66

Unstructured P2P Review

18

slide-67
SLIDE 67

Unstructured P2P Review

18

slide-68
SLIDE 68

Unstructured P2P Review

18

slide-69
SLIDE 69

Unstructured P2P Review

18

slide-70
SLIDE 70

Unstructured P2P Review

18

slide-71
SLIDE 71

Unstructured P2P Review

18

slide-72
SLIDE 72

Unstructured P2P Review

18

slide-73
SLIDE 73

Unstructured P2P Review

18

slide-74
SLIDE 74

Unstructured P2P Review

18

slide-75
SLIDE 75

Unstructured P2P Review

18

slide-76
SLIDE 76

Unstructured P2P Review

18

Redundancy

slide-77
SLIDE 77

Unstructured P2P Review

18

Redundancy Traffic Overhead

slide-78
SLIDE 78

Unstructured P2P Review

18

What if the file is rare or far away? Redundancy Traffic Overhead

slide-79
SLIDE 79

Unstructured P2P Review

18

What if the file is rare or far away? Redundancy Traffic Overhead

  • Search is broken
  • High overhead
  • No guarantee it will work
slide-80
SLIDE 80

Why Do We Need Structure?

19

Without structure, it is difficult to search

Any file can be on any machine Centralization can solve this (i.e. Napster), but we know how that ends

slide-81
SLIDE 81

Why Do We Need Structure?

19

Without structure, it is difficult to search

Any file can be on any machine Centralization can solve this (i.e. Napster), but we know how that ends

How do you build a P2P network with structure?

1.

Give every machine and object a unique name

2.

Map from objects machines

■ Looking for object A? Map(A)X, talk to machine X ■ Looking for object B? Map(B)Y, talk to machine Y

slide-82
SLIDE 82

Why Do We Need Structure?

19

Without structure, it is difficult to search

Any file can be on any machine Centralization can solve this (i.e. Napster), but we know how that ends

How do you build a P2P network with structure?

1.

Give every machine and object a unique name

2.

Map from objects machines

■ Looking for object A? Map(A)X, talk to machine X ■ Looking for object B? Map(B)Y, talk to machine Y Is this starting to sound familiar?

slide-83
SLIDE 83

Naïve Overlay Network

20

1

P2P file-sharing network Peers choose random IDs Locate files by hashing

their names

slide-84
SLIDE 84

Naïve Overlay Network

20

1

P2P file-sharing network Peers choose random IDs Locate files by hashing

their names

slide-85
SLIDE 85

Naïve Overlay Network

20

1 GoT_s03e04.mkv

P2P file-sharing network Peers choose random IDs Locate files by hashing

their names

slide-86
SLIDE 86

Naïve Overlay Network

20

1 GoT_s03e04.mkv

P2P file-sharing network Peers choose random IDs Locate files by hashing

their names

hash(“GoT…”) = 0.314

slide-87
SLIDE 87

Naïve Overlay Network

20

1 GoT_s03e04.mkv

P2P file-sharing network Peers choose random IDs Locate files by hashing

their names

0.322 hash(“GoT…”) = 0.314

slide-88
SLIDE 88

Naïve Overlay Network

20

1 GoT_s03e04.mkv

P2P file-sharing network Peers choose random IDs Locate files by hashing

their names

0.322 hash(“GoT…”) = 0.314

slide-89
SLIDE 89

Naïve Overlay Network

20

1 GoT_s03e04.mkv

P2P file-sharing network Peers choose random IDs Locate files by hashing

their names

0.322 hash(“GoT…”) = 0.314

Problems?

slide-90
SLIDE 90

Naïve Overlay Network

20

1 GoT_s03e04.mkv

P2P file-sharing network Peers choose random IDs Locate files by hashing

their names

0.322 hash(“GoT…”) = 0.314

Problems? How do you know

the IP addresses of arbitrary peers?

There may be

millions of peers

Peers come and go

at random (churn)

slide-91
SLIDE 91

Structured Overlay Fundamentals

21

Every machine chooses a unique, random ID

Used for routing and object location, instead of IP addresses

Deterministic KeyNode mapping

Consistent hashing Allows peer rendezvous using a common name

slide-92
SLIDE 92

Structured Overlay Fundamentals

21

Every machine chooses a unique, random ID

Used for routing and object location, instead of IP addresses

Deterministic KeyNode mapping

Consistent hashing Allows peer rendezvous using a common name

Key-based routing

Scalable to any network of size N

■ Each node needs to know the IP of b*logb(N) other nodes ■ Much better scalability than OSPF/RIP/BGP

Routing from node AB takes at most logb(N) hops

slide-93
SLIDE 93

Structured Overlay Fundamentals

21

Every machine chooses a unique, random ID

Used for routing and object location, instead of IP addresses

Deterministic KeyNode mapping

Consistent hashing Allows peer rendezvous using a common name

Key-based routing

Scalable to any network of size N

■ Each node needs to know the IP of b*logb(N) other nodes ■ Much better scalability than OSPF/RIP/BGP

Routing from node AB takes at most logb(N) hops

Advantages

  • Completely decentralized
  • Self organizing
  • Infinitely scalable
slide-94
SLIDE 94

Structured Overlays at 10,000ft.

22

Node IDs and keys from a randomized namespace Incrementally route towards to destination ID Each node knows a small number of IDs + IPs

slide-95
SLIDE 95

Structured Overlays at 10,000ft.

22

Node IDs and keys from a randomized namespace Incrementally route towards to destination ID Each node knows a small number of IDs + IPs

To: ABCD

slide-96
SLIDE 96

Structured Overlays at 10,000ft.

22

Node IDs and keys from a randomized namespace Incrementally route towards to destination ID Each node knows a small number of IDs + IPs

To: ABCD

Each node has a routing table

slide-97
SLIDE 97

Structured Overlays at 10,000ft.

22

Node IDs and keys from a randomized namespace Incrementally route towards to destination ID Each node knows a small number of IDs + IPs

To: ABCD A930

Each node has a routing table Forward to the longest prefix match

slide-98
SLIDE 98

Structured Overlays at 10,000ft.

22

Node IDs and keys from a randomized namespace Incrementally route towards to destination ID Each node knows a small number of IDs + IPs

To: ABCD A930

Each node has a routing table Forward to the longest prefix match

slide-99
SLIDE 99

Structured Overlays at 10,000ft.

22

Node IDs and keys from a randomized namespace Incrementally route towards to destination ID Each node knows a small number of IDs + IPs

To: ABCD A930 AB5F

Each node has a routing table Forward to the longest prefix match

slide-100
SLIDE 100

Structured Overlays at 10,000ft.

22

Node IDs and keys from a randomized namespace Incrementally route towards to destination ID Each node knows a small number of IDs + IPs

To: ABCD A930 AB5F

Each node has a routing table Forward to the longest prefix match

slide-101
SLIDE 101

Structured Overlays at 10,000ft.

22

Node IDs and keys from a randomized namespace Incrementally route towards to destination ID Each node knows a small number of IDs + IPs

To: ABCD A930 AB5F ABC0

Each node has a routing table Forward to the longest prefix match

slide-102
SLIDE 102

Structured Overlays at 10,000ft.

22

Node IDs and keys from a randomized namespace Incrementally route towards to destination ID Each node knows a small number of IDs + IPs

To: ABCD A930 AB5F ABC0

Each node has a routing table Forward to the longest prefix match

slide-103
SLIDE 103

Structured Overlays at 10,000ft.

22

Node IDs and keys from a randomized namespace Incrementally route towards to destination ID Each node knows a small number of IDs + IPs

To: ABCD A930 AB5F ABC0 ABCE

Each node has a routing table Forward to the longest prefix match

slide-104
SLIDE 104

Structured Overlays at 10,000ft.

22

Node IDs and keys from a randomized namespace Incrementally route towards to destination ID Each node knows a small number of IDs + IPs

To: ABCD A930 AB5F ABC0 ABCE

Each node has a routing table Forward to the longest prefix match

slide-105
SLIDE 105

Details

23

Structured overlay APIs

route(key, msg) : route msg to node responsible for key

■ Just like sending a packet to an IP address

Distributed hash table (DHT) functionality

■ put(key, value) : store value at node/key ■ get(key) : retrieve stored value for key at node

slide-106
SLIDE 106

Details

23

Structured overlay APIs

route(key, msg) : route msg to node responsible for key

■ Just like sending a packet to an IP address

Distributed hash table (DHT) functionality

■ put(key, value) : store value at node/key ■ get(key) : retrieve stored value for key at node Key questions:

Node ID space, what does it represent? How do you route within the ID space? How big are the routing tables? How many hops to a destination (in the worst case)?

slide-107
SLIDE 107

Tapestry/Pastry

24

Node IDs are numbers in a ring

160-bit circular ID space

Node IDs chosen at random Messages for key X is routed to live node

with longest prefix match to X

1000 0100 0010 1110 1100 1010 0110

1111 | 0

slide-108
SLIDE 108

Tapestry/Pastry

24

Node IDs are numbers in a ring

160-bit circular ID space

Node IDs chosen at random Messages for key X is routed to live node

with longest prefix match to X

1000 0100 0010 1110 1100 1010 0110

1111 | 0

slide-109
SLIDE 109

Tapestry/Pastry

24

Node IDs are numbers in a ring

160-bit circular ID space

Node IDs chosen at random Messages for key X is routed to live node

with longest prefix match to X

1000 0100 0010 1110 1100 1010 0110

1111 | 0

To: 1110

slide-110
SLIDE 110

Tapestry/Pastry

24

Node IDs are numbers in a ring

160-bit circular ID space

Node IDs chosen at random Messages for key X is routed to live node

with longest prefix match to X

1000 0100 0010 1110 1100 1010 0110

1111 | 0

To: 1110

slide-111
SLIDE 111

Tapestry/Pastry

24

Node IDs are numbers in a ring

160-bit circular ID space

Node IDs chosen at random Messages for key X is routed to live node

with longest prefix match to X

Incremental prefix routing 1110: 1XXX11XX111X1110 1000 0100 0010 1110 1100 1010 0110

1111 | 0

To: 1110

slide-112
SLIDE 112

Physical and Virtual Routing

25

1000 0100 0010 1110 1100 1010 0110

1111 | 0

1010 1100 1111 0010

slide-113
SLIDE 113

Physical and Virtual Routing

25

1000 0100 0010 1110 1100 1010 0110

1111 | 0

To: 1110 To: 1110 1010 1100 1111 0010

slide-114
SLIDE 114

Physical and Virtual Routing

25

1000 0100 0010 1110 1100 1010 0110

1111 | 0

To: 1110 To: 1110 1010 1100 1111 0010

slide-115
SLIDE 115

Physical and Virtual Routing

25

1000 0100 0010 1110 1100 1010 0110

1111 | 0

To: 1110 To: 1110 1010 1100 1111 0010

slide-116
SLIDE 116

Physical and Virtual Routing

25

1000 0100 0010 1110 1100 1010 0110

1111 | 0

To: 1110 To: 1110 1010 1100 1111 0010

slide-117
SLIDE 117

Problem: Routing Table Size

26

Definitions:

N is the size of the network b is the base of the node IDs d is the number of digits in node IDs bd = N

If N is large, then a naïve routing table is going to be huge

Assume a flat naming space (kind of like MAC addresses) A client knows its own ID To send to any other node, would need to know N-1 other IP addresses Suppose N = 1 billion :(

slide-118
SLIDE 118

Tapestry/Pastry Routing Tables

27 Incremental prefix routing Definitions:

N is the size of the network b is the base of the node IDs d is the number of digits in node IDs bd = N

1000 0100 0010 1110 1100 1010 0110

1111 | 0

slide-119
SLIDE 119

Tapestry/Pastry Routing Tables

27 Incremental prefix routing Definitions:

N is the size of the network b is the base of the node IDs d is the number of digits in node IDs bd = N

1000 0100 0010 1110 1100 1010 0110

1111 | 0

1011

slide-120
SLIDE 120

Tapestry/Pastry Routing Tables

27 Incremental prefix routing Definitions:

N is the size of the network b is the base of the node IDs d is the number of digits in node IDs bd = N

1000 0100 0010 1110 1100 1010 0110

1111 | 0

1011

slide-121
SLIDE 121

Tapestry/Pastry Routing Tables

27 Incremental prefix routing Definitions:

N is the size of the network b is the base of the node IDs d is the number of digits in node IDs bd = N

1000 0100 0010 1110 1100 1010 0110

1111 | 0

1011 0011

slide-122
SLIDE 122

Tapestry/Pastry Routing Tables

27 Incremental prefix routing Definitions:

N is the size of the network b is the base of the node IDs d is the number of digits in node IDs bd = N

1000 0100 0010 1110 1100 1010 0110

1111 | 0

1011 0011

slide-123
SLIDE 123

Tapestry/Pastry Routing Tables

27 Incremental prefix routing Definitions:

N is the size of the network b is the base of the node IDs d is the number of digits in node IDs bd = N

1000 0100 0010 1110 1100 1010 0110

1111 | 0

1011 0011 1110

slide-124
SLIDE 124

Tapestry/Pastry Routing Tables

27 Incremental prefix routing Definitions:

N is the size of the network b is the base of the node IDs d is the number of digits in node IDs bd = N

1000 0100 0010 1110 1100 1010 0110

1111 | 0

1011 0011 1110

slide-125
SLIDE 125

Tapestry/Pastry Routing Tables

27 Incremental prefix routing Definitions:

N is the size of the network b is the base of the node IDs d is the number of digits in node IDs bd = N

1000 0100 0010 1110 1100 1010 0110

1111 | 0

1011 0011 1110 1000

slide-126
SLIDE 126

Tapestry/Pastry Routing Tables

27 Incremental prefix routing Definitions:

N is the size of the network b is the base of the node IDs d is the number of digits in node IDs bd = N

1000 0100 0010 1110 1100 1010 0110

1111 | 0

1011 0011 1110 1000

slide-127
SLIDE 127

Tapestry/Pastry Routing Tables

27 Incremental prefix routing Definitions:

N is the size of the network b is the base of the node IDs d is the number of digits in node IDs bd = N

1000 0100 0010 1110 1100 1010 0110

1111 | 0

1011 0011 1110 1000 1010

slide-128
SLIDE 128

Tapestry/Pastry Routing Tables

27 Incremental prefix routing Definitions:

N is the size of the network b is the base of the node IDs d is the number of digits in node IDs bd = N

How many neighbors at each prefix digit?

1000 0100 0010 1110 1100 1010 0110

1111 | 0

1011 0011 1110 1000 1010

slide-129
SLIDE 129

Tapestry/Pastry Routing Tables

27 Incremental prefix routing Definitions:

N is the size of the network b is the base of the node IDs d is the number of digits in node IDs bd = N

How many neighbors at each prefix digit?

b-1

1000 0100 0010 1110 1100 1010 0110

1111 | 0

1011 0011 1110 1000 1010

slide-130
SLIDE 130

Tapestry/Pastry Routing Tables

27 Incremental prefix routing Definitions:

N is the size of the network b is the base of the node IDs d is the number of digits in node IDs bd = N

How many neighbors at each prefix digit?

b-1

How big is the routing table?

1000 0100 0010 1110 1100 1010 0110

1111 | 0

1011 0011 1110 1000 1010

slide-131
SLIDE 131

Tapestry/Pastry Routing Tables

27 Incremental prefix routing Definitions:

N is the size of the network b is the base of the node IDs d is the number of digits in node IDs bd = N

How many neighbors at each prefix digit?

b-1

How big is the routing table?

Total size: b * d

1000 0100 0010 1110 1100 1010 0110

1111 | 0

1011 0011 1110 1000 1010

slide-132
SLIDE 132

Tapestry/Pastry Routing Tables

27 Incremental prefix routing Definitions:

N is the size of the network b is the base of the node IDs d is the number of digits in node IDs bd = N

How many neighbors at each prefix digit?

b-1

How big is the routing table?

Total size: b * d Or, equivalently: b * logb N

1000 0100 0010 1110 1100 1010 0110

1111 | 0

1011 0011 1110 1000 1010

slide-133
SLIDE 133

Tapestry/Pastry Routing Tables

27 Incremental prefix routing Definitions:

N is the size of the network b is the base of the node IDs d is the number of digits in node IDs bd = N

How many neighbors at each prefix digit?

b-1

How big is the routing table?

Total size: b * d Or, equivalently: b * logb N

logb N hops to any destination

1000 0100 0010 1110 1100 1010 0110

1111 | 0

1011 0011 1110 1000 1010

slide-134
SLIDE 134

Derivation

Definitions:

N is the size of the network b is the base of the node IDs d is the number of digits in node IDs bd = N

Routing table size is b * d

28

slide-135
SLIDE 135

Derivation

Definitions:

N is the size of the network b is the base of the node IDs d is the number of digits in node IDs bd = N

Routing table size is b * d

bd = N

28

slide-136
SLIDE 136

Derivation

Definitions:

N is the size of the network b is the base of the node IDs d is the number of digits in node IDs bd = N

Routing table size is b * d

bd = N d * log b = log N

28

slide-137
SLIDE 137

Derivation

Definitions:

N is the size of the network b is the base of the node IDs d is the number of digits in node IDs bd = N

Routing table size is b * d

bd = N d * log b = log N d = log N / log b

28

slide-138
SLIDE 138

Derivation

Definitions:

N is the size of the network b is the base of the node IDs d is the number of digits in node IDs bd = N

Routing table size is b * d

bd = N d * log b = log N d = log N / log b d = logb N

28

slide-139
SLIDE 139

Derivation

Definitions:

N is the size of the network b is the base of the node IDs d is the number of digits in node IDs bd = N

Routing table size is b * d

bd = N d * log b = log N d = log N / log b d = logb N

Thus, routing table is size b * logb N

28

slide-140
SLIDE 140

Derivation

Definitions:

N is the size of the network b is the base of the node IDs d is the number of digits in node IDs bd = N

Routing table size is b * d

bd = N d * log b = log N d = log N / log b d = logb N

Thus, routing table is size b * logb N

28

  • Key result!
  • Size of routing tables grows logarithmically

to the size of the network

  • Huge P2P overlays are totally feasible
slide-141
SLIDE 141

Routing Table Example

29

Hexadecimal (base-16), node ID = 65a1fc4

Row 0 Each x is the IP address of a peer

slide-142
SLIDE 142

Routing Table Example

29

Hexadecimal (base-16), node ID = 65a1fc4

Row 0 Row 1 Each x is the IP address of a peer

slide-143
SLIDE 143

Routing Table Example

29

Hexadecimal (base-16), node ID = 65a1fc4

Row 0 Row 1 Row 2 Each x is the IP address of a peer

slide-144
SLIDE 144

Routing Table Example

29

Hexadecimal (base-16), node ID = 65a1fc4

Row 0 Row 1 Row 2 Row 3 Each x is the IP address of a peer

slide-145
SLIDE 145

Routing Table Example

29

Hexadecimal (base-16), node ID = 65a1fc4

Row 0 Row 1 Row 2 Row 3 Each x is the IP address of a peer d Rows (d = length

  • f node ID)
slide-146
SLIDE 146

Routing, One More Time

30

Each node has a routing table Routing table size:

b * d or b * logb N

Hops to any destination:

logb N 1000 0100 0010 1110 1100 1010 0110

1111 | 0

slide-147
SLIDE 147

Routing, One More Time

30

Each node has a routing table Routing table size:

b * d or b * logb N

Hops to any destination:

logb N 1000 0100 0010 1110 1100 1010 0110

1111 | 0

To: 1110

slide-148
SLIDE 148

Routing, One More Time

30

Each node has a routing table Routing table size:

b * d or b * logb N

Hops to any destination:

logb N 1000 0100 0010 1110 1100 1010 0110

1111 | 0

To: 1110

slide-149
SLIDE 149

Routing, One More Time

30

Each node has a routing table Routing table size:

b * d or b * logb N

Hops to any destination:

logb N 1000 0100 0010 1110 1100 1010 0110

1111 | 0

To: 1110

slide-150
SLIDE 150

Routing, One More Time

30

Each node has a routing table Routing table size:

b * d or b * logb N

Hops to any destination:

logb N 1000 0100 0010 1110 1100 1010 0110

1111 | 0

To: 1110

slide-151
SLIDE 151

Routing, One More Time

30

Each node has a routing table Routing table size:

b * d or b * logb N

Hops to any destination:

logb N 1000 0100 0010 1110 1100 1010 0110

1111 | 0

To: 1110

slide-152
SLIDE 152

Leaf Sets

31

Each node has an additional table of the L/2 numerically closest

neighbors

Larger and smaller

Uses

Alternate routes Fault detection (keep-alive) Replication of data

slide-153
SLIDE 153

Joining the Overlay

32

1.

Pick a new ID X

1000 0100 0010 1110 1100 1010 0110

1111 | 0

slide-154
SLIDE 154

Joining the Overlay

32

1.

Pick a new ID X

1000 0100 0010 1110 1100 1010 0110

1111 | 0 0011

slide-155
SLIDE 155

Joining the Overlay

32

1.

Pick a new ID X

2.

Contact an arbitrary bootstrap node

1000 0100 0010 1110 1100 1010 0110

1111 | 0 0011

slide-156
SLIDE 156

Joining the Overlay

32

1.

Pick a new ID X

2.

Contact an arbitrary bootstrap node

3.

Route a message to X, discover the current owner

1000 0100 0010 1110 1100 1010 0110

1111 | 0 0011

slide-157
SLIDE 157

Joining the Overlay

32

1.

Pick a new ID X

2.

Contact an arbitrary bootstrap node

3.

Route a message to X, discover the current owner

4.

Add new node to the ring

1000 0010 1110 1100 1010 0110

1111 | 0

slide-158
SLIDE 158

Joining the Overlay

32

1.

Pick a new ID X

2.

Contact an arbitrary bootstrap node

3.

Route a message to X, discover the current owner

4.

Add new node to the ring

5.

Download routes from new neighbors, update leaf sets

1000 0010 1110 1100 1010 0110

1111 | 0

slide-159
SLIDE 159

Node Departure

33

Leaf set members exchange periodic keep-alive messages

Handles local failures

Leaf set repair:

Request the leaf set from the farthest node in the set

Routing table repair:

Get table from peers in row 0, then row 1, … Periodic, lazy

slide-160
SLIDE 160

DHTs and Consistent Hashing

34

1000 0100 0010 1110 1100 1010 0110

1111 | 0

slide-161
SLIDE 161

DHTs and Consistent Hashing

34

1000 0100 0010 1110 1100 1010 0110

1111 | 0

To: 1101

slide-162
SLIDE 162

DHTs and Consistent Hashing

34

1000 0100 0010 1110 1100 1010 0110

1111 | 0

To: 1101

slide-163
SLIDE 163

DHTs and Consistent Hashing

34

1000 0100 0010 1110 1100 1010 0110

1111 | 0

To: 1101

slide-164
SLIDE 164

DHTs and Consistent Hashing

34

1000 0100 0010 1110 1100 1010 0110

1111 | 0

To: 1101

slide-165
SLIDE 165

DHTs and Consistent Hashing

34

1000 0100 0010 1110 1100 1010 0110

1111 | 0

To: 1101

Mappings are deterministic in consistent

hashing

Nodes can leave Nodes can enter Most data does not move

slide-166
SLIDE 166

DHTs and Consistent Hashing

34

1000 0100 0010 1110 1100 1010 0110

1111 | 0

To: 1101

Mappings are deterministic in consistent

hashing

Nodes can leave Nodes can enter Most data does not move

slide-167
SLIDE 167

DHTs and Consistent Hashing

34

1000 0100 0010 1110 1100 1010 0110

1111 | 0

To: 1101

Mappings are deterministic in consistent

hashing

Nodes can leave Nodes can enter Most data does not move

slide-168
SLIDE 168

DHTs and Consistent Hashing

34

1000 0100 0010 1110 1100 1010 0110

1111 | 0

To: 1101

Mappings are deterministic in consistent

hashing

Nodes can leave Nodes can enter Most data does not move

Only local changes impact data

placement

Data is replicated among the leaf set

slide-169
SLIDE 169

DHTs and Consistent Hashing

34

1000 0100 0010 1110 1100 1010 0110

1111 | 0

To: 1101

Mappings are deterministic in consistent

hashing

Nodes can leave Nodes can enter Most data does not move

Only local changes impact data

placement

Data is replicated among the leaf set

slide-170
SLIDE 170

Structured Overlay Advantages and Uses

35 High level advantages

Complete decentralized Self-organizing Scalable and (relatively) robust

Applications

Reliable distributed storage

■ OceanStore (FAST’03), Mnemosyne (IPTPS’02)

Resilient anonymous communication

■ Cashmere (NSDI’05)

Consistent state management

■ Dynamo (SOSP’07)

Many, many others

■ Multicast, spam filtering, reliable routing, email services, even distributed mutexes

slide-171
SLIDE 171

Trackerless BitTorrent

36

1000 0100 0010 1110 1100 1010 0110

1111 | 0

Torrent Hash: 1101

slide-172
SLIDE 172

Trackerless BitTorrent

36

1000 0100 0010 1110 1100 1010 0110

1111 | 0

Torrent Hash: 1101 Initial Seed

slide-173
SLIDE 173

Trackerless BitTorrent

36

1000 0100 0010 1110 1100 1010 0110

1111 | 0

Torrent Hash: 1101 Tracker Initial Seed

slide-174
SLIDE 174

Trackerless BitTorrent

36

1000 0100 0010 1110 1100 1010 0110

1111 | 0

Torrent Hash: 1101 Tracker Initial Seed

Swarm

Initial Seed Tracker

slide-175
SLIDE 175

Trackerless BitTorrent

36

1000 0100 0010 1110 1100 1010 0110

1111 | 0

Torrent Hash: 1101 Tracker Initial Seed Leecher

Swarm

Initial Seed Tracker

slide-176
SLIDE 176

Trackerless BitTorrent

36

1000 0100 0010 1110 1100 1010 0110

1111 | 0

Torrent Hash: 1101 Tracker Initial Seed Leecher

Swarm

Initial Seed Tracker

slide-177
SLIDE 177

Trackerless BitTorrent

36

1000 0100 0010 1110 1100 1010 0110

1111 | 0

Torrent Hash: 1101 Tracker Initial Seed Leecher

Swarm

Initial Seed Tracker Leecher