Data Structures from the Future: Bloom Filters, Distributed - - PowerPoint PPT Presentation

data structures from the future bloom filters distributed
SMART_READER_LITE
LIVE PREVIEW

Data Structures from the Future: Bloom Filters, Distributed - - PowerPoint PPT Presentation

Data Structures from the Future: Bloom Filters, Distributed Hash Tables, and More! Tom Limoncelli, Google NYC tlim@google.com Thursday, November 11, 2010 1 Why am I here? I have no idea. Thursday, November 11, 2010 2 Why are you


slide-1
SLIDE 1

Data Structures from the Future: Bloom Filters, Distributed Hash Tables, and More!

Tom Limoncelli, Google NYC tlim@google.com

1 Thursday, November 11, 2010

slide-2
SLIDE 2

Why am I here?

I have no idea.

2 Thursday, November 11, 2010

slide-3
SLIDE 3

Why are you here?

I have 3 theories...

3 Thursday, November 11, 2010

slide-4
SLIDE 4

Why are you here?

  • 1. You thought this was

the Dreamworks talk.

4 Thursday, November 11, 2010

slide-5
SLIDE 5

Why are you here?

  • 2. You’re still drunk from

last night.

5 Thursday, November 11, 2010

slide-6
SLIDE 6

Why are you here?

  • 3. You can’t manage what

you don’t understand.

6 Thursday, November 11, 2010

slide-7
SLIDE 7

Overview

  • 1. Hashes & Caches
  • 2. Bloom Filters
  • 3. Distributed Hash Tables (DHTs)
  • 4. Key/Value Stores (NoSQL)
  • 5. Google Bigtable

7 Thursday, November 11, 2010

slide-8
SLIDE 8

Disclaimer #1

There will be hand-waving. The Presence of Slides != “Being Prepared”

8 Thursday, November 11, 2010

slide-9
SLIDE 9

Disclaimer #2

You could learn most of this from Wikipedia.

  • Really. Did I mention

they’re talking about Shrek in the other room?

9 Thursday, November 11, 2010

slide-10
SLIDE 10

Disclaimer #3

My LISA 2008 talk also conflicted with a talk from Dreamworks.

10 Thursday, November 11, 2010

slide-11
SLIDE 11

To understand this talk, you must understand: ! ! Hashes ! ! Caches

11 Thursday, November 11, 2010

slide-12
SLIDE 12

Hashes

12 Thursday, November 11, 2010

slide-13
SLIDE 13

What is a Hash?

A fixed-size summary of a large amount of data.

13 Thursday, November 11, 2010

slide-14
SLIDE 14

Checksum

Simple checksum: Sum the byte values. Take the last digit of the total. Pros: Easy. Cons: Change order, same checksum. Improvement: Cyclic Redundancy Check Detects change in order.

14 Thursday, November 11, 2010

slide-15
SLIDE 15

Hash

“Cryptographically Unique” Difficult to generate 2 files with the same MD5 hash Even more difficult to make a “valid second file”: The second file is a valid example of the same

  • format. (i.e. both are HTML files)

15 Thursday, November 11, 2010

slide-16
SLIDE 16

How do crypto hashes work?

“It works because of math.”

Matt Blaze, Ph.D

16 Thursday, November 11, 2010

slide-17
SLIDE 17

Reversible/Irreversible Functions

[ ] + 105 = 205 [ ] mod 10 = 4

17 Thursday, November 11, 2010

slide-18
SLIDE 18

Some common hashes

MD4 MD5 SHA1 SHA2 AES-Hash

18 Thursday, November 11, 2010

slide-19
SLIDE 19

Hashes

19 Thursday, November 11, 2010

slide-20
SLIDE 20

Caches

20 Thursday, November 11, 2010

slide-21
SLIDE 21

What is a Cache?

Using a small/expensive/fast thing to make a big/cheap/slow thing faster.

21 Thursday, November 11, 2010

slide-22
SLIDE 22

Database

Big, Slow, Cheap

User Cache

Fast but expensive.

22 Thursday, November 11, 2010

slide-23
SLIDE 23

Metric used to grade? The “hit rate”: hits / total queries How to tune? Add additional storage Smallest increment: Result size.

23 Thursday, November 11, 2010

slide-24
SLIDE 24

Suppose cache is X times faster ...but Y times more expensive Balance cost of cache vs. savings you can get: Web cache achieves 30% hit rate, costs $/MB 33% of cachable traffic costs $/MB from ISP . What about non-cachable traffic? What about query size?

24 Thursday, November 11, 2010

slide-25
SLIDE 25

Value of next increment is less than the previous: 10 units of cache achieves 30% hit rate +10 units, hit rate goes to 32% +10 more units, hit rate goes to 33%

25 50 75 100 10 20 30 $/unit # units

25 Thursday, November 11, 2010

slide-26
SLIDE 26

Data

Big, Slow, Cheap

User Cache

Fast but expensive.

26 Thursday, November 11, 2010

slide-27
SLIDE 27

Data

Big, Slow, Cheap

NYC Cache

Fast but expensive.

CHI LAX

Cache Cache Cache

27 Thursday, November 11, 2010

slide-28
SLIDE 28

Simple Cache NCACHE Intelligent Add new data? Ok Not found Ok Delete data? Stale Stale Ok Modify data? Stale Stale Ok

28 Thursday, November 11, 2010

slide-29
SLIDE 29

Caches

29 Thursday, November 11, 2010

slide-30
SLIDE 30

Bloom Filters

30 Thursday, November 11, 2010

slide-31
SLIDE 31

What is a Bloom Filter?

Knowing when NOT to waste time seeking out data. Invented in Burton Howard Bloom in 2070

31 Thursday, November 11, 2010

slide-32
SLIDE 32

What is a Bloom Filter?

Knowing when NOT to waste time seeking out data. Invented in Burton Howard Bloom in 1970

32 Thursday, November 11, 2010

slide-33
SLIDE 33

I invented Bloom Filters when I was 10 years old.

33 Thursday, November 11, 2010

slide-34
SLIDE 34

34 Thursday, November 11, 2010

slide-35
SLIDE 35

Data

Big, Slow, Cheap

User Bloom

(Or, precocious 10 year old)

35 Thursday, November 11, 2010

slide-36
SLIDE 36

Using the last 3 bits of hash:

000 001 010 011 100 101 110 111 Olson 000100001111 Polk 000000000011 Smith 001011101110 Singh 001000011110

36 Thursday, November 11, 2010

slide-37
SLIDE 37

Using the last 3 bits of hash:

000 001 010 011 100 101 110 111 Olson 000100001111 Polk 000000000011 Smith 001011101110 Singh 001000011110 Lakey 111110000000 Baird 001011011111 Camp 001101001010 Johns 010100010100 Burd 111000001101 Bloom 110111000011

37 Thursday, November 11, 2010

slide-38
SLIDE 38

1000 1001 1010 1011 1100 1101 1110 1111

Using the last 4 bits of hash:

0000 0001 0010 0011 0100 0101 0110 0111 Olson 000100001111 Polk 000000000011 Smith 001011101110 Singh 001000011110 Lakey 111110000000 Baird 001011011111 Camp 001101001010 Johns 010100010100 Burd 111000001101 Bloom 110111000011

7/16 = 44%

38 Thursday, November 11, 2010

slide-39
SLIDE 39

bits of hash # Entries Bytes <25% 1’s 3 2^3 8 1 2 4 2^4 16 2 4 5 2^5 32 4 8 6 2^6 64 8 16 7 2^7 128 16 32 8 2^8 256 32 64 20 2^8 1048576 131072 262144 24 2^32 16777216 2M 4.1 Million 32 2^64 4294967296 512M 1 Billion

39 Thursday, November 11, 2010

slide-40
SLIDE 40

When to use? Sparse Data When to tune: When more than x% are “1” Pitfall: To resize, must rescan all keys. Minimum Increment doubles memory usage: Each increment is MORE USEFUL than the previous. But exponentially MORE EXPENSIVE!

40 Thursday, November 11, 2010

slide-41
SLIDE 41

Databases: Accelerate lookups of indices. Simulations: Often having, big, sparse databases. Routers: Speeds up route table lookups.

Bloom Filter sample uses

41 Thursday, November 11, 2010

slide-42
SLIDE 42

Distributed Bloom Filters?

42 Thursday, November 11, 2010

slide-43
SLIDE 43

Data NYC BF CHI LAX

BF BF BF

43 Thursday, November 11, 2010

slide-44
SLIDE 44

What if your Bloom Filter is

  • ut of date?

New data added: BAD. Clients may not see it. Data changed: Ok Data deleted: Ok, but not as efficient.

44 Thursday, November 11, 2010

slide-45
SLIDE 45

How to perform updates?

Master calculates bitmap once. Sends it to all clients For a 20-bit table, that’s 130K. Smaller than most GIFs! Reasonable for daily, hourly, updates.

45 Thursday, November 11, 2010

slide-46
SLIDE 46

46 Thursday, November 11, 2010

slide-47
SLIDE 47

Big Bloom Filters often use 96, 120 or 160 bits!

47 Thursday, November 11, 2010

slide-48
SLIDE 48

Bloom Filters

48 Thursday, November 11, 2010

slide-49
SLIDE 49

Hash Tables

49 Thursday, November 11, 2010

slide-50
SLIDE 50

What is a Hash Table?

It’s like an array. But the index can be anything “hashable”.

50 Thursday, November 11, 2010

slide-51
SLIDE 51

Hash tables

Perl hash: $thing{’b’} = 123; $thing{‘key2’} = ”value2”; print $thing{‘key2’}; Python Dictionary or “dict”: thing = {} thing[‘b’] = 123 thing[‘key2’] = “value2” print thing[‘key2’]

51 Thursday, November 11, 2010

slide-52
SLIDE 52

hash(‘cow’) = 78f825 hash(‘bee’) = 92eb5f

Bucket Data 78f825 (“cow”, “moo”) 92eb5f (“bee”, “buzz”), (‘sheep’, ‘baah’) hash(‘sheep’) = 92eb5f

52 Thursday, November 11, 2010

slide-53
SLIDE 53

Hash Tables

53 Thursday, November 11, 2010

slide-54
SLIDE 54

Distributed Hash Tables (DHTs)

54 Thursday, November 11, 2010

slide-55
SLIDE 55

What is a DHT?

A hash table so big you have to spread it over multiple of machines.

55 Thursday, November 11, 2010

slide-56
SLIDE 56

Wouldn’t an infinitely large hash table be awesome?

56 Thursday, November 11, 2010

slide-57
SLIDE 57

Web server

lookup(url) -> page contents ‘index.html’ -> ‘<html><head>...’ ‘/images/smile.png’ -> 0x4d4d2a...

57 Thursday, November 11, 2010

slide-58
SLIDE 58

Virtual Web server

lookup(vhost/url) -> page contents ‘cnn.com/index.html’ -> ‘<html><he...’ ‘time.com/images/smile.png’ -> 0x4d...

58 Thursday, November 11, 2010

slide-59
SLIDE 59

Virtual FTP server

lookup(host:path/file) -> file contents ‘ftp.gnu.org:public/gcc.tgz’ ‘ftp.usenix.org:public/usenix.bib’

59 Thursday, November 11, 2010

slide-60
SLIDE 60

NFS server

lookup(host:path/file) -> file contents ‘srv1:home/tlim/Documents/foo.txt’

  • > file contents

‘srv2:home/tlim/TODO.txt‘

  • > file contents

60 Thursday, November 11, 2010

slide-61
SLIDE 61

Usenet (remember usenet?)

lookup(group:groupname:artnumber)

  • > article

lookup(‘group:comp.sci.math:987765’) lookup(id:message-id) -> pointer lookup(‘id:foo-12345@uunet’) -> ‘group:comp.sci.math:987765’

61 Thursday, November 11, 2010

slide-62
SLIDE 62

IMAP

lookup(‘server:user:folder:NNNN’)

  • > email message

62 Thursday, November 11, 2010

slide-63
SLIDE 63

Our DVD Collection

hash(disc image) -> disc image How do I find a particular disk? Keep a lookup table of name -> hash Benefit: Two people with the same DVD? It only gets stored once.

63 Thursday, November 11, 2010

slide-64
SLIDE 64

How would this work?

64 Thursday, November 11, 2010

slide-65
SLIDE 65

1

Load it up!

Root Host

0100100111011001 0001000101100011 1110001010010110 1001110100110111 0011000000000100

4

65 Thursday, November 11, 2010

slide-66
SLIDE 66

1 1

Split

Root Host

0100100111011001 0001000101100011 1110001010010110 1001110100110111 0011000000000100

3 2

0110000111101100 0100000001101011 0010111000000001

3

0011000101111000

7 4

66 Thursday, November 11, 2010

slide-67
SLIDE 67

1 1

’01...’

Root Host

0100100111011001 0001000101100011 1110001010010110 1001110100110111 0011000000000100

3 2

0110000111101100 0100000001101011 0010111000000001

3

0011000101111000

7 4

67 Thursday, November 11, 2010

slide-68
SLIDE 68

1 1

‘0...’

Root Host

0100100111011001 0001000101100011 1110001010010110 1001110100110111 0011000000000100

3 2

0110000111101100 0100000001101011 0010111000000001

3

0011000101111000

7 4

68 Thursday, November 11, 2010

slide-69
SLIDE 69

1 1

‘1...’

Root Host

0100100111011001 0001000101100011 1110001010010110 1001110100110111 0011000000000100

3 2

0110000111101100 0100000001101011 0010111000000001

3

0011000101111000

7 4

69 Thursday, November 11, 2010

slide-70
SLIDE 70

1 1

Split

Root Host

0100100111011001 0001000101100011 1110001010010110 1001110100110111 0011000000000100

3 2

0110000111101100 0100000001101011 0010111000000001

3

0011000101111000

7 4

70 Thursday, November 11, 2010

slide-71
SLIDE 71

011 010 Root Host 011 000 010 001 000 011 001 010 001 000 010 001 000 011 011 010 001 000

Find: 0100100111011001...

Root Host 010 010

Find: 0100100111011001... Find: 0100100111011001...

71 Thursday, November 11, 2010

slide-72
SLIDE 72

Find: 0100110111011...

72 Thursday, November 11, 2010

slide-73
SLIDE 73

011 010 Root Host 011 000 010 001 000 011 001 010 001 000 010 001 000 011 011 010 001 000

Find: 0100110111011...

Root Host 010 011

Find: 0100110111011... Find: 0100110111011...

73 Thursday, November 11, 2010

slide-74
SLIDE 74

Each host stores:

All the data that “leaf” there. The list of parent nodes talking to it. The list of children it knows about.

74 Thursday, November 11, 2010

slide-75
SLIDE 75

Dynamically Adjusting:

Data hashes in “clumps” making some hosts under-full and some hosts over-full. Host running out of storage? Split in two. Give half the data to another node. Host running out of bandwidth? Clone data and load-balance.

75 Thursday, November 11, 2010

slide-76
SLIDE 76

011 010 011 000 010 001 000 011 001 010 001 000 010 001 000 011 011 010 001 000 Root Host Root Host Root Host Root Host 000 000 000 010 010 000 000 001 001 001 001 001 001

76 Thursday, November 11, 2010

slide-77
SLIDE 77

Real DHTs in action

Peer 2 Peer file-sharing networks. Content Delivery Networks (CDNs like Akamai) Cooperative Caches

77 Thursday, November 11, 2010

slide-78
SLIDE 78

Distributed Hash Tables (DHTs)

78 Thursday, November 11, 2010

slide-79
SLIDE 79

Key/Value Stores

79 Thursday, November 11, 2010

slide-80
SLIDE 80

Some common Key/Value Stores

“NoSQL” CouchDB MongoDB Apache Cassandra Terrastore Google Bigtable

80 Thursday, November 11, 2010

slide-81
SLIDE 81

Name Email Address Tom Limoncelli tlim@google.com 1515 Main Street Mary Smith mary@example.com 111 One Street Joe Bond joe@007.com 7 Seventh St

81 Thursday, November 11, 2010

slide-82
SLIDE 82

Name Email Address Tom Limoncelli tlim@google.com 1515 Main Street Mary Smith mary@example.com 111 One Street Joe Bond joe@007.com 7 Seventh St

User Transaction Amount Tom Limoncelli Deposit 100 Mary Smith Deposit 200 Tom Limoncelli Withdraw 50

82 Thursday, November 11, 2010

slide-83
SLIDE 83

Id Name Email Address 1 Tom Limoncelli tlim@google.com 1515 Main Street 2 Mary Smith mary@example.c

  • m

111 One Street 3 Joe Bond joe@007.com 7 Seventh St

User Id Transaction Amount 1 Deposit 100 2 Deposit 200 1 Withdraw 50

83 Thursday, November 11, 2010

slide-84
SLIDE 84

Id Name Email Address 1 Tom Limoncelli tlim@google.com 1515 Main Street 2 Mary Bond mary@example.c

  • m

111 One Street 3 Joe Bond joe@007.com 7 Seventh St

User Id Transaction Amount 1 Deposit 100 2 Deposit 200 3 Withdraw 50

84 Thursday, November 11, 2010

slide-85
SLIDE 85

Relational Databases

1st Normal Form 2nd Normal Form 3rd Normal Form ACID: Atomicity, Consistency, Isolation, Durability

85 Thursday, November 11, 2010

slide-86
SLIDE 86

Key/Value Stores

Keys Values BASE: Basically Available, Soft-state, Eventually consistent

86 Thursday, November 11, 2010

slide-87
SLIDE 87

Eventually?

Who cares! This is the web, not payroll! Change the address listed in your profile. Might not propagate to Europe for 15 minutes. Can you fly to Europe in less than 15 minutes? And if you could, would you care?

87 Thursday, November 11, 2010

slide-88
SLIDE 88

Key/Value example:

Key Value tlim@google.com BLOB OF DATA mary@example.com BLOB OF DATA joe@007.com BLOB OF DATA

88 Thursday, November 11, 2010

slide-89
SLIDE 89

Key/Value example:

Key Value tlim@google.com

{ ‘name’: ‘Tom Limoncelli’, ‘address’: ‘1515 Main Street’ }

mary@example.com

{ ‘name’: ‘Mary Smith’, ‘address’: ‘111 One Street’ }

joe@007.com

{ ‘name’: ‘Joe Bond’, ‘address’: ‘7 Seventh St’ }

89 Thursday, November 11, 2010

slide-90
SLIDE 90

Google Protobuf:

http://code.google.com/p/protobuf/

Key Value tlim@google.com

message Person { " required string name = 1; " optional string address = 2; repeated string phone = 3; }

mary@example.com

{ ‘name’: ‘Mary Smith’, ‘address’: ‘111 One Street’, ‘phone’: [‘201-555-3456’, ‘908-444-1111’] }

joe@007.com

{ ‘name’: ‘Joe Bond’, ‘phone’: [‘862-555-9876’] }

90 Thursday, November 11, 2010

slide-91
SLIDE 91

Key/Value Stores

91 Thursday, November 11, 2010

slide-92
SLIDE 92

Bigtable

92 Thursday, November 11, 2010

slide-93
SLIDE 93

Bigtable

Google’s very very large database. OSDI'06 http://labs.google.com/papers/bigtable.html Petabytes of data across thousands of commodity servers. Web indexing, Google Earth, and Google Finance

93 Thursday, November 11, 2010

slide-94
SLIDE 94

Bigtable Keys

Can be very huge. Don’t have to have a value! (i.e the value is “null”) Query by Key Key start/stop range (lexigraphical order)

94 Thursday, November 11, 2010

slide-95
SLIDE 95

Long keys are cool.

Key Value Main St/123/Apt1 Jones Main St/123/Apt2 Smith Main St/200 Olson

Query range: Start: “Main St/123” End: infinity

95 Thursday, November 11, 2010

slide-96
SLIDE 96

Bigtable Values

Values can be huge. Gigabytes. Multiple values per key, grouped in “families”: “key:family:family:family:...”

96 Thursday, November 11, 2010

slide-97
SLIDE 97

Families

Within a family: Sub-keys that link to data. Sub-keys are dynamic: no need to pre-define. Sub-keys can be repeated.

97 Thursday, November 11, 2010

slide-98
SLIDE 98

Example: Crawl the web

For every URL: Store the HTML at that location. Store a list of which URLs link to that URL. Store the “anchor text” those sites used. <a href=”URL”>ANCHOR TEXT</a>

98 Thursday, November 11, 2010

slide-99
SLIDE 99

http://www.cnn.com <html>.........</html> http://tomontime.com <html> <p>As you may have read on <a href=”http:// www.cnn.com”>my favorite news site</a> there is...

99 Thursday, November 11, 2010

slide-100
SLIDE 100

Key

contents: anchor:tomontime.com anchor:cnnsi.com

com.cnn.www <html>... my favorite news site CNN Another family Family Key

contents: anchor:everythingsysadmin.com

com.tomontime <html>... videos

100 Thursday, November 11, 2010

slide-101
SLIDE 101

Each Family has its own...

Permissions (who can read/write/admin) QoS (optimize for speed, storage diversity, etc.)

101 Thursday, November 11, 2010

slide-102
SLIDE 102

All updates are timestamped. Retains at least n recent updates or “never”. Expired updates are garbage collected “eventually”.

Plus “time”

102 Thursday, November 11, 2010

slide-103
SLIDE 103

Bigtable

103 Thursday, November 11, 2010

slide-104
SLIDE 104

Further Reading:

Bigtable: http://research.google.com A visual guide to NoSQL: http://blog.nahurst.com/visual-guide-to-nosql- systems HashTables, DHTs, everything else Wikipedia

104 Thursday, November 11, 2010

slide-105
SLIDE 105

Other futuristic topics:

Stop using “locks”, eliminate all deadlocks: STM: Software Transactional Memory Centralized routing: (you’d be surprised) 2 minute overview: www.openflowswitch.org (the 4 minute demo video is MUCH BETTER) “Network Coding”: n^2 more bandwidth? SciAm.com: “Breaking Network Logjams”

105 Thursday, November 11, 2010

slide-106
SLIDE 106

Q&A

106 Thursday, November 11, 2010

slide-107
SLIDE 107

How to do a query?

107 Thursday, November 11, 2010

slide-108
SLIDE 108

KEY VALUE bird “{ legs=2, horns=0, covering=‘feathers’ }” cat “{ legs=4, horns=0, covering=‘fur’ }” dog “{ legs=4, horns=0, covering=‘fur’ }” spider “{ legs=8, horns=0, covering=‘hair’ }” unicorn “{ legs=4, horns=1, covering=‘hair’ }”

108 Thursday, November 11, 2010

slide-109
SLIDE 109

“Which animals have 4 legs?”

Iterate over entire list Open up each blob Parse data Accumulate list SLOW!

109 Thursday, November 11, 2010

slide-110
SLIDE 110

KEY VALUE

animal:bird “{ legs=2, horns=0, covering=‘feathers’ }” animal:cat “{ legs=4, horns=0, covering=‘fur’ }” animal:dog “{ legs=4, horns=0, covering=‘fur’ }” animal:spider “{ legs=8, horns=0, covering=‘hair’ }” animal:unicorn “{ legs=4, horns=1, covering=‘hair’ }” legs:2:bird legs:4:cat legs:4:dog legs:4:unicorn legs:8:spider

Iterate: Start: “legs:4” End: “legs:5” Up to, but not including “end”

110 Thursday, November 11, 2010

slide-111
SLIDE 111

legs=4 AND covering=fur

More indexes + the “zig zag” algorithm. More indexed attributes = the slower insertions Automatic if you use AppEngine’s storage system

111 Thursday, November 11, 2010