Consistent Hashing in your python applications Europython 2017 - - PowerPoint PPT Presentation

consistent hashing in your python applications
SMART_READER_LITE
LIVE PREVIEW

Consistent Hashing in your python applications Europython 2017 - - PowerPoint PPT Presentation

Consistent Hashing in your python applications Europython 2017 @ultrabug Gentoo Linux developer CTO at Numberly History & main use cases Distributed (web) caching (Akamai) P2P (Chord & BitTorrent) Distributed databases (data


slide-1
SLIDE 1

Consistent Hashing in your python applications

Europython 2017

slide-2
SLIDE 2

@ultrabug

Gentoo Linux developer CTO at Numberly

slide-3
SLIDE 3

History & main use cases

Distributed (web) caching (Akamai) P2P (Chord & BitTorrent) Distributed databases (data distribution / sharding)

  • Amazon DynamoDB
  • Cassandra / ScyllaDB
  • Riak
  • CockroachDB
slide-4
SLIDE 4

MAPPING

referential -> information

slide-5
SLIDE 5

Phonebook

name -> phone number

slide-6
SLIDE 6

Referential selection Logical operation INFORMATION lookup efficiency

Map logic

slide-7
SLIDE 7

MAP

key -> value

slide-8
SLIDE 8

Python dict()

{key: value}

slide-9
SLIDE 9

Python dict() is a Hash Table

slide-10
SLIDE 10

Hash function ( key ) Logical operation LOCATION

Hash Table logic

implementation

slide-11
SLIDE 11

Python dict() implementation

hash(key) & (size of array - 1) = array index

hash(‘a’) = 12416037344 & 11 = 0 | value: 123 hash(‘b’) = 12544037731 & 11 = 3 1 | hash(‘c’) = 12672038114 & 11 = 2 2 | value: ‘coco’ 3 | value: None 11 |

...

Array (in memory)
slide-12
SLIDE 12

Distribution (balancing) Accuracy LOCATION efficiency scaling

Key factors to consider

slide-13
SLIDE 13

Python dict efficiency & scaling

hash(key) & (size of array - 1) = array index

hash(‘a’) = 12416037344 & 11 = 0 | value: 123 hash(‘b’) = 12544037731 & 11 = 3 1 | MEMORY hash(‘c’) = 12672038114 & 11 = 2 2 | value: ‘coco’ 3 | value: None 11 | MEMORY

...

hash() = uneven distribution Optimized for fast lookups O(1) Memory inefficient (probing)

slide-14
SLIDE 14

Distributed Hash Tables (DHT)

slide-15
SLIDE 15

Split your key space into buckets

bucket h

  • v

bucket h

  • v

bucket h

  • v

the hash function will impact the size of each bucket

hash(key)

  • perator

hash(key)

  • perator

hash(key)

  • perator
slide-16
SLIDE 16

Distribute your buckets to servers

hash(key)

  • perator

SERVER 0 bucket 0 hash(key)

  • perator

SERVER 1 bucket 1 hash(key)

  • perator

SERVER 2 bucket 2

what’s the best operator function to find the server hosting the bucket for my key ?

slide-17
SLIDE 17

md5(key) % (number of buckets) = server

Naive DHT implementation

int(md5(b'd').hexdigest(), 16) % 3 = 0 SERVER 0 bucket 0 % 3 = 1 SERVER 1 bucket 1 % 3 = 2 SERVER 2 bucket 2

simple & looking good...

int(md5(b'e').hexdigest(), 16) int(md5(b'f').hexdigest(), 16)

slide-18
SLIDE 18

md5(key) % (number of buckets) = server

Naive DHT implementation

int(md5(b'd').hexdigest(), 16) % 4 = 1 (was 0) SERVER 0 bucket 0 % 4 = 2 (was 1) SERVER 1 bucket 1 % 4 = 3 (was 2) SERVER 2 bucket 2

...until you add / remove a bucket/server!

int(md5(b'e').hexdigest(), 16) int(md5(b'f').hexdigest(), 16) % 4 = 1 SERVER 3 bucket 3 int(md5(b'g').hexdigest(), 16) SERVER 1 bucket 1

slide-19
SLIDE 19

n/(n+1)

~ fraction of remapped keys

slide-20
SLIDE 20

HELP! we need consistency

slide-21
SLIDE 21

The Hash Ring

slide-22
SLIDE 22

hash(server 1)

Place your servers on the continuum (ring)

hash(server 2) hash(server 0)

slide-23
SLIDE 23

Keys’ bucket is on the next server in the ring

SERVER 1 SERVER 2 SERVER 0

hash(key) hash(key)

slide-24
SLIDE 24

1/n

~ fraction of remapped keys

slide-25
SLIDE 25

Uneven partitions lead to hotspots

server 0 server 2 server 1 hash functions are not perfect

slide-26
SLIDE 26

Which hash function to use ?

Cryptographic hash functions

  • MD5
  • SHA1
  • SHA256

Non cryptographic hash functions

  • CityHash (google)
  • Murmur (v3)

standard

  • ptimized for key lookups

adoption fast need of C libs need conversion to int SHAX - MD5 - CityHash128 - Murmur3 - CityHash64 - CityHash32

speed
slide-27
SLIDE 27

Hash Rings vnodes & weights mitigate hotspots

reduces load variance on servers

slide-28
SLIDE 28

My preciouuus!

slide-29
SLIDE 29

Consistent Hashing implementations in python

ConsistentHashing consistent_hash hash_ring python-continuum uhashring A simple implement of consistent hashing The algorithm is the same as libketama Using md5 as hashing function Using md5 as hashing function Full featured, ketama compatible

slide-30
SLIDE 30

uhashring

slide-31
SLIDE 31

Example use case #1

Database instances distribution

DB1 DB2 DB3 DB4 client A client B client C client D

slide-32
SLIDE 32

Example use case #1

Database instances distribution

slide-33
SLIDE 33

Example use case #1

Database instances distribution

slide-34
SLIDE 34

Example use case #2

Disk & network I/O distribution

disk 1 disk 2 disk 3 disk 4

task A task B task C task D

slide-35
SLIDE 35

Example use case #3

Log & tracing consistency

worker 1 worker 2 worker 3 worker 4

user_id A user_id B user_id C user_id D

slide-36
SLIDE 36

Example use case #4

python-memcached consolidation

cache 1 cache 2 cache 3 cache 4 ‘potato’ ‘coconut’ ‘tomato’ ‘raspberry’

slide-37
SLIDE 37

Live demo raffle

List of GIFs One of the GIF is the winner Every participant is a node (bucket) hash(WINNER_GIF_URL) picks the winner node

slide-38
SLIDE 38

http://ep17.nbly.co (silly live demo)

slide-39
SLIDE 39

Thanks

github.com/ultrabug/ep2017 github.com/ultrabug/uhashring @ultrabug