[PPT] - Algorithmic Improvements for Fast Concurrent Cuckoo Hashing PowerPoint Presentation

SLIDE 1

Algorithmic ¡Improvements ¡for ¡ Fast ¡Concurrent ¡Cuckoo ¡Hashing ¡

Xiaozhou ¡Li ¡(Princeton) ¡ David ¡G. ¡Andersen ¡(CMU) ¡ Michael ¡Kaminsky ¡(Intel ¡Labs) ¡ Michael ¡J. ¡Freedman ¡(Princeton) ¡

SLIDE 2

How ¡to ¡build ¡a ¡fast ¡concurrent ¡hash ¡table ¡

– algorithm ¡and ¡data ¡structure ¡engineering ¡

Experience ¡with ¡hardware ¡transacJonal ¡memory ¡

– does ¡NOT ¡obviate ¡the ¡need ¡for ¡algorithmic ¡opJmizaJons ¡

In ¡this ¡talk ¡

SLIDE 3

Concurrent ¡hash ¡table ¡

Indexing ¡key-‑value ¡objects ¡

– Lookup(key) – Insert(key, value) – Delete(key)

Fundamental ¡building ¡block ¡for ¡modern ¡systems ¡

– System ¡applicaJons ¡(e.g., ¡kernel ¡caches) ¡ – Concurrent ¡user-‑level ¡applicaJons ¡

Targeted ¡workloads: ¡small ¡objects, ¡high ¡rate ¡

SLIDE 4

Goal: ¡memory-‑efficient ¡and ¡high-‑throughput ¡ ¡

Memory ¡efficient ¡(e.g., ¡> ¡90% ¡space ¡uJlized) ¡
Fast ¡concurrent ¡reads ¡(scale ¡with ¡# ¡of ¡cores) ¡
Fast ¡concurrent ¡writes ¡(scale ¡with ¡# ¡of ¡cores) ¡

SLIDE 5

Preview ¡our ¡results ¡on ¡a ¡quad-‑core ¡machine ¡

0 ¡ 10 ¡ 20 ¡ 30 ¡ 40 ¡ C++11 ¡std::unordered_map ¡ Google ¡dense_hash_map ¡ Intel ¡TBB ¡concurrent_hash_map ¡ cuckoo+ ¡with ¡fine-‑grianed ¡locking ¡ cuckoo+ ¡with ¡HTM ¡ Throughput ¡(million ¡reqs ¡per ¡sec) ¡

64-‑bit ¡key ¡and ¡64-‑bit ¡value ¡ 120 ¡million ¡objects, ¡ ¡100% ¡Insert

cuckoo+ ¡uses ¡(less ¡than) ¡half ¡of ¡ the ¡memory ¡compared ¡to ¡others ¡

SLIDE 6

Background: ¡separate ¡chaining ¡hash ¡table

K V K V K V K V K V

Chaining ¡items ¡hashed ¡in ¡same ¡bucket ¡

lookup

K V

Good: ¡simple ¡ Bad: ¡poor ¡cache ¡locality ¡ ¡ ¡ Bad: ¡pointers ¡cost ¡space ¡

e.g., ¡Intel ¡TBB ¡concurrent_hash_map

SLIDE 7

Background: ¡open ¡addressing ¡hash ¡table

Probing ¡alternate ¡locaJons ¡for ¡vacancy ¡

e.g., ¡linear/quadraJc ¡probing, ¡double ¡hashing ¡

lookup ¡

Good: ¡cache ¡friendly ¡ Bad: ¡poor ¡memory ¡efficiency ¡

performance ¡dramaJcally ¡degrades ¡when ¡the ¡

usage ¡grows ¡beyond ¡70% ¡capacity ¡or ¡so ¡

e.g., ¡Google ¡dense_hash_map ¡wastes ¡50% ¡

memory ¡by ¡default. ¡ ¡

SLIDE 8

Our ¡starJng ¡point ¡

MulJ-‑reader ¡single-‑writer ¡cuckoo ¡hashing ¡[Fan, ¡NSDI’13] ¡

– Open ¡addressing ¡ – Memory ¡efficient ¡ – OpJmized ¡for ¡read-‑intensive ¡workloads ¡

SLIDE 9

Each ¡bucket ¡has ¡b ¡slots ¡for ¡items ¡(b-‑way ¡set-‑associaJve) ¡ Each ¡key ¡is ¡mapped ¡to ¡two ¡random ¡buckets ¡ – stored ¡in ¡one ¡of ¡them ¡

buckets ¡

1 2 3 4 5 6 7 8 key ¡x ¡

hash1(x) ¡ hash2(x) ¡

Cuckoo ¡hashing ¡

SLIDE 10

Predictable ¡and ¡fast ¡lookup ¡

Lookup: ¡read ¡2 ¡buckets ¡in ¡parallel ¡

– constant ¡Jme ¡in ¡the ¡worst ¡case ¡ ¡

x ¡ 1 2 3 4 5 6 7 8 Lookup ¡x ¡

SLIDE 11

1 2 3 4 5 6 7 8

Insert ¡may ¡need ¡“cuckoo ¡move” ¡

Insert y ¡

Insert: ¡

Write ¡to ¡an ¡empty ¡slot ¡in ¡

ne ¡of ¡the ¡two ¡buckets ¡

SLIDE 12

1 2 3 4 5 6 7 8

Insert ¡may ¡need ¡“cuckoo ¡move” ¡

Both ¡are ¡full？ ¡

Insert y ¡ a ¡ x ¡ b ¡ k ¡ r ¡ c ¡ s ¡ e ¡ n ¡ f ¡

Insert: ¡

SLIDE 13

x ¡ a ¡ b ¡ 1 2 3 4 5 6 7 8

move ¡keys ¡to ¡alternate ¡buckets ¡

Insert ¡may ¡need ¡“cuckoo ¡move” ¡

Insert y ¡ k ¡ r ¡ c ¡ s ¡ e ¡ n ¡ f ¡ x ¡ ¡ a ¡ b ¡

Insert: ¡

possible ¡ locaBons ¡ possible ¡ locaBons ¡ possible ¡ locaBons ¡

SLIDE 14

Insert: ¡move ¡keys ¡to ¡alternate ¡buckets ¡

– find ¡a ¡“cuckoo ¡path” ¡to ¡an ¡empty ¡slot ¡ ¡ – move ¡hole ¡backwards ¡

1 2 3 4 5 6 7 8 Insert y ¡ x ¡ a ¡ b ¡ y ¡

Insert ¡may ¡need ¡“cuckoo ¡move” ¡

A ¡technique ¡in ¡[Fan, ¡NSDI’13] ¡ ¡ No ¡reader/writer ¡false ¡misses ¡ b ¡ a ¡ x ¡

SLIDE 15

Review ¡our ¡starJng ¡point ¡[Fan, ¡NSDI’13]: ¡ MulJ-‑reader ¡single-‑writer ¡cuckoo ¡hashing ¡

Benefits ¡

– support ¡concurrent ¡reads ¡ – memory ¡efficient ¡for ¡small ¡objects ¡

ver ¡90% ¡space ¡uJlized ¡when ¡set-‑associaBvity ¡≥ ¡4 ¡
Limits ¡

– Inserts ¡are ¡serialized ¡

poor ¡performance ¡for ¡write-‑heavy ¡workloads ¡

50% ¡Lookup ¡ 100% ¡Lookup ¡

SLIDE 16

Improve ¡write ¡concurrency ¡

Algorithmic ¡opJmizaJons ¡
Minimize ¡criJcal ¡secJons ¡
Exploit ¡data ¡locality ¡
Explore ¡two ¡concurrency ¡control ¡mechanisms ¡
Hardware ¡transacJonal ¡memory ¡
Fine-‑grained ¡locking ¡

SLIDE 17

Algorithmic ¡opJmizaJons ¡

Lock ¡aoer ¡discovering ¡a ¡cuckoo ¡path ¡

– minimize ¡criJcal ¡secJons ¡

Breadth-‑first ¡search ¡for ¡an ¡empty ¡slot ¡

– fewer ¡items ¡displaced ¡ – enable ¡prefetching ¡

Increase ¡set-‑associaJvity ¡(see ¡paper) ¡

– fewer ¡random ¡memory ¡reads ¡

SLIDE 18

Previous ¡approach: ¡writer ¡locks ¡the ¡table ¡ during ¡the ¡whole ¡insert ¡process ¡

lock(); Search f for a a cu cuckoo p path; Cuckoo mo move ve a and i insert; unlock();

// ¡at ¡most ¡hundreds ¡of ¡bucket ¡reads ¡ // ¡at ¡most ¡hundreds ¡of ¡writes ¡

All ¡Insert ¡operaBons ¡of ¡other ¡threads ¡are ¡blocked ¡

SLIDE 19

Lock ¡aoer ¡discovering ¡a ¡cuckoo ¡path ¡

Search for a cuckoo path; lock(); Cuckoo mo move ve a and i insert;

// ¡no ¡locking ¡required ¡

MulBple ¡Insert ¡threads ¡can ¡look ¡for ¡cuckoo ¡paths ¡concurrently ¡

←collision ¡

unlock(); ¡

SLIDE 20

Lock ¡aoer ¡discovering ¡a ¡cuckoo ¡path ¡

while(1) { Search for a cuckoo path; lock(); if(success) unlock(); break; unlock(); }

// ¡no ¡locking ¡required ¡

MulBple ¡Insert ¡threads ¡can ¡look ¡for ¡cuckoo ¡paths ¡concurrently ¡

Cuckoo mo move ve a and i insert w while t the p path i is v valid; ¡

SLIDE 21

Cuckoo ¡hash ¡table ¡⟹ ¡undirected ¡cuckoo ¡graph ¡

x ¡ a ¡ c ¡ b ¡ y ¡ z ¡

1 3 6 7 9

⟹ ¡

x ¡ z ¡ y ¡ 0 ¡ a ¡ b ¡ c ¡ 3 ¡ 1 ¡ 7 ¡ 6 ¡ 9 ¡ a ¡ x ¡ y ¡ b ¡ z ¡ c ¡

bucket ¡⟶ ¡vertex ¡ ¡ ¡ ¡ ¡ ¡ ¡key ¡⟶ ¡edge ¡

SLIDE 22

Previous ¡approach ¡to ¡search ¡for ¡an ¡empty ¡slot: ¡ ¡ random ¡walk ¡on ¡the ¡cuckoo ¡graph ¡

a ¡ * ¡ * ¡ e ¡ * ¡ s ¡ x ¡ * ¡ * ¡ k ¡ * ¡ f ¡ d ¡ * ¡ t ¡ * ¡ * ¡ ∅ ¡ One ¡Insert ¡may ¡move ¡at ¡most ¡hundreds ¡

f ¡items ¡when ¡table ¡occupancy ¡> ¡90% ¡

cuckoo ¡path: ¡ ¡ a➝e➝s➝x➝k➝f➝d➝t➝∅ ¡ 9 ¡writes ¡ Insert ¡y ¡

SLIDE 23

a ¡ * ¡ z ¡ * ¡ * ¡ u ¡ Insert ¡y ¡ * ¡ * ¡ * ¡ * ¡ * ¡ * ¡ * ¡ * ¡ * ¡ * ¡ * ¡ ∅ ¡

Breadth-‑first ¡search ¡for ¡an ¡empty ¡slot ¡

a ¡ * ¡ * ¡ e ¡ * ¡ s ¡ x ¡ * ¡ * ¡ k ¡ * ¡ f ¡ d ¡ * ¡ t ¡ * ¡ * ¡ ∅ ¡ Insert ¡y ¡

SLIDE 24

a ¡ * ¡ z ¡ * ¡ * ¡ u ¡ Insert ¡y ¡ * ¡ * ¡ * ¡ * ¡ * ¡ * ¡ * ¡ * ¡

Breadth-‑first ¡search ¡for ¡an ¡empty ¡slot ¡

cuckoo ¡path: ¡ ¡ a➝z➝u➝∅ ¡ ¡ ¡ ¡ ¡4 ¡writes ¡ Reduced ¡to ¡a ¡logarithmic ¡factor ¡

Same ¡# ¡of ¡reads ¡
Far ¡fewer ¡writes ¡

⟶ ¡unlocked ¡ ⟶ ¡locked ¡

Prefetching: ¡scan ¡one ¡bucket ¡and ¡load ¡next ¡bucket ¡concurrently ¡

* ¡ * ¡ * ¡ ∅ ¡

SLIDE 25

Concurrency ¡control ¡

Fine-‑grained ¡locking ¡

– spinlock ¡and ¡lock ¡striping ¡

Hardware ¡transacJonal ¡memory ¡

– Intel ¡TransacJonal ¡SynchronizaJon ¡Extensions ¡(TSX) ¡ – Hardware ¡support ¡for ¡lock ¡elision ¡

SLIDE 26

Lock ¡elision ¡

acquire ¡ acquire ¡ release ¡ release ¡ criJcal ¡ secJon ¡ criJcal ¡ secJon ¡

Thread ¡1 ¡ Thread ¡2 ¡ Time ¡ Hash ¡Table ¡

Lock: ¡Free ¡ No ¡serializaFon ¡if ¡no ¡data ¡conflicts ¡

SLIDE 27

Implement ¡lock ¡elision ¡with ¡Intel ¡TSX ¡

‑-‑ ¡Abort ¡reasons: ¡
data ¡conflicts ¡
limited ¡HW ¡resources ¡
unfriendly ¡instrucJons ¡

execute ¡ success ¡ abort ¡ LOCK ¡ START ¡TX ¡ CriFcal ¡SecFon ¡ ¡ COMMIT ¡ UNLOCK ¡ fallback ¡ retry ¡

? ¡

pJmized ¡to ¡make ¡

beter ¡decisions ¡

SLIDE 28

Principles ¡to ¡reduce ¡transacJonal ¡aborts ¡

1. Minimize ¡the ¡size ¡of ¡transacJonal ¡regions. ¡

– Algorithmic ¡opJmizaJons ¡

lock ¡later, ¡BFS, ¡increase ¡set-‑associaJvity ¡

Maximum ¡size ¡of ¡transacBonal ¡regions ¡ previous ¡cuckoo[Fan, ¡NSDI’13] ¡

pJmized ¡cuckoo ¡

cuckoo ¡search: ¡500 ¡reads ¡ cuckoo ¡move: ¡ ¡ ¡250 ¡writes ¡ — ¡ cuckoo ¡move: ¡5 ¡writes/reads ¡

SLIDE 29

Principles ¡to ¡reduce ¡transacJonal ¡aborts ¡

2. Avoid ¡unnecessary ¡access ¡to ¡common ¡data. ¡

– Make ¡globals ¡thread-‑local ¡

3. Avoid ¡TSX-‑unfriendly ¡instrucJons ¡in ¡transacJons ¡

– e.g., ¡malloc()may ¡cause ¡problems ¡

4. OpJmize ¡TSX ¡lock ¡elision ¡implementaJon ¡

– Elide ¡the ¡lock ¡more ¡aggressively ¡for ¡short ¡transacJons ¡

SLIDE 30

EvaluaJon ¡

How ¡does ¡the ¡performance ¡scale? ¡

– throughput ¡vs. ¡# ¡of ¡cores ¡

How ¡much ¡each ¡technique ¡improves ¡performance? ¡

– algorithmic ¡opJmizaJons ¡ – lock ¡elision ¡with ¡Intel ¡TSX ¡

SLIDE 31

Experiment ¡sewngs ¡

Plaxorm ¡

– Intel ¡Haswell ¡i7-‑4770 ¡@ ¡3.4GHz ¡(with ¡TSX ¡support) ¡ – 4 ¡cores ¡(8 ¡hyper-‑threaded ¡cores) ¡

Cuckoo ¡hash ¡table ¡

– 8 ¡byte ¡keys ¡and ¡8 ¡byte ¡values ¡ – 2 ¡GB ¡hash ¡table, ¡~134.2 ¡million ¡entries ¡ – 8-‑way ¡set-‑associaJve ¡

Workloads ¡

– Fill ¡an ¡empty ¡table ¡to ¡95% ¡capacity ¡ – Random ¡mixed ¡reads ¡and ¡writes

SLIDE 32

MulJ-‑core ¡scaling ¡comparison ¡(50% ¡Insert) ¡

0 ¡ 10 ¡ 20 ¡ 30 ¡ 1 ¡ 2 ¡ 3 ¡ 4 ¡ 5 ¡ 6 ¡ 7 ¡ 8 ¡ cuckoo+ ¡w/ ¡TSX ¡lock ¡elision ¡ cuckoo+ ¡w/ ¡fine-‑grained ¡locking ¡ Intel ¡TBB ¡concurrent_hash_map ¡ cuckoo ¡w/ ¡TSX ¡ cuckoo+ ¡ cuckoo ¡

cuckoo: ¡ ¡ ¡ ¡single-‑writer/mulJ-‑reader ¡[Fan, ¡NSDI’13] ¡ cuckoo+: ¡ ¡cuckoo ¡with ¡our ¡algorithmic ¡opJmizaJons ¡

Number ¡of ¡threads ¡ Throughput ¡(MOPS) ¡

SLIDE 33

MulJ-‑core ¡scaling ¡comparison ¡(10% ¡Insert) ¡

cuckoo: ¡ ¡ ¡ ¡single-‑writer/mulJ-‑reader ¡[Fan, ¡NSDI’13] ¡ cuckoo+: ¡ ¡cuckoo ¡with ¡our ¡algorithmic ¡opJmizaJons ¡

Number ¡of ¡threads ¡ Throughput ¡(MOPS) ¡ 0 ¡ 10 ¡ 20 ¡ 30 ¡ 40 ¡ 50 ¡ 1 ¡ 2 ¡ 3 ¡ 4 ¡ 5 ¡ 6 ¡ 7 ¡ 8 ¡ cuckoo+ ¡w/ ¡TSX ¡lock ¡elision ¡ cuckoo+ ¡w/ ¡fine-‑grained ¡locking ¡ cuckoo+ ¡ cuckoo ¡w/ ¡TSX ¡ Intel ¡TBB ¡concurrent_hash_map ¡ cuckoo ¡

SLIDE 34

Factor ¡analysis ¡of ¡Insert ¡performance ¡

cuckoo: ¡mulJ-‑reader ¡single-‑writer ¡cuckoo ¡hashing ¡[Fan, ¡NSDI’13] ¡
+TSX-‑glibc: ¡use ¡released ¡Intel ¡glibc ¡TSX ¡lock ¡elision ¡
+TSX*: ¡replace ¡TSX-‑glibc ¡with ¡our ¡opJmized ¡implementaJon ¡
+lock ¡later: ¡lock ¡aoer ¡discovering ¡a ¡cuckoo ¡path ¡
+BFS: ¡breadth ¡first ¡search ¡for ¡an ¡empty ¡slot ¡

SLIDE 35

Lock ¡elision ¡enabled ¡first ¡and ¡ ¡ algorithmic ¡opJmizaJons ¡applied ¡later ¡

100% ¡Insert ¡

SLIDE 36

Algorithmic ¡opJmizaJons ¡applied ¡first ¡ and ¡lock ¡elision ¡enabled ¡later ¡

100% ¡Insert ¡

Both ¡data ¡structure ¡and ¡concurrency ¡control ¡opBmizaBons ¡ are ¡needed ¡to ¡achieve ¡high ¡performance ¡

SLIDE 37

Conclusion ¡

Concurrent ¡cuckoo ¡hash ¡table ¡

– high ¡memory ¡efficiency ¡ – fast ¡concurrent ¡writes ¡and ¡reads ¡

Lessons ¡with ¡hardware ¡transacJonal ¡memory ¡

– algorithmic ¡opJmizaJons ¡are ¡necessary ¡

¡

Algorithmic ¡Improvements ¡for ¡ Fast ¡Concurrent ¡Cuckoo ¡Hashing ¡

Xiaozhou ¡Li ¡(Princeton) ¡ David ¡G. ¡Andersen ¡(CMU) ¡ Michael ¡Kaminsky ¡(Intel ¡Labs) ¡ Michael ¡J. ¡Freedman ¡(Princeton) ¡

– algorithm ¡and ¡data ¡structure ¡engineering ¡

– does ¡NOT ¡obviate ¡the ¡need ¡for ¡algorithmic ¡opJmizaJons ¡

In ¡this ¡talk ¡

Concurrent ¡hash ¡table ¡

– Lookup(key) – Insert(key, value) – Delete(key)

– System ¡applicaJons ¡(e.g., ¡kernel ¡caches) ¡ – Concurrent ¡user-­‑level ¡applicaJons ¡

Goal: ¡memory-­‑efficient ¡and ¡high-­‑throughput ¡ ¡

Preview ¡our ¡results ¡on ¡a ¡quad-­‑core ¡machine ¡

0 ¡ 10 ¡ 20 ¡ 30 ¡ 40 ¡ C++11 ¡std::unordered_map ¡ Google ¡dense_hash_map ¡ Intel ¡TBB ¡concurrent_hash_map ¡ cuckoo+ ¡with ¡fine-­‑grianed ¡locking ¡ cuckoo+ ¡with ¡HTM ¡ Throughput ¡(million ¡reqs ¡per ¡sec) ¡

64-­‑bit ¡key ¡and ¡64-­‑bit ¡value ¡ 120 ¡million ¡objects, ¡ ¡100% ¡Insert

cuckoo+ ¡uses ¡(less ¡than) ¡half ¡of ¡ the ¡memory ¡compared ¡to ¡others ¡

Background: ¡separate ¡chaining ¡hash ¡table

Chaining ¡items ¡hashed ¡in ¡same ¡bucket ¡

lookup

Good: ¡simple ¡ Bad: ¡poor ¡cache ¡locality ¡ ¡ ¡ Bad: ¡pointers ¡cost ¡space ¡

Background: ¡open ¡addressing ¡hash ¡table

Probing ¡alternate ¡locaJons ¡for ¡vacancy ¡

e.g., ¡linear/quadraJc ¡probing, ¡double ¡hashing ¡

lookup ¡

Good: ¡cache ¡friendly ¡ Bad: ¡poor ¡memory ¡efficiency ¡

usage ¡grows ¡beyond ¡70% ¡capacity ¡or ¡so ¡

memory ¡by ¡default. ¡ ¡

Our ¡starJng ¡point ¡

– Open ¡addressing ¡ – Memory ¡efficient ¡ – OpJmized ¡for ¡read-­‑intensive ¡workloads ¡

Each ¡bucket ¡has ¡b ¡slots ¡for ¡items ¡(b-­‑way ¡set-­‑associaJve) ¡ Each ¡key ¡is ¡mapped ¡to ¡two ¡random ¡buckets ¡ – stored ¡in ¡one ¡of ¡them ¡

buckets ¡

1 2 3 4 5 6 7 8 key ¡x ¡

Cuckoo ¡hashing ¡

Predictable ¡and ¡fast ¡lookup ¡

– constant ¡Jme ¡in ¡the ¡worst ¡case ¡ ¡

x ¡ 1 2 3 4 5 6 7 8 Lookup ¡x ¡

1 2 3 4 5 6 7 8

Insert ¡may ¡need ¡“cuckoo ¡move” ¡

Insert y ¡

Write ¡to ¡an ¡empty ¡slot ¡in ¡

1 2 3 4 5 6 7 8

Insert ¡may ¡need ¡“cuckoo ¡move” ¡

Both ¡are ¡full？ ¡

Insert y ¡ a ¡ x ¡ b ¡ k ¡ r ¡ c ¡ s ¡ e ¡ n ¡ f ¡

x ¡ a ¡ b ¡ 1 2 3 4 5 6 7 8

move ¡keys ¡to ¡alternate ¡buckets ¡

Insert ¡may ¡need ¡“cuckoo ¡move” ¡

Insert y ¡ k ¡ r ¡ c ¡ s ¡ e ¡ n ¡ f ¡ x ¡ ¡ a ¡ b ¡

– find ¡a ¡“cuckoo ¡path” ¡to ¡an ¡empty ¡slot ¡ ¡ – move ¡hole ¡backwards ¡

1 2 3 4 5 6 7 8 Insert y ¡ x ¡ a ¡ b ¡ y ¡

Insert ¡may ¡need ¡“cuckoo ¡move” ¡

A ¡technique ¡in ¡[Fan, ¡NSDI’13] ¡ ¡ No ¡reader/writer ¡false ¡misses ¡ b ¡ a ¡ x ¡

Review ¡our ¡starJng ¡point ¡[Fan, ¡NSDI’13]: ¡ MulJ-­‑reader ¡single-­‑writer ¡cuckoo ¡hashing ¡

– support ¡concurrent ¡reads ¡ – memory ¡efficient ¡for ¡small ¡objects ¡

– Inserts ¡are ¡serialized ¡

poor ¡performance ¡for ¡write-­‑heavy ¡workloads ¡

Improve ¡write ¡concurrency ¡

Algorithmic ¡opJmizaJons ¡

– minimize ¡criJcal ¡secJons ¡

– fewer ¡items ¡displaced ¡ – enable ¡prefetching ¡

– fewer ¡random ¡memory ¡reads ¡

Previous ¡approach: ¡writer ¡locks ¡the ¡table ¡ during ¡the ¡whole ¡insert ¡process ¡

lock(); Search f for a a cu cuckoo p path; Cuckoo mo move ve a and i insert; unlock();

// ¡at ¡most ¡hundreds ¡of ¡bucket ¡reads ¡ // ¡at ¡most ¡hundreds ¡of ¡writes ¡

All ¡Insert ¡operaBons ¡of ¡other ¡threads ¡are ¡blocked ¡

Lock ¡aoer ¡discovering ¡a ¡cuckoo ¡path ¡

Search for a cuckoo path; lock(); Cuckoo mo move ve a and i insert;

// ¡no ¡locking ¡required ¡

MulBple ¡Insert ¡threads ¡can ¡look ¡for ¡cuckoo ¡paths ¡concurrently ¡

←collision ¡

unlock(); ¡

Lock ¡aoer ¡discovering ¡a ¡cuckoo ¡path ¡

while(1) { Search for a cuckoo path; lock(); if(success) unlock(); break; unlock(); }

// ¡no ¡locking ¡required ¡

MulBple ¡Insert ¡threads ¡can ¡look ¡for ¡cuckoo ¡paths ¡concurrently ¡

Cuckoo mo move ve a and i insert w while t the p path i is v valid; ¡

Cuckoo ¡hash ¡table ¡⟹ ¡undirected ¡cuckoo ¡graph ¡

x ¡ a ¡ c ¡ b ¡ y ¡ z ¡

⟹ ¡

x ¡ z ¡ y ¡ 0 ¡ a ¡ b ¡ c ¡ 3 ¡ 1 ¡ 7 ¡ 6 ¡ 9 ¡ a ¡ x ¡ y ¡ b ¡ z ¡ c ¡

bucket ¡⟶ ¡vertex ¡ ¡ ¡ ¡ ¡ ¡ ¡key ¡⟶ ¡edge ¡

Previous ¡approach ¡to ¡search ¡for ¡an ¡empty ¡slot: ¡ ¡ random ¡walk ¡on ¡the ¡cuckoo ¡graph ¡

a ¡ * ¡ * ¡ e ¡ * ¡ s ¡ x ¡ * ¡ * ¡ k ¡ * ¡ f ¡ d ¡ * ¡ t ¡ * ¡ * ¡ ∅ ¡ One ¡Insert ¡may ¡move ¡at ¡most ¡hundreds ¡

– System ¡applicaJons ¡(e.g., ¡kernel ¡caches) ¡ – Concurrent ¡user-‑level ¡applicaJons ¡

Goal: ¡memory-‑efficient ¡and ¡high-‑throughput ¡ ¡

Preview ¡our ¡results ¡on ¡a ¡quad-‑core ¡machine ¡

0 ¡ 10 ¡ 20 ¡ 30 ¡ 40 ¡ C++11 ¡std::unordered_map ¡ Google ¡dense_hash_map ¡ Intel ¡TBB ¡concurrent_hash_map ¡ cuckoo+ ¡with ¡fine-‑grianed ¡locking ¡ cuckoo+ ¡with ¡HTM ¡ Throughput ¡(million ¡reqs ¡per ¡sec) ¡

64-‑bit ¡key ¡and ¡64-‑bit ¡value ¡ 120 ¡million ¡objects, ¡ ¡100% ¡Insert

– Open ¡addressing ¡ – Memory ¡efficient ¡ – OpJmized ¡for ¡read-‑intensive ¡workloads ¡

Each ¡bucket ¡has ¡b ¡slots ¡for ¡items ¡(b-‑way ¡set-‑associaJve) ¡ Each ¡key ¡is ¡mapped ¡to ¡two ¡random ¡buckets ¡ – stored ¡in ¡one ¡of ¡them ¡

Review ¡our ¡starJng ¡point ¡[Fan, ¡NSDI’13]: ¡ MulJ-‑reader ¡single-‑writer ¡cuckoo ¡hashing ¡

poor ¡performance ¡for ¡write-‑heavy ¡workloads ¡

Breadth-‑first ¡search ¡for ¡an ¡empty ¡slot ¡

Breadth-‑first ¡search ¡for ¡an ¡empty ¡slot ¡

– Make ¡globals ¡thread-‑local ¡

– Intel ¡Haswell ¡i7-‑4770 ¡@ ¡3.4GHz ¡(with ¡TSX ¡support) ¡ – 4 ¡cores ¡(8 ¡hyper-‑threaded ¡cores) ¡

– 8 ¡byte ¡keys ¡and ¡8 ¡byte ¡values ¡ – 2 ¡GB ¡hash ¡table, ¡~134.2 ¡million ¡entries ¡ – 8-‑way ¡set-‑associaJve ¡

MulJ-‑core ¡scaling ¡comparison ¡(50% ¡Insert) ¡

cuckoo: ¡ ¡ ¡ ¡single-‑writer/mulJ-‑reader ¡[Fan, ¡NSDI’13] ¡ cuckoo+: ¡ ¡cuckoo ¡with ¡our ¡algorithmic ¡opJmizaJons ¡

MulJ-‑core ¡scaling ¡comparison ¡(10% ¡Insert) ¡

cuckoo: ¡ ¡ ¡ ¡single-‑writer/mulJ-‑reader ¡[Fan, ¡NSDI’13] ¡ cuckoo+: ¡ ¡cuckoo ¡with ¡our ¡algorithmic ¡opJmizaJons ¡