Lock-free Concurrent Level Hashing for Persistent Memory Zhangyu - - PowerPoint PPT Presentation

lock free concurrent level hashing for persistent memory
SMART_READER_LITE
LIVE PREVIEW

Lock-free Concurrent Level Hashing for Persistent Memory Zhangyu - - PowerPoint PPT Presentation

Lock-free Concurrent Level Hashing for Persistent Memory Zhangyu Chen , Yu Hua, Bo Ding, Pengfei Zuo Huazhong University of Science and Technology USENIX ATC 2020 Persistent Memory (PM) PM features Non-volatility Byte-addressability


slide-1
SLIDE 1

Lock-free Concurrent Level Hashing for Persistent Memory

Zhangyu Chen, Yu Hua, Bo Ding, Pengfei Zuo Huazhong University of Science and Technology

USENIX ATC 2020

slide-2
SLIDE 2

Persistent Memory (PM)

2

  • PM features

− Non-volatility − Large capacity

  • PM speedups storage systems

− TB-scale memory for applications − Instant recovery from system failures

Intel Optane DC Persistent Memory 512 GB per module at most DIMM compatible

− Byte-addressability − DRAM-scale latency

slide-3
SLIDE 3

PM Optimization

3

  • 1. High overhead for writes

−Limited endurance −Low write bandwidth of PM (Optane PM study in FAST ’20)

  • 1/6 DRAM
  • 1/3 read bandwidth of PM
slide-4
SLIDE 4

PM Optimization

4

  • 1. High overhead for writes
  • 2. Inconsistency due to non-volatility

− Partial update: Copy-on-Write (CoW) or logging

CPU Cache

Volatile Persistent

Bus

Partial update

slide-5
SLIDE 5

PM Optimization

4

  • 1. High overhead for writes
  • 2. Inconsistency due to non-volatility

− Partial update: Copy-on-Write (CoW) or logging

CPU Cache

Volatile Persistent

Bus

Partial update data (32 B)

slide-6
SLIDE 6

PM Optimization

4

  • 1. High overhead for writes
  • 2. Inconsistency due to non-volatility

− Partial update: Copy-on-Write (CoW) or logging

CPU Cache

Volatile Persistent

Bus 8-byte atomic write

Partial update

slide-7
SLIDE 7

PM Optimization

4

  • 1. High overhead for writes
  • 2. Inconsistency due to non-volatility

− Partial update: Copy-on-Write (CoW) or logging

CPU Cache

Volatile Persistent

Bus

Partial update

slide-8
SLIDE 8

PM Optimization

4

  • 1. High overhead for writes
  • 2. Inconsistency due to non-volatility

− Partial update: Copy-on-Write (CoW) or logging

CPU Cache

Volatile Persistent

Bus

Inconsistency Partial update

slide-9
SLIDE 9

PM Optimization

4

  • 1. High overhead for writes
  • 2. Inconsistency due to non-volatility

− Partial update: Copy-on-Write (CoW) or logging

CPU Cache

Volatile Persistent

Bus

Inconsistency

CPU Cache

clwb sfence

Bus

Partial update

slide-10
SLIDE 10

PM Optimization

4

  • 1. High overhead for writes
  • 2. Inconsistency due to non-volatility

− Partial update: Copy-on-Write (CoW) or logging

CPU Cache

Volatile Persistent

Bus

Inconsistency

CPU Cache

clwb sfence

Bus

Partial update

slide-11
SLIDE 11

PM Optimization

4

  • 1. High overhead for writes
  • 2. Inconsistency due to non-volatility

− Partial update: Copy-on-Write (CoW) or logging

CPU Cache

Volatile Persistent

Bus

Inconsistency

CPU Cache

clwb sfence

Bus

Partial update

slide-12
SLIDE 12

PM Optimization

4

  • 1. High overhead for writes
  • 2. Inconsistency due to non-volatility

− Partial update: Copy-on-Write (CoW) or logging

CPU Cache

Volatile Persistent

Bus

Inconsistency

CPU Cache

clwb sfence

Bus

2x writes!

Partial update

slide-13
SLIDE 13

PM Optimization

5

  • 1. High overhead for writes
  • 2. Inconsistency due to non-volatility

− Partial update: Copy-on-Write (CoW) or logging − Reordering: memory fences

CPU Cache Bus

slots Program order slots

slide-14
SLIDE 14

PM Optimization

5

  • 1. High overhead for writes
  • 2. Inconsistency due to non-volatility

− Partial update: Copy-on-Write (CoW) or logging − Reordering: memory fences

CPU Cache Bus

kv_t item = new kv_t(k, v); slots[0] = &item;

slots item Program order slots slots

slide-15
SLIDE 15

PM Optimization

5

  • 1. High overhead for writes
  • 2. Inconsistency due to non-volatility

− Partial update: Copy-on-Write (CoW) or logging − Reordering: memory fences

CPU Cache Bus

kv_t item = new kv_t(k, v); slots[0] = &item;

slots item Cache Reordering Program order slots slots slots

slide-16
SLIDE 16

PM Optimization

5

  • 1. High overhead for writes
  • 2. Inconsistency due to non-volatility

− Partial update: Copy-on-Write (CoW) or logging − Reordering: memory fences

CPU Cache Bus

kv_t item = new kv_t(k, v); slots[0] = &item;

Cache Reordering Program order slots slots Inconsistency

slide-17
SLIDE 17

PM Optimization

5

  • 1. High overhead for writes
  • 2. Inconsistency due to non-volatility

− Partial update: Copy-on-Write (CoW) or logging − Reordering: memory fences

CPU Cache Bus

kv_t item = new kv_t(k, v); slots[0] = &item;

Program order slots slots Inconsistency

clwb(item); sfence;

slide-18
SLIDE 18

PM Index Structures

6

  • PM index structures are important for large-scale storage

systems to provide fast queries

  • Hashing-based structures
  • Tree-based structures
slide-19
SLIDE 19

PM Index Structures

6

  • PM index structures are important for large-scale storage

systems to provide fast queries

  • Hashing-based structures
  • Tree-based structures

key

slide-20
SLIDE 20

PM Index Structures

6

  • PM index structures are important for large-scale storage

systems to provide fast queries

  • Hashing-based structures
  • Tree-based structures

Hash(key)

 O(1) time complexity for point query

key key

slide-21
SLIDE 21

PM Index Structures

6

  • PM index structures are important for large-scale storage

systems to provide fast queries

  • Hashing-based structures
  • Tree-based structures

Hash(key)

 O(1) time complexity for point query

key key

slide-22
SLIDE 22

Hashing Collisions and Resizing

7

  • Hash collisions

x

Hash(y)

y Collision

slide-23
SLIDE 23

Hashing Collisions and Resizing

7

  • Hash collisions

x y

Linear probing

probing distance

slide-24
SLIDE 24

Hashing Collisions and Resizing

7

  • Hash collisions

x y

Linear probing

x y

Linked list

probing distance

slide-25
SLIDE 25

Hashing Collisions and Resizing

7

  • Hash collisions
  • Resizing

Old hash table

x y

Linear probing

x y

Linked list

probing distance

slide-26
SLIDE 26

Hashing Collisions and Resizing

7

  • Hash collisions
  • Resizing

Old hash table New hash table Rehashing

x y

Linear probing

x y

Linked list

probing distance

High latency!

slide-27
SLIDE 27

Concurrent PM Hashing

8

  • The importance of concurrency

−Fast indexing for TB-scale PM data −Multi-core environment for servers equipped with Optane PM

  • Concurrency for PM hashing

−Concurrent queries with correctness

  • Multi-reader concurrency
  • Multi-writer concurrency

−Concurrent resizing

Writers Readers Concurrent resizing

slide-28
SLIDE 28

PM Variants of Concurrent Hashing

9

  • CCEH [FAST ’19]

112 102 012 002

... ... ... ...

Bucket 0 Bucket 1 Bucket 254 Bucket 255 Bucket 0 Bucket 1 Bucket 254 Bucket 255 Segment 0 Segment 1 Directory

slide-29
SLIDE 29

PM Variants of Concurrent Hashing

9

  • CCEH [FAST ’19]

− Segment reader/writer locks for queries

112 102 012 002

... ... ... ...

Bucket 0 Bucket 1 Bucket 254 Bucket 255 Bucket 0 Bucket 1 Bucket 254 Bucket 255 Segment 0 Segment 1 Directory

slide-30
SLIDE 30

PM Variants of Concurrent Hashing

9

  • CCEH [FAST ’19]

− Segment reader/writer locks for queries − Dynamic resizing with segment splitting and directory doubling

112 102 012 002

... ... ... ...

Bucket 0 Bucket 1 Bucket 254 Bucket 255 Bucket 0 Bucket 1 Bucket 254 Bucket 255 Segment 0 Segment 1 Directory

Coarse-grained locks!

slide-31
SLIDE 31

PM Variants of Concurrent Hashing

9

  • CCEH [FAST ’19]

− Segment reader/writer locks for queries − Dynamic resizing with segment splitting and directory doubling

  • P-CLHT [SOSP ’19]

112 102 012 002

... ... ... ...

Bucket 0 Bucket 1 Bucket 254 Bucket 255 Bucket 0 Bucket 1 Bucket 254 Bucket 255 Segment 0 Segment 1 Directory

Coarse-grained locks!

slide-32
SLIDE 32

PM Variants of Concurrent Hashing

9

  • CCEH [FAST ’19]

− Segment reader/writer locks for queries − Dynamic resizing with segment splitting and directory doubling

  • P-CLHT [SOSP ’19]

− Lock-free search and bucket lock for writes

112 102 012 002

... ... ... ...

Bucket 0 Bucket 1 Bucket 254 Bucket 255 Bucket 0 Bucket 1 Bucket 254 Bucket 255 Segment 0 Segment 1 Directory

Coarse-grained locks!

slide-33
SLIDE 33

PM Variants of Concurrent Hashing

9

  • CCEH [FAST ’19]

− Segment reader/writer locks for queries − Dynamic resizing with segment splitting and directory doubling

  • P-CLHT [SOSP ’19]

− Lock-free search and bucket lock for writes − Full-table resizing with one helper thread

112 102 012 002

... ... ... ...

Bucket 0 Bucket 1 Bucket 254 Bucket 255 Bucket 0 Bucket 1 Bucket 254 Bucket 255 Segment 0 Segment 1 Directory

Thread-1: resize Thread-2: help resizing Thread-3~n: wait for finishing resizing…

Coarse-grained locks!

slide-34
SLIDE 34

PM Variants of Concurrent Hashing

9

  • CCEH [FAST ’19]

− Segment reader/writer locks for queries − Dynamic resizing with segment splitting and directory doubling

  • P-CLHT [SOSP ’19]

− Lock-free search and bucket lock for writes − Full-table resizing with one helper thread

112 102 012 002

... ... ... ...

Bucket 0 Bucket 1 Bucket 254 Bucket 255 Bucket 0 Bucket 1 Bucket 254 Bucket 255 Segment 0 Segment 1 Directory

Thread-1: resize Thread-2: help resizing Thread-3~n: wait for finishing resizing…

Coarse-grained locks! Resizing blocks queries!

slide-35
SLIDE 35

...

1 2N-2 2N-1

Level Hashing [OSDI ’18]

10

  • PM-friendly hashing index

...

N-1

Top level Bottom level

slide-36
SLIDE 36

...

1 2N-2 2N-1

Level Hashing [OSDI ’18]

10

  • PM-friendly hashing index

− Two-level bucketized hash table with one- step movement

Two-level structure

...

N-1

Top level Bottom level

slide-37
SLIDE 37

...

1 2N-2 2N-1

Level Hashing [OSDI ’18]

10

  • PM-friendly hashing index

− Two-level bucketized hash table with one- step movement

Two-level structure

KV1 KV2 KV3 KV4 Slots Tokens

A 4-slot bucket

...

N-1

Top level Bottom level

slide-38
SLIDE 38

...

1 2N-2 2N-1

Level Hashing [OSDI ’18]

10

  • PM-friendly hashing index

− Two-level bucketized hash table with one- step movement

key h1(key) h2(key) Two-level structure

KV1 KV2 KV3 KV4 Slots Tokens

A 4-slot bucket One-step movement One extra write at most

...

N-1

Top level Bottom level

slide-39
SLIDE 39

...

1 2N-2 2N-1

Level Hashing [OSDI ’18]

10

  • PM-friendly hashing index

− Two-level bucketized hash table with one- step movement  Write efficiency

key h1(key) h2(key) Two-level structure

KV1 KV2 KV3 KV4 Slots Tokens

A 4-slot bucket One-step movement One extra write at most

...

N-1

Top level Bottom level

slide-40
SLIDE 40

...

1 2N-2 2N-1

Level Hashing [OSDI ’18]

10

  • PM-friendly hashing index

− Two-level bucketized hash table with one- step movement − Low-overhead consistency guarantee via atomic token update  Write efficiency  Crash consistency

key h1(key) h2(key) Two-level structure

KV1 KV2 KV3 KV4 Slots Tokens

A 4-slot bucket One-step movement

(Atomic update)

One extra write at most

...

N-1

Top level Bottom level

slide-41
SLIDE 41

...

1 2N-2 2N-1

Level Hashing [OSDI ’18]

10

  • PM-friendly hashing index

− Two-level bucketized hash table with one- step movement − Low-overhead consistency guarantee via atomic token update − Rehashing 1/3 buckets for one resizing  Write efficiency  Crash consistency

...

N-1

Top level Bottom level

slide-42
SLIDE 42

...

1 2 4N-3 4N-1 4N-2

...

1 2N-2 2N-1

Level Hashing [OSDI ’18]

10

  • PM-friendly hashing index

− Two-level bucketized hash table with one- step movement − Low-overhead consistency guarantee via atomic token update − Rehashing 1/3 buckets for one resizing  Write efficiency  Crash consistency

Top level Bottom level

slide-43
SLIDE 43

...

1 2N-2 2N-1

Concurrency in Level Hashing

11

  • Slot-grained lock for queries

Top level Bottom level

  • Single-thread blocking resizing

...

1 2N-2 2N-1

...

N-1

Top level Bottom level

N-1

...

x

Thread-1: search(x) Thread-2: insert(key)

slide-44
SLIDE 44

...

1 2N-2 2N-1

Concurrency in Level Hashing

11

  • Slot-grained lock for queries

Thread-2: relocate x Top level Bottom level

  • Single-thread blocking resizing

...

1 2N-2 2N-1

...

N-1

Top level Bottom level

N-1

...

x

Thread-1: search x Thread-1: search(x) Thread-2: insert(key)

slide-45
SLIDE 45

...

1 2N-2 2N-1

Concurrency in Level Hashing

11

  • Slot-grained lock for queries

Thread-2: relocate x Top level Bottom level

  • Single-thread blocking resizing

...

1 2N-2 2N-1

...

N-1

Top level Bottom level

N-1

...

x

Thread-1: search x

Missing inserted items!

No “x” is found Thread-1: search(x) Thread-2: insert(key)

slide-46
SLIDE 46

...

1 2N-2 2N-1

Concurrency in Level Hashing

11

  • Slot-grained lock for queries

Thread-2: relocate x Top level Bottom level

  • Single-thread blocking resizing

...

1 2 4N-3 4N-1 4N-2

...

1 2N-2 2N-1

...

N-1

Top level Bottom level

N-1

...

x

Thread-1: search x

Missing inserted items!

No “x” is found Thread-1: search(x) Thread-2: insert(key) Thread-1: insert(key) and trigger resizing…

slide-47
SLIDE 47

...

1 2N-2 2N-1

Concurrency in Level Hashing

11

  • Slot-grained lock for queries

Thread-2: relocate x Top level Bottom level

  • Single-thread blocking resizing

...

1 2 4N-3 4N-1 4N-2

...

1 2N-2 2N-1

...

N-1

Top level Bottom level

N-1

...

x

Thread-1: search x

Missing inserted items!

No “x” is found Thread-1: search(x) Thread-2: insert(key) Thread-1: insert(key) and trigger resizing… Thread-2~n: wait for finishing resizing…

Resizing blocks queries!

slide-48
SLIDE 48

...

1 2N-2 2N-1

Concurrency in Level Hashing

11

  • Slot-grained lock for queries

Thread-2: relocate x Top level Bottom level

  • Single-thread blocking resizing

...

1 2 4N-3 4N-1 4N-2

...

1 2N-2 2N-1

...

N-1

Top level Bottom level

N-1

...

x

Thread-1: search x

Missing inserted items!

No “x” is found Thread-1: search(x) Thread-2: insert(key) Thread-1: insert(key) and trigger resizing… Thread-2~n: wait for finishing resizing…

Resizing blocks queries!

Concurrency is the bottleneck

slide-49
SLIDE 49

Challenges for PM Hashing

12

  • Challenges

−Performance degradation for blocking resizing

  • High latency for coarse-grained locks

−Limited scalability for lock-based concurrency control

  • Lock constraint for concurrent accesses
  • Persisting overheads in the critical path
  • Design goals

−A PM-friendly and high-concurrency hashing scheme

slide-50
SLIDE 50

Our Approach: Clevel Hashing

13

Rehashing threads

...

Worker threads

...

Thread-local context ptr. A thread

... ...

2N-1 2N-2 2N-3 1 2

...

N-1 N-2 1

... ...

Rehashing

key

H1(key) H2(key)

Level list

last_level first_level is_resizing

Context

Global context ptr.

slide-51
SLIDE 51

Our Approach: Clevel Hashing

13

  • Dynamic multi-level structure w/o

extra writes for insertion

 Write-optimal insertion

Rehashing threads

...

Worker threads

...

Thread-local context ptr. A thread

... ...

2N-1 2N-2 2N-3 1 2

...

N-1 N-2 1

... ...

Rehashing

key

H1(key) H2(key)

Level list

last_level first_level is_resizing

Context

Global context ptr.

slide-52
SLIDE 52

Our Approach: Clevel Hashing

13

  • Dynamic multi-level structure w/o

extra writes for insertion

 Write-optimal insertion

  • Asynchronous rehashing w/o

blocking concurrent queries

 Non-blocking resizing

Rehashing threads

...

Worker threads

...

Thread-local context ptr. A thread

... ...

2N-1 2N-2 2N-3 1 2

...

N-1 N-2 1

... ...

Rehashing

key

H1(key) H2(key)

Level list

last_level first_level is_resizing

Context

Global context ptr.

slide-53
SLIDE 53

Our Approach: Clevel Hashing

13

  • Dynamic multi-level structure w/o

extra writes for insertion

 Write-optimal insertion

  • Asynchronous rehashing w/o

blocking concurrent queries

 Non-blocking resizing

  • Lock-free concurrency control

 Lock-free queries

Rehashing threads

...

Worker threads

...

Thread-local context ptr. A thread

... ...

2N-1 2N-2 2N-3 1 2

...

N-1 N-2 1

... ...

Rehashing

key

H1(key) H2(key)

Level list

last_level first_level is_resizing

Context

Global context ptr.

slide-54
SLIDE 54

Components

14

  • Dynamic Multi-level Structure
  • Non-blocking Resizing
  • Lock-free Concurrency Control
slide-55
SLIDE 55

Components

15

  • Dynamic Multi-level Structure
  • Non-blocking Resizing
  • Lock-free Concurrency Control
slide-56
SLIDE 56

Dynamic Multi-level Structure

16

  • Support for variable-length items

−Store pointers in slots and actual items outside of the table

slide-57
SLIDE 57

Dynamic Multi-level Structure

17

  • Support for variable-length items
  • Write-optimized hash table

−8 slots per bucket

...

KV_PTR1 Slots (each 8 bytes)

A bucket

KV_PTR8

...

slide-58
SLIDE 58

Dynamic Multi-level Structure

18

  • Support for variable-length items
  • Write-optimized hash table

−8 slots per bucket −2 candidate buckets in one level

...

H1(key) H2(key)

key

slide-59
SLIDE 59

Dynamic Multi-level Structure

19

  • Support for variable-length items
  • Write-optimized hash table

−8 slots per bucket −2 candidate buckets in one level −Sharing-based multiple levels

  • Add a level for resizing
  • Remove one when rehashing completes

... ...

2N-1 2N-2 2N-3 1 2

...

N-1 N-2 1

...

H1(key) H2(key)

key

slide-60
SLIDE 60

Dynamic Multi-level Structure

19

  • Support for variable-length items
  • Write-optimized hash table

−8 slots per bucket −2 candidate buckets in one level −Sharing-based multiple levels

  • Add a level for resizing
  • Remove one when rehashing completes

... ...

2N-1 2N-2 2N-3 1 2

...

N-1 N-2 1

...

H1(key) H2(key)

key

No extra writes for insertion

Write-optimal

slide-61
SLIDE 61

Components

20

  • Dynamic Multi-level Structure
  • Non-blocking Resizing
  • Lock-free Concurrency Control
slide-62
SLIDE 62

The Support for Concurrent Resizing

21

... ...

2N-1 2N-2 2N-3 1 2

...

N-1 N-2 1

...

key

H1(key) H2(key)

slide-63
SLIDE 63

The Support for Concurrent Resizing

21

  • Level list

−A linked list to associate levels

... ...

2N-1 2N-2 2N-3 1 2

...

N-1 N-2 1

...

key

H1(key) H2(key)

...

Level list

slide-64
SLIDE 64

The Support for Concurrent Resizing

22

  • Level list

−A linked list to associate levels

  • Context

−A metadata structure including:

  • first_level (the largest level)
  • last_level
  • is_resizing

... ...

2N-1 2N-2 2N-3 1 2

...

N-1 N-2 1

... ...

key

H1(key) H2(key)

Level list

last_level first_level is_resizing

Context

Global context ptr.

slide-65
SLIDE 65

Non-blocking Resizing

23

  • Resizing steps

1. Make a local copy of the global context pointer

Worker threads

...

Thread-local context ptr. A thread 1

... ...

N-1 N-2 1

... ...

key

H1(key) H2(key) last_level first_level is_resizing

Level list Context

Global context ptr.

slide-66
SLIDE 66

Non-blocking Resizing

24

  • Resizing steps

1. Make a local copy of the global context pointer 2. CAS to append a new level

Worker threads

...

Thread-local context ptr. A thread 1

... ...

N-1 N-2 1

... ...

last_level first_level is_resizing

Level list Context

Global context ptr.

...

2N-1 2N-2 2N-3 1 2

key

H1(key) H2(key) 2

slide-67
SLIDE 67

Non-blocking Resizing

25

  • Resizing steps

1. Make a local copy of the global context pointer 2. CAS to append a new level 3. CoW + CAS to update the first_level

Worker threads

...

Thread-local context ptr. A thread 1

... ...

N-1 N-2 1

... ...

Level list

...

2N-1 2N-2 2N-3 1 2

key

H1(key) H2(key) 2 last_level first_level is_resizing Global context ptr. 3

Context

Context size: 17 bytes

last_level first_level is_resizing

8 B 8 B 1 B Lightweight CoW

slide-68
SLIDE 68

Non-blocking Resizing

26

  • Resizing steps

1. Make a local copy of the global context pointer 2. CAS to append a new level 3. CoW + CAS to update the first_level 4. Rehash items in the last level

Worker threads

...

Thread-local context ptr. A thread 1

... ...

2N-1 2N-2 2N-3 1 2

...

N-1 N-2 1

... ...

Rehashing

key

H1(key) H2(key)

Level list

2 4 last_level first_level is_resizing Global context ptr. 3

Context

slide-69
SLIDE 69

Non-blocking Resizing

27

  • Resizing steps

1. Make a local copy of the global context pointer 2. CAS to append a new level 3. CoW + CAS to update the first_level 4. Rehash items in the last level 5. CoW + CAS to update the last_level

Worker threads

...

Thread-local context ptr. A thread 1

... ...

2N-1 2N-2 2N-3 1 2

...

N-1 N-2 1

... ...

Rehashing

key

H1(key) H2(key)

Level list

last_level first_level is_resizing

Context

Global context ptr. 2 3 4 5

slide-70
SLIDE 70

... ...

2N-1 2N-2 2N-3 1 2

...

N-1 N-2 1

... ...

Rehashing

key

H1(key) H2(key)

Level list

last_level first_level is_resizing

Context

Global context ptr.

Non-blocking Resizing

28

  • Resizing steps

1. Make a local copy of the global context pointer 2. CAS to append a new level 3. CoW + CAS to update the first_level 4. Rehash items in the last level 5. CoW + CAS to update the last_level

slide-71
SLIDE 71

... ...

2N-1 2N-2 2N-3 1 2

...

N-1 N-2 1

... ...

Rehashing

key

H1(key) H2(key)

Level list

last_level first_level is_resizing

Context

Global context ptr.

Non-blocking Resizing

28

  • Resizing steps

1. Make a local copy of the global context pointer 2. CAS to append a new level 3. CoW + CAS to update the first_level 4. Rehash items in the last level 5. CoW + CAS to update the last_level

Expansion stage Rehashing stage Expansion stage Rehashing stage Resizing steps Queries

slide-72
SLIDE 72

... ...

2N-1 2N-2 2N-3 1 2

...

N-1 N-2 1

... ...

Rehashing

key

H1(key) H2(key)

Level list

last_level first_level is_resizing

Context

Global context ptr.

Non-blocking Resizing

28

  • Resizing steps

1. Make a local copy of the global context pointer 2. CAS to append a new level 3. CoW + CAS to update the first_level 4. Rehash items in the last level 5. CoW + CAS to update the last_level

  • Non-blocking resizing scheme

− Rehashing threads: rehash until there are 2 levels left

Expansion stage Rehashing stage Expansion stage Rehashing stage Rehashing threads (background) Resizing steps Queries

slide-73
SLIDE 73

... ...

2N-1 2N-2 2N-3 1 2

...

N-1 N-2 1

... ...

Rehashing

key

H1(key) H2(key)

Level list

last_level first_level is_resizing

Context

Global context ptr.

Non-blocking Resizing

28

  • Resizing steps

1. Make a local copy of the global context pointer 2. CAS to append a new level 3. CoW + CAS to update the first_level 4. Rehash items in the last level 5. CoW + CAS to update the last_level

  • Non-blocking resizing scheme

− Rehashing threads: rehash until there are 2 levels left

Expansion stage Rehashing stage Expansion stage Rehashing stage Rehashing threads (background) Resizing steps Worker threads Queries

slide-74
SLIDE 74

Components

29

  • Dynamic Multi-level Structure
  • Non-blocking Resizing
  • Lock-free Concurrency Control
slide-75
SLIDE 75

Lock-free Search

30

  • High latency for pointer dereference
slide-76
SLIDE 76

Lock-free Search

30

  • High latency for pointer dereference

−Summary tags

  • A tag is the summary for a key
  • Leverage the unused 16 highest bits of

a pointer in x86_64 to store the tag

Update tag and pointer in an atomic manner

Tag (2 B)

KV_PTR1

A bucket

KV_PTR8

...

A slot

slide-77
SLIDE 77

Lock-free Search

31

  • High latency for pointer dereference

−Summary tags

  • A tag is the summary for a key
  • Leverage the unused 16 highest bits of

a pointer in x86_64 to store the tag

  • Missing items due to rehashing

... ...

2N-1 2N-2 2N-3 1 2

...

N-1 N-2 1

... ...

key

H1(key) H2(key)

Level list

...

: a pointer in one slot

slide-78
SLIDE 78

Lock-free Search

31

  • High latency for pointer dereference

−Summary tags

  • A tag is the summary for a key
  • Leverage the unused 16 highest bits of

a pointer in x86_64 to store the tag

  • Missing items due to rehashing

... ...

2N-1 2N-2 2N-3 1 2

...

N-1 N-2 1

... ...

key

H1(key) H2(key)

Level list

Thread-2: rehashing Thread-1: search

...

2N-1 2N-2 2N-3 1 2

Missing

: a pointer in one slot

slide-79
SLIDE 79

Lock-free Search

31

  • High latency for pointer dereference

−Summary tags

  • A tag is the summary for a key
  • Leverage the unused 16 highest bits of

a pointer in x86_64 to store the tag

  • Missing items due to rehashing

−Bottom-to-top (b2t) search

  • Search from the last level to the first level
  • Redo the search when no item is found and the context changes

... ...

2N-1 2N-2 2N-3 1 2

...

N-1 N-2 1

... ...

key

H1(key) H2(key)

Level list

Thread-2: rehashing Thread-1: b2t search

...

2N-1 2N-2 2N-3 1 2

: a pointer in one slot

slide-80
SLIDE 80

...

2N-1 2N-2 2N-3 1 2

...

N-1 N-2 1

... ...

Level list

Lock-free Insertion

32

  • Basic workflow

− Allocate the new item in PM − B2t search to find duplicate keys − Insert the pointer via CAS

...

slide-81
SLIDE 81

...

2N-1 2N-2 2N-3 1 2

...

N-1 N-2 1

... ...

Level list

Lock-free Insertion

32

  • Basic workflow

− Allocate the new item in PM − B2t search to find duplicate keys − Insert the pointer via CAS

  • Duplicate items from concurrent insertions

x

H1(x) H2(x)

...

slide-82
SLIDE 82

...

2N-1 2N-2 2N-3 1 2

...

N-1 N-2 1

... ...

Level list

Lock-free Insertion

32

  • Basic workflow

− Allocate the new item in PM − B2t search to find duplicate keys − Insert the pointer via CAS

  • Duplicate items from concurrent insertions

− Both items are allowed for read − Fix duplication in future update and deletion

Thread-1: insert(x) Thread-2: insert(x)

... ...

slide-83
SLIDE 83

...

2N-1 2N-2 2N-3 1 2

...

N-1 N-2 1

... ...

Level list

Lock-free Insertion

32

  • Basic workflow

− Allocate the new item in PM − B2t search to find duplicate keys − Insert the pointer via CAS

  • Duplicate items from concurrent insertions

− Both items are allowed for read − Fix duplication in future update and deletion

  • Loss of new items due to rehashing

Thread-1: insert(x) Thread-2: rehashing

...

slide-84
SLIDE 84

...

2N-1 2N-2 2N-3 1 2

...

N-1 N-2 1

... ...

Level list

Lock-free Insertion

32

  • Basic workflow

− Allocate the new item in PM − B2t search to find duplicate keys − Insert the pointer via CAS

  • Duplicate items from concurrent insertions

− Both items are allowed for read − Fix duplication in future update and deletion

  • Loss of new items due to rehashing

Thread-1: insert(x) Thread-2: rehashing

... ...

slide-85
SLIDE 85

...

2N-1 2N-2 2N-3 1 2

...

N-1 N-2 1

... ...

Level list

Lock-free Insertion

32

  • Basic workflow

− Allocate the new item in PM − B2t search to find duplicate keys − Insert the pointer via CAS

  • Duplicate items from concurrent insertions

− Both items are allowed for read − Fix duplication in future update and deletion

  • Loss of new items due to rehashing

Thread-1: insert(x) Thread-2: rehashing

... ...

slide-86
SLIDE 86

...

2N-1 2N-2 2N-3 1 2

...

N-1 N-2 1

... ...

Level list

Lock-free Insertion

32

  • Basic workflow

− Allocate the new item in PM − B2t search to find duplicate keys − Insert the pointer via CAS

  • Duplicate items from concurrent insertions

− Both items are allowed for read − Fix duplication in future update and deletion

  • Loss of new items due to rehashing

Thread-1: insert(x)

slide-87
SLIDE 87

...

2N-1 2N-2 2N-3 1 2

...

N-1 N-2 1

... ...

Level list

Lock-free Insertion

32

  • Basic workflow

− Allocate the new item in PM − B2t search to find duplicate keys − Insert the pointer via CAS

  • Duplicate items from concurrent insertions

− Both items are allowed for read − Fix duplication in future update and deletion

  • Loss of new items due to rehashing

Thread-1: insert(x)

Loss

slide-88
SLIDE 88

...

2N-1 2N-2 2N-3 1 2

...

N-1 N-2 1

... ...

Level list

Lock-free Insertion

32

  • Basic workflow

− Allocate the new item in PM − B2t search to find duplicate keys − Insert the pointer via CAS

  • Duplicate items from concurrent insertions

− Both items are allowed for read − Fix duplication in future update and deletion

  • Loss of new items due to rehashing

− Context-aware insertion

  • Not inserted to the rehashed last level
  • Redo insertion for possible loss

Thread-1: insert(x)

Loss

slide-89
SLIDE 89

Lock-free Update

33

  • Inconsistency for duplicate items
slide-90
SLIDE 90

... ...

2N-1 2N-2 2N-3 1 2

...

N-1 N-2 1

... ...

Level list

Lock-free Update

33

  • Inconsistency for duplicate items

− Concurrent insertions with the same key

...

2N-1 2N-2 2N-3 1 2

Thread-1: insert(x) Thread-2: insert(x)

slide-91
SLIDE 91

... ...

2N-1 2N-2 2N-3 1 2

...

N-1 N-2 1

... ...

Level list

Lock-free Update

33

  • Inconsistency for duplicate items

− Concurrent insertions with the same key − Retry of context-aware insertion

...

2N-1 2N-2 2N-3 1 2

... ...

2N-1 2N-2 2N-3 1 2

...

N-1 N-2 1

... ...

Level list

...

2N-1 2N-2 2N-3 1 2

...

Thread-1: insert(x) Thread-2: insert(x) Thread-1: insert(x) Thread-1: redo insert(x)

slide-92
SLIDE 92

... ...

2N-1 2N-2 2N-3 1 2

...

N-1 N-2 1

... ...

Level list

Lock-free Update

33

  • Inconsistency for duplicate items

− Concurrent insertions with the same key − Retry of context-aware insertion − Data movement for rehashing

...

2N-1 2N-2 2N-3 1 2

... ...

2N-1 2N-2 2N-3 1 2

...

N-1 N-2 1

... ...

Level list

...

2N-1 2N-2 2N-3 1 2

... ... ...

2N-1 2N-2 2N-3 1 2

...

N-1 N-2 1

... ...

Level list

...

2N-1 2N-2 2N-3 1 2

...

Rehashing thread

Thread-1: insert(x) Thread-2: insert(x) Thread-1: insert(x) Thread-1: redo insert(x)

slide-93
SLIDE 93

... ...

2N-1 2N-2 2N-3 1 2

...

N-1 N-2 1

... ...

Level list

Lock-free Update

33

  • Inconsistency for duplicate items

− Concurrent insertions with the same key − Retry of context-aware insertion − Data movement for rehashing

...

2N-1 2N-2 2N-3 1 2

... ...

2N-1 2N-2 2N-3 1 2

...

N-1 N-2 1

... ...

Level list

...

2N-1 2N-2 2N-3 1 2

... ... ...

2N-1 2N-2 2N-3 1 2

...

N-1 N-2 1

... ...

Level list

...

2N-1 2N-2 2N-3 1 2

...

Rehashing thread

Thread-1: insert(x) Thread-2: insert(x) Thread-1: insert(x) Thread-1: redo insert(x)

Two pointers to different items

slide-94
SLIDE 94

... ...

2N-1 2N-2 2N-3 1 2

...

N-1 N-2 1

... ...

Level list

Lock-free Update

33

  • Inconsistency for duplicate items

− Concurrent insertions with the same key − Retry of context-aware insertion − Data movement for rehashing

...

2N-1 2N-2 2N-3 1 2

... ...

2N-1 2N-2 2N-3 1 2

...

N-1 N-2 1

... ...

Level list

...

2N-1 2N-2 2N-3 1 2

... ... ...

2N-1 2N-2 2N-3 1 2

...

N-1 N-2 1

... ...

Level list

...

2N-1 2N-2 2N-3 1 2

...

Rehashing thread

Thread-1: insert(x) Thread-2: insert(x) Thread-1: insert(x) Thread-1: redo insert(x)

Two pointers to different items Two pointers to the same item

slide-95
SLIDE 95

... ...

2N-1 2N-2 2N-3 1 2

...

N-1 N-2 1

... ...

Level list

Lock-free Update

33

  • Inconsistency for duplicate items

− Concurrent insertions with the same key − Retry of context-aware insertion − Data movement for rehashing

...

2N-1 2N-2 2N-3 1 2

... ...

2N-1 2N-2 2N-3 1 2

...

N-1 N-2 1

... ...

Level list

...

2N-1 2N-2 2N-3 1 2

... ... ...

2N-1 2N-2 2N-3 1 2

...

N-1 N-2 1

... ...

Level list

...

2N-1 2N-2 2N-3 1 2

...

Rehashing thread

Thread-1: insert(x) Thread-2: insert(x) Thread-1: insert(x) Thread-1: redo insert(x)

Two pointers to different items Two pointers to the same item

  • Content-conscious Find

− B2t search to find two pointers to duplicate items

B2t search

slide-96
SLIDE 96

... ...

2N-1 2N-2 2N-3 1 2

...

N-1 N-2 1

... ...

Level list

Lock-free Update

33

  • Inconsistency for duplicate items

− Concurrent insertions with the same key − Retry of context-aware insertion − Data movement for rehashing

...

2N-1 2N-2 2N-3 1 2

... ...

2N-1 2N-2 2N-3 1 2

...

N-1 N-2 1

... ...

Level list

...

2N-1 2N-2 2N-3 1 2

... ... ...

2N-1 2N-2 2N-3 1 2

...

N-1 N-2 1

... ...

Level list

...

2N-1 2N-2 2N-3 1 2

...

Rehashing thread

Thread-1: insert(x) Thread-2: insert(x) Thread-1: insert(x) Thread-1: redo insert(x)

Two pointers to different items Two pointers to the same item

  • Content-conscious Find

− B2t search to find two pointers to duplicate items − Check if two pointers refer to the same item

B2t search

slide-97
SLIDE 97

... ...

2N-1 2N-2 2N-3 1 2

...

N-1 N-2 1

... ...

Level list

Lock-free Update

33

  • Inconsistency for duplicate items

− Concurrent insertions with the same key − Retry of context-aware insertion − Data movement for rehashing

...

2N-1 2N-2 2N-3 1 2

... ...

2N-1 2N-2 2N-3 1 2

...

N-1 N-2 1

... ...

Level list

...

2N-1 2N-2 2N-3 1 2

... ...

2N-1 2N-2 2N-3 1 2

...

N-1 N-2 1

... ...

Level list

...

2N-1 2N-2 2N-3 1 2

Rehashing thread

Thread-1: insert(x) Thread-2: insert(x) Thread-1: redo insert(x)

Two pointers to different items Two pointers to the same item

  • Content-conscious Find

− B2t search to find two pointers to duplicate items − Check if two pointers refer to the same item

  • Yes: delete the first pointer matching the key

B2t search

slide-98
SLIDE 98

... ...

2N-1 2N-2 2N-3 1 2

...

N-1 N-2 1

... ...

Level list

Lock-free Update

33

  • Inconsistency for duplicate items

− Concurrent insertions with the same key − Retry of context-aware insertion − Data movement for rehashing

...

2N-1 2N-2 2N-3 1 2

... ...

2N-1 2N-2 2N-3 1 2

...

N-1 N-2 1

... ...

Level list

...

2N-1 2N-2 2N-3 1 2

... ...

2N-1 2N-2 2N-3 1 2

...

N-1 N-2 1

... ...

Level list

...

2N-1 2N-2 2N-3 1 2

Rehashing thread

Thread-2: insert(x) Thread-1: redo insert(x)

Two pointers to different items Two pointers to the same item

  • Content-conscious Find

− B2t search to find two pointers to duplicate items − Check if two pointers refer to the same item

  • Yes: delete the first pointer matching the key
  • No: delete the first pointer and corresponding item

matching the key

B2t search

...

2N-1 2N-2 2N-3 1 2

slide-99
SLIDE 99

Failures of Lock-free Update

34

  • Update failures due to interleaved update and rehashing

... ...

2N-1 2N-2 2N-3 1 2

...

N-1 N-2 1

... ...

Level list

...

Timeline

Thread 1: update Thread-2: rehashing

slide-100
SLIDE 100

Failures of Lock-free Update

34

  • Update failures due to interleaved update and rehashing

... ...

2N-1 2N-2 2N-3 1 2

...

N-1 N-2 1

... ...

Level list

...

Timeline

Thread 1: update Thread-2: rehashing

Find

slide-101
SLIDE 101

Failures of Lock-free Update

34

  • Update failures due to interleaved update and rehashing

... ...

2N-1 2N-2 2N-3 1 2

...

N-1 N-2 1

... ...

Level list

...

2N-1 2N-2 2N-3 1 2

...

Timeline

Thread 1: update Thread-2: rehashing

Find copy

Rehashing

slide-102
SLIDE 102

Failures of Lock-free Update

34

  • Update failures due to interleaved update and rehashing

... ...

2N-1 2N-2 2N-3 1 2

...

N-1 N-2 1

... ...

Level list

...

2N-1 2N-2 2N-3 1 2

Timeline

Thread 1: update Thread-2: rehashing

Find update copy

...

slide-103
SLIDE 103

Failures of Lock-free Update

34

  • Update failures due to interleaved update and rehashing

... ...

2N-1 2N-2 2N-3 1 2

...

N-1 N-2 1

... ...

Level list

...

2N-1 2N-2 2N-3 1 2

Timeline

Thread 1: update Thread-2: rehashing

Find update copy delete

slide-104
SLIDE 104

Failures of Lock-free Update

34

  • Update failures due to interleaved update and rehashing
  • Baseline: two-round Find for update

... ...

2N-1 2N-2 2N-3 1 2

...

N-1 N-2 1

... ...

Level list

...

2N-1 2N-2 2N-3 1 2

Timeline

Thread 1: update Thread-2: rehashing

Find update copy delete

slide-105
SLIDE 105

Failures of Lock-free Update

34

  • Update failures due to interleaved update and rehashing
  • Baseline: two-round Find for update
  • Optimization: redo Find only

when simultaneously satisfying:

− Table is resizing − The updated bucket is in the last level − The bucket index is in one of the processed regions of rehashing threads

... ...

2N-1 2N-2 2N-3 1 2

...

N-1 N-2 1

... ...

Level list

...

2N-1 2N-2 2N-3 1 2

Timeline

Thread 1: update Thread-2: rehashing

Find update copy delete

slide-106
SLIDE 106

Lock-free Deletion

35

  • Delete matched pointers atomically via CAS
  • Inconsistency due to duplicate items

−Instead of Find, delete all matched items in b2t search

  • Deletion failures due to interleaved deletion and rehashing

−Similar optimizations to avoid frequent re-execution of deletion

slide-107
SLIDE 107

Crash Recovery

36

  • Crash consistency for lock-free Clevel hashing

−Persist after PM writes −Persist dependent metadata after loading them

  • Recovery

−Rehashing resumes from the last processed bucket

Atomic visibility enables low-overhead crash consistency

slide-108
SLIDE 108

Experimental Setup

37

  • Platform

− Intel Optane DC PMM configured in App Direct mode − 36 threads in one NUMA node − PMDK

  • Comparisons

− LEVEL: original level hashing [OSDI ’18] − CCEH: lazy deletion version, default probing distance (16 slots) [FAST ’19] − CMAP: concurrent_hash_map engine from Intel pmemkv − P-CLHT: PM variant of CLHT converted by RECIPE [SOSP ’19] − CLEVEL: our Clevel hashing

  • Benchmark: YCSB
slide-109
SLIDE 109

Load Factor

38

200 400 600 800 1000 20 40 60 80 100

Load factor (%) Inserted items (k)

P-CLHT CCEH LEVEL CLEVEL

slide-110
SLIDE 110

Load Factor

38

  • Clevel hashing has comparable load factor with level hashing, i.e., 86%

200 400 600 800 1000 20 40 60 80 100

Load factor (%) Inserted items (k)

P-CLHT CCEH LEVEL CLEVEL

slide-111
SLIDE 111

Micro-benchmarks

39

Positive Negative 5 10 15 20 25

Average latency (us)

P-CLHT LEVEL CCEH CMAP LEVEL-TBB CCEH-TBB CMAP-TBB CLEVEL

Insertion Update Deletion 10 20 30 40

46 101 86 57

Average latency (us)

P-CLHT LEVEL CCEH CMAP LEVEL-TBB CCEH-TBB CMAP-TBB CLEVEL

106

* Lack of implementation of update and deletion in open-source code

slide-112
SLIDE 112

Micro-benchmarks

39

  • Due to using lock-free search and

summary tags, Clevel hashing obtains

− 1.2×−5.0× speedup for positive search − 1.4×−9.0× speedup for negative search

Positive Negative 5 10 15 20 25

Average latency (us)

P-CLHT LEVEL CCEH CMAP LEVEL-TBB CCEH-TBB CMAP-TBB CLEVEL

Insertion Update Deletion 10 20 30 40

46 101 86 57

Average latency (us)

P-CLHT LEVEL CCEH CMAP LEVEL-TBB CCEH-TBB CMAP-TBB CLEVEL

106

* Lack of implementation of update and deletion in open-source code

slide-113
SLIDE 113

Micro-benchmarks

39

  • Due to using lock-free search and

summary tags, Clevel hashing obtains

− 1.2×−5.0× speedup for positive search − 1.4×−9.0× speedup for negative search

Positive Negative 5 10 15 20 25

Average latency (us)

P-CLHT LEVEL CCEH CMAP LEVEL-TBB CCEH-TBB CMAP-TBB CLEVEL

Insertion Update Deletion 10 20 30 40

46 101 86 57

Average latency (us)

P-CLHT LEVEL CCEH CMAP LEVEL-TBB CCEH-TBB CMAP-TBB CLEVEL

106

* Lack of implementation of update and deletion in open-source code

  • Clevel hashing achieves low latency with

correctness guarantee

slide-114
SLIDE 114

Macro-benchmarks

40

Read ratio (%): 50 95 100 Write ratio (%): 100 50 5

Load A A B C 1 2 3 4 5

1.32 M op/s 1.81 M op/s 0.45 M op/s

Throughput ratio wrt P-CLHT

P-CLHT LEVEL CCEH CMAP LEVEL-TBB CCEH-TBB CMAP-TBB CLEVEL 0.91 M op/s

slide-115
SLIDE 115

Macro-benchmarks

40

Read ratio (%): 50 95 100 Write ratio (%): 100 50 5

Load A A B C 1 2 3 4 5

1.32 M op/s 1.81 M op/s 0.45 M op/s

Throughput ratio wrt P-CLHT

P-CLHT LEVEL CCEH CMAP LEVEL-TBB CCEH-TBB CMAP-TBB CLEVEL 0.91 M op/s

  • Clevel hashing obtains up to 4.2× speedup than P-CLHT due to the lock-free concurrency

control and non-blocking resizing

4.2×

slide-116
SLIDE 116

Conclusion

41

  • Existing PM hashing indexes have limited considerations for

concurrency

  • Clevel hashing is PM-friendly

− Write-optimal multi-level structure without extra writes for insertion − Crash consistency by enabling lock-free index to be persistent

  • Clevel hashing achieves high concurrency

− Non-blocking resizing without blocking queries − Lock-free concurrency control with correctness guarantee

  • Clevel hashing achieves up to 4.2× speedup for throughput than

P-CLHT

slide-117
SLIDE 117

Thanks! Q&A

Email: chenzy@hust.edu.cn Homepage: https://chenzhangyu.github.io Open-source code: https://github.com/chenzhangyu/Clevel-Hashing