Resizable, Scalable, Concurrent Hash Tables via Relativistic - - PowerPoint PPT Presentation

resizable scalable concurrent hash tables via
SMART_READER_LITE
LIVE PREVIEW

Resizable, Scalable, Concurrent Hash Tables via Relativistic - - PowerPoint PPT Presentation

Resizable, Scalable, Concurrent Hash Tables via Relativistic Programming Josh Triplett 1 Paul E. McKenney 2 Jonathan Walpole 1 1 Portland State University 2 IBM Linux Technology Center June 16, 2011 Synchronization = Waiting Concurrent


slide-1
SLIDE 1

Resizable, Scalable, Concurrent Hash Tables via Relativistic Programming

Josh Triplett1 Paul E. McKenney2 Jonathan Walpole1

1Portland State University 2IBM Linux Technology Center

June 16, 2011

slide-2
SLIDE 2

Synchronization = Waiting

  • Concurrent programs require synchronization
  • Synchronization requires some threads to wait on others
  • Concurrent programs spend a lot of time waiting
slide-3
SLIDE 3

Locking

  • One thread accesses shared data
  • The rest wait for the lock
slide-4
SLIDE 4

Locking

  • One thread accesses shared data
  • The rest wait for the lock
  • Straightforward to get right
  • Minimal concurrency
slide-5
SLIDE 5

Fine-grained Locking

  • Use different locks for different data
  • Disjoint-access parallelism
  • Reduce waiting, allow multiple threads to proceed
slide-6
SLIDE 6

Fine-grained Locking

  • Use different locks for different data
  • Disjoint-access parallelism
  • Reduce waiting, allow multiple threads to proceed
  • Many expensive synchronization instructions
slide-7
SLIDE 7

Fine-grained Locking

  • Use different locks for different data
  • Disjoint-access parallelism
  • Reduce waiting, allow multiple threads to proceed
  • Many expensive synchronization instructions
  • Wait on memory
  • Wait on the bus
  • Wait on cache coherence
slide-8
SLIDE 8

Reader-writer locking

  • Don’t make readers wait on other readers
  • Readers still wait on writers and vice versa
slide-9
SLIDE 9

Reader-writer locking

  • Don’t make readers wait on other readers
  • Readers still wait on writers and vice versa
  • Same expensive synchronization instructions
  • Dwarfs the actual reader critical section
slide-10
SLIDE 10

Reader-writer locking

  • Don’t make readers wait on other readers
  • Readers still wait on writers and vice versa
  • Same expensive synchronization instructions
  • Dwarfs the actual reader critical section
  • No actual reader parallelism; readers get serialized
slide-11
SLIDE 11

Non-blocking synchronization

  • Right there in the name: non-blocking
  • So, no waiting, right?
slide-12
SLIDE 12

Non-blocking synchronization

  • Right there in the name: non-blocking
  • So, no waiting, right?
  • Expensive synchronization instructions
slide-13
SLIDE 13

Non-blocking synchronization

  • Right there in the name: non-blocking
  • So, no waiting, right?
  • Expensive synchronization instructions
  • All but one thread must retry
  • Useless parallelism: waiting while doing busywork
  • At best equivalent to fine-grained locking
slide-14
SLIDE 14

Transactional memory

  • Non-blocking synchronization made easy
  • (Often implemented using locks for performance)
slide-15
SLIDE 15

Transactional memory

  • Non-blocking synchronization made easy
  • (Often implemented using locks for performance)
  • Theoretically equivalent performance to NBS
  • In practice, somewhat more expensive
slide-16
SLIDE 16

Transactional memory

  • Non-blocking synchronization made easy
  • (Often implemented using locks for performance)
  • Theoretically equivalent performance to NBS
  • In practice, somewhat more expensive
  • Fancy generic abstraction wrappers around waiting
slide-17
SLIDE 17

How do we stop waiting?

  • Reader-writer locking had the right idea
  • But readers needed synchronization to wait on writers
  • Some waiting required to check for potential writers
  • Can readers avoid synchronization entirely?
slide-18
SLIDE 18

How do we stop waiting?

  • Reader-writer locking had the right idea
  • But readers needed synchronization to wait on writers
  • Some waiting required to check for potential writers
  • Can readers avoid synchronization entirely?
  • Readers should not wait at all
slide-19
SLIDE 19

How do we stop waiting?

  • Reader-writer locking had the right idea
  • But readers needed synchronization to wait on writers
  • Some waiting required to check for potential writers
  • Can readers avoid synchronization entirely?
  • Readers should not wait at all
  • Joint-access parallelism: Can we allow concurrent readers and

writers on the same data at the same time?

slide-20
SLIDE 20

How do we stop waiting?

  • Reader-writer locking had the right idea
  • But readers needed synchronization to wait on writers
  • Some waiting required to check for potential writers
  • Can readers avoid synchronization entirely?
  • Readers should not wait at all
  • Joint-access parallelism: Can we allow concurrent readers and

writers on the same data at the same time?

  • What does “at the same time” mean, anyway?
slide-21
SLIDE 21

Modern computers

  • Shared address space
  • Distributed memory
  • Expensive illusion of coherent shared memory
slide-22
SLIDE 22

Modern computers

  • Shared address space
  • Distributed memory
  • Expensive illusion of coherent shared memory
  • “At the same time” gets rather fuzzy
slide-23
SLIDE 23

Modern computers

  • Shared address space
  • Distributed memory
  • Expensive illusion of coherent shared memory
  • “At the same time” gets rather fuzzy
  • Shared address spaces make communication simple
  • Incredibly optimized communication via cache coherence
slide-24
SLIDE 24

Modern computers

  • Shared address space
  • Distributed memory
  • Expensive illusion of coherent shared memory
  • “At the same time” gets rather fuzzy
  • Shared address spaces make communication simple
  • Incredibly optimized communication via cache coherence
  • When we have to communicate, let’s take advantage of that!
  • (and not just to accelerate message passing)
slide-25
SLIDE 25

Relativistic Programming

  • By analogy with relativity: no absolute reference frame
  • No global order for non-causally-related events
  • Readers do no waiting at all, for readers or writers
  • Minimize expensive communication and synchronization
  • Writers do all the waiting, when necessary
  • Reads linearly scalable
slide-26
SLIDE 26

What if readers see partial writes?

  • Writers must not disrupt concurrent readers
  • Data structures must stay consistent after every write
  • Writers order their writes by waiting
  • No impact to concurrent readers
slide-27
SLIDE 27

Outline

  • Synchronization = Waiting
  • Introduction to Relativistic Programming
  • Relativistic synchronization primitives
  • Relativistic data structures
  • Hash-table algorithm
  • Results
  • Future work
slide-28
SLIDE 28

Relativistic synchronization primitives

  • Delimited readers
  • No waiting: Notification, not permission
slide-29
SLIDE 29

Relativistic synchronization primitives

  • Delimited readers
  • No waiting: Notification, not permission
  • Pointer publication
  • Ensures ordering between initialization and publication
slide-30
SLIDE 30

Relativistic synchronization primitives

  • Delimited readers
  • No waiting: Notification, not permission
  • Pointer publication
  • Ensures ordering between initialization and publication
  • Updaters can wait for readers
  • Existing readers only, not new readers
slide-31
SLIDE 31

Example: Relativistic linked list insertion

a b c Potential readers

  • Initial state of the list; writer wants to insert b.
slide-32
SLIDE 32

Example: Relativistic linked list insertion

a b c Potential readers

  • Initial state of the list; writer wants to insert b.
  • Initialize b’s next pointer to point to c
slide-33
SLIDE 33

Example: Relativistic linked list insertion

a b c Potential readers

  • Initial state of the list; writer wants to insert b.
  • Initialize b’s next pointer to point to c
  • The writer can then “publish” b to node a’s next pointer
slide-34
SLIDE 34

Example: Relativistic linked list insertion

a b c Potential readers

  • Initial state of the list; writer wants to insert b.
  • Initialize b’s next pointer to point to c
  • The writer can then “publish” b to node a’s next pointer
  • Readers can immediately begin observing the new node
slide-35
SLIDE 35

Example: Relativistic linked list removal

a b c Potential readers

  • Initial state of the list; writer wants to remove node b
slide-36
SLIDE 36

Example: Relativistic linked list removal

a b c Potential readers

  • Initial state of the list; writer wants to remove node b
  • Sets a’s next pointer to c, removing b from the list for all

future readers

slide-37
SLIDE 37

Example: Relativistic linked list removal

a b c Potential readers

  • Initial state of the list; writer wants to remove node b
  • Sets a’s next pointer to c, removing b from the list for all

future readers

  • Wait for existing readers to finish
slide-38
SLIDE 38

Example: Relativistic linked list removal

a c Potential readers

  • Initial state of the list; writer wants to remove node b
  • Sets a’s next pointer to c, removing b from the list for all

future readers

  • Wait for existing readers to finish
  • Once no readers can hold references to b, the writer can safely

reclaim it.

slide-39
SLIDE 39

Relativistic data structures

  • Linked lists
  • Radix trees
  • Tries
  • Balanced trees
  • Hash tables
slide-40
SLIDE 40

Relativistic hash tables

  • Open chaining with relativistic linked lists
  • Insertion and removal supported
  • Atomic move operation (see previous work)
slide-41
SLIDE 41

Relativistic hash tables

  • Open chaining with relativistic linked lists
  • Insertion and removal supported
  • Atomic move operation (see previous work)
  • What about resizing?
  • Necessary to maintain constant-time performance and

reasonable memory usage

slide-42
SLIDE 42

Relativistic hash tables

  • Open chaining with relativistic linked lists
  • Insertion and removal supported
  • Atomic move operation (see previous work)
  • What about resizing?
  • Necessary to maintain constant-time performance and

reasonable memory usage

  • Must keep the table consistent at all times
slide-43
SLIDE 43

Existing approaches to resizing

  • Don’t: allocate a fixed-size table and never resize it
  • Poor performance or wasted memory
slide-44
SLIDE 44

Existing approaches to resizing

  • Don’t: allocate a fixed-size table and never resize it
  • Poor performance or wasted memory
  • “Dynamic Dynamic Data Structures” (DDDS)
  • Readers must check old and new data structures
  • Readers have to wait until no concurrent resizes
  • Slows down the common case
  • Significantly slows lookups while resizing
slide-45
SLIDE 45

Existing approaches to resizing

  • Don’t: allocate a fixed-size table and never resize it
  • Poor performance or wasted memory
  • “Dynamic Dynamic Data Structures” (DDDS)
  • Readers must check old and new data structures
  • Readers have to wait until no concurrent resizes
  • Slows down the common case
  • Significantly slows lookups while resizing
  • Herbert Xu’s resizable relativistic hash tables
  • Extra linked-list pointers in every node
  • High memory usage
slide-46
SLIDE 46

Defining “consistent”

  • A reader traversing a hash bucket must always observe all

elements in that bucket

slide-47
SLIDE 47

Defining “consistent”

  • A reader traversing a hash bucket must always observe all

elements in that bucket

  • . . . but if it observes more, no harm done
slide-48
SLIDE 48

Defining “consistent”

  • A reader traversing a hash bucket must always observe all

elements in that bucket

  • . . . but if it observes more, no harm done
  • Imprecise hash buckets contain elements from other buckets
slide-49
SLIDE 49

Shrinking: Initial state

  • dd

even 1 3 2 4

slide-50
SLIDE 50

Shrinking: Initialize new buckets

  • dd

even all 1 3 2 4

slide-51
SLIDE 51

Shrinking: Link old chains

  • dd

even all 1 3 2 4

slide-52
SLIDE 52

Shrinking: Publish new buckets

all

  • dd

even 1 3 2 4

slide-53
SLIDE 53

Shrinking: Wait for readers

all

  • dd

even 1 3 2 4

slide-54
SLIDE 54

Shrinking: Reclaim

all 1 3 2 4

slide-55
SLIDE 55

Expanding: Initial state

all 1 2 3 4

slide-56
SLIDE 56

Expanding: Initialize new buckets

all

  • dd

even 1 2 3 4

slide-57
SLIDE 57

Expanding: Publish new buckets

all

  • dd

even 1 2 3 4

slide-58
SLIDE 58

Expanding: Wait for readers

aux

  • dd

even 1 2 3 4

slide-59
SLIDE 59

Expanding: Unzip one step

aux

  • dd

even 1 2 3 4

slide-60
SLIDE 60

Expanding: Wait for readers

aux

  • dd

even 1 2 3 4

slide-61
SLIDE 61

Expanding: Unzip again

aux

  • dd

even 1 2 3 4

slide-62
SLIDE 62

Expanding: Final state

  • dd

even 1 3 2 4

slide-63
SLIDE 63

Benchmarking methodology

  • Implemented a microbenchmark as a Linux kernel module
  • Used Linux’s Read-Copy Update (RCU) implementation
  • Relativistic Programming primitives map to RCU operations
slide-64
SLIDE 64

Benchmarking methodology

  • Implemented a microbenchmark as a Linux kernel module
  • Used Linux’s Read-Copy Update (RCU) implementation
  • Relativistic Programming primitives map to RCU operations
  • Lookups with no resize as a baseline
  • Lookups with continuous resizing as a worst-case scenario
slide-65
SLIDE 65

Benchmarking methodology

  • Implemented a microbenchmark as a Linux kernel module
  • Used Linux’s Read-Copy Update (RCU) implementation
  • Relativistic Programming primitives map to RCU operations
  • Lookups with no resize as a baseline
  • Lookups with continuous resizing as a worst-case scenario
  • Compared: our algorithm, DDDS, rwlock
slide-66
SLIDE 66

Results: fixed-size table baseline

1 2 4 8 16 50 100 150 RP DDDS rwlock Reader threads Lookups/second (millions)

slide-67
SLIDE 67

Results - continuous resizing

1 2 4 8 16 50 100 150 200 RP DDDS Reader threads Lookups/second (millions)

slide-68
SLIDE 68

Results - our resize versus fixed

1 2 4 8 16 50 100 150 200 8k 16k resize Reader threads Lookups/second (millions)

slide-69
SLIDE 69

Results - DDDS resize versus fixed

1 2 4 8 16 50 100 150 200 8k 16k resize Reader threads Lookups/second (millions)

slide-70
SLIDE 70

Hang on a minute. . .

  • This is USENIX!
  • We don’t settle for microbenchmarks here
  • We care about real-world implementations
slide-71
SLIDE 71

memcached

  • Network-accessible key-value store
  • Used for caching
  • Performance-critical
slide-72
SLIDE 72

memcached

  • Network-accessible key-value store
  • Used for caching
  • Performance-critical
  • . . . and it uses a global table lock
slide-73
SLIDE 73

memcached with relativistic hash tables

  • Uses the userspace RCU implementation, urcu
  • Adds a fast path for GET requests using relativistic lookups
  • Copies value while still in a relativistic reader
  • Falls back to the slow path for expiry, eviction
  • Writers use safe relativistic memory reclamation
slide-74
SLIDE 74

memcached results

1 2 3 4 5 6 7 8 9 10 11 12 200 400 600 RP GET default GET default SET RP SET mc-benchmark processes Requests/second (thousands)

slide-75
SLIDE 75

Future work: Relativistic data structures

  • New relativistic algorithms currently require careful

construction

  • We have a general methodology for algorithm construction
  • Write an algorithm assuming our memory model
  • Use this methodology to mechanically place barriers and

wait-for-readers operations

slide-76
SLIDE 76

Summary

  • Relativistic programming allows linearly scalable readers
  • Relativistic hash tables support resizing now
  • Now suitable for general-purpose usage
  • Real-world code scales better with relativistic programming

Questions?