ffwd: delegation is (much) faster than you think Sepideh Roghanchi, - - PowerPoint PPT Presentation

ffwd delegation is much faster than you think
SMART_READER_LITE
LIVE PREVIEW

ffwd: delegation is (much) faster than you think Sepideh Roghanchi, - - PowerPoint PPT Presentation

ffwd: delegation is (much) faster than you think Sepideh Roghanchi, Jakob Eriksson, Nilanjana Basu int get_seqno() { return ++seqno; } // ~1 Billion ops/s // single-threaded int threadsafe_get_seqno() { acquire(lock); int


slide-1
SLIDE 1

ffwd: delegation is (much) faster than you think

Sepideh Roghanchi, Jakob Eriksson, Nilanjana Basu

slide-2
SLIDE 2

int get_seqno() {
 return ++seqno;
 } // ~1 Billion ops/s // single-threaded

slide-3
SLIDE 3

int threadsafe_get_seqno() {
 acquire(lock);
 int ret=++seqno;
 release(lock);
 return ret;
 }

// < 10 Million ops/s

slide-4
SLIDE 4

2 4 6 8 10 16 32 48 64 80 96 112 128 Throughput (Mops) Hardware threads MCS MUTEX TTAS CLH TAS

slide-5
SLIDE 5

why so slow?

slide-6
SLIDE 6
slide-7
SLIDE 7
slide-8
SLIDE 8

~70 ns intra-socket latency ~14 Mops

slide-9
SLIDE 9

QPI QPI QPI QPI

Quad-Socket Intel System 10-20 gigabyte/s

~200ns O/W latency = 5 million ops per second

150-300 M cachelines/s

slide-10
SLIDE 10

critical section critical section wait for lock critical section wait for lock wait for lock i n t e r c

  • n

n e c t l a t e n c y interconnect latency THREAD 1 THREAD 2

400x

slide-11
SLIDE 11

critical section

critical section wait for lock critical section

wait for lock

wait for lock

THREAD 1 THREAD 2

critical section

wait for lock

wait for lock

THREAD3

600x

slide-12
SLIDE 12

dedicated server thread

client client client client

slide-13
SLIDE 13

critical section wait for response critical section

wait for request

wait for request

CLIENT1 DEDICATED SERVER THREAD

wait for response

400x

r e q u e s t r e s p

  • n

s e

slide-14
SLIDE 14

critical section wait for response critical section

wait for request

wait for request wait for response wait for response wait for response critical section critical section

400x

still!

CLIENTn

r e q u e s t r e s p

  • n

s e request r e s p

  • n

s e

CLIENT1 DEDICATED SERVER THREAD

slide-15
SLIDE 15

critical section wait for response critical section

wait for request

wait for request wait for response wait for response wait for response critical section critical section

400x

still!

wait for response wait for response critical section critical section wait for response wait for response critical section critical section wait for response wait for response critical section wait for response wait for response critical section wait for response wait for critical section wait for response

CLIENTn CLIENT1 DEDICATED SERVER THREAD

slide-16
SLIDE 16

ffwd design

shared server response client request client request

  • ne line per

thread read & act on all thread group 0 requests

SERVER

write all responses, thread group 0 READ WRITE

CLIENTS

write request to server spin on server response WRITE READ read & act on all thread group 1 requests write all responses, thread group N for each of N thread groups function pointer toggle bit arg count argv[6] client request (64 bytes) toggle bits return values[15] shared server response (128 bytes)

  • ne line per

group of 15 threads

One dedicated 64-byte request line, per client-server pair Requests are sent synchronously Each group of 15 clients shares one 128-byte response line pair. Server acts upon pending requests in batches 15 clients.

(fast, fly-weight delegation)

slide-17
SLIDE 17

A request in more detail

  • request is new if: request toggle bit != response toggle bit

function pointer toggle bit arg count argv[6] client request (64 bytes)

toggle bits return values[15] shared server response (128 bytes)

  • server calls function with (64-bit) arguments provided
  • client polls response line until toggle bit == response bit
slide-18
SLIDE 18

group 0 requests 1 ..111 . . local response buffer delegation server thread

slide-19
SLIDE 19

group 0 requests 1 . . ..110 local response buffer delegation server thread

slide-20
SLIDE 20

group 0 requests 1 . . ..100 local response buffer delegation server thread

slide-21
SLIDE 21

group 0 requests 1 . . ..100 local response buffer delegation server thread

slide-22
SLIDE 22

group 0 requests 1 . . ..100 ..100 local response buffer global response buffer delegation server thread modified <———————response cache lines———————> shared

slide-23
SLIDE 23

group 0 requests 1 . . ..100 ..100 ..100 ..100 local response buffer global response buffer modified <———————response cache lines———————> shared

slide-24
SLIDE 24

local response buffer global response buffer modified <———————response cache lines———————> shared shared modified <—-requests-—> ..100 ..100 ..100 ..100 group 0 requests 1 . .

slide-25
SLIDE 25

shared local response buffer global response buffer modified <———————response cache lines———————> shared modified <—-requests-—> ..100 ..100 ..100 ..100

slide-26
SLIDE 26

shared local response buffer global response buffer modified <———————response cache lines———————> shared modified <—-requests-—> ..100 ..100 ..100 ..100

slide-27
SLIDE 27

shared local response buffer global response buffer modified <———————response cache lines———————> shared modified <—-requests-—> ..100 ..100 ..100 ..100

slide-28
SLIDE 28

performance evaluation

slide-29
SLIDE 29

evaluation systems

4×16-core Xeon E5-4660, Broadwell, 2.2 GHz 4×8-core Xeon E5-4620, Sandy Bridge-EP, 2.2 GHz 4×8-core Xeon E7-4820, Westmere-EX, 2.0 GHz 4×8-core AMD Opteron 6378, Abu Dhabi, 2.4 GHz

slide-30
SLIDE 30

application benchmarks

  • Same benchmarks as in Lozi et al. (RCL) [USENIX ATC’12]
  • programs that spend large % of time in critical sections
  • Except BerkeleyDB - ran out of time
slide-31
SLIDE 31

100 200 300 400 500 600 16 32 48 64 80 96 112 128 Duration (ms) # of threads FFWD MCS MUTEX TAS FC RCL

raytrace-car (SPLASH-2)

RCL experienced correctness issues above 82 threads. ffwd RCL MCS mutex

slide-32
SLIDE 32

application benchmarks

  • comparing best performance (any thread count) for all methods
  • up to 2.5x improvement over pthreads, any thread count
  • 10+ times speedup at max thread count
slide-33
SLIDE 33

memcached-set

50 100 150 200 250 300 16 32 48 64 80 96 112 128 Duration (sec) # of threads FFWD MCS MUTEX TAS RCL

RCL experienced correctness issues above 24 threads. We did not get Flat Combining to work. pthread mutex

slide-34
SLIDE 34

microbenchmarks

slide-35
SLIDE 35
  • ffwd is much faster on largely sequential data structures
  • linked list (coarse locking), stack, queue
  • fetch and add, for few shared variables
  • for highly concurrent data structures, ffwd falls behind

when the lock contention is low

  • fetch and add, with many shared variables
  • hashtable
  • for concurrent data structures with long query times,

ffwd keeps up, but is not a clear leader

  • lazy linked list
  • binary search tree
slide-36
SLIDE 36

naïve 1024-node linked-list, coarse-grained locking

0.2 0.4 0.6 0.8 1 1.2 16 32 48 64 80 96 112 128 Throughput (Mops) Hardware threads FFWD MCS MUTEX TTAS TICKET CLH TAS HTICKET STM

slide-37
SLIDE 37

two-lock queue

10 20 30 40 50 60 16 32 48 64 80 96 112 128 Throughput (Mops) Hardware threads FFWD MCS MUTEX TTAS TICKET CLH HTICKET FC RCL CC DSM H MS SIM BLF

slide-38
SLIDE 38

10 20 30 40 50 60 16 32 48 64 80 96 112 128 Throughput (Mops) Hardware threads FFWD MCS MUTEX TTAS TICKET CLH HTICKET FC RCL CC DSM H LF SIM BLF

stack

slide-39
SLIDE 39

fetch-and-add, 1 variable

10 20 30 40 50 60 16 32 48 64 80 96 112 128 Throughput (Mops) Hardware threads FFWD MCS MUTEX TTAS TICKET CLH TAS HTICKET FC RCL ATOMIC

hardware-provided atomic increment instruction!

slide-40
SLIDE 40
  • ffwd is much faster on largely sequential data structures
  • naïve linked list, stack, queue
  • fetch and add, for few shared variables
  • for highly concurrent data structures, fwd falls behind

when there are many locks

  • fetch and add, with many shared variables
  • hashtable
  • for concurrent data structures with long query times,

ffwd keeps up, but is not a clear leader

  • lazy linked list
  • binary search tree
slide-41
SLIDE 41

128-thread hash table

25 50 75 100 125 150 175 200 225 1 4 16 64 256 1024 Throughput (Mops) #buckets FFWD MCS MUTEX TTAS TICKET CLH TAS HTICKET

locking takes the lead when #locks ~ #threads

slide-42
SLIDE 42

50 100 150 200 250 300 350 400 1 4 16 64 256 1024 Throughput (Mops) # of shared variables (locks) FFWD MCS MUTEX TTAS TICKET CLH TAS HTICKET FC RCL

fetch-and-add, 128 threads

slide-43
SLIDE 43

fetch-and-add, 128 threads

50 100 150 200 250 300 350 400 1 4 16 64 256 1024 Throughput (Mops) # of shared variables (locks) FFWD MCS MUTEX TTAS TICKET CLH TAS HTICKET FC RCL ATOMIC

atomic increment —>

slide-44
SLIDE 44
  • ffwd is much faster on largely sequential data structures
  • naïve linked list, stack, queue
  • fetch and add, for few shared variables
  • for highly concurrent data structures, once lock# is

similar to thread#, fwd falls behind

  • fetch and add, with many shared variables
  • hashtable
  • for concurrent data structures with long query times,

ffwd keeps up, but is not a clear leader

  • lazy linked list
  • binary search tree
slide-45
SLIDE 45

10 20 30 40 50 60 1 4 16 64 256 1024 4096 16384 Throughput (Mops) #elements FFWD-LZ MCS-LZ MUTEX-LZ TTAS-LZ TICKET-LZ CLH-LZ TAS-LZ HTICKET-LZ HARRIS FC-LZ RCL-LZ

128-thread lazy concurrent lists

slide-46
SLIDE 46

128-thread lazy (LZ) + skip (SK) lists

10 20 30 40 50 60 1 4 16 64 256 1024 4096 16384 Throughput (Mops) #elements FFWD-LZ FFWD-SK MCS-LZ MCS-SK MUTEX-LZ TTAS-LZ TICKET-LZ CLH-LZ TAS-LZ HTICKET-LZ HARRIS FC-LZ RCL-LZ

lower skip list complexity saves the day single MCS lock skip list high concurrency, not very long list

slide-47
SLIDE 47

binary search tree

  • simple, unbalanced tree
  • 50% queries, 50% updates
  • all tree operations delegated for ffwd/RCL
slide-48
SLIDE 48

128-thread binary search tree

10 20 30 40 50 128 512 2K 8K 32K 128K Throughput (Mops/s) Initial Size FFWD RCL RCU RLU SWISSTM VRBTREE VRTREE Single threaded

ffwd is bounded by single-threaded performance

slide-49
SLIDE 49

128-thread tree + 4-way sharding ffwd

10 20 30 40 50 128 512 2K 8K 32K 128K Throughput (Mops) Tree size FFWD FFWD-S4 RCL RCU RLU SWISSTM VRBTREE VRTREE Single threaded

slide-50
SLIDE 50

what makes ffwd so fast?

shared server response client request client request

  • ne line per

thread read & act on all thread group 0 requests

SERVER

write all responses, thread group 0 READ WRITE

CLIENTS

write request to server spin on server response WRITE READ read & act on all thread group 1 requests write all responses, thread group N for each of N thread groups function pointer toggle bit arg count argv[6] client request (64 bytes) toggle bits return values[15] shared server response (128 bytes)

  • ne line per

group of 15 threads

  • requests are virtually un-contended,

contiguous in memory

  • = happy server hardware pre-fetcher
  • buffered responses on server
  • 15 responses in one contiguous copy
  • 2 modified cache-lines instead of 15
  • responses are read-only on the client
  • response line never leaves the server L1
  • very light-weight processing on the server
  • plenty of hand-tuning
slide-51
SLIDE 51

Why isn’t it even faster?

  • Link bandwidth is 300 cache lines per link —> 300+ Mops
  • Latency suggests 2.5 Mops/client. 120 clients —> 300 Mops
  • Why are we only seeing 55 Mops?
  • Processing limit? 55 Mops = 40 cycles per operation
  • Insufficient concurrency: round-trip bandwidth-delay product

is 120 cache lines

  • server store / load buffers, reorder window size?
slide-52
SLIDE 52

using ffwd

  • Free C library available now (Rust is on the way)
  • Some current limitations:
  • delegated functions cannot, in turn, delegate functions
  • delegated functions typically should not block (nor

acquire locks)

  • up to 6, 64-bit parameters
  • currently assume one client per hardware thread
slide-53
SLIDE 53

related work

  • Remote Core Locking [Lozi, USENIX ATC’12]
  • Barrelfish - delegation-based OS [Baumann, SOSP’09]
  • Flat Combining [Hendler, SPAA’10]
  • Log-based node replication [Calciu, ASPLOS’17]
slide-54
SLIDE 54

in conclusion

  • delegation is (much) faster than you thought
  • it is easy to use, and has many attractive applications
  • similar results on 


Intel Broadwell
 Intel Sandy Bridge
 Intel Westmere-EX, and 
 AMD Abu Dhabi

slide-55
SLIDE 55

questions?

  • UIC has many open CS faculty

positions this year, all areas

  • libffwd, extended paper and more


http://github.com/bitslab/ffwd

slide-56
SLIDE 56

comparing parallelism

locking delegation critical section communication latency

#locks #locks #servers #clients