ffwd: delegation is (much) faster than you think
Sepideh Roghanchi, Jakob Eriksson, Nilanjana Basu
ffwd: delegation is (much) faster than you think Sepideh Roghanchi, - - PowerPoint PPT Presentation
ffwd: delegation is (much) faster than you think Sepideh Roghanchi, Jakob Eriksson, Nilanjana Basu int get_seqno() { return ++seqno; } // ~1 Billion ops/s // single-threaded int threadsafe_get_seqno() { acquire(lock); int
Sepideh Roghanchi, Jakob Eriksson, Nilanjana Basu
2 4 6 8 10 16 32 48 64 80 96 112 128 Throughput (Mops) Hardware threads MCS MUTEX TTAS CLH TAS
~70 ns intra-socket latency ~14 Mops
QPI QPI QPI QPI
Quad-Socket Intel System 10-20 gigabyte/s
~200ns O/W latency = 5 million ops per second
150-300 M cachelines/s
critical section critical section wait for lock critical section wait for lock wait for lock i n t e r c
n e c t l a t e n c y interconnect latency THREAD 1 THREAD 2
400x
critical section
critical section wait for lock critical section
wait for lock
wait for lock
THREAD 1 THREAD 2
critical section
wait for lock
wait for lock
THREAD3
600x
dedicated server thread
client client client client
critical section wait for response critical section
wait for request
wait for request
CLIENT1 DEDICATED SERVER THREAD
wait for response
400x
r e q u e s t r e s p
s e
critical section wait for response critical section
wait for request
wait for request wait for response wait for response wait for response critical section critical section
400x
still!
CLIENTn
r e q u e s t r e s p
s e request r e s p
s e
CLIENT1 DEDICATED SERVER THREAD
critical section wait for response critical section
wait for request
wait for request wait for response wait for response wait for response critical section critical section
400x
still!
wait for response wait for response critical section critical section wait for response wait for response critical section critical section wait for response wait for response critical section wait for response wait for response critical section wait for response wait for critical section wait for response
CLIENTn CLIENT1 DEDICATED SERVER THREAD
shared server response client request client request
thread read & act on all thread group 0 requests
SERVER
write all responses, thread group 0 READ WRITE
CLIENTS
write request to server spin on server response WRITE READ read & act on all thread group 1 requests write all responses, thread group N for each of N thread groups function pointer toggle bit arg count argv[6] client request (64 bytes) toggle bits return values[15] shared server response (128 bytes)
group of 15 threads
One dedicated 64-byte request line, per client-server pair Requests are sent synchronously Each group of 15 clients shares one 128-byte response line pair. Server acts upon pending requests in batches 15 clients.
(fast, fly-weight delegation)
function pointer toggle bit arg count argv[6] client request (64 bytes)
toggle bits return values[15] shared server response (128 bytes)
group 0 requests 1 ..111 . . local response buffer delegation server thread
group 0 requests 1 . . ..110 local response buffer delegation server thread
group 0 requests 1 . . ..100 local response buffer delegation server thread
group 0 requests 1 . . ..100 local response buffer delegation server thread
group 0 requests 1 . . ..100 ..100 local response buffer global response buffer delegation server thread modified <———————response cache lines———————> shared
group 0 requests 1 . . ..100 ..100 ..100 ..100 local response buffer global response buffer modified <———————response cache lines———————> shared
local response buffer global response buffer modified <———————response cache lines———————> shared shared modified <—-requests-—> ..100 ..100 ..100 ..100 group 0 requests 1 . .
shared local response buffer global response buffer modified <———————response cache lines———————> shared modified <—-requests-—> ..100 ..100 ..100 ..100
shared local response buffer global response buffer modified <———————response cache lines———————> shared modified <—-requests-—> ..100 ..100 ..100 ..100
shared local response buffer global response buffer modified <———————response cache lines———————> shared modified <—-requests-—> ..100 ..100 ..100 ..100
4×16-core Xeon E5-4660, Broadwell, 2.2 GHz 4×8-core Xeon E5-4620, Sandy Bridge-EP, 2.2 GHz 4×8-core Xeon E7-4820, Westmere-EX, 2.0 GHz 4×8-core AMD Opteron 6378, Abu Dhabi, 2.4 GHz
100 200 300 400 500 600 16 32 48 64 80 96 112 128 Duration (ms) # of threads FFWD MCS MUTEX TAS FC RCL
RCL experienced correctness issues above 82 threads. ffwd RCL MCS mutex
50 100 150 200 250 300 16 32 48 64 80 96 112 128 Duration (sec) # of threads FFWD MCS MUTEX TAS RCL
RCL experienced correctness issues above 24 threads. We did not get Flat Combining to work. pthread mutex
when the lock contention is low
ffwd keeps up, but is not a clear leader
0.2 0.4 0.6 0.8 1 1.2 16 32 48 64 80 96 112 128 Throughput (Mops) Hardware threads FFWD MCS MUTEX TTAS TICKET CLH TAS HTICKET STM
10 20 30 40 50 60 16 32 48 64 80 96 112 128 Throughput (Mops) Hardware threads FFWD MCS MUTEX TTAS TICKET CLH HTICKET FC RCL CC DSM H MS SIM BLF
10 20 30 40 50 60 16 32 48 64 80 96 112 128 Throughput (Mops) Hardware threads FFWD MCS MUTEX TTAS TICKET CLH HTICKET FC RCL CC DSM H LF SIM BLF
10 20 30 40 50 60 16 32 48 64 80 96 112 128 Throughput (Mops) Hardware threads FFWD MCS MUTEX TTAS TICKET CLH TAS HTICKET FC RCL ATOMIC
hardware-provided atomic increment instruction!
when there are many locks
ffwd keeps up, but is not a clear leader
25 50 75 100 125 150 175 200 225 1 4 16 64 256 1024 Throughput (Mops) #buckets FFWD MCS MUTEX TTAS TICKET CLH TAS HTICKET
locking takes the lead when #locks ~ #threads
50 100 150 200 250 300 350 400 1 4 16 64 256 1024 Throughput (Mops) # of shared variables (locks) FFWD MCS MUTEX TTAS TICKET CLH TAS HTICKET FC RCL
50 100 150 200 250 300 350 400 1 4 16 64 256 1024 Throughput (Mops) # of shared variables (locks) FFWD MCS MUTEX TTAS TICKET CLH TAS HTICKET FC RCL ATOMIC
atomic increment —>
similar to thread#, fwd falls behind
ffwd keeps up, but is not a clear leader
10 20 30 40 50 60 1 4 16 64 256 1024 4096 16384 Throughput (Mops) #elements FFWD-LZ MCS-LZ MUTEX-LZ TTAS-LZ TICKET-LZ CLH-LZ TAS-LZ HTICKET-LZ HARRIS FC-LZ RCL-LZ
10 20 30 40 50 60 1 4 16 64 256 1024 4096 16384 Throughput (Mops) #elements FFWD-LZ FFWD-SK MCS-LZ MCS-SK MUTEX-LZ TTAS-LZ TICKET-LZ CLH-LZ TAS-LZ HTICKET-LZ HARRIS FC-LZ RCL-LZ
lower skip list complexity saves the day single MCS lock skip list high concurrency, not very long list
10 20 30 40 50 128 512 2K 8K 32K 128K Throughput (Mops/s) Initial Size FFWD RCL RCU RLU SWISSTM VRBTREE VRTREE Single threaded
ffwd is bounded by single-threaded performance
10 20 30 40 50 128 512 2K 8K 32K 128K Throughput (Mops) Tree size FFWD FFWD-S4 RCL RCU RLU SWISSTM VRBTREE VRTREE Single threaded
shared server response client request client request
thread read & act on all thread group 0 requests
SERVER
write all responses, thread group 0 READ WRITE
CLIENTS
write request to server spin on server response WRITE READ read & act on all thread group 1 requests write all responses, thread group N for each of N thread groups function pointer toggle bit arg count argv[6] client request (64 bytes) toggle bits return values[15] shared server response (128 bytes)
group of 15 threads
contiguous in memory
is 120 cache lines
acquire locks)
Intel Broadwell Intel Sandy Bridge Intel Westmere-EX, and AMD Abu Dhabi
positions this year, all areas
http://github.com/bitslab/ffwd
locking delegation critical section communication latency
#locks #locks #servers #clients