ffwd: delegation is (much) faster than you think Sepideh Roghanchi, - PowerPoint PPT Presentation

ffwd: delegation is (much) faster than you think Sepideh Roghanchi, Jakob Eriksson, Nilanjana Basu

int get_seqno() {   return ++seqno;   } // ~1 Billion ops/s // single-threaded

int threadsafe_get_seqno() {   acquire(lock);   int ret=++seqno;   release(lock);   return ret;   } // < 10 Million ops/s

10 MCS MUTEX TTAS CLH TAS 8 Throughput (Mops) 6 4 2 0 0 16 32 48 64 80 96 112 128 Hardware threads

why so slow?

~70 ns intra-socket latency ~14 Mops

QPI Quad-Socket QPI QPI Intel System 150-300 M cachelines/s 10-20 gigabyte/s QPI ~200ns O/W latency = 5 million ops per second

THREAD 1 THREAD 2 critical section i n t e r c o n wait for lock n e c t l a t e n c y wait for lock 400x critical section interconnect latency wait for lock critical section

THREAD 1 THREAD 2 THREAD3 critical section wait for lock wait for lock critical section 600x wait for lock critical section wait for lock wait for lock critical section

dedicated server client thread client client client

DEDICATED CLIENT1 SERVER THREAD r e q u e s t wait for request 400x e wait for s n critical section o response p s e r wait for request wait for response critical section

DEDICATED CLIENT1 CLIENTn SERVER THREAD r e q u e s request t wait for request 400x e wait for wait for s n critical section o response response p s e r r critical section still! e s p o n s e wait for request wait for wait for response response critical section critical section

DEDICATED CLIENT1 CLIENTn SERVER THREAD wait for request 400x wait for wait for critical section response response wait for critical section still! response wait for critical section response wait for wait for critical section response response wait for critical section response wait for critical section response wait for critical section request wait for wait for response response wait for critical section response wait for critical section response wait for critical section response wait for critical section response wait for

( f ast, f ly- w eight d elegation) ffwd design READ SERVER CLIENTS client request read write all & act on all WRITE responses, one line per write request thread group 0 thread group N thread to server requests client request for each of N thread groups read write all spin on server & act on all responses, shared server response response thread group 1 thread group 0 WRITE READ requests one line per group of 15 threads shared server response (128 bytes) client request (64 bytes) toggle toggle function arg return values[15] argv[6] bits bit pointer count Server acts upon pending requests in One dedicated 64-byte request line, batches 15 clients. per client-server pair Each group of 15 clients shares one Requests are sent synchronously 128-byte response line pair.

A request in more detail client request (64 bytes) toggle function arg argv[6] bit pointer count shared server response (128 bytes) toggle return values[15] bits • request is new if: request toggle bit != response toggle bit • server calls function with (64-bit) arguments provided • client polls response line until toggle bit == response bit

delegation server thread group 0 requests 0 0 1 . . local ..111 response bu ff er

delegation server thread group 0 requests 0 0 1 . . local ..100 ..100 response bu ff er global response bu ff er modified <———————response cache lines———————> shared

group 0 requests 0 0 1 . . local ..100 ..100 response bu ff er global ..100 ..100 response bu ff er modified <———————response cache lines———————> shared

group 0 requests shared <—-requests-—> modified 0 0 1 . . local ..100 ..100 response bu ff er global ..100 ..100 response bu ff er modified <———————response cache lines———————> shared

shared <—-requests-—> modified local ..100 ..100 response bu ff er global ..100 ..100 response bu ff er modified <———————response cache lines———————> shared

performance evaluation

evaluation systems 4 × 16-core Xeon E5-4660, Broadwell, 2.2 GHz 4 × 8-core Xeon E5-4620, Sandy Bridge-EP, 2.2 GHz 4 × 8-core Xeon E7-4820, Westmere-EX, 2.0 GHz 4 × 8-core AMD Opteron 6378, Abu Dhabi, 2.4 GHz

application benchmarks • Same benchmarks as in Lozi et al. (RCL) [USENIX ATC’12] • programs that spend large % of time in critical sections • Except BerkeleyDB - ran out of time

raytrace-car (SPLASH-2) 600 FFWD MUTEX FC MCS TAS RCL 500 400 Duration (ms) 300 200 100 mutex ff wd 0 MCS RCL 0 16 32 48 64 80 96 112 128 # of threads RCL experienced correctness issues above 82 threads.

application benchmarks • comparing best performance (any thread count) for all methods • up to 2.5x improvement over pthreads, any thread count • 10+ times speedup at max thread count

memcached-set 300 FFWD MCS MUTEX TAS RCL 250 pthread 200 mutex Duration (sec) 150 100 50 0 0 16 32 48 64 80 96 112 128 # of threads RCL experienced correctness issues above 24 threads. We did not get Flat Combining to work.

microbenchmarks

• ff wd is much faster on largely sequential data structures • linked list (coarse locking), stack, queue • fetch and add, for few shared variables • for highly concurrent data structures, ff wd falls behind when the lock contention is low • fetch and add, with many shared variables • hashtable • for concurrent data structures with long query times, ff wd keeps up, but is not a clear leader • lazy linked list • binary search tree

naïve 1024-node linked-list, coarse-grained locking 1.2 FFWD MUTEX TICKET TAS STM MCS TTAS CLH HTICKET 1 Throughput (Mops) 0.8 0.6 0.4 0.2 0 0 16 32 48 64 80 96 112 128 Hardware threads

two-lock queue 60 FFWD TTAS HTICKET CC MS MCS TICKET FC DSM SIM MUTEX CLH RCL H BLF 50 Throughput (Mops) 40 30 20 10 0 0 16 32 48 64 80 96 112 128 Hardware threads

stack 60 FFWD TTAS HTICKET CC LF MCS TICKET FC DSM SIM MUTEX CLH RCL H BLF 50 Throughput (Mops) 40 30 20 10 0 0 16 32 48 64 80 96 112 128 Hardware threads

fetch-and-add, 1 variable 60 FFWD TTAS TAS RCL MCS TICKET HTICKET ATOMIC MUTEX CLH FC 50 Throughput (Mops) 40 30 hardware-provided atomic increment instruction! 20 10 0 0 16 32 48 64 80 96 112 128 Hardware threads

• ff wd is much faster on largely sequential data structures • naïve linked list, stack, queue • fetch and add, for few shared variables • for highly concurrent data structures, fwd falls behind when there are many locks • fetch and add, with many shared variables • hashtable • for concurrent data structures with long query times, ff wd keeps up, but is not a clear leader • lazy linked list • binary search tree

128-thread hash table 225 FFWD MUTEX TICKET TAS MCS TTAS CLH HTICKET 200 175 Throughput (Mops) 150 125 100 75 50 25 0 1 4 16 64 256 1024 #buckets locking takes the lead when #locks ~ #threads

fetch-and-add, 128 threads 400 FFWD MUTEX TICKET TAS FC MCS TTAS CLH HTICKET RCL 350 300 Throughput (Mops) 250 200 150 100 50 0 1 4 16 64 256 1024 # of shared variables (locks)

fetch-and-add, 128 threads 400 FFWD TTAS TAS RCL MCS TICKET HTICKET ATOMIC 350 MUTEX CLH FC 300 Throughput (Mops) 250 atomic increment —> 200 150 100 50 0 1 4 16 64 256 1024 # of shared variables (locks)

• ff wd is much faster on largely sequential data structures • naïve linked list, stack, queue • fetch and add, for few shared variables • for highly concurrent data structures, once lock# is similar to thread#, fwd falls behind • fetch and add, with many shared variables • hashtable • for concurrent data structures with long query times, ff wd keeps up, but is not a clear leader • lazy linked list • binary search tree

128-thread lazy concurrent lists 60 FFWD-LZ TTAS-LZ TAS-LZ FC-LZ MCS-LZ TICKET-LZ HTICKET-LZ RCL-LZ MUTEX-LZ CLH-LZ HARRIS 50 Throughput (Mops) 40 30 20 10 0 1 4 16 64 256 1024 4096 16384 #elements

ffwd: delegation is (much) faster than you think Sepideh Roghanchi, - PowerPoint PPT Presentation

ffwd: delegation is (much) faster than you think Sepideh Roghanchi, Jakob Eriksson, Nilanjana Basu int get_seqno() { return ++seqno; } // ~1 Billion ops/s // single-threaded int threadsafe_get_seqno() { acquire(lock); int

DELEGATION readysetpresent.com Delegation Program Objectives ( 1 of 3 ) Understand the

Delegation in Role-Based Access Control Controlling delegation Enforcing transfer delegation

FASTER TRANSFORMER Bo Yang Hsueh, 2019/12/18 AGENDA What is Faster Transformer Introduce the

Options and Configurations Anand Paurana If you think you can Or if you think you cant,

Nurse Delegation Updates May 12 th 2020 Doris Barret RN Marlo Moss RN CONTRACT CHANGE DSHS MUST

How Economists Think and Things They Think About How Economists Think and Things They Think About

WEARABLES?!? LIZA KINDRED @LIZAK SO MUCH HYPE SO MUCH MEH SO MUCH MATTERS. HERES HOW TO THINK

SERBIAN DELEGATION TO CISM Chief of Delegation Brigadier General SINISA RADOVIC Chronology of

GENERAL CONFERENCE DELEGATION REPORT Your Delegation CHAIR: Elisa Gatz VICE CHAIR: Alka

Delegation procedure: lack of transparency Delegation procedure: lack of transparency or European

Delegation: Delegation: Responsibilities of the Responsibilities of the Nurse Nurse Joyce

Delegation with Endogenous States Dino Gerardi Lucas Maestri Ignacio Monzn (Collegio Carlo

A solution for Access Delegation based on SAML Ciro Formisano Ermanno Travaglino Isabel

Think Aloud This slideshow is inspired from Rolf Mlichs book Think aloud & Steve

Water Rights Accounting New Accounting Model New Technology: 1979 versus 2011 Faster

Faster Cover Trees Mike Izbicki and Christian R. Shelton UC Riverside Izbicki and Shelton (UC

Global Types with Internal Delegation and Connecting Communications joint work with Ilaria

An analysis of the applicability of blockchain to secure IP addresses allocation, delegation and

Welcome We will begin at 7:30 p.m. Central Time. Call in for audio. You need to register

Josh Bloch Charlie Garrod 17-214 1 Administrivia Homework 1 graded soon Reading due

Incorporating Off-Line Attribute Delegation into Hierarchical Group and Attribute-Based Access

Computational Social Choice 2020 Ulle Endriss Institute for Logic, Language and Computation

1. TF1 methods delegation 2. TMath reimplementation 3. Hierarchy Restructure 4. Documentation

Overview Part 1: Malicious QFactory Functionality Required assumptions Protocol

ffwd: delegation is (much) faster than you think Sepideh Roghanchi, - PowerPoint PPT Presentation

ffwd: delegation is (much) faster than you think Sepideh Roghanchi, Jakob Eriksson, Nilanjana Basu int get_seqno() { return ++seqno; } // ~1 Billion ops/s // single-threaded int threadsafe_get_seqno() { acquire(lock); int

DELEGATION readysetpresent.com Delegation Program Objectives ( 1 of 3 ) Understand the

Delegation in Role-Based Access Control Controlling delegation Enforcing transfer delegation

FASTER TRANSFORMER Bo Yang Hsueh, 2019/12/18 AGENDA What is Faster Transformer Introduce the

Options and Configurations Anand Paurana If you think you can Or if you think you cant,

Nurse Delegation Updates May 12 th 2020 Doris Barret RN Marlo Moss RN CONTRACT CHANGE DSHS MUST

How Economists Think and Things They Think About How Economists Think and Things They Think About

WEARABLES?!? LIZA KINDRED @LIZAK SO MUCH HYPE SO MUCH MEH SO MUCH MATTERS. HERES HOW TO THINK

SERBIAN DELEGATION TO CISM Chief of Delegation Brigadier General SINISA RADOVIC Chronology of

GENERAL CONFERENCE DELEGATION REPORT Your Delegation CHAIR: Elisa Gatz VICE CHAIR: Alka

Delegation procedure: lack of transparency Delegation procedure: lack of transparency or European

Delegation: Delegation: Responsibilities of the Responsibilities of the Nurse Nurse Joyce

Delegation with Endogenous States Dino Gerardi Lucas Maestri Ignacio Monzn (Collegio Carlo

A solution for Access Delegation based on SAML Ciro Formisano Ermanno Travaglino Isabel

Think Aloud This slideshow is inspired from Rolf Mlichs book Think aloud &amp; Steve

Water Rights Accounting New Accounting Model New Technology: 1979 versus 2011 Faster

Faster Cover Trees Mike Izbicki and Christian R. Shelton UC Riverside Izbicki and Shelton (UC

Global Types with Internal Delegation and Connecting Communications joint work with Ilaria

An analysis of the applicability of blockchain to secure IP addresses allocation, delegation and

Welcome We will begin at 7:30 p.m. Central Time. Call in for audio. You need to register

Josh Bloch Charlie Garrod 17-214 1 Administrivia Homework 1 graded soon Reading due

Incorporating Off-Line Attribute Delegation into Hierarchical Group and Attribute-Based Access

Computational Social Choice 2020 Ulle Endriss Institute for Logic, Language and Computation

1. TF1 methods delegation 2. TMath reimplementation 3. Hierarchy Restructure 4. Documentation

Overview Part 1: Malicious QFactory Functionality Required assumptions Protocol

Think Aloud This slideshow is inspired from Rolf Mlichs book Think aloud & Steve