Locking Synchronization in Hierarchical Multicore Computer Systems - PowerPoint PPT Presentation

Algorithms for Optimization of Remote Core Locking Synchronization in Hierarchical Multicore Computer Systems Paznikov Alexey The first summer school apaznikov@gmail.com on practice and theory of concurrent computing Saint Petersburg Electrotechnical University "LETI“ July 3-7, 2017 Siberian State University of Telecommunications and Information St. Petersburg Sciences, Novosibirsk Rzhanov Institute of Semiconductor Physics Siberian Branch of ITMO University RAS, Novosibirsk

Multicore computer systems with shared memory QuickPath / Hyper-Transport L3 L3 Cache Cache L2 L2 L2 L2 L2 L2 L2 L2 Memory controller Memory controller L1 L1 L1 L1 L1 L1 L1 L1 NUMA-node 1 NUMA-node 2 CPU CPU C1 C2 C3 C4 C1 C2 C3 C4 cores cores L3 L3 Cache Cache L2 L2 L2 L2 L2 L2 L2 L2 L1 L1 L1 L1 L1 L1 L1 L1 CPU CPU C5 C6 C7 C8 C1 C2 C3 C4 cores cores Architecture features of modern multicore computer systems (CS) : ▪ Scalability: number of CPU cores exceeds 10 2 – 10 3 ▪ Hierarchical structure: multilevel cache, logical processors ▪ Non-uniform memory access (NUMA-systems) ▪ Heterogeneous structures: specialized accelerators and co-processors ▪ The variety of mechanisms for consistent state of memory 2

Synchronization in multithreaded programs Threads Locks T 1 T 2 Parallel execution of critical sections is serialized – replaced by sequential. Drawbacks: ▪ Low scalability ▪ Serialization Critical section ▪ Deadlocks, livelocks ▪ Starvation ▪ Priority inversion ▪ Unpredictable order of critical sections execution (FIFO) ▪ Reasonable overheads Implementations: ▪ Queue locks (CLH, MCS) ▪ Spinlocks (test-and-set-based, exponential backoff-based) Locks ▪ Flat combining (CC-Synch, DSM-Synch, Oyama lock, etc) ▪ Futex-based (PThreads mutex, PThread read-write mutex) 3

Synchronization in multithreaded programs Threads T 1 T 2 Lock-free algorithms and concurrent data structures Atomic operations (CAS – compare-and-swap, LL/SC – load link, store conditional) CAS – compare-and-swap are used for thread-safety. Drawbacks: CAS! ▪ High complexity of parallel programs development. CAS! ▪ АВА problem. CAS! ▪ Atomic operations are suitable for variable with size of computer word. CAS! ▪ Low efficiency of atomic operations. CAS! Modern algorithms: Solutions for ABA-problem: ▪ Lock-free producer-consumer ▪ Quiescent-based schemes ▪ Exponential backoff ▪ Pointer-based schemes ▪ Elimination arrays (hazard pointers, ▪ Diffraction trees drop-the-anchor, pass-the-buck) Lock-free algorithms and ▪ Sorting networks ▪ Reference counting data structures ▪ Cliff Click hash table ▪ Tagged state reference ▪ Skip-list ▪ Intermediate nodes ▪ Split-ordering ▪ TM-based 4

Synchronization in multithreaded programs Threads T 1 T 2 Transactional memory (TM) Organization of transactional sections which assures thread-safety of access to shared memory areas (not code sections). Drawbacks: ▪ Very high overheads Transaction ▪ Transactions can be cancelled ▪ Restricted operations inside transactional sections ▪ Debugging complexity ▪ Necessity of refactoring of the program TM implementations: ▪ GCC TM ▪ LazySTM ▪ TinySTM ▪ DTMC ▪ RSTM ▪ STM Monad Transactional memory 5

Synchronization in multithreaded programs Threads Locks T 1 T 2 Parallel execution of critical sections is serialized – replaced by sequential. Drawbacks: ▪ Low scalability ▪ Serialization Critical section ▪ Deadlocks, livelocks ▪ Starvation ▪ Priority inversion ▪ Unpredictable order of critical sections execution (FIFO) ▪ Reasonable overheads Implementations: ▪ Bottlenecks ▪ Queue locks (CLH, MCS) ▪ High access contention  expensive ▪ Spinlocks (test-and-set-based, lock acquisition exponential backoff-based) Locks ▪ Flat combining (CC-Synch,  DSM-Synch, Oyama lock, etc) Development of more efficient locking ▪ Futex-based (PThreads mutex, algorithms is urgent today! PThread read-write mutex) 6

Time of critical section execution T 1 T 2 T 3 Threads Time of critical section execution Transfer of lock ownership (critical path): Critical section t 𝑢 = 𝑢 1 + 𝑢 2 , CS1 here Critical path 𝑢 1 – time of execution of instructions of critical section, CS2 𝑢 2 – time of transfer of lock ownership. CS3 Time of transfer of lock ownership in the existing locking algorithms: Development of locking algorithms ▪ Spinlock  Access to global flag which minimizes time ▪ PThread mutex  Context switch of lock ownership is relevant. ▪ MCS  Thread activation ▪ Flat combining  Acquisition of global lock 7

Time of critical section execution v T 1 T 2 T 3 Threads Time of critical section execution (critical path): t 𝑢 = 𝑢 1 + 𝑢 2 + 𝒖 𝟒 , Global variable CS1 here Critical path 𝑢 1 – time of execution of to global variable Time of access instructions of critical section, CS2 𝑢 2 – time of transfer of lock ownership, 𝒖 𝟒 – time of access to global CS3 variables Localization of memory access in the existing locking algorithms: Development of locking algorithms which ▪ Spinlock  no localization ensures memory access localization ▪ PThread mutex  no localization is relevant today. ▪ MCS  no localization ▪ Flat combining  partial localization 8

Remote Core Locking (RCL) technics Remote Core Locking (RCL) T 1 T 2 T 3 RCL-server v 1 v 2 methods minimizes critical path of critical sections execution t 𝑢 = 𝑢 1 + 𝑢 2 + 𝑢 3 Due to minimization of Critical path CS1 ▪ time 𝑢 2 of transfer of ownership CS2 of lock ownership CS3 ▪ time 𝑢 3 of access to global variables Global variables Core Х Lozi J. P. et al. Remote Core Locking: Migrating Critical-Section Execution to Improve the Performance of Multithreaded Applications // USENIX Annual Technical Conference. – 2012. – P. 65-76. 9

Remote Core Locking (RCL) technics Remote Core Locking (RCL) T 1 T 2 T 3 RCL-server v 1 v 2 methods minimizes critical path of critical sections execution t 𝑢 = 𝑢 1 + 𝑢 2 + 𝑢 3 Due to minimization of Critical path CS1 ▪ time 𝑢 2 of transfer of ownership CS2 of lock ownership CS3 ▪ time 𝑢 3 of access to global variables Core Core Core Core Core Core Core Core CPU cores 1 2 3 4 5 6 7 8 Lozi J. P. et al. Remote Core Locking: L1 L1 L1 L1 L1 L1 L1 L1 Migrating Critical-Section Execution to Improve the Performance of L2 L2 L2 L2 L2 L2 L2 L2 Cache Multithreaded Applications // L3 L3 L3 L3 USENIX Annual Technical Conference. – Memory NUMA node 1 NUMA node 2 2012. – P. 65-76. 10 QuickPath / Hyper-Transport

Remote Core Locking (RCL) technics Time of critical section execution (critical path): T 1 T 2 T 3 RCL-server v 1 v 2 𝑢 = 𝑢 1 + 𝑢 2 + 𝑢 3 t Time 𝒖 𝟒 of access to global variable inside critical section depends on Critical path ▪ on which NUMA-node is allocated CS1 memory for the variable CS2 ▪ on which processor core runs CS3 RCL-server ▪ which thread on which processor core last time accessed to the variable Core Core Core Core Core Core Core Core CPU cores 1 2 3 4 5 6 7 8 Lozi J. P. et al. Remote Core Locking: L1 L1 L1 L1 L1 L1 L1 L1 Migrating Critical-Section Execution to Improve the Performance of L2 L2 L2 L2 L2 L2 L2 L2 Cache Multithreaded Applications // L3 L3 L3 L3 USENIX Annual Technical Conference. – v 1 v 2 Memory NUMA node 1 NUMA node 2 2012. – P. 65-76. 11 QuickPath / Hyper-Transport

Example of critical section execution in RCL liblock_lock_t lock; int main() { const char* liblock_name = "rcl"; liblock_lock_init (liblock_name, &topology->hw_threads[0], &lock, 0); int global_var = 0; Lock initialization void *cs(void* arg) { pthread_t tids[NTHREADS]; global_var++; for (int i = 0; i < NTHREADS; i++) { return NULL; liblock_thread_create (&tids[i], NULL, } thread, NULL); } Creation of thread void *thread(void* arg) { int i; for (int i = 0; i < NTHREADS; i++) { for (i = 0; i < NITERS; i++) { pthread_join(tids[i], NULL); liblock_exec (&lock, cs, NULL); } } return NULL; liblock_lock_destroy(&lock); Critical section execution } return 0; } 12

Example of critical section execution in RCL liblock_lock_t lock; int main() { const char* liblock_name = "rcl"; liblock_lock_init (liblock_name, &topology->hw_threads[0] , &lock, 0); int global_var = 0; Lock initialization 1 Core number void *cs(void* arg) { pthread_t tids[NTHREADS]; global_var++; for (int i = 0; i < NTHREADS; i++) { return NULL; liblock_thread_create (&tids[i], NULL, } thread, NULL); } Creation of thread void *thread(void* arg) { int i; for (int i = 0; i < NTHREADS; i++) { for (i = 0; i < NITERS; i++) { pthread_join(tids[i], NULL); liblock_exec (&lock, cs, NULL); } } return NULL; liblock_lock_destroy(&lock); Critical section execution } return 0; } 13

Locking Synchronization in Hierarchical Multicore Computer Systems - PowerPoint PPT Presentation

Algorithms for Optimization of Remote Core Locking Synchronization in Hierarchical Multicore Computer Systems Paznikov Alexey The first summer school apaznikov@gmail.com on practice and theory of concurrent computing Saint Petersburg

Higher Level Synchronization 9A. Practical Problems locking and waiting Operating Systems

LOCKING CS 2550 / Spring 2006 Principles of Database Systems 10 Locking Alexandros

CS533 Concepts of Operating Systems Linux Kernel Locking Techniques Intro to kernel locking

Content Synchronization Content Synchronization March 2nd 2005 Jukka Honkola T-110.456

Chapter 9.1-9.3 Linked Lists: The Role of Locking Magnus Andersson Introduction Fine-grained

blocking synchronization Yang Xu Outline Disadvantages of locking Hardware support for

Herlihy Ch 9. Linked Lists: The Role of Locking Non-Blocking Synchronization BJRN A. JOHNSSON

CPL 2016, week 2 Inter-thread synchronization: locks and monitors Oleg Batrashev Institute of

Orthogonal key-value locking Goetz Graefe, Hideaki Kimura Hewlett-Packard Laboratories Palo Alto,

Hierarchical Bounding Volume October 11, 2005 () Hierarchical Bounding Volume October 11, 2005

What is a hierarchical model? Richard Erickson Quantitative Ecologist DataCamp Hierarchical

CS 134: Operating Systems Locks and Low-Level Synchronization 1 / 25 Overview CS34 Overview

Clock Synchronization Synchronization Clock Henrik Lnn Electronics & Software Volvo

synchronization.txt synchronization.txt Feb 2 2009 1:10 Page 1

File Synchronization with File Synchronization with Syxaw in an Ad-hoc Network Syxaw in an

Chapter 7: Process Synchronization Background The Critical-Section Problem

Sharing Secrets by Computing Preimages of Bipermutive CA ACRI 2014 - September 22-25 - Krakow

Welcome to the 2017 Charm++ Workshop! Laxmikant (Sanjay) Kale http://charm.cs.illinois.edu

ADVANCED DATABASE SYSTEMS Parallel Join Algorithms (Hashing) @ Andy_Pavlo // 15- 721 //

Grand Large Grand Large INRIA High Performance Computing on P2P Platforms: Recent

A discrete model of O ( 2 ) -homotopy theory Jan Spali nski Department of Mathematics and

Stitch: Fusible Heterogeneous Accelerators Enmeshed with Many-Core Architecture for Wearables

Factors that Determine Speedup Characteristics of parallel code ECE 1747 Parallel Programming

What You Must Know about Memory, Caches, and Shared Memory Kenjiro Taura 1 / 105 Contents 1

Locking Synchronization in Hierarchical Multicore Computer Systems - PowerPoint PPT Presentation

Algorithms for Optimization of Remote Core Locking Synchronization in Hierarchical Multicore Computer Systems Paznikov Alexey The first summer school apaznikov@gmail.com on practice and theory of concurrent computing Saint Petersburg

Higher Level Synchronization 9A. Practical Problems locking and waiting Operating Systems

LOCKING CS 2550 / Spring 2006 Principles of Database Systems 10 Locking Alexandros

CS533 Concepts of Operating Systems Linux Kernel Locking Techniques Intro to kernel locking

Content Synchronization Content Synchronization March 2nd 2005 Jukka Honkola T-110.456

Chapter 9.1-9.3 Linked Lists: The Role of Locking Magnus Andersson Introduction Fine-grained

blocking synchronization Yang Xu Outline Disadvantages of locking Hardware support for

Herlihy Ch 9. Linked Lists: The Role of Locking Non-Blocking Synchronization BJRN A. JOHNSSON

CPL 2016, week 2 Inter-thread synchronization: locks and monitors Oleg Batrashev Institute of

Orthogonal key-value locking Goetz Graefe, Hideaki Kimura Hewlett-Packard Laboratories Palo Alto,

Hierarchical Bounding Volume October 11, 2005 () Hierarchical Bounding Volume October 11, 2005

What is a hierarchical model? Richard Erickson Quantitative Ecologist DataCamp Hierarchical

CS 134: Operating Systems Locks and Low-Level Synchronization 1 / 25 Overview CS34 Overview

Clock Synchronization Synchronization Clock Henrik Lnn Electronics &amp; Software Volvo

synchronization.txt synchronization.txt Feb 2 2009 1:10 Page 1

File Synchronization with File Synchronization with Syxaw in an Ad-hoc Network Syxaw in an

Chapter 7: Process Synchronization Background The Critical-Section Problem

Sharing Secrets by Computing Preimages of Bipermutive CA ACRI 2014 - September 22-25 - Krakow

Welcome to the 2017 Charm++ Workshop! Laxmikant (Sanjay) Kale http://charm.cs.illinois.edu

ADVANCED DATABASE SYSTEMS Parallel Join Algorithms (Hashing) @ Andy_Pavlo // 15- 721 //

Grand Large Grand Large INRIA High Performance Computing on P2P Platforms: Recent

A discrete model of O ( 2 ) -homotopy theory Jan Spali nski Department of Mathematics and

Stitch: Fusible Heterogeneous Accelerators Enmeshed with Many-Core Architecture for Wearables

Factors that Determine Speedup Characteristics of parallel code ECE 1747 Parallel Programming

What You Must Know about Memory, Caches, and Shared Memory Kenjiro Taura 1 / 105 Contents 1

Clock Synchronization Synchronization Clock Henrik Lnn Electronics & Software Volvo