Locking Synchronization in Hierarchical Multicore Computer Systems - - PowerPoint PPT Presentation

locking synchronization in hierarchical
SMART_READER_LITE
LIVE PREVIEW

Locking Synchronization in Hierarchical Multicore Computer Systems - - PowerPoint PPT Presentation

Algorithms for Optimization of Remote Core Locking Synchronization in Hierarchical Multicore Computer Systems Paznikov Alexey The first summer school apaznikov@gmail.com on practice and theory of concurrent computing Saint Petersburg


slide-1
SLIDE 1

Algorithms for Optimization of Remote Core Locking Synchronization in Hierarchical Multicore Computer Systems

Paznikov Alexey apaznikov@gmail.com

Saint Petersburg Electrotechnical University "LETI“ Siberian State University of Telecommunications and Information Sciences, Novosibirsk Rzhanov Institute of Semiconductor Physics Siberian Branch of RAS, Novosibirsk

The first summer school

  • n practice and theory
  • f concurrent computing

July 3-7, 2017

  • St. Petersburg

ITMO University

slide-2
SLIDE 2

Multicore computer systems with shared memory

C3 C2 C1 C4

L1 L1 L1 L1 L2 L2 L2 L2 L3

C7 C6 C5 C8

L1 L1 L1 L1 L2 L2 L2 L2 L3

CPU cores Cache CPU cores Cache

Memory controller

C2 C3 C4 C1

L1 L1 L1 L1 L2 L2 L2 L2 L3

C2 C3 C4 C1

L1 L1 L1 L1 L2 L2 L2 L2 L3

CPU cores Cache CPU cores Cache

Memory controller

2

NUMA-node 1 NUMA-node 2

Architecture features of modern multicore computer systems (CS) : ▪ Scalability: number of CPU cores exceeds 102 – 103 ▪ Hierarchical structure: multilevel cache, logical processors ▪ Non-uniform memory access (NUMA-systems) ▪ Heterogeneous structures: specialized accelerators and co-processors ▪ The variety of mechanisms for consistent state of memory

QuickPath / Hyper-Transport

slide-3
SLIDE 3

3

T1 T2

Locks

Critical section

Locks

Parallel execution of critical sections is serialized – replaced by sequential. Drawbacks: ▪ Low scalability ▪ Serialization ▪ Deadlocks, livelocks ▪ Starvation ▪ Priority inversion ▪ Unpredictable order of critical sections execution (FIFO) ▪ Reasonable overheads Implementations: ▪ Queue locks (CLH, MCS) ▪ Spinlocks (test-and-set-based, exponential backoff-based) ▪ Flat combining (CC-Synch, DSM-Synch, Oyama lock, etc) ▪ Futex-based (PThreads mutex, PThread read-write mutex) Threads

Synchronization in multithreaded programs

slide-4
SLIDE 4

Synchronization in multithreaded programs

4

T1 T2

Lock-free algorithms and data structures

CAS! CAS! CAS! CAS! CAS!

CAS – compare-and-swap

Lock-free algorithms and concurrent data structures

Atomic operations (CAS – compare-and-swap, LL/SC – load link, store conditional) are used for thread-safety. Drawbacks: ▪ High complexity of parallel programs development. ▪ АВА problem. ▪ Atomic operations are suitable for variable with size of computer word. ▪ Low efficiency of atomic operations. Modern algorithms: ▪ Lock-free producer-consumer ▪ Exponential backoff ▪ Elimination arrays ▪ Diffraction trees ▪ Sorting networks ▪ Cliff Click hash table ▪ Skip-list ▪ Split-ordering Solutions for ABA-problem: ▪ Quiescent-based schemes ▪ Pointer-based schemes (hazard pointers, drop-the-anchor, pass-the-buck) ▪ Reference counting ▪ Tagged state reference ▪ Intermediate nodes ▪ TM-based Threads

slide-5
SLIDE 5

5

T1 T2

Transactional memory

Transaction

Transactional memory (TM)

Organization of transactional sections which assures thread-safety of access to shared memory areas (not code sections). Drawbacks: ▪ Very high overheads ▪ Transactions can be cancelled ▪ Restricted operations inside transactional sections ▪ Debugging complexity ▪ Necessity of refactoring of the program TM implementations: ▪ GCC TM ▪ LazySTM ▪ TinySTM ▪ DTMC ▪ RSTM ▪ STM Monad Threads

Synchronization in multithreaded programs

slide-6
SLIDE 6

6

T1 T2

Locks

Critical section

Locks

Parallel execution of critical sections is serialized – replaced by sequential. Drawbacks: ▪ Low scalability ▪ Serialization ▪ Deadlocks, livelocks ▪ Starvation ▪ Priority inversion ▪ Unpredictable order of critical sections execution (FIFO) ▪ Reasonable overheads Implementations: ▪ Queue locks (CLH, MCS) ▪ Spinlocks (test-and-set-based, exponential backoff-based) ▪ Flat combining (CC-Synch, DSM-Synch, Oyama lock, etc) ▪ Futex-based (PThreads mutex, PThread read-write mutex) ▪ Bottlenecks ▪ High access contention  expensive lock acquisition  Development of more efficient locking algorithms is urgent today! Threads

Synchronization in multithreaded programs

slide-7
SLIDE 7

CS2 CS3 CS1

Transfer of lock ownership Critical path

7

Time of critical section execution

Time of critical section execution (critical path): 𝑢 = 𝑢1 + 𝑢2, here 𝑢1 – time of execution of instructions of critical section, 𝑢2 – time of transfer of lock

  • wnership.

Time of transfer of lock ownership in the existing locking algorithms: ▪ Spinlock  Access to global flag ▪ PThread mutex  Context switch ▪ MCS  Thread activation ▪ Flat combining  Acquisition of global lock Development of locking algorithms which minimizes time

  • f lock ownership is relevant.

Critical section

T1 T2 T3 t

Threads

slide-8
SLIDE 8

T1 T2 T3

Time of access to global variable Critical path

8

Time of critical section execution (critical path): 𝑢 = 𝑢1 + 𝑢2 + 𝒖𝟒, here 𝑢1 – time of execution of instructions of critical section, 𝑢2 – time of transfer of lock

  • wnership,

𝒖𝟒 – time of access to global variables

CS2 CS3 CS1

Global variable

v

Localization of memory access in the existing locking algorithms: ▪ Spinlock  no localization ▪ PThread mutex  no localization ▪ MCS  no localization ▪ Flat combining  partial localization Development of locking algorithms which ensures memory access localization is relevant today.

t

Threads

Time of critical section execution

slide-9
SLIDE 9

RCL-server v1 v2

9

CS2 CS3

Remote Core Locking (RCL) methods minimizes critical path

  • f critical sections execution

𝑢 = 𝑢1 + 𝑢2 + 𝑢3 Due to minimization of ▪ time 𝑢2 of transfer of ownership

  • f lock ownership

▪ time 𝑢3 of access to global variables

CS1 T1 T2 T3

Global variables

t

Critical path

Core Х

Remote Core Locking (RCL) technics

Lozi J. P. et al. Remote Core Locking: Migrating Critical-Section Execution to Improve the Performance of Multithreaded Applications // USENIX Annual Technical Conference. –

  • 2012. – P. 65-76.
slide-10
SLIDE 10

RCL-server

10

CS2 CS3 CS1 T1 T2 T3 t

Core 6 Core 7 Core 8 Core 5

L1 L1 L1 L1 L2 L2 L2 L2 L3

CPU cores Cache

L3

Core 2 Core 3 Core 4 Core 1

L1 L1 L1 L1 L2 L2 L2 L2 L3 L3 NUMA node 2 NUMA node 1

Memory

QuickPath / Hyper-Transport

Critical path

v1 v2

Remote Core Locking (RCL) methods minimizes critical path

  • f critical sections execution

𝑢 = 𝑢1 + 𝑢2 + 𝑢3 Due to minimization of ▪ time 𝑢2 of transfer of ownership

  • f lock ownership

▪ time 𝑢3 of access to global variables

Remote Core Locking (RCL) technics

Lozi J. P. et al. Remote Core Locking: Migrating Critical-Section Execution to Improve the Performance of Multithreaded Applications // USENIX Annual Technical Conference. –

  • 2012. – P. 65-76.
slide-11
SLIDE 11

RCL-server

11

CS2 CS3 CS1 T1 T2 T3 t

Core 6 Core 7 Core 8 Core 5

L1 L1 L1 L1 L2 L2 L2 L2 L3 L3

Core 2 Core 3 Core 4 Core 1

L1 L1 L1 L1 L2 L2 L2 L2 L3 L3 NUMA node 2 NUMA node 1 QuickPath / Hyper-Transport

Critical path Time of critical section execution (critical path): 𝑢 = 𝑢1 + 𝑢2 + 𝑢3 Time 𝒖𝟒 of access to global variable inside critical section depends on ▪

  • n which NUMA-node is allocated

memory for the variable ▪

  • n which processor core runs

RCL-server ▪ which thread on which processor core last time accessed to the variable

v1 v2

v1 v2 CPU cores Cache Memory

Remote Core Locking (RCL) technics

Lozi J. P. et al. Remote Core Locking: Migrating Critical-Section Execution to Improve the Performance of Multithreaded Applications // USENIX Annual Technical Conference. –

  • 2012. – P. 65-76.
slide-12
SLIDE 12

Example of critical section execution in RCL

12 liblock_lock_t lock; const char* liblock_name = "rcl"; int global_var = 0; void *cs(void* arg) { global_var++; return NULL; } void *thread(void* arg) { int i; for (i = 0; i < NITERS; i++) { liblock_exec(&lock, cs, NULL); } return NULL; } int main() { liblock_lock_init(liblock_name, &topology->hw_threads[0], &lock, 0); pthread_t tids[NTHREADS]; for (int i = 0; i < NTHREADS; i++) { liblock_thread_create(&tids[i], NULL, thread, NULL); } for (int i = 0; i < NTHREADS; i++) { pthread_join(tids[i], NULL); } liblock_lock_destroy(&lock); return 0; } Critical section execution Creation of thread Lock initialization

slide-13
SLIDE 13

Example of critical section execution in RCL

13 liblock_lock_t lock; const char* liblock_name = "rcl"; int global_var = 0; void *cs(void* arg) { global_var++; return NULL; } void *thread(void* arg) { int i; for (i = 0; i < NITERS; i++) { liblock_exec(&lock, cs, NULL); } return NULL; } int main() { liblock_lock_init(liblock_name, &topology->hw_threads[0], &lock, 0); pthread_t tids[NTHREADS]; for (int i = 0; i < NTHREADS; i++) { liblock_thread_create(&tids[i], NULL, thread, NULL); } for (int i = 0; i < NTHREADS; i++) { pthread_join(tids[i], NULL); } liblock_lock_destroy(&lock); return 0; } Core number 1 Critical section execution Lock initialization Creation of thread

slide-14
SLIDE 14

Example of critical section execution in RCL

14 liblock_lock_t lock; const char* liblock_name = "rcl"; int *global_var = NULL; void *cs(void* arg) { (*global_var)++; return NULL; } void *thread(void* arg) { int i; for (i = 0; i < NITERS; i++) { liblock_exec(&lock, cs, NULL); } return NULL; } int main() { global_var = malloc(sizeof(*global_var)); *global_var = 0; liblock_lock_init(liblock_name, &topology->hw_threads[0], &lock, 0); pthread_t tids[NTHREADS]; for (int i = 0; i < NTHREADS; i++) { liblock_thread_create(&tids[i], NULL, thread, NULL); } for (int i = 0; i < NTHREADS; i++) { pthread_join(tids[i], NULL); } liblock_lock_destroy(&lock); return 0; } Core number 1 Memory allocation in NUMA systems 2 Critical section execution Lock initialization Creation of thread

slide-15
SLIDE 15

Memory allocation in NUMA systems

Example of critical section execution in RCL

liblock_lock_t lock; const char* liblock_name = "rcl"; int *global_var = NULL; void *cs(void* arg) { (*global_var)++; return NULL; } void *thread(void* arg) { int i; for (i = 0; i < NITERS; i++) { liblock_exec(&lock, cs, NULL); } return NULL; } int main() { global_var = malloc(sizeof(*global_var)); *global_var = 0; liblock_lock_init(liblock_name, &topology->hw_threads[0], &lock, 0); pthread_t tids[NTHREADS]; for (int i = 0; i < NTHREADS; i++) { liblock_thread_create_and_bind( &topology->hw_threads[1], 0, &tids[i], NULL, thread, NULL); } for (int i = 0; i < NTHREADS; i++) { pthread_join(tids[i], NULL); } liblock_lock_destroy(&lock); Core number 1 2 User must consider the structure of CS by himself 3 15 Critical section execution Lock initialization

slide-16
SLIDE 16

Drawbacks of current implementation of RCL

  • 1. Lack of the mechanism of automatic choice of CPU core in RCL-locks

initialization. User has to choose processor cores for RCL-locks initialization by himself and consider hierarchical structure of distributed computer system.

  • 2. Memory allocation does not consider non-uniform memory access.

Time of multithreaded program execution in NUMA-systems with using of RCL essentially depends on the core on which RCL-server runs (CPU affinity) and NUMA-nodes from which memory is allocated.

  • 3. Current RCL implementation has no automatic affinity of working threads with

considering affinity of RCL-server and hierarchical structure of computer system. User creating threads has to consider RCL-server affinity in computer system, affinity of created threads and hierarchical structure of computer system.

16

slide-17
SLIDE 17

Model of multicore computer system

17 Core 3 Core 2 Core 1 Core 4

L1 L1 L1 L1 L2 L2 L2 L2 L3

CPU cores Cache

L3

Core 7 Core 6 Core 5 Core 8

L1 L1 L1 L1 L2 L2 L2 L2 L3 L3 NUMA node 1 NUMA node 2

Memory QuickPath / Hyper-Transport

𝑄 = 1,2, … , 𝑂 – set of processor cores of distributed computer system (CS) 𝑀 – number of hierarchical levels of CS 𝑜𝑚 – number of elements on level 𝑚 𝑜𝑚𝑙 – number of first child nodes of element 𝑙 ∈ {1,2, … , 𝑜𝑚} on level 𝑚 𝑑𝑚𝑙 – number of processor cores, which belong to the descendants of element 𝑙 on level 𝑚 𝐷𝑚𝑙 – set of processor cores, which belong to the descendants of element 𝑙 on level 𝑚

Khoroshevsky V., Kurnosov M. Mapping Parallel Programs into Hierarchical Distributed Computer Systems // Proceedings of 4th International Conference “Software and Data Technologies (ICSOFT 2009)”. - Sofia: INSTICC, 2009. - Vol. 2. - P. 123-128.

slide-18
SLIDE 18

Algorithm RCLLockInitNUMA of RCL-lock initialization

18 INITLIBRARY() 1 TRYSETMEMBIND(defaultNode) 2 topology = INITHWLOCTOPOLOGY()

slide-19
SLIDE 19

19 INITLIBRARY() 1 TRYSETMEMBIND(defaultNode) 2 topology = INITHWLOCTOPOLOGY() RCLLOCKINITNUMA() 1 node_usage = GETNODESUSAGE(node_usage) 2 TRYSETMEMBIND(nodes_usage) 3 core = GETFREECORE(nodes_usage) 4 RCLLOCKINITDEFAULT(core)

Algorithm RCLLockInitNUMA of RCL-lock initialization

slide-20
SLIDE 20

GETNODESUSAGE(node_usage) 1 for core = 1 to N do 2 if ISSERVERRUNNING(core) then 3 nb_free_cores = nb_free_cores + 1 4 else 5 nodes_usage[GETNODE(core)]++ 20 INITLIBRARY() 1 TRYSETMEMBIND(defaultNode) 2 topology = INITHWLOCTOPOLOGY() RCLLOCKINITNUMA() 1 node_usage = GETNODESUSAGE(node_usage) 2 TRYSETMEMBIND(nodes_usage) 3 core = GETFREECORE(nodes_usage) 4 RCLLOCKINITDEFAULT(core)

Algorithm RCLLockInitNUMA of RCL-lock initialization

slide-21
SLIDE 21

GETNODESUSAGE(node_usage) 21 INITLIBRARY() 1 TRYSETMEMBIND(defaultNode) 2 topology = INITHWLOCTOPOLOGY() RCLLOCKINITNUMA() 1 node_usage = GETNODESUSAGE(node_usage) 2 TRYSETMEMBIND(nodes_usage) 3 core = GETFREECORE(nodes_usage) 4 RCLLOCKINITDEFAULT(core) TRYSETMEMBIND(nodes_usage) 1 for i = 0 to nnodes do 2 if nodes_usage[i] > 0 then 3 nb_busy_nodes++ 4 node = i 5 if nb_busy_nodes = 1 then 6 SETMEMBIND(node)

Algorithm RCLLockInitNUMA of RCL-lock initialization

slide-22
SLIDE 22

GETNODESUSAGE(node_usage) 22 INITLIBRARY() 1 TRYSETMEMBIND(defaultNode) 2 topology = INITHWLOCTOPOLOGY() RCLLOCKINITNUMA() 1 node_usage = GETNODESUSAGE(node_usage) 2 TRYSETMEMBIND(nodes_usage) 3 core = GETFREECORE(nodes_usage) 4 RCLLOCKINITDEFAULT(core) TRYSETMEMBIND(nodes_usage) GETFREECORE(nodes_usage) 1 if nb_free_cores ≤ 1 then 2 core = GetNextCoreInRRFashion() 3 else 4 node = GetMostBusyNode(nodes_usage) 5 for each core in node do 6 if !IsServerRunning(core) then 7 return core

Algorithm RCLLockInitNUMA of RCL-lock initialization

slide-23
SLIDE 23

Algorithm RCLHierarchicalAffinity of thread affinity

23 RCLHIERARCHICALAFFINITY(thread_attr) 1 if ISREGULARTHREAD(thread_attr) then 2 for cpu = 1 to N do 3 if ISRCLONCPU(cpu) then 4 nthreads_per_cpu = 0 5

  • bj = cpu

6 do 7 if NOCOVEROBJ(obj) then 8 nthr_per_cpu++ 9

  • bj = cpu

10

  • bj = GETCOVEROBJECT(obj)

11 free_cpu = 12 GETFREECPU(obj, nthr_per_cpu); 13 while free_cpu = ∅ 14 SETAFFINITY(free_cpu, thread_attr)

slide-24
SLIDE 24

24 GETCOVEROBJECT(obj) 1 ncpus = GetCPUCountInsideObj(obj) 2 do 3 ncpus_prev = ncpus 4

  • bj = GetParent(obj)

5 ncpus = GetCPUCountInsideObj(obj) 6 GetCPUCountInsideObj(obj) 7 while ncpus = ncpus_prev 8 return obj RCLHIERARCHICALAFFINITY(thread_attr) 1 if ISREGULARTHREAD(thread_attr) then 2 for cpu = 1 to N do 3 if ISRCLONCPU(cpu) then 4 nthreads_per_cpu = 0 5

  • bj = cpu

6 do 7 if NOCOVEROBJ(obj) then 8 nthr_per_cpu++ 9

  • bj = cpu

10

  • bj = GETCOVEROBJECT(obj)

11 free_cpu = 12 GETFREECPU(obj, nthr_per_cpu); 13 while free_cpu = ∅ 14 SETAFFINITY(free_cpu, thread_attr)

Algorithm RCLHierarchicalAffinity of thread affinity

slide-25
SLIDE 25

25 RCLHIERARCHICALAFFINITY(thread_attr) 1 if ISREGULARTHREAD(thread_attr) then 2 for cpu = 1 to N do 3 if ISRCLONCPU(cpu) then 4 nthreads_per_cpu = 0 5

  • bj = cpu

6 do 7 if NOCOVEROBJ(obj) then 8 nthr_per_cpu++ 9

  • bj = cpu

10

  • bj = GETCOVEROBJECT(obj)

11 free_cpu = 12 GETFREECPU(obj, nthr_per_cpu); 13 while free_cpu = ∅ 14 SETAFFINITY(free_cpu, thread_attr) GETFREECPU(obj, nthr_per_cpu) 1 for core in obj do 2 if CPUIsFree(core) and 3 busy_cpus[core] ≤ nthr_per_cpu then 4 return core GETCOVEROBJECT(node_usage)

Algorithm RCLHierarchicalAffinity of thread affinity

slide-26
SLIDE 26

Experiments: software and hardware

26 Software: GNU/Linux Fedora 20 (Jet), CentOS 6.0 (Oak), compiler GCC 5.3.0

Node of cluster Oak (NUMA-system)

  • f SibSUTIS (Novosibirsk)

▪ 2 x Intel Xeon E5620 (2.4 GHz, 4 ядер) – 8 ядер ▪ Cache: L1 (32 KB), L2 (256 KB), L3 (12 MB) ▪ Memory 24 GB ▪ Access to memory on local and remote NUMA-nodes: 10:21. Node of cluster Jet (SMP-system)

  • f SibSUTIS (Novosibirsk)

▪ 2 x Intel Xeon E5420 (2.4 GHz, 4 ядер) – 8 ядер ▪ Cache: L1 (32 KB), L2 (6 MB) ▪ Memory 8 GB

slide-27
SLIDE 27

Experiments: benchmarks

27

Synthetic benchmark

▪ Cyclic access to elements of integer array ▪ Size of array: 𝑐 = 5 × 108 elements ▪ Number of operations 𝑜 = 108/𝑞 ▪ Operation pattern: increment of integer variable by 1 ▪ Patterns of access to elements ▫ Sequential access ▫ Random access ▫ Strided access with interval of 𝑡 = 20 elements

Performance indicators

Throughput: 𝑐 = 𝑜/𝑢, where 𝑢 – time of benchmark execution.

slide-28
SLIDE 28

Threads affinity to processor cores

3 2 1 T1 4 7 6 5 RCL 8 3 2 T2 1 T1 4 7 6 5 RCL 8 A A 3 T3 2 T2 1 T1 4 7 6 5 RCL 8 3 T3 2 T2 1 T1 4 T4 7 6 5 RCL 8 A A 3 T3 2 T2 1 T1 4 T4 7 6 T5 5 RCL 8 A 3 T3 2 T2 1 T1 4 T4 7 T6 6 T5 5 RCL 8 3 T3 2 T2 1 T1 4 T4 7 T6 6 T5 5 RCL 8 T6 A A

NUMA-node 1 NUMA-node 2 28

1 working thread 2 working thread 3 working thread 4 working thread 5 working thread 6 working thread 7 working thread – working thread – RCL-server

A

– thread, allocating memory

1 T1 5 RCL

slide-29
SLIDE 29

Results of experiments, algorithm RCLLockInitNUMA

29

Default, random access

b, 1000 op/s p

RCLLockInitNUMA, random access Default, strided access RCLLockInitNUMA, strided access Default, sequential access RCLLockInitNUMA, sequential access

slide-30
SLIDE 30

30

Default, random access RCLLockInitNUMA, random access Default, strided access RCLLockInitNUMA, strided access Default, sequential access RCLLockInitNUMA, sequential access

p

No thread affinity

Results of experiments, algorithm RCLLockInitNUMA

b, 1000 op/s

slide-31
SLIDE 31

31

Default, random access RCLLockInitNUMA, random access Default, strided access RCLLockInitNUMA, strided access Default, sequential access RCLLockInitNUMA, sequential access

p

Results of experiments, algorithm RCLLockInitNUMA

b, 1000 op/s

slide-32
SLIDE 32

32

Results of experiments, algorithm RCLHierarchicalAffinity

3 T2 2 T1 4 7 6 1 RCL 8 3 2 T1 1 T1 4 7 6 5 T2 8

Affinity 2: (2, 5)

A 3 2 1 RCL 4 7 6 T2 5 T1 8

Affinity 3: (5, 6)

A 5 A

b, 1000 op/s Cluster Oak, 2 threads

RCLHierarchicalAffinity: Affinity 1: (2, 3)

Default, random access Default, sequential access Default, strided access RCLLockInitNUMA, random access RCLLockInitNUMA, sequential access RCLLockInitNUMA, strided access

Affinity 1 Affinity 2 Affinity 3

slide-33
SLIDE 33

33

3 T2 2 T1 4 T3 7 6 1 RCL 8

RCLHierarchicalAffinity: Affinity 1: (2, 3, 4)

3 2 1 RCL 4 7 T3 6 T2 5 T1 8

Affinity 2: (5, 6, 7)

A 3 T2 2 T1 1 RCL 4 7 6 5 T3 8

Affinity 3: (2, 3, 5)

A 5 A 3 2 T1 1 RCL 4 7 6 T3 5 T3 8

Affinity 4: (2, 5, 6)

A

Cluster Oak, 3 threads

Results of experiments, algorithm RCLHierarchicalAffinity

b, 1000 op/s

Default, random access Default, sequential access Default, strided access RCLLockInitNUMA, random access RCLLockInitNUMA, sequential access RCLLockInitNUMA, strided access

Affinity 1 Affinity 2 Affinity 3 Affinity 4

slide-34
SLIDE 34

34

3 T2 2 T1 4 T3 7 6 1 RCL 8

RCLHierarchicalAffinity: Affinity 1: (2, 3, 4, 5)

1 RCL 4 7 6 T2 5 T1 8

Affinity 2: (2, 3, 5, 6)

A 3 2 T1 1 RCL 4 7 T4 6 T3 5 T2 8

Affinity 3: (2, 5, 6, 7)

A 5 T4 A 3 2 1 RCL 4 7 T4 6 T3 5 T3 8 T5

Affinity 4: (5, 6, 7, 8)

A 3 T2 2 T1

Cluster Oak, 4 threads

Results of experiments, algorithm RCLHierarchicalAffinity

b, 1000 op/s

Default, random access Default, sequential access Default, strided access RCLLockInitNUMA, random access RCLLockInitNUMA, sequential access RCLLockInitNUMA, strided access

Affinity 1 Affinity 2 Affinity 3 Affinity 4

slide-35
SLIDE 35

35

3 T2 2 T1 4 T3 7 6 T5 1 RCL 8

RCLHierarchicalAffinity: Affinity 1: (2, 3, 4, 5)

1 RCL 4 7 T5 6 T4 5 T3 8

Affinity 2: (2, 3, 5, 6)

A 3 2 T1 1 RCL 4 7 T4 6 T3 5 T2 8 T5

Affinity 3: (2, 3, 5)

A 5 T4 A 3 T2 2 T1

Cluster Oak, 5 threads

Results of experiments, algorithm RCLHierarchicalAffinity

b, 1000 op/s

Default, random access Default, sequential access Default, strided access RCLLockInitNUMA, random access RCLLockInitNUMA, sequential access RCLLockInitNUMA, strided access

Affinity 1 Affinity 2 Affinity 3

slide-36
SLIDE 36

36

3 T2 2 T1 4 T3 7 6 1 RCL 8

RCLHierarchicalAffinity: Affinity 1: (2, 5)

1 RCL 4 7 6 5 8

Affinity 2: (2, 3)

A 3 2 1 RCL 4 7 6 T2 5 T1 8

Affinity 3: (5, 6)

A 5 T2 A 3 T1 2 1 RCL 4 T2 7 6 5 8

Affinity 4: (3, 4)

A 3 T2 2 T1

Cluster Jet, 2 threads

Default, random access Default, sequential access Default, strided access

3 T1 2 1 RCL 4 7 6 5 8

Affinity 5: (3, 5)

A

Results of experiments, algorithm RCLHierarchicalAffinity

b, 1000 op/s

Affinity 1 Affinity 2 Affinity 3 Affinity 4 Affinity 5

slide-37
SLIDE 37

37

3 T2 2 T1 4 T3 7 6 1 RCL 8

RCLHierarchicalAffinity: Affinity 1: (2, 3, 4)

3 2 1 RCL 4 7 T3 6 T2 5 T1 8

Affinity 2: (5, 6, 7)

A 3 T2 2 T1 1 RCL 4 7 6 5 T3 8

Affinity 3: (2, 3, 5)

A 5 A 3 2 T1 1 RCL 4 7 6 T3 5 T2 8

Affinity 4: (2, 5, 6)

A

Cluster Jet, 3 threads

Results of experiments, algorithm RCLHierarchicalAffinity

b, 1000 op/s

Default, random access Default, sequential access Default, strided access

Affinity 1 Affinity 2 Affinity 3 Affinity 4

slide-38
SLIDE 38

38

3 T2 2 T1 4 T3 7 6 1 RCL 8

RCLHierarchicalAffinity: Affinity 1: (2, 3, 4, 5)

1 RCL 4 7 6 T4 5 T3 8

Affinity 2: (2, 3, 5, 6)

A 3 2 T1 1 RCL 4 7 T4 6 T3 5 T2 8

Affinity 3: (2, 5, 6, 7)

A 5 T4 A 3 2 1 RCL 4 7 T3 6 T2 5 T1 8 T4

Affinity 4: (5, 6, 7, 8)

A 3 T2 2 T1

Cluster Jet, 4 threads

Results of experiments, algorithm RCLHierarchicalAffinity

b, 1000 op/s

Default, random access Default, sequential access Default, strided access

Affinity 1 Affinity 2 Affinity 3 Affinity 4

slide-39
SLIDE 39

39

3 T2 2 T1 4 T3 7 6 1 RCL 8

RCLHierarchicalAffinity: Affinity 1: (2, 3, 4, 5)

1 RCL 4 7 6 T2 5 T1 8

Affinity 2: (2, 3, 5, 6)

A 3 2 T1 1 RCL 4 7 T4 6 T3 5 T2 8

Affinity 3: (2, 3, 5)

A 5 T4 A 3 T2 2 T1

Cluster Jet, 5 threads

Results of experiments, algorithm RCLHierarchicalAffinity

b, 1000 op/s

Default, random access Default, sequential access Default, strided access

Affinity 1 Affinity 2 Affinity 3

slide-40
SLIDE 40

Software implementation of the algorithms

40

Liblock

RCL PThread mutex MCS Spinlock Flat combining CCSynch/ DSMSynch Libnuma hwloc

RCLLockInitNUMA

RCLHierarchicalAffinity Membench

▪ Liblock – library, implementing locks ▪ Algorithms RCLLockInitNUMA и RCLHierarchicalAffinity are implemented as the functions in Liblock library ▪ Libnuma – library, realizing memory affinity in NUMA-systems ▪ hwloc – obtaining information about hierarchical structure of computer system ▪ Membench – synthetic benchmark

slide-41
SLIDE 41

Thank you for attention!

41

slide-42
SLIDE 42

Table of requests in RCL

Lock Arguments Function NULL NULL NULL req1 &lock1 0x34dcf8af &func1 req2 &lock2 0x45adcf9e &func2 req3 NULL NULL NULL reqn Cache-line size Server thread loop 42

slide-43
SLIDE 43

Результаты экспериментов

b, 1000 опер/с p

43

slide-44
SLIDE 44

GETNODESUSAGE(node_usage) 1 for core = 1 to N do 2 if IsServerRunning(core) then 3 nb_free_cores = nb_free_cores + 1 4 else 5 nodes_usage[GETNODE(core)]++

Алгоритм RCLlockInitNUMA иницилазиации блокировки

44 INITLIBRARY() 1 TrySetAffinity(defaultNode) 2 topology = InitHwlocTopology() RCLLOCKINITNUMA() 1 node_usage = GETNODESUSAGE(node_usage) 2 TRYSETMEMBIND(nodes_usage) 3 core = GETFREECORE(nodes_usage) 4 RCLLOCKINITDEFAULT(core) GETFREECORE(nodes_usage) 1 if nb_free_cores ≤ 1 then 2 core = GetNextCoreInRRFashion() 3 else 4 node = GetMostBusyNode(nodes_usage) 5 for each core in node do 6 if !IsServerRunning(core) then 7 return core

slide-45
SLIDE 45

t T1 T2 T3

Время реализации доступа к глобальной переменной Критический путь

45

Делегирование выполнения критических секций процессорным ядрам

Время выполнений критической секции (критический путь): 𝑢 = 𝑢1 + 𝑢2 + 𝒖𝟒, где 𝑢1 – время выполнения инструкций критической секции, 𝑢2 – время передачи права выполнения критической секции, 𝒖𝟒 – время реализации доступа к глобальным переменным

Требуется разработка алгоритмов блокировки, обеспечивающих локализацию обращений к памяти.

КС2 КС3 КС1

Глобальная переменная

v

slide-46
SLIDE 46

Организация экспериментов – аппаратное и программное обеспечение

46

Конфигурация подсистемы

▪ Узел кластера Oak (NUMA-система) ▫ 2 x Intel Xeon E5620 (2.4 GHz, 4 ядер) ▫ Кэш L1 (32 KB), L2 (256 KB), L3 (12 MB) ▫ Оперативная память 24 GB Соотношение скорости доступа к локальному и удалённому сегментам памяти: 10:21. ▪ Узел кластера Jet (SMP-система) ▫ 2 x Intel Xeon E5420 (2.4 GHz, 4 ядер) ▫ Кэш: L1 (32 KB), L2 (6 MB) ▫ Оперативная память 8 GB

Программное обеспечение

▪ GNU/Linux Fedora 20 (Jet), CentOS 6.0 (Oak) ▪ GCC 5.3.0

slide-47
SLIDE 47

Организация экспериментов – тестовые программы

47

Синтетический тест

▪ Циклический доступ к элементам целочисленного массива ▪ Размер массива 𝑐 = 5 × 108 элементов ▪ Количество операций 𝑜 = 108/𝑞 ▪ Шаблон операции: увеличение переменной на 1 ▪ Шаблоны доступа к элементам ▫ Последовательный доступ (sequential access) ▫ Случайный доступ (random access) ▫ Доступ с интервалом 𝑡 = 1000 элементов (strided access)

Реальные многопоточные программы

▪ SPLASH2 – пакет отраслевых тестов и численных методов (kernels) ▫ Barnes, FMM – моделирование взаимодействия физические тел (метод 𝑂 тел) ▫ Cholesky, LU – методы линейной алгебры решения СЛАУ ▫ FFT – быстрое преобразование Фурье ▫ Ocean – моделирование течений в мировом океане ▫ Radiosity, Raytrace, Volrend – моделирование сцен трёхмерной графики ▫ Radix – поразрядная сортировка ▫ Water-Nsquared, Water-Spatial – задачи вычислительной гидродинамики

slide-48
SLIDE 48

Организация экспериментов – тестовые программы

48

Реальные многопоточные программы

▪ Phoenix – реализация метода MapReduce для многоядерных ВС ▫ Histogram – определение гистограммы RGB-изображения ▫ Linear regression – решение задачи аппроксимации по множеству точек ▫ String match – поиск строки в текстовом файле ▫ Word count – определение частоты встречаемости слов в документе ▫ Matrix multiply – умножение целочисленных матриц ▫ Kmeans – алгоритм кластеризации множества точек ▫ PCA – метод главных компонент ▫ Reverse index – составление словаря ссылок по заданному HTML-файлу