Multiprocessor Parallelism ASD Shared Memory HPC Workshop Computer - - PowerPoint PPT Presentation
Multiprocessor Parallelism ASD Shared Memory HPC Workshop Computer - - PowerPoint PPT Presentation
Multiprocessor Parallelism ASD Shared Memory HPC Workshop Computer Systems Group, ANU Research School of Computer Science Australian National University Canberra, Australia February 12, 2020 Schedule - Day 3 Computer Systems (ANU)
Schedule - Day 3
Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 2 / 110
Reference Material
Introduction to Parallel Computing, A Grama, A Gupta, G Karypis, and V Kumar, Addison Wesley 2003 (ISBN 0 201 64865 2) Parallel Computer Architecture and Programming, Kayvon Fatahalian, Carnegie Mellon University Course 15-418/618 (http://15418.courses.cs.cmu.edu/spring2015/) Concurrency: State Models and Java Programming, J. Magee and J. Kramer, 2nd edn, Wiley, 2006 (ISBN-10: 0470093552) The C++ Programming Language, 4th Edition, Bjarne Stroustrup, Pearson Education 2013 (ISBN 978-0-321-56384-2) Shared Memory Application Programming: Concepts and Strategies in Multicore Application Programming 1st Edition, Victor Alessandrini, Morgan Kaufmann, 2015 (ISBN 978-0128037614)
Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 3 / 110
OS Support for Threads
Outline
1
OS Support for Threads
2
Mutual Exclusion
3
Synchronization Constructs
4
Memory Consistency
5
The OpenMP Programming Model
6
C++ 11 Threads Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 4 / 110
OS Support for Threads
Concurrent vs Parallel Processing
Concurrent Processing: Environment in which tasks are defined and allowed to execute in any order
Does not imply a multiple processor environment E.g. spawn a separate thread to do I/O while CPU intensive task continues to do another operation
Parallel Processing: The simultaneous execution of concurrent tasks
- n different CPUs
All parallel processing is concurrent, but NOT all concurrent processing is parallel
Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 5 / 110
OS Support for Threads
O/S Thread Support
O/S Originally designed for process not thread support Require O/S to
Treat threads equally and ensure that they all get equal (or user defined) access to machine resources Know what to do when a thread issues a fork command Be able to deliver a signal to the correct thread
Libraries executed by a multi-threaded program need to be thread safe
Potential static data structures to be overwritten
Hardware support
Some hardware provide support for very fast context switch between threads, e.g. Cray MTA or more recently hyper-threading
Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 6 / 110
OS Support for Threads
Thread Implementations
Three categories
Pure user space Pure kernel space Mixed
User Mode: When a process executes instructions within its program
- r linked libraries
Kernel Mode: When the operating system executes instructions on behalf of a user program
Often as a result of a system call In kernel mode the program can access kernel space
Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 7 / 110
OS Support for Threads
User Space Threads
No kernel involvement in providing a thread Kernel has no knowledge
- f threads and continues
to schedule processes
- nly
Thread library selects which thread to run OK for concurrency, no good for parallel programming
Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 8 / 110
OS Support for Threads
Kernel Space Threads
Kernel level thread created for each user thread O/S must manage on a per-thread basis information typically managed on a process basis Potentially poor scaling when too many threads as O/S gets overloaded
Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 9 / 110
OS Support for Threads
User-space v Kernel-space
User Space Advantages
No changes to core O/S No kernel involvement means they may be faster for some applications Can create many threads without overloading the system
User Space Disadvantages
Potentially long latency during system service blocking (e.g. 1 thread stalled on I/O request) All threads compete for the same CPU cycles No advantage gained from multiple CPUs
Mixed scheme
A few user threads mapped to one kernel thread
Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 10 / 110
Mutual Exclusion
Outline
1
OS Support for Threads
2
Mutual Exclusion
3
Synchronization Constructs
4
Memory Consistency
5
The OpenMP Programming Model
6
C++ 11 Threads Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 11 / 110
Mutual Exclusion
Accessing Shared Data
Concurrent read is no problem
but what about concurrent write?
Result could be x+1 or x+2 Similar problems with access to shared resources like I/O Critical Sections: mechanism to ensure controlled access to shared resources
Processes enter critical section under mutual exclusion
Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 12 / 110
Mutual Exclusion
Locks
Simplest mechanism for providing mutual exclusion A lock is a 1-bit variable, a value of
1 indicates a process is in the critical section 0 indicates no process is in the critical section
At its lowest level a lock is a protocol for coordinating processes,
the CPU is not physically prevented from executing those instruction
1 while (lock == 1) do_nothing ; /* infinite testing loop */ lock = 1; /* enter critical section */ 3 critical section lock = 0; /* exit critical section */
The above is an incorrect example of a spin-lock, that uses a mechanism called busy waiting
Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 13 / 110
Mutual Exclusion
The Assembly
lock: ld R0 , mem[addr] ; load word into R0 2 cmp R0 , #0 ; if 0, store 1 bnz lock ; else , try again 4 st mem[addr], #1 6 8 unlock: st mem[addr], #0 ; store 0 to address
Problem: data race because LOAD-TEST-STORE is not atomic! Processor 0 loads address X, observes 0 Processor 1 loads address X, observes 0 Processor 0 writes 1 to address X Processor 1 writes 1 to address X
(See http://15418.courses.cs.cmu.edu/spring2015/lecture/synchronization) Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 14 / 110
Mutual Exclusion
Test and Set Lock
Test-and-set instruction:
ts R0 , mem[addr] ; atomically load mem[addr] into R0 ; if mem[addr] is 0 then mem[addr] to 1 lock: ts R0 , mem[addr] ; load word into R0 2 cmp R0 , #0 ; if 0, lock
- btained
bnz lock 4 6 unlock: st mem[addr], #0 ; store 0 to address
What does ‘atomic’ mean in this context? p threads contending for the lock will require O(p) attempts, thus O(p2) time (and energy!) we can reduce the number of attempts (but still O(p)) by only trying the ts when we see mem[addr] is 0
Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 15 / 110
Mutual Exclusion
Common Atomic Operations
Test and Set
bool TestAndSet (bool *lock) { 2 bool initial = *lock; lock = true; 4 return initial; }
Fetch and Add (operate)
1 << atomic >> int FetchAndAdd (int *location , int inc) { 3 int value = *location; *location = value + inc; 5 return value; }
Compare and Swap (Intel, CMPXCHG)
bool cas(int *p, old: int , new: int) { 2 if *p != old return false; 4 *p = new; return true; 6 } Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 16 / 110
Mutual Exclusion
Locks: Phases and Properties
Phases
Waiting to acquire lock (busy wait or de-schedule and woken later) Lock acquisition Releasing lock
Properties
Fast to acquire in absence of content Low interconnect traffic Scalability Low storage cost Fairness Do not degrade overall system (e.g. when more threads than CPUs contend for a lock)
Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 17 / 110
Mutual Exclusion
Pthread Lock Routines
Locks implemented as mutually exclusive lock variables or mutex variables To use a variable must be declared as type pthread mutex t
Usually in main thread as it needs to be visible to all threads using it
pthread_mutex_t mutex1; 2 pthread_mutex_init (& mutex1 , NULL); //(NULL specifies default attributes for mutex) pthread_mutex_lock (& mutex1); 4 /* CRITICAL SECTION */ pthread_mutex_unlock (& mutex1); Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 18 / 110
Mutual Exclusion
Synchronization Terminology
Deadlock Livelock Starvation
Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 19 / 110
Mutual Exclusion
Deadlock
http://15418.courses.cs.cmu.edu/spring2015/lecture/snoopimpl
Deadlock is a state where a system has outstanding
- perations to complete, but
no operation can make progress Can arise when each
- peration has acquired a
shared resource that another
- peration needs
In a deadlock situation, there is no way for any thread (or, in this illustration, a car) to make progress unless some thread relinquishes a resource (”backs off”)
Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 20 / 110
Mutual Exclusion
Deadlock
Can occur with multiple processes seeking to acquire resources that cannot be shared Pthreads provides pthread mutex trylock() to help address these circumstances
Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 21 / 110
Mutual Exclusion
Transactions and Concurrency
https://aries.aston.ac.uk/modules/2010 2011/EE4007/chapter 13 8.html Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 22 / 110
Mutual Exclusion
Deadlock Conditions
Mutual exclusion: the resources involved must be unsharable Hold and wait: processor must hold the resources they have been allocated while waiting for other resources needed to complete an
- peration
No pre-emption: processes cannot have resources taken away from them until they have completed the operation they wish to perform Circular wait: exists in the resource dependency graph
Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 23 / 110
Mutual Exclusion
Livelock
http://15418.courses.cs.cmu.edu/spring2015/lecture/snoopimpl Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 24 / 110
Mutual Exclusion
Starvation
In this example: assume traffic moving left/right (yellow cars) must yield to traffic moving up/down (green cars) http://15418.courses.cs.cmu.edu/spring2015/lecture/snoopimpl
State where a system is making overall progress, but some processes make no progress
(green cars make progress, but yellow cars are stopped)
Starvation is usually not a permanent state
(as soon as green cars pass, yellow cars can go) Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 25 / 110
Mutual Exclusion
Follow-On Issues
Lots of stuff on specific barrier implementation Lock free data structures Transactional memory
Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 26 / 110
Mutual Exclusion
Hands-on Exercise: Non-Determinism
Objective: To observe the effect of non-determinism in the form of race conditions on program behaviour and how it can be avoided
Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 27 / 110
Synchronization Constructs
Outline
1
OS Support for Threads
2
Mutual Exclusion
3
Synchronization Constructs
4
Memory Consistency
5
The OpenMP Programming Model
6
C++ 11 Threads Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 28 / 110
Synchronization Constructs
Semaphores: Concepts
semaphores are widely used for dealing with inter-process synchronization in operating systems a semaphore s is an integer variable that can take only non-negative values the only operations permitted on s are up(s) and down(s)
down(s): if (s > 0) decrement s else block execution of the calling process up(s): if there are processes blocked on s awaken one of them else increment s
blocked processes or threads are held in a FIFO queue
Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 29 / 110
Synchronization Constructs
Semaphores: Implementation
binary semaphores have only a value of 0 or 1
hence can act as a lock
waking up of threads via an up(s) occurs via an OS signal Posix semaphore API
#include <semaphore.h> 2 ... int sem_init (sem_t *sem , int pshared , unsigned int value); 4 int sem_destroy (sem_t *sem); int sem_wait (sem_t *sem); // down () 6 int sem_trywait (sem_t *sem); int sem_timedwait (sem_t *sem , const struct timespec *abstime); 8 int sem_post (sem_t *sem); //up() int sem_getvalue (sem_t *sem , int *value);
pshared: if non-0, generate a semaphore for usage between processes value delivers the number of waiting processes/threads as a negative
integer, if there are any waiting on this semaphore
Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 30 / 110
Synchronization Constructs
Monitors and Condition Synchronization
Concepts: monitors: encapsulated data + access function mutual exclusion + condition synchronization
- nly a single access function can be active in the monitor
Practice: private shared data and synchronized functions (exclusion)
the latter requires a per object lock to be acquired in each function
- nly a single thread may be active in the monitor at a time
condition synchronization: achieved using a condition variable
wait(), notify_one() and notify_all()
Note: monitors serialize all accesses to the encapsulated data!
Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 31 / 110
Synchronization Constructs
Monitors in ‘C’ / POSIX (types and creation)
int pthread_mutex_lock ( pthread_mutex_t *mutex); 2 int pthread_mutex_trylock ( pthread_mutex_t *mutex); int pthread_mutex_timedlock ( pthread_mutex_t *mutex , 4 const struct timespec*abstime); int pthread_mutex_unlock ( pthread_mutex_t *mutex); 6 int pthread_cond_wait ( pthread_cond_t *cond , pthread_mutex_t *mutex); 8 int pthread_cond_timedwait ( pthread_cond_t *cond , pthread_mutex_t *mutex , 10 const struct timespec*abstime); int pthread_cond_signal ( pthread_cond_t *cond); 12 int pthread_cond_broadcast ( pthread_cond_t *cond);
pthread_cond_signal() unblocks at least one thread pthread_cond_broadcast() unblocks all threads waiting on cond lock calls can be called anytime (multiple lock activations are
possible) the API is flexible and universal, but relies on conventions rather than compilers
Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 32 / 110
Synchronization Constructs
Condition Synchronization
(courtesy Magee&Kramer: Concurrency) Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 33 / 110
Synchronization Constructs
Condition Synchronization in Posix
Synchronization between POSIX-threads:
typedef ... pthread_mutex_t ; 2 typedef ... pthread_mutexattr_t ; typedef ... pthread_cond_t ; 4 typedef ... pthread_condattr_t ; int pthread_mutex_init ( pthread_mutex_t *mutex , 6 const pthread_mutexattr_t *attr); int pthread_mutex_destroy ( pthread_mutex_t *mutex); 8 int pthread_cond_init ( pthread_cond_t *cond , const pthread_condattr_t *attr); 10 int pthread_cond_destroy ( pthread_cond_t *cond);
Mutex and condition variable attributes include: semantics for trying to lock a mutex which is locked already by the same thread
pthread_mutex_destroy(): undefined, if lock is in use pthread_cond_destroy() undefined, if there are threads waiting
Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 34 / 110
Synchronization Constructs
Bounded Buffer in Posix
typedef struct bbuf_t { 2 int count , N; ... pthread_mutex_t bufLock; 4 pthread_cond_t notFull; pthread_cond_t notEmpty; 6 } bbuf_t; int getBB(bbuf_t *b) { 8 pthread_mutex_lock (&b->bufLock); while (b->count == 0) 10 pthread_cond_wait ( &b->notEmpty , 12 &b->mutex); b->count --; 14 int v; // remove item from buffer pthread_cond_signal (&b->notFull); 16 pthread_mutex_unlock (&b->bufLock); return v; 18 } void putBB(int v, bbuf_t *b) { 2 pthread_mutex_lock (&b->bufLock); while (b->count == b->N) { 4 pthread_cond_wait ( &b->notFull , 6 &b->bufLock); } 8 b->count ++; // add v to buffer 10 pthread_cond_signal (&b->notEmpty); pthread_mutex_unlock (&b->bufLock); 12 } Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 35 / 110
Synchronization Constructs
Deadlock and its Avoidance
ideas for deadlock:
concepts: system deadlock: no further progress model: deadlock: no eligible actions practice: blocked threads
- ur aim: deadlock avoidance: to design systems where deadlock
cannot occur done by removing one of the four necessary and sufficient conditions:
serially reusable resources: the processes involved share resources which they use under mutual exclusion incremental acquisition: processes hold on to resources already allocated to them while waiting to acquire additional resources no pre-emption: once acquired by a process, resources cannot be pre-empted (forcibly withdrawn) but are only released voluntarily. wait-for cycle: a circular chain (or cycle) of processes exists such that each process holds a resource which its successor in the cycle is waiting to acquire
Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 36 / 110
Synchronization Constructs
Wait-for Cycles
- perating systems must deal with deadlock arising from processes
requesting resources (e.g. printers, memory, co-processors) pre-emption involves constructing a resource allocation graph, detecting a cycle and removal of a resource avoidance involves never granting a request that could lead to such a cycle (the Banker’s Algorithm)
(courtesy Magee&Kramer: Concurrency) Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 37 / 110
Synchronization Constructs
The Dining Philosophers Problem
Five philosophers sit around a circular table. Each philosopher spends his life alternately thinking and eating. In the centre of the table is a large bowl of spaghetti. A philosopher needs two forks to eat a helping of
- spaghetti. One fork is placed
between each pair of philosophers and they agree that each will only use the fork to his immediate right and left. If all 5 sit down at once and take the fork on his left, there will be deadlock: a wait-for cycle exists!
(courtesy Magee&Kramer: Concurrency) Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 38 / 110
Synchronization Constructs
Dining Philosophers: Model Structure Diagram
Each FORK is a shared resource with actions get and put. When hungry, each Phil must first get his right and left forks before he can start eating. Each Fork is either in a taken or not taken state. This is an example when the resulting monitor state can be distributed (resulting in greater system concurrency).
(courtesy Magee&Kramer: Concurrency) Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 39 / 110
Memory Consistency
Outline
1
OS Support for Threads
2
Mutual Exclusion
3
Synchronization Constructs
4
Memory Consistency
5
The OpenMP Programming Model
6
C++ 11 Threads Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 40 / 110
Memory Consistency
Consider the following
initially flag1 = flag2 = 0 | // Process 0 | // Process 1 | | flag1 = 1 | flag2 = 1 if (flag2 == 0) | if (flag1 == 0) print "Hello"; | print "World"; |
What is printed?
Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 41 / 110
Memory Consistency
Time Ordered Events
Process 0 Process 1
- 1
flag1 = 1 2 if (flag2 == 0) print "Hello"; 3 flag2 = 1 4 if (flag1 == 0) print "World"; Output: Hello
- 1
flag1 = 1 2 flag2 = 1 3 if (flag2 == 0) print "Hello"; 4 if (flag1 == 0) print "World"; Output: (Nothing Printed)
- 1
flag2 = 1 2 if (flag1 == 0) print "World"; 3 flag1 = 1 4 if (flag2 == 0) print "Hello"; Output: World
- Never Hello and World?
But what fundamental assumption are we making?
Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 42 / 110
Memory Consistency
Memory Consistency
To write correct and efficient shared memory programs, programmers need a precise notion of how memory behaves with respect to read and write operations from multiple processors (Adve and Gharachorloo) Memory/cache coherency defines requirements for the observed behaviour of reads and writes to the same memory location (ensuring all processors have consistent view of same address) Memory consistency defines the behavior of reads and writes to different locations (as observed by other processors)
(See: Sarita V. Adve and Kourosh Gharachorloo. 1996. Shared Memory Consistency Models: A Tutorial. Computer 29, 12, 66-76. DOI=10.1109/2.546611) Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 43 / 110
Memory Consistency
Sequential Consistency
Lamports definition: [A multiprocessor system is sequentially consistent if] the result of any execution is the same as if the
- peration of all the processors were executed in some sequential
- rder, and the operations of each individual processor appear in this
sequence in the order specified by its program Two aspects:
Maintaining program order among operations from individual processors Maintaining a single sequential order among operations from all processors
The latter aspect make it appear as if a memory operation executes atomically or instantaneously with respect to other memory operations
Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 44 / 110
Memory Consistency
Programmer’s View of Sequential Consistency
Conceptually
There is a single global memory and a switch that connects an arbitrary processor to memory at any time step Each process issues memory operations in program order and the switch provides the global serialization among all memory operations.
Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 45 / 110
Memory Consistency
Motivation for Relaxed Consistency
To gain performance
Hide latency of independent memory access with other operations Recall that memory accesses to a cache coherent system may involve much work
Can relax either the program order or atomicity requirements (or both)
e.g. relaxing write to read and write to write allows writes to different locations to be pipelined or overlapped Relaxing write atomicity allows a read to return another processor’s write before all cached copies have been updated
Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 46 / 110
Memory Consistency
Possible Reorderings
We consider first relaxing the program order requirements Four types of memory orderings (within a program):
W→R: write must complete before subsequent read (RAW) R→R: read must complete before subsequent read (RAR) R→W: read must complete before subsequent write (WAR) W→W: write must complete before subsequent write (WAW)
Normally, different addresses are involved in the pair of operations relaxing these can give:
W→R: e.g. the write buffer (aka store buffer) everything: the weak and release consistency models
Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 47 / 110
Memory Consistency
Breaking W → R Ordering: Write Buffers
(Adve and Gharachorloo DOI=10.1109/2.546611)
Write buffer (very common)
Process inserts writes in write buffer and proceeds, assuming it completes in due course Subsequent reads bypass previous writes, using the value in the write buffer subsequent writes are processed by the buffer in order (no W→W relaxations)
Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 48 / 110
Memory Consistency
Allowing reads to move ahead of writes
Total Store Ordering (TSO)
Processor P can read B before its write to A is seen by all processors (process can move its own reads in front of its own writes) Reads by other processors cannot return new value of A until the write to A is observed by all processors
Processor consistency(PC)
Any processor can read new value of A before the write is observed by all processors
In TSO and PC, only W → R order is relaxed. The W → W constraint still exists. Writes by the same thread are not reordered (they occur in program order)
See http://15418.courses.cs.cmu.edu/spring2015/lecture/consistency Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 49 / 110
Memory Consistency
Processor Consistency
Before a LOAD is allowed to perform wrt. any processor, all previous LOAD accesses must be performed wrt. everyone Before a STORE is allowed to perform wrt. any processor, all previous LOAD and STORE accesses must be performed wrt. everyone
Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 50 / 110
Memory Consistency
Four Example Programs
Do (all possible) results of execution match that of sequential consistency (SC)? 1 2 3 4 Total Store Ordering (TSO) ✓ ✓ ✓ ✗ Processor Consistency (PC) ✓ ✓ ✗ ✗
See http://15418.courses.cs.cmu.edu/spring2015/lecture/consistency Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 51 / 110
Memory Consistency
Clarification
The cache coherency problem exists because of the optimization of duplicating data in multiple processor caches. The copies of the data must be kept coherent. Relaxed memory consistency issues arise from the optimization of reordering memory operations (this is unrelated to whether there are caches in the system).
Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 52 / 110
Memory Consistency
Allowing writes to be reordered
Four types of memory operations orderings
W→R : write must complete before subsequent read R→R: read must complete before subsequent read R→W: read must complete before subsequent write W→W: write must complete before subsequent write
Partial Store Ordering (PSO)
Execution may not match sequential consistency on program 1 (P2 may observe change to flag before change to A)
Thread 1 (on P1) Thread 2 (on P2)
- A = 1;
while (flag == 0); flag = 1; print A; Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 53 / 110
Memory Consistency
Breaking W → W Ordering: Overlapped Writes
(Adve and Gharachorloo DOI=10.1109/2.546611)
General (non-bus) Interconnect with multiple memory modules
Different memory operations issued by same processor are serviced by different memory modules Writes from P1 are injected into the memory system in program order, but they may complete out of program order Many processors coalesced write to the same cache line in a write buffer, and could lead to similar effects
Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 54 / 110
Memory Consistency
Allowing all reorderings
Four types of memory operations orderings
W→R: write must complete before subsequent read R→R: read must complete before subsequent read R→W: read must complete before subsequent write W→W: write must complete before subsequent write
Examples:
Weak Ordering (WO) Release Consistency (RC)
Processors support special synchronization
- perations
Memory accesses before memory fence instruction must complete before the fence issues Memory accesses after fence cannot begin until fence instruction is complete
reorderable reads and writes here ... <MEMORY FENCE > ... reorderable reads and writes here .... <MEMORY FENCE > Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 55 / 110
Memory Consistency
Weak Consistency
Relies on the programmer having used critical sections to control access to shared variables
Within the critical section no other process can rely on that data structure being consistent until the critical section is exited
We need to distinguish critical points when the programmer enters or leaves a critical section
Distinguish standard load/stores from synchronization accesses
Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 56 / 110
Memory Consistency
Weak Consistency
Before an ordinary LOAD/STORE is allowed to perform wrt. any processor, all previous SYNCH accesses must be performed wrt. everyone Before a SYNCH access is allowed to perform wrt. any processor, all previous
- rdinary LOAD/STORE accesses must be
performed wrt. everyone SYNCH accesses are sequentially consistent wrt. one another
Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 57 / 110
Memory Consistency
Release Consistency
Before any ordinary LOAD/STORE is allowed to perform wrt. any processor, all previous ACQUIRE accesses must be performed wrt. everyone Before any RELEASE access is allowed to perform wrt. any processor, all previous
- rdinary LOAD/STORE accesses must be
performed wrt. everyone Acquire/Release accesses are processor consistent wrt. one another
Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 58 / 110
Memory Consistency
Enforcing Consistency
The hardware provides underlying instructions that are used to enforce consistency
Fence or Memory Bar instructions
Different processors provide different types of fence instructions
Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 59 / 110
Memory Consistency
Example: Synchronization in relaxed models
Intel x86/x64 - total store ordering
Provides sync instructions if software requires a specific instruction
- rdering not guaranteed by the consistency model
mm_lfence(”load fence”: waits for all loads to complete) mm_sfence(”store fence”: waits for all stores to complete) mm_mfence(”mem fence”: waits for all mem operations to complete)
ARM processors: very relaxed consistency model
Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 60 / 110
Memory Consistency
Summary: relaxed consistency
Motivation: obtain higher performance by allowing recording of memory operations for latency hiding (reordering is not allowed by sequential consistency) One cost is software complexity: programmer or compiler must correctly insert synchronization to ensure certain specific ordering
But in practice complexities encapsulated in libraries that provide intuitive primitives like lock/unlock, barrier (or lower level primitives like fence) Optimize for the common case: most memory accesses are not conflicting, so don’t pay the cost as if they are
Relaxed consistency models differ in which memory ordering constraints they ignore
Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 61 / 110
Memory Consistency
Final Thoughts
What consistency model best describes pthreads?
Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 62 / 110
Memory Consistency
Hands-on Exercises: Synchronization Constructs and Deadlock
Objectives: To understand how to create threads, how race conditions can occur, and how monitors can be implemented in pThreads programs To understand how race conditions can cause deadlocks and erroneous results
Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 63 / 110
The OpenMP Programming Model
Outline
1
OS Support for Threads
2
Mutual Exclusion
3
Synchronization Constructs
4
Memory Consistency
5
The OpenMP Programming Model
6
C++ 11 Threads Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 64 / 110
The OpenMP Programming Model
OpenMP Reference Material
http://www.openmp.org/ Introduction to High Performance Computing for Scientists and Engineers, Hager and Wellein, Chapter 6 & 7 High Performance Computing, Dowd and Severance, Chapter 11 Introduction to Parallel Computing, 2nd Ed, A. Grama, A. Gupta, G. Karypis, V. Kumar Parallel Programming in OpenMP, R. Chandra, L.Dagum, D.Kohr, D.Maydan. J.McDonald, R.Menon
Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 65 / 110
The OpenMP Programming Model
Shared Memory Parallel Programming
Explicit thread programming is messy
low-level primitives
- riginally non-standard, although better since pthreads
used by system programmers, but · · · · · · application programmers have better things to do!
Many application codes can be usefully supported by higher level constructs
led to proprietary directive based approaches of Cray, SGI, Sun etc
OpenMP is an API for shared memory parallel programming targeting Fortran, C and C++
standardizes the form of the proprietary directives avoids the need for explicitly setting up mutexes, condition variables, data scope, and initialization
Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 66 / 110
The OpenMP Programming Model
OpenMP
Specifications maintained by OpenMP Architecture Review Board (ARB) members include AMD, Intel, Fujitsu, IBM, NVIDIA · · · cOMPunity Versions 1.0 (Fortran ’97, C ’98), 1.1 and 2.0 (Fortran ’00, C/C++ ’02), 2.5 (unified Fortran and C, 2005), 3.0 (2008), 3.1 (2011), 4.0 (2013), 4.5 (2015) Comprises compiler directives, library routines and environment variables C directives (case sensitive) #pragma omp directive name [clause-list] library calls begin with omp void omp set num threads(nthreads) environment variables begin with OMP export OMP NUM THREADS=4 OpenMP requires compiler support activated via -fopenmp (gcc) or -qopenmp (icc) compiler flags
Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 67 / 110
The OpenMP Programming Model
The Parallel Directive
OpenMP uses a fork/join model, i.e. programs execute serially until they encounter a parallel directive:
this creates a group of threads the number of threads dependent on an environment variable or set via function call the main thread becomes master with thread id 0
#pragma
- mp
parallel [clause -list] /* structured block */
Each thread executes a structured block
Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 68 / 110
The OpenMP Programming Model
Fork-Join Model
Introduction to High Performance Computing for Scientists and Engineers, Hager and Wellein, Figure 6.1
Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 69 / 110
The OpenMP Programming Model
Parallel Clauses
Clauses are used to specify Conditional Parallelization: to determine if parallel construct results in creation of threads if (scalar expression) Degree of concurrency: explicit specification of the number of threads created num threads (integer expression) Data handling: to indicate if specific variables are local to thread (allocated on the stack), global, or ”special” private (variable list) shared (variable list) firstprivate (variable list)
Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 70 / 110
The OpenMP Programming Model
Compiler Translation: OpenMP to Pthreads
OpenMP code int a,b; main () { // serial segment #pragma
- mp
parallel num_threads (8) private(a) shared(b) { // parallel segment } // rest of serial segment } Pthread equivalent (structured block is outlined) int a, b; main () { // serial segment for (i=0; i <8; i++) pthread_create (..... , internal_thunk ,...); for (i=0; i <8; i++) pthread_join (........) ; // rest of serial segment } void * internal_thunk (void * packaged_argument ) { int a; // parallel segment } Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 71 / 110
The OpenMP Programming Model
Parallel Directive Examples
#pragma
- mp
parallel if ( is_parallel == 1) num_threads (8) \ private(a) shared(b) firstprivate (c)
If value of variable is parallel is one, eight threads are used Each thread has private copy of a and c, but shares a single copy of b Value of each private copy of c is initialized to value of c before parallel region
#pragma
- mp
parallel reduction (+: sum) num_threads (8) default(private)
Eight threads get a copy of variable sum When threads exit, the values of these local copies are accumulated into the sum variable on the master thread
- ther reduction operations include *, -, &, |,ˆ, && and
All variables are private unless otherwise specified
Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 72 / 110
The OpenMP Programming Model
Example: Computing Pi
Compute π by generating random numbers in square with side length
- f 2 centered at (0,0) and counting numbers that fall within a circle
- f radius 1
The area of square = 4, area of circle = πr 2 = π The ratio of points in circle to the outside approaches π/4
#pragma
- mp
parallel default(private) shared(npoints) \ reduction (+: sum) num_threads (8) { num_threads = omp_get_num_threads (); sample_points_per_thread = npoints/ num_threads ; sum = 0; for (i = 0; i < sample_points_per_thread ; i++){ rand_x = (double) rand_range (&seed , -1, 1); rand_y = (double) rand_range (&seed , -1, 1); if (( rand_x * rand_x + rand_y * rand_y) <= 1.0) sum ++; } }
The OpenMP code is very simple - try writing the equivalent pthread code!
Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 73 / 110
The OpenMP Programming Model
The for Worksharing Directive
Used in conjunction with parallel directive to partition the for loop immediately afterwards
#pragma
- mp
parallel default(private) shared(npoints) \ reduction (+: sum) num_threads (8) { sum = 0; #pragma
- mp for
for (i = 0; i < npoints; i++) { rand_x = (double) rand_range (&seed , -1, 1); rand_y = (double) rand_range (&seed , -1, 1); if (( rand_x * rand_x + rand_y * rand_y) <= 1.0) sum ++; } }
The loop index (i) is assumed to be private Only two directives plus the sequential code (code is easy to read/maintain)
There is implicit synchronization at the end of the loop
Can add a nowait clause to prevent synchronization
Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 74 / 110
The OpenMP Programming Model
The Combined parallel for Directive
The most common use case for parallelizing for loops
sum = 0; #pragma
- mp
parallel for default(private) shared(npoints) \ reduction (+: sum) num_threads (8) for (i = 0; i < npoints; i++) { rand_x = (double) rand_range (&seed , -1, 1); rand_y = (double) rand_range (&seed , -1, 1); if (( rand_x * rand_x + rand_y * rand_y) <= 1.0) sum ++; } } printf("sum =%d\n", sum);
Inside the parallel region, sum is treated as a thread-local variable (implicitly initialized to 0) At the end of the region, the thread-local versions of sum are added to the global sum (here initialized to 0) to get the final value
Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 75 / 110
The OpenMP Programming Model
Assigning Iterations to Threads
The schedule clause of the for directive assigns iterations to threads schedule(scheduling clause[,parameter]) schedule(static[,chunk-size])
splits the iteration space into chunks of size chunk-size and allocates to threads in a round-robin fashion no specification implies the number of chunks equals the number of threads
schedule(dynamic[,chunk-size])
iteration space split into chunk-size blocks that are scheduled dynamically
schedule(guided[,chunk-size])
chunk size decreases exponentially with iterations to minimum of chunk-size
schedule(runtime)
determine scheduling based on setting of OMP SCHEDULE environment variable
Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 76 / 110
The OpenMP Programming Model
Loop Schedules
Introduction to High Performance Computing for Scientists and Engineers, Hager and Wellein, Figure 6.2
Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 77 / 110
The OpenMP Programming Model
Sections
Consider partitioning of fixed number of tasks across threads
much less common than for loop partitioning explicit programming naturally limits number of threads (scalability)
#pragma
- mp
sections { #pragma
- mp
section { taskA () } #pragma
- mp
section { taskB () } }
Separate threads will run taskA and taskB Illegal to branch in or out of section blocks
Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 78 / 110
The OpenMP Programming Model
Nesting Parallel Directives
What happens for nested for loops
#pragma
- mp
parallel for num_threads (2) for (i = 0; i < Ni; i++) { #pragma
- mp
parallel for num_threads (2) for (j = 0; j < Nj; j++) {
By default inner loop is serialized and run by one thread To enable multiple threads in nested parallel loops requires environment variable OMP NESTED to be TRUE Note - the use of synchronization constructions in nested parallel sections requires care (see OpenMP specs)!
Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 79 / 110
The OpenMP Programming Model
Synchronization#1
Barrier: threads wait until they have all reached this point #pragma
- mp
barrier Single: following block executed only by first thread to reach this point
- thers wait at end of structured block unless nowait clause used
#pragma
- mp
single [clause -list] /* structured block */ Master: only master executes following block, other threads do NOT wait #pragma
- mp
master /* structured block */ Critical Section: only one thread is ever in the named critical section #pragma
- mp
critical [( name)] /* structured block */ Atomic: memory location updated in following instruction is done so in an atomic fashion can achieve same effect using critical sections Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 80 / 110
The OpenMP Programming Model
Synchronization#2
Ordered: some operations within a for loop must be performed as if it were done so in sequential order
cumul_sum [0] = list [0]; #pragma
- mp
parallel for shared (cumul_sum , list , n) for (i=1; i<n; i++) { /* other processing
- n list[i] if
required */ #pragma
- mp
- rdered
{ cumul_sum[i] = cumul_sum[i -1] + list[i]; } }
Flush: ensures a consistent view of memory
that variables have been flushed from registers into memory
#pragma
- mp
flush [( list)] Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 81 / 110
The OpenMP Programming Model
Data Handling
private: an uninitialized local copy of variable made for each thread shared: variables shared between threads firstprivate: make a local copy of an existing variable and assign it same value
- ften better than multiple reads to shared variable
lastprivate: copies back to master value from thread executing the equivalent of the last loop iteration if executed serially threadprivate: creates private variables but they persist between multiple parallel regions maintaining their values copyin: like first private but for threadprivate variables
Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 82 / 110
The OpenMP Programming Model
OpenMP Tasks
A task has
Code to execute A data environment (it owns its data) An assigned thread that executes the code and uses the data
Creating a task involves two activities: packaging and execution
Each encountering thread packages a new instance of a task (code and data) Some thread in the team executes the task at some later time
Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 83 / 110
The OpenMP Programming Model
Task Syntax
#pragma
- mp
task [clause ...] if (scalar expression) final (scalar expression) untied default (shared | none) mergeable private (list) firstprivate (list) shared (list) structured_block
When if clause is false, task is executes immediately (in own environment) Task completes at thread barriers (explicit or implicit) and task barriers
#pragma
- mp
taskwait
Applies only to child tasks generated in the current task, not to “descendants”
Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 84 / 110
The OpenMP Programming Model
Task Example
int fib(int n) { int i, j; if (n < 2) return n; else { #pragma
- mp
task shared(i) firstprivate (n) i=fib(n -1); #pragma
- mp
task shared(j) firstprivate (n) j=fib(n -2); #pragma
- mp
taskwait return i+j; } } int main () { int n = 10;
- mp_set_dynamic (0);
- mp_set_num_threads (4);
#pragma
- mp
parallel shared(n) { #pragma
- mp
single printf ("fib (%d) = %d\n", n, fib(n)); } } Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 85 / 110
The OpenMP Programming Model
Task Issues
Task switching
Certain constructs have task scheduling points at defined locations within them When a thread encounters a task scheduling point, it is allowed to suspend the current task and execute another (task switching) It can then return to original task and resume
Tied Tasks
By default suspended tasks must resume execution on the same thread as it was previously executing on “untied” clause relaxes this constraint
Task Generation
Very easy to generate many tasks very quickly! Generating task will be suspended and start working on a long and boring task Other threads can consume all their tasks and have nothing to do
Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 86 / 110
The OpenMP Programming Model
Library Functions#1
Defined in header file
#include <omp.h>
Controlling threads and processors
void
- mp_set_num_threads (int
num_threads ) int
- mp_get_num_threads ()
int
- mp_get_max_threads ()
int
- mp_get_thread_num ()
int
- mp_get_num_procs ()
int
- mp_in_parallel ()
Controlling thread creation
void
- mp_set_dynamic (int
dynamic_threads ) int
- mp_get_dynamic ()
void
- mp_set_nested (int
nested) int
- mp_get_nested ()
Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 87 / 110
The OpenMP Programming Model
Library Functions#2
Mutual exclusion
void
- mp_init_lock ( omp_lock_t *lock)
void
- mp_destroy_lock ( omp_lock_t *lock)
void
- mp_set_lock ( omp_lock_t *lock)
void
- mp_unset_lock ( omp_lock_t *lock)
int
- mp_test_lock ( omp_lock_t *lock)
Nested mutual exclusion
void
- mp_init_nest_lock ( omp_nest_lock_t *lock)
void
- mp_destroy_nest_lock ( omp_nest_lock_t *lock)
- mp_set_nest_lock ( omp_nest_lock_t *lock)
void
- mp_unset_nest_lock ( omp_nest_lock_t *lock)
int
- mp_test_nest_lock ( omp_nest_lock_t *lock)
Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 88 / 110
The OpenMP Programming Model
OpenMP Environment Variables
OMP NUM THREADS: default number of threads entering parallel region OMP DYNAMIC: if TRUE it permits the number of threads to change during execution OMP NESTED: if TRUE it permits nested parallel regions OMP SCHEDULE: determines scheduling for loops that are defined to have runtime scheduling
setenv OMP_SCHEDULE "static ,4" setenv OMP_SCHEDULE "dynamic" setenv OMP_SCHEDULE "guided" Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 89 / 110
The OpenMP Programming Model
OpenMP and Pthreads
OpenMP removes the need for a programmer to initialize task attributes, set up arguments to threads, partition iteration spaces etc OpenMP code can closely resemble serial code OpenMP is particularly useful for static or regular problems OpenMP users are hostage to availability of an OpenMP compiler
performance heavily dependent on quality of compiler
Pthreads data exchange is more apparent so false sharing and contention is less likely Pthreads has a richer API that is much more flexible, e.g. condition waits, locks of different types etc Pthreads is library based Must balance above before deciding on parallel model
Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 90 / 110
The OpenMP Programming Model
Hands-on Exercise: Programming with OpenMP
Objective: To use OpenMP to provide a basic introduction to shared memory parallel programming
Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 91 / 110
C++ 11 Threads
Outline
1
OS Support for Threads
2
Mutual Exclusion
3
Synchronization Constructs
4
Memory Consistency
5
The OpenMP Programming Model
6
C++ 11 Threads Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 92 / 110
C++ 11 Threads
Threads in C++11
A higher level of abstraction to threads and synchronization mechanisms The C++11 thread library provides a number of classes in the std
- namespace. The header files below need to be included:
<thread>: managing and identifying threads <mutex>: mutual exclusion primitives <condition_variable>: condition synchronization primitives <atomic>: atomic types and operations
The thread::hardware_concurrency() operation reports the number of tasks that can simultaneously proceed with hardware support
Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 93 / 110
C++ 11 Threads
C++11 Threads Basics
The std::thread constructor takes a task (function) to be executed and the arguments required by that task. The number and types of arguments must match what the function requires. For example:
void f0(); // no arguments 2 void f1(int); // one int argument 4 thread t1 {f0}; thread t2 {f0 , 1}; // error: too many arguments 6 thread t3 {f1}; // error: too few arguments thread t4 {f1 , 1}; 8 thread t5 {f1 , 1, 2}; // error: too many arguments thread t3 {f1 , "I’m being silly"}; // error: wrong type of argument
Each thread of execution has a unique identifier represented as a value of type thread::id. The id of a thread ’t’ can be obtained by a call of t.get_id().
Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 94 / 110
C++ 11 Threads
C++11 Threads Basics contd.
an object of the std::thread class is automatically started on construction:
#include <thread > 2 void hello () { std :: cout << "Hello from thread " << std :: this_thread :: get_id () 4 << std :: endl; } 6 int main () { std :: thread t1(hello); 8 t1.join (); // wait for t1 to finish }
- ften, it is useful to supply our own ids (etc) to a number of threads:
1 void hello(int i) { printf("Hello from thread with logical id %d\n", i); } ... 3 std :: thread ** ts = new std :: thread* [5]; for (int i=0; i < 5; i++) 5 ts[i] = new std :: thread(hello , i); for (int i=0; i < 5; i++) 7 ts[i]->join (); Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 95 / 110
C++ 11 Threads
C++11 Threads Basics contd.
A t.join() tells the current thread not to proceed until t completes
void tick(int n) { 2 for (int i=0; i!=n; ++i) { this_thread :: sleep_for(second {1}); 4 printf("Alive !\n"); } 6 } int main () { 8 thread timer {tick , 10}; timer.join (); 10 }
We can also use lambda functions to specify the functions that the thread executes (but beware!):
for (int i=0; i < 5; i++) 2 ts[i] = new std :: thread ([&i]() { printf("Hello from thread %d\n", i); 4 }); Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 96 / 110
C++ 11 Threads
Thread Local Variables
A thread local variable is an object owned by a thread. It is not accessible from other threads unless its owner gives a pointer to it thread local variables are shared among all functions of a thread and lives for as long as the thread Each thread has its own copy of its thread local variables These are initialized the first time control passes through its
- definition. If constructed, it will be destroyed on thread exit
A thread explicitly keeps a cache of thread local data for exclusive access
void debug_counter () { 2 thread_local int count = 0; print("This function has been called %d times by thread %d\n", 4 ++count , std :: this_thread :: get_id ()); } 6 ... for (int i=0; i < 5; i++) 8 ts[i] = new std :: thread ([]( int n) { for (int j=0; j<n; j++) debug_counter (); 10 }, i); Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 97 / 110
C++ 11 Threads
Locks in C++11 - Mutex
Suppose we wanted threads to repeatedly increment a counter
class Counter { 2 int v; Counter () {v = 0;} 4 void inc () {v++;} } 1 Counter count; for (int i=0; i < 5; i++) 3 ts[i] = new std :: thread ([& count ]() { for (int j=0; j < 1000; j++) 5 count.inc (); });
The above code suffers from interference. We could fix this by adding a lock field std::mutex l to Counter and redefining:
void inc () {l.lock (); v++; l.unlock ();}
Or use std::lock_guard:
void inc () {std :: lock_guard <std ::mutex > guard(l); v++;}
Note: When guard is constructed, it automatically calls l.lock(). The lock is automatically released when the object is destroyed
C++11 also provides std::atomic<T> class for atomic operations. It provides member functions (atomic) such as compare_exchange,
fetch_add etc. E.g. std::atomic<int> v=0; v.fetch_add(1, std::memory_order_relaxed);
Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 98 / 110
C++ 11 Threads
Monitors and Condition Synchronization
Concepts: monitors: Encapsulated data + access functions Mutual exclusion + condition synchronization Only a single access function can be active in the monitor Practice: Private shared data and synchronized functions (exclusion)
The latter requires a per object lock to be acquired in each function Only a single thread may be active in the monitor at a time
Condition synchronization: achieved using a condition variable
std::condition_variable.
It provides following methods:
wait(), notify_one() and notify_all()
Note: monitors serialize all accesses to the encapsulated data!
Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 99 / 110
C++ 11 Threads
Condition Synchronization: std::condition_variable
C++ provides a thread wait set for each monitor:
void wait(std::unique_lock<std::mutex>& l): void wait(std::unique_lock<std::mutex>& l, Predicate pred):
Causes the current thread to block until the condition variable is notified, optionally looping until pred is true. Atomically releases lock
l, blocks the current executing thread, and adds it to the list of
threads waiting on *this
notify_one() or notify_all(): Unblocks one or more threads in the
wait set. The lock is reacquired and the condition is checked again
A thread is deemed to have entered the monitor when it acquires the monitor’s mutual exclusion lock
(courtesy Magee&Kramer: Concurrency) Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 100 / 110
C++ 11 Threads
Bounded Buffer in C++
A bounded buffer has a capacity of N items:
We can put an item in if there are < N items We can get an item out if there are > 0 items
C++ requires a condition variable for each of the above conditions
class BoundedBuffer { 2 int count , N; std :: mutex bufLock; 4 std :: condition_variable notFull , notEmpty; BoundedBuffer (int capacity) { N = capacity; count = 0; } 6 void put(int v) { std :: unique_lock <std ::mutex > l(bufLock); 8 notFull.wait(l, [this ](){ return count < N; }); // code to add v to buffer 10 count ++;
- notEmpty. notify_one ();
12 } int get () {int v; 14 std :: unique_lock <std ::mutex > l(bufLock); notEmpty.wait(l, [this ](){ return count > 0; }); 16 // code to remove item from buffer and put in v count --; 18 notFull.notify_one (); return v; 20 }}; Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 101 / 110
C++ 11 Threads
Launching the Bounded Buffer
First need a producer to add items, and a consumer to remove them
void consumer(BufferMon &buf) { 2 for (int i=0; i < NumOps; i++) int v = buf.get (); 4 } void producer(BufferMon &buf) { 6 for (int i=0; i < NumOps; i++) buf.put(i); 8 }
Thread startup requires creating the buffer first, and passing its reference to the appropriately constructed thread objects acting as the producer and consumer (which immediately start)
BufferMon buf(N); 2 std :: thread cons(consumer , std :: ref(buf)); std :: thread prod(producer , std :: ref(buf)); 4 prod.join (); cons.join ();
The shared buffer object ensures synchronization
Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 102 / 110
C++ 11 Threads
Semaphores: Concepts
Semaphores are widely used for dealing with inter-process synchronization in operating systems A semaphore s is an integer variable that can take only non-negative values Only operations permitted on s are up(s) and down(s)
down(s): if (s > 0) decrement s else block execution of the calling process up(s): if there are processes blocked on s awaken one of them else increment s
blocked processes or threads are held in a FIFO queue
Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 103 / 110
C++ 11 Threads
A C++ Implementation of Semaphores
A semaphore may be implemented as a monitor encapsulating the integer value
class Semaphore { 2 int v; std :: mutex semLock; 4 std :: condition_variable semZero; Semaphore (int initial) { 6 v = initial; } 8 void up() { 10 std :: unique_lock <std ::mutex > l(semLock); v++; 12 semZero.notify_one (); } 14 void down () { 16 std :: unique_lock <std ::mutex > l(semLock); semZero.wait(l, [this ](){ return v > 0; }); 18 v--; } 20 }; // Semaphore Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 104 / 110
C++ 11 Threads
Bounded Buffer via Semaphores
Beware the nested monitor problem!
class BufferSem { 2 int count , N; std :: mutex bufLock; 4 Semaphore *full , *empty; BoundedSem (int capacity) { 6 N = capacity; count = 0; full = new Semaphore (0); 8 empty = new Semaphore(N); } 10 void put(int v) { empty ->down (); // must
- ccur
before we get bufLock! 12 std :: unique_lock <std ::mutex > l(bufLock); // code to add v to buffer 14 count ++; full ->up(); 16 } int get () {int v; 18 full ->down (); // must
- ccur
before we get bufLock! std :: unique_lock <std ::mutex > l(bufLock); 20 notEmpty.wait(l, [this ](){ return count > 0; }); // code to remove item from buffer and put in v 22 count --; empty ->up(); 24 return v; } 26 }; Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 105 / 110
C++ 11 Threads
Dining Philosophers: Model Structure Diagram
Each FORK is a shared re- source with actions get and put. When hungry, each Phil must first get his right and left forks before he can start eating. Each Fork is either in a taken or not taken state. This is an example when the resulting monitor state can be distributed (resulting in greater sys- tem concurrency).
(courtesy Magee&Kramer: Concurrency) Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 106 / 110
C++ 11 Threads
Dining Philosophers in C++11
forks: shared passive entities – implement as monitors
bool taken; // monitor state; available if false 2 int forkId; // position
- f fork in the
ring std :: mutex forkLock; 4 std :: condition_variable isTaken; Fork(int id) { forkId = id; taken = false; } 6 void get () { std :: unique_lock <std ::mutex > l(forkLock); 8 isTaken.wait(l, [this ](){ return (taken == false); }); taken = true; 10 } void put () { 12 std :: unique_lock <std ::mutex > l(forkLock); taken = false; 14 isTaken.notify_one (); }
philosophers: active entities – implement as threads
1 void philosopher (int id , Fork* left , Fork* right) { while (true) { // sit down 3 left ->get(id); right ->get(id); // eat left ->put(id); right ->put(id); // stand up 5 } } Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 107 / 110
C++ 11 Threads
Launching the Philosophers in C++11
Fork ** fork = new Fork* [N]; 2 for (int i = 0; i < N; i++) fork[i] = new Fork(i); 4 std :: thread ** phil = new std :: thread* [N]; for (int i = 0; i < N; i++) 6 phil[i] = new std :: thread (philosopher , i, fork [(i+N -1)%N], fork[i]); for (int i = 0; i < N; i++) 8 phil[i]->join ();
How can we avoid the inevitable deadlock? prevention (of the dangerous situation): do not permit all N philosophers to sit down remove the wait-for cycle: odd and even philosophers pick up forks in the opposite order
Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 108 / 110
C++ 11 Threads
Hands-on Exercise: C++11 Threads and Condition Synchronization
Objective: To understand how to create threads in C++11 and to look at two crucial issues: interference and deadlock To learn how monitors are implemented in C++11
Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 109 / 110
C++ 11 Threads
Summary
Topics covered today - Multiprocessor Parallelism: Threads and OS support Mutual Exclusion Shared memory synchronization constructs Memory Consistency OpenMP C++ 11 Threads Tomorrow - Parallel Performance Optimization!
Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 110 / 110