Multiprocessor Parallelism ASD Shared Memory HPC Workshop Computer - - PowerPoint PPT Presentation

multiprocessor parallelism asd shared memory hpc workshop
SMART_READER_LITE
LIVE PREVIEW

Multiprocessor Parallelism ASD Shared Memory HPC Workshop Computer - - PowerPoint PPT Presentation

Multiprocessor Parallelism ASD Shared Memory HPC Workshop Computer Systems Group, ANU Research School of Computer Science Australian National University Canberra, Australia February 12, 2020 Schedule - Day 3 Computer Systems (ANU)


slide-1
SLIDE 1

Multiprocessor Parallelism ASD Shared Memory HPC Workshop

Computer Systems Group, ANU

Research School of Computer Science Australian National University Canberra, Australia

February 12, 2020

slide-2
SLIDE 2

Schedule - Day 3

Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 2 / 110

slide-3
SLIDE 3

Reference Material

Introduction to Parallel Computing, A Grama, A Gupta, G Karypis, and V Kumar, Addison Wesley 2003 (ISBN 0 201 64865 2) Parallel Computer Architecture and Programming, Kayvon Fatahalian, Carnegie Mellon University Course 15-418/618 (http://15418.courses.cs.cmu.edu/spring2015/) Concurrency: State Models and Java Programming, J. Magee and J. Kramer, 2nd edn, Wiley, 2006 (ISBN-10: 0470093552) The C++ Programming Language, 4th Edition, Bjarne Stroustrup, Pearson Education 2013 (ISBN 978-0-321-56384-2) Shared Memory Application Programming: Concepts and Strategies in Multicore Application Programming 1st Edition, Victor Alessandrini, Morgan Kaufmann, 2015 (ISBN 978-0128037614)

Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 3 / 110

slide-4
SLIDE 4

OS Support for Threads

Outline

1

OS Support for Threads

2

Mutual Exclusion

3

Synchronization Constructs

4

Memory Consistency

5

The OpenMP Programming Model

6

C++ 11 Threads Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 4 / 110

slide-5
SLIDE 5

OS Support for Threads

Concurrent vs Parallel Processing

Concurrent Processing: Environment in which tasks are defined and allowed to execute in any order

Does not imply a multiple processor environment E.g. spawn a separate thread to do I/O while CPU intensive task continues to do another operation

Parallel Processing: The simultaneous execution of concurrent tasks

  • n different CPUs

All parallel processing is concurrent, but NOT all concurrent processing is parallel

Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 5 / 110

slide-6
SLIDE 6

OS Support for Threads

O/S Thread Support

O/S Originally designed for process not thread support Require O/S to

Treat threads equally and ensure that they all get equal (or user defined) access to machine resources Know what to do when a thread issues a fork command Be able to deliver a signal to the correct thread

Libraries executed by a multi-threaded program need to be thread safe

Potential static data structures to be overwritten

Hardware support

Some hardware provide support for very fast context switch between threads, e.g. Cray MTA or more recently hyper-threading

Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 6 / 110

slide-7
SLIDE 7

OS Support for Threads

Thread Implementations

Three categories

Pure user space Pure kernel space Mixed

User Mode: When a process executes instructions within its program

  • r linked libraries

Kernel Mode: When the operating system executes instructions on behalf of a user program

Often as a result of a system call In kernel mode the program can access kernel space

Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 7 / 110

slide-8
SLIDE 8

OS Support for Threads

User Space Threads

No kernel involvement in providing a thread Kernel has no knowledge

  • f threads and continues

to schedule processes

  • nly

Thread library selects which thread to run OK for concurrency, no good for parallel programming

Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 8 / 110

slide-9
SLIDE 9

OS Support for Threads

Kernel Space Threads

Kernel level thread created for each user thread O/S must manage on a per-thread basis information typically managed on a process basis Potentially poor scaling when too many threads as O/S gets overloaded

Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 9 / 110

slide-10
SLIDE 10

OS Support for Threads

User-space v Kernel-space

User Space Advantages

No changes to core O/S No kernel involvement means they may be faster for some applications Can create many threads without overloading the system

User Space Disadvantages

Potentially long latency during system service blocking (e.g. 1 thread stalled on I/O request) All threads compete for the same CPU cycles No advantage gained from multiple CPUs

Mixed scheme

A few user threads mapped to one kernel thread

Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 10 / 110

slide-11
SLIDE 11

Mutual Exclusion

Outline

1

OS Support for Threads

2

Mutual Exclusion

3

Synchronization Constructs

4

Memory Consistency

5

The OpenMP Programming Model

6

C++ 11 Threads Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 11 / 110

slide-12
SLIDE 12

Mutual Exclusion

Accessing Shared Data

Concurrent read is no problem

but what about concurrent write?

Result could be x+1 or x+2 Similar problems with access to shared resources like I/O Critical Sections: mechanism to ensure controlled access to shared resources

Processes enter critical section under mutual exclusion

Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 12 / 110

slide-13
SLIDE 13

Mutual Exclusion

Locks

Simplest mechanism for providing mutual exclusion A lock is a 1-bit variable, a value of

1 indicates a process is in the critical section 0 indicates no process is in the critical section

At its lowest level a lock is a protocol for coordinating processes,

the CPU is not physically prevented from executing those instruction

1 while (lock == 1) do_nothing ; /* infinite testing loop */ lock = 1; /* enter critical section */ 3 critical section lock = 0; /* exit critical section */

The above is an incorrect example of a spin-lock, that uses a mechanism called busy waiting

Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 13 / 110

slide-14
SLIDE 14

Mutual Exclusion

The Assembly

lock: ld R0 , mem[addr] ; load word into R0 2 cmp R0 , #0 ; if 0, store 1 bnz lock ; else , try again 4 st mem[addr], #1 6 8 unlock: st mem[addr], #0 ; store 0 to address

Problem: data race because LOAD-TEST-STORE is not atomic! Processor 0 loads address X, observes 0 Processor 1 loads address X, observes 0 Processor 0 writes 1 to address X Processor 1 writes 1 to address X

(See http://15418.courses.cs.cmu.edu/spring2015/lecture/synchronization) Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 14 / 110

slide-15
SLIDE 15

Mutual Exclusion

Test and Set Lock

Test-and-set instruction:

ts R0 , mem[addr] ; atomically load mem[addr] into R0 ; if mem[addr] is 0 then mem[addr] to 1 lock: ts R0 , mem[addr] ; load word into R0 2 cmp R0 , #0 ; if 0, lock

  • btained

bnz lock 4 6 unlock: st mem[addr], #0 ; store 0 to address

What does ‘atomic’ mean in this context? p threads contending for the lock will require O(p) attempts, thus O(p2) time (and energy!) we can reduce the number of attempts (but still O(p)) by only trying the ts when we see mem[addr] is 0

Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 15 / 110

slide-16
SLIDE 16

Mutual Exclusion

Common Atomic Operations

Test and Set

bool TestAndSet (bool *lock) { 2 bool initial = *lock; lock = true; 4 return initial; }

Fetch and Add (operate)

1 << atomic >> int FetchAndAdd (int *location , int inc) { 3 int value = *location; *location = value + inc; 5 return value; }

Compare and Swap (Intel, CMPXCHG)

bool cas(int *p, old: int , new: int) { 2 if *p != old return false; 4 *p = new; return true; 6 } Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 16 / 110

slide-17
SLIDE 17

Mutual Exclusion

Locks: Phases and Properties

Phases

Waiting to acquire lock (busy wait or de-schedule and woken later) Lock acquisition Releasing lock

Properties

Fast to acquire in absence of content Low interconnect traffic Scalability Low storage cost Fairness Do not degrade overall system (e.g. when more threads than CPUs contend for a lock)

Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 17 / 110

slide-18
SLIDE 18

Mutual Exclusion

Pthread Lock Routines

Locks implemented as mutually exclusive lock variables or mutex variables To use a variable must be declared as type pthread mutex t

Usually in main thread as it needs to be visible to all threads using it

pthread_mutex_t mutex1; 2 pthread_mutex_init (& mutex1 , NULL); //(NULL specifies default attributes for mutex) pthread_mutex_lock (& mutex1); 4 /* CRITICAL SECTION */ pthread_mutex_unlock (& mutex1); Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 18 / 110

slide-19
SLIDE 19

Mutual Exclusion

Synchronization Terminology

Deadlock Livelock Starvation

Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 19 / 110

slide-20
SLIDE 20

Mutual Exclusion

Deadlock

http://15418.courses.cs.cmu.edu/spring2015/lecture/snoopimpl

Deadlock is a state where a system has outstanding

  • perations to complete, but

no operation can make progress Can arise when each

  • peration has acquired a

shared resource that another

  • peration needs

In a deadlock situation, there is no way for any thread (or, in this illustration, a car) to make progress unless some thread relinquishes a resource (”backs off”)

Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 20 / 110

slide-21
SLIDE 21

Mutual Exclusion

Deadlock

Can occur with multiple processes seeking to acquire resources that cannot be shared Pthreads provides pthread mutex trylock() to help address these circumstances

Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 21 / 110

slide-22
SLIDE 22

Mutual Exclusion

Transactions and Concurrency

https://aries.aston.ac.uk/modules/2010 2011/EE4007/chapter 13 8.html Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 22 / 110

slide-23
SLIDE 23

Mutual Exclusion

Deadlock Conditions

Mutual exclusion: the resources involved must be unsharable Hold and wait: processor must hold the resources they have been allocated while waiting for other resources needed to complete an

  • peration

No pre-emption: processes cannot have resources taken away from them until they have completed the operation they wish to perform Circular wait: exists in the resource dependency graph

Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 23 / 110

slide-24
SLIDE 24

Mutual Exclusion

Livelock

http://15418.courses.cs.cmu.edu/spring2015/lecture/snoopimpl Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 24 / 110

slide-25
SLIDE 25

Mutual Exclusion

Starvation

In this example: assume traffic moving left/right (yellow cars) must yield to traffic moving up/down (green cars) http://15418.courses.cs.cmu.edu/spring2015/lecture/snoopimpl

State where a system is making overall progress, but some processes make no progress

(green cars make progress, but yellow cars are stopped)

Starvation is usually not a permanent state

(as soon as green cars pass, yellow cars can go) Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 25 / 110

slide-26
SLIDE 26

Mutual Exclusion

Follow-On Issues

Lots of stuff on specific barrier implementation Lock free data structures Transactional memory

Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 26 / 110

slide-27
SLIDE 27

Mutual Exclusion

Hands-on Exercise: Non-Determinism

Objective: To observe the effect of non-determinism in the form of race conditions on program behaviour and how it can be avoided

Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 27 / 110

slide-28
SLIDE 28

Synchronization Constructs

Outline

1

OS Support for Threads

2

Mutual Exclusion

3

Synchronization Constructs

4

Memory Consistency

5

The OpenMP Programming Model

6

C++ 11 Threads Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 28 / 110

slide-29
SLIDE 29

Synchronization Constructs

Semaphores: Concepts

semaphores are widely used for dealing with inter-process synchronization in operating systems a semaphore s is an integer variable that can take only non-negative values the only operations permitted on s are up(s) and down(s)

down(s): if (s > 0) decrement s else block execution of the calling process up(s): if there are processes blocked on s awaken one of them else increment s

blocked processes or threads are held in a FIFO queue

Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 29 / 110

slide-30
SLIDE 30

Synchronization Constructs

Semaphores: Implementation

binary semaphores have only a value of 0 or 1

hence can act as a lock

waking up of threads via an up(s) occurs via an OS signal Posix semaphore API

#include <semaphore.h> 2 ... int sem_init (sem_t *sem , int pshared , unsigned int value); 4 int sem_destroy (sem_t *sem); int sem_wait (sem_t *sem); // down () 6 int sem_trywait (sem_t *sem); int sem_timedwait (sem_t *sem , const struct timespec *abstime); 8 int sem_post (sem_t *sem); //up() int sem_getvalue (sem_t *sem , int *value);

pshared: if non-0, generate a semaphore for usage between processes value delivers the number of waiting processes/threads as a negative

integer, if there are any waiting on this semaphore

Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 30 / 110

slide-31
SLIDE 31

Synchronization Constructs

Monitors and Condition Synchronization

Concepts: monitors: encapsulated data + access function mutual exclusion + condition synchronization

  • nly a single access function can be active in the monitor

Practice: private shared data and synchronized functions (exclusion)

the latter requires a per object lock to be acquired in each function

  • nly a single thread may be active in the monitor at a time

condition synchronization: achieved using a condition variable

wait(), notify_one() and notify_all()

Note: monitors serialize all accesses to the encapsulated data!

Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 31 / 110

slide-32
SLIDE 32

Synchronization Constructs

Monitors in ‘C’ / POSIX (types and creation)

int pthread_mutex_lock ( pthread_mutex_t *mutex); 2 int pthread_mutex_trylock ( pthread_mutex_t *mutex); int pthread_mutex_timedlock ( pthread_mutex_t *mutex , 4 const struct timespec*abstime); int pthread_mutex_unlock ( pthread_mutex_t *mutex); 6 int pthread_cond_wait ( pthread_cond_t *cond , pthread_mutex_t *mutex); 8 int pthread_cond_timedwait ( pthread_cond_t *cond , pthread_mutex_t *mutex , 10 const struct timespec*abstime); int pthread_cond_signal ( pthread_cond_t *cond); 12 int pthread_cond_broadcast ( pthread_cond_t *cond);

pthread_cond_signal() unblocks at least one thread pthread_cond_broadcast() unblocks all threads waiting on cond lock calls can be called anytime (multiple lock activations are

possible) the API is flexible and universal, but relies on conventions rather than compilers

Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 32 / 110

slide-33
SLIDE 33

Synchronization Constructs

Condition Synchronization

(courtesy Magee&Kramer: Concurrency) Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 33 / 110

slide-34
SLIDE 34

Synchronization Constructs

Condition Synchronization in Posix

Synchronization between POSIX-threads:

typedef ... pthread_mutex_t ; 2 typedef ... pthread_mutexattr_t ; typedef ... pthread_cond_t ; 4 typedef ... pthread_condattr_t ; int pthread_mutex_init ( pthread_mutex_t *mutex , 6 const pthread_mutexattr_t *attr); int pthread_mutex_destroy ( pthread_mutex_t *mutex); 8 int pthread_cond_init ( pthread_cond_t *cond , const pthread_condattr_t *attr); 10 int pthread_cond_destroy ( pthread_cond_t *cond);

Mutex and condition variable attributes include: semantics for trying to lock a mutex which is locked already by the same thread

pthread_mutex_destroy(): undefined, if lock is in use pthread_cond_destroy() undefined, if there are threads waiting

Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 34 / 110

slide-35
SLIDE 35

Synchronization Constructs

Bounded Buffer in Posix

typedef struct bbuf_t { 2 int count , N; ... pthread_mutex_t bufLock; 4 pthread_cond_t notFull; pthread_cond_t notEmpty; 6 } bbuf_t; int getBB(bbuf_t *b) { 8 pthread_mutex_lock (&b->bufLock); while (b->count == 0) 10 pthread_cond_wait ( &b->notEmpty , 12 &b->mutex); b->count --; 14 int v; // remove item from buffer pthread_cond_signal (&b->notFull); 16 pthread_mutex_unlock (&b->bufLock); return v; 18 } void putBB(int v, bbuf_t *b) { 2 pthread_mutex_lock (&b->bufLock); while (b->count == b->N) { 4 pthread_cond_wait ( &b->notFull , 6 &b->bufLock); } 8 b->count ++; // add v to buffer 10 pthread_cond_signal (&b->notEmpty); pthread_mutex_unlock (&b->bufLock); 12 } Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 35 / 110

slide-36
SLIDE 36

Synchronization Constructs

Deadlock and its Avoidance

ideas for deadlock:

concepts: system deadlock: no further progress model: deadlock: no eligible actions practice: blocked threads

  • ur aim: deadlock avoidance: to design systems where deadlock

cannot occur done by removing one of the four necessary and sufficient conditions:

serially reusable resources: the processes involved share resources which they use under mutual exclusion incremental acquisition: processes hold on to resources already allocated to them while waiting to acquire additional resources no pre-emption: once acquired by a process, resources cannot be pre-empted (forcibly withdrawn) but are only released voluntarily. wait-for cycle: a circular chain (or cycle) of processes exists such that each process holds a resource which its successor in the cycle is waiting to acquire

Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 36 / 110

slide-37
SLIDE 37

Synchronization Constructs

Wait-for Cycles

  • perating systems must deal with deadlock arising from processes

requesting resources (e.g. printers, memory, co-processors) pre-emption involves constructing a resource allocation graph, detecting a cycle and removal of a resource avoidance involves never granting a request that could lead to such a cycle (the Banker’s Algorithm)

(courtesy Magee&Kramer: Concurrency) Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 37 / 110

slide-38
SLIDE 38

Synchronization Constructs

The Dining Philosophers Problem

Five philosophers sit around a circular table. Each philosopher spends his life alternately thinking and eating. In the centre of the table is a large bowl of spaghetti. A philosopher needs two forks to eat a helping of

  • spaghetti. One fork is placed

between each pair of philosophers and they agree that each will only use the fork to his immediate right and left. If all 5 sit down at once and take the fork on his left, there will be deadlock: a wait-for cycle exists!

(courtesy Magee&Kramer: Concurrency) Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 38 / 110

slide-39
SLIDE 39

Synchronization Constructs

Dining Philosophers: Model Structure Diagram

Each FORK is a shared resource with actions get and put. When hungry, each Phil must first get his right and left forks before he can start eating. Each Fork is either in a taken or not taken state. This is an example when the resulting monitor state can be distributed (resulting in greater system concurrency).

(courtesy Magee&Kramer: Concurrency) Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 39 / 110

slide-40
SLIDE 40

Memory Consistency

Outline

1

OS Support for Threads

2

Mutual Exclusion

3

Synchronization Constructs

4

Memory Consistency

5

The OpenMP Programming Model

6

C++ 11 Threads Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 40 / 110

slide-41
SLIDE 41

Memory Consistency

Consider the following

initially flag1 = flag2 = 0 | // Process 0 | // Process 1 | | flag1 = 1 | flag2 = 1 if (flag2 == 0) | if (flag1 == 0) print "Hello"; | print "World"; |

What is printed?

Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 41 / 110

slide-42
SLIDE 42

Memory Consistency

Time Ordered Events

Process 0 Process 1

  • 1

flag1 = 1 2 if (flag2 == 0) print "Hello"; 3 flag2 = 1 4 if (flag1 == 0) print "World"; Output: Hello

  • 1

flag1 = 1 2 flag2 = 1 3 if (flag2 == 0) print "Hello"; 4 if (flag1 == 0) print "World"; Output: (Nothing Printed)

  • 1

flag2 = 1 2 if (flag1 == 0) print "World"; 3 flag1 = 1 4 if (flag2 == 0) print "Hello"; Output: World

  • Never Hello and World?

But what fundamental assumption are we making?

Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 42 / 110

slide-43
SLIDE 43

Memory Consistency

Memory Consistency

To write correct and efficient shared memory programs, programmers need a precise notion of how memory behaves with respect to read and write operations from multiple processors (Adve and Gharachorloo) Memory/cache coherency defines requirements for the observed behaviour of reads and writes to the same memory location (ensuring all processors have consistent view of same address) Memory consistency defines the behavior of reads and writes to different locations (as observed by other processors)

(See: Sarita V. Adve and Kourosh Gharachorloo. 1996. Shared Memory Consistency Models: A Tutorial. Computer 29, 12, 66-76. DOI=10.1109/2.546611) Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 43 / 110

slide-44
SLIDE 44

Memory Consistency

Sequential Consistency

Lamports definition: [A multiprocessor system is sequentially consistent if] the result of any execution is the same as if the

  • peration of all the processors were executed in some sequential
  • rder, and the operations of each individual processor appear in this

sequence in the order specified by its program Two aspects:

Maintaining program order among operations from individual processors Maintaining a single sequential order among operations from all processors

The latter aspect make it appear as if a memory operation executes atomically or instantaneously with respect to other memory operations

Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 44 / 110

slide-45
SLIDE 45

Memory Consistency

Programmer’s View of Sequential Consistency

Conceptually

There is a single global memory and a switch that connects an arbitrary processor to memory at any time step Each process issues memory operations in program order and the switch provides the global serialization among all memory operations.

Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 45 / 110

slide-46
SLIDE 46

Memory Consistency

Motivation for Relaxed Consistency

To gain performance

Hide latency of independent memory access with other operations Recall that memory accesses to a cache coherent system may involve much work

Can relax either the program order or atomicity requirements (or both)

e.g. relaxing write to read and write to write allows writes to different locations to be pipelined or overlapped Relaxing write atomicity allows a read to return another processor’s write before all cached copies have been updated

Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 46 / 110

slide-47
SLIDE 47

Memory Consistency

Possible Reorderings

We consider first relaxing the program order requirements Four types of memory orderings (within a program):

W→R: write must complete before subsequent read (RAW) R→R: read must complete before subsequent read (RAR) R→W: read must complete before subsequent write (WAR) W→W: write must complete before subsequent write (WAW)

Normally, different addresses are involved in the pair of operations relaxing these can give:

W→R: e.g. the write buffer (aka store buffer) everything: the weak and release consistency models

Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 47 / 110

slide-48
SLIDE 48

Memory Consistency

Breaking W → R Ordering: Write Buffers

(Adve and Gharachorloo DOI=10.1109/2.546611)

Write buffer (very common)

Process inserts writes in write buffer and proceeds, assuming it completes in due course Subsequent reads bypass previous writes, using the value in the write buffer subsequent writes are processed by the buffer in order (no W→W relaxations)

Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 48 / 110

slide-49
SLIDE 49

Memory Consistency

Allowing reads to move ahead of writes

Total Store Ordering (TSO)

Processor P can read B before its write to A is seen by all processors (process can move its own reads in front of its own writes) Reads by other processors cannot return new value of A until the write to A is observed by all processors

Processor consistency(PC)

Any processor can read new value of A before the write is observed by all processors

In TSO and PC, only W → R order is relaxed. The W → W constraint still exists. Writes by the same thread are not reordered (they occur in program order)

See http://15418.courses.cs.cmu.edu/spring2015/lecture/consistency Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 49 / 110

slide-50
SLIDE 50

Memory Consistency

Processor Consistency

Before a LOAD is allowed to perform wrt. any processor, all previous LOAD accesses must be performed wrt. everyone Before a STORE is allowed to perform wrt. any processor, all previous LOAD and STORE accesses must be performed wrt. everyone

Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 50 / 110

slide-51
SLIDE 51

Memory Consistency

Four Example Programs

Do (all possible) results of execution match that of sequential consistency (SC)? 1 2 3 4 Total Store Ordering (TSO) ✓ ✓ ✓ ✗ Processor Consistency (PC) ✓ ✓ ✗ ✗

See http://15418.courses.cs.cmu.edu/spring2015/lecture/consistency Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 51 / 110

slide-52
SLIDE 52

Memory Consistency

Clarification

The cache coherency problem exists because of the optimization of duplicating data in multiple processor caches. The copies of the data must be kept coherent. Relaxed memory consistency issues arise from the optimization of reordering memory operations (this is unrelated to whether there are caches in the system).

Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 52 / 110

slide-53
SLIDE 53

Memory Consistency

Allowing writes to be reordered

Four types of memory operations orderings

W→R : write must complete before subsequent read R→R: read must complete before subsequent read R→W: read must complete before subsequent write W→W: write must complete before subsequent write

Partial Store Ordering (PSO)

Execution may not match sequential consistency on program 1 (P2 may observe change to flag before change to A)

Thread 1 (on P1) Thread 2 (on P2)

  • A = 1;

while (flag == 0); flag = 1; print A; Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 53 / 110

slide-54
SLIDE 54

Memory Consistency

Breaking W → W Ordering: Overlapped Writes

(Adve and Gharachorloo DOI=10.1109/2.546611)

General (non-bus) Interconnect with multiple memory modules

Different memory operations issued by same processor are serviced by different memory modules Writes from P1 are injected into the memory system in program order, but they may complete out of program order Many processors coalesced write to the same cache line in a write buffer, and could lead to similar effects

Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 54 / 110

slide-55
SLIDE 55

Memory Consistency

Allowing all reorderings

Four types of memory operations orderings

W→R: write must complete before subsequent read R→R: read must complete before subsequent read R→W: read must complete before subsequent write W→W: write must complete before subsequent write

Examples:

Weak Ordering (WO) Release Consistency (RC)

Processors support special synchronization

  • perations

Memory accesses before memory fence instruction must complete before the fence issues Memory accesses after fence cannot begin until fence instruction is complete

reorderable reads and writes here ... <MEMORY FENCE > ... reorderable reads and writes here .... <MEMORY FENCE > Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 55 / 110

slide-56
SLIDE 56

Memory Consistency

Weak Consistency

Relies on the programmer having used critical sections to control access to shared variables

Within the critical section no other process can rely on that data structure being consistent until the critical section is exited

We need to distinguish critical points when the programmer enters or leaves a critical section

Distinguish standard load/stores from synchronization accesses

Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 56 / 110

slide-57
SLIDE 57

Memory Consistency

Weak Consistency

Before an ordinary LOAD/STORE is allowed to perform wrt. any processor, all previous SYNCH accesses must be performed wrt. everyone Before a SYNCH access is allowed to perform wrt. any processor, all previous

  • rdinary LOAD/STORE accesses must be

performed wrt. everyone SYNCH accesses are sequentially consistent wrt. one another

Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 57 / 110

slide-58
SLIDE 58

Memory Consistency

Release Consistency

Before any ordinary LOAD/STORE is allowed to perform wrt. any processor, all previous ACQUIRE accesses must be performed wrt. everyone Before any RELEASE access is allowed to perform wrt. any processor, all previous

  • rdinary LOAD/STORE accesses must be

performed wrt. everyone Acquire/Release accesses are processor consistent wrt. one another

Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 58 / 110

slide-59
SLIDE 59

Memory Consistency

Enforcing Consistency

The hardware provides underlying instructions that are used to enforce consistency

Fence or Memory Bar instructions

Different processors provide different types of fence instructions

Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 59 / 110

slide-60
SLIDE 60

Memory Consistency

Example: Synchronization in relaxed models

Intel x86/x64 - total store ordering

Provides sync instructions if software requires a specific instruction

  • rdering not guaranteed by the consistency model

mm_lfence(”load fence”: waits for all loads to complete) mm_sfence(”store fence”: waits for all stores to complete) mm_mfence(”mem fence”: waits for all mem operations to complete)

ARM processors: very relaxed consistency model

Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 60 / 110

slide-61
SLIDE 61

Memory Consistency

Summary: relaxed consistency

Motivation: obtain higher performance by allowing recording of memory operations for latency hiding (reordering is not allowed by sequential consistency) One cost is software complexity: programmer or compiler must correctly insert synchronization to ensure certain specific ordering

But in practice complexities encapsulated in libraries that provide intuitive primitives like lock/unlock, barrier (or lower level primitives like fence) Optimize for the common case: most memory accesses are not conflicting, so don’t pay the cost as if they are

Relaxed consistency models differ in which memory ordering constraints they ignore

Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 61 / 110

slide-62
SLIDE 62

Memory Consistency

Final Thoughts

What consistency model best describes pthreads?

Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 62 / 110

slide-63
SLIDE 63

Memory Consistency

Hands-on Exercises: Synchronization Constructs and Deadlock

Objectives: To understand how to create threads, how race conditions can occur, and how monitors can be implemented in pThreads programs To understand how race conditions can cause deadlocks and erroneous results

Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 63 / 110

slide-64
SLIDE 64

The OpenMP Programming Model

Outline

1

OS Support for Threads

2

Mutual Exclusion

3

Synchronization Constructs

4

Memory Consistency

5

The OpenMP Programming Model

6

C++ 11 Threads Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 64 / 110

slide-65
SLIDE 65

The OpenMP Programming Model

OpenMP Reference Material

http://www.openmp.org/ Introduction to High Performance Computing for Scientists and Engineers, Hager and Wellein, Chapter 6 & 7 High Performance Computing, Dowd and Severance, Chapter 11 Introduction to Parallel Computing, 2nd Ed, A. Grama, A. Gupta, G. Karypis, V. Kumar Parallel Programming in OpenMP, R. Chandra, L.Dagum, D.Kohr, D.Maydan. J.McDonald, R.Menon

Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 65 / 110

slide-66
SLIDE 66

The OpenMP Programming Model

Shared Memory Parallel Programming

Explicit thread programming is messy

low-level primitives

  • riginally non-standard, although better since pthreads

used by system programmers, but · · · · · · application programmers have better things to do!

Many application codes can be usefully supported by higher level constructs

led to proprietary directive based approaches of Cray, SGI, Sun etc

OpenMP is an API for shared memory parallel programming targeting Fortran, C and C++

standardizes the form of the proprietary directives avoids the need for explicitly setting up mutexes, condition variables, data scope, and initialization

Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 66 / 110

slide-67
SLIDE 67

The OpenMP Programming Model

OpenMP

Specifications maintained by OpenMP Architecture Review Board (ARB) members include AMD, Intel, Fujitsu, IBM, NVIDIA · · · cOMPunity Versions 1.0 (Fortran ’97, C ’98), 1.1 and 2.0 (Fortran ’00, C/C++ ’02), 2.5 (unified Fortran and C, 2005), 3.0 (2008), 3.1 (2011), 4.0 (2013), 4.5 (2015) Comprises compiler directives, library routines and environment variables C directives (case sensitive) #pragma omp directive name [clause-list] library calls begin with omp void omp set num threads(nthreads) environment variables begin with OMP export OMP NUM THREADS=4 OpenMP requires compiler support activated via -fopenmp (gcc) or -qopenmp (icc) compiler flags

Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 67 / 110

slide-68
SLIDE 68

The OpenMP Programming Model

The Parallel Directive

OpenMP uses a fork/join model, i.e. programs execute serially until they encounter a parallel directive:

this creates a group of threads the number of threads dependent on an environment variable or set via function call the main thread becomes master with thread id 0

#pragma

  • mp

parallel [clause -list] /* structured block */

Each thread executes a structured block

Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 68 / 110

slide-69
SLIDE 69

The OpenMP Programming Model

Fork-Join Model

Introduction to High Performance Computing for Scientists and Engineers, Hager and Wellein, Figure 6.1

Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 69 / 110

slide-70
SLIDE 70

The OpenMP Programming Model

Parallel Clauses

Clauses are used to specify Conditional Parallelization: to determine if parallel construct results in creation of threads if (scalar expression) Degree of concurrency: explicit specification of the number of threads created num threads (integer expression) Data handling: to indicate if specific variables are local to thread (allocated on the stack), global, or ”special” private (variable list) shared (variable list) firstprivate (variable list)

Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 70 / 110

slide-71
SLIDE 71

The OpenMP Programming Model

Compiler Translation: OpenMP to Pthreads

OpenMP code int a,b; main () { // serial segment #pragma

  • mp

parallel num_threads (8) private(a) shared(b) { // parallel segment } // rest of serial segment } Pthread equivalent (structured block is outlined) int a, b; main () { // serial segment for (i=0; i <8; i++) pthread_create (..... , internal_thunk ,...); for (i=0; i <8; i++) pthread_join (........) ; // rest of serial segment } void * internal_thunk (void * packaged_argument ) { int a; // parallel segment } Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 71 / 110

slide-72
SLIDE 72

The OpenMP Programming Model

Parallel Directive Examples

#pragma

  • mp

parallel if ( is_parallel == 1) num_threads (8) \ private(a) shared(b) firstprivate (c)

If value of variable is parallel is one, eight threads are used Each thread has private copy of a and c, but shares a single copy of b Value of each private copy of c is initialized to value of c before parallel region

#pragma

  • mp

parallel reduction (+: sum) num_threads (8) default(private)

Eight threads get a copy of variable sum When threads exit, the values of these local copies are accumulated into the sum variable on the master thread

  • ther reduction operations include *, -, &, |,ˆ, && and

All variables are private unless otherwise specified

Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 72 / 110

slide-73
SLIDE 73

The OpenMP Programming Model

Example: Computing Pi

Compute π by generating random numbers in square with side length

  • f 2 centered at (0,0) and counting numbers that fall within a circle
  • f radius 1

The area of square = 4, area of circle = πr 2 = π The ratio of points in circle to the outside approaches π/4

#pragma

  • mp

parallel default(private) shared(npoints) \ reduction (+: sum) num_threads (8) { num_threads = omp_get_num_threads (); sample_points_per_thread = npoints/ num_threads ; sum = 0; for (i = 0; i < sample_points_per_thread ; i++){ rand_x = (double) rand_range (&seed , -1, 1); rand_y = (double) rand_range (&seed , -1, 1); if (( rand_x * rand_x + rand_y * rand_y) <= 1.0) sum ++; } }

The OpenMP code is very simple - try writing the equivalent pthread code!

Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 73 / 110

slide-74
SLIDE 74

The OpenMP Programming Model

The for Worksharing Directive

Used in conjunction with parallel directive to partition the for loop immediately afterwards

#pragma

  • mp

parallel default(private) shared(npoints) \ reduction (+: sum) num_threads (8) { sum = 0; #pragma

  • mp for

for (i = 0; i < npoints; i++) { rand_x = (double) rand_range (&seed , -1, 1); rand_y = (double) rand_range (&seed , -1, 1); if (( rand_x * rand_x + rand_y * rand_y) <= 1.0) sum ++; } }

The loop index (i) is assumed to be private Only two directives plus the sequential code (code is easy to read/maintain)

There is implicit synchronization at the end of the loop

Can add a nowait clause to prevent synchronization

Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 74 / 110

slide-75
SLIDE 75

The OpenMP Programming Model

The Combined parallel for Directive

The most common use case for parallelizing for loops

sum = 0; #pragma

  • mp

parallel for default(private) shared(npoints) \ reduction (+: sum) num_threads (8) for (i = 0; i < npoints; i++) { rand_x = (double) rand_range (&seed , -1, 1); rand_y = (double) rand_range (&seed , -1, 1); if (( rand_x * rand_x + rand_y * rand_y) <= 1.0) sum ++; } } printf("sum =%d\n", sum);

Inside the parallel region, sum is treated as a thread-local variable (implicitly initialized to 0) At the end of the region, the thread-local versions of sum are added to the global sum (here initialized to 0) to get the final value

Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 75 / 110

slide-76
SLIDE 76

The OpenMP Programming Model

Assigning Iterations to Threads

The schedule clause of the for directive assigns iterations to threads schedule(scheduling clause[,parameter]) schedule(static[,chunk-size])

splits the iteration space into chunks of size chunk-size and allocates to threads in a round-robin fashion no specification implies the number of chunks equals the number of threads

schedule(dynamic[,chunk-size])

iteration space split into chunk-size blocks that are scheduled dynamically

schedule(guided[,chunk-size])

chunk size decreases exponentially with iterations to minimum of chunk-size

schedule(runtime)

determine scheduling based on setting of OMP SCHEDULE environment variable

Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 76 / 110

slide-77
SLIDE 77

The OpenMP Programming Model

Loop Schedules

Introduction to High Performance Computing for Scientists and Engineers, Hager and Wellein, Figure 6.2

Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 77 / 110

slide-78
SLIDE 78

The OpenMP Programming Model

Sections

Consider partitioning of fixed number of tasks across threads

much less common than for loop partitioning explicit programming naturally limits number of threads (scalability)

#pragma

  • mp

sections { #pragma

  • mp

section { taskA () } #pragma

  • mp

section { taskB () } }

Separate threads will run taskA and taskB Illegal to branch in or out of section blocks

Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 78 / 110

slide-79
SLIDE 79

The OpenMP Programming Model

Nesting Parallel Directives

What happens for nested for loops

#pragma

  • mp

parallel for num_threads (2) for (i = 0; i < Ni; i++) { #pragma

  • mp

parallel for num_threads (2) for (j = 0; j < Nj; j++) {

By default inner loop is serialized and run by one thread To enable multiple threads in nested parallel loops requires environment variable OMP NESTED to be TRUE Note - the use of synchronization constructions in nested parallel sections requires care (see OpenMP specs)!

Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 79 / 110

slide-80
SLIDE 80

The OpenMP Programming Model

Synchronization#1

Barrier: threads wait until they have all reached this point #pragma

  • mp

barrier Single: following block executed only by first thread to reach this point

  • thers wait at end of structured block unless nowait clause used

#pragma

  • mp

single [clause -list] /* structured block */ Master: only master executes following block, other threads do NOT wait #pragma

  • mp

master /* structured block */ Critical Section: only one thread is ever in the named critical section #pragma

  • mp

critical [( name)] /* structured block */ Atomic: memory location updated in following instruction is done so in an atomic fashion can achieve same effect using critical sections Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 80 / 110

slide-81
SLIDE 81

The OpenMP Programming Model

Synchronization#2

Ordered: some operations within a for loop must be performed as if it were done so in sequential order

cumul_sum [0] = list [0]; #pragma

  • mp

parallel for shared (cumul_sum , list , n) for (i=1; i<n; i++) { /* other processing

  • n list[i] if

required */ #pragma

  • mp
  • rdered

{ cumul_sum[i] = cumul_sum[i -1] + list[i]; } }

Flush: ensures a consistent view of memory

that variables have been flushed from registers into memory

#pragma

  • mp

flush [( list)] Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 81 / 110

slide-82
SLIDE 82

The OpenMP Programming Model

Data Handling

private: an uninitialized local copy of variable made for each thread shared: variables shared between threads firstprivate: make a local copy of an existing variable and assign it same value

  • ften better than multiple reads to shared variable

lastprivate: copies back to master value from thread executing the equivalent of the last loop iteration if executed serially threadprivate: creates private variables but they persist between multiple parallel regions maintaining their values copyin: like first private but for threadprivate variables

Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 82 / 110

slide-83
SLIDE 83

The OpenMP Programming Model

OpenMP Tasks

A task has

Code to execute A data environment (it owns its data) An assigned thread that executes the code and uses the data

Creating a task involves two activities: packaging and execution

Each encountering thread packages a new instance of a task (code and data) Some thread in the team executes the task at some later time

Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 83 / 110

slide-84
SLIDE 84

The OpenMP Programming Model

Task Syntax

#pragma

  • mp

task [clause ...] if (scalar expression) final (scalar expression) untied default (shared | none) mergeable private (list) firstprivate (list) shared (list) structured_block

When if clause is false, task is executes immediately (in own environment) Task completes at thread barriers (explicit or implicit) and task barriers

#pragma

  • mp

taskwait

Applies only to child tasks generated in the current task, not to “descendants”

Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 84 / 110

slide-85
SLIDE 85

The OpenMP Programming Model

Task Example

int fib(int n) { int i, j; if (n < 2) return n; else { #pragma

  • mp

task shared(i) firstprivate (n) i=fib(n -1); #pragma

  • mp

task shared(j) firstprivate (n) j=fib(n -2); #pragma

  • mp

taskwait return i+j; } } int main () { int n = 10;

  • mp_set_dynamic (0);
  • mp_set_num_threads (4);

#pragma

  • mp

parallel shared(n) { #pragma

  • mp

single printf ("fib (%d) = %d\n", n, fib(n)); } } Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 85 / 110

slide-86
SLIDE 86

The OpenMP Programming Model

Task Issues

Task switching

Certain constructs have task scheduling points at defined locations within them When a thread encounters a task scheduling point, it is allowed to suspend the current task and execute another (task switching) It can then return to original task and resume

Tied Tasks

By default suspended tasks must resume execution on the same thread as it was previously executing on “untied” clause relaxes this constraint

Task Generation

Very easy to generate many tasks very quickly! Generating task will be suspended and start working on a long and boring task Other threads can consume all their tasks and have nothing to do

Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 86 / 110

slide-87
SLIDE 87

The OpenMP Programming Model

Library Functions#1

Defined in header file

#include <omp.h>

Controlling threads and processors

void

  • mp_set_num_threads (int

num_threads ) int

  • mp_get_num_threads ()

int

  • mp_get_max_threads ()

int

  • mp_get_thread_num ()

int

  • mp_get_num_procs ()

int

  • mp_in_parallel ()

Controlling thread creation

void

  • mp_set_dynamic (int

dynamic_threads ) int

  • mp_get_dynamic ()

void

  • mp_set_nested (int

nested) int

  • mp_get_nested ()

Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 87 / 110

slide-88
SLIDE 88

The OpenMP Programming Model

Library Functions#2

Mutual exclusion

void

  • mp_init_lock ( omp_lock_t *lock)

void

  • mp_destroy_lock ( omp_lock_t *lock)

void

  • mp_set_lock ( omp_lock_t *lock)

void

  • mp_unset_lock ( omp_lock_t *lock)

int

  • mp_test_lock ( omp_lock_t *lock)

Nested mutual exclusion

void

  • mp_init_nest_lock ( omp_nest_lock_t *lock)

void

  • mp_destroy_nest_lock ( omp_nest_lock_t *lock)
  • mp_set_nest_lock ( omp_nest_lock_t *lock)

void

  • mp_unset_nest_lock ( omp_nest_lock_t *lock)

int

  • mp_test_nest_lock ( omp_nest_lock_t *lock)

Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 88 / 110

slide-89
SLIDE 89

The OpenMP Programming Model

OpenMP Environment Variables

OMP NUM THREADS: default number of threads entering parallel region OMP DYNAMIC: if TRUE it permits the number of threads to change during execution OMP NESTED: if TRUE it permits nested parallel regions OMP SCHEDULE: determines scheduling for loops that are defined to have runtime scheduling

setenv OMP_SCHEDULE "static ,4" setenv OMP_SCHEDULE "dynamic" setenv OMP_SCHEDULE "guided" Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 89 / 110

slide-90
SLIDE 90

The OpenMP Programming Model

OpenMP and Pthreads

OpenMP removes the need for a programmer to initialize task attributes, set up arguments to threads, partition iteration spaces etc OpenMP code can closely resemble serial code OpenMP is particularly useful for static or regular problems OpenMP users are hostage to availability of an OpenMP compiler

performance heavily dependent on quality of compiler

Pthreads data exchange is more apparent so false sharing and contention is less likely Pthreads has a richer API that is much more flexible, e.g. condition waits, locks of different types etc Pthreads is library based Must balance above before deciding on parallel model

Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 90 / 110

slide-91
SLIDE 91

The OpenMP Programming Model

Hands-on Exercise: Programming with OpenMP

Objective: To use OpenMP to provide a basic introduction to shared memory parallel programming

Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 91 / 110

slide-92
SLIDE 92

C++ 11 Threads

Outline

1

OS Support for Threads

2

Mutual Exclusion

3

Synchronization Constructs

4

Memory Consistency

5

The OpenMP Programming Model

6

C++ 11 Threads Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 92 / 110

slide-93
SLIDE 93

C++ 11 Threads

Threads in C++11

A higher level of abstraction to threads and synchronization mechanisms The C++11 thread library provides a number of classes in the std

  • namespace. The header files below need to be included:

<thread>: managing and identifying threads <mutex>: mutual exclusion primitives <condition_variable>: condition synchronization primitives <atomic>: atomic types and operations

The thread::hardware_concurrency() operation reports the number of tasks that can simultaneously proceed with hardware support

Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 93 / 110

slide-94
SLIDE 94

C++ 11 Threads

C++11 Threads Basics

The std::thread constructor takes a task (function) to be executed and the arguments required by that task. The number and types of arguments must match what the function requires. For example:

void f0(); // no arguments 2 void f1(int); // one int argument 4 thread t1 {f0}; thread t2 {f0 , 1}; // error: too many arguments 6 thread t3 {f1}; // error: too few arguments thread t4 {f1 , 1}; 8 thread t5 {f1 , 1, 2}; // error: too many arguments thread t3 {f1 , "I’m being silly"}; // error: wrong type of argument

Each thread of execution has a unique identifier represented as a value of type thread::id. The id of a thread ’t’ can be obtained by a call of t.get_id().

Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 94 / 110

slide-95
SLIDE 95

C++ 11 Threads

C++11 Threads Basics contd.

an object of the std::thread class is automatically started on construction:

#include <thread > 2 void hello () { std :: cout << "Hello from thread " << std :: this_thread :: get_id () 4 << std :: endl; } 6 int main () { std :: thread t1(hello); 8 t1.join (); // wait for t1 to finish }

  • ften, it is useful to supply our own ids (etc) to a number of threads:

1 void hello(int i) { printf("Hello from thread with logical id %d\n", i); } ... 3 std :: thread ** ts = new std :: thread* [5]; for (int i=0; i < 5; i++) 5 ts[i] = new std :: thread(hello , i); for (int i=0; i < 5; i++) 7 ts[i]->join (); Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 95 / 110

slide-96
SLIDE 96

C++ 11 Threads

C++11 Threads Basics contd.

A t.join() tells the current thread not to proceed until t completes

void tick(int n) { 2 for (int i=0; i!=n; ++i) { this_thread :: sleep_for(second {1}); 4 printf("Alive !\n"); } 6 } int main () { 8 thread timer {tick , 10}; timer.join (); 10 }

We can also use lambda functions to specify the functions that the thread executes (but beware!):

for (int i=0; i < 5; i++) 2 ts[i] = new std :: thread ([&i]() { printf("Hello from thread %d\n", i); 4 }); Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 96 / 110

slide-97
SLIDE 97

C++ 11 Threads

Thread Local Variables

A thread local variable is an object owned by a thread. It is not accessible from other threads unless its owner gives a pointer to it thread local variables are shared among all functions of a thread and lives for as long as the thread Each thread has its own copy of its thread local variables These are initialized the first time control passes through its

  • definition. If constructed, it will be destroyed on thread exit

A thread explicitly keeps a cache of thread local data for exclusive access

void debug_counter () { 2 thread_local int count = 0; print("This function has been called %d times by thread %d\n", 4 ++count , std :: this_thread :: get_id ()); } 6 ... for (int i=0; i < 5; i++) 8 ts[i] = new std :: thread ([]( int n) { for (int j=0; j<n; j++) debug_counter (); 10 }, i); Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 97 / 110

slide-98
SLIDE 98

C++ 11 Threads

Locks in C++11 - Mutex

Suppose we wanted threads to repeatedly increment a counter

class Counter { 2 int v; Counter () {v = 0;} 4 void inc () {v++;} } 1 Counter count; for (int i=0; i < 5; i++) 3 ts[i] = new std :: thread ([& count ]() { for (int j=0; j < 1000; j++) 5 count.inc (); });

The above code suffers from interference. We could fix this by adding a lock field std::mutex l to Counter and redefining:

void inc () {l.lock (); v++; l.unlock ();}

Or use std::lock_guard:

void inc () {std :: lock_guard <std ::mutex > guard(l); v++;}

Note: When guard is constructed, it automatically calls l.lock(). The lock is automatically released when the object is destroyed

C++11 also provides std::atomic<T> class for atomic operations. It provides member functions (atomic) such as compare_exchange,

fetch_add etc. E.g. std::atomic<int> v=0; v.fetch_add(1, std::memory_order_relaxed);

Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 98 / 110

slide-99
SLIDE 99

C++ 11 Threads

Monitors and Condition Synchronization

Concepts: monitors: Encapsulated data + access functions Mutual exclusion + condition synchronization Only a single access function can be active in the monitor Practice: Private shared data and synchronized functions (exclusion)

The latter requires a per object lock to be acquired in each function Only a single thread may be active in the monitor at a time

Condition synchronization: achieved using a condition variable

std::condition_variable.

It provides following methods:

wait(), notify_one() and notify_all()

Note: monitors serialize all accesses to the encapsulated data!

Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 99 / 110

slide-100
SLIDE 100

C++ 11 Threads

Condition Synchronization: std::condition_variable

C++ provides a thread wait set for each monitor:

void wait(std::unique_lock<std::mutex>& l): void wait(std::unique_lock<std::mutex>& l, Predicate pred):

Causes the current thread to block until the condition variable is notified, optionally looping until pred is true. Atomically releases lock

l, blocks the current executing thread, and adds it to the list of

threads waiting on *this

notify_one() or notify_all(): Unblocks one or more threads in the

wait set. The lock is reacquired and the condition is checked again

A thread is deemed to have entered the monitor when it acquires the monitor’s mutual exclusion lock

(courtesy Magee&Kramer: Concurrency) Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 100 / 110

slide-101
SLIDE 101

C++ 11 Threads

Bounded Buffer in C++

A bounded buffer has a capacity of N items:

We can put an item in if there are < N items We can get an item out if there are > 0 items

C++ requires a condition variable for each of the above conditions

class BoundedBuffer { 2 int count , N; std :: mutex bufLock; 4 std :: condition_variable notFull , notEmpty; BoundedBuffer (int capacity) { N = capacity; count = 0; } 6 void put(int v) { std :: unique_lock <std ::mutex > l(bufLock); 8 notFull.wait(l, [this ](){ return count < N; }); // code to add v to buffer 10 count ++;

  • notEmpty. notify_one ();

12 } int get () {int v; 14 std :: unique_lock <std ::mutex > l(bufLock); notEmpty.wait(l, [this ](){ return count > 0; }); 16 // code to remove item from buffer and put in v count --; 18 notFull.notify_one (); return v; 20 }}; Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 101 / 110

slide-102
SLIDE 102

C++ 11 Threads

Launching the Bounded Buffer

First need a producer to add items, and a consumer to remove them

void consumer(BufferMon &buf) { 2 for (int i=0; i < NumOps; i++) int v = buf.get (); 4 } void producer(BufferMon &buf) { 6 for (int i=0; i < NumOps; i++) buf.put(i); 8 }

Thread startup requires creating the buffer first, and passing its reference to the appropriately constructed thread objects acting as the producer and consumer (which immediately start)

BufferMon buf(N); 2 std :: thread cons(consumer , std :: ref(buf)); std :: thread prod(producer , std :: ref(buf)); 4 prod.join (); cons.join ();

The shared buffer object ensures synchronization

Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 102 / 110

slide-103
SLIDE 103

C++ 11 Threads

Semaphores: Concepts

Semaphores are widely used for dealing with inter-process synchronization in operating systems A semaphore s is an integer variable that can take only non-negative values Only operations permitted on s are up(s) and down(s)

down(s): if (s > 0) decrement s else block execution of the calling process up(s): if there are processes blocked on s awaken one of them else increment s

blocked processes or threads are held in a FIFO queue

Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 103 / 110

slide-104
SLIDE 104

C++ 11 Threads

A C++ Implementation of Semaphores

A semaphore may be implemented as a monitor encapsulating the integer value

class Semaphore { 2 int v; std :: mutex semLock; 4 std :: condition_variable semZero; Semaphore (int initial) { 6 v = initial; } 8 void up() { 10 std :: unique_lock <std ::mutex > l(semLock); v++; 12 semZero.notify_one (); } 14 void down () { 16 std :: unique_lock <std ::mutex > l(semLock); semZero.wait(l, [this ](){ return v > 0; }); 18 v--; } 20 }; // Semaphore Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 104 / 110

slide-105
SLIDE 105

C++ 11 Threads

Bounded Buffer via Semaphores

Beware the nested monitor problem!

class BufferSem { 2 int count , N; std :: mutex bufLock; 4 Semaphore *full , *empty; BoundedSem (int capacity) { 6 N = capacity; count = 0; full = new Semaphore (0); 8 empty = new Semaphore(N); } 10 void put(int v) { empty ->down (); // must

  • ccur

before we get bufLock! 12 std :: unique_lock <std ::mutex > l(bufLock); // code to add v to buffer 14 count ++; full ->up(); 16 } int get () {int v; 18 full ->down (); // must

  • ccur

before we get bufLock! std :: unique_lock <std ::mutex > l(bufLock); 20 notEmpty.wait(l, [this ](){ return count > 0; }); // code to remove item from buffer and put in v 22 count --; empty ->up(); 24 return v; } 26 }; Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 105 / 110

slide-106
SLIDE 106

C++ 11 Threads

Dining Philosophers: Model Structure Diagram

Each FORK is a shared re- source with actions get and put. When hungry, each Phil must first get his right and left forks before he can start eating. Each Fork is either in a taken or not taken state. This is an example when the resulting monitor state can be distributed (resulting in greater sys- tem concurrency).

(courtesy Magee&Kramer: Concurrency) Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 106 / 110

slide-107
SLIDE 107

C++ 11 Threads

Dining Philosophers in C++11

forks: shared passive entities – implement as monitors

bool taken; // monitor state; available if false 2 int forkId; // position

  • f fork in the

ring std :: mutex forkLock; 4 std :: condition_variable isTaken; Fork(int id) { forkId = id; taken = false; } 6 void get () { std :: unique_lock <std ::mutex > l(forkLock); 8 isTaken.wait(l, [this ](){ return (taken == false); }); taken = true; 10 } void put () { 12 std :: unique_lock <std ::mutex > l(forkLock); taken = false; 14 isTaken.notify_one (); }

philosophers: active entities – implement as threads

1 void philosopher (int id , Fork* left , Fork* right) { while (true) { // sit down 3 left ->get(id); right ->get(id); // eat left ->put(id); right ->put(id); // stand up 5 } } Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 107 / 110

slide-108
SLIDE 108

C++ 11 Threads

Launching the Philosophers in C++11

Fork ** fork = new Fork* [N]; 2 for (int i = 0; i < N; i++) fork[i] = new Fork(i); 4 std :: thread ** phil = new std :: thread* [N]; for (int i = 0; i < N; i++) 6 phil[i] = new std :: thread (philosopher , i, fork [(i+N -1)%N], fork[i]); for (int i = 0; i < N; i++) 8 phil[i]->join ();

How can we avoid the inevitable deadlock? prevention (of the dangerous situation): do not permit all N philosophers to sit down remove the wait-for cycle: odd and even philosophers pick up forks in the opposite order

Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 108 / 110

slide-109
SLIDE 109

C++ 11 Threads

Hands-on Exercise: C++11 Threads and Condition Synchronization

Objective: To understand how to create threads in C++11 and to look at two crucial issues: interference and deadlock To learn how monitors are implemented in C++11

Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 109 / 110

slide-110
SLIDE 110

C++ 11 Threads

Summary

Topics covered today - Multiprocessor Parallelism: Threads and OS support Mutual Exclusion Shared memory synchronization constructs Memory Consistency OpenMP C++ 11 Threads Tomorrow - Parallel Performance Optimization!

Computer Systems (ANU) Multiprocessor Parallelism Feb 12, 2020 110 / 110