Small-Scale Shared Address Space Multiprocessor (MIMD) Processors - - PDF document

small scale shared address space multiprocessor mimd
SMART_READER_LITE
LIVE PREVIEW

Small-Scale Shared Address Space Multiprocessor (MIMD) Processors - - PDF document

Small-Scale Shared Address Space Multiprocessor (MIMD) Processors are connected via a dynamic network/bus to a shared memory Communication and coordination through shared variables in the memory Lecture 3: The user (the


slide-1
SLIDE 1

1

Lecture 3:

Small-Scale Shared Address Space Multiprocessors & Shared memory programming (OpenMP + POSIX threads)

2

Small-Scale Shared Address Space Multiprocessor (MIMD)

  • Processors are connected via a dynamic network/bus to a shared

memory

  • Communication and coordination through shared variables in the

memory

  • The user (the compiler) guarantees that data is written and read in

the right order (barriers, semaphores, ..)

  • UMA: All processors have the same access time to all memory

modules

  • (CC-)NUMA: Processors may have different

access time to different memory modules

  • Data locality (temporal+spatial) important to

get high performance P P P P P

Memory (modules)

Memory bus

3

Processor-memory communication proportional to # proc. Bandwidth increases proportional to # processors

  • Communication Contention

– More than one processor wants to use the same link at the same time

  • Memory Contention

– More than one processor accesses the same memory module at the same time – Serialization

  • Organization of the memory + network important

components for reducing contention

  • Maximum of around one hundred processors (in practice

fewer), demands something else than just a bus

Contention

4

Bandwidth to memory & cache memories in multiprocessors

  • Use locality and introduce local private cache

memories

> Reduces accesses to memory ⇒ reduces the pressure on the network

  • Cache-coherence problem

> More than one copy of a shared block at the same time

5

Programming SMM

  • Big similarities with OS programming
  • Done by small extensions of existing

program languages, OS & libraries

– create processes/threads that executes in parallel – different processes/threads can be assigned to different processors – synchronize and lock critical regions and data

6

Multi- and Microtasking

  • UNIX - Coarse-grained (Multitasking)
  • Vacant processors are assigned new processes continuously
  • Work queue, ”heavy weight” tasks
  • Dynamic load balansing, coarse-grained
  • Heterogeneous tasks – performs diffent separate things
  • Expensive to create in both time and memory (copies everything

from the parent, except process ID)

  • POSIX threads - Fine-grained (Microtasking)
  • ”Light weight” processes
  • Parallelism within application (homogeneous tasks), e.g., loop splitting
  • An application is a series of forks and joins
  • Cheap to create in both time and memory (shares the memory and

global variables)

slide-2
SLIDE 2

7

UNIX Process & shared memory

  • A UNIX-process has

three segments:

– text, executable code – data – stack, activation post + dynamical data

  • Processes are

independent and have no shared addresses

  • fork() -

creates an exact copy of the process (pid an exception)

  • UNIX-process + shared

data segment

  • A fork() copies everything

except the shared data

code data stack shared data code private data private stack code private data private stack

8

Introducing multiple threads

  • A thread is a sequens of instructions that

executes within a program

  • A normal UNIX-process contains one single thread
  • A process may contain more than one thread
  • Different parts of the same code may be executed at

the same time

  • The threads form one process and share the address

space, files, signal handling etc. For example: – When a thread opens a file, it is immediately accessable to all threads – When a thread writes to a global variable, it is readable by all threads

  • The threads have their own IP (instruction pointers)

and stacks

9

Execution of threads

Master thread Master thread Create worker threads with phtread_create() The worker threads starts The worker threads do their work The worker threads terminates Join the worker threads with phtread_join()

10

Coordinating parallel processes

  • Lock

– Mechanism to maintain a policy for how shared data may be read/written

  • Determinism

– The access order is the same for each execution

  • Nondeterminism

– The access order is random

  • Indeterminism (not good...)

– The result of nondeterminism which causes that different results can be received with different executions.

11

Safe, Live, and Fair

  • Safe Lock

– deterministic results (can lead to unfairness)

  • Unfair Lock

– a job may wait for ever while others get repeated accesses

  • Fair Lock

– all may get access in roughly the order they arrived + ”neighbour to the right have priority” (may result in deadlocks=”all waits for their right neighbour”)

  • Live Lock

– prevents deadlocks

12

Critical section, Road crossing

Yield Right Safe but unfair

Yield right + first in first out (FIFO) Safe, fair but not live Yield right + FIFO + Priority Safe, fair and live

Forced to wait for ever on unlimited number of cars

1 2 3 4

All arrives at exactly the same time and everybody is forced to wait on each

  • ther

d e a d l

  • c

k

slide-3
SLIDE 3

13

Example, Spin lock Unsafe

while C do; % spin while C is true C := TRUE; % Lock (C),only

  • ne access

CR; % Critical section C := FALSE; % unLock(C)

  • Unsafe

– both may access critical section

  • Race

– The processes competes for the CR

14

Example, Spin lock Safe

  • Safe

– only one process may access CR;

  • Not live

– risk for deadlock if all processes executes the first statement at the same time Flag[me] := TRUE; % set my flag while Flag[other] do; % Spin CR; % Critical sect. Flag[me] := False; % Clear lock

15

Example: Atomic test and set, Safe and Live

  • Atomic (uninterruptable operation) test

and set lock

char *lock *lock = UNLOCKED Atomic_test_and_set_lock(lock, LOCKED) % TRUE if lock already is LOCKED % FALSE if lock changes from UNLOCKED to LOCKED while Atomic_test_and_set_lock(lock, LOCK); CR; Atomic_test_and_set_lock(LOCK, UNLOCK) % Unlock

16

Locks for simultaneous reading

  • Queue Lock

struct q_lock{ int head; int tasks[NPROCS]; } lock; void add_to_q(lock, myid) while (head_of_q(lock) != myid); CR; void remove_from_q(lock, myid);

  • Safe, Live & Fair

– The queue assures that nobody has to wait forever to access the CR

  • Drawback

– Spinning – How to implement the queue?

17

Multiple Readers Lock

  • Demands

– Several readers may access the CR at the same time – When a writer arrives, no more readers are allowed to enter the Access Room – When readers inside the CR are finished the writer has exclusive right to the CR

  • Waiting-room analogy

– safe, only one Writer at the time is allowed into the Access Room – fair, when there is a Writer in the waiting-room, Readers in the waiting-room must let the Writer go first, except when there already is a Writer in the Access Room – fair, when there is a Reader in the waiting-room a Writer in the waiting-room must let the Reader go first, except when there already is a Reader in the Access Room – live, no deadlocks. Everyone inside the waiting-room are guaranteed access to the Access Room

WriteQ ReadQ Waiting-room Access Room 18

Shared memory programming - tools

  • OpenMP
  • POSIX threads
slide-4
SLIDE 4

19

OpenMP

  • A portable fork-join parallel model for

architectures with shared memory

  • Portable

– Fortran 77 and C/C++ bindings – Many implementations, all work in the same way (in theory at least)

  • Fork-join model

– Execution starts with one thread – Parallel regions forks new threads at entry – The threads joins at thge end of the region

  • Shared memory

– (Almost all) memory accessable by all threads

20

OpenMP

  • Two kinds of parallelism:

– Coarse-grained (task parallelism)

  • Split the program into segments (threads) that can be

executed in parallel

  • Implicit join at the end of the segment, or explicit

synchronization points (like barriers)

  • E.g.: let two threads call different subroutines in parallel

– Fine-grained (loop parallelism)

  • Execute independent iterations of DO-loops in parallel
  • Several choices of splitting
  • Data environments for both kinds:

– Shared data – Private data

22

Design of OpenMP

“A flexible standard, easily implemented across different platforms”

  • Control structures

– Minimalistic, for simplicity – PARALLEL, DO (for), SECTIONS, SINGLE, MASTER

  • Data environments

– New access possibilities for forked threads – SHARED, PRIVATE, REDUCTION

23

Design av OpenMP II

  • Synchronization

– Simple implicit synchronization at the end of control structures – Explicit synchronization for more complex patterns: BARRIER, CRITICAL, ATOMIC, ORDERED, FLUSH – Lock subroutines for fine-grained control

  • Runtime environment and library

– Manages preferences for forking and scheduling – E.g.: OMP_GET_THREAD_NUM(), OMP_SET_NUM_THREADS(intexpr)

  • OpenMP may (i principal) be added to any computer language

– In reality, Fortran och C (C++)

  • OpenMP is mainly a collection of directive to the compiler

– In Fortran: structured comments interpreted by the compiler C$OMP directive [clause[,],clause[,],..] – In C: pragmas sending information to the compiler: #pragma omp directive [clause[,],clause[,],..]

24

Control structures I

  • PARALLEL / END PARALLEL

– Fork and join – Number of threads does not change within the region? – SPMD-execution within the region

  • SINGLE / END SINGLE

– (Short) sequential section within a parallel region

  • MASTER / END MASTER

– SINGLE on master processor (often 0)

S1 S1 S1 S3 S3 S3

PARALLEL END

S2 C$OMP PARALLEL CALL S1() C$OMP SINGLE CALL S2() C$OMP END SINGLE CALL S3() C$OMP END PARALLEL #pragma omp parallel { S1(); #pragma omp single { S2(); } S3(); }

slide-5
SLIDE 5

25

Control structures II

  • DO / END DO

– The classical parallel loop – The iteration space is split between threads

  • Static, dynamic, guided

– Loop index is normally private for the threads

  • More about other variables

later

C$OMP PARALLEL C$OMP DO DO J = 1, 12 CALL FOO(J) END DO C$OMP END DO C$OMP END PARALLEL

PARALLEL END 26

Control structures III

  • SECTIONS / END SECTIONS

– Task parallelism

  • SECTION marks tasks

– Within a parallel region

  • Nestled parallelism

– Demands a new parallel region (i.e. a new PARALLEL directive) – All OpenMP implementations does not support this

  • If there is no support, then the

inner PARALLEL will be without meaning

PARALLEL PARALLEL PARALLEL END END END 27

OpenMP Data Environments

  • Shared memory may be PRIVATE or SHARED

– Declare this in OpenMP directives for parallelism

  • Choices

– SHARED – Reachable by all threads in the ”team”

  • The normal case
  • DEFAULT can change this

– PRIVATE – Each thread has their own copy

  • Normally, private data are not copied in to/out from

parallel regions and are never touched by other threads

  • FIRSTPRIVATE – copy the global value to the first

iteration

  • LASTPRIVATE – copy the value from the last iteration

28

Guidelines for Classification

  • f Variables
  • Generally, ”big things” are SHARED

– Main matrices, the ones taking all the space – Automatic

  • Local variables in subprograms are parallel

PRIVATE-variables

– Automatic

  • Small temporaries are normally PRIVATE

– Often you need one copy of loop temporaries for each thread – Automatic, for iteration variables

29

OpenMP Data Environments II

  • REDUCTION-variables in DO (for)

constructions

– A local phase followed by a global phase. Initialization is handled ”as expected”

  • Fortran REDUCTION ops/fcns: +, *, -,

.AND., .OR., .EQV., .NEQV., MAX, MIN, IAND, IOR, IEOR

  • C reduction operators: +, *, -, &, |,

^, &&, ||

slide-6
SLIDE 6

31

OpenMP Synchronization

  • Implicit barriers waits for all threads in the

”team” at the end of each construction

– DO (for), SECTIONS, SINGLE, MASTER – NOWAIT at END can override the synchronization

  • Explicit directives for finer control

– BARRIER – Waits for all threads in the ”team” – CRITICAL (name), END CRITICAL – Only one thread at the

time

– ATOMIC - Single-statement critical section for reduction – ORDERED – Used to order (the start of) loop iterations – Lock routines to get access control

32

OpenMP summation

  • Based on fork-join-parallelism in shared

memory

– Threads starts in the beginning of a parallel region, ”returns” at the end – Maps well to some hardware – Linked to traditional languages

  • For more information:

– http://www.openmp.org/

33

POSIX threads (IEEE standard)

It is cheaper to create threads than to fork new processes

  • Calls to create and remove threads
  • Calls to synchronize threads and to

lock resources

  • Calls to handle thread scheduling
  • etc

34

Pthread: Basic Routines

Funktion: pthread_create()

int pthread_create(pthread_t * thread, const pthread_attr_t * attr, void * (*start_routine)(void *), void * arg);

  • pthread_create() creates a new thread

within a process

  • The thread starts in the routine

start_routine that takes one argument arg

  • attr specifies attributes, or default-attributes

if NULL (see Ass. 1)

  • If the pthread_create-routine is successful, 0

is returned and the new thread's ID is stored in thread, else an error code is returned

35

Ending a Thread

Function: pthread_exit()

void pthread_exit(void * status);

  • Ends the, at the moment, executing thread

and makes status available to the thread who is joining, pthread_join(), with the terminating thread

  • An other way to end: return.
  • Threads that are not joined are called

”detached”. Can be used if you do not explicitly have to join them: more effective!

36

Hello World ... or World Hello?

void *print_message_function( void *ptr ); main() { pthread_t thread1, thread2; char *message1 = "Hello“, *message2 = "World"; void *status1, *status2; pthread_create(&thread1, NULL, print_message_function, (void*) message1); pthread_create(&thread2, NULL, print_message_function, (void*) message2); pthread_join(thread1, &status1); pthread_join(thread2, &status2); return 0; } void *print_message_function( void *ptr ) { char *message; message = (char *) ptr; printf("%s ", message); }

slide-7
SLIDE 7

37

Joining threads

Function: pthread_join()

int pthread_join(pthread_t thread, void ** status);

  • Waits for the target thread (thread) to

finish, if it not is detached

38

Identifying and Comparing Threads

Function: pthread_self() and pthread_equal()

pthread_t pthread_self(void);

  • returns id for the calling thread

int pthread_equal(pthread_t thread_1, pthread_t thread_2);

  • compares id:s for thread_1 and thread_2

and returns a nonzero value if the id:s represent the same thread

39

Pthread: Synchronization routines

Funktion: pthread_mutex_init()

int pthread_mutex_init(pthread_mutex_t * mutex, const pthread_mutex_attr *attr);

Creates a new mutex (mutually exclusive lock), where attr specifies attributes, or default attributes if attr is NULL.

int pthread_mutex_destroy(pthread_mutex_t * mutex); int pthread_mutex_lock(pthread_mutex_t * mutex);

Locks the mutex mutex. If mutex is already locked, the calling thread is blocked until mutex is available.

int pthread_mutex_trylock(pthread_mutex_t * mutex); ”critical section” (one thread only at one time) int pthread_mutex_unlock(pthread_mutex_t * mutex);

40

Condition Variables

  • Condition variables are associated with a

specific mutex – can lock a thread until a signal is given from another thread

  • Function: pthread_cond_init(). Creates a new

Condition variable:

int pthread_cond_init(pthread_cond_t * cond, const pthread_cond_attr *attr); int pthread_cond_signal(phtread_cond_t * cond ); int pthread_cond_destroy(pthread_cond_t * cond); int pthread_cond_wait(pthread_cond_t * cond, pthread_mutex_t * mutex);

41

Pthreads - general

#include <pthread.h> in the program Link with –lpthread Always check error codes! More info: Getting Started With POSIX Threads, man-sidor

42

Pros’n’cons

  • Portability

– OpenMP: Good C/FORTRAN-compilers – Pthreads: library. Now also with FORTRAN bindings

  • Functionality

– OpenMP: Perfect for HPC-applications! – Pthreads: A little more generel, but some functionality is missing (e.g.: barriers)

  • Performance

– OpenMP is better than pthreads for fine-grain problems. Does not create parallelism ”unnecessarily” – Later versions of pthreads are effective as well

  • Standard

– OpenMP relatively new as a standard – Pthreads are stable

slide-8
SLIDE 8

43

Cache-aspects

44

Why caches?

  • Cache memories serves to:

– Increase band with to memory and at the same time reduce the load on the memory bus – Shorten the latency – Good for both shared and private data

  • But how is the data kept

consistent?

$ $ $ $ $ Memory Memory bus P P P P P

45

The Cache Coherence Problem

Splitting may result in several copies of the same, shared, block in one or more cache memories at the same time. To give a consistent view of the memory, the copies must be coherent. Different solutions exists, from Hardware Software

46

The Cache Coherence Problem

Thanks to Erik Hagersten, TDB in Uppsala, for the pictures

47

Coherent?

  • Informally

– A read must return the latest write – Too strict, and too hard to implement

  • Better

– A write must, in the end, be seen by a read – All writes must be seen in order (serialization)

  • Two rules
  • 1. If P1 writes x and P2 reads x, P1s write will be seen if

read and write are far enough from each other

  • 2. Writes to an address are serialized
  • The latest write is shown, else you would see old values after

new

48

Hardware based protocols

Guarantee memory coherency without software involvement

  • Snoopy cache protocol (”snooping”)
  • Directory schemes
  • Cache-coherent network architecture

– data is split into block of the same size – the blocks is the unit which is sent between memory and cache – arbitrarily many copies of a block is allowed to exist at the same time – updates within blocks are ”visible” for all

slide-9
SLIDE 9

49

Cache coherence policies

  • Write Invalidate:

– reading is done locally if the block is in the cache – during update all other copies are invalidated – additional updates can be made since all other copies are invalid

  • Write Update:

– updates all copies instead of invalidation (broadcast)

50

Consistency Commands

invalidation & update commands

  • Bus-based architectures

– Broadcast all consistency commands ⇒ all the caches have to process all messages – Snoops the network "snoopy cache protocol"

  • Switching network-based architectures

– Sends consistency commands only to the caches with a copy – Requires book-keeping, directory schemes

51

A simple c.c.-protocol

Bus-commands:

2.

A) Send request to the owner to return data to the memory. (BRB) B) Read data from memory.

3.

Bus Invalidate (BInv)

52

False Sharing

  • Different parts of the

same block (cache line) is used by different processors

  • If a processor writes

to one part of the block, then all copies (of the whole block) have to either be updated or invalidated.

  • Even though the data is

not really shared.

53

Repetitionsuppgifter

  • Beskriv enkelt uppbyggnaden av en

parallellmaskin med gemensamt minne!

  • Vad är cache conherency problemet och

hur löser man det?

  • Vad är ”fork-join”?

54

Tentauppgift 040116, 4a

  • Frobeniusnormen av en m x n matris A definieras
  • Implementera en subrutin som beräknar Frobeniusnormen av A

parallellt på en maskin med gemensamt minne m h a OpenMP eller pthreads!

2 , 1 , 1 ,

= =

=

n m j i j i F

a A