Lecture 7: Shared memory programming David Bindel 20 Sep 2011

Logistics ◮ Still have a couple people looking for groups – help out? ◮ Check out this week’s CS colloquium: “Trumping the Multicore Memory Hierarchy with Hi-Spade”

Monte Carlo Basic idea: Express answer a as a = E [ f ( X )] for some random variable(s) X . Typical toy example: π/ 4 = E [ χ [ 0 , 1 ] ( X 2 + Y 2 )] where X , Y ∼ U ( − 1 , 1 ) . We’ll be slightly more interesting...

A toy problem Given ten points ( X i , Y i ) drawn uniformly in [ 0 , 1 ] 2 , what is the expected minimum distance between any pair?

Toy problem: Version 1 Serial version: sum_fX = 0; for i = 1:ntrials x = rand(10,2); fX = min distance between points in x; sum_fX = sum_fX + fx; end result = sum_fX/ntrials; Parallel version: run twice and average results?! No communication — embarrassingly parallel Need to worry a bit about rand ...

Error estimators Central limit theorem: if R is computed result, then � E [ f ( X )] , σ f ( X ) � R ∼ N . √ n So: ◮ Compute sample standard deviation σ f ( X ) ˆ σ f ( X ) / √ n ◮ Error bars are ± ˆ ◮ Use error bars to monitor convergence

Toy problem: Version 2 Serial version: sum_fX = 0; sum_fX2 = 0; for i = 1:ntrials x = rand(10,2); fX = min distance between points in x; sum_fX = sum_fX + fX; sum_fX2 = sum_fX + fX*fX; result = sum_fX/i; errbar = sqrt(sum_fX2-sum_fX*sum_fX/i)/i; if (abs(errbar/result) < reltol), break; end end result = sum_fX/ntrials; Parallel version: ?

Pondering parallelism Two major points: ◮ How should we handle random number generation? ◮ How should we manage termination criteria? Some additional points (briefly): ◮ How quickly can we compute fX ? ◮ Can we accelerate convergence (variance reduction)?

Pseudo-random number generation ◮ Pretend deterministic and process is random. = ⇒ We lose if it doesn’t look random! ◮ RNG functions have state = ⇒ Basic random() call is not thread-safe! ◮ Parallel strategies: ◮ Put RNG in critical section (slow) ◮ Run independent RNGs per thread ◮ Concern: correlation between streams ◮ Split stream from one RNG ◮ E.g. thread 0 uses even steps, thread 1 uses odd steps ◮ Helpful if it’s cheap to skip steps! ◮ Good libraries help! Mersenne twister, SPRNG, ...?

One solution ◮ Use a version of Mersenne twister with no global state: void sgenrand(long seed, struct mt19937p* mt); double genrand(struct mt19937p* mt); ◮ Choose pseudo-random seeds per thread at startup: long seeds[NTHREADS]; srandom(clock()); for (i = 0; i < NTHREADS; ++i) seeds[i] = random(); ... /* sgenrand(seeds[i], mt) for thread i */

Toy problem: Version 2.1p sum_fX = 0; sum_fX2 = 0; n = 0; for each thread in parallel do fX = result of one random trial ++n; sum_fX += fX; sum_fX2 += fX*fX; errbar = ... if (abs(errbar/result) < reltol), break; end loop end result = sum_fX/n;

Toy problem: Version 2.2p sum_fX = 0; sum_fX2 = 0; n = 0; done = false; for each thread in parallel do fX = result of one random trial get lock ++n; sum_fX = sum_fX + fX; sum_fX2 = sum_fX2 + fX*fX; errbar = ... if (abs(errbar/result) < reltol) done = true; end release lock until done end result = sum_fX/n;

Toy problem: Version 2.3p sum_fX = 0; sum_fX2 = 0; n = 0; done = false; for each thread in parallel do batch_sum_fX, batch_sum_fX2 = B trials get lock n += B; sum_fX += batch_sum_fX; sum_fX2 += batch_sum_fX2; errbar = ... if (abs(errbar/result) < reltol) done = true; end release lock until done or n > n_max end result = sum_fX/n;

Toy problem: actual code (pthreads)

Some loose ends ◮ Alternative: “master-slave” organization ◮ Master sends out batches of work to slaves ◮ Example: SETI at Home, Folding at Home, ... ◮ What is the right batch size? ◮ Large B = ⇒ amortize locking/communication overhead (and variance actually helps with contention!) ◮ Small B avoids too much extra work ◮ How to evaluate f ( X ) ? ◮ For p points, obvious algorithm is O ( p 2 ) ◮ Binning points better? No gain for p small... ◮ Is f ( X ) the right thing to evaluate? ◮ Maybe E [ g ( X )] = E [ f ( X )] but Var [ g ( X )] ≪ Var [ f ( X )] ? ◮ May make much more difference than parallelism!

The problem with pthreads revisited pthreads can be painful! ◮ Makes code verbose ◮ Synchronization is hard to think about Would like to make this more automatic! ◮ ... and have been trying for a couple decades. ◮ OpenMP gets us part of the way

OpenMP: Open spec for MultiProcessing ◮ Standard API for multi-threaded code ◮ Only a spec — multiple implementations ◮ Lightweight syntax ◮ C or Fortran (with appropriate compiler support) ◮ High level: ◮ Preprocessor/compiler directives (80%) ◮ Library calls (19%) ◮ Environment variables (1%)

Parallel “hello world” #include <stdio.h> #include <omp.h> int main() { #pragma omp parallel printf("Hello world from %d\n", omp_get_thread_num()); return 0; }

Parallel sections i = 0 i = 1 i = 2 s = [1, 2, 3, 4] i = 3 ◮ Basic model: fork-join ◮ Each thread runs same code block ◮ Annotations distinguish shared ( s ) and private ( i ) data ◮ Relaxed consistency for shared data

Parallel sections i = 0 i = 1 i = 2 s = [1, 2, 3, 4] i = 3 ... double s[MAX_THREADS]; int i; #pragma omp parallel shared(s) private(i) { i = omp_get_thread_num(); s[i] = i; } ...

Critical sections Thread 1 lock unlock Thread 0 lock unlock lock unlock ◮ Automatically lock/unlock at ends of critical section ◮ Automatically memory flushes for consistency ◮ Locks are still there if you really need them...

Critical sections Thread 1 lock unlock Thread 0 lock unlock lock unlock #pragma omp parallel { ... #pragma omp critical my_data_cs { ... modify data structure here ... } }

Barriers barrier barrier barrier barrier Thread 1 Thread 0 barrier barrier barrier barrer #pragma omp parallel for (i = 0; i < nsteps; ++i) { do_stuff #pragma omp barrier }

Toy problem: actual code (OpenMP)

Toy problem: actual code (OpenMP) A practical aside... ◮ GCC 4.3+ has OpenMP support by default ◮ Earlier versions may support (e.g. latest Xcode gcc-4.2 ) ◮ GCC 4.4 (prerelease) for my laptop has buggy support! ◮ -O3 -fopenmp == death of an afternoon ◮ Need -fopenmp for both compile and link lines gcc -c -fopenmp foo.c gcc -o -fopenmp mycode.x foo.o

Parallel loops i = 0, 1, 2, ... parallel for i = ... i = 10, 11, 12, ... i = 20, 21, 22, ... ◮ Independent loop body? At least order doesn’t matter 1 . ◮ Partition index space among threads ◮ Implicit barrier at end (except with nowait ) 1 If order matters, there’s an ordered modifier.

Parallel loops /* Compute dot of x and y of length n */ int i, tid; double my_dot, dot = 0; #pragma omp parallel \ shared(dot,x,y,n) \ private(i,my_dot) { tid = omp_get_thread_num(); my_dot = 0; #pragma omp for for (i = 0; i < n; ++i) my_dot += x[i]*y[i]; #pragma omp critical dot += my_dot; }

Parallel loops /* Compute dot of x and y of length n */ int i, tid; double dot = 0; #pragma omp parallel \ shared(x,y,n) \ private(i) \ reduction(+:dot) { #pragma omp for for (i = 0; i < n; ++i) dot += x[i]*y[i]; }

Parallel loop scheduling Partition index space different ways: ◮ static[(chunk)] : decide at start of loop; default chunk is n/nthreads . Lowest overhead, most potential load imbalance. ◮ dynamic[(chunk)] : each thread takes chunk iterations when it has time; default chunk is 1. Higher overhead, but automatically balances load. ◮ guided : take chunks of size unassigned iterations/threads; chunks get smaller toward end of loop. Somewhere between static and dynamic . ◮ auto : up to the system! Default behavior is implementation-dependent.

Other parallel work divisions ◮ single : do only in one thread (e.g. I/O) ◮ master : do only in one thread; others skip ◮ sections : like cobegin/coend

Essential complexity? Fred Brooks ( Mythical Man Month ) identified two types of software complexity: essential and accidental. Does OpenMP address accidental complexity? Yes, somewhat! Essential complexity is harder.

Things to still think about with OpenMP ◮ Proper serial performance tuning? ◮ Minimizing false sharing? ◮ Minimizing synchronization overhead? ◮ Minimizing loop scheduling overhead? ◮ Load balancing? ◮ Finding enough parallelism in the first place? Let’s focus again on memory issues...

Memory model ◮ Single processor: return last write ◮ What about DMA and memory-mapped I/O? ◮ Simplest generalization: sequential consistency – as if ◮ Each process runs in program order ◮ Instructions from different processes are interleaved ◮ Interleaved instructions ran on one processor

Sequential consistency A multiprocessor is sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program. – Lamport, 1979

Example: Spin lock Initially, flag = 0 and sum = 0 Processor 1: Processor 2: sum += p1; while (!flag); flag = 1; sum += p2;

Lecture 7: Shared memory programming David Bindel 20 Sep 2011 - PowerPoint PPT Presentation

Lecture 7: Shared memory programming David Bindel 20 Sep 2011 Logistics Still have a couple people looking for groups help out? Check out this weeks CS colloquium: Trumping the Multicore Memory Hierarchy with Hi-Spade

Shared Memory Programming with OpenMP Lecture 5: Synchronisation Why is it required? Recall:

Threaded Programming Lecture 1: Concepts Overview Shared memory systems Basic Concepts

Programming with Shared Memory In a shared memory system, any memory location can be accessible by

Shared Memory Programming with OpenMP Lecture 6: Tasks What are tasks? Tasks are

Shared Memory Programming Introduction to OpenMP Overview Shared memory systems Basic

SHARED MEMORY PROGRAMMING WITH OPENMP Lecture 9: OpenMP Performance 2 A common scenario.....

Message Passing DM519 Concurrent Programming 1 1 Absence Of Shared Memory In previous lectures

Lecture 5: Parallel machines and models; shared memory programming David Bindel 8 Feb 2010

Case Studies in Asynchronous, Message-Driven Shared Memory Programming Pritish Jetley Parallel

Distributed Shared Memory 1 Distributed Shared Memory Making the main memory of a cluster of

Shared-Memory Programming Models Programmierung Paralleler und Verteilter Systeme (PPV) Sommer

28. Parallel Programming II 28.1 Shared Memory, Concurrency Shared Memory, Concurrency,

Shared Memory Programming with OpenMP Lecture 7: Further topics Nested parallelism Unlike

Shared Memory Programming with OpenMP Lecture 3: Parallel Regions Parallel region directive

Parallel Programming and Heterogeneous Computing B2 - Shared-Memory: Programming Models Max

A Comparison Of Shared Memory Parallel Programming Models Jace A Mogill David Haglin 1

ECE 4524 Artificial Intelligence and Engineering Applications Lecture 16: Uncertainty and

CS70: Jean Walrand: Lecture 35. Continuous Probability 2 1. Review: CDF , PDF 2. Examples 3.

1 Gridworld: Q* The Bellman Equa)ons How to be op)mal:

Privacy-preserving KYC on Ethereum Introduction A decentralized KYC-compliant identity Alex

Last time: GADTs a b 1/ 41 This time: monads (etc.) = > > 2/ 41 What do monads

Neil Mitchell www.cs.york.ac.uk/~ndm/ The Problem Count the number of lines in a file

Generation of Non-Uniform Random Numbers Generation of Non-Uniform Random Numbers Refs: Chapter 8

Upper and Lower Semimodularity of the Supercharacter Theory Lattices of Cyclic Groups Samuel