Lecture 7: Shared memory programming David Bindel 20 Sep 2011 - - PowerPoint PPT Presentation

lecture 7 shared memory programming
SMART_READER_LITE
LIVE PREVIEW

Lecture 7: Shared memory programming David Bindel 20 Sep 2011 - - PowerPoint PPT Presentation

Lecture 7: Shared memory programming David Bindel 20 Sep 2011 Logistics Still have a couple people looking for groups help out? Check out this weeks CS colloquium: Trumping the Multicore Memory Hierarchy with Hi-Spade


slide-1
SLIDE 1

Lecture 7: Shared memory programming

David Bindel 20 Sep 2011

slide-2
SLIDE 2

Logistics

◮ Still have a couple people looking for groups – help out? ◮ Check out this week’s CS colloquium:

“Trumping the Multicore Memory Hierarchy with Hi-Spade”

slide-3
SLIDE 3

Monte Carlo

Basic idea: Express answer a as a = E[f(X)] for some random variable(s) X. Typical toy example: π/4 = E[χ[0,1](X 2 + Y 2)] where X, Y ∼ U(−1, 1). We’ll be slightly more interesting...

slide-4
SLIDE 4

A toy problem

Given ten points (Xi, Yi) drawn uniformly in [0, 1]2, what is the expected minimum distance between any pair?

slide-5
SLIDE 5

Toy problem: Version 1

Serial version: sum_fX = 0; for i = 1:ntrials x = rand(10,2); fX = min distance between points in x; sum_fX = sum_fX + fx; end result = sum_fX/ntrials; Parallel version: run twice and average results?! No communication — embarrassingly parallel Need to worry a bit about rand...

slide-6
SLIDE 6

Error estimators

Central limit theorem: if R is computed result, then R ∼ N

  • E[f(X)], σf(X)

√n

  • .

So:

◮ Compute sample standard deviation

ˆ σf(X)

◮ Error bars are ±

ˆ σf(X)/√n

◮ Use error bars to monitor convergence

slide-7
SLIDE 7

Toy problem: Version 2

Serial version: sum_fX = 0; sum_fX2 = 0; for i = 1:ntrials x = rand(10,2); fX = min distance between points in x; sum_fX = sum_fX + fX; sum_fX2 = sum_fX + fX*fX; result = sum_fX/i; errbar = sqrt(sum_fX2-sum_fX*sum_fX/i)/i; if (abs(errbar/result) < reltol), break; end end result = sum_fX/ntrials; Parallel version: ?

slide-8
SLIDE 8

Pondering parallelism

Two major points:

◮ How should we handle random number generation? ◮ How should we manage termination criteria?

Some additional points (briefly):

◮ How quickly can we compute fX? ◮ Can we accelerate convergence (variance reduction)?

slide-9
SLIDE 9

Pseudo-random number generation

◮ Pretend deterministic and process is random.

= ⇒ We lose if it doesn’t look random!

◮ RNG functions have state

= ⇒ Basic random() call is not thread-safe!

◮ Parallel strategies:

◮ Put RNG in critical section (slow) ◮ Run independent RNGs per thread ◮ Concern: correlation between streams ◮ Split stream from one RNG ◮ E.g. thread 0 uses even steps, thread 1 uses odd steps ◮ Helpful if it’s cheap to skip steps!

◮ Good libraries help! Mersenne twister, SPRNG, ...?

slide-10
SLIDE 10

One solution

◮ Use a version of Mersenne twister with no global state:

void sgenrand(long seed, struct mt19937p* mt); double genrand(struct mt19937p* mt);

◮ Choose pseudo-random seeds per thread at startup:

long seeds[NTHREADS]; srandom(clock()); for (i = 0; i < NTHREADS; ++i) seeds[i] = random(); ... /* sgenrand(seeds[i], mt) for thread i */

slide-11
SLIDE 11

Toy problem: Version 2.1p

sum_fX = 0; sum_fX2 = 0; n = 0; for each thread in parallel do fX = result of one random trial ++n; sum_fX += fX; sum_fX2 += fX*fX; errbar = ... if (abs(errbar/result) < reltol), break; end loop end result = sum_fX/n;

slide-12
SLIDE 12

Toy problem: Version 2.2p

sum_fX = 0; sum_fX2 = 0; n = 0; done = false; for each thread in parallel do fX = result of one random trial get lock ++n; sum_fX = sum_fX + fX; sum_fX2 = sum_fX2 + fX*fX; errbar = ... if (abs(errbar/result) < reltol) done = true; end release lock until done end result = sum_fX/n;

slide-13
SLIDE 13

Toy problem: Version 2.3p

sum_fX = 0; sum_fX2 = 0; n = 0; done = false; for each thread in parallel do batch_sum_fX, batch_sum_fX2 = B trials get lock n += B; sum_fX += batch_sum_fX; sum_fX2 += batch_sum_fX2; errbar = ... if (abs(errbar/result) < reltol) done = true; end release lock until done or n > n_max end result = sum_fX/n;

slide-14
SLIDE 14

Toy problem: actual code (pthreads)

slide-15
SLIDE 15

Some loose ends

◮ Alternative: “master-slave” organization

◮ Master sends out batches of work to slaves ◮ Example: SETI at Home, Folding at Home, ...

◮ What is the right batch size?

◮ Large B =

⇒ amortize locking/communication overhead (and variance actually helps with contention!)

◮ Small B avoids too much extra work

◮ How to evaluate f(X)?

◮ For p points, obvious algorithm is O(p2) ◮ Binning points better? No gain for p small...

◮ Is f(X) the right thing to evaluate?

◮ Maybe E[g(X)] = E[f(X)] but Var[g(X)] ≪ Var[f(X)]? ◮ May make much more difference than parallelism!

slide-16
SLIDE 16

The problem with pthreads revisited

pthreads can be painful!

◮ Makes code verbose ◮ Synchronization is hard to think about

Would like to make this more automatic!

◮ ... and have been trying for a couple decades. ◮ OpenMP gets us part of the way

slide-17
SLIDE 17

OpenMP: Open spec for MultiProcessing

◮ Standard API for multi-threaded code

◮ Only a spec — multiple implementations ◮ Lightweight syntax ◮ C or Fortran (with appropriate compiler support)

◮ High level:

◮ Preprocessor/compiler directives (80%) ◮ Library calls (19%) ◮ Environment variables (1%)

slide-18
SLIDE 18

Parallel “hello world”

#include <stdio.h> #include <omp.h> int main() { #pragma omp parallel printf("Hello world from %d\n",

  • mp_get_thread_num());

return 0; }

slide-19
SLIDE 19

Parallel sections

s = [1, 2, 3, 4] i = 0 i = 1 i = 2 i = 3

◮ Basic model: fork-join ◮ Each thread runs same code block ◮ Annotations distinguish shared (s) and private (i) data ◮ Relaxed consistency for shared data

slide-20
SLIDE 20

Parallel sections

s = [1, 2, 3, 4] i = 0 i = 1 i = 2 i = 3

... double s[MAX_THREADS]; int i; #pragma omp parallel shared(s) private(i) { i = omp_get_thread_num(); s[i] = i; } ...

slide-21
SLIDE 21

Critical sections

unlock Thread 0 Thread 1 lock unlock lock unlock lock

◮ Automatically lock/unlock at ends of critical section ◮ Automatically memory flushes for consistency ◮ Locks are still there if you really need them...

slide-22
SLIDE 22

Critical sections

unlock Thread 0 Thread 1 lock unlock lock unlock lock

#pragma omp parallel { ... #pragma omp critical my_data_cs { ... modify data structure here ... } }

slide-23
SLIDE 23

Barriers

barrier Thread 0 Thread 1 barrier barrier barrier barrer barrier barrier barrier

#pragma omp parallel for (i = 0; i < nsteps; ++i) { do_stuff #pragma omp barrier }

slide-24
SLIDE 24

Toy problem: actual code (OpenMP)

slide-25
SLIDE 25

Toy problem: actual code (OpenMP)

A practical aside...

◮ GCC 4.3+ has OpenMP support by default

◮ Earlier versions may support (e.g. latest Xcode gcc-4.2) ◮ GCC 4.4 (prerelease) for my laptop has buggy support! ◮ -O3 -fopenmp == death of an afternoon

◮ Need -fopenmp for both compile and link lines

gcc -c -fopenmp foo.c gcc -o -fopenmp mycode.x foo.o

slide-26
SLIDE 26

Parallel loops

i = 20, 21, 22, ... parallel for i = ... i = 0, 1, 2, ... i = 10, 11, 12, ...

◮ Independent loop body? At least order doesn’t matter1. ◮ Partition index space among threads ◮ Implicit barrier at end (except with nowait)

1If order matters, there’s an ordered modifier.

slide-27
SLIDE 27

Parallel loops

/* Compute dot of x and y of length n */ int i, tid; double my_dot, dot = 0; #pragma omp parallel \ shared(dot,x,y,n) \ private(i,my_dot) { tid = omp_get_thread_num(); my_dot = 0; #pragma omp for for (i = 0; i < n; ++i) my_dot += x[i]*y[i]; #pragma omp critical dot += my_dot; }

slide-28
SLIDE 28

Parallel loops

/* Compute dot of x and y of length n */ int i, tid; double dot = 0; #pragma omp parallel \ shared(x,y,n) \ private(i) \ reduction(+:dot) { #pragma omp for for (i = 0; i < n; ++i) dot += x[i]*y[i]; }

slide-29
SLIDE 29

Parallel loop scheduling

Partition index space different ways:

◮ static[(chunk)]: decide at start of loop; default chunk

is n/nthreads. Lowest overhead, most potential load imbalance.

◮ dynamic[(chunk)]: each thread takes chunk iterations

when it has time; default chunk is 1. Higher overhead, but automatically balances load.

◮ guided: take chunks of size unassigned

iterations/threads; chunks get smaller toward end of loop. Somewhere between static and dynamic.

◮ auto: up to the system!

Default behavior is implementation-dependent.

slide-30
SLIDE 30

Other parallel work divisions

◮ single: do only in one thread (e.g. I/O) ◮ master: do only in one thread; others skip ◮ sections: like cobegin/coend

slide-31
SLIDE 31

Essential complexity?

Fred Brooks (Mythical Man Month) identified two types of software complexity: essential and accidental. Does OpenMP address accidental complexity? Yes, somewhat! Essential complexity is harder.

slide-32
SLIDE 32

Things to still think about with OpenMP

◮ Proper serial performance tuning? ◮ Minimizing false sharing? ◮ Minimizing synchronization overhead? ◮ Minimizing loop scheduling overhead? ◮ Load balancing? ◮ Finding enough parallelism in the first place?

Let’s focus again on memory issues...

slide-33
SLIDE 33

Memory model

◮ Single processor: return last write

◮ What about DMA and memory-mapped I/O?

◮ Simplest generalization: sequential consistency – as if

◮ Each process runs in program order ◮ Instructions from different processes are interleaved ◮ Interleaved instructions ran on one processor

slide-34
SLIDE 34

Sequential consistency

A multiprocessor is sequentially consistent if the result

  • f any execution is the same as if the operations of all

the processors were executed in some sequential

  • rder, and the operations of each individual processor

appear in this sequence in the order specified by its program. – Lamport, 1979

slide-35
SLIDE 35

Example: Spin lock

Initially, flag = 0 and sum = 0 Processor 1: sum += p1; flag = 1; Processor 2: while (!flag); sum += p2;

slide-36
SLIDE 36

Example: Spin lock

Initially, flag = 0 and sum = 0 Processor 1: sum += p1; flag = 1; Processor 2: while (!flag); sum += p2; Without sequential consistency support, what if

  • 1. Processor 2 caches flag?
  • 2. Compiler optimizes away loop?
  • 3. Compiler reorders assignments on P1?

Starts to look restrictive!

slide-37
SLIDE 37

Sequential consistency: the good, the bad, the ugly

Program behavior is “intuitive”:

◮ Nobody sees garbage values ◮ Time always moves forward

One issue is cache coherence:

◮ Coherence: different copies, same value ◮ Requires (nontrivial) hardware support

Also an issue for optimizing compiler! There are cheaper relaxed consistency models.

slide-38
SLIDE 38

Snoopy bus protocol

Basic idea:

◮ Broadcast operations on memory bus ◮ Cache controllers “snoop” on all bus transactions

◮ Memory writes induce serial order ◮ Act to enforce coherence (invalidate, update, etc)

Problems:

◮ Bus bandwidth limits scaling ◮ Contending writes are slow

There are other protocol options (e.g. directory-based). But usually give up on full sequential consistency.

slide-39
SLIDE 39

Weakening sequential consistency

Try to reduce to the true cost of sharing

◮ volatile tells compiler when to worry about sharing ◮ Memory fences tell when to force consistency ◮ Synchronization primitives (lock/unlock) include fences

slide-40
SLIDE 40

Sharing

True sharing:

◮ Frequent writes cause a bottleneck. ◮ Idea: make independent copies (if possible). ◮ Example problem: malloc/free data structure.

False sharing:

◮ Distinct variables on same cache block ◮ Idea: make processor memory contiguous (if possible) ◮ Example problem: array of ints, one per processor

slide-41
SLIDE 41

Take-home message

◮ Sequentially consistent shared memory is a useful idea...

◮ “Natural” analogue to serial case ◮ Architects work hard to support it

◮ ... but implementation is costly!

◮ Makes life hard for optimizing compilers ◮ Coherence traffic slows things down ◮ Helps to limit sharing

Have to think about these things to get good performance.