Deterministic OpenMP
Amittai Aviram Dissertation Defense Department of Computer Science Yale University 20 September 2012
Deterministic OpenMP Amittai Aviram Dissertation Defense - - PowerPoint PPT Presentation
Deterministic OpenMP Amittai Aviram Dissertation Defense Department of Computer Science Yale University 20 September 2012 Committee Bryan Ford, Yale University, Advisor Zhong Shao, Yale University Ramakrishna Gummadi, Yale
Amittai Aviram Dissertation Defense Department of Computer Science Yale University 20 September 2012
20 September 2012 Amittai Aviram | Yale University CS 2
Amherst
20 September 2012 Amittai Aviram | Yale University CS 3
language to parallelize source code
version of OpenMP
– Guarantees the same results for the same input – Enforces a deterministic programming model – Catches concurrency bugs –
20 September 2012 Amittai Aviram | Yale University CS 4
y == 0 y == 2
x = 1 x ==1 x == 2 x = 2 x = 2 y = x x == 2 x == 1 x == 2 lock x++ unlock x = 0 lock x*=2 unlock x = 0
20 September 2012 Amittai Aviram | Yale University CS 5
A lock(x) x := x + 2 unlock(x) B lock(x) x := x * 3 unlock(x) Accesses remained unordered But I got the right answer on 1,000,000 test runs ... HEISENBUG That's 'cause B happened to go first — until now!
20 September 2012 Amittai Aviram | Yale University CS 6
20 September 2012 Amittai Aviram | Yale University CS 7
Run any parallel program deterministically, even a racy
Impose a deterministic schedule on the program.
20 September 2012 Amittai Aviram | Yale University CS 8
Run any parallel program deterministically, even a racy
Impose a deterministic schedule on the program. Run only deterministic programs. Enforce a deterministic programming model.
20 September 2012 Amittai Aviram | Yale University CS 9
Run any parallel program deterministically, even a racy
Impose a deterministic schedule on the program. Run only deterministic programs. Enforce a deterministic programming model. Potentially useful but can be problematic
20 September 2012 Amittai Aviram | Yale University CS 10
Run any parallel program deterministically, even a racy
Impose a deterministic schedule on the program. Run only deterministic programs. Enforce a deterministic programming model. Determinator OS (OSDI '10) Potentially useful but can be problematic
20 September 2012 Amittai Aviram | Yale University CS 11
Run any parallel program deterministically, even a racy
Impose a deterministic schedule on the program. Run only deterministic programs. Enforce a deterministic programming model. DOMP Determinator OS (OSDI '10) Potentially useful but can be problematic
20 September 2012 Amittai Aviram | Yale University CS 12
(critical, atomic, flush)
construct
model
20 September 2012 Amittai Aviram | Yale University CS 13
20 September 2012 Amittai Aviram | Yale University CS 14
constructs
etc.)
(idiom)
20 September 2012 Amittai Aviram | Yale University CS 15
Programmers usually (74%) use nondeterministic primitives to build deterministic higher-level idioms for which the language lacks direct expression.
Work Sharing Idioms 8.44% Reduction Idioms 35.71% Pipeline Idioms 10.06% Task Queue Idioms 11.04% Legacy 9.09% Nondeterministic 25.65%
20 September 2012 Amittai Aviram | Yale University CS 16
library (libgomp)
20 September 2012 Amittai Aviram | Yale University CS 17
library (libgomp)
20 September 2012 Amittai Aviram | Yale University CS 18
20 September 2012 Amittai Aviram | Yale University CS 19
20 September 2012 Amittai Aviram | Yale University CS 20
20 September 2012 Amittai Aviram | Yale University CS 21
No data races
20 September 2012 Amittai Aviram | Yale University CS 22
No data races Deterministic
20 September 2012 Amittai Aviram | Yale University CS 23
No data races Deterministic UNFAMILIAR
20 September 2012 Amittai Aviram | Yale University CS 24
No data races Deterministic UNFAMILIAR Rewrite legacy code
20 September 2012 Amittai Aviram | Yale University CS 25
20 September 2012 Amittai Aviram | Yale University CS 26
20 September 2012 Amittai Aviram | Yale University CS 27
20 September 2012 Amittai Aviram | Yale University CS 28
SLOW
20 September 2012 Amittai Aviram | Yale University CS 29
SLOW
Require special hardware
20 September 2012 Amittai Aviram | Yale University CS 30
20 September 2012 Amittai Aviram | Yale University CS 31
Thread A Thread B non-conflicting accesses conflicting accesses conflicting accesses non-conflicting accesses ... parallel sequential parallel
20 September 2012 Amittai Aviram | Yale University CS 32
x = 42; // Thread A: { if (input_is_typical) do_a_lot(); x++; } // Thread B: { do_a_little(); x++; }
20 September 2012 Amittai Aviram | Yale University CS 33
x = 42; // Thread A: { if (input_is_typical) do_a_lot(); x++; } // Thread B: { do_a_little(); x++; } Thread A t1 ← input_is_typical jump_zero t1 L1 call do_a_lot ... ... ... ... ret L1: t1 ← x add t1 1 x ← t1
...
Thread B call do_a_little ret t2 ← x add t2 1 x ← t2 parallel
Q n Qn+1
parallel
Qn+2 Qn+1
parallel
20 September 2012 Amittai Aviram | Yale University CS 34
x = 42; // Thread A: { if (input_is_typical) do_a_lot(); x++; } // Thread B: { do_a_little(); x++; } Thread A t1 ← input_is_typical jump_zero t1 L1 call do_a_lot ret L1: t1 ← x add t1 1 x ← t1 Thread B call do_a_little ret t2 ← x add t2 1 x ← t2 parallel
Q n Qn+1
parallel
Qn+2 Qn+1
serial
20 September 2012 Amittai Aviram | Yale University CS 35
20 September 2012 Amittai Aviram | Yale University CS 36
Deterministic programming model Limited API Unconventional OS
20 September 2012 Amittai Aviram | Yale University CS 37
(like Determinator)
20 September 2012 Amittai Aviram | Yale University CS 38
20 September 2012 Amittai Aviram | Yale University CS 39
20 September 2012 Amittai Aviram | Yale University CS 40
20 September 2012 Amittai Aviram | Yale University CS 41
constructs
nondeterminstic constructs by their use
20 September 2012 Amittai Aviram | Yale University CS 42
20 September 2012 Amittai Aviram | Yale University CS 43
20 September 2012 Amittai Aviram | Yale University CS 44
long ProcessId; /* Get unique ProcessId */ LOCK(Global->CountLock); ProcessId = Global->current_id++; UNLOCK(Global->CountLock); barnes (SPLASH2) Work sharing
20 September 2012 Amittai Aviram | Yale University CS 45
20 September 2012 Amittai Aviram | Yale University CS 46
LOOP
.. .
n iterations Thread Thread 1 Thread 2 Thread t 0...n/t-1 n/t...2n/t-1 2n/t...3n/t-1 (t-1)n/t...n-1 Task A Thread Task B Thread 1 Task C Thread 2 Task D Thread 3 “Data Parallelism”
sharing construct “Task Parallelism”
task work sharing constructs
20 September 2012 Amittai Aviram | Yale University CS 47
v0 v1 v2 v3 v4 v5 v6 v7 X (((((((((X * V0) * V1) * V2) * V3) * V4) * V5) * V6) * V7)
20 September 2012 Amittai Aviram | Yale University CS 48
v0 v1 v2 v3 v4 v5 v6 v7 X (((((((((X * V0) * V1) * V2) * V3) * V4) * V5) * V6) * V7)
Pthreads (low-level threading) has no reduction construct. OpenMP's reduction construct allows only scalar types and simple operations.
20 September 2012 Amittai Aviram | Yale University CS 49
20 September 2012 Amittai Aviram | Yale University CS 50
20 September 2012 Amittai Aviram | Yale University CS 51
20 September 2012 Amittai Aviram | Yale University CS 52
DETERMINISTIC IDIOMS
20 September 2012 Amittai Aviram | Yale University CS 53
water-spatial radix TOTAL fork/join 1 2 1 3 1 5 1 1 1 1 2 1 20 7% barrier 6 13 40 5 1 15 9 9 4 7 10 7 126 46% work sharing
reduction
work sharing 2 1 1 2 5 5 1 1 1 1 2 1 23 8% reduction 1 1 3 5
4
8% pipeline
5 2% task queue
3% legacy 1 15
1
4
2 8
2 6
barnes fmm radiosity raytrace volrend water-nsquared cholesky fft lu Deterministic Constructs Deterministic Idioms nondeterministic
20 September 2012 Amittai Aviram | Yale University CS 54
BT CG DC EP FT IS LU MG SP UA TOTAL fork/join 12 7 1 3 8 7 12 11 13 60 134 25% barrier
4
2% work sharing 37 20
8 11 71 16 38 78 280 52% reduction
1
2
17 3% work sharing
reduction 2
1
2
80 89 17% pipeline
1% task queue
legacy
538
Deterministic Constructs Deterministic Idioms nondeterministic
20 September 2012 Amittai Aviram | Yale University CS 55
ferret x264 TOTAL fork/join 2 5 2 1 13 7 1 3 1 2 1 5 5 48 23% barrier
work sharing 2 5
reduction
work sharing
1% reduction
3% pipeline
4 21 10% task queue
9
legacy
blackscholes bodytrack facesim fluidanimate freqmine raytrace swaptions vips canneal dedup streamcluster Deterministic Constructs Deterministic Idioms nondeterministic
20 September 2012 Amittai Aviram | Yale University CS 56
Fork/Join 17.87% Barrier 14.79% Work Sharing Constructs 32.77% Reduction Constructs 1.81% Work Sharing Idioms 2.77% Reduction Idioms 11.70% Pipeline Idioms 3.30% Task Queue Idioms 3.62% Legacy 2.98% Nondeterministic 8.40%
20 September 2012 Amittai Aviram | Yale University CS 57
All NPB-OMP plus PARSEC blackscholes, bodytrack, and freqmine.
Fork/Join 25.21% Barrier 2.21% Work Sharing 52.47% Simple Reductions 2.90% Reduction Idioms 16.35% Pipeline Idioms 0.85%
20 September 2012 Amittai Aviram | Yale University CS 58
Work Sharing Idioms 8.44% Reduction Idioms 35.71% Pipeline Idioms 10.06% Task Queue Idioms 11.04% Legacy 9.09% Nondeterministic 25.65%
20 September 2012 Amittai Aviram | Yale University CS 59
compatible with many programs
20 September 2012 Amittai Aviram | Yale University CS 60
20 September 2012 Amittai Aviram | Yale University CS 61
20 September 2012 Amittai Aviram | Yale University CS 62
20 September 2012 Amittai Aviram | Yale University CS 63
x := 42 y := 33 (x,y) := (y,x) y := 33 barrier x := y y := x x := 42 Thread 0 Thread 1 x = y = 33 x = y = 42
20 September 2012 Amittai Aviram | Yale University CS 64
20 September 2012 Amittai Aviram | Yale University CS 65
synchronization event
WoDet '11
20 September 2012 Amittai Aviram | Yale University CS 66
rel(1,1) rel(0,1) (0,0) (1,0) acq(1,0) acq(0,0) (0,1) (1,1) Thread 0 Thread 1
BARRIER
20 September 2012 Amittai Aviram | Yale University CS 67
rel(1,0) rel(2,0) rel(3,0) acq(0,0) acq(0,1) acq(0,2) acq(1,1) acq(2,1) acq(3,1) rel(0,3) rel(0,4) rel(0,5) (0,0) (0,1) (0,2) (0,3) (0,5) (0,4) (1,0) (1,1) (2,0) (2,1) (2,0) (3,0) (3,1) Thread 1 Thread 2 Thread 3 start start start exit exit exit compute compute compute compute Thread 0
FORK JOIN
20 September 2012 Amittai Aviram | Yale University CS 68
` (0,3) (0,5) (0,4) (1,1) (2,1) (3,1)
JOIN
(0,0) (0,1) (0,2) (1,0) (2,0) (2,0) (3,0)
FORK
Thread 0 acq(1,1) acq(2,1) acq(3,1) rel(1,0) rel(2,0) rel(3,0) Thread 1 rel(0,3)
BARRIER
acq(0,0) rel(0,4) acq(0,1) Thread 2 rel(0,5) acq(0,2) Thread 3
20 September 2012 Amittai Aviram | Yale University CS 69
Master Worker A Worker B Results R e s u l t s Tasks Tasks while (true) { send(new_task(), out_1); send(next_task(), out_2); result = wait(in_1); store(result); result = wait(in_2); store(result); }
in1 in2 in in
while(true) { task = receive(in); result = process(task); send(result, out); }
20 September 2012 Amittai Aviram | Yale University CS 70
(For Contrast)
Master Worker A Worker B Tasks Tasks while(true) { result = receive(in); store(result); send(new_task(), out); } while(true) { task = receive(in); result = process(task); send(result, out); } mute x locks common channels
in Results Results Tasks Results
in in
20 September 2012 Amittai Aviram | Yale University CS 71
Shared memory Thread A Thread B A's writes B reads “old” values Join: merge changes Conflicting writes → ERROR! Fork: copy state B's writes
20 September 2012 Amittai Aviram | Yale University CS 72
parent thread working copy
20 September 2012 Amittai Aviram | Yale University CS 73
parent thread working copy FORK
20 September 2012 Amittai Aviram | Yale University CS 74
FORK parent thread working copy working copy working copy working copy reference copy hide copy copy copy
20 September 2012 Amittai Aviram | Yale University CS 75
FORK parent thread master thread 1 thread n-1 working copy working copy working copy working copy reference copy hide ... copy copy copy
20 September 2012 Amittai Aviram | Yale University CS 76
FORK parent thread master thread 1 thread n-1 working copy working copy working copy working copy reference copy hide ... copy copy copy
20 September 2012 Amittai Aviram | Yale University CS 77
FORK parent thread master thread 1 thread n-1 working copy working copy working copy working copy reference copy hide ... copy copy copy JOIN
20 September 2012 Amittai Aviram | Yale University CS 78
FORK parent thread master thread 1 thread n-1 working copy working copy working copy working copy reference copy hide ... copy copy copy JOIN merge merge merge
20 September 2012 Amittai Aviram | Yale University CS 79
FORK parent thread master thread 1 thread n-1 working copy working copy working copy working copy reference copy hide ... copy copy copy JOIN working copy merge merge merge release
20 September 2012 Amittai Aviram | Yale University CS 80
FORK parent thread master thread 1 thread n-1 working copy working copy working copy working copy reference copy hide ... copy copy copy parent thread JOIN working copy merge merge merge release
20 September 2012 Amittai Aviram | Yale University CS 81
constructs
20 September 2012 Amittai Aviram | Yale University CS 82
// Multiply an n x m matrix A by an m x p matrix B // to get an n x p matrix C. void matrixMultiply(int n, int m, int p, double ** A, double ** B, double ** C) { for (int i = 0; i < n; i++) for (int j = 0; j < p; j++) { C[i][j] = 0.0; for (int k = 0; k < m; k++) C[i][j] += A[i][k] * B[k][j]; } } SEQUENTIAL
20 September 2012 Amittai Aviram | Yale University CS 83
// Multiply an n x m matrix A by an m x p matrix B // to get an n x p matrix C. void matrixMultiply(int n, int m, int p, double ** A, double ** B, double ** C) { #pragma omp parallel for for (int i = 0; i < n; i++) for (int j = 0; j < p; j++) { C[i][j] = 0.0; for (int k = 0; k < m; k++) C[i][j] += A[i][k] * B[k][j]; } } Creates new threads, distributes work Joins threads to parent OpenMP
20 September 2012 Amittai Aviram | Yale University CS 84
// Multiply an n x m matrix A by an m x p matrix B // to get an n x p matrix C. void matrixMultiply(int n, int m, int p, double ** A, double ** B, double ** C) { #pragma omp parallel for for (int i = 0; i < n; i++) for (int j = 0; j < p; j++) { C[i][j] = 0.0; for (int k = 0; k < m; k++) C[i][j] += A[i][k] * B[k][j]; } } Creates new threads, distributes work + copies of shared state Merges copies
parent's state and joins threads to parent DOMP
20 September 2012 Amittai Aviram | Yale University CS 85
nondeterministic synchronization to compensate
20 September 2012 Amittai Aviram | Yale University CS 86
do 155 i = 0, nq - 1 !$omp atomic q(i) = q(i) + qq(i) 155 continue In NPB-OMP EP (vector sum):
20 September 2012 Amittai Aviram | Yale University CS 87
do 155 i = 0, nq - 1 !$omp atomic q(i) = q(i) + qq(i) 155 continue In NPB-OMP EP (vector sum): Nondeterministic programming model Unpredictable evaluation order
20 September 2012 Amittai Aviram | Yale University CS 88
20 September 2012 Amittai Aviram | Yale University CS 89
void domp_xreduction(void*(*op)(void*,void*), void** var, void* idty, size t size);
20 September 2012 Amittai Aviram | Yale University CS 90
sequential-parallel equivalence semantics
buddy”) runs op on its own and the other thread's (the “down-buddy's”) version if var
and the cumulative var from merges.
20 September 2012 Amittai Aviram | Yale University CS 91
do 155 i = 0, nq - 1 !$omp atomic q(i) = q(i) + qq(i) 155 continue In NPB-OMP EP (vector sum): call xreduction_add(q_ptr, nq)
nq = *nq_val; init_idty(); domp_xreduction(&add_, input, (void *)idty, nq * sizeof(double)); }
20 September 2012 Amittai Aviram | Yale University CS 92
20 September 2012 Amittai Aviram | Yale University CS 93
#pragma omp sections pipeline { while (more_work()) { #pragma omp section { do_step_a(); } #pragma omp section { do_step_b(); } /* ... */ #pragma omp section { do_step_n(); } } }
20 September 2012 Amittai Aviram | Yale University CS 94
20 September 2012 Amittai Aviram | Yale University CS 95
20 September 2012 Amittai Aviram | Yale University CS 96
20 September 2012 Amittai Aviram | Yale University CS 97
for each data segment seg in (stack, heap, bss) for each byte b in seg writer = WRITER_NONE for each thread t if (seg[t][b]] ≠ reference_copy[b]) if (writer ≠ WRITER_NONE) race condition exception() writer = t seg[MASTER][b] = seg[writer][b]
20 September 2012 Amittai Aviram | Yale University CS 98
20 September 2012 Amittai Aviram | Yale University CS 99
20 September 2012 Amittai Aviram | Yale University CS 100
20 September 2012 Amittai Aviram | Yale University CS 101
20 September 2012 Amittai Aviram | Yale University CS 102
20 September 2012 Amittai Aviram | Yale University CS 103
20 September 2012 Amittai Aviram | Yale University CS 104
20 September 2012 Amittai Aviram | Yale University CS 105
20 September 2012 Amittai Aviram | Yale University CS 106
Benchmark Max Pages Total Pages MatMult 24578 24578 Mandelbrot 1 1 BT 4 1911 DC 2 3 EP 3 4 IS 34778 90100 blackscholes 9768 9768 swaptions 677 677 FFT 5 5 LU-cont 7 7 LU-non-cont 7 7
20 September 2012 Amittai Aviram | Yale University CS 107
Total Module % MatMult 109 Mandelbrot 105 BT 3589 16 30 1 DC 2809 3 48 2 EP 228 16 30 20 IS 634 blackscholes 359 swaptions 1780 FFT 1504 LU-cont 2484 LU-non-cont 1890 DOMP Changes
20 September 2012 Amittai Aviram | Yale University CS 108
20 September 2012 Amittai Aviram | Yale University CS 109
20 September 2012 Amittai Aviram | Yale University CS 110
thread pool at runtime
20 September 2012 Amittai Aviram | Yale University CS 111
accessible support framework for a deterministic parallel programming model may have wide applicability.
accessible deterministic parallel programming can be efficient and easy to use for many programs.
20 September 2012 Amittai Aviram | Yale University CS 112