Deterministic OpenMP Amittai Aviram Dissertation Defense - - PowerPoint PPT Presentation

deterministic openmp
SMART_READER_LITE
LIVE PREVIEW

Deterministic OpenMP Amittai Aviram Dissertation Defense - - PowerPoint PPT Presentation

Deterministic OpenMP Amittai Aviram Dissertation Defense Department of Computer Science Yale University 20 September 2012 Committee Bryan Ford, Yale University, Advisor Zhong Shao, Yale University Ramakrishna Gummadi, Yale


slide-1
SLIDE 1

Deterministic OpenMP

Amittai Aviram Dissertation Defense Department of Computer Science Yale University 20 September 2012

slide-2
SLIDE 2

20 September 2012 Amittai Aviram | Yale University CS 2

Committee

  • Bryan Ford, Yale University, Advisor
  • Zhong Shao, Yale University
  • Ramakrishna Gummadi, Yale University
  • Emery Berger, University of Massachusetts-

Amherst

slide-3
SLIDE 3

20 September 2012 Amittai Aviram | Yale University CS 3

The Big Picture

  • OpenMP is a well-established annotation

language to parallelize source code

  • Deterministic OpenMP (DOMP) is our new

version of OpenMP

– Guarantees the same results for the same input – Enforces a deterministic programming model – Catches concurrency bugs –

slide-4
SLIDE 4

20 September 2012 Amittai Aviram | Yale University CS 4

Unordered Memory Accesses

?

y == 0 y == 2

? ?

x = 1 x ==1 x == 2 x = 2 x = 2 y = x x == 2 x == 1 x == 2 lock x++ unlock x = 0 lock x*=2 unlock x = 0

slide-5
SLIDE 5

20 September 2012 Amittai Aviram | Yale University CS 5

A lock(x) x := x + 2 unlock(x) B lock(x) x := x * 3 unlock(x) Accesses remained unordered But I got the right answer on 1,000,000 test runs ... HEISENBUG That's 'cause B happened to go first — until now!

slide-6
SLIDE 6

20 September 2012 Amittai Aviram | Yale University CS 6

Determinism

  • program : input → (output, behavior)
  • Results are as if memory accesses are always
  • rdered
  • Bugs are always reproducible
  • Reproduce computations exactly
  • Byzantine fault tolerance
  • Accountability systems
  • Addressing timing channel attacks
slide-7
SLIDE 7

20 September 2012 Amittai Aviram | Yale University CS 7

Two Approaches

Run any parallel program deterministically, even a racy

  • ne.

Impose a deterministic schedule on the program.

slide-8
SLIDE 8

20 September 2012 Amittai Aviram | Yale University CS 8

Two Approaches

Run any parallel program deterministically, even a racy

  • ne.

Impose a deterministic schedule on the program. Run only deterministic programs. Enforce a deterministic programming model.

slide-9
SLIDE 9

20 September 2012 Amittai Aviram | Yale University CS 9

Two Approaches

Run any parallel program deterministically, even a racy

  • ne.

Impose a deterministic schedule on the program. Run only deterministic programs. Enforce a deterministic programming model. Potentially useful but can be problematic

slide-10
SLIDE 10

20 September 2012 Amittai Aviram | Yale University CS 10

Two Approaches

Run any parallel program deterministically, even a racy

  • ne.

Impose a deterministic schedule on the program. Run only deterministic programs. Enforce a deterministic programming model. Determinator OS (OSDI '10) Potentially useful but can be problematic

slide-11
SLIDE 11

20 September 2012 Amittai Aviram | Yale University CS 11

Two Approaches

Run any parallel program deterministically, even a racy

  • ne.

Impose a deterministic schedule on the program. Run only deterministic programs. Enforce a deterministic programming model. DOMP Determinator OS (OSDI '10) Potentially useful but can be problematic

slide-12
SLIDE 12

20 September 2012 Amittai Aviram | Yale University CS 12

DOMP Semantics

  • Based on familiar OpenMP API
  • Excludes nondeterministic OpenMP constructs

(critical, atomic, flush)

  • Extends OpenMP: generalized reduction

construct

  • Implements a strict deterministic programming

model

slide-13
SLIDE 13

20 September 2012 Amittai Aviram | Yale University CS 13

But can programmers really use a deterministic programming model?

slide-14
SLIDE 14

20 September 2012 Amittai Aviram | Yale University CS 14

Our Analysis

  • Analyzed standard parallel benchmarks
  • Counted instances of synchronization

constructs

  • Deterministic (fork, join, barrier)
  • Nondeterministic (mutex locks, condition variables,

etc.)

  • Classified nondeterministic instances by use

(idiom)

slide-15
SLIDE 15

20 September 2012 Amittai Aviram | Yale University CS 15

We found …

Programmers usually (74%) use nondeterministic primitives to build deterministic higher-level idioms for which the language lacks direct expression.

Work Sharing Idioms 8.44% Reduction Idioms 35.71% Pipeline Idioms 10.06% Task Queue Idioms 11.04% Legacy 9.09% Nondeterministic 25.65%

slide-16
SLIDE 16

20 September 2012 Amittai Aviram | Yale University CS 16

Making Determinism Accessible

  • OpenMP API
  • User library for Linux
  • Replacement for GCC's OpenMP support

library (libgomp)

  • Often a drop-in replacement for libgomp
slide-17
SLIDE 17

20 September 2012 Amittai Aviram | Yale University CS 17

Making Determinism Accessible

  • OpenMP API
  • User library for Linux
  • Replacement for GCC's OpenMP support

library (libgomp)

  • Often a drop-in replacement for libgomp

OUR GOAL

slide-18
SLIDE 18

20 September 2012 Amittai Aviram | Yale University CS 18

Outline

  • The Big Picture √
  • Background
  • Analysis
  • Design and Semantics
  • Implementation
  • Evaluation
  • Conclusion
slide-19
SLIDE 19

20 September 2012 Amittai Aviram | Yale University CS 19

Outline

  • The Big Picture √
  • Background
  • Analysis
  • Design and Semantics
  • Implementation
  • Evaluation
  • Conclusion
slide-20
SLIDE 20

20 September 2012 Amittai Aviram | Yale University CS 20

Single-Assignment Languages

  • Dataflow languages
  • Parallel Haskell
  • Concurrency Collections (CnC)
slide-21
SLIDE 21

20 September 2012 Amittai Aviram | Yale University CS 21

Single-Assignment Languages

  • Dataflow languages
  • Parallel Haskell
  • Concurrency Collections (CnC)

No data races

slide-22
SLIDE 22

20 September 2012 Amittai Aviram | Yale University CS 22

Single-Assignment Languages

  • Dataflow languages
  • Data Parallel Haskell
  • Concurrency Collections (CnC)

No data races Deterministic

slide-23
SLIDE 23

20 September 2012 Amittai Aviram | Yale University CS 23

Single-Assignment Languages

  • Dataflow languages
  • Data Parallel Haskell
  • Concurrency Collections (CnC)

No data races Deterministic UNFAMILIAR

slide-24
SLIDE 24

20 September 2012 Amittai Aviram | Yale University CS 24

Single-Assignment Languages

  • Dataflow languages
  • Data Parallel Haskell
  • Concurrency Collections (CnC)

No data races Deterministic UNFAMILIAR Rewrite legacy code

slide-25
SLIDE 25

20 September 2012 Amittai Aviram | Yale University CS 25

Deterministic Imperative Languages

  • SHIM
  • Message passing
  • Deterministic Parallel Java (DPJ)
  • Programmer annotates data with effect classes
slide-26
SLIDE 26

20 September 2012 Amittai Aviram | Yale University CS 26

Deterministic Imperative Languages

  • SHIM
  • Message passing
  • Deterministic Parallel Java (DPJ)
  • Programmer annotates data with effect classes
slide-27
SLIDE 27

20 September 2012 Amittai Aviram | Yale University CS 27

Record-and-Replay Systems

  • Instant Replay (1987)
  • Recap (1988)
  • DejaVu (1998)
  • ReVirt (2002)
  • Many others
slide-28
SLIDE 28

20 September 2012 Amittai Aviram | Yale University CS 28

Record-and-Replay Systems

  • Instant Replay (1987)
  • Recap (1988)
  • DejaVu (1998)
  • ReVirt (2002)
  • Many others

SLOW

slide-29
SLIDE 29

20 September 2012 Amittai Aviram | Yale University CS 29

Record-and-Replay Systems

  • Instant Replay (1987)
  • Recap (1988)
  • DejaVu (1998)
  • ReVirt (2002)
  • Many others

SLOW

  • OR -

Require special hardware

$$$

slide-30
SLIDE 30

20 September 2012 Amittai Aviram | Yale University CS 30

Deterministic Schedulers

  • DMP
  • CoreDet
  • Grace
  • Dthreads
  • Kendo
  • Orders lock acquisitions only
  • Racy programs remain nondeterministic
  • Tern
  • Memoizes and re-uses schedules
slide-31
SLIDE 31

20 September 2012 Amittai Aviram | Yale University CS 31

Dedeterministic Scheduling

Thread A Thread B non-conflicting accesses conflicting accesses conflicting accesses non-conflicting accesses ... parallel sequential parallel

slide-32
SLIDE 32

20 September 2012 Amittai Aviram | Yale University CS 32

Schedule Dependency

x = 42; // Thread A: { if (input_is_typical) do_a_lot(); x++; } // Thread B: { do_a_little(); x++; }

slide-33
SLIDE 33

20 September 2012 Amittai Aviram | Yale University CS 33

Schedule Dependency

x = 42; // Thread A: { if (input_is_typical) do_a_lot(); x++; } // Thread B: { do_a_little(); x++; } Thread A t1 ← input_is_typical jump_zero t1 L1 call do_a_lot ... ... ... ... ret L1: t1 ← x add t1 1 x ← t1

...

Thread B call do_a_little ret t2 ← x add t2 1 x ← t2 parallel

Q n Qn+1

parallel

Qn+2 Qn+1

parallel

slide-34
SLIDE 34

20 September 2012 Amittai Aviram | Yale University CS 34

Schedule Dependency

x = 42; // Thread A: { if (input_is_typical) do_a_lot(); x++; } // Thread B: { do_a_little(); x++; } Thread A t1 ← input_is_typical jump_zero t1 L1 call do_a_lot ret L1: t1 ← x add t1 1 x ← t1 Thread B call do_a_little ret t2 ← x add t2 1 x ← t2 parallel

Q n Qn+1

parallel

Qn+2 Qn+1

serial

slide-35
SLIDE 35

20 September 2012 Amittai Aviram | Yale University CS 35

Determinator OS

slide-36
SLIDE 36

20 September 2012 Amittai Aviram | Yale University CS 36

Determinator OS

Deterministic programming model Limited API Unconventional OS

slide-37
SLIDE 37

20 September 2012 Amittai Aviram | Yale University CS 37

Deterministic OpenMP (DOMP)

  • Familiar, expressive OpenMP API
  • Includes almost all constructs
  • Excludes nondeterministic constructs
  • atomic, critical, flush
  • Extends OpenMP with generalized reduction
  • Enforces deterministic parallel programming model

(like Determinator)

  • User library for Linux
  • Works with GCC
slide-38
SLIDE 38

20 September 2012 Amittai Aviram | Yale University CS 38

Outline

  • The Big Picture √
  • Background √
  • Analysis
  • Design and Semantics
  • Implementation
  • Evaluation
  • Conclusion
slide-39
SLIDE 39

20 September 2012 Amittai Aviram | Yale University CS 39

Outline

  • The Big Picture √
  • Background √
  • Analysis
  • Design and Semantics
  • Implementation
  • Evaluation
  • Conclusion
slide-40
SLIDE 40

20 September 2012 Amittai Aviram | Yale University CS 40

How easily could real programs conform to DOMP's deterministic programming model?

slide-41
SLIDE 41

20 September 2012 Amittai Aviram | Yale University CS 41

Method

  • Used three parallel benchmark suites
  • SPLASH2, NPB-OMP, PARSEC
  • Total 35 benchmarks
  • Hand-counted instances of synchronization

constructs

  • Recorded instances of deterministic constructs
  • Classified and recorded instances of

nondeterminstic constructs by their use

slide-42
SLIDE 42

20 September 2012 Amittai Aviram | Yale University CS 42

Deterministic Constructs

  • Fork/join
  • Barrier
  • OpenMP work sharing constructs
  • Loop
  • Master
  • (Sections)
  • (Task)
slide-43
SLIDE 43

20 September 2012 Amittai Aviram | Yale University CS 43

Nondeterministic Constructs

  • Mutex lock/unlock
  • Condition variable wait/broadcast
  • (Semaphore wait/post)
  • OpenMP critical
  • OpenMP atomic
  • (OpenMP flush)
slide-44
SLIDE 44

20 September 2012 Amittai Aviram | Yale University CS 44

Use in Idioms

long ProcessId; /* Get unique ProcessId */ LOCK(Global->CountLock); ProcessId = Global->current_id++; UNLOCK(Global->CountLock); barnes (SPLASH2) Work sharing

slide-45
SLIDE 45

20 September 2012 Amittai Aviram | Yale University CS 45

Idioms

  • Work sharing
  • Reduction
  • Pipeline
  • Task queue
  • Legacy
  • Obsolete: Making I/O or heap allocation thread safe
  • Nondeterministic
  • Load balancing, random simulated interaction …
slide-46
SLIDE 46

20 September 2012 Amittai Aviram | Yale University CS 46

Work Sharing

LOOP

.. .

n iterations Thread Thread 1 Thread 2 Thread t 0...n/t-1 n/t...2n/t-1 2n/t...3n/t-1 (t-1)n/t...n-1 Task A Thread Task B Thread 1 Task C Thread 2 Task D Thread 3 “Data Parallelism”

  • cf. OpenMP LOOP work

sharing construct “Task Parallelism”

  • cf. OpenMP sections and

task work sharing constructs

slide-47
SLIDE 47

20 September 2012 Amittai Aviram | Yale University CS 47

Reduction

v0 v1 v2 v3 v4 v5 v6 v7 X (((((((((X * V0) * V1) * V2) * V3) * V4) * V5) * V6) * V7)

*

slide-48
SLIDE 48

20 September 2012 Amittai Aviram | Yale University CS 48

Reduction

v0 v1 v2 v3 v4 v5 v6 v7 X (((((((((X * V0) * V1) * V2) * V3) * V4) * V5) * V6) * V7)

*

Pthreads (low-level threading) has no reduction construct. OpenMP's reduction construct allows only scalar types and simple operations.

slide-49
SLIDE 49

20 September 2012 Amittai Aviram | Yale University CS 49

Pipeline

slide-50
SLIDE 50

20 September 2012 Amittai Aviram | Yale University CS 50

Pipeline

slide-51
SLIDE 51

20 September 2012 Amittai Aviram | Yale University CS 51

Task Queue

slide-52
SLIDE 52

20 September 2012 Amittai Aviram | Yale University CS 52

Idioms

  • Work sharing
  • Reduction
  • Pipeline
  • Task queue
  • Legacy
  • Obsolete: Making I/O or heap allocation thread safe
  • Nondeterministic
  • Load balancing, random simulated interaction …

DETERMINISTIC IDIOMS

slide-53
SLIDE 53

20 September 2012 Amittai Aviram | Yale University CS 53

SPLASH2

  • cean

water-spatial radix TOTAL fork/join 1 2 1 3 1 5 1 1 1 1 2 1 20 7% barrier 6 13 40 5 1 15 9 9 4 7 10 7 126 46% work sharing

  • 0%

reduction

  • 0%

work sharing 2 1 1 2 5 5 1 1 1 1 2 1 23 8% reduction 1 1 3 5

  • 7

4

  • 21

8% pipeline

  • 3
  • 2

5 2% task queue

  • 7
  • 2
  • 9

3% legacy 1 15

  • 6

1

  • 1

4

  • 28 10%

2 8

  • 23

2 6

  • 2
  • 43 16%

barnes fmm radiosity raytrace volrend water-nsquared cholesky fft lu Deterministic Constructs Deterministic Idioms nondeterministic

slide-54
SLIDE 54

20 September 2012 Amittai Aviram | Yale University CS 54

NPB-OMP

BT CG DC EP FT IS LU MG SP UA TOTAL fork/join 12 7 1 3 8 7 12 11 13 60 134 25% barrier

  • 8
  • 1

4

  • 13

2% work sharing 37 20

  • 1

8 11 71 16 38 78 280 52% reduction

  • 6
  • 1

1

  • 3

2

  • 4

17 3% work sharing

  • 0%

reduction 2

  • 1

1

  • 1

2

  • 2

80 89 17% pipeline

  • 5
  • 5

1% task queue

  • 0%

legacy

  • 0%
  • 0%

538

Deterministic Constructs Deterministic Idioms nondeterministic

slide-55
SLIDE 55

20 September 2012 Amittai Aviram | Yale University CS 55

PARSEC

ferret x264 TOTAL fork/join 2 5 2 1 13 7 1 3 1 2 1 5 5 48 23% barrier

  • 14
  • 3
  • 1
  • 34 52 25%

work sharing 2 5

  • 21
  • 28 14%

reduction

  • 0%

work sharing

  • 2
  • 1
  • 3

1% reduction

  • 7
  • 7

3% pipeline

  • 17

4 21 10% task queue

  • 14

9

  • 2
  • 25 12%

legacy

  • 0%
  • 15
  • 6
  • 21 10%

blackscholes bodytrack facesim fluidanimate freqmine raytrace swaptions vips canneal dedup streamcluster Deterministic Constructs Deterministic Idioms nondeterministic

slide-56
SLIDE 56

20 September 2012 Amittai Aviram | Yale University CS 56

Aggregate

Fork/Join 17.87% Barrier 14.79% Work Sharing Constructs 32.77% Reduction Constructs 1.81% Work Sharing Idioms 2.77% Reduction Idioms 11.70% Pipeline Idioms 3.30% Task Queue Idioms 3.62% Legacy 2.98% Nondeterministic 8.40%

slide-57
SLIDE 57

20 September 2012 Amittai Aviram | Yale University CS 57

OpenMP Benchmarks

All NPB-OMP plus PARSEC blackscholes, bodytrack, and freqmine.

Fork/Join 25.21% Barrier 2.21% Work Sharing 52.47% Simple Reductions 2.90% Reduction Idioms 16.35% Pipeline Idioms 0.85%

slide-58
SLIDE 58

20 September 2012 Amittai Aviram | Yale University CS 58

Nondeterministic Synchronization

Work Sharing Idioms 8.44% Reduction Idioms 35.71% Pipeline Idioms 10.06% Task Queue Idioms 11.04% Legacy 9.09% Nondeterministic 25.65%

slide-59
SLIDE 59

20 September 2012 Amittai Aviram | Yale University CS 59

Conclusions

  • Deterministic parallel programming model

compatible with many programs

  • Reductions can help increase the number
slide-60
SLIDE 60

20 September 2012 Amittai Aviram | Yale University CS 60

Outline

  • The Big Picture √
  • Background √
  • Analysis √
  • Design and Semantics
  • Implementation
  • Evaluation
  • Conclusion
slide-61
SLIDE 61

20 September 2012 Amittai Aviram | Yale University CS 61

Outline

  • The Big Picture √
  • Background √
  • Analysis √
  • Design and Semantics
  • Extended Reduction
  • Implementation
  • Evaluation
  • Conclusion
slide-62
SLIDE 62

20 September 2012 Amittai Aviram | Yale University CS 62

Foundations

  • Workspace consistency
  • Memory consistency model
  • Naturally deterministic synchronization
  • Working Copies Determinism
  • Programming model
  • Based on workspace consistency
slide-63
SLIDE 63

20 September 2012 Amittai Aviram | Yale University CS 63

“Parallel Swap” Example

x := 42 y := 33 (x,y) := (y,x) y := 33 barrier x := y y := x x := 42 Thread 0 Thread 1 x = y = 33 x = y = 42

slide-64
SLIDE 64

20 September 2012 Amittai Aviram | Yale University CS 64

Memory Consistency Model Communication Events

  • Acquire
  • Acquires access to a location in shared memory
  • Involves a read
  • Release
  • Enables access to a location in shared memory for
  • ther threads
  • Involves a write
slide-65
SLIDE 65

20 September 2012 Amittai Aviram | Yale University CS 65

Workspace Consistency

  • Pair each release with a determinate acquire
  • Delay visibility of updates until the next

synchronization event

WoDet '11

slide-66
SLIDE 66

20 September 2012 Amittai Aviram | Yale University CS 66

WC “Parallel Swap”

rel(1,1) rel(0,1) (0,0) (1,0) acq(1,0) acq(0,0) (0,1) (1,1) Thread 0 Thread 1

BARRIER

slide-67
SLIDE 67

20 September 2012 Amittai Aviram | Yale University CS 67

WC Fork/Join

rel(1,0) rel(2,0) rel(3,0) acq(0,0) acq(0,1) acq(0,2) acq(1,1) acq(2,1) acq(3,1) rel(0,3) rel(0,4) rel(0,5) (0,0) (0,1) (0,2) (0,3) (0,5) (0,4) (1,0) (1,1) (2,0) (2,1) (2,0) (3,0) (3,1) Thread 1 Thread 2 Thread 3 start start start exit exit exit compute compute compute compute Thread 0

FORK JOIN

slide-68
SLIDE 68

20 September 2012 Amittai Aviram | Yale University CS 68

WC Barrier

` (0,3) (0,5) (0,4) (1,1) (2,1) (3,1)

JOIN

(0,0) (0,1) (0,2) (1,0) (2,0) (2,0) (3,0)

FORK

Thread 0 acq(1,1) acq(2,1) acq(3,1) rel(1,0) rel(2,0) rel(3,0) Thread 1 rel(0,3)

BARRIER

acq(0,0) rel(0,4) acq(0,1) Thread 2 rel(0,5) acq(0,2) Thread 3

slide-69
SLIDE 69

20 September 2012 Amittai Aviram | Yale University CS 69

Kahn Process Network

Master Worker A Worker B Results R e s u l t s Tasks Tasks while (true) { send(new_task(), out_1); send(next_task(), out_2); result = wait(in_1); store(result); result = wait(in_2); store(result); }

  • ut1
  • ut2

in1 in2 in in

  • ut
  • ut

while(true) { task = receive(in); result = process(task); send(result, out); }

slide-70
SLIDE 70

20 September 2012 Amittai Aviram | Yale University CS 70

Nondeterministic Network

(For Contrast)

Master Worker A Worker B Tasks Tasks while(true) { result = receive(in); store(result); send(new_task(), out); } while(true) { task = receive(in); result = process(task); send(result, out); } mute x locks common channels

  • ut

in Results Results Tasks Results

  • ut
  • ut

in in

slide-71
SLIDE 71

20 September 2012 Amittai Aviram | Yale University CS 71

Working Copies Determinism

Shared memory Thread A Thread B A's writes B reads “old” values Join: merge changes Conflicting writes → ERROR! Fork: copy state B's writes

slide-72
SLIDE 72

20 September 2012 Amittai Aviram | Yale University CS 72

parent thread working copy

slide-73
SLIDE 73

20 September 2012 Amittai Aviram | Yale University CS 73

parent thread working copy FORK

slide-74
SLIDE 74

20 September 2012 Amittai Aviram | Yale University CS 74

FORK parent thread working copy working copy working copy working copy reference copy hide copy copy copy

slide-75
SLIDE 75

20 September 2012 Amittai Aviram | Yale University CS 75

FORK parent thread master thread 1 thread n-1 working copy working copy working copy working copy reference copy hide ... copy copy copy

slide-76
SLIDE 76

20 September 2012 Amittai Aviram | Yale University CS 76

FORK parent thread master thread 1 thread n-1 working copy working copy working copy working copy reference copy hide ... copy copy copy

slide-77
SLIDE 77

20 September 2012 Amittai Aviram | Yale University CS 77

FORK parent thread master thread 1 thread n-1 working copy working copy working copy working copy reference copy hide ... copy copy copy JOIN

slide-78
SLIDE 78

20 September 2012 Amittai Aviram | Yale University CS 78

FORK parent thread master thread 1 thread n-1 working copy working copy working copy working copy reference copy hide ... copy copy copy JOIN merge merge merge

slide-79
SLIDE 79

20 September 2012 Amittai Aviram | Yale University CS 79

FORK parent thread master thread 1 thread n-1 working copy working copy working copy working copy reference copy hide ... copy copy copy JOIN working copy merge merge merge release

slide-80
SLIDE 80

20 September 2012 Amittai Aviram | Yale University CS 80

FORK parent thread master thread 1 thread n-1 working copy working copy working copy working copy reference copy hide ... copy copy copy parent thread JOIN working copy merge merge merge release

slide-81
SLIDE 81

20 September 2012 Amittai Aviram | Yale University CS 81

DOMP API

  • Supports most OpenMP constructs
  • Parallel blocks
  • Work sharing
  • Simple (scalar-type) reductions
  • Excludes OpenMP's few nondeterministic

constructs

  • atomic, critical, flush
  • Extends OpenMP with a generalized reduction
slide-82
SLIDE 82

20 September 2012 Amittai Aviram | Yale University CS 82

Example

// Multiply an n x m matrix A by an m x p matrix B // to get an n x p matrix C. void matrixMultiply(int n, int m, int p, double ** A, double ** B, double ** C) { for (int i = 0; i < n; i++) for (int j = 0; j < p; j++) { C[i][j] = 0.0; for (int k = 0; k < m; k++) C[i][j] += A[i][k] * B[k][j]; } } SEQUENTIAL

slide-83
SLIDE 83

20 September 2012 Amittai Aviram | Yale University CS 83

Example

// Multiply an n x m matrix A by an m x p matrix B // to get an n x p matrix C. void matrixMultiply(int n, int m, int p, double ** A, double ** B, double ** C) { #pragma omp parallel for for (int i = 0; i < n; i++) for (int j = 0; j < p; j++) { C[i][j] = 0.0; for (int k = 0; k < m; k++) C[i][j] += A[i][k] * B[k][j]; } } Creates new threads, distributes work Joins threads to parent OpenMP

slide-84
SLIDE 84

20 September 2012 Amittai Aviram | Yale University CS 84

Example

// Multiply an n x m matrix A by an m x p matrix B // to get an n x p matrix C. void matrixMultiply(int n, int m, int p, double ** A, double ** B, double ** C) { #pragma omp parallel for for (int i = 0; i < n; i++) for (int j = 0; j < p; j++) { C[i][j] = 0.0; for (int k = 0; k < m; k++) C[i][j] += A[i][k] * B[k][j]; } } Creates new threads, distributes work + copies of shared state Merges copies

  • f shared vars into

parent's state and joins threads to parent DOMP

slide-85
SLIDE 85

20 September 2012 Amittai Aviram | Yale University CS 85

Extended Reduction

  • OpenMP's reduction is limited
  • Scalar types (no pointers!)
  • Arithmetic, logical, or bitwise operations
  • Benchmark programmers used

nondeterministic synchronization to compensate

slide-86
SLIDE 86

20 September 2012 Amittai Aviram | Yale University CS 86

Typical Workaround

do 155 i = 0, nq - 1 !$omp atomic q(i) = q(i) + qq(i) 155 continue In NPB-OMP EP (vector sum):

slide-87
SLIDE 87

20 September 2012 Amittai Aviram | Yale University CS 87

Typical Workaround

do 155 i = 0, nq - 1 !$omp atomic q(i) = q(i) + qq(i) 155 continue In NPB-OMP EP (vector sum): Nondeterministic programming model Unpredictable evaluation order

slide-88
SLIDE 88

20 September 2012 Amittai Aviram | Yale University CS 88

DOMP Reduction API

  • Binary operation op
  • Arbitrary, user-defined
  • Associative but not necessarily commutative
  • Identity object idty
  • Defined in contiguous memory
  • Reduction variable object var
  • Also defined in contiguous memory
  • Size in bytes of idty and var
slide-89
SLIDE 89

20 September 2012 Amittai Aviram | Yale University CS 89

DOMP Reduction API

  • Binary operation op
  • Associative but not necessarily commutative
  • Identity object idty
  • Defined in contiguous memory
  • Reduction variable object var
  • Also defined in contiguous memory
  • Size in bytes of idty and var

void domp_xreduction(void*(*op)(void*,void*), void** var, void* idty, size t size);

slide-90
SLIDE 90

20 September 2012 Amittai Aviram | Yale University CS 90

Why the Identity Object?

  • DOMP preserves OpenMP's guaranteed

sequential-parallel equivalence semantics

  • Each thread runs op on the rhs and idty
  • At merge time, each merging thread (“up-

buddy”) runs op on its own and the other thread's (the “down-buddy's”) version if var

  • The master thread runs op on the original var

and the cumulative var from merges.

slide-91
SLIDE 91

20 September 2012 Amittai Aviram | Yale University CS 91

DOMP Replacement

do 155 i = 0, nq - 1 !$omp atomic q(i) = q(i) + qq(i) 155 continue In NPB-OMP EP (vector sum): call xreduction_add(q_ptr, nq)

  • void xreduction_add_(void ** input, int * nq_val) {

nq = *nq_val; init_idty(); domp_xreduction(&add_, input, (void *)idty, nq * sizeof(double)); }

slide-92
SLIDE 92

20 September 2012 Amittai Aviram | Yale University CS 92

Desirable Future Extensions

  • Pipeline
  • Task Queue or Task Object
slide-93
SLIDE 93

20 September 2012 Amittai Aviram | Yale University CS 93

Desirable Future Extensions

  • Pipeline
  • Task Queue or Task Object

#pragma omp sections pipeline { while (more_work()) { #pragma omp section { do_step_a(); } #pragma omp section { do_step_b(); } /* ... */ #pragma omp section { do_step_n(); } } }

slide-94
SLIDE 94

20 September 2012 Amittai Aviram | Yale University CS 94

Outline

  • The Big Picture √
  • Background √
  • Analysis √
  • Design and Semantics √
  • Implementation
  • Evaluation
  • Conclusion
slide-95
SLIDE 95

20 September 2012 Amittai Aviram | Yale University CS 95

Outline

  • The Big Picture √
  • Background √
  • Analysis √
  • Design and Semantics √
  • Implementation
  • Evaluation
  • Conclusion
slide-96
SLIDE 96

20 September 2012 Amittai Aviram | Yale University CS 96

Stats

  • 8 files in libgomp
  • ~ 5600 LOC
  • Changes in gcc/omp-low.c and *.def files
  • To support deterministic simple reductions
slide-97
SLIDE 97

20 September 2012 Amittai Aviram | Yale University CS 97

Naive Merge Loop

for each data segment seg in (stack, heap, bss) for each byte b in seg writer = WRITER_NONE for each thread t if (seg[t][b]] ≠ reference_copy[b]) if (writer ≠ WRITER_NONE) race condition exception() writer = t seg[MASTER][b] = seg[writer][b]

slide-98
SLIDE 98

20 September 2012 Amittai Aviram | Yale University CS 98

Improvements

  • Copy on write (page granularity)
  • Merge or copy pages only as needed
  • Parallel merge (binary tree)
  • Thread pool
slide-99
SLIDE 99

20 September 2012 Amittai Aviram | Yale University CS 99

Binary Tree Merge

slide-100
SLIDE 100

20 September 2012 Amittai Aviram | Yale University CS 100

Binary Tree Merge

slide-101
SLIDE 101

20 September 2012 Amittai Aviram | Yale University CS 101

Limitations

  • Problem of granularity
  • False positive/false negative tradeoff
  • Scaling constraints and space inefficiency
  • Global bookkeeping data structures
  • Globally visible heaps (mapped files)
  • No nested parallelism
slide-102
SLIDE 102

20 September 2012 Amittai Aviram | Yale University CS 102

Outline

  • The Big Picture √
  • Background √
  • Analysis √
  • Design and Semantics √
  • Implementation √
  • Evaluation
  • Conclusion
slide-103
SLIDE 103

20 September 2012 Amittai Aviram | Yale University CS 103

Outline

  • The Big Picture √
  • Background √
  • Analysis √
  • Design and Semantics √
  • Implementation √
  • Evaluation
  • Conclusion
slide-104
SLIDE 104

20 September 2012 Amittai Aviram | Yale University CS 104

Performance

slide-105
SLIDE 105

20 September 2012 Amittai Aviram | Yale University CS 105

Speedup

slide-106
SLIDE 106

20 September 2012 Amittai Aviram | Yale University CS 106

Why Is IS So Bad?

Benchmark Max Pages Total Pages MatMult 24578 24578 Mandelbrot 1 1 BT 4 1911 DC 2 3 EP 3 4 IS 34778 90100 blackscholes 9768 9768 swaptions 677 677 FFT 5 5 LU-cont 7 7 LU-non-cont 7 7

slide-107
SLIDE 107

20 September 2012 Amittai Aviram | Yale University CS 107

Converting Nondeterministic Code

Total Module % MatMult 109 Mandelbrot 105 BT 3589 16 30 1 DC 2809 3 48 2 EP 228 16 30 20 IS 634 blackscholes 359 swaptions 1780 FFT 1504 LU-cont 2484 LU-non-cont 1890 DOMP Changes

slide-108
SLIDE 108

20 September 2012 Amittai Aviram | Yale University CS 108

Outline

  • The Big Picture √
  • Background √
  • Analysis √
  • Design and Semantics √
  • Implementation √
  • Evaluation √
  • Conclusion
slide-109
SLIDE 109

20 September 2012 Amittai Aviram | Yale University CS 109

Outline

  • The Big Picture √
  • Background √
  • Analysis √
  • Design and Semantics √
  • Implementation √
  • Evaluation √
  • Conclusion
slide-110
SLIDE 110

20 September 2012 Amittai Aviram | Yale University CS 110

Future Work

  • More flexible design for changing the size of the

thread pool at runtime

  • Pipeline construct
  • Task queue construct
  • Nested parallelism
slide-111
SLIDE 111

20 September 2012 Amittai Aviram | Yale University CS 111

In Conclusion …

  • Our analysis of benchmarks suggests that an

accessible support framework for a deterministic parallel programming model may have wide applicability.

  • Our experiments with DOMP suggest that such

accessible deterministic parallel programming can be efficient and easy to use for many programs.

slide-112
SLIDE 112

20 September 2012 Amittai Aviram | Yale University CS 112

Thank You

  • Bryan Ford
  • Ramakrishna Gummadi
  • Zhong Shao
  • Emery Berger
  • DeDiS Lab members
  • Family and friends
  • NSF Grant No. CNS-1017206.