Interfaces for Runtime Correctness Checking of Parallel Programs - - PowerPoint PPT Presentation

interfaces for runtime correctness checking of parallel
SMART_READER_LITE
LIVE PREVIEW

Interfaces for Runtime Correctness Checking of Parallel Programs - - PowerPoint PPT Presentation

Interfaces for Runtime Correctness Checking of Parallel Programs Joachim Protze (protze@itc.rwth-aachen.de) Motivation OpenMP 3 introduced tasks (2008) Several data race detection tools for OpenMP tasks popped up just last year How can


slide-1
SLIDE 1

Interfaces for Runtime Correctness Checking of Parallel Programs

Joachim Protze (protze@itc.rwth-aachen.de)

slide-2
SLIDE 2

Generic Tool Interface for Runtime Correctness Checking Joachim Protze 2

Motivation

  • OpenMP 3 introduced tasks (2008)
  • Several data race detection tools for OpenMP tasks popped up just last year
  • How can we effectively reduce the porting effort for new programming paradigms?

Memory accesses Concurrency Synchronization

slide-3
SLIDE 3

Generic Tool Interface for Runtime Correctness Checking Joachim Protze 3

Synchronization in OpenMP Parallel region

  • Encountering a parallel directive

happens before execution of the parallel region

  • Encountering a barrier directive

happens before execution of code following the barrier region

  • Encountering the implicit barrier

happens before the master continues code following the parallel region

parallel-begin implicit-task-begin implicit-task-end ! parallel-end barrier-begin barrier-end barrier-begin

slide-4
SLIDE 4

Generic Tool Interface for Runtime Correctness Checking Joachim Protze 4

Synchronization in OpenMP Task region

  • Encountering a task directive

happens before execution of the task region

  • Finishing execution of a child task

happens before execution of code following a taskwait, barrier, or taskgroup region

  • Finishing a predecessor task

happens before a dependent task starts execution

  • Deferring a task happens before

scheduling the task again

task-create task-begin taskwait-end task-end task-begin task-end +task-dependencies

task depend(out:a) task depend(in:a) taskwait task task

slide-5
SLIDE 5

Generic Tool Interface for Runtime Correctness Checking Joachim Protze 5

Archer based on ThreadSanitizer

  • ThreadSanitizer comes with clang and gcc (-fsanitize=thread)
  • Compiler instrumentation of memory accesses

− Less overhead than binary instrumentation (e.g., PIN, valgrind)

  • ThreadSanitizer is not aware of OpenMP synchronization
  • Happens before analysis with simplified fast track algorithm.

− 4 records of memory access to a word, storing (epoch,tid,r/w)

  • Archer annotates OpenMP synchronization

− Initially instrumentation of the LLVM/OpenMP runtime − Now based on OMPT events

slide-6
SLIDE 6

Generic Tool Interface for Runtime Correctness Checking Joachim Protze 6

Data race analysis overhead for SPEC OMP 2012 (train)

  • Expected overhead according to base tool: 2-20x
  • 359.botsspar and 370.mgrid331 > 20x

− Both run <1 second with high synchronization rate

▪ 359.botsspar: 353400 task switches ▪ 370.mgrid331: 6383 parallel regions

10 20 30 40 50 350 351 352 357 358 359 360 362 363 367 370 371 372 376

Tool Slowdown

2 Threads 4 Threads 12 Threads 99.8

slide-7
SLIDE 7

Generic Tool Interface for Runtime Correctness Checking Joachim Protze 7

Concurrency for OpenMP Tasks

thread

  • Execution of

the thread as

  • bserved by a

tool

  • Lamport

happens- before

  • Separating

the logical slices Wallclock time

  • Observed actual

execution with HB

Happens-before Observed execution order

Wallclock time Wallclock time Logical clock

slide-8
SLIDE 8

Generic Tool Interface for Runtime Correctness Checking Joachim Protze 8

TLC: Marking execution within a thread as concurrent

thread

  • Execution of

the thread as

  • bserved by

the tool

  • Lamport

happens- before

  • Separating

the logical slices

  • Observed actual

execution with HB

Happens-before Observed execution order Fork / spawn

Wallclock time Wallclock time Wallclock time Logical clock

slide-9
SLIDE 9

Generic Tool Interface for Runtime Correctness Checking Joachim Protze 9

Generic events

  • Fork(curr, *new)

− Fork(curr, *new, *msg)

  • Join(curr, next)
  • Switch(curr, next)

− Switch(curr, next, msg)

  • Send(curr, *msg)
  • Recv(curr, msg)
slide-10
SLIDE 10

Generic Tool Interface for Runtime Correctness Checking Joachim Protze 10

Concurrency / Synchronization in Shared Memory Parallel, Tasks, Loops

  • Fork → P2P synchronization, concurrency
  • Join → P2P synchronization
  • Barrier → global synchronization

− Can translate into N2N synchronization

  • Dependencies → P2P synchronization
  • Locks → ?

− Should be flexible to enable lock-set and HB analysis

  • Parallel loop → concurrency for each iteration
  • Doacross loops → P2P synchronization

Fork(curr, *new) Join(curr, next) Switch(curr, next) Send(curr, *msg) Recv(curr, msg)

slide-11
SLIDE 11

Generic Tool Interface for Runtime Correctness Checking Joachim Protze 11

Applying this semantics to MPI MPI Non-Blocking

  • MPI_Isend / MPI_Irecv → concurrency, P2P synchronization

− Bind the new execution unit handle to the request

  • MPI_Wait → synchronize
  • Buffer access → read/write

thread task task MPI_Irecv MPI_Wait

slide-12
SLIDE 12

Generic Tool Interface for Runtime Correctness Checking Joachim Protze 12

Applying this semantics to MPI MPI One-sided

  • MPI One-sided epochs → concurrency, P2P synchronization
  • MPI One-sided target completion → synchronize
  • Remote memory access → read/write
slide-13
SLIDE 13

Device Offloading

slide-14
SLIDE 14

Generic Tool Interface for Runtime Correctness Checking Joachim Protze 14

Basic memory operations in device offloading

  • Memory access
  • Alloc/release memory
  • (Dis-)Associate memory
  • Update memory (memcopy)

OpenMP mapping semantics:

  • Alloc

alloc + associate

  • Map-to

((alloc +) associate +) update to device

  • Map-from

update from device (+ disassociate (+ release))

  • Update-to/from

update to/from device

  • Release

disassociate + release Challenge: semantics of global/static memory

slide-15
SLIDE 15

Distributed Memory ?

slide-16
SLIDE 16

Thank you for your attention.