Trace-based detection of lock contention in MPI one-sided - - PowerPoint PPT Presentation

trace based detection of lock contention in mpi one sided
SMART_READER_LITE
LIVE PREVIEW

Trace-based detection of lock contention in MPI one-sided - - PowerPoint PPT Presentation

Trace-based detection of lock contention in MPI one-sided communication Marc-Andr e Hermanns Bernd Mohr Felix Wolf October 4, 2016 Parallel Tools Workshop, Stuttgart Motivation The Message Passing Interface (MPI) standard De-facto


slide-1
SLIDE 1

Trace-based detection of lock contention in MPI one-sided communication

Marc-Andr´ e Hermanns Bernd Mohr Felix Wolf October 4, 2016 — Parallel Tools Workshop, Stuttgart

slide-2
SLIDE 2

Motivation

The Message Passing Interface (MPI) standard

De-facto distributed-memory programming standard in HPC Defines multiple communication paradigms

MPI one-sided communication not often used (yet)

Initial interface not well adopted by users Gaining traction since MPI-3

Tool support for one-sided communication is narrow

Crucial for understanding of complex synchronization behavior Required for supporting multi-paradigm applications

Lock contention in MPI one-sided communication (Hermanns et al.) | Oct 4, 2016 Slide 2

slide-3
SLIDE 3

The Scalasca Toolkit

Toolkit for trace-based performance analysis

Processes OTF2 traces created by Score-P Also processes legacy traces in EPILOG format

Parallel wait state detection

Inter-process synchronization Inter-thread synchronization

Uses message replay to interchange local data just in time

Synchronizing processes exchange data Uses recorded communication information Uses similar communication pattern

Lock contention in MPI one-sided communication (Hermanns et al.) | Oct 4, 2016 Slide 3

slide-4
SLIDE 4

MPI one-sided communication

Separate communication from synchronization

Multiple (logically concurrent) RMA operations Single synchronization to ensure completion of pending

  • perations

Two different synchronization modes

Active-target synchronization

Both origin and target call synchronization functions Target needs to know when to synchronize window

Passive-target synchronization

Only origin calls synchronization functions Target is not explicitly involved in synchronization Mutual exclusion to window using locks (shared & exclusive)

Lock contention in MPI one-sided communication (Hermanns et al.) | Oct 4, 2016 Slide 4

slide-5
SLIDE 5

Lock Contention

time processes A B C D E Lock Put Unlock E L Rq E L E L Rl

RMA operations in passive target synchronization need to be placed in a lock epoch

Lock contention in MPI one-sided communication (Hermanns et al.) | Oct 4, 2016 Slide 5

slide-6
SLIDE 6

Lock Contention

time processes A B C D E Lock Put Unlock Lock Put Unlock E L Rq E L E L Rl E L Rq E L E L Rl

The intuitive behavior would have competing lock epochs to be mutually exclusive

Lock contention in MPI one-sided communication (Hermanns et al.) | Oct 4, 2016 Slide 5

slide-7
SLIDE 7

Lock Contention

time processes A B C D E Lock Put Unlock Lock Put Unlock E L Rq E L E L Rl E L Rq E L E L Rl

Waiting time

Waiting time occurs in lock on process B, waiting for process A to release the lock

Lock contention in MPI one-sided communication (Hermanns et al.) | Oct 4, 2016 Slide 5

slide-8
SLIDE 8

Lock Contention

time processes A B C D E Lock Put Unlock Lock Put Unlock Lock Unlock Put E L Rq E L E L Rl E L Rq E L E L Rl E L Rq E L E L Rl

Waiting time

MPI semantics allow the lock call to postpone the actual acquisition

Lock contention in MPI one-sided communication (Hermanns et al.) | Oct 4, 2016 Slide 5

slide-9
SLIDE 9

Lock Contention

time processes A B C D E Lock Put Unlock Lock Put Unlock Lock Unlock Put E L Rq E L E L Rl E L Rq E L E L Rl E L Rq E L E L Rl

Waiting time Waiting time

Waiting time now occurs in RMA operation waiting for the lock

Lock contention in MPI one-sided communication (Hermanns et al.) | Oct 4, 2016 Slide 5

slide-10
SLIDE 10

Lock Contention

time processes A B C D E Lock Put Unlock Lock Put Unlock Lock Unlock Put Lock Get Unlock E L Rq E L E L Rl E L Rq E L E L Rl E L Rq E L E L Rl E L Rq E L E L Rl

Waiting time Waiting time

MPI semantics even allow lock acquisition and RMA

  • perations to be postponed until the unlock call

Lock contention in MPI one-sided communication (Hermanns et al.) | Oct 4, 2016 Slide 5

slide-11
SLIDE 11

Lock Contention

time processes A B C D E Lock Put Unlock Lock Put Unlock Lock Unlock Put Lock Get Unlock E L Rq E L E L Rl E L Rq E L E L Rl E L Rq E L E L Rl E L Rq E L E L Rl

Waiting time Waiting time Waiting time

Waiting time occurs in unlock operation

Lock contention in MPI one-sided communication (Hermanns et al.) | Oct 4, 2016 Slide 5

slide-12
SLIDE 12

Lock Contention

time processes A B C D E Lock Put Unlock Lock Put Unlock Lock Unlock Put Lock Get Unlock Lock Get Unlock E L Rq E L E L Rl E L Rq E L E L Rl E L Rq E L E L Rl E L Rq E L E L Rl E L Rq E L E L Rl

Waiting time Waiting time Waiting time

Lock epochs with shared locks may overlap

Lock contention in MPI one-sided communication (Hermanns et al.) | Oct 4, 2016 Slide 5

slide-13
SLIDE 13

Lock Contention

time processes A B C D E Lock Put Unlock Lock Put Unlock Lock Unlock Put Lock Get Unlock Lock Get Unlock E L Rq E L E L Rl E L Rq E L E L Rl E L Rq E L E L Rl E L Rq E L E L Rl E L Rq E L E L Rl

Waiting time Waiting time Waiting time Waiting time

Waiting time can occur in context of previous conflicting locks

Lock contention in MPI one-sided communication (Hermanns et al.) | Oct 4, 2016 Slide 5

slide-14
SLIDE 14

Trace-based message-replay

Difficulties with passive-target synchronization

Time of actual lock acquisition unknown

Use heuristic to compute lock acquisition Full epoch information needed

Target cannot record synchronization data via wrappers

No target-side events to trigger data exchange

Incomplete synchronization information at the origin

Events contain target information Access conflict is among two or more origin processes

Locks may suffer contention and insufficient progress

Lock contention in MPI one-sided communication (Hermanns et al.) | Oct 4, 2016 Slide 6

slide-15
SLIDE 15

Active-message replay

Target Origin

Individual trace processing can be at arbitrary points

Lock contention in MPI one-sided communication (Hermanns et al.) | Oct 4, 2016 Slide 7

slide-16
SLIDE 16

Active-message replay

Target Origin

Origin packs and sends active message to target

Lock contention in MPI one-sided communication (Hermanns et al.) | Oct 4, 2016 Slide 7

slide-17
SLIDE 17

Active-message replay

Target Origin

Origin continues processing

Lock contention in MPI one-sided communication (Hermanns et al.) | Oct 4, 2016 Slide 7

slide-18
SLIDE 18

Active-message replay

Target Origin

Target processes active message upon arrival Identifies corresponding event in O(log n)

Lock contention in MPI one-sided communication (Hermanns et al.) | Oct 4, 2016 Slide 7

slide-19
SLIDE 19

Analysis phase I

Collation of epoch data

time processes A B C Lock Foo Put Bar Unlock Foo Bar Lock Foot Put Bar Unlock E L Rq E L E L P E L E L Rl E L E L E L Rq E L E L P E L E L Rl Begin tracking lock epoch Pack RMA operation data Send epoch data to target Process epoch data from A Begin tracking lock epoch Pack RMA operation data Send epoch data to target Process epoch data from C

Lock contention in MPI one-sided communication (Hermanns et al.) | Oct 4, 2016 Slide 8

slide-20
SLIDE 20

Analysis phase II

Detecting contention

Epochs are sorted by latest unlock event Analysis starts with last epoch and continues back in time For each lock epoch in queue:

  • 1. Find conflicting preceding epoch
  • 2. Get unlock time from preceding epoch
  • 3. Get local (target-side) progress region
  • 4. Find location of wait state within current epoch
  • 5. Send active message with wait state information to affected

processes

Lock contention in MPI one-sided communication (Hermanns et al.) | Oct 4, 2016 Slide 9

slide-21
SLIDE 21

Simple benchmark

Single iteration Skewed begin of lock ensures lock request order on processes Target blocks window until all processes requested lock Target releases lock to let origins complete RMA operation

Lock contention in MPI one-sided communication (Hermanns et al.) | Oct 4, 2016 Slide 10

slide-22
SLIDE 22

Simple benchmark

Vampir view: Lock phase

All processes execute foo for time depending on their rank Process 0 is target for RMA operations Target locks window exclusively Target then executes bar for 2s Rest of processes wait for access in unlock operation

Lock contention in MPI one-sided communication (Hermanns et al.) | Oct 4, 2016 Slide 11

slide-23
SLIDE 23

Simple benchmark

Vampir view: Target unlocks

Target unlocks window after leaving bar First origin (process 1) gains access to window Target continues to execute foo again for 100ms Second origin (process 2) is waiting for target progress

Lock contention in MPI one-sided communication (Hermanns et al.) | Oct 4, 2016 Slide 12

slide-24
SLIDE 24

Simple benchmark

Vampir view: Unlock completion

After foo completes, target enters barrier Barrier provides progress for remaining origin processes Remaining origins complete access

Lock contention in MPI one-sided communication (Hermanns et al.) | Oct 4, 2016 Slide 13

slide-25
SLIDE 25

Simple benchmark

Vampir view: Benchmark completion

foo completes on remaining processes Sequence completes with all processes entering barrier

Lock contention in MPI one-sided communication (Hermanns et al.) | Oct 4, 2016 Slide 14

slide-26
SLIDE 26

Simple benchmark

Cube view: Lock contention & wait for progress

Lock contention in MPI one-sided communication (Hermanns et al.) | Oct 4, 2016 Slide 15

slide-27
SLIDE 27

SOR benchmark

Solves Poisson equation using successive over relaxation 2D domain decomposition Ghost-cell exchange configured for

MPI one-sided communication Passive-target synchronization

Configured for weak scaling (keeps local trace data constant)

Lock contention in MPI one-sided communication (Hermanns et al.) | Oct 4, 2016 Slide 16

slide-28
SLIDE 28

SOR benchmark

Weak scaling

29 210 211 212 213 214 215 216 20 40 60 processes analysis time [s]

Lock contention in MPI one-sided communication (Hermanns et al.) | Oct 4, 2016 Slide 17

slide-29
SLIDE 29

Conclusion

Trace-based detection of lock contention in MPI

Identifies lock acquisition time and order Differentiates between lock contention and insufficient progress Enables understanding of complex synchronization schemes Narrows support gap in wait-state detection for one-sided communication

Lock contention in MPI one-sided communication (Hermanns et al.) | Oct 4, 2016 Slide 18

slide-30
SLIDE 30

Outlook

Port analysis prototype to Scalasca 2.x

Extend support to other one-sided libraries (OpenSHMEM, ARMCI, etc.)

Further improve analysis performance

Target-side message handling Work distribution

Integrate contention wait states into higher-level analysis

Root-cause detection Critical path detection

Enhance MPI interfaces

Enable target-side event generation Provide more precise locking information on origin

Lock contention in MPI one-sided communication (Hermanns et al.) | Oct 4, 2016 Slide 19

slide-31
SLIDE 31

Thank you.

Lock contention in MPI one-sided communication (Hermanns et al.) | Oct 4, 2016 Slide 20