Time Stamp Synchronization for Event Traces of Large-Scale Message - - PowerPoint PPT Presentation

time stamp synchronization for event traces of large
SMART_READER_LITE
LIVE PREVIEW

Time Stamp Synchronization for Event Traces of Large-Scale Message - - PowerPoint PPT Presentation

Time Stamp Synchronization for Event Traces of Large-Scale Message Passing Applications D. Becker and F. Wolf Forschungszentrum Jlich Central Institute for Applied Mathematics R. Rabenseifner High Performance Computing Center Stuttgart


slide-1
SLIDE 1

Time Stamp Synchronization for Event Traces

  • f Large-Scale Message Passing Applications
  • D. Becker and F. Wolf

Forschungszentrum Jülich Central Institute for Applied Mathematics

  • R. Rabenseifner

High Performance Computing Center Stuttgart Department Parallel Computing

slide-2
SLIDE 2

Daniel Becker 2

Outline

Introduction Event model and replay-based parallel analysis Controlled logical clock Extended controlled logical clock Timestamp synchronization Conclusion Future work

slide-3
SLIDE 3

Daniel Becker 3

SCALASCA

Goal - diagnose wait states in MPI applications on large-

scale systems

Scalability through parallel analysis of event traces Trace analysis report Execution on parallel machine Parallel trace analyzer Local trace files

slide-4
SLIDE 4

Daniel Becker 4

Wait States in MPI Applications

time process time process time process time process

ENTER EXIT SEND RECV COLLEXIT (a) Late sender (c) Late sender / wrong order (b) Late receiver (d) Wait at n-to-n

slide-5
SLIDE 5

Daniel Becker 5

Non-Synchronized Clocks

Wait states diagnosis measures temporal displacements

between concurrent events

Problem - local processor clocks are often non-

synchronized

  • Clocks may vary in offset and drift

Present approach - linear interpolation

  • Accounts for differences in offset and drift
  • Assumes that drift is not time dependant

Inaccuracies and changing drifts can still cause violations

  • f the logical event ordering

Synchronization method for violations not already covered by linear interpolation required

slide-6
SLIDE 6

Daniel Becker 6

Idea

Requirement - realistic message passing codes

  • Different modes of communication (P2P & collective)
  • Large numbers of processes

Build on controlled logical clock by Rolf Rabenseifner

  • Synchronization based on Lamport’s logical clock
  • Only P2P communication
  • Sequential program

Approach

  • Extend controlled logical clock to collective operations
  • Define scalable correction algorithm through parallel replay
slide-7
SLIDE 7

Daniel Becker 7

Event Model

Event includes at least timestamp, location

and event type

  • Additional information may be supplied depending
  • n event type

Event type refers to

  • Programming-model independent events
  • MPI-related events
  • Events internal to tracing library

Event sequence recorded for typical MPI operations

E X S E

CX

Enter Exit Collective Exit Send Receive

E X R

MPI_Send() MPI_Recv() MPI_Allreduce()

slide-8
SLIDE 8

Daniel Becker 8

Replay-Based Parallel Analysis

  • Parallel analysis scheme
  • SCALASCA toolset
  • Originally developed to improve scalability on large-scale systems
  • Analyze separate local trace files in parallel
  • Exploits distributed memory & processing capabilities
  • Keeps whole trace in main memory
  • Only process-local information visible to a process
  • Parallel replay of target application‘s communication behavior
  • Parallel traversal of event streams
  • Analyze communication with operation of same type
  • Exchange of required data at synchronization points of target

application

slide-9
SLIDE 9

Daniel Becker 9

Example: Wait at N x N

Waiting time due to inherent synchronization in N-to-N

  • perations (e.g., MPI_Allreduce)

Algorithm:

  • Triggered by collective exit event
  • Determine enter events
  • Determine & distribute latest enter event (max-reduction)
  • Calculate & store waiting time

time location

2 1 1

… …

3

… …

2

… …

1 2 3 2 1 2 3 3

slide-10
SLIDE 10

Daniel Becker 10

Controlled Logical Clock

Guarantees Lamport's clock condition

  • Use happened-before relations to synchronize timestamps
  • Send event always earlier than receive event

Scans event trace for clock condition violations and

modifies inexact timestamps

Stretches process-local time axis in the immediate

vicinity of affected event

  • Preserves length of intervals between local events
  • Forward amortization
  • Smoothes discontinuity at affected event
  • Backward amortization
slide-11
SLIDE 11

Daniel Becker 11

Forward Amortization

Inconsistent event stream Corrected and forward amortized event stream

E S X E R X

p1

X

p0

E S X R X Minimum Latency E X

p1 p0

R X

slide-12
SLIDE 12

Daniel Becker 12

Backward Amortization

Forward amortized event stream Forward and backward amortized event stream

E S X R X R

Δt

E X

p1 p0

E S X R X E E X X

p1 p0

slide-13
SLIDE 13

Daniel Becker 13

Extended Controlled Logical Clock

Consider single collective operation as composition of

many point-to-point communications

Distinguish between different types

  • 1-to-N
  • N-to-1
  • N-to-N

Determine send and receive events for each type Define happened-before relations based on

decomposition of collective operations

slide-14
SLIDE 14

Daniel Becker 14

Decomposition of Collective Operations

1xN: Root sends data to N processes Nx1: N processes send data to root NxN: N processes send data to N processes

slide-15
SLIDE 15

Daniel Becker 15

Happened-Before Relation

Synchronization needs one send event timestamp Operation may have multiple send and receive events Multiple receives used to synchronize multiple clocks Latest send event is the relevant send event Example: N-to-1 root

slide-16
SLIDE 16

Daniel Becker 16

Forward Amortization

New timestamp LC’ is maximum of

  • Max( send event timestamp + minimum latency)
  • Event timestamp
  • Previous event timestamp + minimum event spacing
  • Previous event timestamp + controlled event spacing
slide-17
SLIDE 17

Daniel Becker 17

Controller

Approximates original communication

after clock condition violation

Limits synchronization error Bounds propagation during forward amortization Requires global view of the trace data

slide-18
SLIDE 18

Daniel Becker 18

Backward Amortization

Results of the extended controlled logical clock with jump discontinuities Linear interpolation with backwards amortization Piecewise linear interpolation with backwards amortization

LCi’ - LCi

b

x x x x x

R S S S S S R I I

Jump discontinuity due to a clock condition violation LCi

b:= LCi’ without jump

min(LCk’ (corr. receive event) - µ - LCi

b )

wish Amortization interval = jump accuracy Amortization interval

slide-19
SLIDE 19

Daniel Becker 19

Timestamp Synchronization

Event tracing of applications running on thousands of

processes requires scalable synchronization scheme

Proposed algorithm depends on accuracy of

  • riginal timestamps

Two-step synchronization scheme

  • Pre-synchronization
  • Linear interpolation
  • Parallel post-mortem timestamp synchronization
  • Extended controlled logical clock
slide-20
SLIDE 20

Daniel Becker 20

Pre-Synchronization

Account for differences in offset and drift Assume that drift is not time dependant Offset measurement at program initialization and

finalization

  • Among arbitrary chosen master and worker processes

Linear interpolation between these two points

slide-21
SLIDE 21

Daniel Becker 21

Parallel Timestamp Synchronization

Extended controlled logical clock Parallel traversal of the event stream

  • Forward amortization
  • Backward amortization

Exchange required timestamp at synchronization points Perform clock correction Apply control mechanism after replaying the

communication

  • Global view of the trace data
  • Multiple passes until error is below a predefined threshold
slide-22
SLIDE 22

Daniel Becker 22

Forward Amortization

Timestamps exchanged depending on the type of

  • peration

Type of operation timestamp exchanged MPI function P2P timestamp of send event MPI Send 1-to-N timestamp of root enter event MPI Bcast N-to-1 max( all enter event timestamps ) MPI Reduce N-to-N max( all enter event timestamps ) MPI Allreduce

slide-23
SLIDE 23

Daniel Becker 23

Backward Amortization

Timestamps exchanged depending on the type of

  • peration

Type of operation timestamp exchanged MPI function P2P timestamp of receive event MPI Send 1-to-N min( all collective exit event timestamps ) MPI Reduce N-to-1 timestamp of root collective exit event MPI Bcast N-to-N min( all collective exit event timestamps ) MPI Allreduce

slide-24
SLIDE 24

Daniel Becker 24

Conclusion

Extended controlled logical clock algorithm takes

collective communication semantics into account

  • Defined collective send and receive operations
  • Defined collective happened-before relations

Parallel implementation design presented using

SCALASCA’s parallel replay approach

  • Exploits distributed memory & processing capabilities
slide-25
SLIDE 25

Daniel Becker 25

Future Work

Finish actual implementation Evaluate algorithm using real message

passing codes

Extend algorithm to shared memory

programming models

Extend algorithm to one sided communication

slide-26
SLIDE 26

Daniel Becker 26

Thank you…

For more information, visit

  • ur project home page:

http://www.scalasca.org