Does your tool support PAPI SDEs yet? 13 th Scalable Tools Workshop - - PowerPoint PPT Presentation

does your tool support papi sdes yet
SMART_READER_LITE
LIVE PREVIEW

Does your tool support PAPI SDEs yet? 13 th Scalable Tools Workshop - - PowerPoint PPT Presentation

Does your tool support PAPI SDEs yet? 13 th Scalable Tools Workshop Anthony Danalis, Heike Jagode, Jack Dongarra Tahoe City, CA July 28-Aug 1, 2019 Case study: PaRSECs task scheduling algorithm Core 0 Core 1 Core 2 Core N Core 0


slide-1
SLIDE 1

Does your tool support PAPI SDEs yet?

13th Scalable Tools Workshop

Anthony Danalis, Heike Jagode, Jack Dongarra Tahoe City, CA July 28-Aug 1, 2019

slide-2
SLIDE 2

Case study: PaRSEC’s task scheduling algorithm

Core 0 Core 0 Core 1 Core 1 Core 2 Core 2 Core N Core N

Core local queues … Shared Global queue (overflow)

slide-3
SLIDE 3

Case study: PaRSEC’s task scheduling algorithm

Core 0 Core 0 Core 1 Core 1 Core 2 Core 2 Core N Core N

Core local queues … Shared Global queue (overflow) Thread Local Queues => High Locality Overflow & Work Stealing => Load Balance Shared Global queue (overflow)

slide-4
SLIDE 4

Parameter selection

Q1: How long should the local queues be? Q2: Should a thread first steal from a close queue, any queue, or the shared queue?

slide-5
SLIDE 5

Parameter selection

Q1: How long should the local queues be? A: 4*Core_Count Q2: Should a thread first steal from a close queue, any queue, or the shared queue? A: Any local queue (closest to farthest), then shared queue.

slide-6
SLIDE 6

Testing Benchmark

... ...

...

  • 20 Independent Fork-Join chains x 20 (or 25) Tasks per fork.
  • Memory bound kernel, with good cache locality.
  • 20 Cores on testing node.

...

... ...

...

... ...

...

... ...

...

slide-7
SLIDE 7

Execution time vs Local Queue Length

slide-8
SLIDE 8

Execution time vs Local Queue Length (zoom)

slide-9
SLIDE 9

Execution time vs Local Queue Length (zoom 2)

slide-10
SLIDE 10

Execution time vs Local Queue Length (zoom 3)

slide-11
SLIDE 11

Execution time vs Local Queue Length (zoom 4)

slide-12
SLIDE 12

Execution time vs Local Queue Length (zoom 5)

slide-13
SLIDE 13

Execution time vs Local Queue Length (combined)

slide-14
SLIDE 14

Failed Stealing Attempts

slide-15
SLIDE 15

L2 Cache Misses (L3 show same pattern)

slide-16
SLIDE 16

Successful Close Stealing

slide-17
SLIDE 17

Successful Close & Far Stealing

slide-18
SLIDE 18

Successful Shared Queue Stealing

slide-19
SLIDE 19

Successful Local + Shared Queue Stealing

slide-20
SLIDE 20

Unanswered questions

Q: So, what causes the bump? Q: How did you measure all these things?

slide-21
SLIDE 21

Unanswered questions

Q: So, what causes the bump? A: I don’t know! Q: How did you measure all these things?

slide-22
SLIDE 22

Unanswered questions

Q: So, what causes the bump? A: I don’t know! Q: How did you measure all these things? A: I am glad you asked.

slide-23
SLIDE 23

What is missing from current infrastructure?

Events that occurred inside the software stack

There is no standardized way for a software layer to export information about its behavior such that other, independently developed, software layers can read it.

HPC Application Math library Task runtime MPI Libibverbs RDMA completion One Sided Communication Data Dependency Distributed Factorization Quantum Chemistry Method

slide-24
SLIDE 24

PAPI Software Defined Events

  • De facto standard:

SDEs from your library can be read using the standard PAPI_start()/PAPI_stop()/PAPI_read().

  • Low overhead:

Performance critical codes can implement SDEs with zero overhead by exporting existing code variables without adding any new instructions in the fast path.

  • Rich feature set:

PAPI SDE supports counters, groups, recordings, simple statistics, thread safety, custom callbacks.

slide-25
SLIDE 25

The tool infrastructure is already there

slide-26
SLIDE 26

The tool infrastructure is already there

slide-27
SLIDE 27

Simplest SDE code (library side)

s t a t i c l

  • n

g l

  • n

g l

  • c

a l _ v a r ; v

  • i

d s m a l l _ t e s t _ i n i t ( v

  • i

d ) { l

  • c

a l _ v a r = ; p a p i _ h a n d l e _ t * h a n d l e = p a p i _ s d e _ i n i t ( ” T E S T " ) ; p a p i _ s d e _ r e g i s t e r _ c

  • u

n t e r ( h a n d l e , ” E v n t " , P A P I _ S D E _ R O | P A P I _ S D E _ D E L T A , P A P I _ S D E _ l

  • n

g _ l

  • n

g , & l

  • c

a l _ v a r ) ; . . . }

slide-28
SLIDE 28

SDE code for registering a callback function

s

  • m

e t y p e _ t * d a t a ; v

  • i

d s m a l l _ t e s t _ i n i t ( v

  • i

d ) { d a t a = . . . p a p i _ h a n d l e _ t * h a n d l e = p a p i _ s d e _ i n i t ( ” T E S T " ) ; p a p i _ s d e _ r e g i s t e r _ f p _ c

  • u

n t e r ( h a n d l e , " E v n t " , P A P I _ S D E _ R O | P A P I _ S D E _ D E L T A , P A P I _ S D E _ l

  • n

g _ l

  • n

g , a c c e s s

  • r

, d a t a ) ; . . . }

slide-29
SLIDE 29

SDE code for creating a counter (push mode)

v

  • i

d * c

  • u

n t e r _ h a n d l e ; v

  • i

d s m a l l _ t e s t _ i n i t ( v

  • i

d ) { p a p i _ h a n d l e _ t * h a n d l e = p a p i _ s d e _ i n i t ( ” T E S T " ) ; p a p i _ s d e _ c r e a t e _ c

  • u

n t e r ( h a n d l e , " E v n t " , P A P I _ S D E _ l

  • n

g _ l

  • n

g , & c

  • u

n t e r _ h a n d l e ) ; . . . }

slide-30
SLIDE 30

SDE code for creating a recorder (push mode)

v

  • i

d * r e c

  • r

d e r _ h a n d l e ; v

  • i

d s m a l l _ t e s t _ i n i t ( v

  • i

d ) { p a p i _ h a n d l e _ t * h a n d l e = p a p i _ s d e _ i n i t ( ” T E S T " ) ; p a p i _ s d e _ c r e a t e _ r e c

  • r

d e r ( h a n d l e , " R C R D R " , s i z e

  • f

( d

  • u

b l e ) , c m p r _ f u n c _ p t r , & r e c

  • r

d e r _ h a n d l e ) ; . . . }

slide-31
SLIDE 31

SDE code for creating a recorder (push mode)

v

  • i

d * r e c

  • r

d e r _ h a n d l e ; v

  • i

d s m a l l _ t e s t _ i n i t ( v

  • i

d ) { p a p i _ h a n d l e _ t * h a n d l e = p a p i _ s d e _ i n i t ( ” T E S T " ) ; p a p i _ s d e _ c r e a t e _ r e c

  • r

d e r ( h a n d l e , " R C R D R " , s i z e

  • f

( d

  • u

b l e ) , c m p r _ f u n c _ p t r , & r e c

  • r

d e r _ h a n d l e ) ; . . . }

s d e : : : T E S T : : R C R D R

slide-32
SLIDE 32

SDE code for creating a recorder (push mode)

v

  • i

d * r e c

  • r

d e r _ h a n d l e ; v

  • i

d s m a l l _ t e s t _ i n i t ( v

  • i

d ) { p a p i _ h a n d l e _ t * h a n d l e = p a p i _ s d e _ i n i t ( ” T E S T " ) ; p a p i _ s d e _ c r e a t e _ r e c

  • r

d e r ( h a n d l e , " R C R D R " , s i z e

  • f

( d

  • u

b l e ) , c m p r _ f u n c _ p t r , & r e c

  • r

d e r _ h a n d l e ) ; . . . }

s d e : : : T E S T : : R C R D R s d e : : : T E S T : : R C R D R : C N T

slide-33
SLIDE 33

SDE code for creating a recorder (push mode)

v

  • i

d * r e c

  • r

d e r _ h a n d l e ; v

  • i

d s m a l l _ t e s t _ i n i t ( v

  • i

d ) { p a p i _ h a n d l e _ t * h a n d l e = p a p i _ s d e _ i n i t ( ” T E S T " ) ; p a p i _ s d e _ c r e a t e _ r e c

  • r

d e r ( h a n d l e , " R C R D R " , s i z e

  • f

( d

  • u

b l e ) , c m p r _ f u n c _ p t r , & r e c

  • r

d e r _ h a n d l e ) ; . . . }

s d e : : : T E S T : : R C R D R s d e : : : T E S T : : R C R D R : C N T s d e : : : T E S T : : R C R D R : M I N s d e : : : T E S T : : R C R D R : Q 1 s d e : : : T E S T : : R C R D R : M E D s d e : : : T E S T : : R C R D R : Q 3 s d e : : : T E S T : : R C R D R : M A X

slide-34
SLIDE 34

SDE code for updating created counters/recorders

v

  • i

d * c

  • u

n t e r _ h a n d l e ; v

  • i

d * r e c

  • r

d e r _ h a n d l e ; v

  • i

d p u s h _ t e s t _ d

  • w
  • r

k ( v

  • i

d ) { d

  • u

b l e v a l ; l

  • n

g l

  • n

g i n c r e m e n t = 3 ; v a l = p e r f

  • r

m _ u s e f u l _ w

  • r

k ( ) ; p a p i _ s d e _ i n c _ c

  • u

n t e r ( c

  • u

n t e r _ h a n d l e , i n c r e m e n t ) ; p a p i _ s d e _ r e c

  • r

d ( r e c

  • r

d e r _ h a n d l e , s i z e

  • f

( v a l ) , & v a l ) ; }

slide-35
SLIDE 35

Performance overheads in simple benchmark

35

slide-36
SLIDE 36

Performance overhead in PaRSEC

36

slide-37
SLIDE 37

Performance overhead in HPCG

37

slide-38
SLIDE 38

Performance overhead in HPCG (zoom)

38

slide-39
SLIDE 39

Open Problem for our Community:

What meaningful information to associate with “TASKS_STOLEN”?

– Code location – Hardware events (e.g. cache misses) – Patterns in history (e.g. last task before stealing event) – Patterns in call-path/stack/originating thread

How do we associate useful context information with SDEs?

slide-40
SLIDE 40

Conclusions

  • Libraries/runtimes generate multiple useful software “events”.
  • PAPI SDE allows any software layer to export events.
  • SDEs can be read using the standard PAPI functionality.
  • SDEs have minimal to zero performance overhead.
  • SDEs might require different types of analysis by tools.