No/fied Access: Extending Remote Memory Access Programming - - PowerPoint PPT Presentation

no fied access extending remote memory access programming
SMART_READER_LITE
LIVE PREVIEW

No/fied Access: Extending Remote Memory Access Programming - - PowerPoint PPT Presentation

spcl.inf.ethz.ch @spcl_eth R OBERTO B ELLI , T ORSTEN H OEFLER No/fied Access: Extending Remote Memory Access Programming Models for Producer-Consumer Synchroniza/on


slide-1
SLIDE 1

spcl.inf.ethz.ch @spcl_eth

ROBERTO ¡BELLI, ¡TORSTEN ¡HOEFLER ¡

No/fied ¡Access: ¡Extending ¡Remote ¡Memory ¡Access ¡ ¡ Programming ¡Models ¡for ¡Producer-­‑Consumer ¡Synchroniza/on ¡ ¡

slide-2
SLIDE 2

spcl.inf.ethz.ch @spcl_eth

§ The de-facto programming model: MPI-1

§ Using send/recv messages and collectives

§ The de-facto network standard: RDMA

§ Zero-copy, user-level, os-bypass, fuzz-bang

2

COMMUNICATION IN TODAY’S HPC SYSTEMS

slide-3
SLIDE 3

spcl.inf.ethz.ch @spcl_eth

§ Most important communication idiom

§ Some examples:

§ Perfectly supported by MPI-1 Message Passing

§ But how does this actually work over RDMA?

3

PRODUCER-CONSUMER RELATIONS

slide-4
SLIDE 4

spcl.inf.ethz.ch @spcl_eth

4

MPI-1 MESSAGE PASSING – SIMPLE EAGER

[1]: T. S. Woodall, G. M. Shipman, G. Bosilca, R. L. Graham, and A. B. Maccabe, “High performance RDMA protocols in HPC.”, EuroMPI’06

slide-5
SLIDE 5

spcl.inf.ethz.ch @spcl_eth

5

MPI-1 MESSAGE PASSING – SIMPLE EAGER

[1]: T. S. Woodall, G. M. Shipman, G. Bosilca, R. L. Graham, and A. B. Maccabe, “High performance RDMA protocols in HPC.”, EuroMPI’06

slide-6
SLIDE 6

spcl.inf.ethz.ch @spcl_eth

6

MPI-1 MESSAGE PASSING – SIMPLE EAGER

[1]: T. S. Woodall, G. M. Shipman, G. Bosilca, R. L. Graham, and A. B. Maccabe, “High performance RDMA protocols in HPC.”, EuroMPI’06

slide-7
SLIDE 7

spcl.inf.ethz.ch @spcl_eth

Critical path: 1 latency + 1 copy

7

MPI-1 MESSAGE PASSING – SIMPLE EAGER

[1]: T. S. Woodall, G. M. Shipman, G. Bosilca, R. L. Graham, and A. B. Maccabe, “High performance RDMA protocols in HPC.”, EuroMPI’06

slide-8
SLIDE 8

spcl.inf.ethz.ch @spcl_eth

8

MPI-1 MESSAGE PASSING – SIMPLE RENDEZVOUS

[1]: T. S. Woodall, G. M. Shipman, G. Bosilca, R. L. Graham, and A. B. Maccabe, “High performance RDMA protocols in HPC.”, EuroMPI’06

slide-9
SLIDE 9

spcl.inf.ethz.ch @spcl_eth

9

MPI-1 MESSAGE PASSING – SIMPLE RENDEZVOUS

[1]: T. S. Woodall, G. M. Shipman, G. Bosilca, R. L. Graham, and A. B. Maccabe, “High performance RDMA protocols in HPC.”, EuroMPI’06

slide-10
SLIDE 10

spcl.inf.ethz.ch @spcl_eth

10

MPI-1 MESSAGE PASSING – SIMPLE RENDEZVOUS

[1]: T. S. Woodall, G. M. Shipman, G. Bosilca, R. L. Graham, and A. B. Maccabe, “High performance RDMA protocols in HPC.”, EuroMPI’06

slide-11
SLIDE 11

spcl.inf.ethz.ch @spcl_eth

11

MPI-1 MESSAGE PASSING – SIMPLE RENDEZVOUS

[1]: T. S. Woodall, G. M. Shipman, G. Bosilca, R. L. Graham, and A. B. Maccabe, “High performance RDMA protocols in HPC.”, EuroMPI’06

slide-12
SLIDE 12

spcl.inf.ethz.ch @spcl_eth

12

MPI-1 MESSAGE PASSING – SIMPLE RENDEZVOUS

[1]: T. S. Woodall, G. M. Shipman, G. Bosilca, R. L. Graham, and A. B. Maccabe, “High performance RDMA protocols in HPC.”, EuroMPI’06

slide-13
SLIDE 13

spcl.inf.ethz.ch @spcl_eth

Critical path: 3 latencies

13

MPI-1 MESSAGE PASSING – SIMPLE RENDEZVOUS

[1]: T. S. Woodall, G. M. Shipman, G. Bosilca, R. L. Graham, and A. B. Maccabe, “High performance RDMA protocols in HPC.”, EuroMPI’06

slide-14
SLIDE 14

spcl.inf.ethz.ch @spcl_eth

§ The de-facto programming model: MPI-1

§ Using send/recv messages and collectives

§ The de-facto hardware standard: RDMA

§ Zero-copy, user-level, os-bypass, fuzz bang

14

COMMUNICATION IN TODAY’S HPC SYSTEMS

http://www.hpcwire.com/2006/08/18/a_critique_of_rdma-1/

slide-15
SLIDE 15

spcl.inf.ethz.ch @spcl_eth

§ Why not use these RDMA features more directly?

§ A global address space may simplify programming § … and accelerate communication § … and there could be a widely accepted standard

§ MPI-3 RMA (“MPI One Sided”) was born

§ Just one among many others (UPC, CAF, …) § Designed to react to hardware trends, learn from others § Direct (hardware-supported) remote access § New way of thinking for programmers

15 [1] http://www.mpi-forum.org/docs/mpi-3.0/mpi30-report.pdf

REMOTE MEMORY ACCESS PROGRAMMING

slide-16
SLIDE 16

spcl.inf.ethz.ch @spcl_eth

§ MPI-3 updates RMA (“MPI One Sided”)

§ Significant change from MPI-2

§ Communication is „one sided” (no involvement of destination)

§ Utilize direct memory access

§ RMA decouples communication & synchronization

§ Fundamentally different from message passing

16

MPI-3 RMA SUMMARY

Proc A Proc B

send recv

Proc A Proc B

put

two sided

  • ne sided

Communication Communication + Synchronization Synchronization

sync [1] http://www.mpi-forum.org/docs/mpi-3.0/mpi30-report.pdf

slide-17
SLIDE 17

spcl.inf.ethz.ch @spcl_eth

17

MPI-3 RMA COMMUNICATION OVERVIEW

Process A (passive) Memory

MPI window

Process B (active) Process C (active)

Put Get Atomic Non-atomic communication calls (put, get) Atomic communication calls (Acc, Get & Acc, CAS, FAO)

Memory

MPI window …

Process D (active)

slide-18
SLIDE 18

spcl.inf.ethz.ch @spcl_eth

18

MPI-3 RMA COMMUNICATION OVERVIEW

Process A (passive) Memory

MPI window

Process B (active) Process C (active)

Put Get Atomic Non-atomic communication calls (put, get) Atomic communication calls (Acc, Get & Acc, CAS, FAO)

Memory

MPI window …

Process D (active)

slide-19
SLIDE 19

spcl.inf.ethz.ch @spcl_eth

19

MPI-3 RMA COMMUNICATION OVERVIEW

Process A (passive) Memory

MPI window

Process B (active) Process C (active)

Put Get Atomic Non-atomic communication calls (put, get) Atomic communication calls (Acc, Get & Acc, CAS, FAO)

Memory

MPI window …

Process D (active)

slide-20
SLIDE 20

spcl.inf.ethz.ch @spcl_eth

20

MPI-3 RMA COMMUNICATION OVERVIEW

Process A (passive) Memory

MPI window

Process B (active) Process C (active)

Put Get Atomic Non-atomic communication calls (put, get) Atomic communication calls (Acc, Get & Acc, CAS, FAO)

Memory

MPI window …

Process D (active)

slide-21
SLIDE 21

spcl.inf.ethz.ch @spcl_eth

21

MPI-3 RMA COMMUNICATION OVERVIEW

Process A (passive) Memory

MPI window

Process B (active) Process C (active)

Put Get Atomic Non-atomic communication calls (put, get) Atomic communication calls (Acc, Get & Acc, CAS, FAO)

Memory

MPI window …

Process D (active)

slide-22
SLIDE 22

spcl.inf.ethz.ch @spcl_eth

22

MPI-3 RMA SYNCHRONIZATION OVERVIEW

Active process Passive process Synchroni- zation

Passive Target Mode

Lock Lock All

Active Target Mode

Fence Post/Start/ Complete/Wait Communi- cation

slide-23
SLIDE 23

spcl.inf.ethz.ch @spcl_eth

23

MPI-3 RMA SYNCHRONIZATION OVERVIEW

Active process Passive process Synchroni- zation

Passive Target Mode

Lock Lock All

Active Target Mode

Fence Post/Start/ Complete/Wait Communi- cation

slide-24
SLIDE 24

spcl.inf.ethz.ch @spcl_eth

24

MPI-3 RMA SYNCHRONIZATION OVERVIEW

Active process Passive process Synchroni- zation

Passive Target Mode

Lock Lock All

Active Target Mode

Fence Post/Start/ Complete/Wait Communi- cation

slide-25
SLIDE 25

spcl.inf.ethz.ch @spcl_eth

25

MPI-3 RMA SYNCHRONIZATION OVERVIEW

Active process Passive process Synchroni- zation

Passive Target Mode

Lock Lock All

Active Target Mode

Fence Post/Start/ Complete/Wait Communi- cation

slide-26
SLIDE 26

spcl.inf.ethz.ch @spcl_eth

26

MPI-3 RMA SYNCHRONIZATION OVERVIEW

Active process Passive process Synchroni- zation

Passive Target Mode

Lock Lock All

Active Target Mode

Fence Post/Start/ Complete/Wait Communi- cation

slide-27
SLIDE 27

spcl.inf.ethz.ch @spcl_eth

27

MPI-3 RMA SYNCHRONIZATION OVERVIEW

Active process Passive process Synchroni- zation

Passive Target Mode

Lock Lock All

Active Target Mode

Fence Post/Start/ Complete/Wait Communi- cation

IN CASE YOU WANT TO LEARN MORE

How to implement producer/consumer in passive mode?

slide-28
SLIDE 28

spcl.inf.ethz.ch @spcl_eth

28

ONE SIDED – PUT + SYNCHRONIZATION

slide-29
SLIDE 29

spcl.inf.ethz.ch @spcl_eth

29

ONE SIDED – PUT + SYNCHRONIZATION

slide-30
SLIDE 30

spcl.inf.ethz.ch @spcl_eth

30

ONE SIDED – PUT + SYNCHRONIZATION

slide-31
SLIDE 31

spcl.inf.ethz.ch @spcl_eth

31

ONE SIDED – PUT + SYNCHRONIZATION

Critical path: 3 latencies

slide-32
SLIDE 32

spcl.inf.ethz.ch @spcl_eth

32

COMPARING APPROACHES

Message Passing 1 latency + copy / 3 latencies One Sided 3 latencies

slide-33
SLIDE 33

spcl.inf.ethz.ch @spcl_eth

§ First seen in Split-C (1992) § Combine communication and synchronization using RDMA § RDMA networks can provide various notifications

§ Flags § Counters § Event Queues

33

IDEA: RMA NOTIFICATIONS

slide-34
SLIDE 34

spcl.inf.ethz.ch @spcl_eth

Message Passing 1 latency + copy / 3 latencies

34

COMPARING APPROACHES

One Sided 3 latencies Notified Access 1 latency

slide-35
SLIDE 35

spcl.inf.ethz.ch @spcl_eth

Message Passing 1 latency + copy / 3 latencies

35

COMPARING APPROACHES

One Sided 3 latencies Notified Access 1 latency

But how to notify?

slide-36
SLIDE 36

spcl.inf.ethz.ch @spcl_eth

§ Flags (polling at the remote side)

§ Used in GASPI, DMAPP, NEON

§ Disadvantages

§ Location of the flag chosen at the sender side § Consumer needs at least one flag for every process § Polling a high number of flags is inefficient

36

PREVIOUS WORK: OVERWRITING INTERFACE

slide-37
SLIDE 37

spcl.inf.ethz.ch @spcl_eth

§ Atomic counters (accumulate notifications → scalable)

§ Used in Split-C, LAPI, SHMEM - Counting Puts, …

§ Disadvantages

§ Dataflow applications may require many counters § High polling overhead to identify accesses § Does not preserve order (may not be linearizable)

37

PREVIOUS WORK: COUNTING INTERFACE

slide-38
SLIDE 38

spcl.inf.ethz.ch @spcl_eth

38

WHAT IS A GOOD NOTIFICATION INTERFACE?

§ Scalable to yotta-scale

§ Does memory or polling overhead grow with # of processes?

§ Computation/communication overlap

§ Do we support maximum asynchrony? (better than MPI-1)

§ Complex data flow graphs

§ Can we distinguish between different accesses locally? § Can we avoid starvation? § What about load balancing?

§ Ease-of-use

§ Does it use standard mechanisms?

slide-39
SLIDE 39

spcl.inf.ethz.ch @spcl_eth

§ Notifications with MPI-1 (queue-based) matching

§ Retains benefits of previous notification schemes § Poll only head of queue § Provides linearizable semantics

39

OUR APPROACH: NOTIFIED ACCESS

slide-40
SLIDE 40

spcl.inf.ethz.ch @spcl_eth

§ Minor interface evolution

§ Leverages MPI two sided <source, tag> matching § Wildcards matching with FIFO semantics

40

NOTIFIED ACCESS – AN MPI INTERFACE

Example Communication Primitives Example Synchronization Primitives

slide-41
SLIDE 41

spcl.inf.ethz.ch @spcl_eth

§ Minor interface evolution

§ Leverages MPI two sided <source, tag> matching § Wildcards matching with FIFO semantics

41

NOTIFIED ACCESS – AN MPI INTERFACE

, ,

Example Communication Primitives Example Synchronization Primitives

slide-42
SLIDE 42

spcl.inf.ethz.ch @spcl_eth

§ foMPI – a fully functional MPI-3 RMA implementation

§ Runs on newer Cray machines (Aries, Gemini) § DMAPP: low-level networking API for Cray systems § XPMEM: a portable Linux kernel module

§ Implementation of Notified Access via uGNI [1]

§ Leverages uGNI queue semantics § Adds unexpected queue § Uses 32-bit immediate value to encode source and tag

42

NOTIFIED ACCESS - IMPLEMENTATION

Computing Node 1

Proc A Proc C Proc D Proc B

Computing Node 2

Proc E Proc G Proc H Proc F

XPMEM

(intra node communication)

DMAPP

(inter node non-notified communication)

uGNI

(inter node notified communication)

[1] http://spcl.inf.ethz.ch/Research/Parallel_Programming/foMPI_NA/

slide-43
SLIDE 43

spcl.inf.ethz.ch @spcl_eth

§ Piz Daint

§ Cray XC30, Aries interconnect § 5'272 computing nodes (Intel Xeon E5-2670 + NVIDIA Tesla K20X) § Theoretical Peak Performance 7.787 Petaflops § Peak Network Bisection Bandwidth 33 TB/s

43

EXPERIMENTAL SETTING

[1] http://www.cscs.ch

slide-44
SLIDE 44

spcl.inf.ethz.ch @spcl_eth

§ 1000 repetitions, each timed separately, RDTSC timer § 95% confidence interval always within 1% of median

44

PING PONG PERFORMANCE (INTER-NODE)

(lower is better)

slide-45
SLIDE 45

spcl.inf.ethz.ch @spcl_eth

§ 1000 repetitions, each timed separately, RDTSC timer § 95% confidence interval always within 1% of median

45

PING PONG PERFORMANCE (INTRA-NODE)

(lower is better)

slide-46
SLIDE 46

spcl.inf.ethz.ch @spcl_eth

§ 1000 repetitions, each timed separately, RDTSC timer § 95% confidence interval always within 1% of median

46

COMPUTATION/COMMUNICATION OVERLAP

Uses communication progression thread (lower is better)

slide-47
SLIDE 47

spcl.inf.ethz.ch @spcl_eth

§ 1000 repetitions, each timed separately, RDTSC timer § 95% confidence interval always within 1% of median

47

PIPELINE – ONE-TO-ONE SYNCHRONIZATION

[1] https://github.com/intelesg/PRK2

(lower is better)

slide-48
SLIDE 48

spcl.inf.ethz.ch @spcl_eth

§ Reduce as an example (same for FMM, BH, etc.)

§ Small data (8 Bytes), 16-ary tree § 1000 repetitions, each timed separately with RDTSC

48

REDUCE – ONE-TO-MANY SYNCHRONIZATION

f(…) f(…) f(…) f(…) f(…) f(…) f(…)

(lower is better)

slide-49
SLIDE 49

spcl.inf.ethz.ch @spcl_eth

§ 1000 repetitions, each timed separately, RDTSC timer § 95% confidence interval always within 10% of median

49

CHOLESKY – MANY-TO-MANY SYNCHRONIZATION

[1]: J. Kurzak, H. Ltaief, J. Dongarra, R. Badia: "Scheduling dense linear algebra operations on multicore processors“, CCPE 2010

(Higher is better)

slide-50
SLIDE 50

spcl.inf.ethz.ch @spcl_eth

§ Simple and fast solution

§ The interface lies between RMA and Message Passing § Similarity to MPI-1 eases adoption of NA § Richer semantics then current notification systems § Maintains benefits of RDMA for producer/consumer § Effect on other RMA operations needs to be defined § Either synchronizing [1] or no effect § Currently discussed in the MPI Forum § Fully parameterized LogGP-like performance model

50

DISCUSSION AND CONCLUSIONS

[1]: Kourosh Gharachorloo, et al.. "Memory consistency and event ordering in scalable shared-memory multiprocessors"., ISCA’90

slide-51
SLIDE 51

spcl.inf.ethz.ch @spcl_eth

51

ACKNOWLEDGMENTS

spcl.inf.ethz.ch

slide-52
SLIDE 52

spcl.inf.ethz.ch @spcl_eth

52

ACKNOWLEDGMENTS Thank you for your attention

slide-53
SLIDE 53

spcl.inf.ethz.ch @spcl_eth

53

BACKUP SLIDES

slide-54
SLIDE 54

spcl.inf.ethz.ch @spcl_eth

54

NOTIFIED ACCESS - EXAMPLE

slide-55
SLIDE 55

spcl.inf.ethz.ch @spcl_eth

55

PERFORMANCE: APPLICATIONS

NAS 3D FFT [1] Performance MILC [2] Application Execution Time

Annotations represent performance gain of foMPI over Cray MPI-1.

[1] Nishtala et al. Scaling communication-intensive applications on BlueGene/P using one-sided communication and overlap. IPDPS’09 [2] Shan et al. Accelerating applications at scale using one-sided communication. PGAS’12

scale to 512k procs scale to 65k procs

slide-56
SLIDE 56

spcl.inf.ethz.ch @spcl_eth

56

PERFORMANCE: MOTIF APPLICATIONS

Key/Value Store: Random Inserts per Second Dynamic Sparse Data Exchange (DSDE) with 6 neighbors

slide-57
SLIDE 57

spcl.inf.ethz.ch @spcl_eth

57

COMPARING APPROACHES – EXAMPLE

Overriding Interface Counting Interface Notified Access

slide-58
SLIDE 58

spcl.inf.ethz.ch @spcl_eth

58

ONE SIDED – GET + SYNCHRONIZATION

slide-59
SLIDE 59

spcl.inf.ethz.ch @spcl_eth

59

ONE SIDED – GET + SYNCHRONIZATION

slide-60
SLIDE 60

spcl.inf.ethz.ch @spcl_eth

60

ONE SIDED – GET + SYNCHRONIZATION

Critical Path: 3 Messages

slide-61
SLIDE 61

spcl.inf.ethz.ch @spcl_eth

61

COMPARING APPROACHES