Access Programming with MPI-3 One Sided R OBERT G ERSTENBERGER , M - - PowerPoint PPT Presentation

access programming with mpi 3 one sided
SMART_READER_LITE
LIVE PREVIEW

Access Programming with MPI-3 One Sided R OBERT G ERSTENBERGER , M - - PowerPoint PPT Presentation

spcl.inf.ethz.ch @spcl_eth Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One Sided R OBERT G ERSTENBERGER , M ACIEJ B ESTA , T ORSTEN H OEFLER spcl.inf.ethz.ch @spcl_eth MPI-3.0 R EMOTE M EMORY A CCESS MPI-3.0


slide-1
SLIDE 1

spcl.inf.ethz.ch @spcl_eth

ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN HOEFLER

Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One Sided

slide-2
SLIDE 2

spcl.inf.ethz.ch @spcl_eth

  • MPI-3.0 supports RMA (“MPI One Sided”)
  • Designed to react to hardware trends
  • Majority of HPC networks support RDMA

2

MPI-3.0 REMOTE MEMORY ACCESS

[1] http://www.mpi-forum.org/docs/mpi-3.0/mpi30-report.pdf

slide-3
SLIDE 3

spcl.inf.ethz.ch @spcl_eth

  • MPI-3.0 supports RMA (“MPI One Sided”)
  • Designed to react to hardware trends
  • Majority of HPC networks support RDMA

3

MPI-3.0 REMOTE MEMORY ACCESS

[1] http://www.mpi-forum.org/docs/mpi-3.0/mpi30-report.pdf

slide-4
SLIDE 4

spcl.inf.ethz.ch @spcl_eth

  • MPI-3.0 supports RMA (“MPI One Sided”)
  • Designed to react to hardware trends
  • Majority of HPC networks support RDMA

4

MPI-3.0 REMOTE MEMORY ACCESS

[1] http://www.mpi-forum.org/docs/mpi-3.0/mpi30-report.pdf

slide-5
SLIDE 5

spcl.inf.ethz.ch @spcl_eth

  • MPI-3.0 supports RMA (“MPI One Sided”)
  • Designed to react to hardware trends
  • Majority of HPC networks support RDMA
  • Communication is „one sided” (no involvement of

destination)

  • RMA decouples communication & synchronization
  • Different from message passing

5

MPI-3.0 REMOTE MEMORY ACCESS

[1] http://www.mpi-forum.org/docs/mpi-3.0/mpi30-report.pdf

Proc A Proc B

send recv

Proc A Proc B

put

two sided

  • ne sided

Communication Communication + Synchronization Synchronization

sync

slide-6
SLIDE 6

spcl.inf.ethz.ch @spcl_eth

6

PRESENTATION OVERVIEW

  • 5. Application evaluation
  • 1. Overview of three

MPI-3 RMA concepts

  • 2. MPI window creation
  • 3. Communication
  • 4. Synchronization
slide-7
SLIDE 7

spcl.inf.ethz.ch @spcl_eth

7

MPI-3 RMA COMMUNICATION OVERVIEW

Process A (passive) Memory

MPI window

Process B (active) Process C (active)

Put Get Atomic Non-atomic communication calls (put, get) Atomic communication calls (Acc, Get & Acc, CAS, FAO)

Memory

MPI window …

Process D (active)

slide-8
SLIDE 8

spcl.inf.ethz.ch @spcl_eth

8

MPI-3 RMA COMMUNICATION OVERVIEW

Process A (passive) Memory

MPI window

Process B (active) Process C (active)

Put Get Atomic Non-atomic communication calls (put, get) Atomic communication calls (Acc, Get & Acc, CAS, FAO)

Memory

MPI window …

Process D (active)

slide-9
SLIDE 9

spcl.inf.ethz.ch @spcl_eth

9

MPI-3 RMA COMMUNICATION OVERVIEW

Process A (passive) Memory

MPI window

Process B (active) Process C (active)

Put Get Atomic Non-atomic communication calls (put, get) Atomic communication calls (Acc, Get & Acc, CAS, FAO)

Memory

MPI window …

Process D (active)

slide-10
SLIDE 10

spcl.inf.ethz.ch @spcl_eth

10

MPI-3 RMA COMMUNICATION OVERVIEW

Process A (passive) Memory

MPI window

Process B (active) Process C (active)

Put Get Atomic Non-atomic communication calls (put, get) Atomic communication calls (Acc, Get & Acc, CAS, FAO)

Memory

MPI window …

Process D (active)

slide-11
SLIDE 11

spcl.inf.ethz.ch @spcl_eth

11

MPI-3 RMA COMMUNICATION OVERVIEW

Process A (passive) Memory

MPI window

Process B (active) Process C (active)

Put Get Atomic Non-atomic communication calls (put, get) Atomic communication calls (Acc, Get & Acc, CAS, FAO)

Memory

MPI window …

Process D (active)

slide-12
SLIDE 12

spcl.inf.ethz.ch @spcl_eth

12

MPI-3.0 RMA SYNCHRONIZATION OVERVIEW

Active process Passive process Synchroni- zation

Passive Target Mode

Lock Lock All

Active Target Mode

Fence Post/Start/ Complete/Wait Communi- cation

slide-13
SLIDE 13

spcl.inf.ethz.ch @spcl_eth

13

MPI-3.0 RMA SYNCHRONIZATION OVERVIEW

Active process Passive process Synchroni- zation

Passive Target Mode

Lock Lock All

Active Target Mode

Fence Post/Start/ Complete/Wait Communi- cation

slide-14
SLIDE 14

spcl.inf.ethz.ch @spcl_eth

14

MPI-3.0 RMA SYNCHRONIZATION OVERVIEW

Active process Passive process Synchroni- zation

Passive Target Mode

Lock Lock All

Active Target Mode

Fence Post/Start/ Complete/Wait Communi- cation

slide-15
SLIDE 15

spcl.inf.ethz.ch @spcl_eth

15

MPI-3.0 RMA SYNCHRONIZATION OVERVIEW

Active process Passive process Synchroni- zation

Passive Target Mode

Lock Lock All

Active Target Mode

Fence Post/Start/ Complete/Wait Communi- cation

slide-16
SLIDE 16

spcl.inf.ethz.ch @spcl_eth

16

MPI-3.0 RMA SYNCHRONIZATION OVERVIEW

Active process Passive process Synchroni- zation

Passive Target Mode

Lock Lock All

Active Target Mode

Fence Post/Start/ Complete/Wait Communi- cation

slide-17
SLIDE 17

spcl.inf.ethz.ch @spcl_eth

  • Scalable & generic protocols
  • Can be used on any RDMA network (e.g., OFED/IB)

17

SCALABLE PROTOCOLS & REFERENCE IMPLEMENTATION

slide-18
SLIDE 18

spcl.inf.ethz.ch @spcl_eth

  • Scalable & generic protocols
  • Can be used on any RDMA network (e.g., OFED/IB)

18

SCALABLE PROTOCOLS & REFERENCE IMPLEMENTATION

slide-19
SLIDE 19

spcl.inf.ethz.ch @spcl_eth

19

SCALABLE PROTOCOLS & REFERENCE IMPLEMENTATION

Window creation Communication Synchronization

  • Scalable & generic protocols
  • Can be used on any RDMA network (e.g., OFED/IB)
  • Window creation, communication and synchronization
slide-20
SLIDE 20

spcl.inf.ethz.ch @spcl_eth

  • Scalable & generic protocols
  • Can be used on any RDMA network (e.g., OFED/IB)
  • Window creation, communication and synchronization
  • foMPI, a fully functional MPI-3 RMA implementation
  • DMAPP: lowest-level networking API for Cray Gemini/Aries systems
  • XPMEM: a portable Linux kernel module

20

SCALABLE PROTOCOLS & REFERENCE IMPLEMENTATION

http://spcl.inf.ethz.ch/Research/Parallel_Programming/foMPI

slide-21
SLIDE 21

spcl.inf.ethz.ch @spcl_eth

  • Scalable & generic protocols
  • Can be used on any RDMA network (e.g., OFED/IB)
  • Window creation, communication and synchronization
  • foMPI, a fully functional MPI-3 RMA implementation
  • DMAPP: lowest-level networking API for Cray Gemini/Aries systems
  • XPMEM: a portable Linux kernel module

21

SCALABLE PROTOCOLS & REFERENCE IMPLEMENTATION

http://spcl.inf.ethz.ch/Research/Parallel_Programming/foMPI

slide-22
SLIDE 22

spcl.inf.ethz.ch @spcl_eth

  • Scalable & generic protocols
  • Can be used on any RDMA network (e.g., OFED/IB)
  • Window creation, communication and synchronization
  • foMPI, a fully functional MPI-3 RMA implementation
  • DMAPP: lowest-level networking API for Cray Gemini/Aries systems
  • XPMEM: a portable Linux kernel module

22

SCALABLE PROTOCOLS & REFERENCE IMPLEMENTATION

http://spcl.inf.ethz.ch/Research/Parallel_Programming/foMPI

slide-23
SLIDE 23

spcl.inf.ethz.ch @spcl_eth

23

PART 1: SCALABLE WINDOW CREATION

Traditional windows

backwards compatible (MPI-2) Time bound: 𝒫 𝑞 Memory bound: 𝒫 𝑞

𝑞 = total number

  • f processes

Process A Memory Process B Memory Process C Memory 0x111 0x123 0x120

slide-24
SLIDE 24

spcl.inf.ethz.ch @spcl_eth

24

PART 1: SCALABLE WINDOW CREATION

Allocated windows

𝑞 = total number

  • f processes

Process A Memory Process B Memory Process C Memory Allows MPI to allocate memory Time bound: 𝒫 log 𝑞 (𝑥ℎ𝑞) Memory bound: 𝒫 1 0x123 0x123 0x123

slide-25
SLIDE 25

spcl.inf.ethz.ch @spcl_eth

25

PART 1: SCALABLE WINDOW CREATION

Dynamic windows

𝑞 = total number

  • f processes

Process A Memory Process B Memory Process C Memory Local attach/detach Most flexible Time bound: 𝒫 𝑞 Memory bound: 𝒫 𝑞 0x129 0x129 0x111 0x123 0x120

slide-26
SLIDE 26

spcl.inf.ethz.ch @spcl_eth

  • Put and Get:
  • Direct DMAPP put and get operations or

local (blocking) memcpy (XPMEM)

  • Accumulate:
  • DMAPP atomic operations for 64 bit types
  • ...or fall back to remote locking protocol
  • MPI datatype handling with MPITypes library [1]
  • Fast path for contiguous data transfers of common

intrinisic datatypes (e.g., MPI_DOUBLE)

26

PART 2: COMMUNICATION

[1] Ross, Latham, Gropp, Lusk, Thakur. Processing MPI datatypes outside MPI. EuroMPI/PVM’09

Contiguous memory MPI_Put dmapp_put_nbi

Remote process

MPI_Compare _and_swap dmapp_ acswap_qw_nbi

Remote process

slide-27
SLIDE 27

spcl.inf.ethz.ch @spcl_eth

27

PERFORMANCE INTER-NODE: LATENCY

Put Inter-Node Get Inter-Node 20% faster 80% faster

Proc 0 Proc 1

put sync memory

Half ping-pong

slide-28
SLIDE 28

spcl.inf.ethz.ch @spcl_eth

28

PERFORMANCE INTRA-NODE: LATENCY

Put/Get Intra-Node 3x faster

Proc 0 Proc 1

put sync memory

Half ping-pong

slide-29
SLIDE 29

spcl.inf.ethz.ch @spcl_eth

29

PERFORMANCE: OVERLAP

Inter-Node Overlap in %

Useful for, e.g., scientific codes:

3D FFT MILC

Proc 0 Proc 1 put Sync memory comp.

AWM-Olsen seismic

slide-30
SLIDE 30

spcl.inf.ethz.ch @spcl_eth

30

PERFORMANCE: MESSAGE RATE

Intra-Node Inter-Node

Proc 0 Proc 1 puts Sync memory ...

slide-31
SLIDE 31

spcl.inf.ethz.ch @spcl_eth

31

PERFORMANCE: ATOMICS

64 bit integers

hardware- accelerated protocol: fall back protocol: lower latency higher bandwidth

proprietary

slide-32
SLIDE 32

spcl.inf.ethz.ch @spcl_eth

32

PART 3: SYNCHRONIZATION

Active process Passive process Synchroni- zation

Passive Target Mode

Lock Lock All

Active Target Mode

Fence Post/Start/ Complete/Wait Communi- cation

slide-33
SLIDE 33

spcl.inf.ethz.ch @spcl_eth

Node 0 Node 1

SCALABLE FENCE IMPLEMENTATION

  • Collective call
  • Completes all outstanding memory operations

Proc 2

put

Proc 0 Proc 1 Proc 3

int int MPI_Win_fence(…) { asm( mfence ); dmapp_gsync_wait(); MPI_Barrier(...); return return MPI_SUCCESS; }

put put put put put put

33

slide-34
SLIDE 34

spcl.inf.ethz.ch @spcl_eth

Node 0 Node 1

SCALABLE FENCE IMPLEMENTATION

  • Collective call
  • Completes all outstanding memory operations

Proc 2 Proc 0 Proc 1 Proc 3

int int MPI_Win_fence(…) { asm asm( ( mfence mfence ); ); dmapp_gsync_wait(); MPI_Barrier(...); return return MPI_SUCCESS; }

put put put

Local completion (XPMEM)

34

slide-35
SLIDE 35

spcl.inf.ethz.ch @spcl_eth

Node 0 Node 1

SCALABLE FENCE IMPLEMENTATION

  • Collective call
  • Completes all outstanding memory operations

Proc 2 Proc 0 Proc 1 Proc 3

int int MPI_Win_fence(…) { asm( mfence ); dmapp_gsync_wait dmapp_gsync_wait(); (); MPI_Barrier(...); return return MPI_SUCCESS; }

Local completion (DMAPP)

35

slide-36
SLIDE 36

spcl.inf.ethz.ch @spcl_eth

Node 0 Node 1

SCALABLE FENCE IMPLEMENTATION

  • Collective call
  • Completes all outstanding memory operations

Proc 2 Proc 0 Proc 1 Proc 3

int int MPI_Win_fence(…) { asm( mfence ); dmapp_gsync_wait(); MPI_Barrier(...); MPI_Barrier(...); return return MPI_SUCCESS; }

Global completion

barrier

36

slide-37
SLIDE 37

spcl.inf.ethz.ch @spcl_eth

37

SCALABLE FENCE PERFORMANCE

Time bound 𝒫 log 𝑞 Memory bound 𝒫 1

90% faster

slide-38
SLIDE 38

spcl.inf.ethz.ch @spcl_eth

38

PSCW SYNCHRONIZATION

post wait start complete access epoch exposure epoch Proc 0 Proc 1

matching algorithm matching algorithm allows to access other processes allows access from other processes

Posting process

Puts …

Starting process

Puts …

slide-39
SLIDE 39

spcl.inf.ethz.ch @spcl_eth

39

PSCW SYNCHRONIZATION

post wait start complete Proc 0 Proc 1 start complete start complete Proc 2 Proc 3 start complete post wait Proc 4 Proc 5

slide-40
SLIDE 40

spcl.inf.ethz.ch @spcl_eth

  • In general, there can be n posting and m

starting processes

  • In this example there is one posting and

4 starting processes

40

PSCW SCALABLE POST/START MATCHING

Posting process (opens its window)

j4 j1 j3 j2

i

slide-41
SLIDE 41

spcl.inf.ethz.ch @spcl_eth

  • In general, there can be n posting and m

starting processes

  • In this example there is one posting and

4 starting processes

41

PSCW SCALABLE POST/START MATCHING

j4 j1 j3 j2

i

Starting processes (access remote window)

slide-42
SLIDE 42

spcl.inf.ethz.ch @spcl_eth

  • Each starting process has a local list

42

PSCW SCALABLE POST/START MATCHING

j4 j1 j3 j2

i

Local list

slide-43
SLIDE 43

spcl.inf.ethz.ch @spcl_eth

  • Posting process i adds its rank i to a

list at each starting process j1, . . . , j4

  • Each starting process j waits

until the rank of the posting process i is present in its local list

43

PSCW SCALABLE POST/START MATCHING

j4 j1 j3 j2

i

slide-44
SLIDE 44

spcl.inf.ethz.ch @spcl_eth

  • Posting process i adds its rank i to a

list at each starting process j1, . . . , j4

  • Each starting process j waits

until the rank of the posting process i is present in its local list

44

PSCW SCALABLE POST/START MATCHING

j4 j1 j3 j2

i

slide-45
SLIDE 45

spcl.inf.ethz.ch @spcl_eth

  • Posting process i adds its rank i to a

list at each starting process j1, . . . , j4

  • Each starting process j waits

until the rank of the posting process i is present in its local list

45

PSCW SCALABLE POST/START MATCHING

j4 j1 j3 j2

i

i

slide-46
SLIDE 46

spcl.inf.ethz.ch @spcl_eth

  • Posting process i adds its rank i to a

list at each starting process j1, . . . , j4

  • Each starting process j waits

until the rank of the posting process i is present in its local list

46

PSCW SCALABLE POST/START MATCHING

j4 j1 j3 j2

i

i i

slide-47
SLIDE 47

spcl.inf.ethz.ch @spcl_eth

  • Posting process i adds its rank i to a

list at each starting process j1, . . . , j4

  • Each starting process j waits

until the rank of the posting process i is present in its local list

47

PSCW SCALABLE POST/START MATCHING

j4 j1 j3 j2

i

i i i

slide-48
SLIDE 48

spcl.inf.ethz.ch @spcl_eth

  • Posting process i adds its rank i to a

list at each starting process j1, . . . , j4

  • Each starting process j waits

until the rank of the posting process i is present in its local list

48

PSCW SCALABLE POST/START MATCHING

j4 j1 j3 j2

i

i i i i

slide-49
SLIDE 49

spcl.inf.ethz.ch @spcl_eth

  • Posting process i adds its rank i to a

list at each starting process j1, . . . , j4

  • Each starting process j waits

until the rank of the posting process i is present in its local list

49

PSCW SCALABLE POST/START MATCHING

j4 j1 j3 j2

i

i i i i

slide-50
SLIDE 50

spcl.inf.ethz.ch @spcl_eth

  • Each starting process increments a counter stored at the

posting process

50

PSCW SCALABLE COMPLETE/WAIT MATCHING

i

j4 j1 j3 j2

slide-51
SLIDE 51

spcl.inf.ethz.ch @spcl_eth

  • Each starting process increments a counter stored at the

posting process

51

PSCW SCALABLE COMPLETE/WAIT MATCHING

i

j4 j1 j3 j2

1

slide-52
SLIDE 52

spcl.inf.ethz.ch @spcl_eth

  • Each starting process increments a counter stored at the

posting process

52

PSCW SCALABLE COMPLETE/WAIT MATCHING

i

j4 j1 j3 j2

2

slide-53
SLIDE 53

spcl.inf.ethz.ch @spcl_eth

  • Each starting process increments a counter stored at the

posting process

53

PSCW SCALABLE COMPLETE/WAIT MATCHING

i

j4 j1 j3 j2

3

slide-54
SLIDE 54

spcl.inf.ethz.ch @spcl_eth

54

PSCW SCALABLE COMPLETE/WAIT MATCHING

i

j4 j1 j3 j2

4

  • Each starting process increments a counter stored at the

posting process

  • When the counter is equal to the number of starting

processes, the posting process returns from wait

slide-55
SLIDE 55

spcl.inf.ethz.ch @spcl_eth

55

PSCW PERFORMANCE

Time bound 𝒬𝑡𝑢𝑏𝑠𝑢 = 𝒬𝑥𝑏𝑗𝑢 = 𝒫 1 𝒬𝑞𝑝𝑡𝑢 = 𝒬𝑑𝑝𝑛𝑞𝑚𝑓𝑢𝑓 = 𝒫 log 𝑞 Memory bound 𝒫 log 𝑞 (for scalable programs)

Ring Topology

slide-56
SLIDE 56

spcl.inf.ethz.ch @spcl_eth

Two-level lock hierarchy:

56

SCALABLE LOCK SYNCHRONIZATION

Active process Passive process Lock/Unlock (shared/exclusive) Lock All (always shared)

Process 0 Memory

00000 local: Shared Counter Exclusive Bit

Process 1 Memory

00000 local: Shared Counter Exclusive Bit

Process P-1 Memory

00000 local: Shared Counter Exclusive Bit

000 000 global: Shared Counter Exclusive Counter

Master Process

slide-57
SLIDE 57

spcl.inf.ethz.ch @spcl_eth

  • PHASE 1: increment the global exclusive counter

(Invariant 1: no global shared lock held concurrently)

EXCLUSIVE LOCAL LOCK: TWO PHASES

Proc 2 wants to lock Proc 1 exclusively Process 0

00000 000 000

Process 2

00000

Process 1

00000

57

slide-58
SLIDE 58

spcl.inf.ethz.ch @spcl_eth

  • PHASE 1: increment the global exclusive counter

(Invariant 1: no global shared lock held concurrently)

EXCLUSIVE LOCAL LOCK: TWO PHASES

Process 0

00000 000 001

Process 2

00000

Process 1

00000

fetch-add

000 000

MPI_Win_lock( EXCL, 1 )

58

Proc 2 wants to lock Proc 1 exclusively

slide-59
SLIDE 59

spcl.inf.ethz.ch @spcl_eth

  • PHASE 1: increment the global exclusive counter

(Invariant 2: no local shared/exclusive lock held concurrently)

EXCLUSIVE LOCAL LOCK: TWO PHASES

Process 0

00000 000 001

Process 2

00000

Process 1

00000 1

fetch-add

000 000

MPI_Win_lock( EXCL, 1 ) compare & swap

00000

59

Proc 2 wants to lock Proc 1 exclusively

slide-60
SLIDE 60

spcl.inf.ethz.ch @spcl_eth

  • Increment local shared counter

(Invariant: no local exclusive lock on this process held concurrently)

SHARED LOCAL LOCK: ONE PHASE

Proc 0 wants to lock Proc 1 Process 0

00000 000 001

Process 2

00000

Process 1

00000

60

slide-61
SLIDE 61

spcl.inf.ethz.ch @spcl_eth

  • Increment local shared counter

(Invariant: no local exclusive lock on this process held concurrently)

SHARED LOCAL LOCK: ONE PHASE

Proc 0 wants to lock Proc 1 Process 0

00000 000 001

Process 2

00000

Process 1

00001

MPI_Win_lock( SHRD, 1 ) fetch-add

00000

61

slide-62
SLIDE 62

spcl.inf.ethz.ch @spcl_eth

  • Increment global shared counter

(Invariant: no local exclusive lock is held concurrently)

SHARED GLOBAL LOCK: ONE PHASE

Proc 2 wants to lock the whole window Process 0

00000 000 000

Process 2

00000

Process 1

00000

62

slide-63
SLIDE 63

spcl.inf.ethz.ch @spcl_eth

  • Increment global shared counter

(Invariant: no local exclusive lock is held concurrently)

SHARED GLOBAL LOCK: ONE PHASE

Proc 2 wants to lock the whole window Process 0

00000 001 000

Process 2

00000

Process 1

00000

63

MPI_Win_lock_all() fetch-add

000 000

  • Constant number of operations for 𝑞 processes 
slide-64
SLIDE 64

spcl.inf.ethz.ch @spcl_eth

  • Guarantees remote completion
  • Issues a remote bulk synchronization and an x86 mfence
  • One of the most performance critical functions, we add only 78 x86

CPU instructions to the critical path

64

FLUSH SYNCHRONIZATION

Time bound 𝒫 1 Memory bound 𝒫(1)

Process 0 Process 1

inc(counter)

counter:

inc(counter) inc(counter)

slide-65
SLIDE 65

spcl.inf.ethz.ch @spcl_eth

  • Guarantees remote completion
  • Issues a remote bulk synchronization and an x86 mfence
  • One of the most performance critical functions, we add only 78 x86

CPU instructions to the critical path

65

FLUSH SYNCHRONIZATION

Time bound 𝒫 1 Memory bound 𝒫(1)

Process 0 Process 1

inc(counter)

3

counter:

inc(counter) inc(counter)

flush

slide-66
SLIDE 66

spcl.inf.ethz.ch @spcl_eth

66

  • Evaluation on Blue Waters System
  • 22,640 computing Cray XE6 nodes
  • 724,480 schedulable cores
  • All microbenchmarks
  • 4 applications
  • One nearly full-scale run 

PERFORMANCE

slide-67
SLIDE 67

spcl.inf.ethz.ch @spcl_eth

67

PERFORMANCE: MOTIF APPLICATIONS

Key/Value Store: Random Inserts per Second Dynamic Sparse Data Exchange (DSDE) with 6 neighbors

slide-68
SLIDE 68

spcl.inf.ethz.ch @spcl_eth

68

PERFORMANCE: APPLICATIONS

NAS 3D FFT [1] Performance MILC [2] Application Execution Time

Annotations represent performance gain of foMPI over Cray MPI-1.

[1] Nishtala et al. Scaling communication-intensive applications on BlueGene/P using one-sided communication and overlap. IPDPS’09 [2] Shan et al. Accelerating applications at scale using one-sided communication. PGAS’12

scale to 512k procs scale to 65k procs

slide-69
SLIDE 69

spcl.inf.ethz.ch @spcl_eth

69

CONCLUSIONS & SUMMARY

slide-70
SLIDE 70

spcl.inf.ethz.ch @spcl_eth

70

CONCLUSIONS & SUMMARY

  • 1. MPI window creation

routines

slide-71
SLIDE 71

spcl.inf.ethz.ch @spcl_eth

71

CONCLUSIONS & SUMMARY

  • 2. Non-atomic & atomic

communication

  • 1. MPI window creation

routines

slide-72
SLIDE 72

spcl.inf.ethz.ch @spcl_eth

72

CONCLUSIONS & SUMMARY

  • 3. Fence / PSCW
  • 1. MPI window creation

routines

  • 2. Non-atomic & atomic

communication

slide-73
SLIDE 73

spcl.inf.ethz.ch @spcl_eth

73

CONCLUSIONS & SUMMARY

  • 1. MPI window creation

routines

  • 2. Non-atomic & atomic

communication

  • 3. Fence / PSCW
  • 4. Locks
slide-74
SLIDE 74

spcl.inf.ethz.ch @spcl_eth

  • 1. MPI window creation

routines

  • 2. Non-atomic & atomic

communication

74

CONCLUSIONS & SUMMARY

  • 3. Fence / PSCW
  • 4. Locks
  • 5. foMPI reference implementation
slide-75
SLIDE 75

spcl.inf.ethz.ch @spcl_eth

75

CONCLUSIONS & SUMMARY

  • 1. MPI window creation

routines

  • 2. Non-atomic & atomic

communication

  • 3. Fence / PSCW
  • 4. Locks
  • 5. foMPI reference implementation
slide-76
SLIDE 76

spcl.inf.ethz.ch @spcl_eth

76

CONCLUSIONS & SUMMARY

  • 1. MPI window creation

routines

  • 2. Non-atomic & atomic

communication

  • 3. Fence / PSCW
  • 4. Locks
  • 5. foMPI reference implementation
slide-77
SLIDE 77

spcl.inf.ethz.ch @spcl_eth

Thanks to: Timo Schneider, Greg Bauer, Bill Kramer, Duncan Roweth, Nick Wright, Paul Hargrove (and the whole UPC team) and the MPI Forum RMA WG … … and the institutions:

77

ACKNOWLEDGMENTS

slide-78
SLIDE 78

spcl.inf.ethz.ch @spcl_eth

Thank you for your attention

78

http://spcl.inf.ethz.ch/Research/ Parallel_Programming/foMPI

slide-79
SLIDE 79

spcl.inf.ethz.ch @spcl_eth

Backup slides

79

slide-80
SLIDE 80

spcl.inf.ethz.ch @spcl_eth

DYNAMIC WINDOW CREATION

61

Process A Memory Process B Memory Process C Memory

slide-81
SLIDE 81

spcl.inf.ethz.ch @spcl_eth

DYNAMIC WINDOW CREATION

61

Process A Memory Process B Memory Process C Memory

1 1

0x111 0x120

slide-82
SLIDE 82

spcl.inf.ethz.ch @spcl_eth

DYNAMIC WINDOW CREATION

61

Process A Memory Process B Memory Process C Memory

2 1 2

0x111 0x120 0x129 0x129 0x123

slide-83
SLIDE 83

spcl.inf.ethz.ch @spcl_eth

DYNAMIC WINDOW CREATION

61

Process A Memory Process B Memory Process C Memory

2 1 2

0x111 0x120 0x129 0x129 0x123

Get(id)

Process A Process B

2

Cached:

2

Access the window

Get(id)

Process A Process B

2

Cached:

1

Access the window

Update(list)

slide-84
SLIDE 84

spcl.inf.ethz.ch @spcl_eth

84

PERFORMANCE INTER-NODE: LATENCY (SHMEM)

Put Inter-Node Get Inter-Node

Proc 0 Proc 1

put sync memory

Half ping-pong

slide-85
SLIDE 85

spcl.inf.ethz.ch @spcl_eth

85

PERFORMANCE INTRA-NODE: LATENCY (SHMEM)

Put/Get Intra-Node

Proc 0 Proc 1

put sync memory

Half ping-pong

slide-86
SLIDE 86

spcl.inf.ethz.ch @spcl_eth

  • BACKOFF: decrement the global exclusive counter
  • ...then retry

EXCLUSIVE LOCAL LOCK: TWO PHASES

Proc 2 wants to lock exclusively Proc 1 Process 0

00000 000 000

Process 2

00000

Process 1

00000

Add(-1)

000 001

61

slide-87
SLIDE 87

spcl.inf.ethz.ch @spcl_eth

PERFORMANCE MODELING

61

Fence 𝒬

𝑔𝑓𝑜𝑑𝑓 = 2.9𝜈𝑡 ⋅ log2(𝑞)

PSCW 𝒬𝑡𝑢𝑏𝑠𝑢 = 0.7𝜈𝑡, 𝒬𝑥𝑏𝑗𝑢 = 1.8𝜈𝑡 𝒬𝑞𝑝𝑡𝑢 = 𝒬𝑑𝑝𝑛𝑞𝑚𝑓𝑢𝑓 = 350𝑜𝑡 ⋅ 𝑙 Locks 𝒬𝑚𝑝𝑑𝑙,𝑓𝑦𝑑𝑚 = 5.4𝜈𝑡 𝒬𝑚𝑝𝑑𝑙,𝑡ℎ𝑠𝑒 = 𝒬𝑚𝑝𝑑𝑙_𝑏𝑚𝑚 = 2.7𝜈𝑡 𝒬𝑣𝑜𝑚𝑝𝑑𝑙 = 𝒬𝑣𝑜𝑚𝑝𝑑𝑙_𝑏𝑚𝑚 = 0.4𝜈𝑡 𝒬

𝑔𝑚𝑣𝑡ℎ = 76𝑜𝑡

𝒬

𝑡𝑧𝑜𝑑 = 17𝑜𝑡

Put/get 𝒬𝑞𝑣𝑢 = 0.16𝑜𝑡 ⋅ 𝑡 + 1𝜈𝑡 𝒬

𝑕𝑓𝑢 = 0.17𝑜𝑡 ⋅ 𝑡 + 1.9𝜈𝑡

Atomics 𝒬

𝑏𝑑𝑑,𝑡𝑣𝑛 = 28𝑜𝑡 ⋅ 𝑡 + 2.4𝜈𝑡

𝒬𝑏𝑑𝑑,𝑛𝑗𝑜 = 0.8𝑜𝑡 ⋅ 𝑡 + 7.3𝜈𝑡 Performance functions for synchronization protocols Performance functions for communication protocols