rmalloc() and rpipe() a uGNI-based Distributed Remote Memory - - PowerPoint PPT Presentation

rmalloc and rpipe a ugni based distributed remote memory
SMART_READER_LITE
LIVE PREVIEW

rmalloc() and rpipe() a uGNI-based Distributed Remote Memory - - PowerPoint PPT Presentation

rmalloc() and rpipe() a uGNI-based Distributed Remote Memory Allocator and Access Library for One-sided Messaging Udayanga Wickramasinghe Andrew Lumsdaine Indiana University Pacific Northwest Na<onal Laboratory Overview


slide-1
SLIDE 1

rmalloc() and rpipe() – a uGNI-based Distributed Remote Memory Allocator and Access Library for One-sided Messaging

Udayanga Wickramasinghe Indiana University Andrew Lumsdaine Pacific Northwest Na<onal Laboratory

slide-2
SLIDE 2

Overview

§ Mo<va<on § Design/System Implementa<on § Evalua<on § Future Work

2

slide-3
SLIDE 3

RDMA Network Communication

3

Network Op

Kernel+CPU direct

RDMA

Kernel+CPU bypass Zero Copy

Designed for one-sided communica<on!!

slide-4
SLIDE 4

One-sided Communication

4

§ Great for Random Access + Irregular Data patterns § Less Overhead/High Performance

Advantages Disadvantages

§ Explicit Synchronization – separate from data-path!!

slide-5
SLIDE 5

RDMA Challenges – Communication

5

Recv Send Pin Pin NIC

exchange comm

NIC

register/match register/match

§ Buffer Pin/Registration § Rendezvous § Model imposed overheads

slide-6
SLIDE 6

RDMA Challenges – Synchronization

6

register/match

Exposure Epoch

comm Barrier/Fence Barrier/Fence comm comm

... Access Epoch

How to make reads and updates visible ? “in-use”/”re-use”

slide-7
SLIDE 7

RDMA Challenges – Dynamic Memory Management

Cluster wide alloca<ons à costly in a dynamic context i.e. PGAS

slide-8
SLIDE 8

RDMA Challenges – Programming

register/match exchange

RDMA PUT 0x1F0000

Load 0x1F0000 Inc 0x1F0000, 1

RDMA PUT 0x1F0000

register/match

RDMA PUT 0x1F0000

Data Race !!! Delivery comple1on Buffer re-use

slide-9
SLIDE 9

§ Enforcing “in-use”/”re-use” seman<cs

– Flow Control – Credit based, Counter based, polling (CQ based) § Enforcing Comple<on seman<cs – MPI 3.0 Ac<ve/Passive – barriers, fence, lock, unlock, flush – GAS/PGAS based (SHMEM, X10, Titanum) – futures, barriers, locks, ac<ons – GASNet like (RDMA) Libraries – user has to implement

§ Explicit and Complex to implement for applica<ons !!

9

Challenges – Programming

slide-10
SLIDE 10

§ Low overhead, high-throughput communica<on?

– Eliminate unnecessary overheads.

§ Dynamic On-demand RDMA Memory?

– Allocate/de-Allocate with heuris<cs support. – Less coherence Traffic and may be becer u<liza<on

§ Scalable Synchroniza<on?

– Comple<on and Buffer in-use/re-use.

§ RDMA Programming abstrac<ons for applica<ons?

– No explicit synchroniza<on – Let middleware transparently handle it. – Expose light-weight RDMA ready memory and opera<ons.

10

Challenges – Summary

slide-11
SLIDE 11

11

How rmalloc()/rpipe() meets these Challenges ?

Problem Key Idea

Low Communica<on Overhead Fast Path(MMIO vs Doorbell) Network Opera<on (in uGNI) with synchronized updates. Dynamic RDMA Memory Mgmt Per endpoint RDMA Dynamic Heap à Heuris<cs + Asymmetric Alloca<on Synchroniza<on No<fica<on Flags with Polling (NFP) Programmability A familiar Two-level Abstrac<on à allocator (rmalloc) + stream like channel(rpipe) à No explicit synchroniza<on

slide-12
SLIDE 12

§ Mo<va<on § Design/System Implementa<on § Evalua<on § Future Work

12

Overview

slide-13
SLIDE 13

13

System Overview

slide-14
SLIDE 14

14

System Overview

High Performance RDMA Channel

§ Expose Zero-copy RDMA ops § Interface/s

  • rread()
  • rrwrite()

Enable Implicit Synchronization

§ NFP (Notified Flags with Polling)

slide-15
SLIDE 15

15

System Overview

Allocates RDMA memory

§ Returns Network Compatible Memory § Dynamic Asymmetric Heap for RDMA § Interface/s

  • rmalloc()

Alloca1on policies

§ Next-fit, First-fit

slide-16
SLIDE 16

16

System Overview

Network Backend

§ Cray specific – uGNI

§ MPI 3.0 based (portability layer)

Cray uGNI

§ FMA/BTE Support § Memory Registration § CQ handling

slide-17
SLIDE 17

17

“rmalloc”

Asymmetric heaps across cluster

  • 0 or more for each endpoint pair
  • dynamically created
slide-18
SLIDE 18

18

“rmalloc” Allocation Next-fit heuris<c – return next available RDMA heap segment

rmalloc instance

L - local heap S - shadow heap R - remote heap

  • unused
  • used

Synchroniza<on à a special bootstrap rpipe

slide-19
SLIDE 19

19

“rmalloc” Allocation best-fit heuris<c – find smallest possible RDMA heap segment

rmalloc instance

L - local heap S - shadow heap R - remote heap

  • unused
  • used
slide-20
SLIDE 20

20

“rmalloc” Allocation worst-fit heuris<c – find largest possible RDMA heap segment

rmalloc instance

L - local heap S - shadow heap R - remote heap

  • unused
  • used
slide-21
SLIDE 21

21

“rmalloc” Implementation

rmalloc_descriptor à manages local and remote virtual memory

slide-22
SLIDE 22

22

rfree()/rmalloc() synchronization § When to synchronize ? Buffer “in-use/re-use”

– Two op<ons, use both for different alloca<on modes

  • At alloca<on <me –> latency (i.e. rmalloc())
  • At de-alloca<on <me –> throughput (i.e. rfree())

§ Deferred synchroniza<on by rfree() à next-fit

– Coalesce tags from a sorted free list – rmalloc updates state by RDMA into coalesced tag list in the remote

§ Immediate synchroniza<on by rmalloc() à best-fit

OR worst-fit

– Using a special bootstrap rpipe to synchronize at each allocated memory

slide-23
SLIDE 23

23

“rpipe”– rwrite()

Local CQ 1

§ Completion Queue (CQ)

(Light weight events by NIC/HCA)

  • 1. Ini<ate RDMA Write.

– Source buffer à ‘’in-use’’

slide-24
SLIDE 24

24

Local CQ 2

  • 2. Probe Local CQ for comple<on.

Zero-copy source data to target.

2

“rpipe”– rwrite()

slide-25
SLIDE 25

25

Local CQ 3 4

  • 3. Write to flag just aner data.

“rpipe”– rwrite()

slide-26
SLIDE 26

26

Local CQ 4

  • 4. Probe Local CQ success.

Source buffer à ‘’re-use’’ “rpipe”– rwrite()

slide-27
SLIDE 27

27

Local CQ 5

  • 5. Probe flag success.

target buffer is ready to load/ops. “rpipe”– rwrite()

slide-28
SLIDE 28

28

Local CQ 6

Load 0x1F0000

  • 6. remote host consumes data.

Source yet to know buffer à rfree() “rpipe”– rwrite()

slide-29
SLIDE 29

29

“rpipe”– rread()

Local CQ 1

  • 1. Store data into target.

– Target buffer à ‘’in-use’’.

Store 0x1F0000, val

slide-30
SLIDE 30

30

“rpipe”– rread()

Local CQ

  • 2. Write to source flag.

Data is now ready for rread()!!

Store 0x1F0000, val

2

rfree()

slide-31
SLIDE 31

31

“rpipe”– rread()

Local CQ 3

  • 3. RDMA Zero-Copy to source.
slide-32
SLIDE 32

32

Local CQ

  • 4. Write to flag just aner data.

4

“rpipe”– rread()

slide-33
SLIDE 33

33

Local CQ 5

  • 5. Probe Local CQ for comple<on.

“rpipe”– rread()

slide-34
SLIDE 34

34

Implementing rpipe(), rwrite() and rread() § A rpipe is created between two endpoints.

– A uGNI based Control Message (FMA Cmsg) network to lazy ini<alize rpipe i.e. GNI_CqCreate, GNI_EpCreate, GNI_EpBind

§ Implements rwrite(), rread() in uGNI

– Small/medium messages – FMA (Fast Memory Access) – Large messages – BTE (Byte Transfer Engine)

§ MPI portability Layer

– rpipe with MPI-3.0 windows + passive RMA

FMA BTE

slide-35
SLIDE 35

§ Mo<va<on § Design/System Implementa<on § Evalua<on § Future Work

35

Overview

slide-36
SLIDE 36

36

rpipe programming

int main(){ #define PIPE_WIDTH 8 rpipe_t rp; rinit(&rank, NULL); // create a Half Duplex RMA pipe rpipe(rp, peer, iswriter, PIPE_WIDTH, HD_PIPE); raddr_t addr; int *ptr; if (iswriter) { addr = rmalloc(rp, sizeof(int)); ptr = rmem(rp, addr); *ptr = SEND_VAL; rwrite(rp, addr); } else { rread(rp, addr, sizeof(int)); ptr = rmem(rp, addr); rfree(addr); } }

Remote allocate Free rem memory Release immediately a5er use !! Rpipe ops

slide-37
SLIDE 37

37

Experimentation Setup

Cray XC30[Aries]/ Dragon Fly

BigredII+ 550 nodes/ Rpeak 280 Tflops

— 10GB/s Uni-direc<onal 15GB/s Bi-direc<onal BW

Perf baseline à MPI/OSU Benchmark

slide-38
SLIDE 38

38

Small/Medium Message Latency Comparison

1 4 16 1 4 16 64 256 1024 8192

Message Size (bytes) Latency/operation (us)

MPI_RMA_FENCE MPI_RMA_PASSIVE(lock_once) MPI_RMA_PSCW MPI_SEND RMA_PIPE_WRITE(uGNI_FMA_2PUTS) RMA_PIPE_WRITE(uGNI_FMA_PUTW_SYNC)

§ Default Alloc = Next-Fit § FMA_PUT_W_SYNC

– Upto 6X speedup MPI RMA § rpipe PUT_W_sync(s) <

rpipe 2PUT (s)

slide-39
SLIDE 39

39

Large Message Latency Comparison – rwrite()

2 16 128 1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M

Message Size (bytes) Latency/operation (us)

MPI_RMA_PASSIVE MPI_RMA_PSCW RMA_PIPE_WRITE

§ rpipe uGNI(s) ≈ rpipe MPI(s) when s > 4K

– S ≥ 4K à FMA to BTE switch

small/medium 0.65us

slide-40
SLIDE 40

40

Large Message Latency Comparison – rread()

8 64 512 1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M

Message Size (bytes) Latency/operation (us)

MPI_RMA_PASSIVE MPI_RMA_PSCW RMA_PIPE_READ

§ rpipe uGNI(s) ≈ rpipe MPI(s) when s > 1K

– S < 4b à FMA_FETCH Atomic (AMO) – S < 1K à FMA_FETCH + PSYNC – S ≥ 1K à FMA to BTE switch (BTE_FETCH + FMA_PSYNC)

small/medium 2.14us

slide-41
SLIDE 41

41

Rpipe Scales ...

1 4 16 2 4 8 12 16 20 24 28 32

Nodes (N) Latency/operation (us)

RPIPE_WRITE(1K)(unbounded) RPIPE_WRITE(64)(unbounded) RPIPE_WRITE(8)(4K) RPIPE_WRITE(8)(64) RPIPE_WRITE(8)(unbounded) RPIPE_WRITE(8K)(unbounded)

§ “unbounded”à allocator has full rpipe available for all Zero-copy operations § Scaling upto 32 nodes – randomized rwrite()

– 0.65 – 3.8us avg latency

slide-42
SLIDE 42

42

Allocation Algorithms

1 4 1 2 4 8 16 32 64 128 256 512

Message Size (bytes) Latency/operation (us)

MPI3.0_RMA RPIPE_WRITE(1K) RPIPE_WRITE(256K) RPIPE_WRITE(unbounded)

§ Zero-copy write vs Heuristics

– Next-fit allocator has better performance – 1X – 3.5X slowdown for Best/Worst-fit

1 4 16 1 2 4 8 16 32 64 128 256 512

Message Size (bytes)

MPI3.0_RMA RPIPE_WRITE(1K) RPIPE_WRITE(256K) RPIPE_WRITE(unbounded)

1 4 16 1 2 4 8 16 32 64 128 256 512

Message Size (bytes)

MPI3.0_RMA RPIPE_WRITE(1K) RPIPE_WRITE(256K) RPIPE_WRITE(unbounded)

Next-fit Best-fit Worst-fit L = Latency L[Next-fit] < L[MPI] < L[Worst-fit]

slide-43
SLIDE 43

§ Mo<va<on § Design/System Implementa<on § Evalua<on § Future Work

43

Overview

slide-44
SLIDE 44

§ Plavorm Support/Automated synchroniza<on § High performance RMA Kernels

– Ac<ve messages/ Neighbor/collec<ve communica<on

§ Aggregated rpipes

– Leverage Zero copy/Eliminate hidden buffers

  • i.e. Collec<ves
  • Possible throughput, memory u<liza<on gains

§ Irregular RMA and memory disaggrega<on

Future Work

slide-45
SLIDE 45

Questions?

slide-46
SLIDE 46

Thank You!