Shawn Hall Hybrid RDMA RDMA/SR mix for data, SR otherwise Client - - PowerPoint PPT Presentation

shawn hall hybrid rdma rdma sr mix for data sr otherwise
SMART_READER_LITE
LIVE PREVIEW

Shawn Hall Hybrid RDMA RDMA/SR mix for data, SR otherwise Client - - PowerPoint PPT Presentation

Shawn Hall Hybrid RDMA RDMA/SR mix for data, SR otherwise Client side events Completion of sending request messages polling Completion of incoming reply and control messages interrupts Server side events Since its dedicated -


slide-1
SLIDE 1

Shawn Hall

slide-2
SLIDE 2
slide-3
SLIDE 3
slide-4
SLIDE 4

Hybrid RDMA

slide-5
SLIDE 5

RDMA/SR mix for data, SR otherwise Client side events

Completion of sending request messages – polling Completion of incoming reply and control messages – interrupts

Server side events

Since it’s dedicated - polling

slide-6
SLIDE 6

For small messages, memory (de)registration cost > zero-copy benefit

Data-piggybacked SR w/ pre-registered buffers

Client caches location of preallocated/ preregistered Fast RDMA buffers on I/O server RDMA Write with Immediate data Large transfers split into smaller ones Client/server communication and disk I/O pipelined

slide-7
SLIDE 7

Internal Buffer Credit-Based Flow Control

Preallocated/prepinned buffers per connection

Server RDMA Buffer Management

Most I/O server memory allocated as RDMA buffers Buffers are grouped by size into “zones”

Try to fit into contiguous buffer, otherwise split transfer

Client RDMA Buffer Management

Dynamic (de)registration required for clients Pin-down cache delays deregistration, caches info

Pin-down not useful for I/O intensive applications

slide-8
SLIDE 8

Fast Memory Registration and Deregistration

Uses pin-down cache and batched deregistration

slide-9
SLIDE 9
slide-10
SLIDE 10
slide-11
SLIDE 11

% of pin-down cache hits

slide-12
SLIDE 12
slide-13
SLIDE 13

Chunk List – Multidimensional lists that store the locations of multiple buffers RPC Long Call – Long RPCs are broken into chunks

First message contains chunk list of other messages

NFS Write

Client Server

slide-14
SLIDE 14

NFS Readdir and Readlink – similar to NFS Read NFS Read

Client Server Read-Read Design Client Server Read-Write Design

slide-15
SLIDE 15

Server buffers exposed to client RDMA Server resources not freed until client sends RDMA_DONE Synchronous RDMA read causes latency Number of concurrent RDMA reads is limited

slide-16
SLIDE 16

RPC long replies and NFS READ can come directly from server

Client cannot initiate RDMA and try to access other buffers, so more secure

Mellanox HCA can issue many RDMA write

  • perations in parallel

No waiting for RDMA_DONE

Fewer server interrupts

slide-17
SLIDE 17

Fast Memory Registration – registration steps that involve communication with the HCA are done at initialization rather than dynamically Buffer Registration Cache

No information about server buffers exposed

Physical Memory Registration – avoids virtual to physical address translation

Translation also does not need to be sent to HCA

slide-18
SLIDE 18

RDMA_DONE elimination RDMA Write parallelism

slide-19
SLIDE 19
slide-20
SLIDE 20

No local scatter/gather, so more RDMA reads. Simultaneous reads are capped though, so decreased parallelism.

slide-21
SLIDE 21

Server memory saturates.

slide-22
SLIDE 22
slide-23
SLIDE 23

Applies only to MPI implementation. Not portable. Portable and transparent to MPI stacks and applications

slide-24
SLIDE 24

FUSE – software that allows to create a user level virtual file system. Berkeley Lab Checkpoint/Restart (BLCR) – writes a process image to a file for later restart. MPI Checkpointing Mechanisms – offered by MVAPICH2, MPICH2, OpenMPI.

MPI library flushes communication channel BLCR library used to dump memory snapshot BLCR library used to restart job if needed

slide-25
SLIDE 25

VFS Cache Efficient Sequential Writes Needs Work

slide-26
SLIDE 26
slide-27
SLIDE 27

File Open – caught by FUSE, CRFS inserts/increments value in hash table, passes call to underlying file system File Close – buffer pool flushed into work queue, blocked until operations complete File Sync – complete all writes on file, pass fsync() to underlying file system Other File Operations – passed to file system

slide-28
SLIDE 28

File Write

Data copied from file into chunk in buffer pool until chunk is full Chunk enqueued into work queue This triggers an I/O thread to wake up and write chunk

Number of I/O threads limited to prevent contention

slide-29
SLIDE 29
slide-30
SLIDE 30
slide-31
SLIDE 31
slide-32
SLIDE 32
slide-33
SLIDE 33
slide-34
SLIDE 34
slide-35
SLIDE 35
slide-36
SLIDE 36
slide-37
SLIDE 37
slide-38
SLIDE 38