SLIDE 1
Shawn Hall Hybrid RDMA RDMA/SR mix for data, SR otherwise Client - - PowerPoint PPT Presentation
Shawn Hall Hybrid RDMA RDMA/SR mix for data, SR otherwise Client - - PowerPoint PPT Presentation
Shawn Hall Hybrid RDMA RDMA/SR mix for data, SR otherwise Client side events Completion of sending request messages polling Completion of incoming reply and control messages interrupts Server side events Since its dedicated -
SLIDE 2
SLIDE 3
SLIDE 4
Hybrid RDMA
SLIDE 5
RDMA/SR mix for data, SR otherwise Client side events
Completion of sending request messages – polling Completion of incoming reply and control messages – interrupts
Server side events
Since it’s dedicated - polling
SLIDE 6
For small messages, memory (de)registration cost > zero-copy benefit
Data-piggybacked SR w/ pre-registered buffers
Client caches location of preallocated/ preregistered Fast RDMA buffers on I/O server RDMA Write with Immediate data Large transfers split into smaller ones Client/server communication and disk I/O pipelined
SLIDE 7
Internal Buffer Credit-Based Flow Control
Preallocated/prepinned buffers per connection
Server RDMA Buffer Management
Most I/O server memory allocated as RDMA buffers Buffers are grouped by size into “zones”
Try to fit into contiguous buffer, otherwise split transfer
Client RDMA Buffer Management
Dynamic (de)registration required for clients Pin-down cache delays deregistration, caches info
Pin-down not useful for I/O intensive applications
SLIDE 8
Fast Memory Registration and Deregistration
Uses pin-down cache and batched deregistration
SLIDE 9
SLIDE 10
SLIDE 11
% of pin-down cache hits
SLIDE 12
SLIDE 13
Chunk List – Multidimensional lists that store the locations of multiple buffers RPC Long Call – Long RPCs are broken into chunks
First message contains chunk list of other messages
NFS Write
Client Server
SLIDE 14
NFS Readdir and Readlink – similar to NFS Read NFS Read
Client Server Read-Read Design Client Server Read-Write Design
SLIDE 15
Server buffers exposed to client RDMA Server resources not freed until client sends RDMA_DONE Synchronous RDMA read causes latency Number of concurrent RDMA reads is limited
SLIDE 16
RPC long replies and NFS READ can come directly from server
Client cannot initiate RDMA and try to access other buffers, so more secure
Mellanox HCA can issue many RDMA write
- perations in parallel
No waiting for RDMA_DONE
Fewer server interrupts
SLIDE 17
Fast Memory Registration – registration steps that involve communication with the HCA are done at initialization rather than dynamically Buffer Registration Cache
No information about server buffers exposed
Physical Memory Registration – avoids virtual to physical address translation
Translation also does not need to be sent to HCA
SLIDE 18
RDMA_DONE elimination RDMA Write parallelism
SLIDE 19
SLIDE 20
No local scatter/gather, so more RDMA reads. Simultaneous reads are capped though, so decreased parallelism.
SLIDE 21
Server memory saturates.
SLIDE 22
SLIDE 23
Applies only to MPI implementation. Not portable. Portable and transparent to MPI stacks and applications
SLIDE 24
FUSE – software that allows to create a user level virtual file system. Berkeley Lab Checkpoint/Restart (BLCR) – writes a process image to a file for later restart. MPI Checkpointing Mechanisms – offered by MVAPICH2, MPICH2, OpenMPI.
MPI library flushes communication channel BLCR library used to dump memory snapshot BLCR library used to restart job if needed
SLIDE 25
VFS Cache Efficient Sequential Writes Needs Work
SLIDE 26
SLIDE 27
File Open – caught by FUSE, CRFS inserts/increments value in hash table, passes call to underlying file system File Close – buffer pool flushed into work queue, blocked until operations complete File Sync – complete all writes on file, pass fsync() to underlying file system Other File Operations – passed to file system
SLIDE 28
File Write
Data copied from file into chunk in buffer pool until chunk is full Chunk enqueued into work queue This triggers an I/O thread to wake up and write chunk
Number of I/O threads limited to prevent contention
SLIDE 29
SLIDE 30
SLIDE 31
SLIDE 32
SLIDE 33
SLIDE 34
SLIDE 35
SLIDE 36
SLIDE 37
SLIDE 38