High Performance Pipelined Process Migration with RDMA Xiangyong - - PowerPoint PPT Presentation
High Performance Pipelined Process Migration with RDMA Xiangyong - - PowerPoint PPT Presentation
High Performance Pipelined Process Migration with RDMA Xiangyong Ouyang, Raghunath Rajachandrasekar, Xavier Besseron, Dhabaleswar K. (DK) Panda Department of Computer Science & Engineering The Ohio State University Outline Introduction
- Introduction and Motivation
- Profiling Process Migration
- Pipelined Process Migration with RDMA
- Performance Evaluation
- Conclusions and Future Work
2 CCGrid 2011
Outline
3 CCGrid 2011
Motivation
- Computer clusters continue to grow larger
– Heading towards Multi-PetaFlop and ExaFlop Era – Mean-time-between-failures (MTBF) is getting smaller
- Fault-Tolerance becomes imperative
- Checkpoint/Restart (C/R) – common approach to Fault
Tolerance
– Checkpoint: save snapshots of all processes (IO overhead) – Restart: restore, resubmit the job (IO overhead + queue delay)
- C/R Drawbacks
× Unnecessarily dump all processes IO bottleneck × Resubmit queuing delay
- Checkpoint/Restart alone doesn’t scale to large
systems
- Pro-active Fault Tolerance
– Only handle processes on failing node – Health monitoring mechanisms, failure prediction models
- Five steps
(1) Suspend communication channels (2) Write snapshots on source node (3) Transfer process image files (Source=>Target) (4) Read image files on target node (5) Reconnect communication channels
4 CCGrid 2011
Job/Process Migration
- Overcomes C/R drawbacks
× Unnecessary dump of all processes × Resubmit queuing delay
- Desirable feature for other applications
– Cluster-wide load balancing – Server consolidation – Performance isolation
5 CCGrid 2011
Process Migration Advantages
- Available in MVAPICH2 and OpenMPI
- Both suffers low performance
- Cause? Solution?
6 CCGrid 2011
Existing MPI Process Migration
7 CCGrid 2011
Problem Statements
- What are the dominant factors of the high cost of
process migration?
- How to design an efficient protocol to minimize
- verhead?
– How to optimize checkpoint-related I/O path ? – How to optimize data transfer path? – How to leverage RDMA transport to accelerate data transmission?
- What will be the performance benefits?
- Introduction and Motivation
- Profiling Process Migration
- Pipelined Process Migration with RDMA
- Performance Evaluation
- Conclusions and Future Work
8 CCGrid 2011
Outline
- MVAPICH: MPI over InfiniBand, 10GigE/iWARP and
RDMA over Converged Enhanced Ethernet (RoCE)
– MVAPICH (MPI-1) and MVAPICH2 (MPI-2) – Used by more than 1,550 organizations worldwide (in 60 countries) – Empowering many TOP500 clusters (11th, 15th … ) – Available with software stacks of many IB, 10GE/iWARP and RoCE, and server vendors including Open Fabrics Enterprise Distribution (OFED) – Available with Redhat and SuSE Distributions – http://mvapich.cse.ohio-state.edu/
- Has support for Checkpoint/Restart and Process Migration
for the last several years
– Already used by many organizations
9 CCGrid 2011
MVAPICH/MVAPICH2 Software
- MVAPICH2 already supports three
process migration strategies
– Local Filesystem-based Migration (Local) – Shared Filesystem-based Migration (Shared) – RDMA+Local Filesystem-based Migration (RDMA+Local)
10 CCGrid 2011
Three Process Migration Approaches
Time
11 CCGrid 2011
Local Filesystem-based Process Migration ( Local )
Process Process
Network Stack Memory VFS Page Cache Local Filesystem Local Filesystem
(1) Suspend (2) Write (3) Transfer (5) Reconnect
Migration Source Node Network Stack Memory VFS Page Cache
(4) Read
Write Transfer Read
Restarted Process Migration Target Node Restarted Process
Time
12 CCGrid 2011
Shared Filesystem-based Process Migration (Shared )
Process Process
Network Stack Memory VFS Page Cache Shared Filesystem
(1) Suspend (5) Reconnect
Migration Source Node Migration Target Node Network Stack Memory VFS Page Cache
Write Transfer 1 Read
VFS Page Cache
(2) Write + (3) Transfer (4) Read
Transfer 2
Restarted Process Restarted Process
Time
13 CCGrid 2011
RDMA + Local Filesystem-based Process Migration ( RDMA+Local )
Process Process
Local Filesystem
(1) Suspend (5) Reconnect
Migration Source Node Migration Target Node VFS Page Cache
(4) Read
Write Transfer Read
RDMA Buffer Pool RDMA Buffer Pool
(2) Write + (3) Transfer
Restarted Process Restarted Process
14 CCGrid 2011
Profiling Process Migration Time Cost
- All three approaches
suffer from IO cost
- Conclusion: All three steps (Write, Transfer, Read) shall be optimized
- Source node writes
checkpoint files
- Copy checkpoint from
source to target
- Read image files on
target node Migrate 8 processes
- Introduction and Motivation
- Profiling Process Migration
- Pipelined Process Migration with RDMA
- Performance Evaluation
- Conclusions and Future Work
15 CCGrid 2011
Outline
16 CCGrid 2011
Pipelined Process Migration with RDMA (PPMR)
Process Process Buffer Manager FUSE
RDMA Buffer Pool
(1) Suspend Write Transfer (5) Reconnect
Migration Source Node Migration Target Node
FUSE Buffer Manager
RDMA Buffer Pool
Read
Restarted Process Restarted Process
Time
Write Transfer Read
17 CCGrid 2011
Comparisons
Time
Write Transfer Read
Time
Write Transfer 1 Read Transfer 2
Time
Write Transfer Read Local Shared RDMA+Local PPMR
Time
Write Transfer Read
Fully pipelines the three key steps
– Write at source node – Transfer checkpoint data to target node – Read process images
Efficient restart mechanism on target node
– Restart from RDMA data streams
- Design choices
– Buffer Pool size, Chunk size
18 CCGrid 2011
PPMR Design Strategy
- Introduction and Motivation
- Profiling Process Migration
- Pipelined Process Migration with RDMA
- Performance Evaluation
- Conclusions and Future Work
19 CCGrid 2011
Outline
- System setup
– Linux cluster
- Dual-socket Quad core Xeon processors, 2.33GHz
- Nodes are connected by InfiniBand DDR (16Gbps)
- Linux 2.6.30, FUSE-2.8.5
- NAS parallel Benchmark suite version 3.2.1
- LU/BT/SP Class C/D input
- MVAPICH2 with Job Migration Framework
- PPMR
- Local, Shared, RDMA+Local
20 CCGrid 2011
Experiment Environment
21 CCGrid 2011
Raw Data Bandwidth Test (1)
Process Process Buffer Manager FUSE
RDMA Buffer Pool
Restarted Process Restarted Process
Migration Source Node Migration Target Node
FUSE Buffer Manager
RDMA Buffer Pool
Aggregation Bandwidth
22 CCGrid 2011
Aggregation Bandwidth
Write Unit Size = 128KB
saturate with 8-16 processes (~800 MB/s) Bandwidth determined by FUSE ( in-sensitive to buffer pool size ) Chunk size = 128 KB generally the best
23 CCGrid 2011
Raw Data Bandwidth Test (2)
Process Process Buffer Manager FUSE
RDMA Buffer Pool
Restarted Process Restarted Process
Migration Source Node Migration Target Node
FUSE Buffer Manager
RDMA Buffer Pool
Network Transfer Bandwidth
24 CCGrid 2011
InfiniBand DDR Bandwidth
InfiniBand DDR (16Gbps ) Chunk size > 16 KB Peak BW = 1450 MB/s
25 CCGrid 2011
Network Transfer Bandwidth
Chunk Size = 128KB
Bandwidth in-sensitive to buffer pool size 8 IO streams can saturate the network
26 CCGrid 2011
Raw Data Bandwidth Test (3)
Process Process Buffer Manager FUSE
RDMA Buffer Pool
Restarted Process Restarted Process
Migration Source Node Migration Target Node
FUSE Buffer Manager
RDMA Buffer Pool
Pipeline Bandwidth
27 CCGrid 2011
Pipeline Bandwidth
Buffer Pool = 8MB Chunk Size = 128KB
Determined by Aggregation Bandwidth
Insensitive to buffer pool size Chunk size = 128 KB generally the best
28 CCGrid 2011
Time to Complete a Process Migration (Lower is Better)
10.7X 2.3X 4.3X
(PPMR : Buffer Pool=8MB, Chunk Size = 128KB)
1
Migrate 8 processes
29 CCGrid 2011
Application Execution Time (Lower is Better)
+5.1% +9.2% +38% No Migration
30 CCGrid 2011
Scalability: Memory Footprint
Migration Time of Different Problem Sizes (64 processes on 8 nodes)
10.9X 2.6X 7.3X
31 CCGrid 2011
Scalability: IO Multiplexing
LU.D with 8/16/32/64 Processes, 8 Compute nodes. Migration data = 1500 MB
- Process per Node: 1 4
Better Pipeline bandwidth
- Introduction and Motivation
- Profiling Process Migration
- Pipelined Process Migration with RDMA
- Performance Evaluation
- Conclusions and Future Work
32 CCGrid 2011
Outline
- Process Migration overcomes C/R
drawbacks
- Process Migration shall be optimized in its
IO path
- Pipelined Process Migration with RDMA
(PPMR)
– Pipelines all steps in the IO path
33 CCGrid 2011
Conclusions
- The PPMR design has been released in
MVAPICH2 1.7
– Downloadable from http://mvapich.cse.ohio-state.edu/
34 CCGrid 2011
Software Distribution
- How PPMR can benefit general cluster
applications
– Cluster-wide load balancing – Server consolidation
- How diskless cluster architecture can
utilize PPMR
35 CCGrid 2011
Future Work
36 CCGrid 2011