High Performance Pipelined Process Migration with RDMA Xiangyong - - PowerPoint PPT Presentation

high performance pipelined process migration with rdma
SMART_READER_LITE
LIVE PREVIEW

High Performance Pipelined Process Migration with RDMA Xiangyong - - PowerPoint PPT Presentation

High Performance Pipelined Process Migration with RDMA Xiangyong Ouyang, Raghunath Rajachandrasekar, Xavier Besseron, Dhabaleswar K. (DK) Panda Department of Computer Science & Engineering The Ohio State University Outline Introduction


slide-1
SLIDE 1

High Performance Pipelined Process Migration with RDMA

Xiangyong Ouyang, Raghunath Rajachandrasekar, Xavier Besseron, Dhabaleswar K. (DK) Panda Department of Computer Science & Engineering The Ohio State University

slide-2
SLIDE 2
  • Introduction and Motivation
  • Profiling Process Migration
  • Pipelined Process Migration with RDMA
  • Performance Evaluation
  • Conclusions and Future Work

2 CCGrid 2011

Outline

slide-3
SLIDE 3

3 CCGrid 2011

Motivation

  • Computer clusters continue to grow larger

– Heading towards Multi-PetaFlop and ExaFlop Era – Mean-time-between-failures (MTBF) is getting smaller

  • Fault-Tolerance becomes imperative
  • Checkpoint/Restart (C/R) – common approach to Fault

Tolerance

– Checkpoint: save snapshots of all processes (IO overhead) – Restart: restore, resubmit the job (IO overhead + queue delay)

  • C/R Drawbacks

× Unnecessarily dump all processes  IO bottleneck × Resubmit queuing delay

  • Checkpoint/Restart alone doesn’t scale to large

systems

slide-4
SLIDE 4
  • Pro-active Fault Tolerance

– Only handle processes on failing node – Health monitoring mechanisms, failure prediction models

  • Five steps

(1) Suspend communication channels (2) Write snapshots on source node (3) Transfer process image files (Source=>Target) (4) Read image files on target node (5) Reconnect communication channels

4 CCGrid 2011

Job/Process Migration

slide-5
SLIDE 5
  • Overcomes C/R drawbacks

× Unnecessary dump of all processes × Resubmit queuing delay

  • Desirable feature for other applications

– Cluster-wide load balancing – Server consolidation – Performance isolation

5 CCGrid 2011

Process Migration Advantages

slide-6
SLIDE 6
  • Available in MVAPICH2 and OpenMPI
  • Both suffers low performance
  • Cause? Solution?

6 CCGrid 2011

Existing MPI Process Migration

slide-7
SLIDE 7

7 CCGrid 2011

Problem Statements

  • What are the dominant factors of the high cost of

process migration?

  • How to design an efficient protocol to minimize
  • verhead?

– How to optimize checkpoint-related I/O path ? – How to optimize data transfer path? – How to leverage RDMA transport to accelerate data transmission?

  • What will be the performance benefits?
slide-8
SLIDE 8
  • Introduction and Motivation
  • Profiling Process Migration
  • Pipelined Process Migration with RDMA
  • Performance Evaluation
  • Conclusions and Future Work

8 CCGrid 2011

Outline

slide-9
SLIDE 9
  • MVAPICH: MPI over InfiniBand, 10GigE/iWARP and

RDMA over Converged Enhanced Ethernet (RoCE)

– MVAPICH (MPI-1) and MVAPICH2 (MPI-2) – Used by more than 1,550 organizations worldwide (in 60 countries) – Empowering many TOP500 clusters (11th, 15th … ) – Available with software stacks of many IB, 10GE/iWARP and RoCE, and server vendors including Open Fabrics Enterprise Distribution (OFED) – Available with Redhat and SuSE Distributions – http://mvapich.cse.ohio-state.edu/

  • Has support for Checkpoint/Restart and Process Migration

for the last several years

– Already used by many organizations

9 CCGrid 2011

MVAPICH/MVAPICH2 Software

slide-10
SLIDE 10
  • MVAPICH2 already supports three

process migration strategies

– Local Filesystem-based Migration (Local) – Shared Filesystem-based Migration (Shared) – RDMA+Local Filesystem-based Migration (RDMA+Local)

10 CCGrid 2011

Three Process Migration Approaches

slide-11
SLIDE 11

Time

11 CCGrid 2011

Local Filesystem-based Process Migration ( Local )

Process Process

Network Stack Memory VFS Page Cache Local Filesystem Local Filesystem

(1) Suspend (2) Write (3) Transfer (5) Reconnect

Migration Source Node Network Stack Memory VFS Page Cache

(4) Read

Write Transfer Read

Restarted Process Migration Target Node Restarted Process

slide-12
SLIDE 12

Time

12 CCGrid 2011

Shared Filesystem-based Process Migration (Shared )

Process Process

Network Stack Memory VFS Page Cache Shared Filesystem

(1) Suspend (5) Reconnect

Migration Source Node Migration Target Node Network Stack Memory VFS Page Cache

Write Transfer 1 Read

VFS Page Cache

(2) Write + (3) Transfer (4) Read

Transfer 2

Restarted Process Restarted Process

slide-13
SLIDE 13

Time

13 CCGrid 2011

RDMA + Local Filesystem-based Process Migration ( RDMA+Local )

Process Process

Local Filesystem

(1) Suspend (5) Reconnect

Migration Source Node Migration Target Node VFS Page Cache

(4) Read

Write Transfer Read

RDMA Buffer Pool RDMA Buffer Pool

(2) Write + (3) Transfer

Restarted Process Restarted Process

slide-14
SLIDE 14

14 CCGrid 2011

Profiling Process Migration Time Cost

  • All three approaches

suffer from IO cost

  • Conclusion: All three steps (Write, Transfer, Read) shall be optimized
  • Source node writes

checkpoint files

  • Copy checkpoint from

source to target

  • Read image files on

target node Migrate 8 processes

slide-15
SLIDE 15
  • Introduction and Motivation
  • Profiling Process Migration
  • Pipelined Process Migration with RDMA
  • Performance Evaluation
  • Conclusions and Future Work

15 CCGrid 2011

Outline

slide-16
SLIDE 16

16 CCGrid 2011

Pipelined Process Migration with RDMA (PPMR)

Process Process Buffer Manager FUSE

RDMA Buffer Pool

(1) Suspend Write Transfer (5) Reconnect

Migration Source Node Migration Target Node

FUSE Buffer Manager

RDMA Buffer Pool

Read

Restarted Process Restarted Process

Time

Write Transfer Read

slide-17
SLIDE 17

17 CCGrid 2011

Comparisons

Time

Write Transfer Read

Time

Write Transfer 1 Read Transfer 2

Time

Write Transfer Read Local Shared RDMA+Local PPMR

Time

Write Transfer Read

slide-18
SLIDE 18

Fully pipelines the three key steps

– Write at source node – Transfer checkpoint data to target node – Read process images

Efficient restart mechanism on target node

– Restart from RDMA data streams

  • Design choices

– Buffer Pool size, Chunk size

18 CCGrid 2011

PPMR Design Strategy

slide-19
SLIDE 19
  • Introduction and Motivation
  • Profiling Process Migration
  • Pipelined Process Migration with RDMA
  • Performance Evaluation
  • Conclusions and Future Work

19 CCGrid 2011

Outline

slide-20
SLIDE 20
  • System setup

– Linux cluster

  • Dual-socket Quad core Xeon processors, 2.33GHz
  • Nodes are connected by InfiniBand DDR (16Gbps)
  • Linux 2.6.30, FUSE-2.8.5
  • NAS parallel Benchmark suite version 3.2.1
  • LU/BT/SP Class C/D input
  • MVAPICH2 with Job Migration Framework
  • PPMR
  • Local, Shared, RDMA+Local

20 CCGrid 2011

Experiment Environment

slide-21
SLIDE 21

21 CCGrid 2011

Raw Data Bandwidth Test (1)

Process Process Buffer Manager FUSE

RDMA Buffer Pool

Restarted Process Restarted Process

Migration Source Node Migration Target Node

FUSE Buffer Manager

RDMA Buffer Pool

Aggregation Bandwidth

slide-22
SLIDE 22

22 CCGrid 2011

Aggregation Bandwidth

Write Unit Size = 128KB

saturate with 8-16 processes (~800 MB/s) Bandwidth determined by FUSE ( in-sensitive to buffer pool size ) Chunk size = 128 KB generally the best

slide-23
SLIDE 23

23 CCGrid 2011

Raw Data Bandwidth Test (2)

Process Process Buffer Manager FUSE

RDMA Buffer Pool

Restarted Process Restarted Process

Migration Source Node Migration Target Node

FUSE Buffer Manager

RDMA Buffer Pool

Network Transfer Bandwidth

slide-24
SLIDE 24

24 CCGrid 2011

InfiniBand DDR Bandwidth

InfiniBand DDR (16Gbps ) Chunk size > 16 KB Peak BW = 1450 MB/s

slide-25
SLIDE 25

25 CCGrid 2011

Network Transfer Bandwidth

Chunk Size = 128KB

Bandwidth in-sensitive to buffer pool size 8 IO streams can saturate the network

slide-26
SLIDE 26

26 CCGrid 2011

Raw Data Bandwidth Test (3)

Process Process Buffer Manager FUSE

RDMA Buffer Pool

Restarted Process Restarted Process

Migration Source Node Migration Target Node

FUSE Buffer Manager

RDMA Buffer Pool

Pipeline Bandwidth

slide-27
SLIDE 27

27 CCGrid 2011

Pipeline Bandwidth

Buffer Pool = 8MB Chunk Size = 128KB

Determined by Aggregation Bandwidth

Insensitive to buffer pool size Chunk size = 128 KB generally the best

slide-28
SLIDE 28

28 CCGrid 2011

Time to Complete a Process Migration (Lower is Better)

10.7X 2.3X 4.3X

(PPMR : Buffer Pool=8MB, Chunk Size = 128KB)

1

Migrate 8 processes

slide-29
SLIDE 29

29 CCGrid 2011

Application Execution Time (Lower is Better)

+5.1% +9.2% +38% No Migration

slide-30
SLIDE 30

30 CCGrid 2011

Scalability: Memory Footprint

Migration Time of Different Problem Sizes (64 processes on 8 nodes)

10.9X 2.6X 7.3X

slide-31
SLIDE 31

31 CCGrid 2011

Scalability: IO Multiplexing

LU.D with 8/16/32/64 Processes, 8 Compute nodes. Migration data = 1500 MB

  • Process per Node: 1  4

Better Pipeline bandwidth

slide-32
SLIDE 32
  • Introduction and Motivation
  • Profiling Process Migration
  • Pipelined Process Migration with RDMA
  • Performance Evaluation
  • Conclusions and Future Work

32 CCGrid 2011

Outline

slide-33
SLIDE 33
  • Process Migration overcomes C/R

drawbacks

  • Process Migration shall be optimized in its

IO path

  • Pipelined Process Migration with RDMA

(PPMR)

– Pipelines all steps in the IO path

33 CCGrid 2011

Conclusions

slide-34
SLIDE 34
  • The PPMR design has been released in

MVAPICH2 1.7

– Downloadable from http://mvapich.cse.ohio-state.edu/

34 CCGrid 2011

Software Distribution

slide-35
SLIDE 35
  • How PPMR can benefit general cluster

applications

– Cluster-wide load balancing – Server consolidation

  • How diskless cluster architecture can

utilize PPMR

35 CCGrid 2011

Future Work

slide-36
SLIDE 36

36 CCGrid 2011

Thank you!

{ouyangx, rajachan, besseron, panda} @cse.ohio-state.edu Network-Based Computing Laboratory http://mvapich.cse.ohio-state.edu