High Performance Pipelined Process Migration with RDMA Xiangyong - PowerPoint PPT Presentation

High Performance Pipelined Process Migration with RDMA Xiangyong Ouyang, Raghunath Rajachandrasekar, Xavier Besseron, Dhabaleswar K. (DK) Panda Department of Computer Science & Engineering The Ohio State University

Outline • Introduction and Motivation • Profiling Process Migration • Pipelined Process Migration with RDMA • Performance Evaluation • Conclusions and Future Work CCGrid 2011 2

Motivation • Computer clusters continue to grow larger – Heading towards Multi-PetaFlop and ExaFlop Era – Mean-time-between-failures (MTBF) is getting smaller  Fault-Tolerance becomes imperative • Checkpoint/Restart (C/R) – common approach to Fault Tolerance – Checkpoint: save snapshots of all processes ( IO overhead ) – Restart: restore, resubmit the job ( IO overhead + queue delay ) • C/R Drawbacks × Unnecessarily dump all processes  IO bottleneck × Resubmit queuing delay  Checkpoint/Restart alone doesn’t scale to large systems CCGrid 2011 3

Job/Process Migration • Pro-active Fault Tolerance – Only handle processes on failing node – Health monitoring mechanisms, failure prediction models • Five steps (1) Suspend communication channels (2) Write snapshots on source node (3) Transfer process image files ( Source=>Target ) image files on target node (4) Read (5) Reconnect communication channels CCGrid 2011 4

Process Migration Advantages • Overcomes C/R drawbacks × Unnecessary dump of all processes × Resubmit queuing delay • Desirable feature for other applications – Cluster-wide load balancing – Server consolidation – Performance isolation CCGrid 2011 5

Existing MPI Process Migration • Available in MVAPICH2 and OpenMPI • Both suffers low performance • Cause? Solution? CCGrid 2011 6

Problem Statements • What are the dominant factors of the high cost of process migration? • How to design an efficient protocol to minimize overhead? – How to optimize checkpoint-related I/O path ? – How to optimize data transfer path? – How to leverage RDMA transport to accelerate data transmission? • What will be the performance benefits? CCGrid 2011 7

MVAPICH/MVAPICH2 Software • MVAPICH: MPI over InfiniBand, 10GigE/iWARP and RDMA over Converged Enhanced Ethernet (RoCE) – MVAPICH (MPI-1) and MVAPICH2 (MPI-2) – Used by more than 1,550 organizations worldwide (in 60 countries) – Empowering many TOP500 clusters (11 th , 15 th … ) – Available with software stacks of many IB, 10GE/iWARP and RoCE, and server vendors including Open Fabrics Enterprise Distribution (OFED) – Available with Redhat and SuSE Distributions – http://mvapich.cse.ohio-state.edu/ • Has support for Checkpoint/Restart and Process Migration for the last several years – Already used by many organizations CCGrid 2011 9

Three Process Migration Approaches • MVAPICH2 already supports three process migration strategies – Local Filesystem-based Migration ( Local ) – Shared Filesystem-based Migration ( Shared ) – RDMA+Local Filesystem-based Migration ( RDMA+Local) CCGrid 2011 10

Local Filesystem-based Process Migration ( Local ) Migration Source Node Migration Target Node Process Process Restarted Restarted Process Process (5) Reconnect (1) Suspend (2) Write Network Stack Network Stack (4) Read Memory Memory (3) Transfer VFS Page Cache VFS Page Cache Local Filesystem Local Filesystem Time Write Transfer Read CCGrid 2011 11

Shared Filesystem-based Process Migration ( Shared ) Migration Source Node Migration Target Node Process Process Restarted Restarted Process Process (1) Suspend (5) Reconnect VFS Page Cache VFS Page Cache (4) Read Network Stack Network Stack Memory Memory (2) Write + (3) Transfer VFS Page Cache Shared Filesystem Write Read Time Transfer 1 Transfer 2 CCGrid 2011 12

RDMA + Local Filesystem-based Process Migration ( RDMA+Local ) Migration Source Node Migration Target Node Process Process Restarted Restarted Process Process (1) Suspend (5) Reconnect (2) Write + (3) Transfer RDMA Buffer Pool (4) Read RDMA Buffer Pool VFS Page Cache Local Filesystem Write Read Time Transfer CCGrid 2011 13

Profiling Process Migration Time Cost • Read image files on target node • Copy checkpoint from source to target • Source node writes checkpoint files  All three approaches suffer from IO cost Migrate 8 processes  Conclusion: All three steps (Write, Transfer, Read) shall be optimized CCGrid 2011 14

Pipelined Process Migration with RDMA (PPMR) Migration Source Node Migration Target Node Process Process Restarted Restarted Process Process (1) Suspend (5) Reconnect FUSE FUSE Buffer Buffer Write Read Manager Manager Transfer RDMA Buffer Pool RDMA Buffer Pool Write Time Transfer Read CCGrid 2011 16

Comparisons Time Local Write Transfer Read Write Read Time Shared Transfer 1 Transfer 2 Write Read Time RDMA+Local Transfer Write Time PPMR Transfer Read CCGrid 2011 17

PPMR Design Strategy  Fully pipelines the three key steps – Write at source node – Transfer checkpoint data to target node – Read process images  Efficient restart mechanism on target node – Restart from RDMA data streams • Design choices – Buffer Pool size, Chunk size CCGrid 2011 18

Experiment Environment • System setup – Linux cluster • Dual-socket Quad core Xeon processors, 2.33GHz • Nodes are connected by InfiniBand DDR (16Gbps) • Linux 2.6.30, FUSE-2.8.5 • NAS parallel Benchmark suite version 3.2.1 • LU/BT/SP Class C/D input • MVAPICH2 with Job Migration Framework • PPMR • Local, Shared, RDMA+Local CCGrid 2011 20

Raw Data Bandwidth Test (1) Migration Source Node Migration Target Node Process Process Restarted Restarted Process Process FUSE FUSE Buffer Buffer Aggregation Manager Manager Bandwidth RDMA Buffer Pool RDMA Buffer Pool CCGrid 2011 21

Aggregation Bandwidth Write Unit Size = 128KB  saturate with 8-16 processes (~800 MB/s)  Bandwidth determined by FUSE ( in-sensitive to buffer pool size )  Chunk size = 128 KB generally the best CCGrid 2011 22

Raw Data Bandwidth Test (2) Migration Source Node Migration Target Node Process Process Restarted Restarted Process Process FUSE FUSE Buffer Buffer Manager Manager RDMA Buffer Pool RDMA Buffer Pool Network Transfer Bandwidth CCGrid 2011 23

InfiniBand DDR Bandwidth Chunk size > 16 KB Peak BW = 1450 MB/s InfiniBand DDR (16Gbps ) CCGrid 2011 24

Network Transfer Bandwidth Chunk Size = 128KB  Bandwidth in-sensitive to buffer pool size  8 IO streams can saturate the network CCGrid 2011 25

Raw Data Bandwidth Test (3) Migration Source Node Migration Target Node Process Process Restarted Restarted Process Process FUSE FUSE Buffer Buffer Manager Manager RDMA Buffer Pool Pipeline Bandwidth RDMA Buffer Pool CCGrid 2011 26

Pipeline Bandwidth Buffer Pool = 8MB Chunk Size = 128KB  Determined by Aggregation Bandwidth  Chunk size = 128 KB generally the best  Insensitive to buffer pool size CCGrid 2011 27

Time to Complete a Process Migration (Lower is Better) (PPMR : Buffer Pool=8MB, Chunk Size = 128KB) 10.7X 4.3X 2.3X 1 Migrate 8 processes CCGrid 2011 28

Application Execution Time (Lower is Better) +38% +9.2% +5.1% No Migration CCGrid 2011 29

Scalability: Memory Footprint Migration Time of Different Problem Sizes (64 processes on 8 nodes) 10.9X 2.6X 7.3X CCGrid 2011 30

Scalability: IO Multiplexing • Process per Node: 1  4 Better Pipeline bandwidth LU.D with 8/16/32/64 Processes, 8 Compute nodes. Migration data = 1500 MB CCGrid 2011 31

Conclusions • Process Migration overcomes C/R drawbacks • Process Migration shall be optimized in its IO path • Pipelined Process Migration with RDMA (PPMR) – Pipelines all steps in the IO path CCGrid 2011 33

Software Distribution • The PPMR design has been released in MVAPICH2 1.7 – Downloadable from http://mvapich.cse.ohio-state.edu/ CCGrid 2011 34

Future Work • How PPMR can benefit general cluster applications – Cluster-wide load balancing – Server consolidation • How diskless cluster architecture can utilize PPMR CCGrid 2011 35

Thank you! http://mvapich.cse.ohio-state.edu {ouyangx, rajachan, besseron, panda} @cse.ohio-state.edu Network-Based Computing Laboratory CCGrid 2011 36

High Performance Pipelined Process Migration with RDMA Xiangyong - PowerPoint PPT Presentation

High Performance Pipelined Process Migration with RDMA Xiangyong Ouyang, Raghunath Rajachandrasekar, Xavier Besseron, Dhabaleswar K. (DK) Panda Department of Computer Science & Engineering The Ohio State University Outline Introduction

DLX Pipeline 2-stage fully pipelined Adder 4-stage fully pipelined Multiplier 5-cycle

Review: FP Pipeline Model 4-stage fully pipelined adder, Non-pipelined multiplier and divider A1

FaSST: Fast, Scalable, and Simple Distributed Transactions with Two-Sided (RDMA) Datagram RPCs

Performance of RDMA-Capable Storage Performance of RDMA-Capable Storage Protocols on Wide-Area

Design Guidelines for High Performance RDMA Systems Anuj Kalia (CMU) Michael Kaminsky (Intel

the kernel bypass with RDMA! Using the RDMA infrastructure for performance while retaining kernel

Performance Isolation Anomalies in RDMA Yiwen Zhang with Juncheng Gu, Youngmoon Lee, Mosharaf

Improving access to migration data Improving access to migration data Local area migration

RoGUE: RDMA over Generic Unconverged Ethernet Yanfang Le with Brent Stephens, Arjun Singhvi,

NFS over RDMA Brent Callaghan, Theresa Lingutla-Raj, Alex Chiu, Peter Staubach, Omer Asad Sun

Shawn Hall Hybrid RDMA RDMA/SR mix for data, SR otherwise Client side events Completion of

Chapter 6: Designing a Pipelined CPU What are our resources? 1 washer, 1 dryer, 1 folder

International Dialogue on Migration (IDM) Human Rights and Migration: Working Together for Safe,

WHY IS MIGRATION SO IMPORTANT? Why migration? German National Team 2014 Why migration?

EU policy on Legal Migration DG Migration and Home Affairs EU migration basic facts and figures

File Transfer Migration SP09-01 Migration Tools Overview Who are we? Why migrate ?

Health Care Innovation Awards Round Two: Payment Models July 11, 2013 Agenda Overview

Health Insurance Market Design Lecture in Honor of Prof. Guideon Fishelson, Tel Aviv University,

Self Control, Risk Aversion, and the Allais Paradox Drew Fudenberg and David K. Levine May 28,

Leads, Lags, and Logs: Asset Prices in Business Cycle Analysis David Backus (NYU), Bryan

Stop Using Your Gut: How to Efficiently Measure ROI Curtis Onuczko Software Developer, BioWare

Jonathan Asaadi UT Arlington (On behalf of the PixLAr Collaboration) What is PixLAr? What is

UFO 2 : A Unified Framework towards Omni-supervised Object Detection Zhongzheng Ren, Zhiding Yu,

Transport Focus 2016 Bus Passenger Survey Briefing 22 March 2017 - Liverpool Presentation of BPS