Optimized MPI Gather Collective for Many Integrated Core (MIC) - PowerPoint PPT Presentation

Optimized MPI Gather Collective for Many Integrated Core (MIC) InfiniBand Clusters Akshay Venkatesh Krishna Kandalla Dhabaleswar K. Panda Network-based Computing Laboratory Department of Computer Science and Engineering The Ohio State University

Outline • Introduction • Problem Statement • Designs • Experimental Evaluation and Analyses • Conclusion and Future Work 2

Scientific applications, Accelerators and MPI • Several areas such as medical sciences, atmospheric research and earthquake simulations rely on speed of computations for better prediction/ analysis. • Many instances of applications benefitting from use of large scale • Accelerators/ Coprocessors further increase computational speed and increase energy efficiency • MPI continues to be widely used as HPC reels in heterogeneous architectures 3

MVAPICH2/MVAPICH2-X Software • MPI(+X) continues to be the predominant programming model in HPC • High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE) – MVAPICH (MPI-1) ,MVAPICH2 (MPI-2.2 and MPI-3.0), Available since 2002 – MVAPICH2-X (MPI + PGAS), Available since 2012 – Used by more than 2,055 organizations (HPC Centers, Industry and Universities) in 70 countries – More than 181,000 downloads from OSU site directly – Empowering many TOP500 clusters 6 th ranked 462,462-core cluster (Stampede) at TACC • • 19 th ranked 125,980-core cluster (Pleiades) at NASA • 21st ranked 73,278-core cluster (Tsubame 2.0) at Tokyo Institute of Technology and many others – Available with software stacks of many IB, HSE, and server vendors including Linux Distros (RedHat and SuSE) – http://mvapich.cse.ohio-state.edu • Partner in the U.S. NSF-TACC Stampede System 4 XSCALE13

Latest Version of MVAPICH2 • Support for GPU-Aware MPI • Optimized and tuned point-to-point operations involving GPU Buffers • Support of GPU Direct RDMA • Optimized GPU Collectives • Ongoing effort to design a high performance library that enables MPI communication in MIC clusters 5

Intel Xeon Phi Specifications • Belongs to the Many Integrated Core (MIC) family • 61 cores on the chip (each running at 1 GHz) • 4 hardware threads (smart round robin scheduler) • 1 Teraflop peak throughput and energy efficient • X_86 compatible and supports OpenMP, MPI, Cilk, etc. • Installed in the compute node as PCI Express device 6

MPI on MIC Clusters (1) • Stampede has ~6,000 MIC Coprocessors • Tianhe-2 has ~48,000 MIC Coprocessors • Through MPSS*, MIC Coprocessors can directly use IB HCAs through peer-to-peer PCI communication for inter/intra-node communication • MPI is the predominantly used to make use of multiple such compute nodes in tandem *MPSS – Intel Manycore Platfrom Sofware Stack 7

MPI on MIC Clusters(2) • MIC supports various modes of operation – Offload mode – Coprocessor-only or Native mode – Symmetric mode • Non-uniform host and destination platforms – Host to MIC, MIC to host, MIC-to-MIC • Transfers involving the MIC incurs additional cost owing to expensive PCIe path 8

Symmetric mode and Implications • Non-uniform host and destination platforms – Host to MIC – MIC to host – MIC to MIC • Transfers involving the MIC incurs additional cost owing to expensive PCIe path • Performance is non-uniform 9

Symmetric mode and Implications 18 x 1000 16 host->remote_host 14 host->remote_mic Latency (us) 12 mic->remote_host 10 8 mic->remote_mic 6 4 2 0 256K 512K 1M 2M 4M 8M 16M Message Size (Bytes) • MIC->MIC Latency = 8 X Host->Host Latency • Host->IB NIC bandwidth = 6 X MIC->IB NIC Bandwidth 10

MPI on MIC Clusters(2) • MIC supports various modes of operation – Offload mode – Coprocessor-only or Native mode – Symmetric mode • Besides point-to-point primitives, MPI Standard also defines a set of collectives such as: – MPI_Bcast – MPI_Gather 11

MPI_Gather ROOT 0 0 1 2 3 1 2 3 1 2 3 • Gather used in – Multi-agent heuristics – Mini applications – Can be used for reduction operations and more 12

MPI Gather(2) • MPI_Gather – One root process receives data from every other process • On homogeneous systems the collective adopts – Linear scheme – Binomial scheme – Hierarchical scheme 13

Typical MPI_Gather on MIC Clusters HCA HCA • Yellow grid boxes represent host Host 0 Host 1 processors MIC 0 MIC 1 • Blue grid boxes Node 0 Node 1 represent MIC coprocessors HCA HCA Host 2 Host 3 MIC 2 MIC 3 Node 2 Node 3 14

Typical MPI_Gather on MIC Clusters HCA HCA • Hierarchical scheme or Leader-based Host 0 Host 1 scheme MIC 0 MIC 1 • Communicator per Node 0 Node 1 node • Leader is least rank in node HCA HCA Host 2 Host 3 Local - Leader MIC 2 MIC 3 Node 2 Node 3 15

Typical MPI_Gather on MIC Clusters HCA HCA • Local leader on the MIC directly uses Host 0 Host 1 the NIC MIC 0 MIC 1 • IB reading from MIC Node 0 Node 1 is costly • When root of the gather is on HCA HCA MIC, there are transfers from local Host 2 Host 3 MIC to remote MIC • This can be very MIC 2 MIC 3 costly Node 2 Node 3 16

Problem Statement • What are the primary bottlenecks that affect the performance of the MPI Gather operation on MIC clusters? • Can we design algorithms to overcome architecture specific performance deficits to improve gather latency? • Can we analyze and quantify the benefits of our proposed approach using micro-benchmarks? 18

MPI Collectives on MIC • Primitive operations such as MPI_Send, MPI_Recv and their non-blocking counterparts have been optimized* • MPI Collectives such MPI_Alltoall, MPI_Scatter, etc which are designed on top of p2p primitives immediately benefit from such optimizations • Can we do better? *S. Potluri, A. Venkatesh, D. Bureddy, K. Kandalla and D. K. Panda: Efficient Intra-node Communication on Intel-MIC Clusters , CCGrid’13 19

Design Goals • Avoid IB reading from MIC – Especially large transfers • Use pipelining methods • Overlap operations when possible 21

Designs • 3-level hierarchical algorithm • Pipelined Algorithm • Overlapped 3-Level-hierarchical algorithm 22

Design1: 3-Level-Hierarchical Algorithm HCA HCA • Step 1: Same as the default hierarchical Host 0 Host 1 or leader based scheme MIC 0 MIC 1 Node 0 Node 1 HCA HCA Host 2 Host 3 Local - Leader MIC 2 MIC 3 Node 2 Node 3 24

Design1: 3-Level-Hierarchical Algorithm HCA HCA • Step 2: Transfer of MIC-aggregated Host 0 Host 1 data to host over PCI MIC 0 MIC 1 Node 0 Node 1 • Difference? SCIF is used and performance is relatively competent HCA HCA Host 2 Host 3 MIC 2 MIC 3 Node 2 Node 3 25

Design1: 3-Level-Hierarchical Algorithm • Advantage: Step 3 HCA HCA which involves transferring the Host 0 Host 1 large aggregate message does not MIC 0 MIC 1 involve any IB reads Node 0 Node 1 from MIC • Disadvantage: MIC cores are slow and HCA HCA hence Intra-MIC gathers are slower Host 2 Host 3 • => Host leader MIC 2 MIC 3 needs to wait Node 2 Node 3 26

Design 2: Pipelined Gather • Leader on a each host posts non- HCA HCA blocking recvs from all processes within Host 0 Host 1 the node MIC 0 MIC 1 • Leader sends its Node 0 Node 1 own data to gather root • Each process within HCA HCA a node sends to leader on host Host 2 Host 3 • Host forwards to MIC 2 MIC 3 gather root Node 2 Node 3 28

Design 2: Pipelined Gather • Advantage: None of HCA HCA the steps involve IB reading from MIC Host 0 Host 1 • Disadvantage: Majority of transfer MIC 0 MIC 1 burden lies on Node 0 Node 1 leader of host nodes • => Processing non- HCA HCA blocking receives on Host leader can turn Host 2 Host 3 into a bottleneck MIC 2 MIC 3 Node 2 Node 3 29

Design 3: 3-Level-Hierarchical Overlapped Algorithm HCA HCA • Step 1: Same as the default hierarchical Host 0 Host 1 or leader based scheme MIC 0 MIC 1 Node 0 Node 1 HCA HCA Host 2 Host 3 Local - Leader MIC 2 MIC 3 Node 2 Node 3 31

Design 3: 3-Level-Hierarchical Overlapped Algorithm HCA HCA • Host leader first posts non-blocking Host 0 Host 1 receive from MIC leader in the node MIC 0 MIC 1 Node 0 Node 1 • Host leader gathers data from host processes and sends to root HCA HCA • Meanwhile MIC Host 2 Host 3 leader starts sending data to host MIC 2 MIC 3 leader Node 2 Node 3 32

Optimized MPI Gather Collective for Many Integrated Core (MIC) - PowerPoint PPT Presentation

Optimized MPI Gather Collective for Many Integrated Core (MIC) InfiniBand Clusters Akshay Venkatesh Krishna Kandalla Dhabaleswar K. Panda Network-based Computing Laboratory Department of Computer Science and Engineering The Ohio State

MPI is too High-Level MPI is too Low-Level Marc Snir High-Level MPI MPI is an Application

The MPI+MPI programming model and why we need shared-memory MPI libraries Jeff Hammond Extreme

Introduction to MPI T opics to be covered MPI vs shared memory Initializing MPI MPI

Message Passing Programming with MPI What is MPI? Message Passing Programming with MPI 1

Welcome Welcome Core: Core A Regional Destination Core: Core UL Core: Core Downtown

Programming Miscellaneous MPI-IO topics MPI-IO Errors Unlike the rest of MPI, MPI-IO errors

MPI-IO: A Retrospective Rajeev Thakur 25 th Anniversary of MPI Workshop Argonne, IL, Sept 25,

Message Passing Programming with MPI Message Passing Programming with MPI 1 What is MPI?

MPI & MPICH Presenter: Naznin Fauzia CSE 788.08 Winter 2012 Outline MPI-1 standards

Open MPI on the Cray XT presented by Richard L. Graham Galen Shipman Open MPI Is Open

Advanced MPI USER-DEFINED DATATYPES MPI datatypes MPI datatypes are used for communication

Caching, Parallelism, Fault Tolerance Marco Serafini COMPSCI 532 Lectures 2-3 Memory Hierarchy

Gather and Summarize Data Gather and Summarize Data 1 Introductions Introductions Audience

Investigation of Parallel Processing Using How to Enable/Access Open MPI in Open MPI ADMB.

Parallelization strategies in PWSCF (and other QE codes) MPI vs Open MP MPI Message

MPI - Message Passing Interface MPI is the mostly used message passing-standard By

Ruin Probability-Based Initial Capital of the Discrete-Time Surplus Process Watcharin Klongdee

Presentation Technology Guide Choate MWC Ceiling-Mounted LCD Projector The projector is

Tutorial on How to Record a Narration during a Power Point Presentation Suggestions before you

Information, Objectivity, and Propaganda ! History of Information 103 ! Geoff Nunberg ! March 29,

Replacing and interpreting clinical data John H. Rex, MD , on behalf of the EFPIA team EMA PK-PD

excellence results MIC Fixed Income Presentation March 21, 2014 Genworth MI Canada Inc. 1

Genworth MI Canada FEBRUARY 2015 MIC Fixed Income Presentation March 24, 2014 1 1 Genworth

Maximum Import Capability Stabilization and Multi-Year Allocation Straw Proposal Catalin Micsa

Optimized MPI Gather Collective for Many Integrated Core (MIC) - PowerPoint PPT Presentation

Optimized MPI Gather Collective for Many Integrated Core (MIC) InfiniBand Clusters Akshay Venkatesh Krishna Kandalla Dhabaleswar K. Panda Network-based Computing Laboratory Department of Computer Science and Engineering The Ohio State

MPI is too High-Level MPI is too Low-Level Marc Snir High-Level MPI MPI is an Application

The MPI+MPI programming model and why we need shared-memory MPI libraries Jeff Hammond Extreme

Introduction to MPI T opics to be covered MPI vs shared memory Initializing MPI MPI

Message Passing Programming with MPI What is MPI? Message Passing Programming with MPI 1

Welcome Welcome Core: Core A Regional Destination Core: Core UL Core: Core Downtown

Programming Miscellaneous MPI-IO topics MPI-IO Errors Unlike the rest of MPI, MPI-IO errors

MPI-IO: A Retrospective Rajeev Thakur 25 th Anniversary of MPI Workshop Argonne, IL, Sept 25,

Message Passing Programming with MPI Message Passing Programming with MPI 1 What is MPI?

MPI &amp; MPICH Presenter: Naznin Fauzia CSE 788.08 Winter 2012 Outline MPI-1 standards

Open MPI on the Cray XT presented by Richard L. Graham Galen Shipman Open MPI Is Open

Advanced MPI USER-DEFINED DATATYPES MPI datatypes MPI datatypes are used for communication

Caching, Parallelism, Fault Tolerance Marco Serafini COMPSCI 532 Lectures 2-3 Memory Hierarchy

Gather and Summarize Data Gather and Summarize Data 1 Introductions Introductions Audience

Investigation of Parallel Processing Using How to Enable/Access Open MPI in Open MPI ADMB.

Parallelization strategies in PWSCF (and other QE codes) MPI vs Open MP MPI Message

MPI - Message Passing Interface MPI is the mostly used message passing-standard By

Ruin Probability-Based Initial Capital of the Discrete-Time Surplus Process Watcharin Klongdee

Presentation Technology Guide Choate MWC Ceiling-Mounted LCD Projector The projector is

Tutorial on How to Record a Narration during a Power Point Presentation Suggestions before you

Information, Objectivity, and Propaganda ! History of Information 103 ! Geoff Nunberg ! March 29,

Replacing and interpreting clinical data John H. Rex, MD , on behalf of the EFPIA team EMA PK-PD

excellence results MIC Fixed Income Presentation March 21, 2014 Genworth MI Canada Inc. 1

Genworth MI Canada FEBRUARY 2015 MIC Fixed Income Presentation March 24, 2014 1 1 Genworth

Maximum Import Capability Stabilization and Multi-Year Allocation Straw Proposal Catalin Micsa

MPI & MPICH Presenter: Naznin Fauzia CSE 788.08 Winter 2012 Outline MPI-1 standards