High Performance Design and Implementation of Nemesis Communication - - PowerPoint PPT Presentation

high performance design and implementation of nemesis
SMART_READER_LITE
LIVE PREVIEW

High Performance Design and Implementation of Nemesis Communication - - PowerPoint PPT Presentation

High Performance Design and Implementation of Nemesis Communication Layer for Two-sided and One-Sided MPI Semantics in MVAPICH2 Miao Luo, Sreeram Potluri, Ping Lai, Emilio P. Mancini, Hari Subramoni, Krishna Kandalla, Sayantan Sur, D. K. Panda


slide-1
SLIDE 1

High Performance Design and Implementation of Nemesis Communication Layer for Two-sided and One-Sided MPI Semantics in MVAPICH2

Miao Luo, Sreeram Potluri, Ping Lai, Emilio P. Mancini, Hari Subramoni, Krishna Kandalla, Sayantan Sur, D. K. Panda Network-based Computing Lab The Ohio State University

slide-2
SLIDE 2

1

Outline

  • Introduction & Motivation
  • Problem Statement
  • Design Challenges
  • Evaluation of Performance
  • Conclusions and Future Work
slide-3
SLIDE 3

2

Introduction

  • Message Passing Interface

– Pre-dominant parallel programming model – Deployed by many scientific applications

  • Earthquake Simulation
  • Weather prediction
  • Computational Fluid dynamics
slide-4
SLIDE 4

3

Introduction

  • MPI-2 R(emote) M(emory) A(ccess)

– Allow one process involved in data transfer. – Data transfer operations:

  • MPI_Put
  • MPI_Get
  • MPI_Accumulate

– Synchronization operations:

  • Fence
  • Post-start-wait-complete
  • Lock/unlock
slide-5
SLIDE 5

4

Introduction

  • MPICH2

– Freely available, open-source, widely portable implementation of MPI standard – Re-designed for multi-core systems – Nemesis Communication Layer

  • Optimized for fast intra-node communication

– Lock-free queues with shared memory – Kernel-based: KNEM

  • Modular design for various high-performance

interconnects

slide-6
SLIDE 6

5

Nemesis Communication Layer

  • Nemesis Communication Layer

– For scalability, high-performance intra-node communication

– Modular design: multiple network modules

– Envision: next generation and highest performing design for MPICH2

ADI3 CH3 … sock … Nemesis

TCP/IP Netmod

… ?

Devices Channels

Nemesis Network Modules

slide-7
SLIDE 7

6

An overview of InfiniBand

  • InfiniBand

– High-speed, general purpose I/O interconnect – Widely used by scientific computing centers world- wide – 40% systems in Top500 (June 2010) – Two communication semantics

  • Channel semantics: send/recv
  • Memory semantics: RDMA
slide-8
SLIDE 8

7

Motivation

  • Nemesis + InfiniBand ?
  • InfiniBand network module (IB-Netmod)

– Expose InfiniBand’s high-performance ability to intra- node optimized Nemesis Communication Layer

ADI3 CH3 … sock … Nemesis

TCP/IP Netmod

IB- NETMOD

Devices Channels

Nemesis Network Modules

slide-9
SLIDE 9

8

Outline

  • Introduction
  • Problem Statement
  • Design Challenges
  • Evaluation of Performance
  • Conclusions and Future Work
slide-10
SLIDE 10

9

Problem Statement

  • What are the considerations for a high-performance network

module?

– Best two-sided performance – Efficiently utilize the full ability of interconnects

  • Limitation of current ch3 and nemesis general API:

– Can extensions be made to current layering API? – RMA functionality can be optimized by lower layer

  • Better performance from extended Nemesis interface ?

– while also keeping an unified design? – providing modularity?

slide-11
SLIDE 11

10

Outline

  • Introduction
  • Problem Statement
  • Design Challenges
  • Evaluation of Performance
  • Conclusions and Future Work
slide-12
SLIDE 12

11

Designing IB Support for Nemesis: IB-Netmod

  • Credit-based InfiniBand

Netmod Header

  • Additional Optimization

Techniques.

– SRQ – RDMA Fast Path – Header caching

  • Limitation from existing

API?

– Stops directly one-sided supports from lower layer! ADI3

Two-sided Operations

CH3_iSendv CH3_iStartMsg …

CH3

Implementation

  • f CH3

two-sided API Nemesis original Network module API Other network module TCP/IP Netmod RDMA enabled Network Module (IB-Netmod)

Nemesis

slide-13
SLIDE 13

12

Proposed Extensions to Nemesis

ADI3

Two-sided Operations

CH3_iSendv CH3_iStartMsg …

CH3

Implementation

  • f CH3

two-sided API Nemesis original Network module API Other network module TCP/IP Netmod RDMA enabled Network Module (IB-Netmod)

Nemesis

Customized CH3 Interface with RDMA

  • perations in MVAPICH2

Mrail subchannel In MVAPICH2 Design for IB

slide-14
SLIDE 14

13

Proposed Extensions to Nemesis

ADI3 CH3

Other network module TCP/IP Netmod RDMA enabled Network Module (IB-Netmod)

Two-sided Operations

CH3_iSendv CH3_iStartMsg … Implementation

  • f CH3

two-sided API Nemesis original Network module API

Nemesis

Customized CH3 Interface with RDMA

  • perations in MVAPICH2

Mrail subchannel In MVAPICH2 Design for IB

One-Sided Operations

Extended API: 1scWinCreate … Implementation Of extended CH3 One-sided API Nemesis

  • ne-sided

netmod API

slide-15
SLIDE 15

14

Proposed Extensions to Nemesis

ADI3 CH3

Other network module TCP/IP Netmod RDMA enabled Network Module (IB-Netmod)

Two-sided Operations

CH3_iSendv CH3_iStartMsg … Implementation

  • f CH3

two-sided API Nemesis original Network module API

Nemesis

Customized CH3 Interface with RDMA

  • perations in MVAPICH2

Mrail subchannel In MVAPICH2 Design for IB

One-Sided Operations

Extended API: 1scWinCreate … Implementation Of extended CH3 One-sided API Nemesis

  • ne-sided

netmod API

Fall- back

slide-16
SLIDE 16

15

Extended CH3 One-sided API

  • CH3_1scWinCreate(void *base, MPI-Aint size, MPID_Win *win_ptr,

MPID_Comm *comm_ptr):

– Get window object handler and initial address of the window

  • CH3_1scWinPost(MPID_Win *win_ptr, int *group);

– Implement or be aware of the starting of a RMA epoch

  • CH3_1scWinWait(MPID_Win *win_ptr)

– Check the completion of an RMA epoch as a target.

  • CH3_1scWinFinish(MPID_Win *win_ptr)

– Inform remote processes about the finish of all RMA operations in current epoch.

  • CH3_1scWinPut(MPID_Win *win_ptr, MPIDI_RMA_ops *rma_op)

– Interface for sub-channels to realize truly one-sided put operations.

  • CH3_1scWinGet(MPID_Win *win_ptr, MPIDI_RMA_ops *rma_op)
slide-17
SLIDE 17

16

Extended Nemesis One-sided API

  • MPID_nem_net_mod_WinCreate(void *base, MPI_Aint size, int

comm_size, int rank, MPID_Win **win_ptr, MPID_Comm *comm_ptr)

– Interface for netmods to get prepared for truly one-sided operations.

  • MPID_nem_net_mod_WinPost(MPID_Win *win_ptr, int target_rank)

– Interface for netmods with RMA ability to realize sync by RDMA write or even hardware multicast features.

  • MPID_nem_net_mod_WinFinish(MPID_Win *win_ptr)

– Interface for netmods with RDMA ability to realize CH3_1scWinFinish by RDMA write.

  • MPID_nem_net_mod_WinWait(MPID_Win *win_ptr)

– Interface for netmods to match net_mod_WinFinish functions with proper polling schemes.

  • MPID_nem_net_mod_Put(MPID_Win *win_ptr, MPIDI_RMA_ops

*rma_op, int size) MPID_nem_net_mod_Get(MPID_Win *win_ptr, MPIDI_RMA_ops *rma_op, int size)

– Interface for netmods to carry out truly RMA put operation by hardware features.

slide-18
SLIDE 18

17

Outline

  • Introduction
  • Problem Statement
  • Design Challenges
  • Evaluation of Performance
  • Conclusions and Future Work
slide-19
SLIDE 19

18

MVAPICH2 Software

  • High Performance MPI Library for IB and 10GE

– MVAPICH (MPI-1) and MVAPICH2 (MPI-2) – Used by more than 1,250 organizations – Empowering many TOP500 clusters – Available with software stacks of many IB, 10GE and server vendors including Open Fabrics Enterprise Distribution (OFED) – Also supports uDAPL device – http://mvapich.cse.ohio-state.edu – IB-Netmod has been incorporated into MVAPICH2 since 1.5 release (July 2010); IB-Netmod with one-sided extension will be available in the near future.

slide-20
SLIDE 20

19

Experimental Testbed

  • Cluster A:

– 8 Intel Nehalem machines – ConnectX QDR HCAs – Eight Intel Xeon 5500 processors – two sockets of four cores – 2.40 GHz with 12 GB of main memory.

  • Cluster B:

– 32 Intel Clovertown – ConnectX DDR HCAs – Eight Intel Xeon processors – 2.33 GHz with 6 GB of main memory.

  • RedHat Enterprise Linux Server 5, OFED version

1.4.2.

slide-21
SLIDE 21

20

Results Evaluation

  • Micro-benchmark Level Evaluation

– Two-sided – One-sided – Available Overlap rate

  • Application Level Evaluation

– NAMD – AWP-ODC

slide-22
SLIDE 22

Micro-benchmark Evaluation Two-sided Intra-node Latency

  • Nemesis intra-node communication design

helps to reduce the latency of small messages.

slide-23
SLIDE 23

Micro-benchmark Evaluation Two-sided Intra-node Bandwidth

  • Between 8KB and 128KB message size range, MVPICH2 1.5 with LiMIC2

performs better.

  • For even larger messages, Nemesis with KNEM has average 400MB/s larger

bandwidth.

  • Different inner design of KNEM and LiMIC2.
slide-24
SLIDE 24

Micro-benchmark Evaluation Two-sided Inter-node Latency

  • IB-netmod is able to provide 1.5us latency by using

native InfiniBand, which efficiently utilize the high performance of InfiniBand network.

  • Comparable performance as MVAPICH2 1.5
slide-25
SLIDE 25

Micro-benchmark Evaluation Two-sided Inter-node Bandwidth

  • Though IB-Netmod can achieve even better bi-directional bandwidth for medium

message sizes up to 16K Bytes, it loses up to 200MB/s performance for message range between 32K Bytes and 256K Bytes.

slide-26
SLIDE 26

Micro-benchmark Evaluation One-Sided MPI_Put Latency

  • Through extended API, Nemesis IB-Netmod

is able to reduces an average 10% latency for small messages.

  • Extended API eliminates the fall-back
  • verhead of customized CH3 interfaces..
slide-27
SLIDE 27

Micro-benchmark Evaluation One-Sided MPI_Put Bandwidth

  • By direct one-sided implementation of MPI_Put operation, Nemesis-IB with extended
  • ne-sided API achieve nearly full bandwidth, the same as MVAPICH2 1.5.
  • Nemesis IB-Netmod with original two-sided based API can only achieve 60% of full

bandwidth.

slide-28
SLIDE 28

Micro-benchmark Evaluation One-Sided MPI_Get

  • Similar results in MPI_Get benchmark.
slide-29
SLIDE 29

Micro-benchmark Evaluation

  • Computation is inserted after each round of multiple Put or Get operations.
  • Overlap = (Tcomm + Tcomp - Ttotal)/Tcomm
  • 90% overlap achieved for large message, through extended API.
slide-30
SLIDE 30

Application Evaluation NAMD apoa1

  • Production molecular dynamics program for high performance simulation of large bio-

molecular system.

  • Nemesis IB-Netmod performs as much good as MVAPICH2 1.5.
  • As the number of processes increase, the new IB-Netmod shows a trend of even

better performance, which maybe due to Nemesis intra-node optimization.

slide-31
SLIDE 31

Application Evaluation AWP-ODC

  • Anelastic Wave Propagation: earthquake simulation application.
  • http://hpgeoc.sdsc.edu/AWPODC/
  • AWP-ODC one-sided version with 128*256*256 elements per process.
  • 24% reduction of execution time.
slide-32
SLIDE 32

31

Conclusion

  • InfiniBand based network module

– based on MVAPICH2 – for modular Nemesis communication layer

  • Extended Nemesis API

– truly one-sided communication support for RMA semantics. – Implemented in the new Nemesis IB-Netmod. – Evaluation of its impact comparing with MVAPICH2 1.5.

  • Reusability?

– We believe the extended API can also be utilized by other netmods.

slide-33
SLIDE 33

32

Future Work

  • Intra-node one-sided communications
  • IB-Netmod:

– Scalability – Performance optimization techniques.

  • Continue to design and evaluate new

interfaces.

slide-34
SLIDE 34

33

Thanks!

{luom, potluri, laipi, mancini, subramon, kandalla, surs, panda}@cse.ohio-state.edu Network-based Computing Laboratory http://mvapich.cse.ohio-state.edu/