High Performance MPI on IBM 12x InfiniBand Architecture Abhinav - - PowerPoint PPT Presentation

high performance mpi on ibm 12x infiniband architecture
SMART_READER_LITE
LIVE PREVIEW

High Performance MPI on IBM 12x InfiniBand Architecture Abhinav - - PowerPoint PPT Presentation

High Performance MPI on IBM 12x InfiniBand Architecture Abhinav Vishnu , Brad Benton 1 and Dhabaleswar K. Panda {vishnu, panda} @ cse.ohio-state.edu {brad.benton}@us.ibm.com 1 1 Presentation Road-Map Introduction and Motivation


slide-1
SLIDE 1

1

High Performance MPI on IBM 12x InfiniBand Architecture

Abhinav Vishnu, Brad Benton1 and Dhabaleswar

  • K. Panda

{vishnu, panda} @ cse.ohio-state.edu {brad.benton}@us.ibm.com1

slide-2
SLIDE 2

2

Presentation Road-Map

  • Introduction and Motivation
  • Background
  • Enhanced MPI design for IBM 12x

Architecture

  • Performance Evaluation
  • Conclusions and Future Work
slide-3
SLIDE 3

3

Introduction and Motivation

  • Demand for more compute power is driven by Parallel

Applications – Molecular Dynamics (NAMD), Car Crash Simulations (LS- DYNA) , ...... , ……

  • Cluster sizes have been increasing forever to meet these

demands – 9K proc. (Sandia Thunderbird, ASCI Q) – Larger scale clusters are planned using upcoming multi- core architectures

  • MPI is used as the primary programming model for writing

these applications

slide-4
SLIDE 4

4

Emergence of InfiniBand

  • Interconnects with very low latency and very high

throughput have become available

– InfiniBand, Myrinet, Quadrics …

  • InfiniBand

– High Performance and Open Standard – Advanced Features

  • PCI-Express Based InfiniBand Adapters are becoming

popular – 8X (1X ~ 2.5 Gbps) with Double Data Rate (DDR) support – MPI Designs for these Adapters are emerging

  • Compared to PCI-Express, GX+ I/O Bus Based Adapters

are also emerging – 4X and 12X link support

slide-5
SLIDE 5

5

InfiniBand Adapters

To Network PCI-X (4x Bidirectional)

HCA Chipset HCA Chipset P1 P1

I/O Bus Interface I/O Bus Interface

P2 P2

4x 4x PCI-Express (16x Bidirectional) GX+ (>24x Bidirectional Bandwidth) 12x 12x To Host (SDR/DDR)

MPI for PCI-Express based are coming up IBM 12x InfiniBand Adapters on GX+ are coming up

slide-6
SLIDE 6

6

Problem Statement

  • How do we design an MPI with low overhead

for IBM 12x InfiniBand Architecture?

  • What are the performance benefits of enhanced

design over the existing designs?

– Point-to-point communication – Collective communication – MPI Applications

slide-7
SLIDE 7

7

Presentation Road-Map

  • Introduction and Motivation
  • Background
  • Enhanced MPI design for IBM 12x

Architecture

  • Performance Evaluation
  • Conclusions and Future Work
slide-8
SLIDE 8

8

  • Overview of InfiniBand
  • An interconnect technology to connect I/O nodes and

processing nodes

  • InfiniBand provides multiple transport semantics

– Reliable Connection

  • Supports reliable notification and Remote Direct Memory

Access (RDMA)

– Unreliable Datagram

  • Data delivery is not reliable, send/recv

is supported

– Reliable Datagram

  • Currently not implemented by Vendors

– Unreliable Connection

  • Notification is not supported
  • InfiniBand uses a queue pair (QP) model for data transfer

– Send queue (for send operations) – Receive queue (not involved in RDMA kind of operations)

slide-9
SLIDE 9

9

MultiPathing Configurations

Switch Switch

A combination of these is also possible Multiple Adapters and Multiple Ports (Multi-Rail Configurations) Multi-rail for multiple Send/recv engines

slide-10
SLIDE 10

10

Presentation Road-Map

  • Introduction and Motivation
  • Background
  • Enhanced MPI design for IBM 12x

Architecture

  • Performance Evaluation
  • Conclusions and Future Work
slide-11
SLIDE 11

11

MPI Design for 12x Architecture

InfiniBand Layer ADI Layer

Communication Scheduler Scheduling Policies Completion Notifier

Communication Marker

Notification EPC Multiple QPs/port

Jiuxing Liu, Abhinav Vishnu and Dhabaleswar K. Panda. , “ Building Multi-rail InfiniBand Clusters: MPI-level Design and Performance Evaluation, ”. SuperComputing 2004

Eager, Rendezvous

pt-to-pt, collective?

slide-12
SLIDE 12

12

  • Discussion on Scheduling Policies

Policies

Reverse Multiplexing Even Striping

Binding Round Robin

Enhanced Pt-to-Pt and Collective (EPC)

Overhead

  • Multiple Stripes
  • Multiple Completions

Non-blocking Blocking Communication Collective Communication

slide-13
SLIDE 13

13

EPC Characteristics

  • For small messages, round robin

policy is used – Striping leads to overhead for small messages

pt-2-pt blocking striping non-blocking round-robin collective striping

slide-14
SLIDE 14

14

MVAPICH/MVAPICH2

  • We have used MVAPICH

as our MPI framework for the enhanced design

  • MVAPICH/MVAPICH2

– High Performance MPI-1/MPI-2 implementation over InfiniBand and iWARP – Has powered many supercomputers in TOP500 supercomputing rankings – Currently being used by more than 450 organizations (academia and industry worldwide) – http://nowlab.cse.ohio-state.edu/projects/mpi-iba

  • The enhanced design is available with MVAPICH

– Will become available with MVAPICH2 in the upcoming releases

slide-15
SLIDE 15

15

Presentation Road-Map

  • Introduction and Motivation
  • Background
  • Enhanced MPI design for IBM 12x

Architecture

  • Performance Evaluation
  • Conclusions and Future Work
slide-16
SLIDE 16

16

Experimental TestBed

  • The Experimental Test-Bed consists of:

– Power5 based systems with SLES9 SP2 – GX+ at 950 MHz clock speed – 2.6.9 Kernel Version – 2.8 GHz Processor with 8 GB of Memory – TS120 switch for connecting the adapters

  • One port per adapter and one adapter is used for

communication

– The objective is to see the benefit with using only one physical port

slide-17
SLIDE 17

17

Ping-Pong Latency Test

  • EPC adds insignificant overhead

to the small message latency

  • Large Message latency reduces by 41% using EPC

with IBM 12x architecture

slide-18
SLIDE 18

18

Small Messages Throughput

  • Unidirectional bandwidth doubles for small messages

using EPC

  • Bidirectional bandwidth does not improve with increasing

number of QPs due to the copy bandwidth limitation

slide-19
SLIDE 19

19

Large Messages Throughput

  • EPC improves the uni-directional and bi-directional

throughput significantly for medium size messages

  • We can achieve a peak unidirectional bandwidth of 2731

MB/s and bidirectional bandwidth of 5421 MB/s

slide-20
SLIDE 20

20

Collective Communication

  • MPI_Alltoall shows significant benefits for large messages
  • MPI_Bcast

shows more benefits for very large messages

slide-21
SLIDE 21

21

NAS Parallel Benchmarks

  • For class A and class B problem sizes, x1 configuration shows

improvement

  • There is no degradation for other configurations on Fourier

Transform

slide-22
SLIDE 22

22

NAS Parallel Benchmarks

  • Integer sort shows 7-11%

improvement for x1 configurations

  • Other NAS Parallel Benchmarks do not show performance

degradation

slide-23
SLIDE 23

23

Presentation Road-Map

  • Introduction and Motivation
  • Background
  • Enhanced MPI design for IBM 12x

Architecture

  • Performance Evaluation
  • Conclusions and Future Work
slide-24
SLIDE 24

24

Conclusions

  • We presented an enhanced design for IBM 12x InfiniBand

Architecture – EPC (Enhanced Point-to-Point and collective communication)

  • We have implemented our design and evaluated with Micro-

benchmarks, collectives and MPI application kernels

  • IBM 12x HCAs

can significantly improve communication performance – 41% for ping-pong latency test – 63-65% for uni-directional and bi-directional bandwidth tests – 7-13% improvement in performance for NAS Parallel Benchmarks – We can achieve a peak bandwidth of 2731 MB/s and 5421 MB/s unidirectional and bidirectional bandwidth respectively

slide-25
SLIDE 25

25

Future Directions

  • We plan to evaluate EPC with multi-rail

configurations on upcoming multi-core systems – Multi-port configurations – Multi-HCA configurations

  • Scalability studies of using multiple QPs
  • n large

scale clusters – Impact of QP caching – Network Fault Tolerance

slide-26
SLIDE 26

26

Acknowledgements

Our research is supported by the following organizations

  • Current Funding support by
  • Current Equipment support by
slide-27
SLIDE 27

27

Web Pointers

http://nowlab.cse.ohio-state.edu/ MVAPICH Web Page http://mvapich.cse.ohio-state.edu E-mail: {vishnu, panda}@cse.ohio-state.edu, brad.benton@us.ibm.com