Communication for InfiniBand Clusters G.Santhanaraman, T. - - PowerPoint PPT Presentation

communication for infiniband clusters
SMART_READER_LITE
LIVE PREVIEW

Communication for InfiniBand Clusters G.Santhanaraman, T. - - PowerPoint PPT Presentation

Design Alternatives for Implementing Fence Synchronization in MPI-2 One-Sided Communication for InfiniBand Clusters G.Santhanaraman, T. Gangadharappa, S.Narravula, A.Mamidala and D.K.Panda Presented by: Miao Luo National Center for


slide-1
SLIDE 1

Design Alternatives for Implementing Fence Synchronization in MPI-2 One-Sided Communication for InfiniBand Clusters

G.Santhanaraman, T. Gangadharappa, S.Narravula, A.Mamidala and D.K.Panda Presented by: Miao Luo

National Center for Supercomputing Applications Dept of Computer Science and Engineering, The Ohio State University

1

slide-2
SLIDE 2

Introduction

  • High-end Computing (HEC) Systems (approaching petascale capability)

– Systems with few thousands/tens/hundreds of thousands of cores – Meet the requirements of grand challenge problems

  • Greater emphasis on programming models

– One sided communication is getting popular

  • Minimize the need to synchronization

– Ability to overlap computation and communication

  • Scalable application communication patterns

– Clique-based communication

  • Nearest neighbor: Ocean/Climate modeling, PDE solvers
  • Cartesian grids: 3DFFT

2

slide-3
SLIDE 3

Introduction: HPC Clusters

  • HPC has been the key driving force

– Provides immense computing power by increasing the scale of parallel machines

  • Approaching petascale capabilities

– Increased Node performance – Faster/Larger Memory – Hundreds of thousands of cores

  • Commodity clusters with Modern

Interconnects (InfiniBand, Myrinet 10GigE etc)

3

slide-4
SLIDE 4

Introduction:

Message Passing Interface (MPI)

  • MPI - Dominant programming model
  • Very Portable

– Available on all High end systems

  • Two sided message passing

– Requires a handshake between the sender and receiver – Matching sends and receives

  • One sided programming models becoming popular

– MPI also provides one-sided communication semantics

4

slide-5
SLIDE 5

Introduction:

One-sided Communication

  • P0 reads/writes directly into the address space of P1
  • Only one processor (P0) involved in the communication
  • MPI-2 standard (extension to MPI-1)‏

One Sided Communication or Remote memory Access (RMA) MPI-3 standard coming up...

5

Node

Memory

P0 P2

PCI/PCI-EX

Node

Memory

P1 P3 IB IB

PCI/PCI-EX

slide-6
SLIDE 6

Introduction:

MPI-2 One-sided Communication

  • Sender (origin)

ca n access the receiver (target) remote address space (window) directly

  • Decouples data transfer and synchronization operations
  • Communication operations

– MPI_Put, MPI_Get, MPI_Accumulate – Contiguous and Non-contiguous operations

  • Synchronization Modes

– Active synchronization

  • Post/start Wait/Complete
  • Fence (collective)‏

– Passive synchronization

  • Lock/unlock

6

slide-7
SLIDE 7

Introduction:

Fence Synchronization

START: epoch 0 END : epoch 0 START: epoch 1 END: epoch 1

PROCESS: 0

PROCESS : 1 PROCESS: 2

Put(1)‏ Put(1)‏ Put(2)‏ Fence Fence Fence Fence Fence Fence Fence Fence Fence Put(2)‏ Get(1)‏ Put(2)‏ Put(0)‏

7

slide-8
SLIDE 8

Introduction:

Top 100 Interconnect Share

8

In top systems, the use of InfiniBand has grown significantly. Over 50% of the top 100 systems in the Top500 use InfiniBand

8

58/100 systems

slide-9
SLIDE 9

Introduction:

InfiniBand Overview

  • The InfiniBand Architecture (IBA):

Open standard for high speed interconnect

  • IBA supports send/recv and RDMA semantics
  • Can provide good hardware support for RMA/one-sided

communication model

  • Very good performance with many features
  • Minimum latency ~1usecs, peak bandwidth ~2500MB/s
  • RDMA Read, RDMA Write ( matches well with one-sided

get/put semantics)

  • RDMA Write with Immediate‏ (explored in this work)
  • Several High End Computing systems use InfiniBand

examples: Ranger at TACC (62976 cores), Chinook at PNNL (18176 cores)

9

slide-10
SLIDE 10

Pr Presenta esentation Lay tion Layout

  • ut
  • Introduction
  • Problem Statement
  • Design Alternatives
  • Experimental Evaluation
  • Conclusions and Future Work
slide-11
SLIDE 11

Problem Statement

  • How can we explore the design space for implementing fence

synchronization on modern Interconnects?

  • Can we design a novel fence synchronization mechanism that

leverages InfiniBand’s RDMA Write with immediate primitives? – Reduced synchronization overhead and network traffic – Provide increased scope for overlap

11

slide-12
SLIDE 12

Pr Presenta esentation Lay tion Layout

  • ut
  • Introduction
  • Problem Statement
  • Design Alternatives
  • Experimental Evaluation
  • Conclusions and Future Work
slide-13
SLIDE 13

Design Space

  • Deferred Approach

– All operations and synchronizations deferred to subsequent fence

– Use two-sided operations – Certain optimizations possible to reduce latency of ops and

  • verhead of sync

– Capability for overlap is lost

  • Immediate Approach

– Sync and communication ops happen as they are issued – Use RDMA for communication ops – Can achieve good overlap of computation and communication – How can we handle remote completions??

  • Characterize the performance

– Overlap capability – Synchronization overhead

13

slide-14
SLIDE 14

Fence Designs

  • Deferred approach (Fence-2S)

– Two Sided Based Approach – First fence does nothing – All one-sided operations queued locally – The second fence goes through the queue, issues operations, and handles completion – The last message in the epoch can signal a completion

  • Optimizations (combining of put and the ensuing synchronization)
  • > reduced synchronization overhead
  • Cons : No scope for providing overlap

14

slide-15
SLIDE 15

Fence Designs

  • Immediate Approach

– Issue a completion message on all the channels – Issue a Barrier after the

  • perations?

Barrier: step 2 Barrier: step 2

Barrier: step 1

Barrier: step 1

Process: 0 Process: 1 Process: 3 Process: 2

PUT: from 0 to 3 (Arrives after step 2)‏ PUT: from 0 to 3 Issued 15

slide-16
SLIDE 16

Fence-Imm Naive Design (Fence-1S)

Fence begin

Local comple/on

Complete Epoch 0

Fence end

Start Epoch 1 Epoch 0 Finish mesg comple/on

PUT PUT

P3 P2 P0 P1

REDUCE SCATTER PUT Finish message

slide-17
SLIDE 17

Fence-Imm Opt Design (Fence-1S-Barrier)

Fence begin

Local comple/on

Complete Epoch 0

Fence end

Start Epoch 1 Epoch 0 Finish mesg comple/on

PUT PUT

P3 P2 P0 P1

REDUCE SCATTER PUT Finish message BARRIER

slide-18
SLIDE 18

Novel Fence-RI Design

Fence begin

Local comple/on

Complete Epoch 0

Fence end

Start Epoch 1 Epoch 0

(RDMA write with imm)‏ (RDMA write with imm)‏ (RDMA write with imm)‏

Remote RDMA Immediate comple/on

PUT PUT

P3 P2 P0 P1

ALL REDUCE BARRIER PUT

18

slide-19
SLIDE 19

Pr Presenta esentation Lay tion Layout

  • ut
  • Introduction
  • Problem Statement
  • Design Alternatives
  • Experimental Evaluation
  • Conclusions and Future Work
slide-20
SLIDE 20

Experimental Evaluation

Experimental Testbed

  • 64 Node Intel Cluster
  • 2.33 GHz quad-core processor
  • 4GB Main Memory
  • RedHat Linux AS4
  • Mellanox MT25208 HCAs with PCI Express Interfaces
  • Silverstorm 144 port switch
  • MVAPICH2 Software Stack

Experiments Conducted

  • Overlap Measurements
  • Fence Synchronization Microbenchmarks
  • Halo Exchange Communication Pattern

20

slide-21
SLIDE 21

MVAPICH/MVAPICH2 Software Distributions

  • High Performance MPI Library for InfiniBand and iWARP Clusters

– MVAPICH2(MPI-2) – Used by more than 975 organizations world-wide – Empowering many TOP500 clusters – Available with software stacks of many InfiniBand, iWARP and server vendors including Open Fabrics Enterprise Distribution (OFED) http://mvapich.cse.ohio-state.edu/

21

slide-22
SLIDE 22
  • Overlap Metric
  • Increasing amount of computation is

inserted between the put and fence sync

  • Percentage overlap is measured as the

amount of computation that can be inserted without increasing overall latency

  • Two sided implementation

(Fence-2S) uses deferred approach – No scope for overlap

  • The one-sided implementations

can achieve overlap

Overlap

22

20 40 60 80 100 16 64 256 1k 4k 8k 16k 32k 64k 256k Percentage Overlap Message Size Fence‐2S Fence‐1S Fence‐1S‐Barrier Fence‐RI

slide-23
SLIDE 23

Latency of Fence (Zero-put)

200 400 600 800 1000 1200 1400 8 16 32 64 Latency (usecs) Num of Procs Fence‐2S Fence‐1S Fence‐1S‐Barrier Fence‐RI

  • Performance of fence alone

without any one-sided

  • perations
  • Overhead of synchronization

alone

  • Fence-1S performs badly due

to all pair-wise sync to indicate start of next epoch

  • Fence-2S performs the best

since it does not need additional collective to indicate start of an epoch

slide-24
SLIDE 24

Latency of Fence with Put Operations

200 400 600 800 1000 1200 1400 1600 1800 8 16 32 64 Latency (usecs) Num of Procs Fence‐2S Fence‐1S Fence‐1S‐Barrier Fence‐RI

  • Performance of fence with put
  • perations

– Measuring synchronization with communication ops – A single put is issued by all the processes between two fences

  • Fence-1s performs the worst
  • Fence-RI performs better than

Fence-1S-Barrier

24

Single Put

slide-25
SLIDE 25

Latency of Fence with Multiple Put Operations

200 400 600 800 1000 1200 1400 8 16 32 64 Latency (usecs) Num of Procs Fence‐2S Fence‐1S Fence‐1S‐Barrier Fence‐RI

  • Performance of fence with

multiple put operations

– Each process issues puts to 8 neighbors

  • Fence-RI performs better than

Fence-1S barrier

  • Fence-2S still performs the

best

– However poor overlap capability

25

8 Puts

slide-26
SLIDE 26
  • Mimics halo or Ghost cell

update

  • The Fence-RI scheme

performs the best

Halo Communication Pattern

500 1000 1500 2000 2500 8 16 32 64 Latency (usecs) Num of Procs Fence‐2S Fence‐1S Fence‐1S‐Barrier Fence‐RI 26

slide-27
SLIDE 27

Pr Presenta esentation Lay tion Layout

  • ut
  • Introduction
  • Problem Statement
  • Design Alternatives
  • Experimental Evaluation
  • Conclusions and Future Work
slide-28
SLIDE 28

Conclusions and Future Work

  • Analyzed different design choices for implementing fence

synchronizations on modern interconnects

  • Proposed a new design using RDMA Write with Imm

mechanism

– handle remote completions

  • Significantly improved performance for microbenchmarks

and application communication patterns

  • Future Work

– Impact of these designs on real world applications

28

slide-29
SLIDE 29

THANK YOU

Email Contacts

  • G. Santhanaraman: gopal@ncsa.uiuc.edu

D.K. Panda: panda@cse.ohio‐state.edu

29