communication for infiniband clusters
play

Communication for InfiniBand Clusters G.Santhanaraman, T. - PowerPoint PPT Presentation

Design Alternatives for Implementing Fence Synchronization in MPI-2 One-Sided Communication for InfiniBand Clusters G.Santhanaraman, T. Gangadharappa, S.Narravula, A.Mamidala and D.K.Panda Presented by: Miao Luo National Center for


  1. Design Alternatives for Implementing Fence Synchronization in MPI-2 One-Sided Communication for InfiniBand Clusters G.Santhanaraman, T. Gangadharappa, S.Narravula, A.Mamidala and D.K.Panda Presented by: Miao Luo National Center for Supercomputing Applications Dept of Computer Science and Engineering, The Ohio State University 1

  2. Introduction High-end Computing (HEC) Systems (approaching petascale capability) • – Systems with few thousands/tens/hundreds of thousands of cores – Meet the requirements of grand challenge problems Greater emphasis on programming models • – One sided communication is getting popular Minimize the need to synchronization • – Ability to overlap computation and communication Scalable application communication patterns • – Clique-based communication • Nearest neighbor: Ocean/Climate modeling, PDE solvers • Cartesian grids: 3DFFT 2

  3. Introduction: HPC Clusters HPC has been the key driving force • – Provides immense computing power by increasing the scale of parallel machines Approaching petascale capabilities • – Increased Node performance – Faster/Larger Memory – Hundreds of thousands of cores Commodity clusters with Modern • Interconnects (InfiniBand, Myrinet 10GigE etc) 3

  4. Introduction: Message Passing Interface (MPI) • MPI - Dominant programming model • Very Portable – Available on all High end systems • Two sided message passing – Requires a handshake between the sender and receiver – Matching sends and receives • One sided programming models becoming popular – MPI also provides one-sided communication semantics 4

  5. Introduction: One-sided Communication P0 reads/writes directly into the address space of P1 • Only one processor (P0) involved in the communication • MPI-2 standard (extension to MPI-1) ‏ • One Sided Communication or Remote memory Access (RMA ) MPI-3 standard coming up... Node Node Memory Memory P1 P0 PCI/PCI-EX PCI/PCI-EX IB IB P2 P3 5

  6. Introduction : MPI-2 One-sided Communication • Sender (origin) ca n access the receiver (target) remote address space (window) directly • Decouples data transfer and synchronization operations • Communication operations – MPI_Put, MPI_Get, MPI_Accumulate – Contiguous and Non-contiguous operations • Synchronization Modes – Active synchronization • Post/start Wait/Complete • Fence (collective) ‏ – Passive synchronization 6 • Lock/unlock

  7. Introduction: Fence Synchronization PROCESS : 1 PROCESS: 2 PROCESS: 0 START: epoch 0 Fence Fence Fence Put(2) ‏ Put(0) ‏ Get(1) ‏ Put(2) ‏ END : epoch 0 Fence Fence Fence START: epoch 1 Put(1) ‏ Put(1) ‏ Put(2) ‏ END: epoch 1 Fence Fence Fence 7

  8. Introduction: Top 100 Interconnect Share 58/100 systems In top systems, the use of InfiniBand has grown significantly. Over 50% of the top 100 systems in the Top500 use InfiniBand 8 8

  9. Introduction: InfiniBand Overview  The InfiniBand Architecture (IBA): Open standard for high speed interconnect  IBA supports send/recv and RDMA semantics • Can provide good hardware support for RMA/one-sided communication model  Very good performance with many features • Minimum latency ~1usecs, peak bandwidth ~2500MB/s • RDMA Read, RDMA Write ( matches well with one-sided get/put semantics) • RDMA Write with Immediate ‏ (explored in this work)  Several High End Computing systems use InfiniBand examples: Ranger at TACC (62976 cores), Chinook at PNNL (18176 cores) 9

  10. Pr Presenta esentation Lay tion Layout out • Introduction • Problem Statement • Design Alternatives • Experimental Evaluation • Conclusions and Future Work

  11. Problem Statement • How can we explore the design space for implementing fence synchronization on modern Interconnects? • Can we design a novel fence synchronization mechanism that leverages InfiniBand’s RDMA Write with immediate primitives? – Reduced synchronization overhead and network traffic – Provide increased scope for overlap 11

  12. Pr Presenta esentation Lay tion Layout out • Introduction • Problem Statement • Design Alternatives • Experimental Evaluation • Conclusions and Future Work

  13. Design Space • Deferred Approach – All operations and synchronizations deferred to subsequent fence – Use two-sided operations – Certain optimizations possible to reduce latency of ops and overhead of sync – Capability for overlap is lost • Immediate Approach – Sync and communication ops happen as they are issued – Use RDMA for communication ops – Can achieve good overlap of computation and communication – How can we handle remote completions?? • Characterize the performance – Overlap capability – Synchronization overhead 13

  14. Fence Designs • Deferred approach ( Fence-2S ) – Two Sided Based Approach – First fence does nothing – All one-sided operations queued locally – The second fence goes through the queue, issues operations, and handles completion – The last message in the epoch can signal a completion • Optimizations (combining of put and the ensuing synchronization) -> reduced synchronization overhead • Cons : No scope for providing overlap 14

  15. Fence Designs Process: 1 • Immediate Approach Process: 0 Barrier: step 1 – Issue a completion message on all the channels PUT: from 0 to 3 PUT: from 0 to 3 Issued – Issue a Barrier after the Barrier: step 2 Barrier: step 2 (Arrives after step 2) ‏ operations? Barrier: step 1 Process: 2 Process: 3 15

  16. Fence-Imm Naive Design (Fence-1S) P0 P1 P2 P3 Epoch 0 PUT PUT PUT Fence begin Finish message Complete Epoch 0 Local comple/on REDUCE SCATTER Finish mesg comple/on Start Epoch 1 Fence end

  17. Fence-Imm Opt Design (Fence-1S-Barrier) P0 P1 P2 P3 Epoch 0 PUT PUT PUT Fence begin Finish message Complete Epoch 0 Local comple/on REDUCE SCATTER Finish mesg comple/on Start Epoch 1 BARRIER Fence end

  18. Novel Fence-RI Design P0 P1 P2 P3 (RDMA write with imm) ‏ (RDMA write with imm) ‏ (RDMA write with imm) ‏ Epoch 0 PUT PUT PUT Fence begin Local comple/on Complete Epoch 0 ALL REDUCE Remote RDMA Immediate comple/on BARRIER Start Epoch 1 Fence end 18

  19. Pr Presenta esentation Lay tion Layout out • Introduction • Problem Statement • Design Alternatives • Experimental Evaluation • Conclusions and Future Work

  20. Experimental Evaluation Experimental Testbed - 64 Node Intel Cluster - 2.33 GHz quad-core processor - 4GB Main Memory - RedHat Linux AS4 - Mellanox MT25208 HCAs with PCI Express Interfaces - Silverstorm 144 port switch - MVAPICH2 Software Stack Experiments Conducted - Overlap Measurements - Fence Synchronization Microbenchmarks - Halo Exchange Communication Pattern 20

  21. MVAPICH/MVAPICH2 Software Distributions • High Performance MPI Library for InfiniBand and iWARP Clusters – MVAPICH2(MPI-2) – Used by more than 975 organizations world-wide – Empowering many TOP500 clusters – Available with software stacks of many InfiniBand, iWARP and server vendors including Open Fabrics Enterprise Distribution (OFED) http://mvapich.cse.ohio-state.edu/ 21

  22. Overlap 100 Overlap Metric • Fence‐2S - Increasing amount of computation is Fence‐1S 80 inserted between the put and fence Fence‐1S‐Barrier Percentage Overlap sync Fence‐RI 60 - Percentage overlap is measured as the amount of computation that can be inserted without increasing overall 40 latency 20 Two sided implementation • (Fence-2S) uses deferred approach 0 16 64 256 1k 4k 8k 16k 32k 64k 256k – No scope for overlap The one-sided implementations • Message Size can achieve overla p 22

  23. Latency of Fence (Zero-put) Performance of fence alone 1400 • without any one-sided 1200 operations 1000 Overhead of synchronization • Latency (usecs) alone 800 Fence-1S performs badly due • 600 to all pair-wise sync to indicate 400 start of next epoch 200 Fence-2S performs the best • since it does not need 0 additional collective to indicate 8 16 32 64 start of an epoch Num of Procs Fence‐2S Fence‐1S Fence‐1S‐Barrier Fence‐RI

  24. Latency of Fence with Put Operations • Performance of fence with put 1800 1600 operations 1400 – Measuring synchronization with Latency (usecs) 1200 communication ops 1000 – A single put is issued by all the 800 processes between two fences 600 400 200 • Fence-1s performs the worst 0 • Fence-RI performs better than 8 16 32 64 Fence-1S-Barrier Num of Procs Fence‐2S Fence‐1S Fence‐1S‐Barrier Fence‐RI Single Put 24

  25. Latency of Fence with Multiple Put Operations • Performance of fence with 1400 multiple put operations 1200 – Each process issues puts to 1000 Latency (usecs) 8 neighbors 800 600 400 • Fence-RI performs better than Fence-1S barrier 200 • Fence-2S still performs the 0 8 16 32 64 best Num of Procs – However poor overlap capability Fence‐2S Fence‐1S Fence‐1S‐Barrier Fence‐RI 8 Puts 25

  26. Halo Communication Pattern 2500 2000 Latency (usecs) 1500 1000 500 0 Mimics halo or Ghost cell • 8 16 32 64 update The Fence-RI scheme • Num of Procs performs the best Fence‐2S Fence‐1S Fence‐1S‐Barrier Fence‐RI 26

  27. Pr Presenta esentation Lay tion Layout out • Introduction • Problem Statement • Design Alternatives • Experimental Evaluation • Conclusions and Future Work

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend