Scaling Alltoall Collective on Multi-core Systems Rahul Kumar, Amith - PowerPoint PPT Presentation

Scaling Alltoall Collective on Multi-core Systems Rahul Kumar, Amith R Mamidala, Dhabaleswar K Panda Department of Computer Science & Engineering The Ohio State University {kumarra, mamidala, panda}@cse.ohio-state.edu

Presentation Outline • Introduction • Motivation & Problem Statement • Proposed Design • Performance Evaluation • Conclusion & Future Work

Introduction  Multi-core architectures being widely used for high performance computing  Ranger cluster at TACC has 16 core/node and in total more than 60,000 cores  Message Passing is the default programming model for distributed memory systems  MPI provides many communication primitives  MPI Collective operations are widely used in applications

Introduction  MPI_alltoall is the most intensive collective and is widely used in many applications such as CPMD, NAMD, FFT, Matrix transpose.  In MPI_Alltoall every process has a different data to be sent to every other process.  An efficient alltoall is highly desirable for multi-core systems as the number of processes have increased dramatically due to cheap cost ratio of multi-core architecture

Introduction  24% of the top 500 supercomputers use InfiniBand as their interconnect (based on Nov „07 rankings).  Several different implementations of InfiniBand Network Interfaces  Offload implementation e.g. InfiniHost III(3 rd generation cards from Mellanox)  Onload implementation e.g. Qlogic InfiniPath  Combination of both onload and offload e.g. ConnectX from Mellanox.

Offload & Onload Architecture Node Node Node Node Core NIC NIC NIC NIC INFINIBAND INFINIBAND Offload architecture Onload architecture  In an offload architecture, the network processing is offloaded to network interface. The NIC is able to send message relieving the CPU of communication  In an onload architecture, the CPU is involved in communication in addition to performing the computation  In onload architecture, the faster CPU is able to speed up the communication. However, ability to overlap communication with computation is not possible

Characteristics of various Network Interfaces • Some basic experiments were performed on various network architectures and the following observations were made • The bi-directional bandwidth of onload network interfaces increases with more number of cores used to push the data on the network • This is shown in the following slides

Bi-directional Bandwidth: InfiniPath (onload) • Bidirectional Bandwidth increases with more cores used to push data • In onload interface, more cores help achieve better network utilization

Bi-directional Bandwidth: ConnectX • A similar trend is also observed for connectX network interfaces

Bi-directional Bandwidth: InfiniHost III (offload) • However, in Offload network interfaces the bandwidth drops on using more cores • We feel this to be due to congestion at the network interface on using many cores simultaneously

Results from the Experiments • Depending on the interface implementation, their characteristics differ – Qlogic onload implementations : Using more cores simultaneously for inter-node communication is beneficial – Mellanox offload implementations: Using less cores at the same time for inter-node communication is beneficial – Mellanox ConnectX architecture: Using more cores simultaneously is beneficial

Motivation • To evaluate the performance of existing alltoall algorithm we conduct the following experiment • In the experiment alltoall time is measured on a set of nodes. • The number of cores per node participating in alltoall are increased gradually.

Motivation • The alltoall time doubles on doubling the number of cores in the nodes

What is the problem with the Algorithm? Cores Node 1 Node 2 • Alltoall between two nodes involves one communication step • With two cores per node, the number of inter-node communication by each core increases to two • So on doubling the core alltoall time is almost doubled. • This is exactly what we obtained from the previous experiment.

Problem Statement • Can low cost shared memory help to avoid network transactions? • Can the performance of alltoall be improved especially for multi-core systems? • What algorithms to choose for different infiniband implementations?

Related Work  There have been studies that propose a leader- based hierarchical scheme for other collectives  A leader is chosen on each node  Only the leader is involved in inter-node communication  The communication takes place in three stages  The cores aggregate data at the leader of the node  The leader perform inter-node communication  The leader distributes the data to the cores  We implemented the above scheme for Alltoall as illistrated in the diagram in next slide

Leader-based Scheme for Alltoall Node 0 Node 1 Node 0 Node 1 Step 1 GROUP Node 0 Node 1 Node 0 Node 1 Step 3 Step 2 • step 1 : all cores send data to the leader • step 2 : the leader performs alltoall with other leader • step 3 : the leader distributes the respective data to other cores

Issues with Leader-based Scheme • It uses only one core to send the data out on the network • Does not take advantage of increase in bandwidth with the use of more cores to send the data out of the node

Proposed Design Cores GROUP 1 Node 0 Node 1 Node 0 Node 1 Node 0 Node 1 GROUP 2 Step 1 Step 2 • All the cores take part in the communication • Each core communicates with one and only one core from other nodes • Step 1: Intra-node Communication • The data destined for other nodes is exchanged among the cores • The core which communicates with the respective core of the other node receives the data • Step 2: Inter-node Communication • Alltoall is called among each group 21

Advantages of the Proposed Scheme • The scheme takes advantage of low cost shared memory • It uses multiple cores to send the data out on the network, thus achieving better network utilization • Each core issues same number of sends as the leader-based scheme, hence start-up costs are lower

Evaluation Framework • Testbed – Cluster A: 64 node (512 cores) • dual 2.33 GHz Intel Xeon “ Clovertown ” quad -core • InfiniPath SDR network interface QLE7140 • InfiniHost III DDR network interface card MT25208 – Cluster B: 4 node (32 cores) • dual 2.33 GHz Intel Xeon “ Clovertown ” quad -core • Mellanox DDR ConnectX network interface • Experiments – Alltoall collective time • Onload InfiniPath network interface • Offload InfiniHost III network interface • ConnectX network interface – CPMD Application performance

Alltoall: InfiniPath Alltoall Time 30000 original 25000 Leader-based proposed 20000 Time (us) 15000 10000 5000 0 1 2 4 8 16 32 64 128 256 512 1K 2K Msg Size • The figure shows the alltoall time for different message size on 512 core system • Leader-based reduces the alltoall time • Proposed design gives the best performance on onload network interfaces

Alltoall-InfiniPath: 512 Bytes Message Alltoall Time 12000 original 10000 leader-based proposed 8000 Time (us) 6000 4000 2000 0 2 4 8 16 32 64 # of Nodes • The figure shows the alltoall time for 512 Bytes message on varying system size • The proposed scheme scales much better than other schemes on increase in system size

Alltoall: InfiniHost III Alltoall Time 90000 original 80000 Leader-based 70000 proposed 60000 Time (us) 50000 40000 30000 20000 10000 0 1 2 4 8 16 32 64 128 256 512 1K 2K Msg Size • The figure shows the performance of the schemes on offload network interfaces • Leader-based scheme performs best on offload NIC as it avoids congestion. • This matches our expectations

Alltoall: ConnectX Alltoall Time 4500 Leader-based 4000 3500 original 3000 proposed Time (us) 2500 2000 1500 1000 500 0 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K Msg Size • As seen earlier, bi-directional bandwidth increases with the use of more cores on ConnectX architecture • Therfore, the proposed scheme attains the best performance

CPMD Application 200 180 original 160 Leader-based Execution Time (sec) 140 proposed 120 100 80 60 40 20 0 32-wat si63-10ryd si63-70ryd si63-120ryd • CPMD is designed for ab-initio molecular dynamics. CPMD makes extensive use of alltoall communication. • Figure shows the performance improvement of CPMD Application on 128 core system • The proposed design delivers the best execution time

CPMD Application Performance on Varying System Size CPMD 600 original 500 Leader-based proposed 400 Time (secs) 300 200 100 0 8X8 16X8 32X8 64X8 System Size • This figure shows the application execution time on different system sizes. • The reduction in application execution time increases with increasing system sizes. Proposed design scales very well.

Scaling Alltoall Collective on Multi-core Systems Rahul Kumar, Amith - PowerPoint PPT Presentation

Scaling Alltoall Collective on Multi-core Systems Rahul Kumar, Amith R Mamidala, Dhabaleswar K Panda Department of Computer Science & Engineering The Ohio State University {kumarra, mamidala, panda}@cse.ohio-state.edu Presentation Outline

Welcome Welcome Core: Core A Regional Destination Core: Core UL Core: Core Downtown

Outline Scaling Scalinga Plenitude of Power Laws Scaling-at-large Scaling-at-large

UP UP AND OUT: SCALING SOFTWARE WITH AKKA Jonas Bonr CTO Typesafe @jboner Scaling software

Caching, Parallelism, Fault Tolerance Marco Serafini COMPSCI 532 Lectures 2-3 Memory Hierarchy

MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefits Ashish

Analysis of Scaling Algorithms for Matrix & Operator Scaling Contents Scaling Algorithms

Outline Scalinga Plenitude of Power Laws Scaling-at-large Scaling-at-large Principles of

Effectively Scaling Effectively Scaling up/universalizing exclusive up/universalizing exclusive

Scaling From simple models to rich strategies PPPLab Day, November 30th Scaling: recent

COLLECTIVE LEADERSHIP AND SAFETY CULTURES COLLECTIVE LEADERSHIP FOR SAFETY SKILLS Co-Lead Coll

Collective Investment Schemes in Cyprus What are the Collective Investment Schemes A

Collective states and transitional behavior in schooling fish KOLBJRN TUNSTRM Collective

Efficient Wake-Up Scheduling for Efficient Wake-Up Scheduling for Multi-Core Systems Multi-Core

Collective effects in small systems, Collective effects in small systems, Hydro vs Color

Scaling for Ultrafast, G.FAST, FTTP, 5G and the Cloud Neil J. McRae Chief Architect BT 1

INTRODUCING THE SYSTEMS PLANNING COLLECTIVE Alina Turner Principal, Turner Strategies ||

1 TCP AIMD TCP Congestion Control additive increase: multiplicative decrease: increase CongWin

Chapter 3 outline 3.1 transport-layer 3.5 connection-oriented transport: TCP services

Transport layer Congestion Control in TCP Global congestion collapse Craig Partridge, Research

The Use of Real-Time and Archived Operations Data for Congestion Planning and Incident Management

Fair and Efficient Router Congestion Control Xiaojie Gao, Kamal Jain, Leonard J. Schulman In

Comparison of congestion management techniques: Nodal, zonal and discriminatory pricing Pr

Preview question Which of these is a cryptographic primitive based on a Feistel cipher design?

The Transport Layer: TCP & Congestion Control Smith College, CSC 249 Feb 20, 2018 1

Scaling Alltoall Collective on Multi-core Systems Rahul Kumar, Amith - PowerPoint PPT Presentation

Scaling Alltoall Collective on Multi-core Systems Rahul Kumar, Amith R Mamidala, Dhabaleswar K Panda Department of Computer Science & Engineering The Ohio State University {kumarra, mamidala, panda}@cse.ohio-state.edu Presentation Outline

Welcome Welcome Core: Core A Regional Destination Core: Core UL Core: Core Downtown

Outline Scaling Scalinga Plenitude of Power Laws Scaling-at-large Scaling-at-large

UP UP AND OUT: SCALING SOFTWARE WITH AKKA Jonas Bonr CTO Typesafe @jboner Scaling software

Caching, Parallelism, Fault Tolerance Marco Serafini COMPSCI 532 Lectures 2-3 Memory Hierarchy

MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefits Ashish

Analysis of Scaling Algorithms for Matrix &amp; Operator Scaling Contents Scaling Algorithms

Outline Scalinga Plenitude of Power Laws Scaling-at-large Scaling-at-large Principles of

Effectively Scaling Effectively Scaling up/universalizing exclusive up/universalizing exclusive

Scaling From simple models to rich strategies PPPLab Day, November 30th Scaling: recent

COLLECTIVE LEADERSHIP AND SAFETY CULTURES COLLECTIVE LEADERSHIP FOR SAFETY SKILLS Co-Lead Coll

Collective Investment Schemes in Cyprus What are the Collective Investment Schemes A

Collective states and transitional behavior in schooling fish KOLBJRN TUNSTRM Collective

Efficient Wake-Up Scheduling for Efficient Wake-Up Scheduling for Multi-Core Systems Multi-Core

Collective effects in small systems, Collective effects in small systems, Hydro vs Color

Scaling for Ultrafast, G.FAST, FTTP, 5G and the Cloud Neil J. McRae Chief Architect BT 1

INTRODUCING THE SYSTEMS PLANNING COLLECTIVE Alina Turner Principal, Turner Strategies ||

1 TCP AIMD TCP Congestion Control additive increase: multiplicative decrease: increase CongWin

Chapter 3 outline 3.1 transport-layer 3.5 connection-oriented transport: TCP services

Transport layer Congestion Control in TCP Global congestion collapse Craig Partridge, Research

The Use of Real-Time and Archived Operations Data for Congestion Planning and Incident Management

Fair and Efficient Router Congestion Control Xiaojie Gao, Kamal Jain, Leonard J. Schulman In

Comparison of congestion management techniques: Nodal, zonal and discriminatory pricing Pr

Preview question Which of these is a cryptographic primitive based on a Feistel cipher design?

The Transport Layer: TCP &amp; Congestion Control Smith College, CSC 249 Feb 20, 2018 1

Analysis of Scaling Algorithms for Matrix & Operator Scaling Contents Scaling Algorithms

The Transport Layer: TCP & Congestion Control Smith College, CSC 249 Feb 20, 2018 1