Accelerating Large Charm++ Messages using RDMA Nitin Bhat, Vipul - PowerPoint PPT Presentation

Accelerating Large Charm++ Messages using RDMA Nitin Bhat, Vipul Harsh Nitin Bhat Master’s Student Parallel Programming Lab UIUC 1

Motivation • Major bottleneck in HPC Applications – Communication • Strategies to address communication bottlenecks – Overlap communication and computation – Topology aware mapping – Reduce message sending times • Avoiding copying for large messages 2

Charm++ Programming Model • Asynchronous Message Driven Execution • Naturally One-sided Cell_Proxy[8].recv_forces(forces, 1000000, 4.0); 3 8 PE 0 on Node 1 PE 0 on Node 0 3

forcecalculations.ci Module forcecalculations{ …... array [1D] Cell { entry forces( ) ; entry void recv_forces (double forces [size], int size, double value); } …..... } Charm Interface File - Declarations forcecalculations.C void recv_forces(double * forces, int size, double value){ …. } C++ Code File – Entry method forcecalculations.C Cell_Proxy[n].recv_forces(forces, 1000000, 4.0); C++ Code File – Call site 4

What happens under the hood? Node 0 Node 1 Charm++ . Charm++ ...... void recv_force ( double * forces, int size, int value) Cell_Proxy [n]. recv_force (forces, size, value); { ....... size size } forces forces Marshalling of value value Un-marshalling of Parameters Parameters Header Header size value Header forces size value Header size value forces LRTS LRTS 5

In Rdma enabled networks for large messages: Node 0 Node 1 Charm++ . Charm++ ...... void recv_force ( double * forces, int size, int value) Cell_Proxy [n]. recv_force (forces, size, value); { ....... size size } forces forces Marshalling of value value Un-marshalling of Parameters Parameters Header value Header forces size size value Header size value forces metadata LRTS LRTS Allocate Memory Perform Get 6

How to accelerate large messages? • Avoid sender side copy of a large messages – Small parameters will be marshalled into contiguous memory and sent. – Large arrays will be sent through Rdma Get Operations. 7

Regular Charm++ forcecalculations.ci Module forcecalculations{ …... array [1D] Cell { entry forces( ) ; entry void recv_forces (double forces [size], int size, double value); } …..... } No copy Rdma API forcecalculations.ci Module forcecalculations{ …... array [1D] Cell { entry forces( ) ; entry void recv_forces ( Rdma double forces [size], int size, double value); } …..... } 8

Regular Charm++ forcecalculations.C Cell_Proxy[98].recv_forces(forces, 1000000, 4.0); C++ Code File – Call site No copy Rdma API forcecalculations.C Callback Cb = new Callback(CkIndex_Cell::completed, cellArrayID); Cell_Proxy[98].recv_forces( RDMA(forces, Cb) , 1000000, 4.0); C++ Code File – Call site 9

No Copy One-sided API Node 0 Node 1 Charm++ Charm++ . ...... void recv_force ( double * forces, int size, int value) Cell_Proxy [n]. recv_force (RDMA(forces, Cb), size, value); { ....... size size } forces Callback value value Un-marshalling of Parameters Header Marshalling of non Rdma Header size size value value metadata Parameters with metadata Header size value metadata forces Allocate Memory LRTS LRTS Perform Get ack 10

Results on Bluegene/Q Vesta – Pingpong Benchmark Existing� No� copy� One� Message� One� sided sided Size� Speed� Up� � Paradigm� � Paradigm� (MB) (ms) (ms) 0.125 0.1040 0.1036 1.01 0.25 0.19 0.18 1.07 0.5 0.36 0.32 1.12 1 0.70 0.61 1.14 2 1.62 1.25 1.30 4 3.21 2.46 1.31 8 6.40 5.13 1.25 16 12.81 10.22 1.25 32 28.38 20.44 1.39 64 55.62 43.87 1.27 11

Performance Improvement 1.3x speedup 12

Conclusions and Future Work • Saving copy for large messages in RDMA supported networks improves performance • On the receiver side, the user can pre-allocate a buffer and post a receive. • Persistent RDMA • Use cases in : – Charm++ with a posted receive – Charm++ sdag when clause – AMPI non blocking receive 13

Questions? 14

Accelerating Large Charm++ Messages using RDMA Nitin Bhat, Vipul - PowerPoint PPT Presentation

Accelerating Large Charm++ Messages using RDMA Nitin Bhat, Vipul Harsh Nitin Bhat Masters Student Parallel Programming Lab UIUC 1 Motivation Major bottleneck in HPC Applications Communication Strategies to address

FaSST: Fast, Scalable, and Simple Distributed Transactions with Two-Sided (RDMA) Datagram RPCs

Shawn Hall Hybrid RDMA RDMA/SR mix for data, SR otherwise Client side events Completion of

Recent Results in Charm Physics Recent Results in Charm Physics Topics Topics Rare Charm

the kernel bypass with RDMA! Using the RDMA infrastructure for performance while retaining kernel

Design Guidelines for High Performance RDMA Systems Anuj Kalia (CMU) Michael Kaminsky (Intel

RoGUE: RDMA over Generic Unconverged Ethernet Yanfang Le with Brent Stephens, Arjun Singhvi,

Performance Isolation Anomalies in RDMA Yiwen Zhang with Juncheng Gu, Youngmoon Lee, Mosharaf

NFS over RDMA Brent Callaghan, Theresa Lingutla-Raj, Alex Chiu, Peter Staubach, Omer Asad Sun

Performance of RDMA-Capable Storage Performance of RDMA-Capable Storage Protocols on Wide-Area

State of Charm++ Laxmikant Kale http://charm.cs.uiuc.edu Parallel Programming Laboratory

Dynamic Load Balancing in Dynamic Load Balancing in Charm+ + Charm+ + Abhinav S Bhatele

Welcome to the 2017 Charm++ Workshop! Laxmikant (Sanjay) Kale http://charm.cs.illinois.edu

Charm++ Interoperability Nikhil Jain Charm Workshop - 2013 1 Monday, April 15, 13 1

Charm physics and XYZ states at BESIII Evgeny BOGER JINR Dubna On behalf of BESIII

Using RDMA Efficiently for Key-Value Services Anuj Kalia (CMU) Michael Kaminsky (Intel Labs),

How to Write a Parallel GPU Application Using CUDA and Charm++ Presented by Lukasz Wesolowski

4 Our mission to Mars Alexander Nyen ! itemis AG ! Graphical Editing Framework Project Lead

nimble media services our cli lients Dis iscover Rutland Vis isit itor Guid ide Discover

The Aerospace & Defense Forum Arizona Chapter August 11, 2015 KippsDeSanto Introduction

Building Rich Internal Sites Deciding Why, Content Strategy and Best Practices Ryan Price

Interprocess Communication (IPC) The characteristics of protocols for communication between

PtrSplit: Supporting General Pointers in Automatic Program Partitioning Shen Liu Gang Tan

Marawacc: A Framework for Heterogeneous Computing in Java Motivation Marawacc-API Runtime Code

Secure Implementations for Typed Session Abstractions Ricardo Corin, Pierre-Malo Denilou,

Accelerating Large Charm++ Messages using RDMA Nitin Bhat, Vipul - PowerPoint PPT Presentation

Accelerating Large Charm++ Messages using RDMA Nitin Bhat, Vipul Harsh Nitin Bhat Masters Student Parallel Programming Lab UIUC 1 Motivation Major bottleneck in HPC Applications Communication Strategies to address

FaSST: Fast, Scalable, and Simple Distributed Transactions with Two-Sided (RDMA) Datagram RPCs

Shawn Hall Hybrid RDMA RDMA/SR mix for data, SR otherwise Client side events Completion of

Recent Results in Charm Physics Recent Results in Charm Physics Topics Topics Rare Charm

the kernel bypass with RDMA! Using the RDMA infrastructure for performance while retaining kernel

Design Guidelines for High Performance RDMA Systems Anuj Kalia (CMU) Michael Kaminsky (Intel

RoGUE: RDMA over Generic Unconverged Ethernet Yanfang Le with Brent Stephens, Arjun Singhvi,

Performance Isolation Anomalies in RDMA Yiwen Zhang with Juncheng Gu, Youngmoon Lee, Mosharaf

NFS over RDMA Brent Callaghan, Theresa Lingutla-Raj, Alex Chiu, Peter Staubach, Omer Asad Sun

Performance of RDMA-Capable Storage Performance of RDMA-Capable Storage Protocols on Wide-Area

State of Charm++ Laxmikant Kale http://charm.cs.uiuc.edu Parallel Programming Laboratory

Dynamic Load Balancing in Dynamic Load Balancing in Charm+ + Charm+ + Abhinav S Bhatele

Welcome to the 2017 Charm++ Workshop! Laxmikant (Sanjay) Kale http://charm.cs.illinois.edu

Charm++ Interoperability Nikhil Jain Charm Workshop - 2013 1 Monday, April 15, 13 1

Charm physics and XYZ states at BESIII Evgeny BOGER JINR Dubna On behalf of BESIII

Using RDMA Efficiently for Key-Value Services Anuj Kalia (CMU) Michael Kaminsky (Intel Labs),

How to Write a Parallel GPU Application Using CUDA and Charm++ Presented by Lukasz Wesolowski

4 Our mission to Mars Alexander Nyen ! itemis AG ! Graphical Editing Framework Project Lead

nimble media services our cli lients Dis iscover Rutland Vis isit itor Guid ide Discover

The Aerospace &amp; Defense Forum Arizona Chapter August 11, 2015 KippsDeSanto Introduction

Building Rich Internal Sites Deciding Why, Content Strategy and Best Practices Ryan Price

Interprocess Communication (IPC) The characteristics of protocols for communication between

PtrSplit: Supporting General Pointers in Automatic Program Partitioning Shen Liu Gang Tan

Marawacc: A Framework for Heterogeneous Computing in Java Motivation Marawacc-API Runtime Code

Secure Implementations for Typed Session Abstractions Ricardo Corin, Pierre-Malo Denilou,

The Aerospace & Defense Forum Arizona Chapter August 11, 2015 KippsDeSanto Introduction