Communication and Topology-aware Load Balancing in Charm++ with - PowerPoint PPT Presentation

Communication and Topology-aware Load Balancing in Charm++ with TreeMatch Joint lab 10th workshop (IEEE Cluster 2013, Indianapolis, IN) Emmanuel Jeannot Esteban Meneses-Rojas Guillaume Mercier François Tessier Gengbin Zheng November 27, 2013 Emmanuel Jeannot Communication-aware load balancing 1 / 25

Introduction Scalable execution of parallel applications Number of cores is increasing But memory per core is decreasing Application will need to communicate even more than now Issues Process placement should take into account process affinity Here: load balancing in Charm++ taking into account: load affinity topology migration cost (transfer time) Emmanuel Jeannot Communication-aware load balancing 2 / 25

Outline Introduction 1 Problem and models 2 Load balancing for compute-bound applications 3 Load balancing for communication-bound applications 4 Conclusion 5 Emmanuel Jeannot Communication-aware load balancing 3 / 25

Charm++ Features Parallel object-oriented programming language based on C++ Programs are decomposed into a number of cooperating message-driven objects called chares . In general we have more chares than processing units Chares are mapped to physical processors by an adaptive runtime system Load balancers can be called to migrate chares Chares placement and load balancing is transparent for the programmer Emmanuel Jeannot Communication-aware load balancing 5 / 25

Chares/Process Placement Why we should consider it Many current and future parallel platforms have several levels of hierarchy Application Chares/processes do not exchange the same amount of data (affinity) The process placement policy may have impact on performance Cache hierarchy, memory bus, high-performance network... In this work we deal with tree topologies only Switch Cabinet Cabinet ... Node Node ... Processor Processor Core Core Core Core Emmanuel Jeannot Communication-aware load balancing 6 / 25

Problems Given The parallel machine topology The application communication pattern Map application processes/chares to physical resources (cores) to reduce the communication costs zeus16.map 7 15 6 5 10 Receiver rank 4 3 5 2 1 0 5 10 15 Sender rank Emmanuel Jeannot Communication-aware load balancing 7 / 25

TreeMatch The TreeMatch Algorithm Algorithm and environment to compute processes placement based on processes affinities and NUMA topology Input : The communication pattern of the application Preliminary execution with a monitored MPI implementation for static placement Dynamic recording on iterative applications with Charm++ A model (tree) of the underlying architecture : Hwloc can provide us this. Output : A processes permutation σ such that σ i is the core number on which we have to bind the process i Emmanuel Jeannot Communication-aware load balancing 8 / 25

Example example16.mat example16_TreeMatch.mat 7 7 15 15 6 6 5 5 σ = (0,2,8,10,4, 10 10 6,12,14,1,3,9, Receiver rank Receiver rank 4 4 11,5,7,13,15) 3 3 ⇒ = 5 5 2 2 1 1 0 0 5 10 15 5 10 15 Sender rank Sender rank Emmanuel Jeannot Communication-aware load balancing 9 / 25

TreeMatch Vs. existing solution Graph partitionners Parallel Scotch (Par)Metis Other static algorithms [Träff 02]: placement through graph embedding and graph partitioning MPIPP [Chen et al. 2006]: placement through local exchange of processes LibTopoMap [Hoefler & Snir 11]: placement through network model + graph partitioning (ParMetis) Other topology-aware load-balacing algorithms [L. L. Pilla, et al. 2012] NUCOLB, shared memory machines [L. L. Pilla, et al. 2012] HwTopoLB All these solution requires quantitative information about the network and the communication duration. TreeMatch: only qualitative information about the topology (the structure) is required. Emmanuel Jeannot Communication-aware load balancing 10 / 25

Load balancing Principle Iterative applications load balancer called at regular interval Migrate chares in order to optimize several criteria Charm++ runtime system provides: chares load chares affinity etc. . . Constraints Dealing with complex modern architectures Taking into account communications between elements Cost of migrations Emmanuel Jeannot Communication-aware load balancing 11 / 25

What about Charm++? Not so easy... Several issues raised! Scalability of TreeMatch Need to find a relevant compromise between processes affinities and load balancing Compute-bound applications Communication-bound applications Impact of chares migrations? What about load balancing time? The next slides will present two load balancers relying on TreeMatch Compute-bound applications : TMLB_Min_Weight which applies a communication-aware load balancing by favoring the CPU load levelling and minimizing migrations Communication-bound applications : TMLB_TreeBased which performs a parallel communication-aware load balancing by giving advantage to the minimization of communication cost. Emmanuel Jeannot Communication-aware load balancing 12 / 25

Strategy for Charm++ TMLB_Min_Weight Applies TreeMatch on all chares (fake topology : #leaves = #chares) Binds chares according to their load leveling on less loaded chares (see example below) Hungarian algorithm to minimize group of chares migrations (min. weight matching) Chares Emmanuel Jeannot Communication-aware load balancing 14 / 25

Strategy for Charm++ TMLB_Min_Weight Applies TreeMatch on all chares (fake topology : #leaves = #chares) Binds chares according to their load leveling on less loaded chares (see example below) Hungarian algorithm to minimize group of chares migrations (min. weight matching) Sort each part by CPU load CPU Load Chares placement + Load balancing -> groups of chares Chares Emmanuel Jeannot Communication-aware load balancing 14 / 25

Strategy for Charm++ TMLB_Min_Weight Applies TreeMatch on all chares (fake topology : #leaves = #chares) Binds chares according to their load leveling on less loaded chares (see example below) Hungarian algorithm to minimize group of chares migrations (min. weight matching) �� Emmanuel Jeannot Communication-aware load balancing 14 / 25

Results LeanMD Molecular Dynamics application Massive unbalance, few communications Experiments on 8 nodes with 8 cores on each (Intel Xeon 5550) LeanMD on 64 cores - 960 chares 350 Baseline GreedyLB 300 Re fi neLB Execution time (in seconds) TMLB_min_weight 250 200 150 100 50 0 0 500 1000 1500 2000 2500 3000 Particles per cell Emmanuel Jeannot Communication-aware load balancing 15 / 25

Results LeanMD - Migrations Comparing to TMLB_Min_Weight without minimizing migrations : Execution time up to 5% better Around 200 migrations less Number of migrated chares in LeanMD 960 chares - 64 cores GreedyLB 900 Re fi neLB Number of migrated chares TMLB_min_weight 800 700 600 500 400 300 200 100 0 0 500 1000 1500 2000 2500 3000 Particles per cell Emmanuel Jeannot Communication-aware load balancing 16 / 25

Communication and Topology-aware Load Balancing in Charm++ with - PowerPoint PPT Presentation

Communication and Topology-aware Load Balancing in Charm++ with TreeMatch Joint lab 10th workshop (IEEE Cluster 2013, Indianapolis, IN) Emmanuel Jeannot Esteban Meneses-Rojas Guillaume Mercier Franois Tessier Gengbin Zheng November 27,

Dynamic Load Balancing in Dynamic Load Balancing in Charm+ + Charm+ + Abhinav S Bhatele

Load Balancing Load Balancing Load balancing: distributing data and/or computations across

Load Balancing with nftables by Laura Garca (Zen Load Balancer Team) Netdev 1.1 Prototype of

Internal Load Balancing in 5 mins Deliver scalable and resilient internal-only services on GCP

Epidemic Algorithm for Load Balancing Harshitha Menon, Laxmikant Kal e 15th April 1 / 25

L O A D B A L A N C I N G I S I M P O S S I B L E LOAD BALANCING IS IMPOSSIBLE Tyler McMullen

Load Balancing in Ceph: Load Balancing With Pseudorandom Placement Esteban Molina-Estolano,

Charm++ Interoperability Nikhil Jain Charm Workshop - 2013 1 Monday, April 15, 13 1

Vector Load Balancing in Charm++ Ronak Buch Parallel Programming Laboratory, University of

Balancing Gas system information provision 12 June 2018 GRTgaz balancing in a nutshell -> 2

MP-HULA A Multipath Transport Layer Aware Datacenter Load Balancing Scheme Using Programmable

Recent Results in Charm Physics Recent Results in Charm Physics Topics Topics Rare Charm

Load Balancing and Termination Detection Load balancing used to distribute computations fairly

Load Balancing Load Balancing: Example Example Problem Consider 6 jobs whose processing times

Computing Load Aware and Long-View Load Balancing for Cluster Storage Systems Guoxin Liu,

State of Charm++ Laxmikant Kale http://charm.cs.uiuc.edu Parallel Programming Laboratory

Cloud-based Global Load Balancing Improve Performance and reliability while reducing IT costs

Load Shift Working Group AUG 22, 2018 10AM 2PM PST CPUC COURTYARD ROOM

Alberta Energy and Capacity Presentation to AESO Adequacy and Demand Curve Working Group August

ORAL PRESENTATION GUIDELINES IMPORTANT: Please note that these guidelines have been prepared for

D ATA center networks use multi-rooted Clos topologies to balances the number of flowcells. It

Using Octavia deep dive Dean H. Lorenz, IBM Research Haifa Allan Hu, Cloud Networking

Fahime Alizade & Rawi Ramdhan } Introduction Why scan the Internet? How to detect

Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing Zuhair Khayyat 1 Karim

Communication and Topology-aware Load Balancing in Charm++ with - PowerPoint PPT Presentation

Communication and Topology-aware Load Balancing in Charm++ with TreeMatch Joint lab 10th workshop (IEEE Cluster 2013, Indianapolis, IN) Emmanuel Jeannot Esteban Meneses-Rojas Guillaume Mercier Franois Tessier Gengbin Zheng November 27,

Dynamic Load Balancing in Dynamic Load Balancing in Charm+ + Charm+ + Abhinav S Bhatele

Load Balancing Load Balancing Load balancing: distributing data and/or computations across

Load Balancing with nftables by Laura Garca (Zen Load Balancer Team) Netdev 1.1 Prototype of

Internal Load Balancing in 5 mins Deliver scalable and resilient internal-only services on GCP

Epidemic Algorithm for Load Balancing Harshitha Menon, Laxmikant Kal e 15th April 1 / 25

L O A D B A L A N C I N G I S I M P O S S I B L E LOAD BALANCING IS IMPOSSIBLE Tyler McMullen

Load Balancing in Ceph: Load Balancing With Pseudorandom Placement Esteban Molina-Estolano,

Charm++ Interoperability Nikhil Jain Charm Workshop - 2013 1 Monday, April 15, 13 1

Vector Load Balancing in Charm++ Ronak Buch Parallel Programming Laboratory, University of

Balancing Gas system information provision 12 June 2018 GRTgaz balancing in a nutshell -&gt; 2

MP-HULA A Multipath Transport Layer Aware Datacenter Load Balancing Scheme Using Programmable

Recent Results in Charm Physics Recent Results in Charm Physics Topics Topics Rare Charm

Load Balancing and Termination Detection Load balancing used to distribute computations fairly

Load Balancing Load Balancing: Example Example Problem Consider 6 jobs whose processing times

Computing Load Aware and Long-View Load Balancing for Cluster Storage Systems Guoxin Liu,

State of Charm++ Laxmikant Kale http://charm.cs.uiuc.edu Parallel Programming Laboratory

Cloud-based Global Load Balancing Improve Performance and reliability while reducing IT costs

Load Shift Working Group AUG 22, 2018 10AM 2PM PST CPUC COURTYARD ROOM

Alberta Energy and Capacity Presentation to AESO Adequacy and Demand Curve Working Group August

ORAL PRESENTATION GUIDELINES IMPORTANT: Please note that these guidelines have been prepared for

D ATA center networks use multi-rooted Clos topologies to balances the number of flowcells. It

Using Octavia deep dive Dean H. Lorenz, IBM Research Haifa Allan Hu, Cloud Networking

Fahime Alizade &amp; Rawi Ramdhan } Introduction Why scan the Internet? How to detect

Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing Zuhair Khayyat 1 Karim

Balancing Gas system information provision 12 June 2018 GRTgaz balancing in a nutshell -> 2

Fahime Alizade & Rawi Ramdhan } Introduction Why scan the Internet? How to detect