Communication and Topology-aware Load Balancing in Charm++ with - - PowerPoint PPT Presentation

communication and topology aware load balancing in charm
SMART_READER_LITE
LIVE PREVIEW

Communication and Topology-aware Load Balancing in Charm++ with - - PowerPoint PPT Presentation

Communication and Topology-aware Load Balancing in Charm++ with TreeMatch Joint lab 10th workshop (IEEE Cluster 2013, Indianapolis, IN) Emmanuel Jeannot Esteban Meneses-Rojas Guillaume Mercier Franois Tessier Gengbin Zheng November 27,


slide-1
SLIDE 1

Communication and Topology-aware Load Balancing in Charm++ with TreeMatch

Joint lab 10th workshop (IEEE Cluster 2013, Indianapolis, IN) Emmanuel Jeannot Esteban Meneses-Rojas Guillaume Mercier François Tessier Gengbin Zheng November 27, 2013

Emmanuel Jeannot Communication-aware load balancing 1 / 25

slide-2
SLIDE 2

Introduction

Scalable execution of parallel applications Number of cores is increasing But memory per core is decreasing Application will need to communicate even more than now Issues Process placement should take into account process affinity Here: load balancing in Charm++ taking into account:

load affinity topology migration cost (transfer time)

Emmanuel Jeannot Communication-aware load balancing 2 / 25

slide-3
SLIDE 3

Outline

1

Introduction

2

Problem and models

3

Load balancing for compute-bound applications

4

Load balancing for communication-bound applications

5

Conclusion

Emmanuel Jeannot Communication-aware load balancing 3 / 25

slide-4
SLIDE 4

Outline

1

Introduction

2

Problem and models

3

Load balancing for compute-bound applications

4

Load balancing for communication-bound applications

5

Conclusion

Emmanuel Jeannot Communication-aware load balancing 4 / 25

slide-5
SLIDE 5

Charm++

Features Parallel object-oriented programming language based on C++ Programs are decomposed into a number of cooperating message-driven

  • bjects called chares.

In general we have more chares than processing units Chares are mapped to physical processors by an adaptive runtime system Load balancers can be called to migrate chares Chares placement and load balancing is transparent for the programmer

Emmanuel Jeannot Communication-aware load balancing 5 / 25

slide-6
SLIDE 6

Chares/Process Placement

Why we should consider it Many current and future parallel platforms have several levels of hierarchy Application Chares/processes do not exchange the same amount of data (affinity) The process placement policy may have impact on performance

Cache hierarchy, memory bus, high-performance network...

In this work we deal with tree topologies only Switch Cabinet Cabinet ... Node Node ... Processor Processor Core Core Core Core

Emmanuel Jeannot Communication-aware load balancing 6 / 25

slide-7
SLIDE 7

Problems

Given The parallel machine topology The application communication pattern Map application processes/chares to physical resources (cores) to reduce the communication costs

5 10 15 5 10 15

zeus16.map

Sender rank Receiver rank 1 2 3 4 5 6 7

Emmanuel Jeannot Communication-aware load balancing 7 / 25

slide-8
SLIDE 8

TreeMatch

The TreeMatch Algorithm Algorithm and environment to compute processes placement based on processes affinities and NUMA topology Input :

The communication pattern of the application

Preliminary execution with a monitored MPI implementation for static placement Dynamic recording on iterative applications with Charm++

A model (tree) of the underlying architecture : Hwloc can provide us this.

Output :

A processes permutation σ such that σi is the core number on which we have to bind the process i

Emmanuel Jeannot Communication-aware load balancing 8 / 25

slide-9
SLIDE 9

TreeMatch

The TreeMatch Algorithm Algorithm and environment to compute processes placement based on processes affinities and NUMA topology Input :

The communication pattern of the application

Preliminary execution with a monitored MPI implementation for static placement Dynamic recording on iterative applications with Charm++

A model (tree) of the underlying architecture : Hwloc can provide us this.

Output :

A processes permutation σ such that σi is the core number on which we have to bind the process i

Emmanuel Jeannot Communication-aware load balancing 8 / 25

slide-10
SLIDE 10

Example

5 10 15 5 10 15

example16.mat

Sender rank Receiver rank 1 2 3 4 5 6 7

σ =(0,2,8,10,4, 6,12,14,1,3,9, 11,5,7,13,15) = ⇒

5 10 15 5 10 15

example16_TreeMatch.mat

Sender rank Receiver rank 1 2 3 4 5 6 7

Emmanuel Jeannot Communication-aware load balancing 9 / 25

slide-11
SLIDE 11

TreeMatch Vs. existing solution

Graph partitionners Parallel Scotch (Par)Metis Other static algorithms [Träff 02]: placement through graph embedding and graph partitioning MPIPP [Chen et al. 2006]: placement through local exchange of processes LibTopoMap [Hoefler & Snir 11]: placement through network model + graph partitioning (ParMetis) Other topology-aware load-balacing algorithms [L. L. Pilla, et al. 2012] NUCOLB, shared memory machines [L. L. Pilla, et al. 2012] HwTopoLB All these solution requires quantitative information about the network and the communication duration. TreeMatch: only qualitative information about the topology (the structure) is required.

Emmanuel Jeannot Communication-aware load balancing 10 / 25

slide-12
SLIDE 12

Load balancing

Principle Iterative applications load balancer called at regular interval Migrate chares in order to optimize several criteria Charm++ runtime system provides:

chares load chares affinity

  • etc. . .

Constraints Dealing with complex modern architectures Taking into account communications between elements Cost of migrations

Emmanuel Jeannot Communication-aware load balancing 11 / 25

slide-13
SLIDE 13

What about Charm++?

Not so easy... Several issues raised! Scalability of TreeMatch Need to find a relevant compromise between processes affinities and load balancing

Compute-bound applications Communication-bound applications

Impact of chares migrations? What about load balancing time? The next slides will present two load balancers relying on TreeMatch Compute-bound applications: TMLB_Min_Weight which applies a communication-aware load balancing by favoring the CPU load levelling and minimizing migrations Communication-bound applications: TMLB_TreeBased which performs a parallel communication-aware load balancing by giving advantage to the minimization of communication cost.

Emmanuel Jeannot Communication-aware load balancing 12 / 25

slide-14
SLIDE 14

Outline

1

Introduction

2

Problem and models

3

Load balancing for compute-bound applications

4

Load balancing for communication-bound applications

5

Conclusion

Emmanuel Jeannot Communication-aware load balancing 13 / 25

slide-15
SLIDE 15

Strategy for Charm++

TMLB_Min_Weight Applies TreeMatch on all chares (fake topology : #leaves = #chares) Binds chares according to their load leveling on less loaded chares (see example below) Hungarian algorithm to minimize group of chares migrations (min. weight matching)

Chares

Emmanuel Jeannot Communication-aware load balancing 14 / 25

slide-16
SLIDE 16

Strategy for Charm++

TMLB_Min_Weight Applies TreeMatch on all chares (fake topology : #leaves = #chares) Binds chares according to their load leveling on less loaded chares (see example below) Hungarian algorithm to minimize group of chares migrations (min. weight matching)

Chares

Chares placement + Load balancing -> groups of chares

CPU Load

Sort each part by CPU load Emmanuel Jeannot Communication-aware load balancing 14 / 25

slide-17
SLIDE 17

Strategy for Charm++

TMLB_Min_Weight Applies TreeMatch on all chares (fake topology : #leaves = #chares) Binds chares according to their load leveling on less loaded chares (see example below) Hungarian algorithm to minimize group of chares migrations (min. weight matching)

  • Emmanuel Jeannot

Communication-aware load balancing 14 / 25

slide-18
SLIDE 18

Results

LeanMD Molecular Dynamics application Massive unbalance, few communications Experiments on 8 nodes with 8 cores on each (Intel Xeon 5550)

50 100 150 200 250 300 350 500 1000 1500 2000 2500 3000 Execution time (in seconds) Particles per cell LeanMD on 64 cores - 960 chares Baseline GreedyLB RefineLB TMLB_min_weight

Emmanuel Jeannot Communication-aware load balancing 15 / 25

slide-19
SLIDE 19

Results

LeanMD - Migrations Comparing to TMLB_Min_Weight without minimizing migrations :

Execution time up to 5% better Around 200 migrations less

100 200 300 400 500 600 700 800 900 500 1000 1500 2000 2500 3000 Number of migrated chares Particles per cell Number of migrated chares in LeanMD 960 chares - 64 cores GreedyLB RefineLB TMLB_min_weight

Emmanuel Jeannot Communication-aware load balancing 16 / 25

slide-20
SLIDE 20

Outline

1

Introduction

2

Problem and models

3

Load balancing for compute-bound applications

4

Load balancing for communication-bound applications

5

Conclusion

Emmanuel Jeannot Communication-aware load balancing 17 / 25

slide-21
SLIDE 21

Strategy for Charm++

TMLB_TreeBased 1st step : Applies TreeMatch while considering groups of chares on cores 2nd step : Reorders chares inside each node

Defines the subtree Creates a fake topology with as much leaves as the number of chares + something... (constraints) Applies TreeMatch on this topology and the chares communication pattern Binds chares according to their load (leveling on less loaded chares) Each node in parallel

2 4 6 1 3 5 7

Groups of chares assigned to cores CPU Load Emmanuel Jeannot Communication-aware load balancing 18 / 25

slide-22
SLIDE 22

Strategy for Charm++

TMLB_TreeBased 1st step : Applies TreeMatch while considering groups of chares on cores 2nd step : Reorders chares inside each node

Defines the subtree Creates a fake topology with as much leaves as the number of chares + something... (constraints) Applies TreeMatch on this topology and the chares communication pattern Binds chares according to their load (leveling on less loaded chares) Each node in parallel

2 4 6 1 3 5 7

Groups of chares assigned to cores CPU Load Emmanuel Jeannot Communication-aware load balancing 18 / 25

slide-23
SLIDE 23

Strategy for Charm++

TMLB_TreeBased 1st step : Applies TreeMatch while considering groups of chares on cores 2nd step : Reorders chares inside each node

Defines the subtree Creates a fake topology with as much leaves as the number of chares + something... (constraints) Applies TreeMatch on this topology and the chares communication pattern Binds chares according to their load (leveling on less loaded chares) Each node in parallel

2 4 6

Groups of chares assigned to cores CPU Load Emmanuel Jeannot Communication-aware load balancing 18 / 25

slide-24
SLIDE 24

Strategy for Charm++

TMLB_TreeBased 1st step : Applies TreeMatch while considering groups of chares on cores 2nd step : Reorders chares inside each node

Defines the subtree Creates a fake topology with as much leaves as the number of chares + something... (constraints) Applies TreeMatch on this topology and the chares communication pattern Binds chares according to their load (leveling on less loaded chares) Each node in parallel

Chares Emmanuel Jeannot Communication-aware load balancing 18 / 25

slide-25
SLIDE 25

Strategy for Charm++

TMLB_TreeBased 1st step : Applies TreeMatch while considering groups of chares on cores 2nd step : Reorders chares inside each node

Defines the subtree Creates a fake topology with as much leaves as the number of chares + something... (constraints) Applies TreeMatch on this topology and the chares communication pattern Binds chares according to their load (leveling on less loaded chares) Each node in parallel

2 4 6

Chares Emmanuel Jeannot Communication-aware load balancing 18 / 25

slide-26
SLIDE 26

Strategy for Charm++

TMLB_TreeBased 1st step : Applies TreeMatch while considering groups of chares on cores 2nd step : Reorders chares inside each node

Defines the subtree Creates a fake topology with as much leaves as the number of chares + something... (constraints) Applies TreeMatch on this topology and the chares communication pattern Binds chares according to their load (leveling on less loaded chares) Each node in parallel

2 4 6

Chares Emmanuel Jeannot Communication-aware load balancing 18 / 25

slide-27
SLIDE 27

Strategy for Charm++

TMLB_TreeBased 1st step : Applies TreeMatch while considering groups of chares on cores 2nd step : Reorders chares inside each node

Defines the subtree Creates a fake topology with as much leaves as the number of chares + something... (constraints) Applies TreeMatch on this topology and the chares communication pattern Binds chares according to their load (leveling on less loaded chares) Each node in parallel

2 4 6 1 3 5 7

Groups of chares assigned to cores CPU Load Emmanuel Jeannot Communication-aware load balancing 18 / 25

slide-28
SLIDE 28

Results

kNeighbor Benchmarks application designed to simulate intensive communication between processes Experiments on 8 nodes with 8 cores on each (Intel Xeon 5550) Particularly compared to RefineCommLB

Takes into account load and communication Minimizes migrations

DummyLB GreedyCommLB GreedyLB RefineCommLB TMLB_TreeBased Execution time (in seconds) 50 100 150 200 250 300 kNeighbor on 64 cores 64 elements − 1MB message size DummyLB GreedyCommLB GreedyLB RefineCommLB TMLB_TreeBased Execution time (in seconds) 100 200 300 400 500 600 700 kNeighbor on 64 cores 128 elements − 1MB message size DummyLB GreedyCommLB GreedyLB RefineCommLB TMLB_TreeBased Execution time (in seconds) 500 1000 1500 2000 kNeighbor on 64 cores 256 elements − 1MB message size Emmanuel Jeannot Communication-aware load balancing 19 / 25

slide-29
SLIDE 29

Results

Impact on communication Communications evolution between ten iterations

864 864 652 640 672 692 348 380 404 372 392 400 412 376

1 2 3 4 5 6 7 8

Communication between 10 iterations without any load balancing strategy (in thousands of messages sent)

800 800 620 636 688 664 364 376 396 376 420 404 360 408

1 2 3 4 5 6 7 8

Communication between 10 iterations after the first call of TreeMatchLB (in thousands of messages sent) Emmanuel Jeannot Communication-aware load balancing 20 / 25

slide-30
SLIDE 30

Results

Stencil3D 3 dimensional stencil with regular communication with fixed neighbors One chare per core : balance only considering communications Only one load balancing step after 10 iterations Experiments on 8 nodes with 8 cores on each (Intel Xeon 5550)

DummyLB GreedyCommLB GreedyLB RefineCommLB TMLB_TreeBased Execution time (in seconds) 50 100 150 200 Stencil3D on 64 cores − 64 elements

Emmanuel Jeannot Communication-aware load balancing 21 / 25

slide-31
SLIDE 31

Results

What about the load balancing time? Linear trajectory while the number of chares is doubled TMLB_TreeBased is clearly slower than the other strategies But the parallel version is almost implemented. . .

0.001 0.01 0.1 1 10 100 1000 10000 64 128 256 Execution time (in ms) Number of chares Execution time of load balancing strategies (running on 64 cores) DummyLB GreedyCommLB GreedyLB RefineCommLB TMLB_TreeBased

Figure : Load balancing time of the different strategies vs. number of chares for the KNeighbor application.

Emmanuel Jeannot Communication-aware load balancing 22 / 25

slide-32
SLIDE 32

Outline

1

Introduction

2

Problem and models

3

Load balancing for compute-bound applications

4

Load balancing for communication-bound applications

5

Conclusion

Emmanuel Jeannot Communication-aware load balancing 23 / 25

slide-33
SLIDE 33

Conclusion and Future

Conclusion Topology is not flat! Processes affinities are not homogeneous Take into account these information to map chares give us improvement Need to distinguish between compute-bound and communication-bound application Several criteria taken into account: affinity, topology, load, migration cost,

  • etc. . .

Future work Find a better way to gather the topology (Hwloc?) Distribute and parallelize TMLB_TreeBased on the different nodes (work in progess with the PPL) Make TMLB_TreeBased more scalable for large scale clusters: allow to chose the level in the hierarchy where the algorithm will be distributed Hybrid architecture? Intel MIC? Continue collaborations between Inria and PPL

Emmanuel Jeannot Communication-aware load balancing 24 / 25

slide-34
SLIDE 34

The End

Thanks for your attention ! Any questions?

Emmanuel Jeannot Communication-aware load balancing 25 / 25