[PPT] - Designing Multi-Leader based Allgather Algorithms for Multi-core PowerPoint Presentation

SLIDE 1

Designing Multi-Leader based Allgather Algorithms for Multi-core Clusters

Krishna Kandalla, Hari Subramoni, Gopal Santhanaraman, Matthew Koop and Dhabaleswar.

K. Panda

Computer Science & Engineering Department The Ohio State University

SLIDE 2

Outline

Introduction and Background
Motivation
Related Work
Multi-Leader based Algorithms
Experimental evaluation
Conclusions and Future Work

SLIDE 3

Introduction and Background

MPI is the de-facto programming model for HPC
Multi-core clusters are becoming increasingly

common

Modern interconnects like InfiniBand offer high-

bandwidth and low-latency

The collective communication primitives

consume a significant amount of time

Necessary to have multi-core aware collective

designs

SLIDE 4

Allgather Communication

Each process broadcasts a vector data to every other

process in the group

Commonly used algorithms :
Recursive Doubling (RD) Algorithm for small messages

tcomm = ts * log(p) + tw * (p -1) * m

Ring Algorithm for large messages

tcomm = (ts + tw * m) * (p -1)

tcomm - Total Communication cost ts - Communication start-up cost tw - Cost of sending a byte of data p - Number of processes m - Message Size.

SLIDE 5

Outline

Introduction and Background
Motivation
Related Work
Multi-Leader based Algorithms
Experimental evaluation
Conclusions and Future Work

SLIDE 6

Recursive Doubling (RD) Algorithm on Multi-cores

4 3 1 1 2 2 3 4 5 6 7 8 9 10 11 12 13 14 15

SLIDE 7

Ring Algorithm on Multi-cores

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

SLIDE 8

Scaling on Multi-cores : Recursive Doubling Algorithm

500 1000 1500 2000 2500 3000 3500 4000 32 64 128 256 512 1024 2048 4096 Latency (usec) Message Size (Bytes)

Recursive Doubling (RD) scales poorly with increasing core counts

16 Cores/Node 8 Cores/Node

SLIDE 9

Scaling on Multi-cores : Ring Algorithm

10000 20000 30000 40000 50000 60000 70000 80000 90000 8192 16384 32768 65536 131072 262144 Latency (usec) Message Size (Bytes)

Ring Algorithm scales as expected with increasing core counts

16 Cores/Node 8 Cores/Node

SLIDE 10

Scaling on Large Scale Multi-core clusters (Recursive Doubling)

10000 20000 30000 40000 50000 60000 64 128 256 512 1024 2048 4096 8192 Latency (usec) Message Size (Bytes)

Recursive Doubling (RD) scales poorly for large system size

128 Processes 256 Processes 512 Processes 1024 Processes

SLIDE 11

Scaling on Large Scale Multi-core clusters (Ring Algorithm)

200000 400000 600000 800000 1000000 1200000 16384 32768 65536 131072 262144 Latency (usec) Message Size (Bytes)

Ring Algorithm scales as expected for large system sizes

128 Processes 256 Processes 512 Processes 1024 Processes

SLIDE 12

Problem Statement

Is it possible to design an algorithm to :
be Multi-core and NUMA aware to achieve

better performance and scalability as core- counts and system sizes increase?

fully exploit the differential memory access costs

in NUMA based Multi-core systems?

SLIDE 13

Outline

Introduction and Background
Motivation
Related Work
Multi-Leader based Algorithms
Experimental evaluation
Conclusions and Future Work

SLIDE 14

Collective Design Framework

Collective Algorithms Conventional Schemes Pt2pt Single Leader Schemes Pt2pt Shmem

SLIDE 15

Existing Multi-core aware Algorithms

Single Leader approaches :

Aggregation – Distribution . Step 1 : Data aggregation at the leader on each node Step 2 : Inter leader exchanges Step 3 : Data distribution within each node Steps 1 and 3 are intra-node operations. → Point-to-point MPI calls → Shared memory buffer visible to all the processes within a node

SLIDE 16

Single Leader Algorithms : Step1 intra-node (pt2pt)

* * *

SLIDE 17

Single Leader Algorithms : Step2 inter-node (pt2pt)

* * *

SLIDE 18

Single Leader Algorithms : Step 3 intra-node (pt2pt)

* * *

SLIDE 19

Single Leader Algorithms : Step1 intra-node (shmem)

* * *

SLIDE 20

Single Leader Algorithms : Step1 intra-node (shmem)

* * *

SLIDE 21

Single Leader Algorithms : Step2 inter-node (pt2pt)

* * *

SLIDE 22

Single Leader Algorithms : Step 3 intra-node (shmem)

* * *

SLIDE 23

Single Leader Algorithms : Step3 intra-node (shmem)

* * *

SLIDE 24

Performance of Single Leader Schemes

500 1000 1500 2000 2500 3000 3500 4000 4500 32 64 128 256 512 1024 2048 Latency (usec) Message Size (Bytes)

Single Leader schemes show potential for improvement

Single Leader pt2pt Single Leader shmem Conventional(RD)

SLIDE 25

Performance of Single Leader Schemes

5000 10000 15000 20000 25000 1024 2048 4096 8192 16384 Latency (usec) Message Size (Bytes)

Conventional Ring Algorithm performs better for larger messages

Single Leader pt2pt Single Leader shmem Conventional(Ring)

SLIDE 26

Performance of Single Leader Schemes

50000 100000 150000 200000 250000 300000 350000 400000 16384 32768 65536 131072 262144 Latency (usec) Message Size (Bytes) Conventional Ring Algorithm performs better for larger messages Single Leader pt2pt Conventional(Ring)

SLIDE 27

Outline

Introduction and Background
Motivation
Related Work
Multi-Leader based Algorithms
Experimental evaluation
Conclusions and Future Work

SLIDE 28

AMD Barcelona Architecture

Node

C1 C2 C3 C4 Socket 1

Mem HT Links Mem Mem Mem

C1 C2 C3 C4 Socket 2 C1 C2 C3 C4 Socket 4 C1 C2 C3 C4 Socket 3

SLIDE 29

Single Leader algorithms on the AMD Barcelona Architecture Shared Memory buffer

Leader Process

C1 C2 C3 C4 Socket 1

Mem HT Links Mem Mem Mem

C1 C2 C3 C4 Socket 2 C1 C2 C3 C4 Socket 4 C1 C2 C3 C4 Socket 3

SLIDE 30

Proposed Collective Design Framework

Collective Algorithms Conventional Schemes Pt2pt Single Leader Schemes Pt2pt Shmem Multi Leader Schemes Pt2pt Shmem

SLIDE 31

Multi-Leader based Algorithms

Number of leader processes per node
Intra-socket and Inter-leader exchange

algorithms.

SLIDE 32

Multi-Leader based Algorithms(Step 1)

N1

S1

*

S2 * S3 * S4 *

N2 N4 N3

* * * * * * * * * * * *

SLIDE 33

Multi-Leader based Algorithms(Step 2)

N1

S1

*

S2 * S3

*

S4 *

N2 N4 N3

* * * * * * * * * * * *

SLIDE 34

Multi-Leader based Algorithms(Step 3)

N1

S1 * S2 * S3 * S4 *

N2 N4 N3

* * * * * * * * * * * *

SLIDE 35

Outline

Introduction and Background
Motivation
Related Work
Multi-Leader based Algorithms
Experimental evaluation
Conclusions and Future Work

SLIDE 36

Experimental Test-bed

Each node of our testbed has 16 AMD Opteron

1.95 Ghz processors with 512 KB L2 cache. We used 8 nodes.

Each node has 16 GB memory and PCI-Express

bus, 2 MT25418 DDR HCAs with PCI-Ex interfaces.

24-port Mellanox switch is used to connect all

the nodes.

RedHat Enterprise Linux Server 5.

SLIDE 37

Performance of Multi-Leader pt2pt

500 1000 1500 2000 2500 3000 3500 4000 4500 32 64 128 256 512 1024 2048 Latency (usec) Message Size (Bytes)

4-Leader scheme does about 20% better than Single Leader scheme and 50% better than RD

1 Leader pt2pt 2 Leader pt2pt 4 Leader pt2pt 8 Leader pt2pt Conventional(RD)

SLIDE 38

Performance of Multi-leader : Shared Memory

500 1000 1500 2000 2500 3000 3500 4000 4500 32 64 128 256 512 1024 2048 Latency (usec) Message Size (Bytes)

4-Leader scheme performs better than 1-Leader scheme by about 25% and 70% better than RD

1 Leader shmem 2 Leader shmem 4 Leader shmem 8 Leader shmem Conventional(RD)

SLIDE 39

Performance of Multi-Leader Schemes (pt2pt Vs Shared Memory)

500 1000 1500 2000 2500 3000 3500 4000 4500 32 64 128 256 512 1024 2048 Latency (usec) Message Size (Bytes)

4-Leader Shared Memory approach performs better than 4-Leader Point-to-point scheme by about 40%

4-Leader pt2pt 4-Leader shmem Conventional(RD)

SLIDE 40

Performance of Multi-Leader schemes on large scale Multi-cores

2000 4000 6000 8000 10000 12000 14000 16000 18000 Latency (usec) Message Size (Bytes)

4-Leader Point-to-point scheme outperforms the recursive doubling method on 1024 processes on the TACC Ranger

1-Leader pt2pt 2-Leader pt2pt 4-Leader pt2pt 8-Leader pt2pt Conventional(RD)

SLIDE 41

Performance of Multi-Leader schemes

n large scale Multi-cores

20000 40000 60000 80000 100000 120000 140000 160000 1024 2048 4096 8192 16384 Latency (usec) Message Size (Bytes)

Conventional Ring Algorithm performs better for larger messages

1-Leader pt2pt 2-Leader pt2pt 4-Leader pt2pt 8-Leader pt2pt Conventional(Ring)

SLIDE 42

Proposed Unified Scheme

Intra-Node Mechanism Inter-Leader Algorithm Design Small Messages Point-to-Point Recursive Doubling Hierarchical Medium Messages Shared Memory Recursive / Ring Hierarchical Large Messages Point-to-Point Ring Conventional

SLIDE 43

Outline

Introduction and Background
Motivation
Related Work
Multi-Leader based Algorithms
Experimental evaluation
Conclusions and Future Work

SLIDE 44

Conclusions & Future Work

Single Leader schemes are limited by scalability and

memory contention. Proposed Multi-Leader schemes perform show significant performance benefits.

Future work:
Examine the benefits of using kernel based zero-copy

intra-node exchanges for large messages.

A frame-work that can choose leaders in an optimal

manner for emerging multi-core systems.

Evaluate the impact of such designs on real-world

applications.

SLIDE 45

45

http://mvapich.cse.ohio-state.edu

SLIDE 46

Designing Multi-Leader based Allgather Algorithms for Multi-core Clusters

Krishna Kandalla, Hari Subramoni, Gopal Santhanaraman, Matthew Koop and Dhabaleswar.

Computer Science & Engineering Department The Ohio State University

Outline

Introduction and Background

common

bandwidth and low-latency

consume a significant amount of time

designs

Allgather Communication

process in the group

tcomm = ts * log(p) + tw * (p -1) * m

tcomm = (ts + tw * m) * (p -1)

Outline

Recursive Doubling (RD) Algorithm on Multi-cores

Ring Algorithm on Multi-cores

Scaling on Multi-cores : Recursive Doubling Algorithm

Scaling on Multi-cores : Ring Algorithm

Scaling on Large Scale Multi-core clusters (Recursive Doubling)

Scaling on Large Scale Multi-core clusters (Ring Algorithm)

Problem Statement

better performance and scalability as core- counts and system sizes increase?

in NUMA based Multi-core systems?

Outline

Collective Design Framework

Existing Multi-core aware Algorithms

Aggregation – Distribution . Step 1 : Data aggregation at the leader on each node Step 2 : Inter leader exchanges Step 3 : Data distribution within each node Steps 1 and 3 are intra-node operations. → Point-to-point MPI calls → Shared memory buffer visible to all the processes within a node

Single Leader Algorithms : Step1 intra-node (pt2pt)

Single Leader Algorithms : Step2 inter-node (pt2pt)

Single Leader Algorithms : Step 3 intra-node (pt2pt)

Single Leader Algorithms : Step1 intra-node (shmem)

Single Leader Algorithms : Step1 intra-node (shmem)

Single Leader Algorithms : Step2 inter-node (pt2pt)

Single Leader Algorithms : Step 3 intra-node (shmem)

Single Leader Algorithms : Step3 intra-node (shmem)

Performance of Single Leader Schemes

Performance of Single Leader Schemes

Performance of Single Leader Schemes

Outline

AMD Barcelona Architecture

Node

Single Leader algorithms on the AMD Barcelona Architecture Shared Memory buffer

Proposed Collective Design Framework

Multi-Leader based Algorithms

algorithms.

Multi-Leader based Algorithms(Step 1)

Multi-Leader based Algorithms(Step 2)

Multi-Leader based Algorithms(Step 3)

Outline

Experimental Test-bed

1.95 Ghz processors with 512 KB L2 cache. We used 8 nodes.

bus, 2 MT25418 DDR HCAs with PCI-Ex interfaces.

the nodes.

Performance of Multi-Leader pt2pt

Performance of Multi-leader : Shared Memory

Performance of Multi-Leader Schemes (pt2pt Vs Shared Memory)

Performance of Multi-Leader schemes on large scale Multi-cores

Performance of Multi-Leader schemes

Proposed Unified Scheme

Outline

Conclusions & Future Work

memory contention. Proposed Multi-Leader schemes perform show significant performance benefits.

intra-node exchanges for large messages.

manner for emerging multi-core systems.

applications.

http://mvapich.cse.ohio-state.edu

Thank you !

{kandalla, subramon, santhana, koop, panda}@cse.ohio-state.edu Network-Based Computing Laboratory