Designing Multi-Leader based Allgather Algorithms for Multi-core - - PowerPoint PPT Presentation

designing multi leader based allgather algorithms for
SMART_READER_LITE
LIVE PREVIEW

Designing Multi-Leader based Allgather Algorithms for Multi-core - - PowerPoint PPT Presentation

Designing Multi-Leader based Allgather Algorithms for Multi-core Clusters Krishna Kandalla, Hari Subramoni, Gopal Santhanaraman, Matthew Koop and Dhabaleswar. K. Panda Computer Science & Engineering Department The Ohio State University


slide-1
SLIDE 1

Designing Multi-Leader based Allgather Algorithms for Multi-core Clusters

Krishna Kandalla, Hari Subramoni, Gopal Santhanaraman, Matthew Koop and Dhabaleswar.

  • K. Panda

Computer Science & Engineering Department The Ohio State University

slide-2
SLIDE 2

Outline

  • Introduction and Background
  • Motivation
  • Related Work
  • Multi-Leader based Algorithms
  • Experimental evaluation
  • Conclusions and Future Work
slide-3
SLIDE 3

Introduction and Background

  • MPI is the de-facto programming model for HPC
  • Multi-core clusters are becoming increasingly

common

  • Modern interconnects like InfiniBand offer high-

bandwidth and low-latency

  • The collective communication primitives

consume a significant amount of time

  • Necessary to have multi-core aware collective

designs

slide-4
SLIDE 4

Allgather Communication

  • Each process broadcasts a vector data to every other

process in the group

  • Commonly used algorithms :
  • Recursive Doubling (RD) Algorithm for small messages

tcomm = ts * log(p) + tw * (p -1) * m

  • Ring Algorithm for large messages

tcomm = (ts + tw * m) * (p -1)

tcomm - Total Communication cost ts - Communication start-up cost tw - Cost of sending a byte of data p - Number of processes m - Message Size.

slide-5
SLIDE 5

Outline

  • Introduction and Background
  • Motivation
  • Related Work
  • Multi-Leader based Algorithms
  • Experimental evaluation
  • Conclusions and Future Work
slide-6
SLIDE 6

Recursive Doubling (RD) Algorithm on Multi-cores

4 3 1 1 2 2 3 4 5 6 7 8 9 10 11 12 13 14 15

slide-7
SLIDE 7

Ring Algorithm on Multi-cores

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

slide-8
SLIDE 8

Scaling on Multi-cores : Recursive Doubling Algorithm

500 1000 1500 2000 2500 3000 3500 4000 32 64 128 256 512 1024 2048 4096 Latency (usec) Message Size (Bytes)

Recursive Doubling (RD) scales poorly with increasing core counts

16 Cores/Node 8 Cores/Node

slide-9
SLIDE 9

Scaling on Multi-cores : Ring Algorithm

10000 20000 30000 40000 50000 60000 70000 80000 90000 8192 16384 32768 65536 131072 262144 Latency (usec) Message Size (Bytes)

Ring Algorithm scales as expected with increasing core counts

16 Cores/Node 8 Cores/Node

slide-10
SLIDE 10

Scaling on Large Scale Multi-core clusters (Recursive Doubling)

10000 20000 30000 40000 50000 60000 64 128 256 512 1024 2048 4096 8192 Latency (usec) Message Size (Bytes)

Recursive Doubling (RD) scales poorly for large system size

128 Processes 256 Processes 512 Processes 1024 Processes

slide-11
SLIDE 11

Scaling on Large Scale Multi-core clusters (Ring Algorithm)

200000 400000 600000 800000 1000000 1200000 16384 32768 65536 131072 262144 Latency (usec) Message Size (Bytes)

Ring Algorithm scales as expected for large system sizes

128 Processes 256 Processes 512 Processes 1024 Processes

slide-12
SLIDE 12

Problem Statement

  • Is it possible to design an algorithm to :
  • be Multi-core and NUMA aware to achieve

better performance and scalability as core- counts and system sizes increase?

  • fully exploit the differential memory access costs

in NUMA based Multi-core systems?

slide-13
SLIDE 13

Outline

  • Introduction and Background
  • Motivation
  • Related Work
  • Multi-Leader based Algorithms
  • Experimental evaluation
  • Conclusions and Future Work
slide-14
SLIDE 14

Collective Design Framework

Collective Algorithms Conventional Schemes Pt2pt Single Leader Schemes Pt2pt Shmem

slide-15
SLIDE 15

Existing Multi-core aware Algorithms

  • Single Leader approaches :

Aggregation – Distribution . Step 1 : Data aggregation at the leader on each node Step 2 : Inter leader exchanges Step 3 : Data distribution within each node Steps 1 and 3 are intra-node operations. → Point-to-point MPI calls → Shared memory buffer visible to all the processes within a node

slide-16
SLIDE 16

Single Leader Algorithms : Step1 intra-node (pt2pt)

* * *

slide-17
SLIDE 17

Single Leader Algorithms : Step2 inter-node (pt2pt)

* * *

slide-18
SLIDE 18

Single Leader Algorithms : Step 3 intra-node (pt2pt)

* * *

slide-19
SLIDE 19

Single Leader Algorithms : Step1 intra-node (shmem)

* * *

slide-20
SLIDE 20

Single Leader Algorithms : Step1 intra-node (shmem)

* * *

slide-21
SLIDE 21

Single Leader Algorithms : Step2 inter-node (pt2pt)

* * *

slide-22
SLIDE 22

Single Leader Algorithms : Step 3 intra-node (shmem)

* * *

slide-23
SLIDE 23

Single Leader Algorithms : Step3 intra-node (shmem)

* * *

slide-24
SLIDE 24

Performance of Single Leader Schemes

500 1000 1500 2000 2500 3000 3500 4000 4500 32 64 128 256 512 1024 2048 Latency (usec) Message Size (Bytes)

Single Leader schemes show potential for improvement

Single Leader pt2pt Single Leader shmem Conventional(RD)

slide-25
SLIDE 25

Performance of Single Leader Schemes

5000 10000 15000 20000 25000 1024 2048 4096 8192 16384 Latency (usec) Message Size (Bytes)

Conventional Ring Algorithm performs better for larger messages

Single Leader pt2pt Single Leader shmem Conventional(Ring)

slide-26
SLIDE 26

Performance of Single Leader Schemes

50000 100000 150000 200000 250000 300000 350000 400000 16384 32768 65536 131072 262144 Latency (usec) Message Size (Bytes) Conventional Ring Algorithm performs better for larger messages Single Leader pt2pt Conventional(Ring)

slide-27
SLIDE 27

Outline

  • Introduction and Background
  • Motivation
  • Related Work
  • Multi-Leader based Algorithms
  • Experimental evaluation
  • Conclusions and Future Work
slide-28
SLIDE 28

AMD Barcelona Architecture

Node

C1 C2 C3 C4 Socket 1

Mem HT Links Mem Mem Mem

C1 C2 C3 C4 Socket 2 C1 C2 C3 C4 Socket 4 C1 C2 C3 C4 Socket 3

slide-29
SLIDE 29

Single Leader algorithms on the AMD Barcelona Architecture Shared Memory buffer

Leader Process

C1 C2 C3 C4 Socket 1

Mem HT Links Mem Mem Mem

C1 C2 C3 C4 Socket 2 C1 C2 C3 C4 Socket 4 C1 C2 C3 C4 Socket 3

slide-30
SLIDE 30

Proposed Collective Design Framework

Collective Algorithms Conventional Schemes Pt2pt Single Leader Schemes Pt2pt Shmem Multi Leader Schemes Pt2pt Shmem

slide-31
SLIDE 31

Multi-Leader based Algorithms

  • Number of leader processes per node
  • Intra-socket and Inter-leader exchange

algorithms.

slide-32
SLIDE 32

Multi-Leader based Algorithms(Step 1)

N1

S1

*

S2 * S3 * S4 *

N2 N4 N3

* * * * * * * * * * * *

slide-33
SLIDE 33

Multi-Leader based Algorithms(Step 2)

N1

S1

*

S2 * S3

*

S4 *

N2 N4 N3

* * * * * * * * * * * *

slide-34
SLIDE 34

Multi-Leader based Algorithms(Step 3)

N1

S1 * S2 * S3 * S4 *

N2 N4 N3

* * * * * * * * * * * *

slide-35
SLIDE 35

Outline

  • Introduction and Background
  • Motivation
  • Related Work
  • Multi-Leader based Algorithms
  • Experimental evaluation
  • Conclusions and Future Work
slide-36
SLIDE 36

Experimental Test-bed

  • Each node of our testbed has 16 AMD Opteron

1.95 Ghz processors with 512 KB L2 cache. We used 8 nodes.

  • Each node has 16 GB memory and PCI-Express

bus, 2 MT25418 DDR HCAs with PCI-Ex interfaces.

  • 24-port Mellanox switch is used to connect all

the nodes.

  • RedHat Enterprise Linux Server 5.
slide-37
SLIDE 37

Performance of Multi-Leader pt2pt

500 1000 1500 2000 2500 3000 3500 4000 4500 32 64 128 256 512 1024 2048 Latency (usec) Message Size (Bytes)

4-Leader scheme does about 20% better than Single Leader scheme and 50% better than RD

1 Leader pt2pt 2 Leader pt2pt 4 Leader pt2pt 8 Leader pt2pt Conventional(RD)

slide-38
SLIDE 38

Performance of Multi-leader : Shared Memory

500 1000 1500 2000 2500 3000 3500 4000 4500 32 64 128 256 512 1024 2048 Latency (usec) Message Size (Bytes)

4-Leader scheme performs better than 1-Leader scheme by about 25% and 70% better than RD

1 Leader shmem 2 Leader shmem 4 Leader shmem 8 Leader shmem Conventional(RD)

slide-39
SLIDE 39

Performance of Multi-Leader Schemes (pt2pt Vs Shared Memory)

500 1000 1500 2000 2500 3000 3500 4000 4500 32 64 128 256 512 1024 2048 Latency (usec) Message Size (Bytes)

4-Leader Shared Memory approach performs better than 4-Leader Point-to-point scheme by about 40%

4-Leader pt2pt 4-Leader shmem Conventional(RD)

slide-40
SLIDE 40

Performance of Multi-Leader schemes on large scale Multi-cores

2000 4000 6000 8000 10000 12000 14000 16000 18000 Latency (usec) Message Size (Bytes)

4-Leader Point-to-point scheme outperforms the recursive doubling method on 1024 processes on the TACC Ranger

1-Leader pt2pt 2-Leader pt2pt 4-Leader pt2pt 8-Leader pt2pt Conventional(RD)

slide-41
SLIDE 41

Performance of Multi-Leader schemes

  • n large scale Multi-cores

20000 40000 60000 80000 100000 120000 140000 160000 1024 2048 4096 8192 16384 Latency (usec) Message Size (Bytes)

Conventional Ring Algorithm performs better for larger messages

1-Leader pt2pt 2-Leader pt2pt 4-Leader pt2pt 8-Leader pt2pt Conventional(Ring)

slide-42
SLIDE 42

Proposed Unified Scheme

Intra-Node Mechanism Inter-Leader Algorithm Design Small Messages Point-to-Point Recursive Doubling Hierarchical Medium Messages Shared Memory Recursive / Ring Hierarchical Large Messages Point-to-Point Ring Conventional

slide-43
SLIDE 43

Outline

  • Introduction and Background
  • Motivation
  • Related Work
  • Multi-Leader based Algorithms
  • Experimental evaluation
  • Conclusions and Future Work
slide-44
SLIDE 44

Conclusions & Future Work

  • Single Leader schemes are limited by scalability and

memory contention. Proposed Multi-Leader schemes perform show significant performance benefits.

  • Future work:
  • Examine the benefits of using kernel based zero-copy

intra-node exchanges for large messages.

  • A frame-work that can choose leaders in an optimal

manner for emerging multi-core systems.

  • Evaluate the impact of such designs on real-world

applications.

slide-45
SLIDE 45

45

http://mvapich.cse.ohio-state.edu

slide-46
SLIDE 46

Thank you !

{kandalla, subramon, santhana, koop, panda}@cse.ohio-state.edu Network-Based Computing Laboratory