Designing Multi-Leader based Allgather Algorithms for Multi-core Clusters
Krishna Kandalla, Hari Subramoni, Gopal Santhanaraman, Matthew Koop and Dhabaleswar.
- K. Panda
Designing Multi-Leader based Allgather Algorithms for Multi-core - - PowerPoint PPT Presentation
Designing Multi-Leader based Allgather Algorithms for Multi-core Clusters Krishna Kandalla, Hari Subramoni, Gopal Santhanaraman, Matthew Koop and Dhabaleswar. K. Panda Computer Science & Engineering Department The Ohio State University
tcomm - Total Communication cost ts - Communication start-up cost tw - Cost of sending a byte of data p - Number of processes m - Message Size.
4 3 1 1 2 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
500 1000 1500 2000 2500 3000 3500 4000 32 64 128 256 512 1024 2048 4096 Latency (usec) Message Size (Bytes)
Recursive Doubling (RD) scales poorly with increasing core counts
16 Cores/Node 8 Cores/Node
10000 20000 30000 40000 50000 60000 70000 80000 90000 8192 16384 32768 65536 131072 262144 Latency (usec) Message Size (Bytes)
Ring Algorithm scales as expected with increasing core counts
16 Cores/Node 8 Cores/Node
10000 20000 30000 40000 50000 60000 64 128 256 512 1024 2048 4096 8192 Latency (usec) Message Size (Bytes)
Recursive Doubling (RD) scales poorly for large system size
128 Processes 256 Processes 512 Processes 1024 Processes
200000 400000 600000 800000 1000000 1200000 16384 32768 65536 131072 262144 Latency (usec) Message Size (Bytes)
Ring Algorithm scales as expected for large system sizes
128 Processes 256 Processes 512 Processes 1024 Processes
Collective Algorithms Conventional Schemes Pt2pt Single Leader Schemes Pt2pt Shmem
* * *
* * *
* * *
* * *
* * *
* * *
* * *
* * *
500 1000 1500 2000 2500 3000 3500 4000 4500 32 64 128 256 512 1024 2048 Latency (usec) Message Size (Bytes)
Single Leader schemes show potential for improvement
Single Leader pt2pt Single Leader shmem Conventional(RD)
5000 10000 15000 20000 25000 1024 2048 4096 8192 16384 Latency (usec) Message Size (Bytes)
Conventional Ring Algorithm performs better for larger messages
Single Leader pt2pt Single Leader shmem Conventional(Ring)
50000 100000 150000 200000 250000 300000 350000 400000 16384 32768 65536 131072 262144 Latency (usec) Message Size (Bytes) Conventional Ring Algorithm performs better for larger messages Single Leader pt2pt Conventional(Ring)
C1 C2 C3 C4 Socket 1
Mem HT Links Mem Mem Mem
C1 C2 C3 C4 Socket 2 C1 C2 C3 C4 Socket 4 C1 C2 C3 C4 Socket 3
Leader Process
C1 C2 C3 C4 Socket 1
Mem HT Links Mem Mem Mem
C1 C2 C3 C4 Socket 2 C1 C2 C3 C4 Socket 4 C1 C2 C3 C4 Socket 3
Collective Algorithms Conventional Schemes Pt2pt Single Leader Schemes Pt2pt Shmem Multi Leader Schemes Pt2pt Shmem
N1
S1
*
S2 * S3 * S4 *
N2 N4 N3
* * * * * * * * * * * *
N1
S1
*
S2 * S3
*
S4 *
N2 N4 N3
* * * * * * * * * * * *
N1
S1 * S2 * S3 * S4 *
N2 N4 N3
* * * * * * * * * * * *
500 1000 1500 2000 2500 3000 3500 4000 4500 32 64 128 256 512 1024 2048 Latency (usec) Message Size (Bytes)
4-Leader scheme does about 20% better than Single Leader scheme and 50% better than RD
1 Leader pt2pt 2 Leader pt2pt 4 Leader pt2pt 8 Leader pt2pt Conventional(RD)
500 1000 1500 2000 2500 3000 3500 4000 4500 32 64 128 256 512 1024 2048 Latency (usec) Message Size (Bytes)
4-Leader scheme performs better than 1-Leader scheme by about 25% and 70% better than RD
1 Leader shmem 2 Leader shmem 4 Leader shmem 8 Leader shmem Conventional(RD)
500 1000 1500 2000 2500 3000 3500 4000 4500 32 64 128 256 512 1024 2048 Latency (usec) Message Size (Bytes)
4-Leader Shared Memory approach performs better than 4-Leader Point-to-point scheme by about 40%
4-Leader pt2pt 4-Leader shmem Conventional(RD)
2000 4000 6000 8000 10000 12000 14000 16000 18000 Latency (usec) Message Size (Bytes)
4-Leader Point-to-point scheme outperforms the recursive doubling method on 1024 processes on the TACC Ranger
1-Leader pt2pt 2-Leader pt2pt 4-Leader pt2pt 8-Leader pt2pt Conventional(RD)
20000 40000 60000 80000 100000 120000 140000 160000 1024 2048 4096 8192 16384 Latency (usec) Message Size (Bytes)
Conventional Ring Algorithm performs better for larger messages
1-Leader pt2pt 2-Leader pt2pt 4-Leader pt2pt 8-Leader pt2pt Conventional(Ring)
Intra-Node Mechanism Inter-Leader Algorithm Design Small Messages Point-to-Point Recursive Doubling Hierarchical Medium Messages Shared Memory Recursive / Ring Hierarchical Large Messages Point-to-Point Ring Conventional
45