1:n n:1 Microbenchmark LogP Prediction 1:n n:1 Benchmark Results A - - PowerPoint PPT Presentation

1 n n 1 microbenchmark logp prediction 1 n n 1 benchmark
SMART_READER_LITE
LIVE PREVIEW

1:n n:1 Microbenchmark LogP Prediction 1:n n:1 Benchmark Results A - - PowerPoint PPT Presentation

Architectural Specialities of InfiniBand TM A new Barrier Algorithm for InfiniBand TM Results and Conclusions Fast Barrier Synchronization for InfiniBand TM Torsten Hoefler Chair of Computer Architecture Technical University of Chemnitz


slide-1
SLIDE 1

Architectural Specialities of InfiniBandTM A new Barrier Algorithm for InfiniBandTM Results and Conclusions

Fast Barrier Synchronization for InfiniBandTM

Torsten Hoefler

Chair of Computer Architecture Technical University of Chemnitz

IPDPS’06 - CAC’06 Workshop

Rhodes Island, Greece

25th April 2006

Torsten Hoefler n-way Dissemination

slide-2
SLIDE 2

Architectural Specialities of InfiniBandTM A new Barrier Algorithm for InfiniBandTM Results and Conclusions

1

Architectural Specialities of InfiniBandTM 1:n n:1 Microbenchmark LogP Prediction 1:n n:1 Benchmark Results

2

A new Barrier Algorithm for InfiniBandTM The Dissemination Algorithm The n-way Dissemination Algorithm

3

Results and Conclusions Comparison with other MPI_Barrier Implementations Conclusions and Future Work

Torsten Hoefler n-way Dissemination

slide-3
SLIDE 3

Architectural Specialities of InfiniBandTM A new Barrier Algorithm for InfiniBandTM Results and Conclusions 1:n n:1 Microbenchmark LogP Prediction 1:n n:1 Benchmark Results

1

Architectural Specialities of InfiniBandTM 1:n n:1 Microbenchmark LogP Prediction 1:n n:1 Benchmark Results

2

A new Barrier Algorithm for InfiniBandTM The Dissemination Algorithm The n-way Dissemination Algorithm

3

Results and Conclusions Comparison with other MPI_Barrier Implementations Conclusions and Future Work

Torsten Hoefler n-way Dissemination

slide-4
SLIDE 4

Architectural Specialities of InfiniBandTM A new Barrier Algorithm for InfiniBandTM Results and Conclusions 1:n n:1 Microbenchmark LogP Prediction 1:n n:1 Benchmark Results

1:n n:1 Microbenchmark

Developed to analyze the InfiniBandTM network Especially for collective communication Measures single message performance (RDTSC) MPI based Supports (nearly) all transport types

Torsten Hoefler n-way Dissemination

slide-5
SLIDE 5

Architectural Specialities of InfiniBandTM A new Barrier Algorithm for InfiniBandTM Results and Conclusions 1:n n:1 Microbenchmark LogP Prediction 1:n n:1 Benchmark Results

1:n n:1 Microbenchmark - principle

1 2 P

. . .

ping 1 2 P

. . .

pong

1

(0): Take time

2

(1..n-1): Send a single message to n-1 hosts

3

(1..n-1): Hosts respond immediately

4

(0): Wait for message receiption from all hosts

5

(0): Take time

Torsten Hoefler n-way Dissemination

slide-6
SLIDE 6

Architectural Specialities of InfiniBandTM A new Barrier Algorithm for InfiniBandTM Results and Conclusions 1:n n:1 Microbenchmark LogP Prediction 1:n n:1 Benchmark Results

1

Architectural Specialities of InfiniBandTM 1:n n:1 Microbenchmark LogP Prediction 1:n n:1 Benchmark Results

2

A new Barrier Algorithm for InfiniBandTM The Dissemination Algorithm The n-way Dissemination Algorithm

3

Results and Conclusions Comparison with other MPI_Barrier Implementations Conclusions and Future Work

Torsten Hoefler n-way Dissemination

slide-7
SLIDE 7

Architectural Specialities of InfiniBandTM A new Barrier Algorithm for InfiniBandTM Results and Conclusions 1:n n:1 Microbenchmark LogP Prediction 1:n n:1 Benchmark Results

The LogP Model

LogP model by Culler et.al. 1993 LogP Parameters L - Latency g - Bandwidth-limiting Gap between consecutive messages (g ≈ 1/BW)

  • - Send-/Receive Overhead

P - Number of involved Processors

Torsten Hoefler n-way Dissemination

slide-8
SLIDE 8

Architectural Specialities of InfiniBandTM A new Barrier Algorithm for InfiniBandTM Results and Conclusions 1:n n:1 Microbenchmark LogP Prediction 1:n n:1 Benchmark Results

LogP Prediction

1 P RTT(P) P

max{g,o}

RTT(P)/P = (4o + 2L + (P − 1) · max{g, o})/P

Torsten Hoefler n-way Dissemination

slide-9
SLIDE 9

Architectural Specialities of InfiniBandTM A new Barrier Algorithm for InfiniBandTM Results and Conclusions 1:n n:1 Microbenchmark LogP Prediction 1:n n:1 Benchmark Results

1

Architectural Specialities of InfiniBandTM 1:n n:1 Microbenchmark LogP Prediction 1:n n:1 Benchmark Results

2

A new Barrier Algorithm for InfiniBandTM The Dissemination Algorithm The n-way Dissemination Algorithm

3

Results and Conclusions Comparison with other MPI_Barrier Implementations Conclusions and Future Work

Torsten Hoefler n-way Dissemination

slide-10
SLIDE 10

Architectural Specialities of InfiniBandTM A new Barrier Algorithm for InfiniBandTM Results and Conclusions 1:n n:1 Microbenchmark LogP Prediction 1:n n:1 Benchmark Results

1:n n:1 Benchmark Results

Test Environment 8 Nodes Dual Xeon 2.066 GHz Red Hat Linux release 9 (Shrike) Kernel: 2.4.27 SMP HCA: Mellanox ”Cougar” (MTPB 23108)

Torsten Hoefler n-way Dissemination

slide-11
SLIDE 11

Architectural Specialities of InfiniBandTM A new Barrier Algorithm for InfiniBandTM Results and Conclusions 1:n n:1 Microbenchmark LogP Prediction 1:n n:1 Benchmark Results

1:n n:1 Benchmark Results

4 6 8 10 12 14 16 10 20 30 40 50 60 70 Minimal RTT in Microseconds # Processors (P) RDMA Write RDMA Write inline

RDMA/W - fastest transport type in our tests Graph shows minimal values We benefit from sending to multiple hosts simultaneously Atomic was not available on our HCAs

Torsten Hoefler n-way Dissemination

slide-12
SLIDE 12

Architectural Specialities of InfiniBandTM A new Barrier Algorithm for InfiniBandTM Results and Conclusions 1:n n:1 Microbenchmark LogP Prediction 1:n n:1 Benchmark Results

1:n n:1 Benchmark Results

6 8 10 12 14 16 18 20 22 24 5 10 15 20 25 30 35 40 45 50 Average RTT in Microseconds # Processors (P) RDMA Write RDMA Write inline

Average Graph has many outliers Still same ”shape”

Torsten Hoefler n-way Dissemination

slide-13
SLIDE 13

Architectural Specialities of InfiniBandTM A new Barrier Algorithm for InfiniBandTM Results and Conclusions 1:n n:1 Microbenchmark LogP Prediction 1:n n:1 Benchmark Results

A possible Explanation - The LoP Model

P saturation

tsaturated

processing pipeline processing

t RTT(P) P

Pipeline startup function - hardware pipe, caches Minimal processing time - hardware Network saturation - network hardware / transceiver

Torsten Hoefler n-way Dissemination

slide-14
SLIDE 14

Architectural Specialities of InfiniBandTM A new Barrier Algorithm for InfiniBandTM Results and Conclusions The Dissemination Algorithm The n-way Dissemination Algorithm

1

Architectural Specialities of InfiniBandTM 1:n n:1 Microbenchmark LogP Prediction 1:n n:1 Benchmark Results

2

A new Barrier Algorithm for InfiniBandTM The Dissemination Algorithm The n-way Dissemination Algorithm

3

Results and Conclusions Comparison with other MPI_Barrier Implementations Conclusions and Future Work

Torsten Hoefler n-way Dissemination

slide-15
SLIDE 15

Architectural Specialities of InfiniBandTM A new Barrier Algorithm for InfiniBandTM Results and Conclusions The Dissemination Algorithm The n-way Dissemination Algorithm

The Dissemination Algorithm

Round 0 Round 1 Round 2

Logarithmic running time (O(log2P)) Works with non-power of two P

Torsten Hoefler n-way Dissemination

slide-16
SLIDE 16

Architectural Specialities of InfiniBandTM A new Barrier Algorithm for InfiniBandTM Results and Conclusions The Dissemination Algorithm The n-way Dissemination Algorithm

Dissemination - Peer selection

Round 0 Round 1 Round 2

speer = (p + 2r) mod P rpeer = (p − 2r) mod P

Torsten Hoefler n-way Dissemination

slide-17
SLIDE 17

Architectural Specialities of InfiniBandTM A new Barrier Algorithm for InfiniBandTM Results and Conclusions The Dissemination Algorithm The n-way Dissemination Algorithm

1

Architectural Specialities of InfiniBandTM 1:n n:1 Microbenchmark LogP Prediction 1:n n:1 Benchmark Results

2

A new Barrier Algorithm for InfiniBandTM The Dissemination Algorithm The n-way Dissemination Algorithm

3

Results and Conclusions Comparison with other MPI_Barrier Implementations Conclusions and Future Work

Torsten Hoefler n-way Dissemination

slide-18
SLIDE 18

Architectural Specialities of InfiniBandTM A new Barrier Algorithm for InfiniBandTM Results and Conclusions The Dissemination Algorithm The n-way Dissemination Algorithm

The n-way Dissemination Algorithm

Round 0 Round 1

Logarithmic running time (O(log2P) − O(lognP)?) Works with non-power of n P

Torsten Hoefler n-way Dissemination

slide-19
SLIDE 19

Architectural Specialities of InfiniBandTM A new Barrier Algorithm for InfiniBandTM Results and Conclusions The Dissemination Algorithm The n-way Dissemination Algorithm

n-way Dissemination - Peer selection

Round 0 Round 1

speeri = (p + i · (n + 1)r) mod P rpeeri = (p − i · (n + 1)r) mod P

Torsten Hoefler n-way Dissemination

slide-20
SLIDE 20

Architectural Specialities of InfiniBandTM A new Barrier Algorithm for InfiniBandTM Results and Conclusions Comparison with other MPI_Barrier Implementations Conclusions and Future Work

1

Architectural Specialities of InfiniBandTM 1:n n:1 Microbenchmark LogP Prediction 1:n n:1 Benchmark Results

2

A new Barrier Algorithm for InfiniBandTM The Dissemination Algorithm The n-way Dissemination Algorithm

3

Results and Conclusions Comparison with other MPI_Barrier Implementations Conclusions and Future Work

Torsten Hoefler n-way Dissemination

slide-21
SLIDE 21

Architectural Specialities of InfiniBandTM A new Barrier Algorithm for InfiniBandTM Results and Conclusions Comparison with other MPI_Barrier Implementations Conclusions and Future Work

The n-way Dissemination Algorithm

Implementation Details Implementation as collv1 component in Open MPI Communication peers are precomputed Benchmark Details LAM/MPI 7.1.1 TUC SHIBA 1.0 MVAPICH 0.9.4

Torsten Hoefler n-way Dissemination

slide-22
SLIDE 22

Architectural Specialities of InfiniBandTM A new Barrier Algorithm for InfiniBandTM Results and Conclusions Comparison with other MPI_Barrier Implementations Conclusions and Future Work

Benchmark Results

50 100 150 200 250 300 10 20 30 40 50 60 70 average runtime in microseconds (rt) # processors (P) LAM-MPI SHIBA MVAPICH IBBARR-1 IBBARR-2 IBBARR-3 IBBARR-4 IBBARR-6

LAM/MPI dominates Zoom in ...

Torsten Hoefler n-way Dissemination

slide-23
SLIDE 23

Architectural Specialities of InfiniBandTM A new Barrier Algorithm for InfiniBandTM Results and Conclusions Comparison with other MPI_Barrier Implementations Conclusions and Future Work

Benchmark Results

5 10 15 20 25 30 35 40 45 50 55 10 20 30 40 50 60 70 average runtime in microseconds (rt) # processors (P) MVAPICH IBBARR-1 IBBARR-2 IBBARR-3

Fastest Barrier in test

Torsten Hoefler n-way Dissemination

slide-24
SLIDE 24

Architectural Specialities of InfiniBandTM A new Barrier Algorithm for InfiniBandTM Results and Conclusions Comparison with other MPI_Barrier Implementations Conclusions and Future Work

Benchmark Results

  • 0.2
  • 0.1

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 10 20 30 40 50 60 70 Speedup # processors (P) IBBARR-1 IBBARR-2 IBBARR-3 MAXIMUM

Speedup with regards to MVAPICH

Torsten Hoefler n-way Dissemination

slide-25
SLIDE 25

Architectural Specialities of InfiniBandTM A new Barrier Algorithm for InfiniBandTM Results and Conclusions Comparison with other MPI_Barrier Implementations Conclusions and Future Work

1

Architectural Specialities of InfiniBandTM 1:n n:1 Microbenchmark LogP Prediction 1:n n:1 Benchmark Results

2

A new Barrier Algorithm for InfiniBandTM The Dissemination Algorithm The n-way Dissemination Algorithm

3

Results and Conclusions Comparison with other MPI_Barrier Implementations Conclusions and Future Work

Torsten Hoefler n-way Dissemination

slide-26
SLIDE 26

Architectural Specialities of InfiniBandTM A new Barrier Algorithm for InfiniBandTM Results and Conclusions Comparison with other MPI_Barrier Implementations Conclusions and Future Work

Conclusions

InfiniBandTM InfiniBandTM hardware shows parallelism n-way dissemination principle can lower barrier latency MPI Layer Open MPI collv1 provides a low overhead framework n-selection non trivial → collv2

Torsten Hoefler n-way Dissemination

slide-27
SLIDE 27

Architectural Specialities of InfiniBandTM A new Barrier Algorithm for InfiniBandTM Results and Conclusions Comparison with other MPI_Barrier Implementations Conclusions and Future Work

Future Work/Ongoing Efforts

InfiniBandTM Implementation of InfiniBandTM specialized collectives MPI Layer Open MPI collective framework version 2

Torsten Hoefler n-way Dissemination