Implementation and Analysis of Nonblocking Collective Operations on - - PowerPoint PPT Presentation

implementation and analysis of nonblocking collective
SMART_READER_LITE
LIVE PREVIEW

Implementation and Analysis of Nonblocking Collective Operations on - - PowerPoint PPT Presentation

Implementation and Analysis of Nonblocking Collective Operations on SCI Networks Christian Kaiser Torsten Hoefler Boris Bierbaum, Thomas Bemmerl Scalable Coherent Interface (SCI) Ringlet: IEEE Std 1596-1992 Memory Coupled Clusters


slide-1
SLIDE 1

Christian Kaiser Torsten Hoefler Boris Bierbaum, Thomas Bemmerl

Implementation and Analysis of Nonblocking Collective Operations on SCI Networks

slide-2
SLIDE 2

Nonblocking Collectives for SCI Networks Chair for Operating Systems

Scalable Coherent Interface (SCI)

2

Ringlet: 2D Torus:

  • IEEE Std 1596-1992
  • Memory Coupled Clusters
  • Data Transfer: PIO and DMA
  • SISCI User-Level Interface
  • 16 x Intel Pentium D, 2.8 GHz
  • SCI: D352 (IB: Mellanox DDR x4)
slide-3
SLIDE 3

Nonblocking Collectives for SCI Networks Chair for Operating Systems

Collective Operations: GATHER

3

Process 0 Process 1 (root) Process 2 Process 3 A B C D A B C D

source destination

slide-4
SLIDE 4

Nonblocking Collectives for SCI Networks Chair for Operating Systems

Collective Operations: GATHERV

4

Process 0 Process 1 (root) Process 2 Process 3 A B C D

source destination

D C B A

slide-5
SLIDE 5

Nonblocking Collectives for SCI Networks Chair for Operating Systems

Collective Operations: ALLTOALL

5

Process 0 Process 1 (root) Process 2 Process 3 A0 B0 C0 D0

source destination

A1 A1 A2 A2 A3 A3 A0 B0 B1 B1 B2 B2 B3 B3 C0 C1 C1 C2 C2 C3 C3 D0 D1 D1 D2 D2 D3 D3 D3

slide-6
SLIDE 6

Nonblocking Collectives for SCI Networks Chair for Operating Systems

Collective Operations: ALLTOALLV

6

Process 0 Process 1 (root) Process 2 Process 3 A0 B0 C0 D0

source destination

A1 A2 A3 B1 B2 B3 C1 C2 C3 D1 D2 D3 A0 A1 A2 A3 B0 B1 B2 B3 C0 C1 C2 C3 D0 D1 D2 D3

slide-7
SLIDE 7

Nonblocking Collectives for SCI Networks Chair for Operating Systems

The SCI Collectives Library

7

Purpose:

  • Study collective communication algorithms for SCI

clusters

  • Support multiple MPI libraries: Open MPI, NMPI
  • Support arbitrary communication libraries: LibNBC
slide-8
SLIDE 8

Nonblocking Collectives for SCI Networks Chair for Operating Systems

Nonblocking Collectives (NBC)

8

Purpose: Overlap of Computation and Communication

slide-9
SLIDE 9

Nonblocking Collectives for SCI Networks Chair for Operating Systems

NBC in MPI

9

MPI-2.0 JoD: Split Collectives

MPI_BCAST_BEGIN(buffer, count, datatype, root, comm) MPI_BCAST_END(buffer, comm)

MPI-3 Draft:

MPI_IBCAST(buffer, count, datatype, root, comm, request) MPI_WAIT(request, status)

MPI-2.1:

  • Implement with nonblocking Point-to-Point operations
  • Blocking collectives in separate thread
slide-10
SLIDE 10

Nonblocking Collectives for SCI Networks Chair for Operating Systems

LibNBC

10

FFT CG PC

IB support scicoll adapter MPI support

LibNBC scicoll SISCI pthreads MPI

slide-11
SLIDE 11

Nonblocking Collectives for SCI Networks Chair for Operating Systems

Rationale: NBC for SCI

11

So far:

  • Promising results with NBC via LibNBC
  • Research done on InfiniBand clusters

Therefore:

What about a very different network architecture?

Implementation considerations:

  • Use algorithms different from blocking version?
  • PIO vs DMA
  • Use background thread?
slide-12
SLIDE 12

Nonblocking Collectives for SCI Networks Chair for Operating Systems

Available Benchmarks for LibNBC API

12

Synthetic:

NBCBench: measures the communication overhead /

  • verlap potential

Application Kernels:

  • CG (Alltoallv): 3D Grid, overlaps computation with

halo zone exchange

  • PC (Gatherv): overlaps compression with gathering
  • f previous results
  • FFT (Alltoall): parallel matrix transpose, overlaps

data exchange for z transpose with computation for x and y

slide-13
SLIDE 13

Nonblocking Collectives for SCI Networks Chair for Operating Systems

Gather(v)

13

  • Underlying concept: Hamiltonian Path in a 2D torus
  • Algorithms: Binary Tree, Binomial Tree, Flat Tree, Sequential

Transmission

slide-14
SLIDE 14

Nonblocking Collectives for SCI Networks Chair for Operating Systems

Gather(v)/Alltoall(v)

Gather(v):

  • Additional progress thread: Binary Tree (PIO), Binomial

Tree (PIO), Flat Tree (PIO), Sequential Transmission (PIO, DMA)

  • Single Thread with manual progress: Sequential

Transmission

  • Vector Variant: Flat Tree and Sequential Transmission

Alltoall(v):

  • Additional progress thread: Bruck (PIO), Pairwise

Exchange (PIO), Ring (PIO), Flat Tree (PIO)

  • Single Thread with manual progress: Pairwise Exchange

(DMA)

  • Vector Variant:Pairwise Exchange, Flat Tree

14

slide-15
SLIDE 15

Nonblocking Collectives for SCI Networks Chair for Operating Systems

Application Kernels: Algorithms

15

PC (Gatherv) CG (Alltoallv) FFT (Alltoall)

slide-16
SLIDE 16

Nonblocking Collectives for SCI Networks Chair for Operating Systems

Communication Overhead (NBCBench)

16

Gather

slide-17
SLIDE 17

Nonblocking Collectives for SCI Networks Chair for Operating Systems

Communication Overhead (NBCBench)

17

Alltoall

slide-18
SLIDE 18

Nonblocking Collectives for SCI Networks Chair for Operating Systems

Application Kernels: Performance

18

CG (Alltoallv) FFT (Alltoall) PC (Gatherv)

slide-19
SLIDE 19

Nonblocking Collectives for SCI Networks Chair for Operating Systems

Conclusion

19

What we‘ve done:

Implement nonblocking Gather(v) and Alltoall(v) collective opera-tions on SCI clusters with different algorithms and implementation alternatives

What we found out:

  • Applications can benefit from nonblocking collectives on

SCI clusters in spite of inferior DMA performance

  • Best implementation method: DMA in a single thread, PIO

is usually used for blocking collectives

  • Issues with multiple threads
slide-20
SLIDE 20

Nonblocking Collectives for SCI Networks Chair for Operating Systems

The End

20

Thank you for your attention!