Overlapping Communication and Computation with High Level - - PowerPoint PPT Presentation

overlapping communication and computation with high level
SMART_READER_LITE
LIVE PREVIEW

Overlapping Communication and Computation with High Level - - PowerPoint PPT Presentation

Overlapping Communication and Computation with High Level Communication Routines - On Optimizing Parallel Applications - Torsten Hoefler and Andrew Lumsdaine Open Systems Lab Indiana University Bloomington, IN 47405, USA Conference on Cluster


slide-1
SLIDE 1

Overlapping Communication and Computation with High Level Communication Routines

  • On Optimizing Parallel Applications -

Torsten Hoefler and Andrew Lumsdaine

Open Systems Lab Indiana University Bloomington, IN 47405, USA

Conference on Cluster Computing and the Grid (CCGrid’08)

Lyon, France

21th May 2008

Torsten Hoefler and Andrew Lumsdaine Communication/Computation Overlap

slide-2
SLIDE 2

Introduction

Solving Grand Challenge Problems not a Grid talk HPC-centric view highly-scalable tightly coupled machines Thanks for the Introduction Manish! All processors will be multi-core All computers will be massively parallel All programmers will be parallel programmers All programs will be parallel programs ⇒ All (massively) parallel programs need optimized communication (patterns)

Torsten Hoefler and Andrew Lumsdaine Communication/Computation Overlap

slide-3
SLIDE 3

Fundamental Assumptions (I)

We need more powerful machines! Solutions for real-world scientific problems need huge processing power (Grand Challenges) Capabilities of single PEs have fundamental limits The scaling/frequency race is currently stagnating Moore’s law is still valid (number of transistors/chip) Instruction level parallelism is limited (pipelining, VLIW, multi-scalar) Explicit parallelism seems to be the only solution Single chips and transistors get cheaper Implicit transistor use (ILP , branch prediction) have their limits

Torsten Hoefler and Andrew Lumsdaine Communication/Computation Overlap

slide-4
SLIDE 4

Fundamental Assumptions (II)

Parallelism requires communication Local or even global data-dependencies exist Off-chip communication becomes necessary Bridges a physical distance (many PEs) Communication latency is limited It’s widely accepted that the speed of light limits data-transmission Example: minimal 0-byte latency for 1m ≈ 3.3ns ≈ 13 cycles on a 4GHz PE Bandwidth can hide latency only partially Bandwidth is limited (physical constraints) The problem of “scaling out” (especially iterative solvers)

Torsten Hoefler and Andrew Lumsdaine Communication/Computation Overlap

slide-5
SLIDE 5

Assumptions about Parallel Program Optimization

Collective Operations Collective Operations (COs) are an optimization tool CO performance influences application performance

  • ptimized implementation and analysis of CO is non-trivial

Hardware Parallelism More PEs handle more tasks in parallel Transistors/PEs take over communication processing Communication and computation could run simultaneously Overlap of Communication and Computation Overlap can hide latency Improves application performance

Torsten Hoefler and Andrew Lumsdaine Communication/Computation Overlap

slide-6
SLIDE 6

Overview (I)

Theoretical Considerations a model for parallel architectures parametrize model derive model for BC and NBC prove optimality of collops in the model (?) show processor idle time during BC show limits of the model (IB,BG/L) Implementation of NBC how to assess performance? highly portable low-performance IB optimized, high performance, threaded

Torsten Hoefler and Andrew Lumsdaine Communication/Computation Overlap

slide-7
SLIDE 7

Overview (II)

Application Kernels FFT (strong data dependency) compression (parallel data analysis) poisson solver (2d-decomposition) Applications show how performance benefits for microbenchmarks can benefit real-world applications ABINIT Octopus OSEM medical image reconstruction

Torsten Hoefler and Andrew Lumsdaine Communication/Computation Overlap

slide-8
SLIDE 8

The LogGP model

Modelling Network Communication LogP model family has best tradeoff between ease of use and accuracy LogGP is most accurate for different message sizes Methodology assess LogGP parameters for modern interconnects model collective communication

CPU Network

  • s

L

  • r

level time g, G Sender Receiver g, G

Torsten Hoefler and Andrew Lumsdaine Communication/Computation Overlap

slide-9
SLIDE 9

TCP/IP - GigE/SMP

100 200 300 400 500 600 10000 20000 30000 40000 50000 60000 Time in microseconds Datasize in bytes (s) MPICH2 - G*s+g MPICH2 - o TCP - G*s+g TCP o

Torsten Hoefler and Andrew Lumsdaine Communication/Computation Overlap

slide-10
SLIDE 10

Myrinet/GM (preregistered/cached)

50 100 150 200 250 300 350 10000 20000 30000 40000 50000 60000 Time in microseconds Datasize in bytes (s) Open MPI - G*s+g Open MPI - o Myrinet/GM - G*s+g Myrinet/GM - o

Torsten Hoefler and Andrew Lumsdaine Communication/Computation Overlap

slide-11
SLIDE 11

InfiniBand (preregistered/cached)

10 20 30 40 50 60 70 80 90 10000 20000 30000 40000 50000 60000 Time in microseconds Datasize in bytes (s) Open MPI - G*s+g Open MPI - o OpenIB - G*s+g OpenIB - o

Torsten Hoefler and Andrew Lumsdaine Communication/Computation Overlap

slide-12
SLIDE 12

Modelling Collectives

LogGP Models - general tbarr = (2o + L) · ⌈log2P⌉ tallred = 2 · (2o + L + m · G) · ⌈log2P⌉ + m · γ · ⌈log2P⌉ tbcast = (2o + L + m · G) · ⌈log2P⌉ CPU and Network LogGP parts tCPU

barr = 2o · ⌈log2P⌉

tNET

barr = L · ⌈log2P⌉

tCPU

allred = (4o + m · γ) · ⌈log2P⌉

tNET

allred = 2 · (L + m · G) · ⌈log2P⌉

tCPU

bcast = 2o · ⌈log2P⌉

tNET

bcast = (L + m · G) · ⌈log2P⌉

Torsten Hoefler and Andrew Lumsdaine Communication/Computation Overlap

slide-13
SLIDE 13

CPU Overhead - MPI_Allreduce LAM/MPI 7.1.2

10 20 30 40 50 60 1 10 100 1000 10000 100000 0.005 0.01 0.015 0.02 0.025 0.03 CPU Usage (share) Communicator Size Data Size CPU Usage (share)

Torsten Hoefler and Andrew Lumsdaine Communication/Computation Overlap

slide-14
SLIDE 14

CPU Overhead - MPI_Allreduce MPICH2 1.0.3

10 20 30 40 50 60 1 10 100 1000 10000 100000 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 CPU Usage (share) Communicator Size Data Size CPU Usage (share)

Torsten Hoefler and Andrew Lumsdaine Communication/Computation Overlap

slide-15
SLIDE 15

Implementation of Non-blocking Collectives

LibNBC for MPI single-threaded highly portable schedule-based design LibNBC for InfiniBand single-threaded (first version) receiver-driven message passing very low overhead Threaded LibNBC thread support requires MPI_THREAD_MULTIPLE completely asynchronous progress complicated due to scheduling issues

Torsten Hoefler and Andrew Lumsdaine Communication/Computation Overlap

slide-16
SLIDE 16

LibNBC - Alltoall overhead, 64 nodes

10000 20000 30000 40000 50000 60000 50 100 150 200 250 300 Overhead (usec) Message Size (kilobytes) Open MPI/blocking LibNBC/Open MPI, 1024 LibNBC/OF, waitonsend

Torsten Hoefler and Andrew Lumsdaine Communication/Computation Overlap

slide-17
SLIDE 17

First Example

Derivation from “normal” implementation distribution identical to “normal” 3D-FFT first FFT in z direction and index-swap identical Design Goals to Minimize Communication Overhead start communication as early as possible achieve maximum overlap time Solution start MPI_Ialltoall as soon as first xz-plane is ready calculate next xz-plane start next communication accordingly ... collect multiple xz-planes (tile factor)

Torsten Hoefler and Andrew Lumsdaine Communication/Computation Overlap

slide-18
SLIDE 18

Transformation in z Direction

Data already transformed in y direction

y x z

1 block = 1 double value (3x3x3 grid)

Torsten Hoefler and Andrew Lumsdaine Communication/Computation Overlap

slide-19
SLIDE 19

Transformation in z Direction

Transform first xz plane in z direction

y x z

  • pattern means that data was transformed in y and z direction

Torsten Hoefler and Andrew Lumsdaine Communication/Computation Overlap

slide-20
SLIDE 20

Transformation in z Direction

start MPI_Ialltoall of first xz plane and transform second plane

  • y

x z

  • cyan color means that data is communicated in the background

Torsten Hoefler and Andrew Lumsdaine Communication/Computation Overlap

slide-21
SLIDE 21

Transformation in z Direction

start MPI_Ialltoall of second xz plane and transform third plane

  • y

x z

  • data of two planes is not accessible due to communication

Torsten Hoefler and Andrew Lumsdaine Communication/Computation Overlap

slide-22
SLIDE 22

Transformation in x Direction

start communication of the third plane and ...

  • y

x z

  • we need the first xz plane to go on ...

Torsten Hoefler and Andrew Lumsdaine Communication/Computation Overlap

slide-23
SLIDE 23

Transformation in x Direction

... so MPI_Wait for the first MPI_Ialltoall!

  • y

x z

  • and transform first plane (new pattern means xyz transformed)

Torsten Hoefler and Andrew Lumsdaine Communication/Computation Overlap

slide-24
SLIDE 24

Transformation in x Direction

Wait and transform second xz plane

  • y

x z

  • first plane’s data could be accessed for next operation

Torsten Hoefler and Andrew Lumsdaine Communication/Computation Overlap

slide-25
SLIDE 25

Transformation in x Direction

wait and transform last xz plane

  • y

x z

  • done! → 1 complete 1D-FFT overlaps a communication

Torsten Hoefler and Andrew Lumsdaine Communication/Computation Overlap

slide-26
SLIDE 26

3d-FFT performance

0.1 0.2 0.3 0.4 0.5 0.6 0.7 FFT Communication Overhead (s) 64 32 16 8 4 2 MPI/BL MPI/NBC OF/NBC

Torsten Hoefler and Andrew Lumsdaine Communication/Computation Overlap

slide-27
SLIDE 27

Parallel Data Compression

Second Example Data Parallel Loops - Parallel Compression for (i=0; i < N/P; i++) { compute(i); } comm(N/P);

Torsten Hoefler and Andrew Lumsdaine Communication/Computation Overlap

slide-28
SLIDE 28

Parallel Compression Performance

0.1 0.2 0.3 0.4 0.5 Communication Overhead (s) 64 32 16 8 MPI/BL MPI/NBC OF/NBC

Torsten Hoefler and Andrew Lumsdaine Communication/Computation Overlap

slide-29
SLIDE 29

Domain Decomposition

nearest neighbor communication can be implemented with MPI_Alltoallv we propose new collective MPI_Neighbor_xchg[v]

  • Process−local data

Halo−data 2D Domain P1 P2 P3 P4 P5 P6 P7 P10 P9 P8 P11 P0

Torsten Hoefler and Andrew Lumsdaine Communication/Computation Overlap

slide-30
SLIDE 30

Parallel 3d-Poisson solver - Speedup

20 40 60 80 100 8 16 24 32 40 48 56 64 72 80 88 96 Speedup Number of CPUs Eth blocking Eth non-blocking 20 40 60 80 100 8 16 24 32 40 48 56 64 72 80 88 96 Speedup Number of CPUs IB blocking IB non-blocking

Cluster: 128 2 GHz Opteron 246 nodes Interconnect: Gigabit Ethernet, InfiniBandTM System size 800x800x800 (1 node ≈ 5300s)

Torsten Hoefler and Andrew Lumsdaine Communication/Computation Overlap

slide-31
SLIDE 31

Medical Image Reconstruction

OSEM algorithm Allreduction of full image

1 2 3 4 5 6 7 Communication Overhead (s) 8 nodes 16 nodes 32 nodes MPI_Allreduce() NBC_Iallreduce() NBC_Iallreduce() (thread)

Torsten Hoefler and Andrew Lumsdaine Communication/Computation Overlap

slide-32
SLIDE 32

Thank you for your Attention

Torsten Hoefler and Andrew Lumsdaine Communication/Computation Overlap