CSE 262 Lecture 13 Communication overlap Continued Heterogeneous - - PowerPoint PPT Presentation

cse 262 lecture 13
SMART_READER_LITE
LIVE PREVIEW

CSE 262 Lecture 13 Communication overlap Continued Heterogeneous - - PowerPoint PPT Presentation

CSE 262 Lecture 13 Communication overlap Continued Heterogeneous processing Announcements Final presentations u Friday March 13 th , 10:00 AM to 1:00PM Note time change. u Room 3217, CSE Building (EBU3B) Scott B. Baden / CSE 262


slide-1
SLIDE 1

CSE 262 Lecture 13

Communication overlap – Continued Heterogeneous processing

slide-2
SLIDE 2

Announcements

  • Final presentations

u Friday March 13th, 10:00 AM to 1:00PM

Note time change.

u Room 3217, CSE Building (EBU3B)

Scott B. Baden / CSE 262 / UCSD, Wi '15 2

slide-3
SLIDE 3

An alternative way to hide communication

  • Reformulate MPI code into a data-driven form

u Decouple scheduling and communication

handling from the application

u Automatically overlap communication with

computation

¡

2 1 4 3

Worker threads Communication handlers

Dynamic scheduling

Irecv j Send j Wait Comp

SPMD MPI Task dependency graph Runtime system

Irecv i Send i Wait Comp

Scott B. Baden / CSE 262 / UCSD, Wi '15 3

slide-4
SLIDE 4

Bamboo Programming Model

  • Olap-regions: task switching point

u Data availability is checked at entry u Only 1 olap may be active at a time u When a task is ready, some olap region’s

input conditions have been satisfied

  • Send blocks

u Hold send calls only u Enable the olap-region

  • Receive blocks:

u Hold receive and/or send calls u Receive calls are input to olap-region u Send calls are output to an olap-region

  • Activities in send blocks must be

independent of those in receive blocks

  • MPI_Wait/MPI_Waitall can reside

anywhere within the olap-region 1 #pragma bamboo olap 2 { 3 #pragma bamboo send 4 {…} 5 #pragma bamboo receive 6 {… } 7 } 10 #pragma bamboo olap 11 {…} … Computation …. …. φ1 φ2 … φN

OLAP1 OLAPN

Scott B. Baden / CSE 262 / UCSD, Wi '15 4

slide-5
SLIDE 5

Results

  • Stampede at TACC

u 102,400 cores; dual socket Sandy Bridge processors u K20 GPUs

  • Cray XE-6 at NERSC (Hopper)

u 153,216 cores; dual socket 12-core Magny Cours u 4 NUMA nodes per Hopper node, each with 6 cores u 3D Toroidal Network

  • Cray XC30 at NERSC (Edison)

u 133,824 cores; dual socket 12-core Ivy Bridge u Dragonfly Network

Scott B. Baden / CSE 262 / UCSD, Wi '15 5

slide-6
SLIDE 6

5 10 15 20 25 30 35 40 12288 24576 49152 98304

MPI-basic MPI-olap Bamboo-basic MPI-nocomm

Stencil application performance (Hopper)

  • Solve 3D Laplace equation, Dirichlet BCs (N=30723)

7-point stencil Δu = 0, u=f on ∂Ω

  • Added 4 Bamboo pragmas to a 419 line MPI code

TFLOPS/s

Scott B. Baden / CSE 262 / UCSD, Wi '15 6

slide-7
SLIDE 7

2D Cannon - Weak scaling study

  • Communication cost: 11% - 39%
  • Bamboo improves MPI-basic 9%-37%
  • Bamboo outperforms MPI-olap at scale

200 400 600 800 1000 1200 1400 4096 16384 65536

MPI-basic MPI-olap Bamboo MPI-nocomm

TFLOPS/s

N=N0=196608 N0/42/3 N0/41/3 Cores

Edison

Scott B. Baden / CSE 262 / UCSD, Wi '15 7

slide-8
SLIDE 8

Communication Avoiding Matrix Multiplication (Hopper)

  • Pathological matrices in Planewave basis methods for ab-

initio molecular dynamics (Ng

3 x Ne), For Si: Ng=140, Ne=2000

  • Weak scaling study, used OpenMP, 23 pragmas, 337 lines

Cores

TFLOPS/s

Matrix size

10 20 30 40 50 60 70 80 90 100 110

4096 8192 16384 32768

MPI+OMP MPI+OMP-olap Bamboo+OMP MPI-OMP-nocomm

N=N0= 20608

N= 21/3N0

N=2N0

N=22/3N0 210.4TF

Scott B. Baden / CSE 262 / UCSD, Wi '15 8

slide-9
SLIDE 9

Virtualization Improves Performance

647892:4;2<=+">258=7"

#$*&" #$%" #$%&" '" '$#&" '$'" '$'&" '$!" '" !" )" *"

+,-"'!!**" +,-"!)&./" +,-")%'&!" +,-"%*(#)"

0?@A"

B1!C"D2E1==F?@A""

Jacobi (MPI+OMP) Cannon 2.5D (MPI)

c=2, VF=8 c=2, VF=4 c=2, VF=2 c=4, VF=2

Scott B. Baden / CSE 262 / UCSD, Wi '15 9

slide-10
SLIDE 10

Virtualization Improves Performance

647892:4;2<=+">258=7"

#$*&" #$%" #$%&" '" '$#&" '$'" '$'&" '$!" '" !" )" *"

+,-"'!!**" +,-"!)&./" +,-")%'&!" +,-"%*(#)"

0?@A"

B1!C"D2E1==F?@A""

#$%" #$&" #$'" (" ($(" ($)" ($*" ($+" ($!" (" )" +" &" (,"

  • ./"(#)+"
  • ./"+#',"
  • ./(,*&+"

0.1123." 45673895:8;<-"=8>7<6"

Jacobi (MPI+OMP) Cannon 2D (MPI)

Scott B. Baden / CSE 262 / UCSD, Wi '15 10

slide-11
SLIDE 11

High Performance Linpack (HPL) on Stampede

21.5 22 22.5 23 23.5 24 24.5 25 25.5 26 26.5 27 27.5 28 139264 147456 155648 163840 172032 180224

Basic Unprioritized Scheduling Prioritized Scheduling Olap

TFLOP/S Matrix size

  • Solve systems of

linear equations using LU factorization

  • Latency-tolerant lookahead

code is complicated

2048 cores on Stampede

!"#"$%&'()*+,(-.(/( !"#"$%&'()*+,(-.(0( 1( /!" 0!" 2!(3(2!(4(0!5/!"

  • Results
  • Bamboo meets the performance of the

highly-optimized version of HPL

  • Uses the far simpler non-lookahead

version

  • Task prioritization is crucial
  • Bamboo improves the baseline version
  • f HPL by up to 10%

Scott B. Baden / CSE 262 / UCSD, Wi '15 11

slide-12
SLIDE 12

Bamboo on multiple GPUs

  • MPI+CUDA programming model

u CPU is host and GPU works as a device u Host-host with MPI and host-device with

CUDA

u Optimize MPI and CUDA portions separately

  • Need a GPU-aware programming model

u Allow a device to transfer data to another device u Compiler and runtime system handle the data

transfer

u Hide both host-device and host-host

communication automatically

GPU0 GPU1 MPI0 MPI1 Send/recv cudaMemcpy

MPI+CUDA

GPU task x GPU task y MPI 0 RTS Send/recv

GPU-aware model

RTS handles communication MPI 1 RTS Scott B. Baden / CSE 262 / UCSD, Wi '15 12

slide-13
SLIDE 13

3D Jacobi – Weak Scaling Study

  • Results on Stampede

u Bamboo-GPU

  • utperforms MPI-basic

u Bamboo-GPU and MPI-

  • lap hide most

communication overheads

  • Bamboo-GPU improves

performance by

u Hide Host-Host transfer u Hide Host-Device transfer u Tasks residing in the same

GPU send address of the message GFLOP/S

GPU count

200 400 600 800 1000 1200 1400

4 8 16 32

MPI-basic MPI-olap Bamboo Bamboo-GPU MPI-nocomm Stampede

Scott B. Baden / CSE 262 / UCSD, Wi '15 13

slide-14
SLIDE 14

Multigrid – Weak Scaling

  • A geometric multigrid solver to

Helmholtz’s equation [Willams et al. 12]

u

Vcycle: restrict, smooth, solve, interpolate, smooth

u

Smooth: Red-Black Gauss-Seidel

u

DRAM avoiding with the wave-front method

Time (secs) Cores

0.5 1 1.5 2 2.5 3 3.5 2048 4096 8192 16384 32768

MPI Bamboo MPI-nocomm

Edison

Results:

§ Communication cost: 16%-22% § Bamboo improves the performance by up to 14% § Communication overlap is effective on levels L0 and L1

Cores Comm Compute pack/unpack inter-box copy Comm/total time at each level L0 L1 L2 L3 L4 2048 0.448 1.725 0.384 0.191 12% 21% 36% 48% 48% 4096 0.476 1.722 0.353 0.191 12% 24% 37% 56% 50% 8192 0.570 1.722 0.384 0.191 13% 27% 45% 69% 63% 16384 0.535 1.726 0.386 0.192 12% 30% 48% 53% 49% 32768 0.646 1.714 0.376 0.189 17% 28% 44% 63% 58%

slide-15
SLIDE 15

A GPU-aware programming model

  • MPI+CUDA programming model

u CPU is host and GPU works as a device u Host-host with MPI and host-device with CUDA u Optimize MPI and CUDA portions separately

  • A GPU-aware programming model

u Allow a device to transfer data to another device u Compiler and runtime system handle the data

transfer

u We implemented a GPU-aware runtime system u Hide both host-device and host-host communication

automatically

GPU0 GPU1 MPI0 MPI1 Send/recv cudaMemcpy

MPI+CUDA

GPU task x GPU task y MPI 0 RTS Send/recv

GPU-aware

RTS handles communication MPI 1 RTS

slide-16
SLIDE 16

3D Jacobi – Weak Scaling

  • Results

u Bamboo-GPU

  • utperforms MPI-basic

u Bamboo-GPU and MPI-

  • lap hide most

communication overheads

  • Bamboo-GPU improves

performance by

u Hide Host-Host transfer u Hide Host-Device transfer u Tasks residing in the same

GPU send address of the message

GFLOP/s GPU count

200 400 600 800 1000 1200 1400

4 8 16 32

MPI-basic MPI-olap Bamboo Bamboo-GPU MPI-nocomm Stampede

slide-17
SLIDE 17

Bamboo Design

  • Core message passing

u

Support point-to-point routines

u

Require programmer annotation

u

Employ Tarragon runtime system [Cicotti 06, 11]

  • Subcommunicator layer

u

Support MPI_Comm_split

u

No annotation required

  • Collectives

u

A framework to translate collectives

u

Implement common collectives

u

No annotation required

  • User-defined subprograms

u

A normal MPI program

Bamboo implementation

  • f collective routines

Collective Subcommunicator Core message passing User-defined subprograms Collectives

Scott B. Baden / CSE 262 / UCSD, Wi '15 17

slide-18
SLIDE 18

Bamboo Translator

Annotated MPI input

EDG front-end ROSE AST Annotation handler Tarragon Analyzer Inlining Translating MPI reordering ROSE back-end Optimizer … MPI extractor Outlining Bamboo middle-end Transformer

Scott B. Baden / CSE 262 / UCSD, Wi '15 18

slide-19
SLIDE 19

Bamboo Transformations

  • Outlining

u TaskGraph definition: fill various Tarragon

methods with input source code blocks

  • MPI Translation: capture MPI calls and

generate calls to Tarragon

u Some MPI calls removed, e.g. Barrier(), Wait() u Conservative static analysis to determine task

dependencies

  • Code reordering: reorder certain code to

accommodate Tarragon semantics

¡

Scott B. Baden / CSE 262 / UCSD, Wi '15 19

slide-20
SLIDE 20

Code outlining

Compute

yes No

  • utput data

Runtime system (RTS) All input ready?

No State= EXEC fireable

All input ready?

yes RTS listening for incoming messages

for (int iter=0; iter<nIters; iter++){ #pragma bamboo olap { #pragma bamboo send { … } #pragma bamboo receive

{ … }

}

compute

}

inject data Iter =nIters

Scott B. Baden / CSE 262 / UCSD, Wi '15 20

slide-21
SLIDE 21

Firing & yielding Rule Generation

  • Extract source information in all Recv and iRecv calls
  • f each olap-region, including associated if and for

statements

  • A task is fireable if and only if it receives messages

from all source

for(source = 0 to n) if(source%2==0) MPI_Recv from source bool test(){ for (source =0 to n) if(source%2==0) if(notArrivedFrom(source)) { return false; } return true; } Recv(0) ∧ Recv (2) ∧…∧ Recv(n) Firing rule: while (messageArrival) { return test (); } Yiedling rule: yield = ! test();

Scott B. Baden / CSE 262 / UCSD, Wi '15 21

slide-22
SLIDE 22

Inter-procedural translation

Void multiGridSolver(){ #pragma bamboo olap for(int level=0; level<nLevels; level++){ Send_to_neighbors(); Receive_from_neighbors(); Update the data grid } }

for(cycle=0 to nVcycles { multiGridSolver(); }

Void send_to_neighbors(){ forall neighbors if(neighbor) MPI_Isend(neighbor) } Void receive_from_neighbors(){ forall neighbors if(neighbor) MPI_Irecv(neighbor) }

Bamboo inlines functions that either directly or indirectly hold MPI calls only i n v

  • k

e invoke invoke inline inline inline

main

Scott B. Baden / CSE 262 / UCSD, Wi '15 22

slide-23
SLIDE 23

Collective implementation

Main file

error = MPI_Barrier(comm); Bamboo_Barrier(comm);

int Bamboo_Barrier(MPI_Comm comm){ #pragma bamboo olap for (int step = 1; step < size; step<<=1) { MPI_Send(1 byte to (rank+step)%size, comm); MPI_Recv(1 byte from (rank-step+size)%size, comm); } return SUCCESS; }

A collective Library

Rename collective call Merge library’s AST to main Then inline collective call

comm_0 = comm; #pragma bamboo olap for (int step = 1; step < size; step<<=1) { MPI_Send(1 byte to (rank+step)%size, comm_0); MPI_Recv(1 byte from (rank-step+size)%size, comm_0); } error = SUCCESS;

Next translation process

Scott B. Baden / CSE 262 / UCSD, Wi '15 23