cse 262 lecture 13
play

CSE 262 Lecture 13 Communication overlap Continued Heterogeneous - PowerPoint PPT Presentation

CSE 262 Lecture 13 Communication overlap Continued Heterogeneous processing Announcements Final presentations u Friday March 13 th , 10:00 AM to 1:00PM Note time change. u Room 3217, CSE Building (EBU3B) Scott B. Baden / CSE 262


  1. CSE 262 Lecture 13 Communication overlap – Continued Heterogeneous processing

  2. Announcements • Final presentations u Friday March 13 th , 10:00 AM to 1:00PM Note time change. u Room 3217, CSE Building (EBU3B) Scott B. Baden / CSE 262 / UCSD, Wi '15 2

  3. An alternative way to hide communication • Reformulate MPI code into a data-driven form u Decouple scheduling and communication handling from the application u Automatically overlap communication with computation Runtime system Irecv j Irecv i Communication Worker 0 handlers Send j ¡ Send i threads 2 Wait Wait 1 3 4 Comp Comp Task dependency SPMD MPI Dynamic scheduling graph Scott B. Baden / CSE 262 / UCSD, Wi '15 3

  4. Bamboo Programming Model 1 #pragma bamboo olap • Olap-regions : task switching point 2 { u Data availability is checked at entry 3 #pragma bamboo send u Only 1 olap may be active at a time 4 {…} u When a task is ready, some olap region’s 5 #pragma bamboo receive input conditions have been satisfied 6 {… } 7 } • Send blocks u Hold send calls only … Computation u Enable the olap-region …. • Receive blocks : u Hold receive and/or send calls 10 #pragma bamboo olap u Receive calls are input to olap-region 11 {…} …. u Send calls are output to an olap-region • Activities in send blocks must be φ 1 φ 2 … φ N independent of those in receive blocks … OLAP 1 OLAP N • MPI_Wait/MPI_Waitall can reside anywhere within the olap-region Scott B. Baden / CSE 262 / UCSD, Wi '15 4

  5. Results • Stampede at TACC u 102,400 cores; dual socket Sandy Bridge processors u K20 GPUs • Cray XE-6 at NERSC (Hopper) u 153,216 cores; dual socket 12-core Magny Cours u 4 NUMA nodes per Hopper node, each with 6 cores u 3D Toroidal Network • Cray XC30 at NERSC (Edison) u 133,824 cores; dual socket 12-core Ivy Bridge u Dragonfly Network Scott B. Baden / CSE 262 / UCSD, Wi '15 5

  6. Stencil application performance (Hopper) • Solve 3D Laplace equation, Dirichlet BCs (N=3072 3 ) 7-point stencil Δ u = 0, u=f on ∂Ω • Added 4 Bamboo pragmas to a 419 line MPI code 40 MPI-basic MPI-olap 35 TFLOPS/s Bamboo-basic MPI-nocomm 30 25 20 15 10 5 0 12288 24576 49152 98304 Scott B. Baden / CSE 262 / UCSD, Wi '15 6

  7. 2D Cannon - Weak scaling study Edison 1400 TFLOPS/s MPI-basic 1200 MPI-olap 1000 Bamboo 800 MPI-nocomm 600 400 200 0 4096 16384 65536 Cores N0/4 2/3 N0/4 1/3 N=N0=196608 • Communication cost: 11% - 39% • Bamboo improves MPI-basic 9%-37% • Bamboo outperforms MPI-olap at scale Scott B. Baden / CSE 262 / UCSD, Wi '15 7

  8. Communication Avoiding Matrix Multiplication (Hopper) • Pathological matrices in Planewave basis methods for ab- initio molecular dynamics (N g 3 x N e ), For Si: N g =140, Ne=2000 • Weak scaling study, used OpenMP, 23 pragmas, 337 lines 210.4TF 110 100 90 MPI+OMP 80 TFLOPS/s MPI+OMP-olap 70 Bamboo+OMP 60 50 MPI-OMP-nocomm 40 30 20 10 0 Cores 4096 8192 16384 32768 N=2 2/3 N 0 N= 2 1/3 N 0 N=2N 0 Matrix size N=N 0 = 20608 Scott B. Baden / CSE 262 / UCSD, Wi '15 8

  9. Virtualization Improves Performance +,-"'!!**" '$!" +,-"!)&./" '$'&" c=2, VF=8 +,-")%'&!" '$'" +,-"%*(#)" 0 ?@A " c=2, VF=4 '$#&" '" c=2, VF=2 #$%&" c=4, VF=2 #$%" #$*&" '" !" )" *" 647892:4;2<=+">258=7" B1!C"D2E1==F?@A"" Cannon 2.5D (MPI) Jacobi (MPI+OMP) Scott B. Baden / CSE 262 / UCSD, Wi '15 9

  10. Virtualization Improves Performance ($!" +,-"'!!**" -./"(#)+" ($+" -./"+#'," '$!" +,-"!)&./" ($*" -./(,*&+" '$'&" +,-")%'&!" ($)" 0.1123." '$'" +,-"%*(#)" 0 ?@A " ($(" '$#&" (" '" #$'" #$%&" #$&" #$%" #$%" #$*&" (" )" +" &" (," 45673895:8;<-"=8>7<6" '" !" )" *" 647892:4;2<=+">258=7" B1!C"D2E1==F?@A"" Cannon 2D (MPI) Jacobi (MPI+OMP) Scott B. Baden / CSE 262 / UCSD, Wi '15 10

  11. High Performance Linpack (HPL) on Stampede • Solve systems of linear equations using LU factorization TFLOP/S • Latency-tolerant lookahead 28 Basic code is complicated 27.5 Unprioritized Scheduling 27 Prioritized Scheduling 26.5 Olap !"#"$%&'()*+,(-.(/( 26 25.5 1( / !" !"#"$%&'()*+,(-.(0( 25 24.5 0 !" 2 ! (3(2 ! (4(0 ! 5/ !" 24 • Results 23.5 • Bamboo meets the performance of the 23 highly-optimized version of HPL 22.5 22 • Uses the far simpler non-lookahead 21.5 version 139264 147456 155648 163840 172032 180224 • Task prioritization is crucial Matrix size • Bamboo improves the baseline version of HPL by up to 10% 2048 cores on Stampede Scott B. Baden / CSE 262 / UCSD, Wi '15 11

  12. Bamboo on multiple GPUs • MPI+CUDA programming model u CPU is host and GPU works as a device u Host-host with MPI and host-device with CUDA Send/recv MPI0 MPI1 u Optimize MPI and CUDA portions separately • Need a GPU-aware programming model cudaMemcpy u Allow a device to transfer data to another device GPU0 GPU1 u Compiler and runtime system handle the data transfer MPI+CUDA u Hide both host-device and host-host communication automatically MPI 0 MPI 1 RTS RTS RTS handles communication GPU Send/recv GPU task x task y GPU-aware model Scott B. Baden / CSE 262 / UCSD, Wi '15 12

  13. 3D Jacobi – Weak Scaling Study GFLOP/S • Results on Stampede 1400 u Bamboo-GPU MPI-basic outperforms MPI-basic 1200 u Bamboo-GPU and MPI- MPI-olap olap hide most 1000 communication overheads Bamboo • Bamboo-GPU improves 800 Bamboo-GPU performance by 600 MPI-nocomm u Hide Host-Host transfer u Hide Host-Device transfer 400 u Tasks residing in the same 200 GPU send address of the message 0 4 8 16 32 GPU count Stampede Scott B. Baden / CSE 262 / UCSD, Wi '15 13

  14. Multigrid – Weak Scaling Time (secs) Edison 3.5 • A geometric multigrid solver to MPI Bamboo MPI-nocomm 3 Helmholtz’s equation [Willams et al. 2.5 12] 2 Vcycle: restrict, smooth, solve, u 1.5 interpolate, smooth 1 Smooth: Red-Black Gauss-Seidel u 0.5 DRAM avoiding with the wave-front 0 u method Cores 2048 4096 8192 16384 32768 Results: Cores Comm Compute pack/unpack inter-box copy Comm/total time at each level § Communication cost: 16%-22% L0 L1 L2 L3 L4 § Bamboo improves the performance by up to 14% 2048 0.448 1.725 0.384 0.191 12% 21% 36% 48% 48% 4096 0.476 1.722 0.353 0.191 12% 24% 37% 56% 50% § Communication overlap is effective on levels L0 and L1 8192 0.570 1.722 0.384 0.191 13% 27% 45% 69% 63% 16384 0.535 1.726 0.386 0.192 12% 30% 48% 53% 49% 32768 0.646 1.714 0.376 0.189 17% 28% 44% 63% 58%

  15. A GPU-aware programming model • MPI+CUDA programming model u CPU is host and GPU works as a device u Host-host with MPI and host-device with CUDA u Optimize MPI and CUDA portions separately Send/recv MPI0 MPI1 • A GPU-aware programming model cudaMemcpy u Allow a device to transfer data to another device u Compiler and runtime system handle the data GPU0 GPU1 transfer MPI+CUDA u We implemented a GPU-aware runtime system u Hide both host-device and host-host communication automatically MPI 0 MPI 1 RTS RTS RTS handles communication GPU Send/recv GPU task x task y GPU-aware

  16. 3D Jacobi – Weak Scaling • Results u Bamboo-GPU GFLOP/s Stampede outperforms MPI-basic 1400 u Bamboo-GPU and MPI- MPI-basic olap hide most 1200 communication overheads MPI-olap • Bamboo-GPU improves 1000 Bamboo performance by 800 u Hide Host-Host transfer Bamboo-GPU u Hide Host-Device transfer 600 MPI-nocomm u Tasks residing in the same GPU send address of the 400 message 200 0 4 8 16 32 GPU count

  17. Bamboo Design • Core message passing Support point-to-point routines u Require programmer annotation u Employ Tarragon runtime system [Cicotti u 06, 11] • Subcommunicator layer Support MPI_Comm_split u No annotation required Bamboo implementation User-defined subprograms u • Collectives of collective routines A framework to translate collectives u Collective Collectives Implement common collectives u No annotation required u Subcommunicator • User-defined subprograms A normal MPI program u Core message passing Scott B. Baden / CSE 262 / UCSD, Wi '15 17

  18. Bamboo Translator Annotated MPI input MPI reordering EDG front-end Inlining Annotation handler ROSE AST Outlining Analyzer Translating MPI extractor … Transformer Bamboo middle-end Optimizer ROSE back-end Tarragon Scott B. Baden / CSE 262 / UCSD, Wi '15 18

  19. Bamboo Transformations • Outlining u TaskGraph definition: fill various Tarragon methods with input source code blocks • MPI Translation : capture MPI calls and generate calls to Tarragon u Some MPI calls removed, e.g. Barrier(), Wait() u Conservative static analysis to determine task dependencies • Code reordering : reorder certain code to accommodate Tarragon semantics Scott B. Baden / CSE 262 / UCSD, Wi '15 19 ¡

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend