CSE 262 Lecture 13 Communication overlap Continued Heterogeneous - PowerPoint PPT Presentation

CSE 262 Lecture 13 Communication overlap – Continued Heterogeneous processing

Announcements • Final presentations u Friday March 13 th , 10:00 AM to 1:00PM Note time change. u Room 3217, CSE Building (EBU3B) Scott B. Baden / CSE 262 / UCSD, Wi '15 2

An alternative way to hide communication • Reformulate MPI code into a data-driven form u Decouple scheduling and communication handling from the application u Automatically overlap communication with computation Runtime system Irecv j Irecv i Communication Worker 0 handlers Send j ¡ Send i threads 2 Wait Wait 1 3 4 Comp Comp Task dependency SPMD MPI Dynamic scheduling graph Scott B. Baden / CSE 262 / UCSD, Wi '15 3

Bamboo Programming Model 1 #pragma bamboo olap • Olap-regions : task switching point 2 { u Data availability is checked at entry 3 #pragma bamboo send u Only 1 olap may be active at a time 4 {…} u When a task is ready, some olap region’s 5 #pragma bamboo receive input conditions have been satisfied 6 {… } 7 } • Send blocks u Hold send calls only … Computation u Enable the olap-region …. • Receive blocks : u Hold receive and/or send calls 10 #pragma bamboo olap u Receive calls are input to olap-region 11 {…} …. u Send calls are output to an olap-region • Activities in send blocks must be φ 1 φ 2 … φ N independent of those in receive blocks … OLAP 1 OLAP N • MPI_Wait/MPI_Waitall can reside anywhere within the olap-region Scott B. Baden / CSE 262 / UCSD, Wi '15 4

Results • Stampede at TACC u 102,400 cores; dual socket Sandy Bridge processors u K20 GPUs • Cray XE-6 at NERSC (Hopper) u 153,216 cores; dual socket 12-core Magny Cours u 4 NUMA nodes per Hopper node, each with 6 cores u 3D Toroidal Network • Cray XC30 at NERSC (Edison) u 133,824 cores; dual socket 12-core Ivy Bridge u Dragonfly Network Scott B. Baden / CSE 262 / UCSD, Wi '15 5

Stencil application performance (Hopper) • Solve 3D Laplace equation, Dirichlet BCs (N=3072 3 ) 7-point stencil Δ u = 0, u=f on ∂Ω • Added 4 Bamboo pragmas to a 419 line MPI code 40 MPI-basic MPI-olap 35 TFLOPS/s Bamboo-basic MPI-nocomm 30 25 20 15 10 5 0 12288 24576 49152 98304 Scott B. Baden / CSE 262 / UCSD, Wi '15 6

2D Cannon - Weak scaling study Edison 1400 TFLOPS/s MPI-basic 1200 MPI-olap 1000 Bamboo 800 MPI-nocomm 600 400 200 0 4096 16384 65536 Cores N0/4 2/3 N0/4 1/3 N=N0=196608 • Communication cost: 11% - 39% • Bamboo improves MPI-basic 9%-37% • Bamboo outperforms MPI-olap at scale Scott B. Baden / CSE 262 / UCSD, Wi '15 7

Communication Avoiding Matrix Multiplication (Hopper) • Pathological matrices in Planewave basis methods for ab- initio molecular dynamics (N g 3 x N e ), For Si: N g =140, Ne=2000 • Weak scaling study, used OpenMP, 23 pragmas, 337 lines 210.4TF 110 100 90 MPI+OMP 80 TFLOPS/s MPI+OMP-olap 70 Bamboo+OMP 60 50 MPI-OMP-nocomm 40 30 20 10 0 Cores 4096 8192 16384 32768 N=2 2/3 N 0 N= 2 1/3 N 0 N=2N 0 Matrix size N=N 0 = 20608 Scott B. Baden / CSE 262 / UCSD, Wi '15 8

Virtualization Improves Performance +,-"'!!**" '$!" +,-"!)&./" '$'&" c=2, VF=8 +,-")%'&!" '$'" +,-"%*(#)" 0 ?@A " c=2, VF=4 '$#&" '" c=2, VF=2 #$%&" c=4, VF=2 #$%" #$*&" '" !" )" *" 647892:4;2<=+">258=7" B1!C"D2E1==F?@A"" Cannon 2.5D (MPI) Jacobi (MPI+OMP) Scott B. Baden / CSE 262 / UCSD, Wi '15 9

Virtualization Improves Performance ($!" +,-"'!!**" -./"(#)+" ($+" -./"+#'," '$!" +,-"!)&./" ($*" -./(,*&+" '$'&" +,-")%'&!" ($)" 0.1123." '$'" +,-"%*(#)" 0 ?@A " ($(" '$#&" (" '" #$'" #$%&" #$&" #$%" #$%" #$*&" (" )" +" &" (," 45673895:8;<-"=8>7<6" '" !" )" *" 647892:4;2<=+">258=7" B1!C"D2E1==F?@A"" Cannon 2D (MPI) Jacobi (MPI+OMP) Scott B. Baden / CSE 262 / UCSD, Wi '15 10

High Performance Linpack (HPL) on Stampede • Solve systems of linear equations using LU factorization TFLOP/S • Latency-tolerant lookahead 28 Basic code is complicated 27.5 Unprioritized Scheduling 27 Prioritized Scheduling 26.5 Olap !"#"$%&'()*+,(-.(/( 26 25.5 1( / !" !"#"$%&'()*+,(-.(0( 25 24.5 0 !" 2 ! (3(2 ! (4(0 ! 5/ !" 24 • Results 23.5 • Bamboo meets the performance of the 23 highly-optimized version of HPL 22.5 22 • Uses the far simpler non-lookahead 21.5 version 139264 147456 155648 163840 172032 180224 • Task prioritization is crucial Matrix size • Bamboo improves the baseline version of HPL by up to 10% 2048 cores on Stampede Scott B. Baden / CSE 262 / UCSD, Wi '15 11

Bamboo on multiple GPUs • MPI+CUDA programming model u CPU is host and GPU works as a device u Host-host with MPI and host-device with CUDA Send/recv MPI0 MPI1 u Optimize MPI and CUDA portions separately • Need a GPU-aware programming model cudaMemcpy u Allow a device to transfer data to another device GPU0 GPU1 u Compiler and runtime system handle the data transfer MPI+CUDA u Hide both host-device and host-host communication automatically MPI 0 MPI 1 RTS RTS RTS handles communication GPU Send/recv GPU task x task y GPU-aware model Scott B. Baden / CSE 262 / UCSD, Wi '15 12

3D Jacobi – Weak Scaling Study GFLOP/S • Results on Stampede 1400 u Bamboo-GPU MPI-basic outperforms MPI-basic 1200 u Bamboo-GPU and MPI- MPI-olap olap hide most 1000 communication overheads Bamboo • Bamboo-GPU improves 800 Bamboo-GPU performance by 600 MPI-nocomm u Hide Host-Host transfer u Hide Host-Device transfer 400 u Tasks residing in the same 200 GPU send address of the message 0 4 8 16 32 GPU count Stampede Scott B. Baden / CSE 262 / UCSD, Wi '15 13

Multigrid – Weak Scaling Time (secs) Edison 3.5 • A geometric multigrid solver to MPI Bamboo MPI-nocomm 3 Helmholtz’s equation [Willams et al. 2.5 12] 2 Vcycle: restrict, smooth, solve, u 1.5 interpolate, smooth 1 Smooth: Red-Black Gauss-Seidel u 0.5 DRAM avoiding with the wave-front 0 u method Cores 2048 4096 8192 16384 32768 Results: Cores Comm Compute pack/unpack inter-box copy Comm/total time at each level § Communication cost: 16%-22% L0 L1 L2 L3 L4 § Bamboo improves the performance by up to 14% 2048 0.448 1.725 0.384 0.191 12% 21% 36% 48% 48% 4096 0.476 1.722 0.353 0.191 12% 24% 37% 56% 50% § Communication overlap is effective on levels L0 and L1 8192 0.570 1.722 0.384 0.191 13% 27% 45% 69% 63% 16384 0.535 1.726 0.386 0.192 12% 30% 48% 53% 49% 32768 0.646 1.714 0.376 0.189 17% 28% 44% 63% 58%

A GPU-aware programming model • MPI+CUDA programming model u CPU is host and GPU works as a device u Host-host with MPI and host-device with CUDA u Optimize MPI and CUDA portions separately Send/recv MPI0 MPI1 • A GPU-aware programming model cudaMemcpy u Allow a device to transfer data to another device u Compiler and runtime system handle the data GPU0 GPU1 transfer MPI+CUDA u We implemented a GPU-aware runtime system u Hide both host-device and host-host communication automatically MPI 0 MPI 1 RTS RTS RTS handles communication GPU Send/recv GPU task x task y GPU-aware

3D Jacobi – Weak Scaling • Results u Bamboo-GPU GFLOP/s Stampede outperforms MPI-basic 1400 u Bamboo-GPU and MPI- MPI-basic olap hide most 1200 communication overheads MPI-olap • Bamboo-GPU improves 1000 Bamboo performance by 800 u Hide Host-Host transfer Bamboo-GPU u Hide Host-Device transfer 600 MPI-nocomm u Tasks residing in the same GPU send address of the 400 message 200 0 4 8 16 32 GPU count

Bamboo Design • Core message passing Support point-to-point routines u Require programmer annotation u Employ Tarragon runtime system [Cicotti u 06, 11] • Subcommunicator layer Support MPI_Comm_split u No annotation required Bamboo implementation User-defined subprograms u • Collectives of collective routines A framework to translate collectives u Collective Collectives Implement common collectives u No annotation required u Subcommunicator • User-defined subprograms A normal MPI program u Core message passing Scott B. Baden / CSE 262 / UCSD, Wi '15 17

Bamboo Translator Annotated MPI input MPI reordering EDG front-end Inlining Annotation handler ROSE AST Outlining Analyzer Translating MPI extractor … Transformer Bamboo middle-end Optimizer ROSE back-end Tarragon Scott B. Baden / CSE 262 / UCSD, Wi '15 18

Bamboo Transformations • Outlining u TaskGraph definition: fill various Tarragon methods with input source code blocks • MPI Translation : capture MPI calls and generate calls to Tarragon u Some MPI calls removed, e.g. Barrier(), Wait() u Conservative static analysis to determine task dependencies • Code reordering : reorder certain code to accommodate Tarragon semantics Scott B. Baden / CSE 262 / UCSD, Wi '15 19 ¡

CSE 262 Lecture 13 Communication overlap Continued Heterogeneous - PowerPoint PPT Presentation

CSE 262 Lecture 13 Communication overlap Continued Heterogeneous processing Announcements Final presentations u Friday March 13 th , 10:00 AM to 1:00PM Note time change. u Room 3217, CSE Building (EBU3B) Scott B. Baden / CSE 262

CSE 262 Lecture 7 Parallel Matrix Multiplication Announcements Projects Scott B. Baden /CSE

CSE 262 Lecture 11 GPU Implementation of stencil methods (II) Announcements Final

CSE 262 Lecture 10 Multigrid GPU Implementation of stencil methods (I) Announcements Final

CSE 262 Lecture 12 Communication overlap Announcements A problem set has been posted, due

City Hall 730 Washington Ave Racine WI 53403 262 636-9101 262-636-9570 FAX City of Racine,

City Hall 730 Washington Ave Racine WI 53403 262 636-9101 262-636-9570 FAX City of Racine,

7th Grade The Number System & Mathematical Operations 2015-08-31 www.njctl.org Slide 3 /

CSE 3401 Functional and Logic Programming York University CSE 3401 Vida Movahedi 1 York University

Creative Labeling Of 12 ft. x 262 ft. Canal As Floating Plant Removal In Wetlands Enable

T-1 Date: 1 10/01/2018 02/04/2019 Historic District Map & Index of Drawings 4C/D 262

SEPTEMBER 15, 2020 www.aimnet.org | 617.262.1180 @AIMBusinessNews | #AIMBiz Brooke Thomson

CSE 182-L2:Blast & variants I Dynamic Programming www.cse cse. .ucsd ucsd. .edu

CSE 312 Final Review: Section AA CSE 312 TAs December 8, 2011 CSE 312 Final Review: Section AA

Welcome to CSE 506 Introduc/on & Review Don Porter 1 2 CSE 506: Opera.ng Systems CSE 506:

4 5 6 CSE 142 vs CSE 143 CSE 142 / AP CS A CSE 143 You learned how to write Return of

CSE Fall 2014 311 Lecture 1 Lecture 1 Lecture 1: Propositional Logic Lecture 1 Foundations

CompSci 356: Computer Network Architectures Lecture 8: Switching technologies Chapter 3.1

Real-Time Java for Latency Critical Banking Applications Real-Time Bertrand Delsart System

Chapter 4: Technology and Cost 1 Introduction Firms should transform efficiently inputs into

Characteristics of Adapti tive Runtime Systems in HPC Laxmikant (Sanjay) Kale

Tactical and Strategic AI Marco Chiarandini Department of Mathematics & Computer Science

GHC heap internals Nikita Frolov < frolov@chalmers.se > (courtesy of Bob Ippolito,

I have no disclosures to report A Patient Centered Approach to Abortion Essentials of Womens

Markov Decision Processes [RN2] Sec 17.1, 17.2, 17.4, 17.5 [RN3] Sec 17.1, 17.2, 17.4 CS 486/686

CSE 262 Lecture 13 Communication overlap Continued Heterogeneous - PowerPoint PPT Presentation

CSE 262 Lecture 13 Communication overlap Continued Heterogeneous processing Announcements Final presentations u Friday March 13 th , 10:00 AM to 1:00PM Note time change. u Room 3217, CSE Building (EBU3B) Scott B. Baden / CSE 262

CSE 262 Lecture 7 Parallel Matrix Multiplication Announcements Projects Scott B. Baden /CSE

CSE 262 Lecture 11 GPU Implementation of stencil methods (II) Announcements Final

CSE 262 Lecture 10 Multigrid GPU Implementation of stencil methods (I) Announcements Final

CSE 262 Lecture 12 Communication overlap Announcements A problem set has been posted, due

City Hall 730 Washington Ave Racine WI 53403 262 636-9101 262-636-9570 FAX City of Racine,

City Hall 730 Washington Ave Racine WI 53403 262 636-9101 262-636-9570 FAX City of Racine,

7th Grade The Number System &amp; Mathematical Operations 2015-08-31 www.njctl.org Slide 3 /

CSE 3401 Functional and Logic Programming York University CSE 3401 Vida Movahedi 1 York University

Creative Labeling Of 12 ft. x 262 ft. Canal As Floating Plant Removal In Wetlands Enable

T-1 Date: 1 10/01/2018 02/04/2019 Historic District Map &amp; Index of Drawings 4C/D 262

SEPTEMBER 15, 2020 www.aimnet.org | 617.262.1180 @AIMBusinessNews | #AIMBiz Brooke Thomson

CSE 182-L2:Blast &amp; variants I Dynamic Programming www.cse cse. .ucsd ucsd. .edu

CSE 312 Final Review: Section AA CSE 312 TAs December 8, 2011 CSE 312 Final Review: Section AA

Welcome to CSE 506 Introduc/on &amp; Review Don Porter 1 2 CSE 506: Opera.ng Systems CSE 506:

4 5 6 CSE 142 vs CSE 143 CSE 142 / AP CS A CSE 143 You learned how to write Return of

CSE Fall 2014 311 Lecture 1 Lecture 1 Lecture 1: Propositional Logic Lecture 1 Foundations

CompSci 356: Computer Network Architectures Lecture 8: Switching technologies Chapter 3.1

Real-Time Java for Latency Critical Banking Applications Real-Time Bertrand Delsart System

Chapter 4: Technology and Cost 1 Introduction Firms should transform efficiently inputs into

Characteristics of Adapti tive Runtime Systems in HPC Laxmikant (Sanjay) Kale

Tactical and Strategic AI Marco Chiarandini Department of Mathematics &amp; Computer Science

GHC heap internals Nikita Frolov &lt; frolov@chalmers.se &gt; (courtesy of Bob Ippolito,

I have no disclosures to report A Patient Centered Approach to Abortion Essentials of Womens

Markov Decision Processes [RN2] Sec 17.1, 17.2, 17.4, 17.5 [RN3] Sec 17.1, 17.2, 17.4 CS 486/686

7th Grade The Number System & Mathematical Operations 2015-08-31 www.njctl.org Slide 3 /

T-1 Date: 1 10/01/2018 02/04/2019 Historic District Map & Index of Drawings 4C/D 262

CSE 182-L2:Blast & variants I Dynamic Programming www.cse cse. .ucsd ucsd. .edu

Welcome to CSE 506 Introduc/on & Review Don Porter 1 2 CSE 506: Opera.ng Systems CSE 506:

Tactical and Strategic AI Marco Chiarandini Department of Mathematics & Computer Science

GHC heap internals Nikita Frolov < frolov@chalmers.se > (courtesy of Bob Ippolito,