CSE 262 Lecture 12 Communication overlap Announcements A problem - PowerPoint PPT Presentation

CSE 262 Lecture 12 Communication overlap

Announcements • A problem set has been posted, due next Thursday Scott B. Baden / CSE 262 / UCSD, Wi '15 2

Technological trends of scalable HPC systems • Growth: cores/socket rather than sockets • Hybrid processors • Complicated software-managed parallel memory hierarchy • Memory/core is shrinking • Communication costs increasing relative to computation ︎ 2X/year 2X/ 3-4 years 60 Peak performance PFLOP /s PFLOP/s [Top500, 13] 40 20 0 2008 2009 2010 2011 2012 2013 Scott B. Baden / CSE 262 / UCSD, Wi '15 3

Reducing communication costs in scalable applications • Tolerate or avoid them [Demmel et al.] • Difficult to reformulate MPI apps to overlap communication with computation u Enables but does support communication hiding u Split phase coding u Scheduling • Implementation policies become entangled with correctness MPI process i MPI process j u Non-robust performance Irecv j Irecv i u High software development costs Send j Send i Compute Compute Wait Wait Remaining Computation Scott B. Baden / CSE 262 / UCSD, Wi '15 4

  Motivating application • Solve Laplace’s equation in 3 Ω dimensions with Dirichlet Boundary conditions Δϕ = ρ (x,y,z), ϕ =0 on ∂Ω ρ≠ 0 • Building block: iterative solver using Jacobi’s method (7-point stencil) ∂Ω for (i,j,k) in 1:N x 1:N x 1:N u’[i][j][k] = (u[i-1][j][k] + u[i+1][j][k] + u[i][j-1][k] + u[i][j+1][k] + u[i][j][k+1] + u[i][j][k-1] ) / 6.0 Scott B. Baden / CSE 262 / UCSD, Wi '15 5

Classic message passing implementation • Decompose domain into sub-regions, one per process Transmit halo regions between processes u Compute inner region after communication completes u • Loop carried dependences impose a strict ordering on communication and computation Scott B. Baden / CSE 262 / UCSD, Wi '15 6

Latency tolerant variant • Only a subset of the domain exhibits loop carried dependences with respect to the halo region • Subdivide the domain to remove some of the dependences • We may now sweep the inner region in parallel with communication • Sweep the annulus after communication finishes Scott B. Baden / CSE 262 / UCSD, Wi '15 7

MPI Encoding MPI_Init(); MPI_Comm_rank();MPI_Comm_size(); Data initialization MPI_Send/MPI_Isend MPI_Recv/MPI_Irecv Computations MPI_Finalize(); Scott B. Baden / CSE 262 / UCSD, Wi '15 8

A few implementation details • Some installations of MPI cannot realize overlap with MPI_IRecv and MPI_Isend • We can use multithreading to handle the overlap • We let one or more processors (proxy thread(s)) handle communication S. Fink, PhD thesis, UCSD, 1998 Baden and Fink, “Communication overlap in multi-tier parallel algorithms,” SC98 Scott B. Baden / CSE 262 / UCSD, Wi '15 9

A performance model of overlap • Assumptions p = number of processors per node running time = 1.0 f < 1 = communication time (i.e. not overlapped) 1 - f f T = 1.0 Scott B. Baden / CSE 262 / UCSD, Wi '15 10

Performance • When we displace computation to make way for the proxy, computation time increases • Wait on communication drops to zero , ideally • When f < p/(2p-1): improvement is (1-f)x(p/(p-1)) -1 • Communication bound: improvement is 1/(1-f) f 1 - f Dilation f T = 1.0 T = (1-f)x(p/(p-1)) Scott B. Baden / CSE 262 / UCSD, Wi '15 11

Processor Virtualization • Virtualize the processors by overdecomposing • AMPI [Kalé et al.] • When an MPI call blocks, thread yields to another virtual process • How do we inform the scheduler about ready tasks? Scott B. Baden / CSE 262 / UCSD, Wi '15 12

Observations • The exact execution order depends on the data dependence structure: communication & computation • We don’t have to hard code a particular overlap strategy • We can alter the behavior by changing the data dependences, e.g. disable overlap, or by varying the on-node decomposition geometry • For other algorithms we can add priorities to force a preferred ordering • Applies to many scales of granularity (i.e. memory locality, network, etc.) Scott B. Baden / CSE 262 / UCSD, Wi '15 13

An alternative way to hide communication • Reformulate MPI code into a data-driven form u Decouple scheduling and communication handling from the application u Automatically overlap communication with computation Runtime system Irecv j Irecv i Communication Worker 0 handlers Send j ¡ Send i threads 2 Wait Wait 1 3 4 Comp Comp Task dependency SPMD MPI Dynamic scheduling graph Scott B. Baden / CSE 262 / UCSD, Wi '15 14

Tarragon - Non-SPMD, Graph Driven Execution • Pietro Cicotti [Ph.D., 2011] • Automatically tolerate communication delays via a Task Precedence Graph T0 u Vertices = computation T2 T1 T7 T3 u Edges = dependences T4 • Inspired by Dataflow and Actors T11 T8 T5 u Parallelism ~ independent tasks T9 T6 T12 u Task completion ⇋ Data Motion T10 T13 • Asynchronous task graph model of execution u Tasks run according to availability of the data u Graph execution semantics independent of the schedule Scott B. Baden / CSE 262 / UCSD, Wi '15 15

  Task Graph • Represents the program as a task precedence graph encoding data dependences • Background run time services support dataflow execution of the graph • Virtualized tasks: many to each processor • The graph maintains meta-data to inform the scheduler about runnable tasks for (i,j,k) in 1:N x 1:N x 1:N u[i][j][k] = … .. Scott B. Baden / CSE 262 / UCSD, Wi '15 16

Graph execution semantics • Parallelism exists among independent tasks • Independent tasks may execute concurrently • A task is runnable when its data dependences have been met • A task suspends if its data dependences are not met • Computation and data motion are coupled activities • Background services manage graph execution • The scheduler determines which task(s) to run next • Scheduler and application are only vaguely aware of one another • Scheduler doesn’t affect graph execution semantics Scott B. Baden / CSE 262 / UCSD, Wi '15 17

Code reformulatoin • Reformulating code by hand is difficult • Observation: For every MPI program there is a corresponding dataflow graph … … determined by the matching patterns of sends and receives invoked by the running program • Can we come up with a deterministic procedure for translating MPI code to Tarragon, using dynamic and static information about MPI call sites? • Yes! Bamboo: custom, domain specific-translator Tan Nguyen (PhD, 2014, UCSD) ¡ Scott B. Baden / CSE 262 / UCSD, Wi '15 19

Bamboo • Uses a deterministic procedure for translating MPI code to Tarragon, using dynamic and static information about MPI call sites • A custom, domain-specific translator u MPI library primitives → primitive language objects u Collects static information about MPI call sites u Relies on some programmer annotation u Targets Tarragon library, supports scheduling and data motion [Pietro Cicotti ’06, ’11] ¡ Bamboo MPI Tarragon translator Scott B. Baden / CSE 262 / UCSD, Wi '15 20

Example: MPI with annotations 1 for (iter=0; iter<maxIters && Error > ε ; iter++) 2 { 3 #pragma bamboo olap 4 { 5 #pragma bamboo receive Up 6 { for each dim in 4 dimensions 7 if(hasNeighbor[dim]) MPI_Irecv (…, neighbor[dim], …); 8 } 9 #pragma bamboo send 10 { for each dim in 4 dimensions 11 if(hasNeighbor[dim]) MPI_Send ( …, neighbor[dim]…); 12 MPI_Waitall(…); Left Right 13 } 14 } 15 update (Uold, Un); swap (Uold, Un); lerror = Err(Uold); 16 } 17 MPI_Allreduce(lError, Error); //translated automatically 18 } Down Send and Receive are independent Scott B. Baden / CSE 262 / UCSD, Wi '15 21

Task definition and communication • Bamboo instantiates MPI processes as tasks u User-defined threads + user level scheduling (not OS thread) u Tasks communicate via messages, which are not imperative • Mapping processes → tasks u Send → put (), RTS handles delivery u Recvs → firing rule: task is ready to run when input conditions are met, firing rule processing handled by RTS u No explicit receives; when a task is runnable, its input conditions have been met by definition incoming buffer outgoing buffer RTS k (i,j,1) RTS i (j,k,1) (j,k,0) Source = i Destination=j Source=j Destination = k j Scott B. Baden / CSE 262 / UCSD, Wi '15 23

CSE 262 Lecture 12 Communication overlap Announcements A problem - PowerPoint PPT Presentation

CSE 262 Lecture 12 Communication overlap Announcements A problem set has been posted, due next Thursday Scott B. Baden / CSE 262 / UCSD, Wi '15 2 Technological trends of scalable HPC systems Growth: cores/socket rather than sockets

CSE 262 Lecture 7 Parallel Matrix Multiplication Announcements Projects Scott B. Baden /CSE

CSE 262 Lecture 11 GPU Implementation of stencil methods (II) Announcements Final

CSE 262 Lecture 13 Communication overlap Continued Heterogeneous processing Announcements

CSE 262 Lecture 10 Multigrid GPU Implementation of stencil methods (I) Announcements Final

City Hall 730 Washington Ave Racine WI 53403 262 636-9101 262-636-9570 FAX City of Racine,

City Hall 730 Washington Ave Racine WI 53403 262 636-9101 262-636-9570 FAX City of Racine,

7th Grade The Number System & Mathematical Operations 2015-08-31 www.njctl.org Slide 3 /

CSE 3401 Functional and Logic Programming York University CSE 3401 Vida Movahedi 1 York University

Creative Labeling Of 12 ft. x 262 ft. Canal As Floating Plant Removal In Wetlands Enable

T-1 Date: 1 10/01/2018 02/04/2019 Historic District Map & Index of Drawings 4C/D 262

SEPTEMBER 15, 2020 www.aimnet.org | 617.262.1180 @AIMBusinessNews | #AIMBiz Brooke Thomson

CSE 182-L2:Blast & variants I Dynamic Programming www.cse cse. .ucsd ucsd. .edu

CSE 312 Final Review: Section AA CSE 312 TAs December 8, 2011 CSE 312 Final Review: Section AA

Welcome to CSE 506 Introduc/on & Review Don Porter 1 2 CSE 506: Opera.ng Systems CSE 506:

4 5 6 CSE 142 vs CSE 143 CSE 142 / AP CS A CSE 143 You learned how to write Return of

CSE Fall 2014 311 Lecture 1 Lecture 1 Lecture 1: Propositional Logic Lecture 1 Foundations

Paper 4 Statistics & Research Statistics & Research Methodology - BK SAVITRI, Pandav

What will the new curriculum mean for 14-16 qualifications? Ebrill / April 2019 Gweledigaeth

& 233$. (#( % ( 42!+.!25 ($ )67! ( 2)6 ( +7%-)8379 ( 63 ( !2:)&6!5%6) ( "!3-3);4%7 (

Rational matrix factorizations via defect functors based on 1005.2117 and 1112.XXXX Nicolas Behr

Mobile Interface Design Patterns J.Serrat 102759 Software Design June 18, 2014 Index UI

6 WHY DYNAMIC UI ** app

DBI for Computer Security: Uses and Comparative Juan Antonio Artal , Ricardo J. Rodr guez

taste buds for the Bible Calvin on Knowing the Bible is Gods Word by the inward testimony of

CSE 262 Lecture 12 Communication overlap Announcements A problem - PowerPoint PPT Presentation

CSE 262 Lecture 12 Communication overlap Announcements A problem set has been posted, due next Thursday Scott B. Baden / CSE 262 / UCSD, Wi '15 2 Technological trends of scalable HPC systems Growth: cores/socket rather than sockets

CSE 262 Lecture 7 Parallel Matrix Multiplication Announcements Projects Scott B. Baden /CSE

CSE 262 Lecture 11 GPU Implementation of stencil methods (II) Announcements Final

CSE 262 Lecture 13 Communication overlap Continued Heterogeneous processing Announcements

CSE 262 Lecture 10 Multigrid GPU Implementation of stencil methods (I) Announcements Final

City Hall 730 Washington Ave Racine WI 53403 262 636-9101 262-636-9570 FAX City of Racine,

City Hall 730 Washington Ave Racine WI 53403 262 636-9101 262-636-9570 FAX City of Racine,

7th Grade The Number System &amp; Mathematical Operations 2015-08-31 www.njctl.org Slide 3 /

CSE 3401 Functional and Logic Programming York University CSE 3401 Vida Movahedi 1 York University

Creative Labeling Of 12 ft. x 262 ft. Canal As Floating Plant Removal In Wetlands Enable

T-1 Date: 1 10/01/2018 02/04/2019 Historic District Map &amp; Index of Drawings 4C/D 262

SEPTEMBER 15, 2020 www.aimnet.org | 617.262.1180 @AIMBusinessNews | #AIMBiz Brooke Thomson

CSE 182-L2:Blast &amp; variants I Dynamic Programming www.cse cse. .ucsd ucsd. .edu

CSE 312 Final Review: Section AA CSE 312 TAs December 8, 2011 CSE 312 Final Review: Section AA

Welcome to CSE 506 Introduc/on &amp; Review Don Porter 1 2 CSE 506: Opera.ng Systems CSE 506:

4 5 6 CSE 142 vs CSE 143 CSE 142 / AP CS A CSE 143 You learned how to write Return of

CSE Fall 2014 311 Lecture 1 Lecture 1 Lecture 1: Propositional Logic Lecture 1 Foundations

Paper 4 Statistics &amp; Research Statistics &amp; Research Methodology - BK SAVITRI, Pandav

What will the new curriculum mean for 14-16 qualifications? Ebrill / April 2019 Gweledigaeth

&amp; 233$. (#( % ( 42!+.!25 ($ )67! ( 2)6 ( +7%-)8379 ( 63 ( !2:)&amp;6!5%6) ( &quot;!3-3*);4*%7 (

Rational matrix factorizations via defect functors based on 1005.2117 and 1112.XXXX Nicolas Behr

Mobile Interface Design Patterns J.Serrat 102759 Software Design June 18, 2014 Index UI

6 WHY DYNAMIC UI ** app

DBI for Computer Security: Uses and Comparative Juan Antonio Artal , Ricardo J. Rodr guez

taste buds for the Bible Calvin on Knowing the Bible is Gods Word by the inward testimony of

7th Grade The Number System & Mathematical Operations 2015-08-31 www.njctl.org Slide 3 /

T-1 Date: 1 10/01/2018 02/04/2019 Historic District Map & Index of Drawings 4C/D 262

CSE 182-L2:Blast & variants I Dynamic Programming www.cse cse. .ucsd ucsd. .edu

Welcome to CSE 506 Introduc/on & Review Don Porter 1 2 CSE 506: Opera.ng Systems CSE 506:

Paper 4 Statistics & Research Statistics & Research Methodology - BK SAVITRI, Pandav

& 233$. (#( % ( 42!+.!25 ($ )67! ( 2)6 ( +7%-)8379 ( 63 ( !2:)&6!5%6) ( "!3-3);4%7 (