Lecture 13 CSE 260 Parallel Computation (Fall 2015) Scott B. - PowerPoint PPT Presentation

Lecture 13 CSE 260 – Parallel Computation (Fall 2015) Scott B. Baden Message Passing Stencil methods with message passing

Announcements • Weds office hours changed for the remainder of quarter: 3:30 to 5:30 Scott B. Baden / CSE 260, UCSD / Fall '15 2

Today’s lecture • Aliev Panfilov Method (A3) • Message passing • Stencil methods in MPI Scott B. Baden / CSE 260, UCSD / Fall '15 3

Warp aware summation For next time: complete the code so it handles global data with arbitrary N reduceSum <<<N/512,512>>> (x,N) __global__ void reduce(int *input, unsigned int N, int *total){ unsigned int tid = threadIdx.x; unsigned int i = blockIdx.x * blockDim.x + tid; unsigned int s; for (s = blockDim.x/2; s > 1; s /= 2) { 4 1 6 3 3 1 7 0 __syncthreads(); 5 9 7 4 if (tid < s ) x[tid] += x[tid + s ]; } 11 14 if (tid == 0) atomicAdd(total,x[tid]); 25 } Scott B. Baden / CSE 260, UCSD / Fall '15 4

Recapping from last time • Stencil methods use nearest neighbor computations • The Aliev-Panfilov method solves a coupled set of differential equations on a mesh • We showed how to implement it on a GPU • We use shared memory (and registers) to store “ghost cells” to optimize performance Scott B. Baden / CSE 260, UCSD / Fall '15 5

Computational loop of the cardiac simulator • ODE solver: u No data dependency, trivially parallelizable u Requires a lot of registers to hold temporary variables • PDE solver: u Jacobi update for the 5-point Laplacian operator. u Sweeps over a uniformly spaced mesh u Updates voltage to weighted contributions from the 4 nearest neighbors updating the solution as a function of the values in the previous time step For a specified number of iterations, using supplied initial conditions repeat for (j=1; j < m+1; j++){ for (i=1; i < n+1; i++) { // PDE SOLVER E[j,i] = E_p[j,i]+ α *(E_p[j,i+1]+E_p[j,i-1]-4*E_p[j,i]+E_p[j+1,i]+E_p[j-1,i]); // ODE SOLVER E[j,i] += -dt*(kk*E[j,i]*(E[j,i]-a)*(E[j,i]-1)+E[j,i]*R[j,i]); R[j,i] += dt*( ε +M1* R[j,i]/(E[j,i]+M2))*(-R[j,i]-kk*E[j,i]*(E[j,i]-b-1)); } } swap E_p and E Scott B. Baden / CSE 260, UCSD / Fall '15 6 End repeat

Where is the time spent (Sorken)? • Loops are unfused I1 cache: 32768 B, 64 B, 8-way associative D1 cache: 32768 B, 64 B, 8-way associative LL cache: 20971520 B, 64 B, 20-way associative Command: ./apf -n 256 -i 2000 -------------------------------------------------------------------------------- Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw 4.451B 2,639 2,043 1,381,173,237 50,592,465 7,051 3957M 16,794,937 26,115 PROGRAM TOTALS Dr D1mr -------------------------------------------------------------------------------- 1,380,464,019 50,566,007 solve.cpp:solve( ...) . . . // Fills in the TOP Ghost Cells 10,000 1,999 for (i = 0; i < n+2; i++) 516,000 66,000 E_prev[i] = E_prev[i + (n+2)*2]; // Fills in the RIGHT Ghost Cells 10,000 0 for i = (n+1); i < (m+2)*(n+2); i+=(m+2)) 516,000 504,003 E_prev[i] = E_prev[i-2]; // Solve for the excitation, a PDE 1,064,000 8,000 for(j =innerBlkRowStartIndx;j<=innerBlkRowEndIndx; j+=(m+)){ 0 0 E_prevj = E_prev + j; E_tmp = E + j; 512,000 0 for(i = 0; i < n; i++) { 721,408,002 16,630,001 E_tmp[i] = E_prevj[i]+alpha*(E_prevj[i+1]...) } // Solve the ODEs 4,000 4,000 for(j=innerBlkRowStartIndx; j <= innerBlkRowEndIndx;j+=(m+3)){ for(i = 0; i <= n; i++) { 262,144,000 33,028,000 E_tmp[i] += -dt*(kk*E_tmp[i]*(E_tmp[i]-a).. )*R_tmp[i]); 393,216,000 4,000 R_tmp[i] += dt*( ε +M1*R_tmp[i]/(E_tmp[i]+M2))*(…); } Scott B. Baden / CSE 260, UCSD / Fall '15 7

Fusing the loops • On Sorken u Slows down the simulation by 20% u # data references drops by 35% u total number of L1 read misses drops by 48% • What happened? • Code didn’t vectorize For a specified number of iterations, using supplied initial conditions repeat for (j=1; j < m+1; j++){ for (i=1; i < n+1; i++) { // PDE SOLVER E[j,i] = E_p[j,i]+ α *(E_p[j,i+1]+E_p[j,i-1]-4*E_p[j,i]+E_p[j+1,i]+E_p[j-1,i]); // ODE SOLVER E[j,i] += -dt*(kk*E[j,i]*(E[j,i]-a)*(E[j,i]-1)+E[j,i]*R[j,i]); R[j,i] += dt*( ε +M1* R[j,i]/(E[j,i]+M2))*(-R[j,i]-kk*E[j,i]*(E[j,i]-b-1)); } } swap E_p and E End repeat Scott B. Baden / CSE 260, UCSD / Fall '15 8

Vectorization output • Gcc compiles with -ftree-vectorizer-verbose=1 Analyzing loop at solve.cpp:118 solve.cpp:43: note: vectorized 0 loops in function. For a specified number of iterations, using supplied initial conditions repeat for (j=1; j < m+1; j++){ for (i=1; i < n+1; i++) { // PDE SOLVER E[j,i] = E_p[j,i]+ α *(E_p[j,i+1]+E_p[j,i-1]-4*E_p[j,i]+E_p[j+1,i]+E_p[j-1,i]); // ODE SOLVER E[j,i] += -dt*(kk*E[j,i]*(E[j,i]-a)*(E[j,i]-1)+E[j,i]*R[j,i]); R[j,i] += dt*( ε +M1* R[j,i]/(E[j,i]+M2))*(-R[j,i]-kk*E[j,i]*(E[j,i]-b-1)); } } swap E_p and E End repeat Scott B. Baden / CSE 260, UCSD / Fall '15 9

On Stampede • We use the Intel compiler suite icpc --std=c++11 -O3 -qopt-report=1 -c solve.cpp icpc: remark #10397: optimization reports are generated in *.optrpt files in the output location LOOP BEGIN at solve.cpp(142,9) remark #25460: No loop optimizations reported LOOP END for (j=1; j < m+1; j++){ for (i=1; i < n+1; i++) { // Line 142 // PDE SOLVER E[j,i] = E_p[j,i]+ α *(E_p[j,i+1]+E_p[j,i-1]-4*E_p[j,i]+E_p[j+1,i]+E_p[j-1,i]); // ODE SOLVER E[j,i] += -dt*(kk*E[j,i]*(E[j,i]-a)*(E[j,i]-1)+E[j,i]*R[j,i]); R[j,i] += dt*( ε +M1* R[j,i]/(E[j,i]+M2))*(-R[j,i]-kk*E[j,i]*(E[j,i]-b-1)); } } Scott B. Baden / CSE 260, UCSD / Fall '15 10

A vectorized loop • We use the Intel compiler suite icpc --std=c++11 -O3 -qopt-report=1 -c solve.cpp 6: for (j=0; j< 10000; j++) x[j] = j-1; 8: for (j=0; j< 10000; j++) x[j] = x[j]*x[j]; LOOP BEGIN at vec.cpp(6,3) remark #25045: Fused Loops: ( 6 8 ) remark #15301: FUSED LOOP WAS VECTORIZED LOOP END Scott B. Baden / CSE 260, UCSD / Fall '15 11

Thread assignment in a GPU implementation • We assign threads to interior cells only • 3 phases 1. Fill the interior 2. Fill the ghost cells – red circles correspond to active threads, orange to ghost cell data they copy into shared memory 3. Compute – uses the same thread mapping as in step 1 Scott B. Baden / CSE 260, UCSD / Fall '15 14

Today’s lecture • Aliev Panfilov Method (A3) • Message passing u The Message Passing Programming Model u The Message Passing Interface - MPI u A first MPI Application – The Trapezoidal Rule • Stencil methods in MPI Scott B. Baden / CSE 260, UCSD / Fall '15 16

Architectures without shared memory • Shared nothing architecture, or a multicomputer • Hierarchical parallelism Fat tree Wikipedia uk.hardware.info tinyurl.com/k6jqag5 Scott B. Baden / CSE 260, UCSD / Fall '15 17 17

Programming with Message Passing • Programs execute as a set of P processes (user specifies P) • Each process assumed to run on a different core u Usually initialized with the same code, but has private state SPMD = “Same Program Multiple Data” u Access to local memory only u Communicates with other processes by passing messages u Executes instructions at its own rate according to its rank (0:P-1) and the messages it sends and receives P0 P1 P0 P1 … Node 0 Node 0 P3 P3 P2 P2 Tan Nguyen Scott B. Baden / CSE 260, UCSD / Fall '15 18

Bulk Synchronous Execution Model • A process is either communicating or computing • Generally, all processors are performing the same activity at the same time • Pathological cases, when workloads aren’t well balanced Scott B. Baden / CSE 260, UCSD / Fall '15 20

Message passing • There are two kinds of communication patterns • Point-to-point communication: a single pair of communicating processes copy data between address space • Collective communication : all the processors participate, possibly exchanging information Scott B. Baden / CSE 260, UCSD / Fall '15 21

Point-to-Point communication • Messages are like email; to send or receive one, we specify u A destination and a message body (can be empty) • Requires a sender and an explicit recipient that must be aware of one another • Message passing performs two events u Memory to memory block copy u Synchronization signal at recipient: “Data has arrived” y x y Process 1 Process 0 Send(y,1) Recv(x) Scott B. Baden / CSE 260, UCSD / Fall '15 23

Send and Recv • Primitives that implement Pt to Pt communication • When Send( ) returns, the message is “in transit” u Return doesn’t tell us if the message has been received u The data is somewhere in the system u Safe to overwrite the buffer • Receive( ) blocks until the message has been received u Safe to use the data in the buffer y x y Process 1 Process 0 Send(y,1) Recv(x) Scott B. Baden / CSE 260, UCSD / Fall '15 24

Causality • If a process sends multiple messages to the same destination, then the messages will be received in the order sent • If different processes send messages to the same destination, the order of receipt isn’t defined across sources A B C B A C Y A X Y B X B A Scott B. Baden / CSE 260, UCSD / Fall '15 25

Today’s lecture • Aliev Panfilov Method (A3) • Message passing u The Message Passing Programming Model u The Message Passing Interface - MPI u A first MPI Application – The Trapezoidal Rule • Stencil methods in MPI Scott B. Baden / CSE 260, UCSD / Fall '15 26

Lecture 13 CSE 260 Parallel Computation (Fall 2015) Scott B. - PowerPoint PPT Presentation

Lecture 13 CSE 260 Parallel Computation (Fall 2015) Scott B. Baden Message Passing Stencil methods with message passing Announcements Weds office hours changed for the remainder of quarter: 3:30 to 5:30 Scott B. Baden / CSE 260, UCSD

Lecture 3 CSE 260 Parallel Computation (Fall 2015) Scott B. Baden Address space

Lecture 1 CSE 260 Parallel Computation (Fall 2015) Scott B. Baden Introduction Welcome to

Lecture 12 CSE 260 Parallel Computation (Fall 2015) Scott B. Baden Stencil methods

Lecture 10 CSE 260 Parallel Computation (Fall 2015) Scott B. Baden Looking at PTX code

Lecture 18 CSE 260 Parallel Computation (Fall 2015) Scott B. Baden Large scale computing

Welcome to CSE 160! Introduction to parallel computation Scott B. Baden Welcome to Parallel

260 SOUTH STREET 1 260 SOUTH STREET NY, NY 260 SOUTH STREET NY, NY CB3 LAND USE COMMITTEE

CSE 262 Lecture 7 Parallel Matrix Multiplication Announcements Projects Scott B. Baden /CSE

Introduction Nima Honarmand Fall 2015 :: CSE 610 Parallel Computer Architectures Parallel

Parallel Computing Basics Nima Honarmand Fall 2015 :: CSE 610 Parallel Computer Architectures

CSE Fall 2014 311 Lecture 1 Lecture 1 Lecture 1: Propositional Logic Lecture 1 Foundations

Models of Parallel Computation Mark Greenstreet CpSc 418 Oct. 10, 2013 The RAM Model of

CSL 860: Modern Parallel Computation Computation Hello OpenMP #pragma omp parallel { // I am

Data-Level Parallelism Nima Honarmand Fall 2015 :: CSE 610 Parallel Computer Architectures

Cache Coherence Nima Honarmand Fall 2015 :: CSE 610 Parallel Computer Architectures Cache

Memory Consistency Models Nima Honarmand Fall 2015 :: CSE 610 Parallel Computer Architectures

Quantum Information Processing and Quantum Error Correction and Quantum Error Correction with

A WAM Implementation for the Meta-logic Programming Language Log _______________ University

Time energy entropic uncertainty relations: an algebraic approach Christian Bertoni, Yuxiang

A Case for Better Integration of Host and Target Compilation When Using OpenCL for FPGAs Taylor

Outline 1. Optical Quantum Computing 101 2. Where we are 3. Some theoretical magic 4.

Logics for Reasoning about Logics for Reasoning about Quantum Information: Quantum Information:

Welcome to the Cyber Risk Insights Conference! Welcoming Remarks Rebecca Bole EVP &

Learning from Label Proportions (LLP) Online ind Onl ine individual r ividual recor ecords ds

Sambuz

Useful Links

Newsletter

Mail Us

Lecture 13 CSE 260 Parallel Computation (Fall 2015) Scott B. - PowerPoint PPT Presentation

Lecture 13 CSE 260 Parallel Computation (Fall 2015) Scott B. Baden Message Passing Stencil methods with message passing Announcements Weds office hours changed for the remainder of quarter: 3:30 to 5:30 Scott B. Baden / CSE 260, UCSD

Lecture 3 CSE 260 Parallel Computation (Fall 2015) Scott B. Baden Address space

Lecture 1 CSE 260 Parallel Computation (Fall 2015) Scott B. Baden Introduction Welcome to

Lecture 12 CSE 260 Parallel Computation (Fall 2015) Scott B. Baden Stencil methods

Lecture 10 CSE 260 Parallel Computation (Fall 2015) Scott B. Baden Looking at PTX code

Lecture 18 CSE 260 Parallel Computation (Fall 2015) Scott B. Baden Large scale computing

Welcome to CSE 160! Introduction to parallel computation Scott B. Baden Welcome to Parallel

260 SOUTH STREET 1 260 SOUTH STREET NY, NY 260 SOUTH STREET NY, NY CB3 LAND USE COMMITTEE

CSE 262 Lecture 7 Parallel Matrix Multiplication Announcements Projects Scott B. Baden /CSE

Introduction Nima Honarmand Fall 2015 :: CSE 610 Parallel Computer Architectures Parallel

Parallel Computing Basics Nima Honarmand Fall 2015 :: CSE 610 Parallel Computer Architectures

CSE Fall 2014 311 Lecture 1 Lecture 1 Lecture 1: Propositional Logic Lecture 1 Foundations

Models of Parallel Computation Mark Greenstreet CpSc 418 Oct. 10, 2013 The RAM Model of

CSL 860: Modern Parallel Computation Computation Hello OpenMP #pragma omp parallel { // I am

Data-Level Parallelism Nima Honarmand Fall 2015 :: CSE 610 Parallel Computer Architectures

Cache Coherence Nima Honarmand Fall 2015 :: CSE 610 Parallel Computer Architectures Cache

Memory Consistency Models Nima Honarmand Fall 2015 :: CSE 610 Parallel Computer Architectures

Quantum Information Processing and Quantum Error Correction and Quantum Error Correction with

A WAM Implementation for the Meta-logic Programming Language Log _______________ University

Time energy entropic uncertainty relations: an algebraic approach Christian Bertoni, Yuxiang

A Case for Better Integration of Host and Target Compilation When Using OpenCL for FPGAs Taylor

Outline 1. Optical Quantum Computing 101 2. Where we are 3. Some theoretical magic 4.

Logics for Reasoning about Logics for Reasoning about Quantum Information: Quantum Information:

Welcome to the Cyber Risk Insights Conference! Welcoming Remarks Rebecca Bole EVP &amp;

Learning from Label Proportions (LLP) Online ind Onl ine individual r ividual recor ecords ds

Sambuz

Useful Links

Newsletter

Mail Us

Welcome to the Cyber Risk Insights Conference! Welcoming Remarks Rebecca Bole EVP &