Lecture 13 CSE 260 Parallel Computation (Fall 2015) Scott B. - - PowerPoint PPT Presentation

lecture 13 cse 260 parallel computation fall 2015 scott b
SMART_READER_LITE
LIVE PREVIEW

Lecture 13 CSE 260 Parallel Computation (Fall 2015) Scott B. - - PowerPoint PPT Presentation

Lecture 13 CSE 260 Parallel Computation (Fall 2015) Scott B. Baden Message Passing Stencil methods with message passing Announcements Weds office hours changed for the remainder of quarter: 3:30 to 5:30 Scott B. Baden / CSE 260, UCSD


slide-1
SLIDE 1

Lecture 13 CSE 260 – Parallel Computation (Fall 2015) Scott B. Baden Message Passing Stencil methods with message passing

slide-2
SLIDE 2

Announcements

  • Weds office hours changed for the remainder
  • f quarter: 3:30 to 5:30

Scott B. Baden / CSE 260, UCSD / Fall '15 2

slide-3
SLIDE 3

Today’s lecture

  • Aliev Panfilov Method (A3)
  • Message passing
  • Stencil methods in MPI

Scott B. Baden / CSE 260, UCSD / Fall '15 3

slide-4
SLIDE 4

Warp aware summation

__global__ void reduce(int *input, unsigned int N, int *total){ unsigned int tid = threadIdx.x; unsigned int i = blockIdx.x * blockDim.x + tid; unsigned int s; for (s = blockDim.x/2; s > 1; s /= 2) { __syncthreads(); if (tid < s ) x[tid] += x[tid + s ]; } if (tid == 0) atomicAdd(total,x[tid]); }

reduceSum <<<N/512,512>>> (x,N)

Scott B. Baden / CSE 260, UCSD / Fall '15 4 7 11 14 25 7 4 3 1 5 4 1 9 6 3

For next time: complete the code so it handles global data with arbitrary N

slide-5
SLIDE 5

Recapping from last time

  • Stencil methods use nearest neighbor

computations

  • The Aliev-Panfilov method solves a coupled

set of differential equations on a mesh

  • We showed how to implement it on a GPU
  • We use shared memory (and registers) to store

“ghost cells” to optimize performance

Scott B. Baden / CSE 260, UCSD / Fall '15 5

slide-6
SLIDE 6

Computational loop of the cardiac simulator

  • ODE solver:

u No data dependency, trivially parallelizable u Requires a lot of registers to hold temporary variables

  • PDE solver:

u Jacobi update for the 5-point Laplacian operator. u Sweeps over a uniformly spaced mesh u Updates voltage to weighted contributions from

the 4 nearest neighbors updating the solution as a function of the values in the previous time step

For a specified number of iterations, using supplied initial conditions repeat for (j=1; j < m+1; j++){ for (i=1; i < n+1; i++) { // PDE SOLVER E[j,i] = E_p[j,i]+α*(E_p[j,i+1]+E_p[j,i-1]-4*E_p[j,i]+E_p[j+1,i]+E_p[j-1,i]); // ODE SOLVER E[j,i] += -dt*(kk*E[j,i]*(E[j,i]-a)*(E[j,i]-1)+E[j,i]*R[j,i]); R[j,i] += dt*(ε+M1* R[j,i]/(E[j,i]+M2))*(-R[j,i]-kk*E[j,i]*(E[j,i]-b-1)); } } swap E_p and E End repeat

Scott B. Baden / CSE 260, UCSD / Fall '15 6

slide-7
SLIDE 7

Where is the time spent (Sorken)?

  • Loops are unfused

I1 cache: 32768 B, 64 B, 8-way associative D1 cache: 32768 B, 64 B, 8-way associative LL cache: 20971520 B, 64 B, 20-way associative Command: ./apf -n 256 -i 2000

  • Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw

4.451B 2,639 2,043 1,381,173,237 50,592,465 7,051 3957M 16,794,937 26,115 PROGRAM TOTALS Dr D1mr

  • 1,380,464,019 50,566,007 solve.cpp:solve( ...)

. . . // Fills in the TOP Ghost Cells 10,000 1,999 for (i = 0; i < n+2; i++) 516,000 66,000 E_prev[i] = E_prev[i + (n+2)*2]; // Fills in the RIGHT Ghost Cells 10,000 0 for i = (n+1); i < (m+2)*(n+2); i+=(m+2)) 516,000 504,003 E_prev[i] = E_prev[i-2]; // Solve for the excitation, a PDE 1,064,000 8,000 for(j =innerBlkRowStartIndx;j<=innerBlkRowEndIndx; j+=(m+)){ 0 E_prevj = E_prev + j; E_tmp = E + j; 512,000 0 for(i = 0; i < n; i++) { 721,408,002 16,630,001 E_tmp[i] = E_prevj[i]+alpha*(E_prevj[i+1]...) } // Solve the ODEs 4,000 4,000 for(j=innerBlkRowStartIndx; j <= innerBlkRowEndIndx;j+=(m+3)){ for(i = 0; i <= n; i++) { 262,144,000 33,028,000 E_tmp[i] += -dt*(kk*E_tmp[i]*(E_tmp[i]-a).. )*R_tmp[i]); 393,216,000 4,000 R_tmp[i] += dt*(ε+M1*R_tmp[i]/(E_tmp[i]+M2))*(…); }

Scott B. Baden / CSE 260, UCSD / Fall '15 7

slide-8
SLIDE 8

Fusing the loops

  • On Sorken

u Slows down the simulation by 20% u # data references drops by 35% u total number of L1 read misses drops by 48%

  • What happened?
  • Code didn’t vectorize

Scott B. Baden / CSE 260, UCSD / Fall '15 8

For a specified number of iterations, using supplied initial conditions repeat for (j=1; j < m+1; j++){ for (i=1; i < n+1; i++) { // PDE SOLVER E[j,i] = E_p[j,i]+α*(E_p[j,i+1]+E_p[j,i-1]-4*E_p[j,i]+E_p[j+1,i]+E_p[j-1,i]); // ODE SOLVER E[j,i] += -dt*(kk*E[j,i]*(E[j,i]-a)*(E[j,i]-1)+E[j,i]*R[j,i]); R[j,i] += dt*(ε+M1* R[j,i]/(E[j,i]+M2))*(-R[j,i]-kk*E[j,i]*(E[j,i]-b-1)); } } swap E_p and E End repeat

slide-9
SLIDE 9

Vectorization output

  • Gcc compiles with -ftree-vectorizer-verbose=1

Analyzing loop at solve.cpp:118 solve.cpp:43: note: vectorized 0 loops in function.

Scott B. Baden / CSE 260, UCSD / Fall '15 9

For a specified number of iterations, using supplied initial conditions repeat for (j=1; j < m+1; j++){ for (i=1; i < n+1; i++) { // PDE SOLVER E[j,i] = E_p[j,i]+α*(E_p[j,i+1]+E_p[j,i-1]-4*E_p[j,i]+E_p[j+1,i]+E_p[j-1,i]); // ODE SOLVER E[j,i] += -dt*(kk*E[j,i]*(E[j,i]-a)*(E[j,i]-1)+E[j,i]*R[j,i]); R[j,i] += dt*(ε+M1* R[j,i]/(E[j,i]+M2))*(-R[j,i]-kk*E[j,i]*(E[j,i]-b-1)); } } swap E_p and E End repeat

slide-10
SLIDE 10

On Stampede

  • We use the Intel compiler suite

icpc --std=c++11 -O3 -qopt-report=1 -c solve.cpp icpc: remark #10397: optimization reports are generated in *.optrpt files in the output location

LOOP BEGIN at solve.cpp(142,9) remark #25460: No loop optimizations reported LOOP END

Scott B. Baden / CSE 260, UCSD / Fall '15 10

for (j=1; j < m+1; j++){ for (i=1; i < n+1; i++) { // Line 142 // PDE SOLVER E[j,i] = E_p[j,i]+α*(E_p[j,i+1]+E_p[j,i-1]-4*E_p[j,i]+E_p[j+1,i]+E_p[j-1,i]); // ODE SOLVER E[j,i] += -dt*(kk*E[j,i]*(E[j,i]-a)*(E[j,i]-1)+E[j,i]*R[j,i]); R[j,i] += dt*(ε+M1* R[j,i]/(E[j,i]+M2))*(-R[j,i]-kk*E[j,i]*(E[j,i]-b-1)); } }

slide-11
SLIDE 11

A vectorized loop

  • We use the Intel compiler suite

icpc --std=c++11 -O3 -qopt-report=1 -c solve.cpp

Scott B. Baden / CSE 260, UCSD / Fall '15 11

6: for (j=0; j< 10000; j++) x[j] = j-1; 8: for (j=0; j< 10000; j++) x[j] = x[j]*x[j];

LOOP BEGIN at vec.cpp(6,3) remark #25045: Fused Loops: ( 6 8 ) remark #15301: FUSED LOOP WAS VECTORIZED LOOP END

slide-12
SLIDE 12

Thread assignment in a GPU implementation

  • We assign threads to interior cells only
  • 3 phases
  • 1. Fill the interior
  • 2. Fill the ghost cells – red circles correspond to active

threads, orange to ghost cell data they copy into shared memory

  • 3. Compute – uses the same thread mapping as in step 1

Scott B. Baden / CSE 260, UCSD / Fall '15 14

slide-13
SLIDE 13

Today’s lecture

  • Aliev Panfilov Method (A3)
  • Message passing

u The Message Passing Programming Model u The Message Passing Interface - MPI u A first MPI Application – The Trapezoidal Rule

  • Stencil methods in MPI

Scott B. Baden / CSE 260, UCSD / Fall '15 16

slide-14
SLIDE 14

17

Architectures without shared memory

  • Shared nothing architecture, or a multicomputer
  • Hierarchical parallelism

uk.hardware.info tinyurl.com/k6jqag5 Wikipedia

Fat tree

Scott B. Baden / CSE 260, UCSD / Fall '15 17

slide-15
SLIDE 15

Programming with Message Passing

  • Programs execute as a set of P processes (user specifies P)
  • Each process assumed to run on a different core

u Usually initialized with the same code, but has private state

SPMD = “Same Program Multiple Data”

u Access to local memory only u Communicates with other processes by passing messages u Executes instructions at its own rate according to its rank (0:P-1) and

the messages it sends and receives Node 0

P0

P1

P2 P3

Tan Nguyen

Node 0

P0

P1

P2 P3

Scott B. Baden / CSE 260, UCSD / Fall '15 18

slide-16
SLIDE 16

Bulk Synchronous Execution Model

  • A process is either communicating or computing
  • Generally, all processors are performing the same

activity at the same time

  • Pathological cases, when workloads aren’t well

balanced

Scott B. Baden / CSE 260, UCSD / Fall '15 20

slide-17
SLIDE 17

Message passing

  • There are two kinds of communication patterns
  • Point-to-point communication:

a single pair of communicating processes copy data between address space

  • Collective communication: all the processors

participate, possibly exchanging information

Scott B. Baden / CSE 260, UCSD / Fall '15 21

slide-18
SLIDE 18

Point-to-Point communication

  • Messages are like email; to send or receive one, we

specify

u A destination and a message body (can be empty)

  • Requires a sender and an explicit recipient that

must be aware of one another

  • Message passing performs two events

u Memory to memory block copy u Synchronization signal at recipient: “Data has arrived”

Scott B. Baden / CSE 260, UCSD / Fall '15 23

Send(y,1) Recv(x)

Process 0 Process 1

y

y x

slide-19
SLIDE 19

Send and Recv

  • Primitives that implement Pt to Pt communication
  • When Send( ) returns, the message is “in transit”

u Return doesn’t tell us if the message has been received u The data is somewhere in the system u Safe to overwrite the buffer

  • Receive( ) blocks until the message has been received

u Safe to use the data in the buffer

Send(y,1) Recv(x)

Process 0 Process 1

y

y x

Scott B. Baden / CSE 260, UCSD / Fall '15 24

slide-20
SLIDE 20

Causality

A B C A B C A B X Y A X Y B

Scott B. Baden / CSE 260, UCSD / Fall '15 25

  • If a process sends multiple messages to the same

destination, then the messages will be received in the order sent

  • If different processes send messages to the same

destination, the order of receipt isn’t defined across sources

slide-21
SLIDE 21

Today’s lecture

  • Aliev Panfilov Method (A3)
  • Message passing

u The Message Passing Programming Model u The Message Passing Interface - MPI u A first MPI Application – The Trapezoidal Rule

  • Stencil methods in MPI

Scott B. Baden / CSE 260, UCSD / Fall '15 26

slide-22
SLIDE 22

MPI

  • We’ll use a library called MPI

“Message Passing Interface”

u 125 routines in MPI-1 u 7 minimal routines needed by every MPI program

  • start, end, and query MPI execution state (4)
  • non-blocking point-to-point message passing (3)
  • Reference material:

cseweb.ucsd.edu/~baden/Doc/mpi.html

Scott B. Baden / CSE 260, UCSD / Fall '15 27

slide-23
SLIDE 23

Functionality we’ll will cover today

  • Point-to-point communication
  • Message Filtering
  • Communicators and Tags
  • Application: the trapezoidal rule
  • Collective Communication

Scott B. Baden / CSE 260, UCSD / Fall '15 28

slide-24
SLIDE 24

A first MPI program : “hello world”

#include "mpi.h” int main(int argc, char **argv ){ MPI_Init(&argc, &argv); int rank, size; MPI_Comm_size(MPI_COMM_WORLD,&size); MPI_Comm_rank(MPI_COMM_WORLD,&rank); printf("Hello, world! I am process %d of %d.\n”, rank, size); MPI_Finalize(); return(0); }

Scott B. Baden / CSE 260, UCSD / Fall '15 29

slide-25
SLIDE 25

const int Tag=99; int msg[2] = { rank, rank * rank}; if (rank == 0) { MPI_Status status; MPI_Recv(msg, 2, MPI_INT, 1, Tag, MPI_COMM_WORLD, &status); } else MPI_Send(msg, 2, MPI_INT, 0, Tag, MPI_COMM_WORLD);

Send and Recv

Message Buffer Message length SOURCE Process ID Destination Process ID Communicator Message Tag

Scott B. Baden / CSE 260, UCSD / Fall '15 33

slide-26
SLIDE 26

Communicators

  • A communicator is a name-space (or a context)

describing a set of processes that may communicate

  • MPI defines a default communicator

MPI_C _COMM_W _WORLD containing all processes

  • MPI provides the means of generating uniquely

named subsets (later on)

  • A mechanism for screening or filtering messages

Scott B. Baden / CSE 260, UCSD / Fall '15 34

slide-27
SLIDE 27

MPI Tags

  • Tags enable processes to organize or screen

messages

  • Each sent message is accompanied by a user-

defined integer tag:

u Receiving process can use this information to

  • rganize or filter messages

u MPI_ANY_TAG inhibits tag filtering

Scott B. Baden / CSE 260, UCSD / Fall '15 35

slide-28
SLIDE 28

MPI Datatypes

  • MPI messages have a specified length
  • The unit depends on the type of the data

u The length in bytes is sizeof(type) × # elements u We don’t specify the as the # byte

  • MPI specifies a set of built-in types for each of

the primitive types of the language

  • In C: MPI_INT, MPI_FLOAT, MPI_DOUBLE,


MPI_CHAR, MPI_LONG,
 MPI_UNSIGNED, MPI_BYTE,…

  • Also defined types, e.g. structs

Scott B. Baden / CSE 260, UCSD / Fall '15 36

slide-29
SLIDE 29
  • An MPI_Status variable is a struct that contains the

sending processor and the message tag

  • This information is useful when we aren’t filtering

messages

  • We may also access the length of the received

message (may be shorter than the message buffer) MPI_Recv( message, count,

TYPE, MPI_ANY_SOURCE, MPI_ANY_TAG, COMMUNICATOR, &status); MPI_Get_count(&status, TYPE, &recv_count ); status.MPI_SOURCE status.MPI_TAG

Message status

Scott B. Baden / CSE 260, UCSD / Fall '15 37

slide-30
SLIDE 30

Today’s lecture

  • Aliev Panfilov Method (A3)
  • Message passing

u The Message Passing Programming Model u The Message Passing Interface - MPI u A first MPI Application – The Trapezoidal Rule

  • Stencil methods in MPI

Scott B. Baden / CSE 260, UCSD / Fall '15 38

slide-31
SLIDE 31

The trapezoidal rule

  • Use the trapezoidal rule to

numerically approximate a definite integral, area under the curve

  • Divide the interval [a,b] into n

segments of size h=1/n

  • Area under the ith trapezoid

½ (f(a+i×h)+f(a+(i+1)×h)) ×h

  • Area under the entire curve

≈ sum of all the trapezoids h a+i*h a+(i+1)*h

f (x)dx

a b

Scott B. Baden / CSE 260, UCSD / Fall '15 39

slide-32
SLIDE 32

Reference material

  • For a discussion of the trapezoidal rule

http:/

/ en.wikipedia.org/wiki/Trapezoidal_rule

  • A applet to carry out integration

http:/ /www.csse.monash.edu.au/~lloyd/tildeAlgDS/Numerical/Integration

  • Code on Bang/Sorken/Stampede

(from Pacheco’s MPI text)

  • Serial Code

$PUB/Examples/MPI/Pacheco/ppmpi_c/chap04/serial.c Parallel Code $PUB/Examples/MPI/Pacheco/ppmpi_c/chap04/trap.c

Scott B. Baden / CSE 260, UCSD / Fall '15 40

slide-33
SLIDE 33

Serial code (Following Pacheco)

main() { float f(float x) { return x*x; } // Function we're integrating float h = (b-a)/n; // h = trapezoid base width // a and b: endpoints // n = # of trapezoids float integral = (f(a) + f(b))/2.0; float x; int i; for (i = 1, x=a; i <= n-1; i++) { x += h; integral = integral + f(x); } integral = integral*h; }

Scott B. Baden / CSE 260, UCSD / Fall '15 41

slide-34
SLIDE 34

Parallel Implementation of the Trapezoidal Rule

  • Decompose the integration interval into sub-intervals, one

per processor

  • Each processor computes the integral on its local subdomain
  • Processors combine their local integrals into a global one

Scott B. Baden / CSE 260, UCSD / Fall '15 42

slide-35
SLIDE 35

First version of the parallel code

int local_n = n/p; // # trapezoids; assume p divides n evenly float local_a = a + my_rank*local_n*h, local_b = local_a + local_n*h, integral = Trap(local_a, local_b, local_n); if (my_rank == ROOT) { // Sum the integrals calculated by
 // all processes total = integral; for (source = 1; source < p; source++) { MPI_Recv(&integral, 1, MPI_FLOAT, MPI_ANY_SOURCE,
 tag, WORLD, &status); total += integral; } } else MPI_Send(&integral, 1, 
 MPI_FLOAT, ROOT, tag, WORLD);

Scott B. Baden / CSE 260, UCSD / Fall '15 43