[PPT] - Image courtesy: Southern California Earthquake Center Matthias PowerPoint Presentation

SLIDE 1

SLIDE 2

Matthias Christen, Cetus Users and Compiler Infastructure Workshop 2

Image courtesy: Southern California Earthquake Center

SLIDE 3

Matthias Christen, Cetus Users and Compiler Infastructure Workshop 3

stencil laplacian {

peration(float grid u,

float param alpha, float param beta) { u[x, y, z; t+1] = alpha * u[x, y, z; t] + beta * ( u[x+1,y,z;t]+u[x-1,y,z;t]+ u[x,y+1,z;t]+u[x,y-1,z;t]+ u[x,y,z+1;t]+u[x,y,z-1;t]); } }

PATUS

/* (u[0][0][0][1][0]=((alpha*u[0][0][0 ][0][0])+(beta*((u[1][0][0][0][0]+( u[- 1][0][0][0][0]+u[0][1][0][0][0]))+( u[0][- 1][0][0][0]+(u[0][0][1][0][0]+u[0][ 0][-1][0][0])))))) */ __global__ void laplacian(float * * u_0_1_out, float * u_0_0, float * u_0_1, float alpha, float beta, int x_max, int y_max, int z_max, int tbx, int tby, int tbz, int c) { float * const u__u_0[] = { u_0_0, u_0_1 } ; size_1_1=(y_max/blockDim.y); size_1_2=(z_max/blockDim.z); idx_1_2=(blockIdx.y/size_1_2);

…

SLIDE 4

Matthias Christen, Cetus Users and Compiler Infastructure Workshop 4

Gradient

SLIDE 5

Matthias Christen, Cetus Users and Compiler Infastructure Workshop 5

Stencils Particle methods Sparse Linear Algebra BLAS 1 BLAS 2 Lattice Methods FFT Dense Linear Algebra (BLAS3) High arithmetic intensity  processor bound Low arithmetic intensity  memory bandwidth bound

Arithmetic Intensity := Flops / Transferred Data

SLIDE 6

Matthias Christen, Cetus Users and Compiler Infastructure Workshop 6

SLIDE 7

Matthias Christen, Cetus Users and Compiler Infastructure Workshop 7

M. Christen et al., PATUS: A Code Generation and Autotuning Framework For Parallel

Iterative Stencil Computations on Modern Microarchitectures, IPDPS 2011

Coco/R Cetus Cetus

SLIDE 8

Matthias Christen, Cetus Users and Compiler Infastructure Workshop 8

stencil pmcl3d_uxx1 { domainsize = (nxb .. nxe, nyb .. nye, nzb .. nze); t_max = 1;

peration (

const float grid d1(-1 .. nxt+2, -1 .. nyt+2, -1 .. nzt+2), float grid u1(-1 .. nxt+2, -1 .. nyt+2, -1 .. nzt+2), ... float param dth) { float d = 0.25*(d1[x,y,z]+d1[x,y-1,z]+d1[x,y,z-1]+d1[x,y-1,z-1]); u1[x,y,z; t+1] = u1[x,y,z; t] + (dth / d) * ( c1 * (xx[x,y,z] - xx[x-1,y,z] + xy[x,y,z] - xy[x,y-1,z] + xz[x,y,z ] - xz[x,y,z-1]) + c2 * (xx[x+1,y,z] - xx[x-2,y,z] + xy[x,y+1,z] - xy[x,y-2,z] + xz[x,y,z+1] - xz[x,y,z-2]) ); } }

SLIDE 9

Matthias Christen, Cetus Users and Compiler Infastructure Workshop 9

strategy cacheblocking (domain u, auto dim cb) { // iterate over time steps for t = 1 .. stencil.t_max { // iterate over subdomain for subdomain v(cb) in u(:; t) parallel { for point p in v(:; t) v[p; t+1] = stencil (v[p; t]); } } }

SLIDE 10

Matthias Christen, Cetus Users and Compiler Infastructure Workshop 10

5 10 15 20 GFlop/s cby cbx cbz cby cbx=96 cbz=4

SLIDE 11

Matthias Christen, Cetus Users and Compiler Infastructure Workshop 11

5 10 15 20 25 30 35 40 45 50 2 4 8 16 32 64 128 256 512 1024 Single Precision GFlop/s Number of Benchmark Runs

Single Precision Wave Stencil

DIRECT GCE Genetic Greedy Hooke-Jeeves Simplex Search

10 20 30 40 50 60 Optimization Duration [minutes]

Auto-Tuning Process Duration

Wave | Upstream

SLIDE 12

Matthias Christen, Cetus Users and Compiler Infastructure Workshop 12

Vectorization
Loop unrolling
Benchmarking Harness
Time measurement & performance calculation
Result validation

SLIDE 13

Matthias Christen, Cetus Users and Compiler Infastructure Workshop 13

SLIDE 14

Matthias Christen, Cetus Users and Compiler Infastructure Workshop 14

Image courtesy: Southern California Earthquake Center

HP2C Petaquake project

Dynamic Coulomb failure stress changes in a shakeout simulation

f an earthquake on

the southern San Andreas Fault

SLIDE 15

Matthias Christen, Cetus Users and Compiler Infastructure Workshop 15

Kernel Description Discretization Flops/Stencil

Arith. Intens.

uxx1 Velocity in one direction 4th order 20 Flops 0.83 Flop/Byte xy1 Diagonal stress in one direction 4th order 16 Flops 0.80 Flop/Byte xyz1 Stresses parallel to axes 4th order 90 Flops 2.04 Flop/Byte xyzq Stresses parallel to axes in viscous mode 4th order 129 Flops 1.61 Flop/Byte

SLIDE 16

Matthias Christen, Cetus Users and Compiler Infastructure Workshop 16

Core 0

512K L2

64K L1I 64K L1D

Exec Unit

Core 1

512K L2

64K L1I 64K L1D

Exec Unit

Core 5

512K L2

64K L1I 64K L1D

Exec Unit

…

6M L3

System Request Interface / Crossbar

Hyper- transpor t Mem Ctrlr

HT Links

Probe Filter Directory

DRAM

Probe Filter

x2

53GB/s (total)

SLIDE 17

Matthias Christen, Cetus Users and Compiler Infastructure Workshop 17

5 10 15 20 25 30 35 40 45 1 2 4 6 12 24 1 2 4 6 12 24 1 2 4 6 12 24 1 2 4 6 12 24 uxx1 xy1 xyz1 xyzq Single Precision GFlop/s Reference (Fortran)

SLIDE 18

Matthias Christen, Cetus Users and Compiler Infastructure Workshop 18

5 10 15 20 25 30 35 40 45 1 2 4 6 12 24 1 2 4 6 12 24 1 2 4 6 12 24 1 2 4 6 12 24 uxx1 xy1 xyz1 xyzq Single Precision GFlop/s Peak Reference (Fortran)

SLIDE 19

Matthias Christen, Cetus Users and Compiler Infastructure Workshop 19

5 10 15 20 25 30 35 40 45 1 2 4 6 12 24 1 2 4 6 12 24 1 2 4 6 12 24 1 2 4 6 12 24 uxx1 xy1 xyz1 xyzq Single Precision GFlop/s Patus, Basic Cache Blocking Peak Reference (Fortran)

SLIDE 20

Matthias Christen, Cetus Users and Compiler Infastructure Workshop 20

5 10 15 20 25 30 35 40 45 1 2 4 6 12 24 1 2 4 6 12 24 1 2 4 6 12 24 1 2 4 6 12 24 uxx1 xy1 xyz1 xyzq Single Precision GFlop/s Patus, Basic Cache Blocking +SSE Peak Reference (Fortran)

SLIDE 21

Matthias Christen, Cetus Users and Compiler Infastructure Workshop 21

5 10 15 20 25 30 35 40 45 1 2 4 6 12 24 1 2 4 6 12 24 1 2 4 6 12 24 1 2 4 6 12 24 uxx1 xy1 xyz1 xyzq Single Precision GFlop/s Patus, Basic Cache Blocking +SSE +Loop Unrolling Peak Reference (Fortran)

SLIDE 22

Matthias Christen, Cetus Users and Compiler Infastructure Workshop 22

GPC

SM Raster Engine

786K L2

GDDR5

Giga Thd Eng

x4

SM

PolyMo rph E

SM SM

PolyMo rph E PolyMo rph E

Host IF

Mem Ctrlr Mem Ctrlr Mem Ctrlr

PCIe 64K Shared Memory

SFU

L/S L/S

SFU

L/S L/S L/S L/S L/S L/S

SFU

L/S L/S

SFU

L/S L/S L/S L/S L/S L/S

SLIDE 23

Matthias Christen, Cetus Users and Compiler Infastructure Workshop 23

10 20 30 40 50 60 70 80 (1,1) (2,3) (3,3) (1,1) (2,3) (3,3) (1,1) (2,3) (3,3) (1,1) (2,3) (3,3) uxx1 xy1 xyz1 xyzq Single Precision GFlop/s Default +Blocking +Loop Unrolling

SLIDE 24

Matthias Christen, Cetus Users and Compiler Infastructure Workshop 24

Code generation framework for stencil codes of arbitrary stencil shapes

for different types of hardware

Experimenting environment for parallelization and blocking strategies
Auto-tuning as methodology to achieve best performance on a given

hardware architecture for a given strategy

Performance improvement up to ~4x for 3D APW-ODC stencil kernels

SLIDE 25

Matthias Christen, Cetus Users and Compiler Infastructure Workshop 25