Image courtesy: Southern California Earthquake Center Matthias - - PowerPoint PPT Presentation
Image courtesy: Southern California Earthquake Center Matthias - - PowerPoint PPT Presentation
Image courtesy: Southern California Earthquake Center Matthias Christen, Cetus Users and Compiler Infastructure Workshop 2 /*
Matthias Christen, Cetus Users and Compiler Infastructure Workshop 2
Image courtesy: Southern California Earthquake Center
Matthias Christen, Cetus Users and Compiler Infastructure Workshop 3
stencil laplacian {
- peration(float grid u,
float param alpha, float param beta) { u[x, y, z; t+1] = alpha * u[x, y, z; t] + beta * ( u[x+1,y,z;t]+u[x-1,y,z;t]+ u[x,y+1,z;t]+u[x,y-1,z;t]+ u[x,y,z+1;t]+u[x,y,z-1;t]); } }
PATUS
/* (u[0][0][0][1][0]=((alpha*u[0][0][0 ][0][0])+(beta*((u[1][0][0][0][0]+( u[- 1][0][0][0][0]+u[0][1][0][0][0]))+( u[0][- 1][0][0][0]+(u[0][0][1][0][0]+u[0][ 0][-1][0][0])))))) */ __global__ void laplacian(float * * u_0_1_out, float * u_0_0, float * u_0_1, float alpha, float beta, int x_max, int y_max, int z_max, int tbx, int tby, int tbz, int c) { float * const u__u_0[] = { u_0_0, u_0_1 } ; size_1_1=(y_max/blockDim.y); size_1_2=(z_max/blockDim.z); idx_1_2=(blockIdx.y/size_1_2);
…
Matthias Christen, Cetus Users and Compiler Infastructure Workshop 4
Gradient
Matthias Christen, Cetus Users and Compiler Infastructure Workshop 5
Stencils Particle methods Sparse Linear Algebra BLAS 1 BLAS 2 Lattice Methods FFT Dense Linear Algebra (BLAS3) High arithmetic intensity processor bound Low arithmetic intensity memory bandwidth bound
Arithmetic Intensity := Flops / Transferred Data
Matthias Christen, Cetus Users and Compiler Infastructure Workshop 6
Matthias Christen, Cetus Users and Compiler Infastructure Workshop 7
- M. Christen et al., PATUS: A Code Generation and Autotuning Framework For Parallel
Iterative Stencil Computations on Modern Microarchitectures, IPDPS 2011
Coco/R Cetus Cetus
Matthias Christen, Cetus Users and Compiler Infastructure Workshop 8
stencil pmcl3d_uxx1 { domainsize = (nxb .. nxe, nyb .. nye, nzb .. nze); t_max = 1;
- peration (
const float grid d1(-1 .. nxt+2, -1 .. nyt+2, -1 .. nzt+2), float grid u1(-1 .. nxt+2, -1 .. nyt+2, -1 .. nzt+2), ... float param dth) { float d = 0.25*(d1[x,y,z]+d1[x,y-1,z]+d1[x,y,z-1]+d1[x,y-1,z-1]); u1[x,y,z; t+1] = u1[x,y,z; t] + (dth / d) * ( c1 * (xx[x,y,z] - xx[x-1,y,z] + xy[x,y,z] - xy[x,y-1,z] + xz[x,y,z ] - xz[x,y,z-1]) + c2 * (xx[x+1,y,z] - xx[x-2,y,z] + xy[x,y+1,z] - xy[x,y-2,z] + xz[x,y,z+1] - xz[x,y,z-2]) ); } }
Matthias Christen, Cetus Users and Compiler Infastructure Workshop 9
strategy cacheblocking (domain u, auto dim cb) { // iterate over time steps for t = 1 .. stencil.t_max { // iterate over subdomain for subdomain v(cb) in u(:; t) parallel { for point p in v(:; t) v[p; t+1] = stencil (v[p; t]); } } }
Matthias Christen, Cetus Users and Compiler Infastructure Workshop 10
5 10 15 20 GFlop/s cby cbx cbz cby cbx=96 cbz=4
Matthias Christen, Cetus Users and Compiler Infastructure Workshop 11
5 10 15 20 25 30 35 40 45 50 2 4 8 16 32 64 128 256 512 1024 Single Precision GFlop/s Number of Benchmark Runs
Single Precision Wave Stencil
DIRECT GCE Genetic Greedy Hooke-Jeeves Simplex Search
10 20 30 40 50 60 Optimization Duration [minutes]
Auto-Tuning Process Duration
Wave | Upstream
Matthias Christen, Cetus Users and Compiler Infastructure Workshop 12
- Vectorization
- Loop unrolling
- Benchmarking Harness
- Time measurement & performance calculation
- Result validation
Matthias Christen, Cetus Users and Compiler Infastructure Workshop 13
Matthias Christen, Cetus Users and Compiler Infastructure Workshop 14
Image courtesy: Southern California Earthquake Center
HP2C Petaquake project
Dynamic Coulomb failure stress changes in a shakeout simulation
- f an earthquake on
the southern San Andreas Fault
Matthias Christen, Cetus Users and Compiler Infastructure Workshop 15
Kernel Description Discretization Flops/Stencil
- Arith. Intens.
uxx1 Velocity in one direction 4th order 20 Flops 0.83 Flop/Byte xy1 Diagonal stress in one direction 4th order 16 Flops 0.80 Flop/Byte xyz1 Stresses parallel to axes 4th order 90 Flops 2.04 Flop/Byte xyzq Stresses parallel to axes in viscous mode 4th order 129 Flops 1.61 Flop/Byte
Matthias Christen, Cetus Users and Compiler Infastructure Workshop 16
Core 0
512K L2
64K L1I 64K L1D
Exec Unit
Core 1
512K L2
64K L1I 64K L1D
Exec Unit
Core 5
512K L2
64K L1I 64K L1D
Exec Unit
…
6M L3
System Request Interface / Crossbar
Hyper- transpor t Mem Ctrlr
HT Links
Probe Filter Directory
DRAM
Probe Filter
x2
53GB/s (total)
Matthias Christen, Cetus Users and Compiler Infastructure Workshop 17
5 10 15 20 25 30 35 40 45 1 2 4 6 12 24 1 2 4 6 12 24 1 2 4 6 12 24 1 2 4 6 12 24 uxx1 xy1 xyz1 xyzq Single Precision GFlop/s Reference (Fortran)
Matthias Christen, Cetus Users and Compiler Infastructure Workshop 18
5 10 15 20 25 30 35 40 45 1 2 4 6 12 24 1 2 4 6 12 24 1 2 4 6 12 24 1 2 4 6 12 24 uxx1 xy1 xyz1 xyzq Single Precision GFlop/s Peak Reference (Fortran)
Matthias Christen, Cetus Users and Compiler Infastructure Workshop 19
5 10 15 20 25 30 35 40 45 1 2 4 6 12 24 1 2 4 6 12 24 1 2 4 6 12 24 1 2 4 6 12 24 uxx1 xy1 xyz1 xyzq Single Precision GFlop/s Patus, Basic Cache Blocking Peak Reference (Fortran)
Matthias Christen, Cetus Users and Compiler Infastructure Workshop 20
5 10 15 20 25 30 35 40 45 1 2 4 6 12 24 1 2 4 6 12 24 1 2 4 6 12 24 1 2 4 6 12 24 uxx1 xy1 xyz1 xyzq Single Precision GFlop/s Patus, Basic Cache Blocking +SSE Peak Reference (Fortran)
Matthias Christen, Cetus Users and Compiler Infastructure Workshop 21
5 10 15 20 25 30 35 40 45 1 2 4 6 12 24 1 2 4 6 12 24 1 2 4 6 12 24 1 2 4 6 12 24 uxx1 xy1 xyz1 xyzq Single Precision GFlop/s Patus, Basic Cache Blocking +SSE +Loop Unrolling Peak Reference (Fortran)
Matthias Christen, Cetus Users and Compiler Infastructure Workshop 22
GPC
SM Raster Engine
786K L2
GDDR5
Giga Thd Eng
x4
SM
PolyMo rph E
SM SM
PolyMo rph E PolyMo rph E
Host IF
Mem Ctrlr Mem Ctrlr Mem Ctrlr
PCIe 64K Shared Memory
SFU
L/S L/S
SFU
L/S L/S L/S L/S L/S L/S
SFU
L/S L/S
SFU
L/S L/S L/S L/S L/S L/S
Matthias Christen, Cetus Users and Compiler Infastructure Workshop 23
10 20 30 40 50 60 70 80 (1,1) (2,3) (3,3) (1,1) (2,3) (3,3) (1,1) (2,3) (3,3) (1,1) (2,3) (3,3) uxx1 xy1 xyz1 xyzq Single Precision GFlop/s Default +Blocking +Loop Unrolling
Matthias Christen, Cetus Users and Compiler Infastructure Workshop 24
- Code generation framework for stencil codes of arbitrary stencil shapes
for different types of hardware
- Experimenting environment for parallelization and blocking strategies
- Auto-tuning as methodology to achieve best performance on a given
hardware architecture for a given strategy
- Performance improvement up to ~4x for 3D APW-ODC stencil kernels
Matthias Christen, Cetus Users and Compiler Infastructure Workshop 25