SLIDE 1
DiamondTile Algorithm for High-Performance Wave Modeling
Vadim Levchenko Anastasia Perepelkina Keldysh Institute of Apllied Mathematics RAS GTC 2015
SLIDE 2 FLOPs and Bandwidth Performance Ratio
10 100 1000 0.1 1 10 GB/s TFLOP/s (fp32) nVidia Maxwell, 2014-15 nVidia Kepler, 2012-13 Intel CPU, 2014 NEC SX, 199x . 1 B y t e s / F l
s . 4 B y t e s / F l
s 4 B y t e s / F l
s
SLIDE 3 RoofLine model
- S. Williams, A. Waterman, and D. Patterson. Roofline: an insightful visual performance model for multicore
- architectures. Commun. ACM, 52:65–76, 2009.
L.Barba, R.Yokota, How Will the Fast Multipole Method Fare in the Exascale Era?
SLIDE 4 Wave Modeling Specifics
∂2F ∂t2 = c2 ∂2F ∂x2 + ∂2F ∂y2 + ∂2F ∂z2
Finite difference along each axis:
∂2F ∂x2
1 ∆x2
NO/2
i=0
Ci(F|x0+i∆x,y0,z0 + F|x0−i∆x,y0,z0) NO = 2 for ∂2F
∂t2 , NO = 2, 4, 6, ..14 for coordinate axes.
Per one cell, per one time step calculation:
◮ O = 1 + 3NO FMA operations ◮ D = 3 + 3NO data
Operational intensity: O/D ∼ 1/2 Flop/byte (na¨ ıve algorithm) .
x y t x2+y2+z2=c2t2 domain of influence domain of dependence asynchro- nous domain asynchro- nous domain synchronization instant
SLIDE 5 Wave Modeling Specifics
∂2F ∂t2 = c2 ∂2F ∂x2 + ∂2F ∂y2 + ∂2F ∂z2
Finite difference along each axis:
∂2F ∂x2
1 ∆x2
NO/2
i=0
Ci(F|x0+i∆x,y0,z0 + F|x0−i∆x,y0,z0) NO = 2 for ∂2F
∂t2 , NO = 2, 4, 6, ..14 for coordinate axes.
. Cross-shaped stencil fits into diamond shape
x y t x2+y2+z2=c2t2 domain of influence domain of dependence asynchro- nous domain asynchro- nous domain synchronization instant
SLIDE 6
Wave equation modelling
Computational Grid projection to (x–t)
SLIDE 7
Wave equation modelling
Computational Grid projection to (x–t)
SLIDE 8
Wave equation modelling
SLIDE 9
Wave equation modelling
SLIDE 10
Wave equation modelling
SLIDE 11
Traditional stepwise evaluation order
SLIDE 12
Traditional stepwise evaluation order
SLIDE 13
Traditional stepwise evaluation order
SLIDE 14
Traditional stepwise evaluation order
Overlapping stencils increase operational intensity:
◮ O = 1 + 3NO FMA operations ◮ D = 3 data
Operational intensity: O/D ∼ (1 + NO) Flop/byte
SLIDE 15
RoofLine Model for Wave Equation on GPGPU
10 100 1000 0.1 1 10 performance, 109 cells/sec localization parameter, cells calculations/(data loads+stores) the best of stepwise naive CUDA FDTD3d results TitanZ GTX 970
SLIDE 16
LRnLA method
SLIDE 17
LRnLA method
SLIDE 18
LRnLA method
Locality Take advantage of memory subsystem hierarchy, from on-chip CPU cash and up to disk and network Recursivity Application of “divide et impera” strategy for any situations (computer architectures, numerical schemes, etc.) non-Locality Optimized for distributed computations Asynchrony Adaptable parallel computations on any levels
SLIDE 19 Memory Subsystem Hierarchy for GPGPU and CPU
. GK110 Haswell GM204 . . GTX Titan Xeon E5 v3 GTX 980 .
109 1010 1011 1012 1013 1014 1T 1G 1M 1K 1M 1G 1T Data throughput, B/sec Data set size, B regs L1+sh L2 GDDR5 regs L1+sh L2 GDDR5 regs L1 L2 LLC DDR4 SSD/PCIe HDD
SLIDE 20
DiamondTile based algorithm construction
Computational grid in x-y and x-t projections
SLIDE 21
DiamondTile based algorithm construction
Computational domain is subdivided into Diamond shaped tiles in x-y.
◮ Diamond encloses cross-shaped stencil ◮ All elements along 3rd (z) axis are included
SLIDE 22
DiamondTile based algorithm construction
❖ Choose a DiamondTile on first time-step
SLIDE 23
DiamondTile based algorithm construction
❖ Choose a DiamondTile on first time-step ❖ Plot influence cone of first tile
SLIDE 24
DiamondTile based algorithm construction
❖ Choose a DiamondTile on first time-step ❖ Plot influence cone of first tile ❖ Choose a shifted DiamondTile on another time-step (Nt steps later)
SLIDE 25
DiamondTile based algorithm construction
❖ Choose a DiamondTile on first time-step ❖ Plot influence cone of first tile ❖ Choose a shifted DiamondTile on another time-step (Nt steps later) ❖ Plot dependence cone of last tile
SLIDE 26
DiamondTile based algorithm construction
❖ Choose a DiamondTile on first time-step ❖ Plot influence cone of first tile ❖ Choose a shifted DiamondTile on another time-step (Nt steps later) ❖ Plot dependence cone of last tile ❖ Find intersection
SLIDE 27
DiamondTorre Algorithm shape
SLIDE 28
Understand Algorithm as a shape
Stepwise
SLIDE 29
Understand Algorithm as a shape
Domain decomposition
SLIDE 30
Understand Algorithm as a shape
More operational intensity
SLIDE 31
Understand Algorithm as a shape
DiamondTorre
SLIDE 32
DiamondTorre Algorithm shape
◮ DiamondTorre tilt depends on stencil size ◮ Stencil width is determined by order of approximation (NO)
SLIDE 33
DiamondTorre Algorithm parameters
Performance depends on careful choice of algorithm parameters:
◮ Size of DiamondTorre base — Diamond Tile Size, DTS ◮ Quantity of time layers — Nt
Operational Intensity ∼ DTS/(4-1/DTS) (for large Nt)
SLIDE 34 RoofLine Model for Wave Equation on GPGPU
10 100 1000 0.1 1 10 performance, 109 cells/sec localization parameter, cells calculations/(data loads+stores) DiamondTile, DTS=1 DTS=4 DTS=7 DTS=14 DTS=20 the best of stepwise DTS=1 naive D i a m
d T
r e f
v a r i
s D T S TitanZ GTX 970
SLIDE 35
DiamondTorre Algorithm with CUDA
In each tile Nz elements (along 3rd (z) axis) are processed asynchronously by CUDA threads in a block
First stage
SLIDE 36
DiamondTorre Algorithm with CUDA
In each tile Nz elements (along 3rd (z) axis) are processed asynchronously by CUDA threads in a block
Second stage
SLIDE 37
DiamondTorre Algorithm with CUDA
In each tile Nz elements (along 3rd (z) axis) are processed asynchronously by CUDA threads in a block
Odd and even stages are alternating. Synchronization after each stage.
SLIDE 38
DiamondTorre Algorithm with CUDA
In each tile Nz elements (along 3rd (z) axis) are processed asynchronously by CUDA threads in a block
Odd and even stages are alternating. Synchronization after each stage.
SLIDE 39
DiamondTorre Algorithm with CUDA
In each tile Nz elements (along 3rd (z) axis) are processed asynchronously by CUDA threads in a block
Odd and even stages are alternating. Synchronization after each stage.
SLIDE 40
DiamondTorre Algorithm with CUDA
❖ At first, some portion of cells remain on first time step, while some are processed to several time layers
SLIDE 41
DiamondTorre Algorithm with CUDA
❖ At first, some portion of cells remain on first time step, while some are processed to several time layers
SLIDE 42
DiamondTorre Algorithm with CUDA
❖ At first, some portion of cells remain on first time step, while some are processed to several time layers
SLIDE 43
DiamondTorre Algorithm with CUDA
❖ At the end, all data are progressed to a given time step. This time step is determined by DiamondTorre height
SLIDE 44 RoofLine Model for Wave Equation on GPGPU
10 100 1000 0.1 1 10 performance, 109 cells/sec localization parameter, cells calculations/(data loads+stores) DiamondTile, DTS=1 DTS=4 DTS=7 DTS=14 DTS=20 the best of stepwise DTS=1 naive D i a m
d T
r e f
v a r i
s D T S CUDA FDTD3d results TitanZ GTX 970
SLIDE 45
10 20 30 40 50 60 2/1 4/1 6/1 8/1 10/112/114/1 6/1 6/2 4/1 4/2 4/3 2/1 2/2 2/3 2/4 2/5 2/6 2/7 calc rate, Gcells/sec various scheme/algorithm parameters, NO/DTS GTX 750Ti GTX 970 TitanZ (1)
SLIDE 46
0.01 0.1 1 10 100 0.01 0.1 1 10 100 1000 calc rate, Gcells/sec parallel level, warps FDTD3d CPU rate FDTD3d CPU rate with -O3 FDTD3d TitanZ rate FDTD3d GTX970 rate TitanZ GTX970
SLIDE 47
Wave Modeling Applications
FDTD simulation for electromagnetics (2nd and 4th order approximation, PML) (Zakirov A., Goryachev I.)
SLIDE 48
Wave Modeling Applications
Gas Dynamis with RKDG scheme (Korneev B.)
SLIDE 49 Wave Modeling Applications
2000 3000 4000 5000 6000 7000 7.5 3.75
6 6 4 4 2 2 7.5 3.75
FDTD simulation for elastic seismic media (Levander scheme, 4th order, PML, Thompsen anisotropy, TFSF source) (Levchenko V., Zakirov A., Perepelkina A., Ivanov A.)
SLIDE 50
Wave Modeling Applications
Particle-in-cell plasma kinetics (Levchenko V., Perepelkina A., Goryachev I.)
SLIDE 51
Main Results and Conclusions
◮ New algorithms DiamondTile of LRnLA family are developed for wave modeling.
The algorithms are efficient on memory and parallelism models of CUDA GPGPU;
◮ Unlike traditional stepwise evaluation order, data dependencies are traced for many
time iteration steps. It increases operational intensity and allows to reach higher calculation rates.
◮ Performance of 50-60 billion cells/s is achieved with Titan, as well as with
GTX970 in the implementation of wave modeling.