DiamondTile Algorithm for High-Performance Wave Modeling Vadim - - PowerPoint PPT Presentation

diamondtile algorithm for high performance wave modeling
SMART_READER_LITE
LIVE PREVIEW

DiamondTile Algorithm for High-Performance Wave Modeling Vadim - - PowerPoint PPT Presentation

DiamondTile Algorithm for High-Performance Wave Modeling Vadim Levchenko Anastasia Perepelkina Keldysh Institute of Apllied Mathematics RAS GTC 2015 FLOPs and Bandwidth Performance Ratio 1000 s p o l F / s s e p t o y l B F /


slide-1
SLIDE 1

DiamondTile Algorithm for High-Performance Wave Modeling

Vadim Levchenko Anastasia Perepelkina Keldysh Institute of Apllied Mathematics RAS GTC 2015

slide-2
SLIDE 2

FLOPs and Bandwidth Performance Ratio

10 100 1000 0.1 1 10 GB/s TFLOP/s (fp32) nVidia Maxwell, 2014-15 nVidia Kepler, 2012-13 Intel CPU, 2014 NEC SX, 199x . 1 B y t e s / F l

  • p

s . 4 B y t e s / F l

  • p

s 4 B y t e s / F l

  • p

s

slide-3
SLIDE 3

RoofLine model

  • S. Williams, A. Waterman, and D. Patterson. Roofline: an insightful visual performance model for multicore
  • architectures. Commun. ACM, 52:65–76, 2009.

L.Barba, R.Yokota, How Will the Fast Multipole Method Fare in the Exascale Era?

slide-4
SLIDE 4

Wave Modeling Specifics

∂2F ∂t2 = c2 ∂2F ∂x2 + ∂2F ∂y2 + ∂2F ∂z2

  • (+BCs + ICs)

Finite difference along each axis:

∂2F ∂x2

  • x0,y0,z0 =

1 ∆x2

NO/2

i=0

Ci(F|x0+i∆x,y0,z0 + F|x0−i∆x,y0,z0) NO = 2 for ∂2F

∂t2 , NO = 2, 4, 6, ..14 for coordinate axes.

Per one cell, per one time step calculation:

◮ O = 1 + 3NO FMA operations ◮ D = 3 + 3NO data

Operational intensity: O/D ∼ 1/2 Flop/byte (na¨ ıve algorithm) .

x y t x2+y2+z2=c2t2 domain of influence domain of dependence asynchro- nous domain asynchro- nous domain synchronization instant

slide-5
SLIDE 5

Wave Modeling Specifics

∂2F ∂t2 = c2 ∂2F ∂x2 + ∂2F ∂y2 + ∂2F ∂z2

  • (+BCs + ICs)

Finite difference along each axis:

∂2F ∂x2

  • x0,y0,z0 =

1 ∆x2

NO/2

i=0

Ci(F|x0+i∆x,y0,z0 + F|x0−i∆x,y0,z0) NO = 2 for ∂2F

∂t2 , NO = 2, 4, 6, ..14 for coordinate axes.

. Cross-shaped stencil fits into diamond shape

x y t x2+y2+z2=c2t2 domain of influence domain of dependence asynchro- nous domain asynchro- nous domain synchronization instant

slide-6
SLIDE 6

Wave equation modelling

Computational Grid projection to (x–t)

slide-7
SLIDE 7

Wave equation modelling

Computational Grid projection to (x–t)

slide-8
SLIDE 8

Wave equation modelling

slide-9
SLIDE 9

Wave equation modelling

slide-10
SLIDE 10

Wave equation modelling

slide-11
SLIDE 11

Traditional stepwise evaluation order

slide-12
SLIDE 12

Traditional stepwise evaluation order

slide-13
SLIDE 13

Traditional stepwise evaluation order

slide-14
SLIDE 14

Traditional stepwise evaluation order

Overlapping stencils increase operational intensity:

◮ O = 1 + 3NO FMA operations ◮ D = 3 data

Operational intensity: O/D ∼ (1 + NO) Flop/byte

slide-15
SLIDE 15

RoofLine Model for Wave Equation on GPGPU

10 100 1000 0.1 1 10 performance, 109 cells/sec localization parameter, cells calculations/(data loads+stores) the best of stepwise naive CUDA FDTD3d results TitanZ GTX 970

slide-16
SLIDE 16

LRnLA method

slide-17
SLIDE 17

LRnLA method

slide-18
SLIDE 18

LRnLA method

Locality Take advantage of memory subsystem hierarchy, from on-chip CPU cash and up to disk and network Recursivity Application of “divide et impera” strategy for any situations (computer architectures, numerical schemes, etc.) non-Locality Optimized for distributed computations Asynchrony Adaptable parallel computations on any levels

slide-19
SLIDE 19

Memory Subsystem Hierarchy for GPGPU and CPU

. GK110 Haswell GM204 . . GTX Titan Xeon E5 v3 GTX 980 .

109 1010 1011 1012 1013 1014 1T 1G 1M 1K 1M 1G 1T Data throughput, B/sec Data set size, B regs L1+sh L2 GDDR5 regs L1+sh L2 GDDR5 regs L1 L2 LLC DDR4 SSD/PCIe HDD

slide-20
SLIDE 20

DiamondTile based algorithm construction

Computational grid in x-y and x-t projections

slide-21
SLIDE 21

DiamondTile based algorithm construction

Computational domain is subdivided into Diamond shaped tiles in x-y.

◮ Diamond encloses cross-shaped stencil ◮ All elements along 3rd (z) axis are included

slide-22
SLIDE 22

DiamondTile based algorithm construction

❖ Choose a DiamondTile on first time-step

slide-23
SLIDE 23

DiamondTile based algorithm construction

❖ Choose a DiamondTile on first time-step ❖ Plot influence cone of first tile

slide-24
SLIDE 24

DiamondTile based algorithm construction

❖ Choose a DiamondTile on first time-step ❖ Plot influence cone of first tile ❖ Choose a shifted DiamondTile on another time-step (Nt steps later)

slide-25
SLIDE 25

DiamondTile based algorithm construction

❖ Choose a DiamondTile on first time-step ❖ Plot influence cone of first tile ❖ Choose a shifted DiamondTile on another time-step (Nt steps later) ❖ Plot dependence cone of last tile

slide-26
SLIDE 26

DiamondTile based algorithm construction

❖ Choose a DiamondTile on first time-step ❖ Plot influence cone of first tile ❖ Choose a shifted DiamondTile on another time-step (Nt steps later) ❖ Plot dependence cone of last tile ❖ Find intersection

slide-27
SLIDE 27

DiamondTorre Algorithm shape

slide-28
SLIDE 28

Understand Algorithm as a shape

Stepwise

slide-29
SLIDE 29

Understand Algorithm as a shape

Domain decomposition

slide-30
SLIDE 30

Understand Algorithm as a shape

More operational intensity

slide-31
SLIDE 31

Understand Algorithm as a shape

DiamondTorre

slide-32
SLIDE 32

DiamondTorre Algorithm shape

◮ DiamondTorre tilt depends on stencil size ◮ Stencil width is determined by order of approximation (NO)

slide-33
SLIDE 33

DiamondTorre Algorithm parameters

Performance depends on careful choice of algorithm parameters:

◮ Size of DiamondTorre base — Diamond Tile Size, DTS ◮ Quantity of time layers — Nt

Operational Intensity ∼ DTS/(4-1/DTS) (for large Nt)

slide-34
SLIDE 34

RoofLine Model for Wave Equation on GPGPU

10 100 1000 0.1 1 10 performance, 109 cells/sec localization parameter, cells calculations/(data loads+stores) DiamondTile, DTS=1 DTS=4 DTS=7 DTS=14 DTS=20 the best of stepwise DTS=1 naive D i a m

  • n

d T

  • r

r e f

  • r

v a r i

  • u

s D T S TitanZ GTX 970

slide-35
SLIDE 35

DiamondTorre Algorithm with CUDA

In each tile Nz elements (along 3rd (z) axis) are processed asynchronously by CUDA threads in a block

First stage

slide-36
SLIDE 36

DiamondTorre Algorithm with CUDA

In each tile Nz elements (along 3rd (z) axis) are processed asynchronously by CUDA threads in a block

Second stage

slide-37
SLIDE 37

DiamondTorre Algorithm with CUDA

In each tile Nz elements (along 3rd (z) axis) are processed asynchronously by CUDA threads in a block

Odd and even stages are alternating. Synchronization after each stage.

slide-38
SLIDE 38

DiamondTorre Algorithm with CUDA

In each tile Nz elements (along 3rd (z) axis) are processed asynchronously by CUDA threads in a block

Odd and even stages are alternating. Synchronization after each stage.

slide-39
SLIDE 39

DiamondTorre Algorithm with CUDA

In each tile Nz elements (along 3rd (z) axis) are processed asynchronously by CUDA threads in a block

Odd and even stages are alternating. Synchronization after each stage.

slide-40
SLIDE 40

DiamondTorre Algorithm with CUDA

❖ At first, some portion of cells remain on first time step, while some are processed to several time layers

slide-41
SLIDE 41

DiamondTorre Algorithm with CUDA

❖ At first, some portion of cells remain on first time step, while some are processed to several time layers

slide-42
SLIDE 42

DiamondTorre Algorithm with CUDA

❖ At first, some portion of cells remain on first time step, while some are processed to several time layers

slide-43
SLIDE 43

DiamondTorre Algorithm with CUDA

❖ At the end, all data are progressed to a given time step. This time step is determined by DiamondTorre height

slide-44
SLIDE 44

RoofLine Model for Wave Equation on GPGPU

10 100 1000 0.1 1 10 performance, 109 cells/sec localization parameter, cells calculations/(data loads+stores) DiamondTile, DTS=1 DTS=4 DTS=7 DTS=14 DTS=20 the best of stepwise DTS=1 naive D i a m

  • n

d T

  • r

r e f

  • r

v a r i

  • u

s D T S CUDA FDTD3d results TitanZ GTX 970

slide-45
SLIDE 45

10 20 30 40 50 60 2/1 4/1 6/1 8/1 10/112/114/1 6/1 6/2 4/1 4/2 4/3 2/1 2/2 2/3 2/4 2/5 2/6 2/7 calc rate, Gcells/sec various scheme/algorithm parameters, NO/DTS GTX 750Ti GTX 970 TitanZ (1)

slide-46
SLIDE 46

0.01 0.1 1 10 100 0.01 0.1 1 10 100 1000 calc rate, Gcells/sec parallel level, warps FDTD3d CPU rate FDTD3d CPU rate with -O3 FDTD3d TitanZ rate FDTD3d GTX970 rate TitanZ GTX970

slide-47
SLIDE 47

Wave Modeling Applications

FDTD simulation for electromagnetics (2nd and 4th order approximation, PML) (Zakirov A., Goryachev I.)

slide-48
SLIDE 48

Wave Modeling Applications

Gas Dynamis with RKDG scheme (Korneev B.)

slide-49
SLIDE 49

Wave Modeling Applications

2000 3000 4000 5000 6000 7000 7.5 3.75

  • 3.75
  • 7.5

6 6 4 4 2 2 7.5 3.75

  • 3.75
  • 7.5

FDTD simulation for elastic seismic media (Levander scheme, 4th order, PML, Thompsen anisotropy, TFSF source) (Levchenko V., Zakirov A., Perepelkina A., Ivanov A.)

slide-50
SLIDE 50

Wave Modeling Applications

Particle-in-cell plasma kinetics (Levchenko V., Perepelkina A., Goryachev I.)

slide-51
SLIDE 51

Main Results and Conclusions

◮ New algorithms DiamondTile of LRnLA family are developed for wave modeling.

The algorithms are efficient on memory and parallelism models of CUDA GPGPU;

◮ Unlike traditional stepwise evaluation order, data dependencies are traced for many

time iteration steps. It increases operational intensity and allows to reach higher calculation rates.

◮ Performance of 50-60 billion cells/s is achieved with Titan, as well as with

GTX970 in the implementation of wave modeling.