DiamondTile Algorithm for High-Performance Wave Modeling Vadim - PowerPoint PPT Presentation

DiamondTile Algorithm for High-Performance Wave Modeling Vadim Levchenko Anastasia Perepelkina Keldysh Institute of Apllied Mathematics RAS GTC 2015

FLOPs and Bandwidth Performance Ratio 1000 s p o l F / s s e p t o y l B F / s 1 e . 0 t y B 4 100 s p o l F GB/s / s e t y B 4 0 . 0 10 nVidia Maxwell, 2014-15 nVidia Kepler, 2012-13 Intel CPU, 2014 NEC SX, 199x 0.1 1 10 TFLOP/s (fp32)

RoofLine model S. Williams, A. Waterman, and D. Patterson. Roofline: an insightful visual performance model for multicore architectures. Commun. ACM, 52:65–76, 2009. L.Barba, R.Yokota, How Will the Fast Multipole Method Fare in the Exascale Era?

Wave Modeling Specifics ∂ t 2 = c 2 � � ∂ 2 F ∂ 2 F ∂ x 2 + ∂ 2 F ∂ y 2 + ∂ 2 F (+BCs + ICs) ∂ z 2 Finite difference along each axis: domain of in fl uence � ∂ 2 F � N O / 2 1 x 0 , y 0 , z 0 = C i ( F | x 0 + i ∆ x , y 0 , z 0 + F | x 0 − i ∆ x , y 0 , z 0 ) � ∂ x 2 ∆ x 2 i = 0 � x 2 +y 2 +z 2 =c 2 t 2 N O = 2 for ∂ 2 F ∂ t 2 , N O = 2 , 4 , 6 , .. 14 for coordinate axes. asynchro- nous domain synchronization instant Per one cell, per one time step calculation: t asynchro- ◮ O = 1 + 3 N O FMA operations nous domain ◮ D = 3 + 3 N O data domain of Operational intensity: dependence O / D ∼ 1 / 2 Flop/byte (na¨ ıve algorithm) . y x

Wave Modeling Specifics ∂ t 2 = c 2 � � ∂ 2 F ∂ 2 F ∂ x 2 + ∂ 2 F ∂ y 2 + ∂ 2 F (+BCs + ICs) ∂ z 2 Finite difference along each axis: domain of in fl uence � ∂ 2 F � N O / 2 1 x 0 , y 0 , z 0 = C i ( F | x 0 + i ∆ x , y 0 , z 0 + F | x 0 − i ∆ x , y 0 , z 0 ) � ∂ x 2 ∆ x 2 i = 0 � x 2 +y 2 +z 2 =c 2 t 2 N O = 2 for ∂ 2 F ∂ t 2 , N O = 2 , 4 , 6 , .. 14 for coordinate axes. asynchro- nous domain synchronization instant t asynchro- nous domain domain of dependence y . x Cross-shaped stencil fits into diamond shape

Wave equation modelling Computational Grid projection to (x–t)

Wave equation modelling

Traditional stepwise evaluation order

Traditional stepwise evaluation order Overlapping stencils increase operational intensity: ◮ O = 1 + 3 N O FMA operations ◮ D = 3 data Operational intensity: O / D ∼ ( 1 + N O ) Flop/byte

RoofLine Model for Wave Equation on GPGPU 1000 TitanZ the best of stepwise performance, 10 9 cells/sec GTX 970 100 CUDA FDTD3d results naive 10 0.1 1 10 localization parameter, cells calculations/(data loads+stores)

LRnLA method

LRnLA method Locality Take advantage of memory subsystem hierarchy, from on-chip CPU cash and up to disk and network Recursivity Application of “divide et impera” strategy for any situations (computer architectures, numerical schemes, etc.) non-Locality Optimized for distributed computations Asynchrony Adaptable parallel computations on any levels

Memory Subsystem Hierarchy for GPGPU and CPU . GK110 Haswell GM204 . . GTX Titan Xeon E5 v3 GTX 980 . 10 14 regs regs 10 13 regs Data throughput, B/sec L1+sh L1+sh L1 10 12 L2 L2 L2 GDDR5 GDDR5 LLC 10 11 DDR4 10 10 SSD/PCIe 10 9 HDD 1T 1G 1M 1K 1M 1G 1T Data set size, B

DiamondTile based algorithm construction Computational grid in x-y and x-t projections

DiamondTile based algorithm construction Computational domain is subdivided into Diamond shaped tiles in x-y. ◮ Diamond encloses cross-shaped stencil ◮ All elements along 3rd (z) axis are included

DiamondTile based algorithm construction Choose a DiamondTile on first time-step ❖

DiamondTile based algorithm construction Choose a DiamondTile on first time-step ❖ Plot influence cone of first tile ❖

DiamondTile based algorithm construction Choose a DiamondTile on first time-step ❖ Plot influence cone of first tile ❖ Choose a shifted DiamondTile on another time-step (Nt steps later) ❖

DiamondTile based algorithm construction Choose a DiamondTile on first time-step ❖ Plot influence cone of first tile ❖ Choose a shifted DiamondTile on another time-step (Nt steps later) ❖ Plot dependence cone of last tile ❖

DiamondTile based algorithm construction Choose a DiamondTile on first time-step ❖ Plot influence cone of first tile ❖ Choose a shifted DiamondTile on another time-step (Nt steps later) ❖ Plot dependence cone of last tile ❖ Find intersection ❖

DiamondTorre Algorithm shape

Understand Algorithm as a shape Stepwise

Understand Algorithm as a shape Domain decomposition

Understand Algorithm as a shape More operational intensity

Understand Algorithm as a shape DiamondTorre

DiamondTorre Algorithm shape ◮ DiamondTorre tilt depends on stencil size ◮ Stencil width is determined by order of approximation ( N O )

DiamondTorre Algorithm parameters Performance depends on careful choice of algorithm parameters: ◮ Size of DiamondTorre base — Diamond Tile Size, DTS ◮ Quantity of time layers — Nt Operational Intensity ∼ DTS/(4-1/DTS) (for large Nt)

RoofLine Model for Wave Equation on GPGPU 1000 DTS=20 DTS=14 S T D s u TitanZ o i r a v the best of stepwise r DTS=7 o performance, 10 9 cells/sec f e r r o GTX 970 T DiamondTile, DTS=1 d DTS=4 n o m a i D 100 DTS=1 naive 10 0.1 1 10 localization parameter, cells calculations/(data loads+stores)

DiamondTorre Algorithm with CUDA In each tile Nz elements (along 3rd (z) axis) are processed asynchronously by CUDA threads in a block First stage

DiamondTorre Algorithm with CUDA In each tile Nz elements (along 3rd (z) axis) are processed asynchronously by CUDA threads in a block Second stage

DiamondTorre Algorithm with CUDA In each tile Nz elements (along 3rd (z) axis) are processed asynchronously by CUDA threads in a block Odd and even stages are alternating. Synchronization after each stage.

DiamondTorre Algorithm with CUDA At first, some portion of cells remain on first time step, while some are processed to ❖ several time layers

DiamondTorre Algorithm with CUDA At the end, all data are progressed to a given time step. This time step is ❖ determined by DiamondTorre height

RoofLine Model for Wave Equation on GPGPU 1000 DTS=20 DTS=14 S T D s u TitanZ o i r a v the best of stepwise r DTS=7 o performance, 10 9 cells/sec f e r r o GTX 970 T DiamondTile, DTS=1 d DTS=4 n o m a i D 100 DTS=1 CUDA FDTD3d results naive 10 0.1 1 10 localization parameter, cells calculations/(data loads+stores)

60 GTX 750Ti GTX 970 TitanZ (1) 50 calc rate, Gcells/sec 40 30 20 10 0 2/1 4/1 6/1 8/1 10/112/114/1 6/1 6/2 4/1 4/2 4/3 2/1 2/2 2/3 2/4 2/5 2/6 2/7 various scheme/algorithm parameters, NO/DTS

100 TitanZ GTX970 FDTD3d TitanZ rate 10 FDTD3d GTX970 rate calc rate, Gcells/sec 1 FDTD3d CPU rate with -O3 0.1 FDTD3d CPU rate 0.01 0.01 0.1 1 10 100 1000 parallel level, warps

Wave Modeling Applications FDTD simulation for electromagnetics (2nd and 4th order approximation, PML) (Zakirov A., Goryachev I.)

Wave Modeling Applications Gas Dynamis with RKDG scheme (Korneev B.)

Wave Modeling Applications 2000 3000 4000 5000 6000 7000 -7.5 -3.75 0 3.75 7.5 0 0 2 2 4 4 6 6 -7.5 -3.75 0 3.75 7.5 0 FDTD simulation for elastic seismic media (Levander scheme, 4th order, PML, Thompsen anisotropy, TFSF source) (Levchenko V., Zakirov A., Perepelkina A., Ivanov A.)

Wave Modeling Applications Particle-in-cell plasma kinetics (Levchenko V., Perepelkina A., Goryachev I.)

Main Results and Conclusions ◮ New algorithms DiamondTile of LRnLA family are developed for wave modeling. The algorithms are efficient on memory and parallelism models of CUDA GPGPU; ◮ Unlike traditional stepwise evaluation order, data dependencies are traced for many time iteration steps. It increases operational intensity and allows to reach higher calculation rates. ◮ Performance of 50-60 billion cells/s is achieved with Titan, as well as with GTX970 in the implementation of wave modeling.

DiamondTile Algorithm for High-Performance Wave Modeling Vadim - PowerPoint PPT Presentation

DiamondTile Algorithm for High-Performance Wave Modeling Vadim Levchenko Anastasia Perepelkina Keldysh Institute of Apllied Mathematics RAS GTC 2015 FLOPs and Bandwidth Performance Ratio 1000 s p o l F / s s e p t o y l B F /

INSPIRATION Faxton Campus St . Lukes Campus Faxton-St . Lukes Healthcare EDUCATION

GENwave What is Google wave? What is a wave? A wave is equal parts conversation and

1 8th Grade Wave Properties 20151028 www.njctl.org 2 Table of Contents: Wave properties

Modeling of proteins and complexes High resolution Low resolution Modeling of domains Modeling

Odds Algorithm An Online Algorithm Group Fibonado 20. Dec 2016 Group Fibonado Odds Algorithm

First Power Production Results from the Wave Star Roshage Wave Energy Converter Laurent Marquis,

Pelamis - Wave Energy in Action !! 2 Wave Hub - What is it? Simple idea! An area

The 8th Wave Process and Outcomes Jesse Marsh 8th Wave

Google Wave Joe Gregorio Developer Advocate Overview of Google Wave Google Wave Client

Search for the gravity wave signature of Search for the gravity wave signature of

Orc David Schleef Entropy Wave Inc (c) 2009 Entropy Wave Inc What is Orc A system for

Wave Relay System and Wave Relay System and General Project Details General Project Details

Slide 7 / 102 Slide 8 / 102 4 Compare/Contrast Pulse and Wave. 5 In a transverse wave, compare

Slide 4 / 102 1 What causes a wave? Slide 5 / 102 2 In terms of wave motion, define medium.

Wave packets on Riemannian manifolds Jean-Marc Bouclet Institut de Math ematiques de Toulouse

What are Waves? Return to Table of Contents Slide 5 / 144 What is a Wave? A wave is a

Assessment of the Single Perturbation Load Approach on composite conical shells 25 March 2015,

Advanced Section #3: Methods of Regularization and their justifications Robbert Struyven and

Student Employment THINK ACADEMY SUMMER Did you know? 91 % of employers prefer that their

PennyMac Mortgage PennyMac Mortgage PennyMac Mortgage PennyMac Mortgage Investment Trust

Finding the Right Exemplars for Reconstructing Single Image Super-Resolution Jiahuan Zhou , Ying

Kindergarten Curriculum -Environment rich in reading and writing opportunities English Language

Welcome and agenda Description Time 1 Welcome 7.00 pm 2 Orientation: how we came to be here

COMMERCIAL CONFIDENTIAL INFORMATION Principles to be considered London, 22 January 2009 Vincenzo