diamondtile algorithm for high performance wave modeling
play

DiamondTile Algorithm for High-Performance Wave Modeling Vadim - PowerPoint PPT Presentation

DiamondTile Algorithm for High-Performance Wave Modeling Vadim Levchenko Anastasia Perepelkina Keldysh Institute of Apllied Mathematics RAS GTC 2015 FLOPs and Bandwidth Performance Ratio 1000 s p o l F / s s e p t o y l B F /


  1. DiamondTile Algorithm for High-Performance Wave Modeling Vadim Levchenko Anastasia Perepelkina Keldysh Institute of Apllied Mathematics RAS GTC 2015

  2. FLOPs and Bandwidth Performance Ratio 1000 s p o l F / s s e p t o y l B F / s 1 e . 0 t y B 4 100 s p o l F GB/s / s e t y B 4 0 . 0 10 nVidia Maxwell, 2014-15 nVidia Kepler, 2012-13 Intel CPU, 2014 NEC SX, 199x 0.1 1 10 TFLOP/s (fp32)

  3. RoofLine model S. Williams, A. Waterman, and D. Patterson. Roofline: an insightful visual performance model for multicore architectures. Commun. ACM, 52:65–76, 2009. L.Barba, R.Yokota, How Will the Fast Multipole Method Fare in the Exascale Era?

  4. Wave Modeling Specifics ∂ t 2 = c 2 � � ∂ 2 F ∂ 2 F ∂ x 2 + ∂ 2 F ∂ y 2 + ∂ 2 F (+BCs + ICs) ∂ z 2 Finite difference along each axis: domain of in fl uence � ∂ 2 F � N O / 2 1 x 0 , y 0 , z 0 = C i ( F | x 0 + i ∆ x , y 0 , z 0 + F | x 0 − i ∆ x , y 0 , z 0 ) � ∂ x 2 ∆ x 2 i = 0 � x 2 +y 2 +z 2 =c 2 t 2 N O = 2 for ∂ 2 F ∂ t 2 , N O = 2 , 4 , 6 , .. 14 for coordinate axes. asynchro- nous domain synchronization instant Per one cell, per one time step calculation: t asynchro- ◮ O = 1 + 3 N O FMA operations nous domain ◮ D = 3 + 3 N O data domain of Operational intensity: dependence O / D ∼ 1 / 2 Flop/byte (na¨ ıve algorithm) . y x

  5. Wave Modeling Specifics ∂ t 2 = c 2 � � ∂ 2 F ∂ 2 F ∂ x 2 + ∂ 2 F ∂ y 2 + ∂ 2 F (+BCs + ICs) ∂ z 2 Finite difference along each axis: domain of in fl uence � ∂ 2 F � N O / 2 1 x 0 , y 0 , z 0 = C i ( F | x 0 + i ∆ x , y 0 , z 0 + F | x 0 − i ∆ x , y 0 , z 0 ) � ∂ x 2 ∆ x 2 i = 0 � x 2 +y 2 +z 2 =c 2 t 2 N O = 2 for ∂ 2 F ∂ t 2 , N O = 2 , 4 , 6 , .. 14 for coordinate axes. asynchro- nous domain synchronization instant t asynchro- nous domain domain of dependence y . x Cross-shaped stencil fits into diamond shape

  6. Wave equation modelling Computational Grid projection to (x–t)

  7. Wave equation modelling Computational Grid projection to (x–t)

  8. Wave equation modelling

  9. Wave equation modelling

  10. Wave equation modelling

  11. Traditional stepwise evaluation order

  12. Traditional stepwise evaluation order

  13. Traditional stepwise evaluation order

  14. Traditional stepwise evaluation order Overlapping stencils increase operational intensity: ◮ O = 1 + 3 N O FMA operations ◮ D = 3 data Operational intensity: O / D ∼ ( 1 + N O ) Flop/byte

  15. RoofLine Model for Wave Equation on GPGPU 1000 TitanZ the best of stepwise performance, 10 9 cells/sec GTX 970 100 CUDA FDTD3d results naive 10 0.1 1 10 localization parameter, cells calculations/(data loads+stores)

  16. LRnLA method

  17. LRnLA method

  18. LRnLA method Locality Take advantage of memory subsystem hierarchy, from on-chip CPU cash and up to disk and network Recursivity Application of “divide et impera” strategy for any situations (computer architectures, numerical schemes, etc.) non-Locality Optimized for distributed computations Asynchrony Adaptable parallel computations on any levels

  19. Memory Subsystem Hierarchy for GPGPU and CPU . GK110 Haswell GM204 . . GTX Titan Xeon E5 v3 GTX 980 . 10 14 regs regs 10 13 regs Data throughput, B/sec L1+sh L1+sh L1 10 12 L2 L2 L2 GDDR5 GDDR5 LLC 10 11 DDR4 10 10 SSD/PCIe 10 9 HDD 1T 1G 1M 1K 1M 1G 1T Data set size, B

  20. DiamondTile based algorithm construction Computational grid in x-y and x-t projections

  21. DiamondTile based algorithm construction Computational domain is subdivided into Diamond shaped tiles in x-y. ◮ Diamond encloses cross-shaped stencil ◮ All elements along 3rd (z) axis are included

  22. DiamondTile based algorithm construction Choose a DiamondTile on first time-step ❖

  23. DiamondTile based algorithm construction Choose a DiamondTile on first time-step ❖ Plot influence cone of first tile ❖

  24. DiamondTile based algorithm construction Choose a DiamondTile on first time-step ❖ Plot influence cone of first tile ❖ Choose a shifted DiamondTile on another time-step (Nt steps later) ❖

  25. DiamondTile based algorithm construction Choose a DiamondTile on first time-step ❖ Plot influence cone of first tile ❖ Choose a shifted DiamondTile on another time-step (Nt steps later) ❖ Plot dependence cone of last tile ❖

  26. DiamondTile based algorithm construction Choose a DiamondTile on first time-step ❖ Plot influence cone of first tile ❖ Choose a shifted DiamondTile on another time-step (Nt steps later) ❖ Plot dependence cone of last tile ❖ Find intersection ❖

  27. DiamondTorre Algorithm shape

  28. Understand Algorithm as a shape Stepwise

  29. Understand Algorithm as a shape Domain decomposition

  30. Understand Algorithm as a shape More operational intensity

  31. Understand Algorithm as a shape DiamondTorre

  32. DiamondTorre Algorithm shape ◮ DiamondTorre tilt depends on stencil size ◮ Stencil width is determined by order of approximation ( N O )

  33. DiamondTorre Algorithm parameters Performance depends on careful choice of algorithm parameters: ◮ Size of DiamondTorre base — Diamond Tile Size, DTS ◮ Quantity of time layers — Nt Operational Intensity ∼ DTS/(4-1/DTS) (for large Nt)

  34. RoofLine Model for Wave Equation on GPGPU 1000 DTS=20 DTS=14 S T D s u TitanZ o i r a v the best of stepwise r DTS=7 o performance, 10 9 cells/sec f e r r o GTX 970 T DiamondTile, DTS=1 d DTS=4 n o m a i D 100 DTS=1 naive 10 0.1 1 10 localization parameter, cells calculations/(data loads+stores)

  35. DiamondTorre Algorithm with CUDA In each tile Nz elements (along 3rd (z) axis) are processed asynchronously by CUDA threads in a block First stage

  36. DiamondTorre Algorithm with CUDA In each tile Nz elements (along 3rd (z) axis) are processed asynchronously by CUDA threads in a block Second stage

  37. DiamondTorre Algorithm with CUDA In each tile Nz elements (along 3rd (z) axis) are processed asynchronously by CUDA threads in a block Odd and even stages are alternating. Synchronization after each stage.

  38. DiamondTorre Algorithm with CUDA In each tile Nz elements (along 3rd (z) axis) are processed asynchronously by CUDA threads in a block Odd and even stages are alternating. Synchronization after each stage.

  39. DiamondTorre Algorithm with CUDA In each tile Nz elements (along 3rd (z) axis) are processed asynchronously by CUDA threads in a block Odd and even stages are alternating. Synchronization after each stage.

  40. DiamondTorre Algorithm with CUDA At first, some portion of cells remain on first time step, while some are processed to ❖ several time layers

  41. DiamondTorre Algorithm with CUDA At first, some portion of cells remain on first time step, while some are processed to ❖ several time layers

  42. DiamondTorre Algorithm with CUDA At first, some portion of cells remain on first time step, while some are processed to ❖ several time layers

  43. DiamondTorre Algorithm with CUDA At the end, all data are progressed to a given time step. This time step is ❖ determined by DiamondTorre height

  44. RoofLine Model for Wave Equation on GPGPU 1000 DTS=20 DTS=14 S T D s u TitanZ o i r a v the best of stepwise r DTS=7 o performance, 10 9 cells/sec f e r r o GTX 970 T DiamondTile, DTS=1 d DTS=4 n o m a i D 100 DTS=1 CUDA FDTD3d results naive 10 0.1 1 10 localization parameter, cells calculations/(data loads+stores)

  45. 60 GTX 750Ti GTX 970 TitanZ (1) 50 calc rate, Gcells/sec 40 30 20 10 0 2/1 4/1 6/1 8/1 10/112/114/1 6/1 6/2 4/1 4/2 4/3 2/1 2/2 2/3 2/4 2/5 2/6 2/7 various scheme/algorithm parameters, NO/DTS

  46. 100 TitanZ GTX970 FDTD3d TitanZ rate 10 FDTD3d GTX970 rate calc rate, Gcells/sec 1 FDTD3d CPU rate with -O3 0.1 FDTD3d CPU rate 0.01 0.01 0.1 1 10 100 1000 parallel level, warps

  47. Wave Modeling Applications FDTD simulation for electromagnetics (2nd and 4th order approximation, PML) (Zakirov A., Goryachev I.)

  48. Wave Modeling Applications Gas Dynamis with RKDG scheme (Korneev B.)

  49. Wave Modeling Applications 2000 3000 4000 5000 6000 7000 -7.5 -3.75 0 3.75 7.5 0 0 2 2 4 4 6 6 -7.5 -3.75 0 3.75 7.5 0 FDTD simulation for elastic seismic media (Levander scheme, 4th order, PML, Thompsen anisotropy, TFSF source) (Levchenko V., Zakirov A., Perepelkina A., Ivanov A.)

  50. Wave Modeling Applications Particle-in-cell plasma kinetics (Levchenko V., Perepelkina A., Goryachev I.)

  51. Main Results and Conclusions ◮ New algorithms DiamondTile of LRnLA family are developed for wave modeling. The algorithms are efficient on memory and parallelism models of CUDA GPGPU; ◮ Unlike traditional stepwise evaluation order, data dependencies are traced for many time iteration steps. It increases operational intensity and allows to reach higher calculation rates. ◮ Performance of 50-60 billion cells/s is achieved with Titan, as well as with GTX970 in the implementation of wave modeling.

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend