a technical overview of pyfr
play

A Technical Overview of PyFR F.D. Witherden Department of Ocean - PowerPoint PPT Presentation

A Technical Overview of PyFR F.D. Witherden Department of Ocean Engineering, Texas A&M University Why Go High-Order? Greater resolving power per degree of freedom (DOF) and thus fewer overall DOFs for same accuracy. Tight


  1. A Technical Overview of PyFR F.D. Witherden Department of Ocean Engineering, Texas A&M University

  2. Why Go High-Order? • Greater resolving power per degree of freedom (DOF)… • …and thus fewer overall DOFs for same accuracy. • Tight coupling between DOFs inside of an element… • …reduces indirection and saves memory bandwidth .

  3. Flux Reconstruction • Our high-order method of choice is the flux reconstruction (FR) scheme of Huynh. • It is both unifying and capable of operating effectively on mixed unstructured grids .

  4. PyFR Python + Flux Reconstruction

  5. PyFR • Features. Governing Equations Compressible and incompressible Navier Stokes Arbitrary order Flux Reconstruction on mixed unstructured Spatial Discretisation grids (tris, quads, hexes, tets, prisms, and pyramids) Temporal Discretisation Adaptive explicit Runge-Kutta schemes Precision single double Sub-grid scale models None CPU and Xeon Phi clusters Platforms NVIDIA GPU clusters AMD GPU clusters

  6. PyFR • High level structure. Python Outer Layer Matrix Multiply Point-Wise ( Hardware Independent ) Kernels Nonlinear Kernels • Setup • Data • Flux functions, • Distributed memory parallelism interpolation/ Riemann solvers • Outer loop calls hardware specific kernels extrapolation etc. etc. C/OpenMP OpenCL CUDA Hardware Hardware Hardware Pass templates specific specific specific through Mako Call GEMM kernels kernels kernels derived templating engine

  7. PyFR • Enables heterogeneous computing from a homogeneous code base.

  8. PyFR • PyFR can scale up to leadership class DOE machines and was shortlisted for the 2016 Gordon Bell Prize .

  9. Implementing FR Efficiently 1. Use non-blocking communication primitives. 2. Arrange data in a cache- and vectorisation-friendly manner. 3. Cast key kernels as performance primitives.

  10. Non-Blocking Communication • Time to solution is heavily impacted by the parallel scaling of a code. • This, in turn, is influenced by the amount of communication performed at each time step.

  11. Non-Blocking Communication

  12. Non-Blocking Communication • If a code is to strong scale it is hence essential for it to overlap communication with computation .

  13. Non-Blocking Communication MPI Recv Compute Compute Compute Compute A B C D MPI Send t

  14. Non-Blocking Communication MPI IRecv Compute Compute Compute MPI Compute A C D Wait B MPI ISend t

  15. Non-Blocking Communication

  16. Implementing FR Efficiently 1. Use non-blocking communication primitives. 2. Arrange data in a cache- and vectorisation-friendly manner. 3. Cast key kernels as performance primitives.

  17. Data Layouts • FR is very often a memory bandwidth bound algorithm. • It is therefore vital that a code arranges its data in a way which enables us to extract a high fraction of peak bandwidth.

  18. Data Layouts • Three main layouts: • AoS • SoA • AoSoA( k )

  19. Data Layouts: AoS struct { float rho; float rhou; float E; } data[NELES];

  20. Data Layouts: AoS Memory • Cache and TLB friendly. • Difficult to vectorise.

  21. Data Layouts: SoA struct { float rho[NELES]; float rhou[NELES]; float E[NELES]; } data;

  22. Data Layouts: SoA Memory • Trivial to vectorise. • Can put pressure on TLB and/or hardware pre-fetchers.

  23. Data Layouts: AoSoA( k = 2) struct { float rho[k]; float rhou[k]; float E[k]; } data[NELES / k];

  24. Data Layouts: AoSoA( k = 2) Memory • Can be vectorised efficiently for suitable k. • Cache and TLB friendly.

  25. Data Layouts: AoSoA( k = 2) • The ideal ‘Goldilocks’ solution • …albeit at the cost of messy indexing • …and requires coaxing for compilers to vectorise .

  26. Data Layouts: AoSoA( k ) Results • FR with SoA vs FR AoSoA on an Intel KNL. p = 1 p = 2 p = 3 p = 4 0 2 4 6 8 10 12 Time per DOF per RK stage / ns

  27. Implementing FR Efficiently 1. Use non-blocking communication primitives. 2. Arrange data in a cache- and vectorisation-friendly manner. 3. Cast key kernels as performance primitives.

  28. Performance Primitives • On modern hardware it can be extremely difficult to extract a high percentage of peak FLOP/s in otherwise compute-bound kernels. • To this end it is important—where possible—to cast operations in terms of performance primitives .

  29. Performance Primitives • Have data at and want to interpolate to . M =

  30. Performance Primitives • This operation can be recognised as a matrix-vector product (GEMV) as u = M v . • If we are working in transformed space then M is the same for all elements. • This can be recognised as a matrix-matrix product (GEMM) as U = MV.

  31. Performance Primitives • Both GEMV and GEMM are performance primitives and optimised implementations are readily available from vendor BLAS libraries. • These routines can perform an order of magnitude better than hand-rolled routines .

  32. Performance Primitives • In FR the operator matrix M can sometimes be sparse . • This requires use of a more specialised primitives such as those found in GiMMiK and libxsmm which account for the size/sparsity of FR operators .

  33. Summary • Use non-blocking communication primitives. • Arrange data in a cache- and vectorisation-friendly manner. • Cast key kernels as performance primitives.

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend