A Technical Overview of PyFR F.D. Witherden Department of Ocean - - PowerPoint PPT Presentation

a technical overview of pyfr
SMART_READER_LITE
LIVE PREVIEW

A Technical Overview of PyFR F.D. Witherden Department of Ocean - - PowerPoint PPT Presentation

A Technical Overview of PyFR F.D. Witherden Department of Ocean Engineering, Texas A&M University Why Go High-Order? Greater resolving power per degree of freedom (DOF) and thus fewer overall DOFs for same accuracy. Tight


slide-1
SLIDE 1

A Technical Overview of PyFR

F.D. Witherden Department of Ocean Engineering, Texas A&M University

slide-2
SLIDE 2

Why Go High-Order?

  • Greater resolving power per degree of freedom (DOF)…
  • …and thus fewer overall DOFs for same accuracy.
  • Tight coupling between DOFs inside of an element…
  • …reduces indirection and saves memory bandwidth.
slide-3
SLIDE 3

Flux Reconstruction

  • Our high-order method of choice is the

flux reconstruction (FR) scheme of Huynh.

  • It is both unifying and capable of operating

effectively on mixed unstructured grids.

slide-4
SLIDE 4

PyFR

Python + Flux Reconstruction

slide-5
SLIDE 5
  • Features.

PyFR

Governing Equations

Compressible and incompressible Navier Stokes

Spatial Discretisation

Arbitrary order Flux Reconstruction on mixed unstructured grids (tris, quads, hexes, tets, prisms, and pyramids)

Temporal Discretisation

Adaptive explicit Runge-Kutta schemes

Precision

single double

Sub-grid scale models

None

Platforms

CPU and Xeon Phi clusters NVIDIA GPU clusters AMD GPU clusters

slide-6
SLIDE 6
  • High level structure.

PyFR

Pass templates through Mako derived templating engine Matrix Multiply Kernels Point-Wise Nonlinear Kernels Call GEMM Python Outer Layer (Hardware Independent)

  • Setup
  • Distributed memory parallelism
  • Outer loop calls hardware specific kernels

CUDA Hardware specific kernels OpenCL Hardware specific kernels C/OpenMP Hardware specific kernels

  • Data

interpolation/ extrapolation etc.

  • Flux functions,

Riemann solvers etc.

slide-7
SLIDE 7

PyFR

  • Enables heterogeneous computing from a homogeneous

code base.

slide-8
SLIDE 8

PyFR

  • PyFR can scale up to leadership class DOE machines and

was shortlisted for the 2016 Gordon Bell Prize.

slide-9
SLIDE 9

Implementing FR Efficiently

  • 1. Use non-blocking communication primitives.
  • 2. Arrange data in a cache- and vectorisation-friendly manner.
  • 3. Cast key kernels as performance primitives.
slide-10
SLIDE 10

Non-Blocking Communication

  • Time to solution is heavily impacted by the

parallel scaling of a code.

  • This, in turn, is influenced by the amount of

communication performed at each time step.

slide-11
SLIDE 11

Non-Blocking Communication

slide-12
SLIDE 12

Non-Blocking Communication

  • If a code is to strong scale it is hence

essential for it to overlap communication with computation.

slide-13
SLIDE 13

Non-Blocking Communication

Compute A MPI Send MPI Recv Compute B Compute C Compute D

t

slide-14
SLIDE 14

Non-Blocking Communication

Compute A MPI ISend MPI IRecv Compute C Compute D MPI Wait Compute B

t

slide-15
SLIDE 15

Non-Blocking Communication

slide-16
SLIDE 16

Implementing FR Efficiently

  • 1. Use non-blocking communication primitives.
  • 2. Arrange data in a cache- and vectorisation-friendly manner.
  • 3. Cast key kernels as performance primitives.
slide-17
SLIDE 17

Data Layouts

  • FR is very often a memory bandwidth bound algorithm.
  • It is therefore vital that a code arranges its data in a way

which enables us to extract a high fraction of peak bandwidth.

slide-18
SLIDE 18

Data Layouts

  • Three main layouts:
  • AoS
  • SoA
  • AoSoA(k)
slide-19
SLIDE 19

Data Layouts: AoS

struct { float rho; float rhou; float E; } data[NELES];

slide-20
SLIDE 20
  • Cache and TLB friendly.
  • Difficult to vectorise.

Data Layouts: AoS

Memory

slide-21
SLIDE 21

Data Layouts: SoA

struct { float rho[NELES]; float rhou[NELES]; float E[NELES]; } data;

slide-22
SLIDE 22
  • Trivial to vectorise.
  • Can put pressure on TLB and/or hardware pre-fetchers.

Data Layouts: SoA

Memory

slide-23
SLIDE 23

Data Layouts: AoSoA(k = 2)

struct { float rho[k]; float rhou[k]; float E[k]; } data[NELES / k];

slide-24
SLIDE 24

Data Layouts: AoSoA(k = 2)

  • Can be vectorised efficiently for suitable k.
  • Cache and TLB friendly.

Memory

slide-25
SLIDE 25

Data Layouts: AoSoA(k = 2)

  • The ideal ‘Goldilocks’ solution
  • …albeit at the cost of messy indexing
  • …and requires coaxing for compilers to vectorise.
slide-26
SLIDE 26

Data Layouts: AoSoA(k) Results

  • FR with SoA vs FR AoSoA on an Intel KNL.

p = 1 p = 2 p = 3 p = 4 Time per DOF per RK stage / ns 2 4 6 8 10 12

slide-27
SLIDE 27

Implementing FR Efficiently

  • 1. Use non-blocking communication primitives.
  • 2. Arrange data in a cache- and vectorisation-friendly manner.
  • 3. Cast key kernels as performance primitives.
slide-28
SLIDE 28

Performance Primitives

  • On modern hardware it can be extremely difficult to

extract a high percentage of peak FLOP/s in otherwise compute-bound kernels.

  • To this end it is important—where possible—to cast
  • perations in terms of performance primitives.
slide-29
SLIDE 29

Performance Primitives

  • Have data at and want to interpolate to .

=

M

slide-30
SLIDE 30

Performance Primitives

  • This operation can be recognised as a matrix-vector

product (GEMV) as u = Mv.

  • If we are working in transformed space then M is the same

for all elements.

  • This can be recognised as a matrix-matrix product (GEMM)

as U = MV.

slide-31
SLIDE 31

Performance Primitives

  • Both GEMV and GEMM are performance primitives and
  • ptimised implementations are readily available from

vendor BLAS libraries.

  • These routines can perform an order of magnitude better

than hand-rolled routines.

slide-32
SLIDE 32

Performance Primitives

  • In FR the operator matrix M can sometimes be sparse.
  • This requires use of a more specialised primitives such as

those found in GiMMiK and libxsmm which account for the size/sparsity of FR operators.

slide-33
SLIDE 33

Summary

  • Use non-blocking communication primitives.
  • Arrange data in a cache- and vectorisation-friendly manner.
  • Cast key kernels as performance primitives.