Experiences with OpenCL in PyFR: 2014Present F.D. Witherden 1 and - PowerPoint PPT Presentation

Experiences with OpenCL in PyFR: 2014—Present F.D. Witherden 1 and P.E. Vincent 2 1 Department of Ocean Engineering, Texas A&M University 2 Department of Aeronautics, Imperial College London

Future Motivation High-Order PyFR OpenCL Backend Methods

Motivation • Computational fluid dynamics (CFD) is the bedrock of several high-tech industries.

Motivation • Interested in simulating unsteady, turbulent, flows.

Motivation • Objective is to solve the Navier–Stokes equations in the vicinity of complex geometrical configurations . RANS URANS DES LES DNS Fidelity

High-Order Methods • Our choice of method is the high-order accurate Flux Reconstruction (FR) approach of Huynh. • Combines aspects of traditional finite volume (FVM) and finite element methods (FEM).

High-Order Methods • Consider a smooth function

High-Order Methods • In FVM we divide the domain into cells …

High-Order Methods • …and in each cell store the average of the function.

High-Order Methods • Cells are coupled via Riemann solves at the interfaces.

High-Order Methods • In FR we divide the domain into elements …

High-Order Methods • …and in each element store a discontinuous interpolating polynomial of degree p .

High-Order Methods • As before elements are coupled via Riemann solves.

High-Order Methods • Greater resolving power per degree of freedom (DOF)… • …and thus fewer overall DOFs for same accuracy. • Tight coupling between DOFs inside of an element… • …reduces indirection and saves memory bandwidth .

High-Order Methods • Direct extension into 2D and 3D.

High-Order Methods • Most operations can be cast as matrix-matrix products. • Element type determines if the operation is sparse or dense.

PyFR Python + Flux Reconstruction

PyFR • High level structure. Python Outer Layer ( Hardware Independent ) • Setup • Distributed memory parallelism • Outer loop calls hardware specific kernels

PyFR • Need to generate hardware specific kernels . Python Outer Layer ( Hardware Independent ) • Setup • Distributed memory parallelism • Outer loop calls hardware specific kernels

PyFR • In FR two types of kernel are required. Python Outer Layer Matrix Multiply Point-Wise ( Hardware Independent ) Kernels Nonlinear Kernels • Setup • Data • Flux functions, • Distributed memory parallelism interpolation/ Riemann solvers • Outer loop calls hardware specific kernels extrapolation etc. etc.

PyFR • Matrix multiplications are quite simple. Python Outer Layer Matrix Multiply Point-Wise ( Hardware Independent ) Kernels Nonlinear Kernels • Setup • Data • Flux functions, • Distributed memory parallelism interpolation/ Riemann solvers • Outer loop calls hardware specific kernels extrapolation etc. etc. Call GEMM

PyFR • For the point-wise nonlinear kernels we use a DSL. Python Outer Layer Matrix Multiply Point-Wise ( Hardware Independent ) Kernels Nonlinear Kernels • Setup • Data • Flux functions, • Distributed memory parallelism interpolation/ Riemann solvers • Outer loop calls hardware specific kernels extrapolation etc. etc. Pass templates through Mako Call GEMM derived templating engine

PyFR • Kernels are generated and compiled at start-up. Python Outer Layer Matrix Multiply Point-Wise ( Hardware Independent ) Kernels Nonlinear Kernels • Setup • Data • Flux functions, • Distributed memory parallelism interpolation/ Riemann solvers • Outer loop calls hardware specific kernels extrapolation etc. etc. C/OpenMP OpenCL CUDA Hardware Hardware Hardware Pass templates specific specific specific through Mako Call GEMM kernels kernels kernels derived templating engine

PyFR • Which may then be called by the outer layer. Python Outer Layer Matrix Multiply Point-Wise ( Hardware Independent ) Kernels Nonlinear Kernels • Setup • Data • Flux functions, • Distributed memory parallelism interpolation/ Riemann solvers • Outer loop calls hardware specific kernels extrapolation etc. etc. C/OpenMP OpenCL CUDA Hardware Hardware Hardware Pass templates specific specific specific through Mako Call GEMM kernels kernels kernels derived templating engine

PyFR • An example. PyFR-Mako <%namespace module='pyfr.backends.base.makoutil' name='pyfr'/> <%pyfr:macro name='inviscid_flux' params='s, f, p, v'> fpdtype_t invrho = 1.0/s[0]; fpdtype_t E = s[${nvars - 1}]; // Compute the velocities fpdtype_t rhov[${ndims}]; % for i in range(ndims): rhov[${i}] = s[${i + 1}]; CUDA v[${i}] = invrho*rhov[${i}]; % endfor C/OpenMP // Compute the pressure p = ${c['gamma'] - 1}*(E - 0.5*invrho*${pyfr.dot('rhov[{i}]', i=ndims)}); OpenCL // Density and energy fluxes % for i in range(ndims): f[${i}][0] = rhov[${i}]; f[${i}][${nvars - 1}] = (E + p)*v[${i}]; % endfor // Momentum fluxes % for i, j in pyfr.ndrange(ndims, ndims): f[${i}][${j + 1}] = rhov[${i}]*v[${j}]${' + p' if i == j else ''}; % endfor </%pyfr:macro>

PyFR • An example. CUDA PyFR-Mako //AoSoA macros <%namespace module='pyfr.backends.base.makoutil' name='pyfr'/> #define SOA_SZ 32 #define SOA_IX(a, v, nv) ((((a) / SOA_SZ)*(nv) + (v))*SOA_SZ + (a) % <%pyfr:macro name='inviscid_flux' params='s, f, p, v'> SOA_SZ) fpdtype_t invrho = 1.0/s[0]; fpdtype_t E = s[${nvars - 1}]; // Typedefs // Compute the velocities typedef double fpdtype_t; fpdtype_t rhov[${ndims}]; % for i in range(ndims): rhov[${i}] = s[${i + 1}]; __global__ void tflux(int _ny, int _nx, fpdtype_t* __restrict__ f_v, int ldf, v[${i}] = invrho*rhov[${i}]; const fpdtype_t* __restrict__ smats_v, int ldsmats, const fpdtype_t* % endfor __restrict__ u_v, int ldu) // Compute the pressure { p = ${c['gamma'] - 1}*(E - 0.5*invrho*${pyfr.dot('rhov[{i}]', i=ndims)}); int _x = blockIdx.x*blockDim.x + threadIdx.x;int _y = // Density and energy fluxes blockIdx.y*blockDim.y + threadIdx.y; % for i in range(ndims): #define X_IDX (_x) f[${i}][0] = rhov[${i}]; #define X_IDX_AOSOA(v, nv) SOA_IX(X_IDX, v, nv) f[${i}][${nvars - 1}] = (E + p)*v[${i}]; if (_x < _nx && _y < _ny) % endfor { // Momentum fluxes % for i, j in pyfr.ndrange(ndims, ndims): // Compute the flux f[${i}][${j + 1}] = rhov[${i}]*v[${j}]${' + p' if i == j else ''}; fpdtype_t ftemp[2][4]; % endfor fpdtype_t p, v[2]; </%pyfr:macro> { fpdtype_t invrho_ = 1.0/u_v[ldu*_y + X_IDX_AOSOA(0, 4)]; fpdtype_t E_ = u_v[ldu*_y + X_IDX_AOSOA(3, 4)];

PyFR //AoSoA macros #define SOA_SZ 32 #define SOA_IX(a, v, nv) ((((a) / SOA_SZ)*(nv) + (v))*SOA_SZ + (a) % SOA_SZ) // Typedefs typedef double fpdtype_t; • Abstracts data layout . __global__ void tflux(int _ny, int _nx, fpdtype_t* __restrict__ f_v, int ldf, const fpdtype_t* __restrict__ smats_v, int ldsmats, const fpdtype_t* CUDA PyFR-Mako __restrict__ u_v, int ldu) { int _x = blockIdx.x*blockDim.x + threadIdx.x;int _y = blockIdx.y*blockDim.y + threadIdx.y; <%namespace module='pyfr.backends.base.makoutil' name='pyfr'/> #define X_IDX (_x) #define X_IDX_AOSOA(v, nv) SOA_IX(X_IDX, v, nv) <%pyfr:macro name='inviscid_flux' params='s, f, p, v'> if (_x < _nx && _y < _ny) fpdtype_t invrho = 1.0/s[0]; { fpdtype_t E = s[${nvars - 1}]; // Compute the velocities // Compute the flux fpdtype_t rhov[${ndims}]; fpdtype_t ftemp[2][4]; % for i in range(ndims): fpdtype_t p, v[2]; rhov[${i}] = s[${i + 1}]; { v[${i}] = invrho*rhov[${i}]; % endfor fpdtype_t invrho_ = 1.0/u_v[ldu*_y + X_IDX_AOSOA(0, 4)]; // Compute the pressure fpdtype_t E_ = u_v[ldu*_y + X_IDX_AOSOA(3, 4)]; p = ${c['gamma'] - 1}*(E - 0.5*invrho*${pyfr.dot('rhov[{i}]', i=ndims)}); // Density and energy fluxes // Compute the velocities % for i in range(ndims): fpdtype_t rhov_[2]; f[${i}][0] = rhov[${i}]; rhov_[0] = u_v[ldu*_y + X_IDX_AOSOA(1, 4)]; f[${i}][${nvars - 1}] = (E + p)*v[${i}]; v[0] = invrho_*rhov_[0]; % endfor rhov_[1] = u_v[ldu*_y + X_IDX_AOSOA(2, 4)]; // Momentum fluxes v[1] = invrho_*rhov_[1]; % for i, j in pyfr.ndrange(ndims, ndims): f[${i}][${j + 1}] = rhov[${i}]*v[${j}]${' + p' if i == j else ''}; // Compute the pressure % endfor p = 0.4*(E_ - 0.5*invrho_*((rhov_[0])*(rhov_[0]) + (rhov_[1])*(rhov_[1]))); </%pyfr:macro> // Density and energy fluxes ftemp[0][0] = rhov_[0]; ftemp[0][3] = (E_ + p)*v[0];

Experiences with OpenCL in PyFR: 2014Present F.D. Witherden 1 and - PowerPoint PPT Presentation

Experiences with OpenCL in PyFR: 2014Present F.D. Witherden 1 and P.E. Vincent 2 1 Department of Ocean Engineering, Texas A&M University 2 Department of Aeronautics, Imperial College London Future Motivation High-Order PyFR OpenCL

OpenCL Kernel Compilation Slides taken from Hands On OpenCL by Simon McIntosh-Smith, Tom Deakin,

High-Order Accurate Numerical Simulations of Flow around a Projectile using PyFR Jin Seok Park 1 1

Simulations of Flow over Low-Pressure Turbine Blades with PyFR Yoshiaki Abe 1 , Arvind Iyer 2 ,

Investigation of the OpenCL support in the GeantV's Vectorized Geometry Gabor Biro 22.09.2014.

The OpenCL C++ API Slides taken from Hands On OpenCL by Simon McIntosh-Smith, Tom Deakin, James

Introduction to OpenCL David Black-Schaffer david.black-schaffer@it.uu.se 1 Disclaimer I

OpenCL on FPGAs Contains material from Hands On OpenCL by Simon McIntosh-Smith, Tom Deakin, James

Synchronization in OpenCL Slides taken from Hands On OpenCL by Simon McIntosh-Smith, Tom Deakin,

SEGMENT IV: PRESENT SEGMENT IV: PRESENT EXPERIENCES AND PLANS EXPERIENCES AND PLANS NIMH- -BAS

PyFR: PastPresentFuture P. E. Vincent Department of Aeronautics, Imperial College London 19

Partially-Averaged Navier-Stokes in PyFR Tarik Dzanic Prof. Freddie D. Witherden Department of

PyFR Symposium 2020 Addin ing Mult ltiphase Capabili lities to to PyF yFR Xi Deng 1 , Pierre

A Technical Overview of PyFR F.D. Witherden Department of Ocean Engineering, Texas A&M

CUDA (Compute Unified Device Dr. Bharathwaj Bharath Muthuswamy Architecture) and OpenCL

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and

PERFORMANCE CONSIDERATIONS FOR OPENCL ON NVIDIA GPUS Karthik Raghavan Ravi, 4/4/16 THE PROBLEM

CONVERGENCE ACCELERATION TECHNIQUES FOR DUAL TIME STEPPING Niki A. Loppi Brian C. Vermeire

Gradient-based optimization of flow problems using the adjoint method and high-order numerical

Entropy noise in gas turbines - background and challenges

A Finite Volume Kinetic (FVK) framework for unsteady mixed flows in non uniform closed water

Hydroacoustics using Unsteady CFD - OpenFOAM with Generalised Grid Interface Chalmers University

Leonard Kleinrock Leonard Kleinrock Professor, UCLA Computer Science Dept Professor, UCLA

Short range dispersion in urban areas Alan Robins, University of Surrey DAPPLE team Surrey -

Fluid Interface Detection with PETSc and DONLP2 PETSc User Meeting Vienna 2016 Poster Session

Experiences with OpenCL in PyFR: 2014Present F.D. Witherden 1 and - PowerPoint PPT Presentation

Experiences with OpenCL in PyFR: 2014Present F.D. Witherden 1 and P.E. Vincent 2 1 Department of Ocean Engineering, Texas A&M University 2 Department of Aeronautics, Imperial College London Future Motivation High-Order PyFR OpenCL

OpenCL Kernel Compilation Slides taken from Hands On OpenCL by Simon McIntosh-Smith, Tom Deakin,

High-Order Accurate Numerical Simulations of Flow around a Projectile using PyFR Jin Seok Park 1 1

Simulations of Flow over Low-Pressure Turbine Blades with PyFR Yoshiaki Abe 1 , Arvind Iyer 2 ,

Investigation of the OpenCL support in the GeantV's Vectorized Geometry Gabor Biro 22.09.2014.

The OpenCL C++ API Slides taken from Hands On OpenCL by Simon McIntosh-Smith, Tom Deakin, James

Introduction to OpenCL David Black-Schaffer david.black-schaffer@it.uu.se 1 Disclaimer I

OpenCL on FPGAs Contains material from Hands On OpenCL by Simon McIntosh-Smith, Tom Deakin, James

Synchronization in OpenCL Slides taken from Hands On OpenCL by Simon McIntosh-Smith, Tom Deakin,

SEGMENT IV: PRESENT SEGMENT IV: PRESENT EXPERIENCES AND PLANS EXPERIENCES AND PLANS NIMH- -BAS

PyFR: PastPresentFuture P. E. Vincent Department of Aeronautics, Imperial College London 19

Partially-Averaged Navier-Stokes in PyFR Tarik Dzanic Prof. Freddie D. Witherden Department of

PyFR Symposium 2020 Addin ing Mult ltiphase Capabili lities to to PyF yFR Xi Deng 1 , Pierre

A Technical Overview of PyFR F.D. Witherden Department of Ocean Engineering, Texas A&amp;M

CUDA (Compute Unified Device Dr. Bharathwaj Bharath Muthuswamy Architecture) and OpenCL

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and

PERFORMANCE CONSIDERATIONS FOR OPENCL ON NVIDIA GPUS Karthik Raghavan Ravi, 4/4/16 THE PROBLEM

CONVERGENCE ACCELERATION TECHNIQUES FOR DUAL TIME STEPPING Niki A. Loppi Brian C. Vermeire

Gradient-based optimization of flow problems using the adjoint method and high-order numerical

Entropy noise in gas turbines - background and challenges

A Finite Volume Kinetic (FVK) framework for unsteady mixed flows in non uniform closed water

Hydroacoustics using Unsteady CFD - OpenFOAM with Generalised Grid Interface Chalmers University

Leonard Kleinrock Leonard Kleinrock Professor, UCLA Computer Science Dept Professor, UCLA

Short range dispersion in urban areas Alan Robins, University of Surrey DAPPLE team Surrey -

Fluid Interface Detection with PETSc and DONLP2 PETSc User Meeting Vienna 2016 Poster Session

A Technical Overview of PyFR F.D. Witherden Department of Ocean Engineering, Texas A&M