On a Novel Method for High Performance Computational Fluid Dynamics - PowerPoint PPT Presentation

CCDSC 2016 On a Novel Method for High Performance Computational Fluid Dynamics Christian Obrecht Energy and Thermal Sciences Centre of Lyon (CETHIL) Department of Civil Engineering and Urban Planning National Institute of Applied Sciences of Lyon (INSA-Lyon) October 6, 2016

Outline Motivation 1 Link-wise artificial compressibility method 2 Work in progress 3

I – Motivation

Areas of interest: Urban physics Margheri and Sagaut, 2014 Urban micro-climate, pedestrian wind comfort, pollutant dispersion. . . 4

Areas of interest: Thermal energy storage Shell and tube heat exchanger Latent heat storage (phase change materials). Air outlet Zeolite beads Air inlet Sorption and/or chemical heat storage. 5

Computational Fluid Dynamics The previous engineering applications rely heavily on CFD simulations. ◮ Multi-physics models. ◮ Complex geometries. ◮ O (10 9 ) fluid cells. ◮ Physically relevant simulation times. Technical issues: ◮ Multi-physics commercial codes (e.g. Fluent) are expensive and do not scale over O (10 2 ) cores. ◮ Open CFD codes (e.g. code Saturne) are not designed for accelerators. 6

Unstructured vs Cartesian meshes Unstructured ◮ Body fitting mesh. ◮ Time consuming generation process. ◮ Isotropy is an issue. ◮ Irregular data access pattern. Cartesian ◮ Trivial meshing. ◮ GPU-friendly data layout. ◮ Hierarchical structure is often needed. 7

Lattice Boltzmann method ◮ Discretized version of the Boltzmann equation recovering the solutions of the Navier–Stokes equation. ◮ Regular Cartesian grid of mesh size δ x with constant time step δ t . ◮ Finite set of particular densities f α associated to particular velocities ξ α . ◮ Collision operator Ω (usually explicit). � � � � = Ω � � � f α ( ① + δ t ξ α , t + δ t ) � f α ( ① , t ) � f α ( ① , t ) − 15 12 5 11 � 16 ρ = f α α 8 3 7 2 1 � ρ ✉ = f α ξ α 10 4 9 α 17 14 6 13 18 8

Pull formulation of the LBM Two-step formulation of LBM: propagation (1) followed by collision (2). � � � � � f α ( ① , t + δ t ) = � f ∗ α ( ① − δ t ξ α , t ) (1) � � � � f ∗ � � + Ω � α ( ① , t + δ t ) = � f α ( ① , t + δ t ) � f α ( ① , t + δ t ) (2) 9

Solid-fluid interface Simple bounce-back boundary condition 10

LBM pros and cons Pros ◮ Explicitness, algorithmic simplicity. ◮ Easy solid boundary processing. ◮ Well-suited to GPUs. Cons ◮ Large memory consumption (19 scalars vs 4 hydrodynamic variables). ◮ Impact on performance in memory bound context. 11

II – Link-wise artificial compressibility method

Link-wise artificial compressibility method (LW-ACM) ◮ Novel formulation of the artificial compressibility method. ◮ Strong analogies with lattice Boltzmann schemes. Updating rule: � ω − 1 � � � f α ( ① , t + 1) = f (e) f (e , o) ( ① , t ) − f (e , o) α ( ① − ξ α , t ) + 2 ( ① − ξ α , t ) α α ω where f (e) are local equilibria which only depend on local ρ and ✉ , and f (e , o) are α α the odd parts of the equilibrium functions: ( ρ, ✉ ) = 1 � � f (e , o) f (e) α ( ρ, ✉ ) − f (e) α ( ρ, − ✉ ) . α 2 P. Asinari, T. Ohwada, E. Chiavazzo, and A. F. Di Rienzo. Link-wise artificial compressibility method. Journal of Computational Physics , 231(15), 5109–5143, 2012. 13

First GPU implementation: TheLMA* Two-step updating rule: � ω − 1 � f (e , o) f α ( ① , t + 1) = f ∗ α ( ① − ξ α , t ) + 2 ( ① , t ) α ω � ω − 1 � α ( ① , t + 1) = f (e) f (e , o) f ∗ α ( ① , t + 1) − 2 ( ① , t + 1) α ω ◮ LW-ACM very similar to LBM, with additional cost of loading and storing ρ and ✉ at each time step. ◮ First GPU implementation of LW-ACM: slightly modified version of a TheLMA based single-GPU CUDA LBM solver. C. Obrecht, F. Kuznik, B. Tourancheau, and J.-J. Roux. The TheLMA project: Multi-GPU implementation of the lattice Boltzmann method. International Journal of High Performance Computing Applications , 25(3):295–303, 2011. 14

b bc b bc b bc b bc b b b bc b bc b bc b bc b bc bc b b bc b bc b bc b bc bc b b bc b bc b bc b bc bc bc bc b bc b bc b bc b bc bc bc b bc b bc b bc b bc b b b bc bc b bc b bc b bc b b bc bc b bc b bc b bc b b b bc b bc b bc b bc b bc bc bc b bc b bc b bc b bc b b bc bc bc b bc b bc b bc b b bc bc b bc b bc b bc b b b bc bc b bc b bc b bc b b b bc b bc b bc b bc b bc bc bc b b bc b bc b bc b bc bc b b bc b bc b bc b bc b b bc b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b bc bc b bc b bc b bc b b b bc b bc b bc b bc b bc bc b b b bc b bc b bc b bc bc b b bc b bc b bc b bc bc bc b b b b b b b b b b bc b b b b b b b b b b b bc bc b bc b bc b bc b b bc bc b bc b bc b bc b b b b b bc b bc b bc b bc bc bc b bc b bc b bc b bc b b bc bc b bc b bc b bc b b bc bc b bc b bc b bc b b b b bc b bc b bc b bc b b b bc b bc b bc b bc b bc bc bc b b bc b bc b bc b bc bc b b bc b bc b bc b bc bc bc b bc bc bc bc bc bc bc bc bc bc bc bc bc bc bc bc bc bc bc bc bc bc b b bc bc bc bc bc bc bc bc bc bc bc bc bc bc bc bc bc bc b b bc b bc b bc b bc bc bc b bc b bc b bc b bc b b bc bc bc bc bc bc bc bc bc b b bc bc b bc b bc b bc b bc bc bc b bc b bc b bc b bc bc bc b bc b bc b bc b bc b b bc bc bc b bc b bc b bc b b bc bc b bc b bc b bc b b b bc bc b bc b bc b bc b b b bc b bc b bc b bc b bc bc bc b b bc b bc b bc b bc bc b b bc b bc b bc b bc b b b b bc b bc b bc b bc bc bc b bc b bc b bc b bc b b bc bc bc b bc b bc b bc b b bc bc b bc b bc b bc b b b bc bc b bc b bc b bc b b b bc b bc b bc b bc b bc bc bc b b bc b bc b bc b bc bc b b bc b bc b bc b bc b Second GPU implementation: Louise ◮ Sufficient to have access to ρ and ✉ at node ① and its neighbours ① − ξ α . ◮ Reduction of read redundancy: use CUDA blocks of 8 × 8 × 8 threads, store ρ and ✉ in an array of 10 3 float4 structures in shared memory. C. Obrecht, P. Asinari, F. Kuznik, and J.-J. Roux. High-performance Implementations and Large-scale Validation of the Link-wise ACM. Journal of Computational Physics , 275:143–153, 2014. 15

Data throughput and memory footprint Louise data throughput per time step ◮ 992 float4 structures read per CUDA block (41% of LBM). ◮ 512 written per block (21% of LBM). Test hardware: GTX Titan Black (single precision) ◮ LBM: 38 million nodes (e.g. 320 3 cubic cavity). ◮ LW-ACM: 201 million nodes (e.g. 576 3 cubic cavity). 16

Local bounce-back boundary condition ◮ Bounce-back boundary condition: f ∗ α ( ① − ξ α , t ) = f ¯ α ( ① , t − 1) where ① − ξ α is a wall node and ¯ α is such that ξ ¯ α = − ξ α . ◮ Louise does not keep f ∗ α variables: finite difference boundary conditions (cumbersome for complex geometries). ◮ Louise* variant: local bounce-back f (e) α ( ① , t ) = f (e) α ( ρ, − ✉ ). ¯ Updating rule at boundary node: � ω − 1 � � � f α ( ① , t + 1) = f (e) f (e , o) ( ① , t ) − f (e , o) α ( ① , t ) + 2 ( ① , t ) . ¯ ¯ α ω α 17

Runtime video (Louise) Lid-driven cubic cavity at Re = 1000, 160 3 ≈ 4 . 1 million nodes, 20320 time steps, computation time 37.1 s on the GTX Titan, 2259 MLUPS. 18

Performance comparison: lid-driven cavity in single precision MLUPS 2400 2200 2000 1800 1600 1400 1200 1000 800 600 TheLMA (MRT + SBB) TheLMA* (LW-ACM + SBB) Louise (LW-ACM + FDBC) 400 Louise* (LW-ACM + LBB) 200 0 0 32 64 96 128 160 192 224 256 288 320 352 384 416 448 480 512 544 576 Size GPU start temperature: 60 ◦ C, runtime per resolution ≈ 30 s. For long term computations, performance is about 15% less. 19

Velocity discrepancy with respect to spectral element data 0,100 L ₂ TheLMA (MRT + SBB) ↘ − 1.25 TheLMA* (LW-ACM + SBB) ↘ − 1.32 Louise (LW-ACM + FDBC) ↘ − 1.23 Louise* (LW-ACM + LBB) ↘ − 1.25 0,010 0,001 100 1000 Size 20

III – Work in progress

OpenCL Link-wise ACM on Many-core Processors (OpenCLAMP) ◮ OpenCLAMP: newly developed OpenCL program based on the same principles as Louise*. ◮ Performance portability: execution parameters specified in a JSON configuration file loaded at runtime. ◮ Performance on GTX Titan Black: similar than for Louise* code, i.e. higher than 2000 MLUPS, using 8 × 8 × 8 work-groups. ◮ Performance on octocore Xeon (E5-2687W v2 at 3.40GHz): up to 40 MLUPS using 32 × 1 × 1 work-groups. 22

On a Novel Method for High Performance Computational Fluid Dynamics - PowerPoint PPT Presentation

CCDSC 2016 On a Novel Method for High Performance Computational Fluid Dynamics Christian Obrecht Energy and Thermal Sciences Centre of Lyon (CETHIL) Department of Civil Engineering and Urban Planning National Institute of Applied Sciences of

Novel Gaits for a Novel Novel Gaits for a Novel Crawling/Grasping Mechanism Crawling/Grasping

The Scientific Method The Scientific Method The Scientific Method involves 6 steps: Problem

. High Purity Solvents in the Working Lab . High Purity Solvents in the Working Lab High Purity

Method Handles Everywhere! Charles Oliver Nutter @headius Method Handles What are method

B Method Proof assistants May 16, 2017 Lucas Franceschino What is B method? B-method goal

Newtons method Newtons method 1 / 8 Newtons method Objective: solving a non-linear

Computational Optimization Newtons Method 2/5/08 Newtons Method Method for finding a zero

The Future is Light John Cronin AUT University, Auckland NZ Wearable Resistance (W (WR) Novel

A Novel Method for Minimization of Boolean Functions using Gray Code and development of a

High performance computational techniques for the simplex method Julian Hall School of

NOVEL TWO-DIMENSIONAL MATERIALS FROM HIGH-THROUGHPUT COMPUTATIONAL EXFOLIATION Nicola Marzari,

Chapter 2: Method of Alterations The Probabilistic Method Summer 2020 Freie Universitt Berlin

1. Computational Fluid a. Computational Fluid Dynamics is in the domain of Computational Science

Drupal High Availability High Performance Samstag, 3. November 12 Drupal High Availability

Eagle Scholars: High Eagle Scholars: High Eagle Scholars: High Eagle Scholars: High Eagle

Getting the Performance Out Of Getting the Performance Out Of High Performance Computing High

Vortices and the Navier-Stokes equation: understanding solutions of equations that we cant

Particle Physics Division Engineering Jonathan Lewis Engineers Retreat 20 February 2018

Microscopic derivation of (non-)relativistic second-order hydrodynamics from Boltzmann Equation

CFD Acceleration with FPGA Launching byteLAKEs CFD Suite Krzysztof Rojek, CTO at byteLAKE,

Visualisatie BMT Introduction, visualization, visualization pipeline Arjan Kok Huub van de

PANS [4] L ARS D AVIDSON Lars Davidson, www.tfd.chalmers.se/lada PANS L OW R EYNOLDS N UMBER M

CDF a theory of mathematical modeling for irreversible processes Wen-An Yong Tsinghua

DYNAMICS : THE RESIDUAL DISTRIBUTION POINT OF VIEW . A PPLICATION TO LAMINAR AND TURBULENT FLOWS