Improving 3D Lattice Boltzmann Method with asynchronous transfers on - PowerPoint PPT Presentation

Improving 3D Lattice Boltzmann Method with asynchronous transfers on many-core processors Minh Quan HO 1 , 3 , Bernard TOURANCHEAU 1 , Christian OBRECHT 2 , ıt DUPONT DE DINECHIN 3 and Julien HASCOET 3 Benoˆ 1 LIG UMR 5217 - Grenoble Alps University - Grenoble, France 2 CETHIL UMR 5008 - INSA-Lyon - Villeurbanne, France 3 Kalray S.A. - Montbonnot, France CCDSC - October 03-06, 2016 1 / 27

Overview Introduction 1 Motivation 2 Kalray MPPA-256 architecture 3 Pipelined 3D LBM stencil 4 Domain decomposition and macro pipeline Sub-domain addressing Sub-domain size and Halo bandwidth Results 5 Conclusions 6 2 / 27

Introduction - LBM theory The Lattice Boltzmann Method performs on a regular Cartesian grid: mesh size δ x constant time step δ t A node = { particle densities f α , velocities ξ α } Nodes are linked by e.g the D3Q19 stencil and updated by [He, 1997]: �� f α ( x + δ t ξ α , t + δ t ) − � f α ( x , t ) = Ω � f α ( x , t ) (1) � � � 15 11 12 5 16 8 3 7 2 1 10 9 4 17 14 6 13 18 3 / 27

Introduction - Memory bound context Given a ’square’ fluid represented as a grid of L × L × L lattice nodes in D3Q19, evoluating throught T time steps. OPAL MPPA Roofline 30 Simulating the whole domain requires: s B / G 19 × 2 × L 3 × T floating-point numbers 0 0 2 h t 28 d w i d s n / Performance in log(flops/sec) B a G raw performance (634 GFLOPS SP) b ≤ 400 × L 3 × T floating-point ops. 0 4 ) s h B / t d G i w 5 d performance 200 GFLOPS SP 26 2 . n ( a b M A performance 100 GFLOPS SP E Moving data is much slower than computing R T S h t d i w d 24 n today. a b k a e p 22 GPU is until now the most well-suited for OPAL kernel log(AI) = log(2.34) LBM. 20 0 2 4 6 8 10 12 Arithmetic intensity in log(flops/byte) 4 / 27

Motivation Power-efficient NoC-based many-core processors are very promising for next HPC challenges (e.g Sunway, MPPA, PULP, STHORM ...). Good latency, but low memory bandwidth (DDR3). Lack of efficient programming model and optimization methods. High { computing | data } predictability and fast-local-memory centric. Enabling sophisticated optimizations, based on software-prefetching and streaming. These motivates us to study a pipelined 3D LBM algorithm on many-core processors, using local memory and asynchronous communication. 5 / 27

Kalray MPPA-256 architecture 16 x 16-core Compute Clusters (CC) 2 x I/O clusters with quad-core CPUs, DDR3, Ethernet, PCIe Dual 2D-torus NoC for 24 GB/s per link @ 600 MHz Peak 634 GFLOPS SP for 25W @ 600 MHz 2 MB multi-banked shared memory per CC, 77 GB/s bandwidth SMEM configurable as DDR L2 cache, or explicit user buffers Support asynchronous data transfer by DMA engines POSIX C/C++ programming or OpenCL offloading 6 / 27

Outline Introduction 1 Motivation 2 Kalray MPPA-256 architecture 3 Pipelined 3D LBM stencil 4 Domain decomposition and macro pipeline Sub-domain addressing Sub-domain size and Halo bandwidth Results 5 Conclusions 6 7 / 27

Domain decomposition and macro pipeline We take the lid-driven cavity example from the OPAL solver [Obrecht, 2015], originally implemented in OpenCL The L x × L y × L z domain is decomposed to sub-domains of size C x × C y × C z 8 / 27

Domain decomposition and macro pipeline Main idea: Cluster 0 async_copy_3D A sub-domain is copied into CC’s local memory by a 3D asynchronous copy function async_copy_3D Computation is carried out on local memory then data are copied back to global memory (DDR) 9 / 27

Domain decomposition and macro pipeline Cluster 0 async_copy_3D Requires copying halo layers for each sub-domain async_copy_3D In 1-order stencil, the copied sub-domain S is at most ( C x + 2) × ( C y + 2) × ( C z + 2) 10 / 27

Domain decomposition and macro pipeline 16 computing clusters, each is working on NB CUBES PER CLUSTER sub-domains: Cluster 0 async_copy_3D /* Prologue */ prefetch_cube (0); // non -blocking /* Pipeline */ for i in 0 .. NB_CUBES_PER_CLUSTER -1 async_copy_3D prefetch_cube (i+1); // non -blocking wait_cube(i); compute_cube (i); put_cube(i); done /* Epilogue */ wait_cube( NB_CUBES_PER_CLUSTER -1); 11 / 27

Sub-domain addressing A : “Hey, don’t touch my cube !” B : “No, that’s mine.” Cluster 0 async_copy_3D async_copy_3D a a Credit : 9gag 13 / 27

Sub-domain addressing iblockx iblocky 0 1 4 5 Space filling curves like Morton 2 3 6 7 or Hilbert are fast 8 9 12 13 10 11 14 15 14 / 27

Sub-domain addressing iblockx iblockx iblocky iblocky ? ? 0 0 1 1 4 4 5 5 ? ? Space filling curves like Morton 2 2 3 3 6 6 7 7 or Hilbert are fast ? ? 8 8 9 9 12 12 13 13 But, what if (sub-)domains are ? ? 10 10 11 11 14 14 15 15 not cubic ? ? ? ? ? ? ? 15 / 27

Sub-domain addressing iblockx iblockx iblocky iblocky ? ? 0 0 1 1 4 4 5 5 ? ? Space filling curves like Morton 2 2 3 3 6 6 7 7 or Hilbert are fast ? ? 8 8 9 9 12 12 13 13 But, what if (sub-)domains are ? ? 10 10 11 11 14 14 15 15 not cubic ? ? ? ? ? ? ? Such a curve that works for any configuration will be more complex (octree, recursion, trailing handle) 16 / 27

Sub-domain addressing iblockx iblockx iblocky iblocky ? ? 0 0 1 1 4 4 5 5 ? ? Space filling curves like Morton 2 2 3 3 6 6 7 7 or Hilbert are fast ? ? 8 8 9 9 12 12 13 13 But, what if (sub-)domains are ? ? 10 10 11 11 14 14 15 15 not cubic ? ? ? ? ? ? ? Such a curve that works for any configuration will be more complex (octree, recursion, trailing handle) Addressing sub-domains in ’3D’ row-major style 17 / 27

Sub-domain size and Halo bandwidth We call ”Halo bandwidth” the fraction between the number of halo cells and the total number of copied cells. Halo bandwidth 1.0 Halo bandwidth ratio 0.8 0.6 g ( C x ) = ( C x + 2 ) 3 − C x 2 ( C x + 2 ) 3 0.4 0.2 0.0 2 8 16 32 64 96 Cube size (Cx = Cy = Cz) 19 / 27

Sub-domain size and Halo bandwidth We call ”Halo bandwidth” the fraction between the number of halo cells and the total number of copied cells. Halo bandwidth 1.0 Halo bandwidth ratio 0.8 0.6 g ( C x ) = ( C x + 2 ) 3 − C x 2 ( C x + 2 ) 3 0.4 0.2 0.0 2 8 16 32 64 96 Cube size (Cx = Cy = Cz) Which size for sub-domains, given a limited local memory ? 20 / 27

Sub-domain size and Halo bandwidth We call ”Halo bandwidth” the fraction between the number of halo cells and the total number of copied cells. Halo bandwidth 1.0 Halo bandwidth ratio 0.8 0.6 g ( C x ) = ( C x + 2 ) 3 − C x 2 ( C x + 2 ) 3 0.4 0.2 0.0 2 8 16 32 64 96 Cube size (Cx = Cy = Cz) Which size for sub-domains, given a limited local memory ? E.g double buffering : malloc (2 × ( C x + 2) 3 × sizeof ( float )) ( C x = C y = C z ) 21 / 27

Sub-domain size and Halo bandwidth We call ”Halo bandwidth” the fraction between the number of halo cells and the total number of copied cells. Halo bandwidth 1.0 Halo bandwidth ratio 0.8 0.6 g ( C x ) = ( C x + 2 ) 3 − C x 2 ( C x + 2 ) 3 0.4 0.2 0.0 2 8 16 32 64 96 Cube size (Cx = Cy = Cz) Which size for sub-domains, given a limited local memory ? E.g double buffering : malloc (2 × ( C x + 2) 3 × sizeof ( float )) ( C x = C y = C z ) Sub-domains should be cubic and as big as possible. 22 / 27

Results (1/2) We compare original OPAL performance on Intel CPU, Intel MIC, NVIDIA GPU and Kalray MPPA-256 (all OpenCL). OPAL OpenCL. Duration = 1000, OPAL OpenCL. Duration = 1000, OPAL OpenCL. Duration = 1000, Relative throughput vs. GPU−STREAM (%) Workgroup = 32x1x1 Workgroup = 32x1x1 Workgroup = 32x1x1 500 Tesla C2070 Tesla C2070 Power efficiency (MLUPS/Watt) Tesla C2070 100 2.0 Xeon E5−2667 v3 Xeon E5−2667 v3 Xeon E5−2667 v3 Performance (MLUPS) 400 Xeon Phi 3100 Xeon Phi 3100 Xeon Phi 3100 MPPA−256 Bostan MPPA−256 Bostan MPPA−256 Bostan 80 1.5 300 60 1.0 200 40 0.5 100 20 0.0 0 0 32 64 96 128 160 192 224 256 32 64 96 128 160 192 224 256 32 64 96 128 160 192 224 256 Cavity size Cavity size Cavity size (a) Performance in (b) Relative throughput (c) Power efficiency MLUPS to GPU-STREAM (%) (MLUPS/W) Figure: Original OPAL OpenCL on GPU, CPU, MIC and MPPA GPU-STREAM benchmark [Deakin, 2015] 24 / 27

Improving 3D Lattice Boltzmann Method with asynchronous transfers on - PowerPoint PPT Presentation

Improving 3D Lattice Boltzmann Method with asynchronous transfers on many-core processors Minh Quan HO 1 , 3 , Bernard TOURANCHEAU 1 , Christian OBRECHT 2 , t DUPONT DE DINECHIN 3 and Julien HASCOET 3 Beno 1 LIG UMR 5217 - Grenoble Alps

Multi GPU, Interactive 3D Simulator for the Lattice Boltzmann Immersed Boundary Method Bob Zigon

CFD Lab Course The Lattice Boltzmann Method Philipp Neumann 20.5.2011 P. Neumann: CFD Lab

Transport properties - Boltzmann equation goal: calculation of conductivity Boltzmann transport

How to Design Fast Asynchronous How to Design Fast Asynchronous Routers for Asynchronous Routers

AN ASYNCHRONOUS DIVIDER IMPLEMENTATION Navaneeth Jamadagni and Jo Ebergen 2 Asynchronous

Asynchronous Replication and Bayou Asynchronous Replication and Bayou Asynchronous Replication

Asynchronous Replication and Bayou Asynchronous Replication and Bayou Jeff Chase CPS 212, Fall

Asynchronous Replication and Bayou Asynchronous Replication and Bayou Jeff Chase CPS 212, Fall

CSC321 Lecture 19: Boltzmann Machines Roger Grosse Roger Grosse CSC321 Lecture 19: Boltzmann

Boltzmann Sampling and Random Generation of Combinatorial Structures Philippe Flajolet Based on

Einstein on Boltzmann principle Giovanni Jona-Lasinio Galileo Galilei Institute, May 27, 2014

Non Isotropic Cauchy Theory for the Boltzmann Nordheim Equations Equation for Bosons. Bose

Fourier Law and Non-Isothermal Boundary in the Boltzmann Theory Joint work with Raffaele

with Applications to Change-point Detection and Restricted Boltzmann Machine Restricted Boltzmann

Application of the Lattice Boltzmann method with moving boundaries in a coarse-grained suspension

Asymptotic Analysis of the Lattice Boltzmann Method for Generalized Newtonian Fluid Flows Wen-An

Programs in context Monday 12 th November 2012 Dominic Orchard, Cambridge Programming Research

GPU Programming Maciej Halber Aim Give basic introduction to CUDA C How to write kernels

Deferred Shading Rasmus Vahtra, Andres Traks Forward rendering (non-deferred shading) Forward

Spare Node Substitution for Failure Nodes Kazumi Yoshinaga RIKEN AICS Background In the

language acquisition and language change Linguistic Theory MA course Szcsnyi Krisztina 2015

Introduction Me: Software Engineer at DENX, Gmbh U-Boot Custodian for Freescale's i.MX

6: Old English Adverbs and Numerals Adverb Formation in Modern English adjective + -ly : easy

Combinatorial Species and the Virial Expansion By Stephen Tate University of Warwick