Improving 3D Lattice Boltzmann Method with asynchronous transfers on - - PowerPoint PPT Presentation

improving 3d lattice boltzmann method with asynchronous
SMART_READER_LITE
LIVE PREVIEW

Improving 3D Lattice Boltzmann Method with asynchronous transfers on - - PowerPoint PPT Presentation

Improving 3D Lattice Boltzmann Method with asynchronous transfers on many-core processors Minh Quan HO 1 , 3 , Bernard TOURANCHEAU 1 , Christian OBRECHT 2 , t DUPONT DE DINECHIN 3 and Julien HASCOET 3 Beno 1 LIG UMR 5217 - Grenoble Alps


slide-1
SLIDE 1

Improving 3D Lattice Boltzmann Method with asynchronous transfers on many-core processors

Minh Quan HO 1,3, Bernard TOURANCHEAU 1, Christian OBRECHT 2, Benoˆ ıt DUPONT DE DINECHIN 3 and Julien HASCOET 3

1LIG UMR 5217 - Grenoble Alps University - Grenoble, France 2CETHIL UMR 5008 - INSA-Lyon - Villeurbanne, France 3Kalray S.A. - Montbonnot, France

CCDSC - October 03-06, 2016

1 / 27

slide-2
SLIDE 2

Overview

1

Introduction

2

Motivation

3

Kalray MPPA-256 architecture

4

Pipelined 3D LBM stencil Domain decomposition and macro pipeline Sub-domain addressing Sub-domain size and Halo bandwidth

5

Results

6

Conclusions

2 / 27

slide-3
SLIDE 3

Introduction - LBM theory

The Lattice Boltzmann Method performs on a regular Cartesian grid: mesh size δx constant time step δt A node = {particle densities fα, velocities ξα} Nodes are linked by e.g the D3Q19 stencil and updated by [He, 1997]:

  • fα(x + δtξα, t + δt)
  • fα(x, t)
  • = Ω
  • fα(x, t)
  • (1)

12 15 5 16 11 7 3 1 9 8 2 10 14 18 6 17 13 4

3 / 27

slide-4
SLIDE 4

Introduction - Memory bound context

Given a ’square’ fluid represented as a grid of L × L × L lattice nodes in D3Q19, evoluating throught T time steps. Simulating the whole domain requires: 19 × 2 × L3 × T floating-point numbers ≤ 400 × L3 × T floating-point ops. Moving data is much slower than computing today. GPU is until now the most well-suited for LBM.

2 4 6 8 10 12 20 22 24 26 28 30 OPAL MPPA Roofline

Arithmetic intensity in log(flops/byte) Performance in log(flops/sec)

p e a k b a n d w i d t h S T R E A M ( 2 . 5 G B / s ) raw performance (634 GFLOPS SP) OPAL kernel log(AI) = log(2.34) b a n d w i d t h 4 G B / s b a n d w i d t h 2 G B / s performance 200 GFLOPS SP performance 100 GFLOPS SP

4 / 27

slide-5
SLIDE 5

Motivation

Power-efficient NoC-based many-core processors are very promising for next HPC challenges (e.g Sunway, MPPA, PULP, STHORM ...). Good latency, but low memory bandwidth (DDR3). Lack of efficient programming model and optimization methods. High {computing|data} predictability and fast-local-memory centric. Enabling sophisticated optimizations, based on software-prefetching and streaming. These motivates us to study a pipelined 3D LBM algorithm on many-core processors, using local memory and asynchronous communication.

5 / 27

slide-6
SLIDE 6

Kalray MPPA-256 architecture

16 x 16-core Compute Clusters (CC) 2 x I/O clusters with quad-core CPUs, DDR3, Ethernet, PCIe Dual 2D-torus NoC for 24 GB/s per link @ 600 MHz Peak 634 GFLOPS SP for 25W @ 600 MHz 2 MB multi-banked shared memory per CC, 77 GB/s bandwidth SMEM configurable as DDR L2 cache, or explicit user buffers Support asynchronous data transfer by DMA engines POSIX C/C++ programming or OpenCL offloading

6 / 27

slide-7
SLIDE 7

Outline

1

Introduction

2

Motivation

3

Kalray MPPA-256 architecture

4

Pipelined 3D LBM stencil Domain decomposition and macro pipeline Sub-domain addressing Sub-domain size and Halo bandwidth

5

Results

6

Conclusions

7 / 27

slide-8
SLIDE 8

Domain decomposition and macro pipeline

We take the lid-driven cavity example from the OPAL solver [Obrecht, 2015],

  • riginally implemented in OpenCL

The Lx ×Ly ×Lz domain is decomposed to sub-domains of size Cx × Cy × Cz

8 / 27

slide-9
SLIDE 9

Domain decomposition and macro pipeline

async_copy_3D async_copy_3D Cluster 0

Main idea: A sub-domain is copied into CC’s local memory by a 3D asynchronous copy function Computation is carried out on local memory then data are copied back to global memory (DDR)

9 / 27

slide-10
SLIDE 10

Domain decomposition and macro pipeline

async_copy_3D async_copy_3D Cluster 0

Requires copying halo layers for each sub-domain In 1-order stencil, the copied sub-domain S is at most (Cx + 2) × (Cy + 2) × (Cz + 2)

10 / 27

slide-11
SLIDE 11

Domain decomposition and macro pipeline

async_copy_3D async_copy_3D Cluster 0

16 computing clusters, each is working on NB CUBES PER CLUSTER sub-domains:

/* Prologue */ prefetch_cube (0); // non -blocking /* Pipeline */ for i in 0 .. NB_CUBES_PER_CLUSTER -1 prefetch_cube (i+1); // non -blocking wait_cube(i); compute_cube (i); put_cube(i); done /* Epilogue */ wait_cube( NB_CUBES_PER_CLUSTER -1);

11 / 27

slide-12
SLIDE 12

Outline

1

Introduction

2

Motivation

3

Kalray MPPA-256 architecture

4

Pipelined 3D LBM stencil Domain decomposition and macro pipeline Sub-domain addressing Sub-domain size and Halo bandwidth

5

Results

6

Conclusions

12 / 27

slide-13
SLIDE 13

Sub-domain addressing

A : “Hey, don’t touch my cube !” B : “No, that’s mine.”

async_copy_3D async_copy_3D Cluster 0

a

aCredit : 9gag 13 / 27

slide-14
SLIDE 14

Sub-domain addressing

Space filling curves like Morton

  • r Hilbert are fast

iblockx iblocky 1 2 3 4 6 5 7 8 9 10 11 12 13 14 15

14 / 27

slide-15
SLIDE 15

Sub-domain addressing

Space filling curves like Morton

  • r Hilbert are fast

But, what if (sub-)domains are not cubic ?

iblockx iblocky 1 2 3 4 6 5 7 8 9 10 11 12 13 14 15 1 2 3 4 6 5 7 8 9 10 11 12 13 14 15 iblockx iblocky ? ? ? ? ? ? ? ? ? ? ? ? ? ?

15 / 27

slide-16
SLIDE 16

Sub-domain addressing

Space filling curves like Morton

  • r Hilbert are fast

But, what if (sub-)domains are not cubic ?

iblockx iblocky 1 2 3 4 6 5 7 8 9 10 11 12 13 14 15 1 2 3 4 6 5 7 8 9 10 11 12 13 14 15 iblockx iblocky ? ? ? ? ? ? ? ? ? ? ? ? ? ?

Such a curve that works for any configuration will be more complex (octree, recursion, trailing handle)

16 / 27

slide-17
SLIDE 17

Sub-domain addressing

Space filling curves like Morton

  • r Hilbert are fast

But, what if (sub-)domains are not cubic ?

iblockx iblocky 1 2 3 4 6 5 7 8 9 10 11 12 13 14 15 1 2 3 4 6 5 7 8 9 10 11 12 13 14 15 iblockx iblocky ? ? ? ? ? ? ? ? ? ? ? ? ? ?

Such a curve that works for any configuration will be more complex (octree, recursion, trailing handle) Addressing sub-domains in ’3D’ row-major style

17 / 27

slide-18
SLIDE 18

Outline

1

Introduction

2

Motivation

3

Kalray MPPA-256 architecture

4

Pipelined 3D LBM stencil Domain decomposition and macro pipeline Sub-domain addressing Sub-domain size and Halo bandwidth

5

Results

6

Conclusions

18 / 27

slide-19
SLIDE 19

Sub-domain size and Halo bandwidth

We call ”Halo bandwidth” the fraction between the number of halo cells and the total number of copied cells.

Halo bandwidth Cube size (Cx = Cy = Cz) Halo bandwidth ratio

2 8 16 32 64 96 0.0 0.2 0.4 0.6 0.8 1.0

g(Cx) = (Cx + 2)3 − Cx

2

(Cx + 2)3

19 / 27

slide-20
SLIDE 20

Sub-domain size and Halo bandwidth

We call ”Halo bandwidth” the fraction between the number of halo cells and the total number of copied cells.

Halo bandwidth Cube size (Cx = Cy = Cz) Halo bandwidth ratio

2 8 16 32 64 96 0.0 0.2 0.4 0.6 0.8 1.0

g(Cx) = (Cx + 2)3 − Cx

2

(Cx + 2)3

Which size for sub-domains, given a limited local memory ?

20 / 27

slide-21
SLIDE 21

Sub-domain size and Halo bandwidth

We call ”Halo bandwidth” the fraction between the number of halo cells and the total number of copied cells.

Halo bandwidth Cube size (Cx = Cy = Cz) Halo bandwidth ratio

2 8 16 32 64 96 0.0 0.2 0.4 0.6 0.8 1.0

g(Cx) = (Cx + 2)3 − Cx

2

(Cx + 2)3

Which size for sub-domains, given a limited local memory ? E.g double buffering : malloc(2 × (Cx + 2)3 × sizeof (float)) (Cx = Cy = Cz)

21 / 27

slide-22
SLIDE 22

Sub-domain size and Halo bandwidth

We call ”Halo bandwidth” the fraction between the number of halo cells and the total number of copied cells.

Halo bandwidth Cube size (Cx = Cy = Cz) Halo bandwidth ratio

2 8 16 32 64 96 0.0 0.2 0.4 0.6 0.8 1.0

g(Cx) = (Cx + 2)3 − Cx

2

(Cx + 2)3

Which size for sub-domains, given a limited local memory ? E.g double buffering : malloc(2 × (Cx + 2)3 × sizeof (float)) (Cx = Cy = Cz) Sub-domains should be cubic and as big as possible.

22 / 27

slide-23
SLIDE 23

Outline

1

Introduction

2

Motivation

3

Kalray MPPA-256 architecture

4

Pipelined 3D LBM stencil Domain decomposition and macro pipeline Sub-domain addressing Sub-domain size and Halo bandwidth

5

Results

6

Conclusions

23 / 27

slide-24
SLIDE 24

Results (1/2)

We compare original OPAL performance on Intel CPU, Intel MIC, NVIDIA GPU and Kalray MPPA-256 (all OpenCL).

100 200 300 400 500

OPAL OpenCL. Duration = 1000, Workgroup = 32x1x1 Cavity size Performance (MLUPS)

Tesla C2070 Xeon E5−2667 v3 Xeon Phi 3100 MPPA−256 Bostan 32 64 96 128 160 192 224 256

(a) Performance in

MLUPS

20 40 60 80 100

OPAL OpenCL. Duration = 1000, Workgroup = 32x1x1 Cavity size Relative throughput vs. GPU−STREAM (%)

Tesla C2070 Xeon E5−2667 v3 Xeon Phi 3100 MPPA−256 Bostan 32 64 96 128 160 192 224 256

(b) Relative throughput

to GPU-STREAM (%)

0.0 0.5 1.0 1.5 2.0

OPAL OpenCL. Duration = 1000, Workgroup = 32x1x1 Cavity size Power efficiency (MLUPS/Watt)

Tesla C2070 Xeon E5−2667 v3 Xeon Phi 3100 MPPA−256 Bostan 32 64 96 128 160 192 224 256

(c) Power efficiency

(MLUPS/W)

Figure: Original OPAL OpenCL on GPU, CPU, MIC and MPPA

GPU-STREAM benchmark [Deakin, 2015]

24 / 27

slide-25
SLIDE 25

Results (2/2)

Asynchronous approach implemented in POSIX C on MPPA Outperforms the OpenCL version by 33% Twice better using two DDRs (MPPA OpenCL currently supports

  • nly one DDR)

10 12 14 16 18 20

OPAL_async vs. OPAL OpenCL on MPPA Single DDR, duration = 1000 Cavity size Performance (MLUPS)

OPAL OpenCL, WG = 32x1x1, single DDR

OPAL_async inplace 3−depth : 29 % halo BW OPAL_async inplace 4−depth : 36 % halo BW OPAL_async outplace 3−depth : 36 % halo BW OPAL_async outplace 4−depth : 43 % halo BW

64 96 128 160

(a) Single-DDR

10 15 20 25 30 35 40 45

OPAL_async vs. OPAL OpenCL on MPPA Double DDR, duration = 1000 Cavity size Performance (MLUPS)

OPAL OpenCL, WG = 32x1x1, single DDR

OPAL_async inplace 3−depth : 29 % halo BW OPAL_async inplace 4−depth : 36 % halo BW OPAL_async outplace 3−depth : 36 % halo BW OPAL_async outplace 4−depth : 43 % halo BW

64 96 128 160 192 224

(b) Double-DDR Figure: OPAL async vs. OPAL OpenCL on MPPA

25 / 27

slide-26
SLIDE 26

Conclusions

33% performance gain by actively streaming stencil domains on local memories. Software pipeline is not a trivial task, but essential to obtain good performance on many-core processors. DDR bandwidth is bottleneck. Halo copy is critical to performance, consumes up to 60% bandwidth

  • n small sub-domains.

Perspective : applying alternative method - Link-wise artificial compressibility method [Obrecht, 2016] with 5x less memory traffic.

26 / 27

slide-27
SLIDE 27

References

He, Xiaoyi, and Li-Shi Luo (1997) Theory of the lattice Boltzmann method: From the Boltzmann equation to the lattice Boltzmann equation. Physical Review E 56.6 (1997): 6811. Obrecht, Christian, Bernard Tourancheau, and Frdric Kuznik (2015) Performance Evaluation of an OpenCL Implementation of the Lattice Boltzmann Method on the Intel Xeon Phi. Parallel Processing Letters 25.03 (2015): 1541001. Deakin, Tom, and Simon McIntosh-Smith (2015) GPU-STREAM: Benchmarking the achievable memory bandwidth of Graphics Processing Units. Supercomputing Poster Austin, Texas (2015). Obrecht, Christian, et al. (2016) Thermal link-wise artificial compressibility method: GPU implementation and validation of a double-population model. Computers & Mathematics with Applications 72.2 (2016): 375-385.

27 / 27