Optimization and Parallelization of the Boundary Element Method for - - PowerPoint PPT Presentation

optimization and parallelization of the boundary element
SMART_READER_LITE
LIVE PREVIEW

Optimization and Parallelization of the Boundary Element Method for - - PowerPoint PPT Presentation

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B erenger Bramas Inria, Bordeaux - Sud-Ouest PhD defense - Feb. 15th 2016 Advisor: Oliver Coulaud (Inria) Industrial co-advisor: Guillaume


slide-1
SLIDE 1

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain

B´ erenger Bramas

Inria, Bordeaux - Sud-Ouest

PhD defense - Feb. 15th 2016

Advisor: Oliver Coulaud (Inria) Industrial co-advisor: Guillaume Sylvand (Airbus)

. .... .. .. ... . .... .... .... ... . .... .... .... ... . .... .... .... ... . .... .... .... .. . . .. .. .... .. .

slide-2
SLIDE 2

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives (2)

Context

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-3
SLIDE 3

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives (3)

Wave Equation Problems

  • Study the wave propagation in acoustics or electromagnetism
  • Critical in several industrial fields (design, robustness study)

In our case: antenna placement, electromagnetic compatibility, furtivity, lightning, ...

Image from Airbus Group.

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-4
SLIDE 4

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives (4)

Wave Equation Simulations

Boundary element method (BEM): integral equation over a discretized mesh Interest of BEM compared to other approaches

  • Better accuracy
  • Surfacic mesh (easier to produce)

Disadvantages of BEM

  • Dense matrices (specific solvers)

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-5
SLIDE 5

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives (5)

BEM

Boundary Element Method (BEM) for the Wave Equation In Time Domain (TD) In Frequency Domain (FD)

One solve = a range of frequencies (Fourier Transform) One solve = one frequency Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-6
SLIDE 6

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives (5)

BEM

Boundary Element Method (BEM) for the Wave Equation In Time Domain (TD) In Frequency Domain (FD) Accelerated by FMM (Fast-BEM) Accelerated by H-Matrix Accelerated by FMM Dense BEM/Matrix Approach Dense BEM ... ...

One solve = a range of frequencies One solve = one frequency Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-7
SLIDE 7

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives (5)

BEM

Boundary Element Method (BEM) for the Wave Equation In Time Domain (TD) In Frequency Domain (FD) Accelerated by FMM (Fast-BEM) Accelerated by H-Matrix Accelerated by FMM Dense BEM/Matrix Approach Dense BEM ... ...

One solve = a range of frequencies One solve = one frequency Less studied and used Widely used (academy and industry) Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-8
SLIDE 8

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives (5)

BEM

Boundary Element Method (BEM) for the Wave Equation In Time Domain (TD) In Frequency Domain (FD) Accelerated by FMM (Fast-BEM) Accelerated by H-Matrix Accelerated by FMM Dense BEM/Matrix Approach Dense BEM ... ...

One solve = a range of frequencies One solve = one frequency Less studied and used Widely used (academy and industry)

Advantages/disadvantages depend on the application/configuration

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-9
SLIDE 9

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives (6)

Industrial Context

  • In partnership with Airbus Group Innovation (financed jointly

with Region Aquitaine)

  • Airbus solvers:
  • FD-BEM
  • Accelerated by FMM or H-Matrix techniques
  • TD-BEM (experimental)
  • No stability problem (formulation based on a full Galerkin

discretization unconditionnaly stable from [Terrasse, 1993])

  • With FMM [Ergin et al., 2000] (trial)

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-10
SLIDE 10

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives (6)

Industrial Context

  • In partnership with Airbus Group Innovation (financed jointly

with Region Aquitaine)

  • Airbus solvers:
  • FD-BEM
  • Accelerated by FMM or H-Matrix techniques
  • TD-BEM (experimental)
  • No stability problem (formulation based on a full Galerkin

discretization unconditionnaly stable from [Terrasse, 1993])

  • With FMM [Ergin et al., 2000] (trial)

Objective:

  • Reduce the performance gap between FD and TD approaches

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-11
SLIDE 11

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives (7)

HPC

Super-computers are mandatory to solve large problems

  • Shared/Distributed memory
  • Heterogeneous (one or more GPU per node)

CPU CPU GPU Node CPU CPU GPU Node CPU CPU GPU Node CPU CPU GPU Node ... Cluster Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-12
SLIDE 12

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives (7)

HPC

Super-computers are mandatory to solve large problems

  • Shared/Distributed memory
  • Heterogeneous (one or more GPU per node)

CPU CPU GPU Node CPU CPU GPU Node CPU CPU GPU Node CPU CPU GPU Node ... Cluster

Some of the challenges

  • Efficient computational algorithm/kernel
  • Parallelization
  • Balancing
  • Hardware abstraction, portable implementation, long-term

development, ...

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-13
SLIDE 13

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives (8)

Outline

  • Problem Formulation
  • BEM Solver (Matrix Approach)
  • Fast-Multipole Method Approach
  • FMM Algorithm & Parallelization
  • FMM BEM Solver (Experimental Implementation)
  • Conclusion & Perspectives

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-14
SLIDE 14

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives (9)

TD-BEM Application Stages

User inputs, simulation parameters ↓ Mesh generator, configuration ↓ Solver ↓ Post-processing (TD → FD)

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-15
SLIDE 15

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives TD-BEM Formulation (10)

Linear Formulation

Notations:

  • δΩ discretized in N unknowns/degrees of freedom

Ω ln

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-16
SLIDE 16

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives TD-BEM Formulation (10)

Linear Formulation

Notations:

  • δΩ discretized in N unknowns/degrees of freedom
  • Mk: the convolution matrices (dimension N × N) - input
  • ln: the incident wave emitted by a source on the unknowns at

time step n - input

  • an: the state of the system at time step n - to compute

Ω ln

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-17
SLIDE 17

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives TD-BEM Formulation (10)

Linear Formulation

Notations:

  • δΩ discretized in N unknowns/degrees of freedom
  • Mk: the convolution matrices (dimension N × N) - input
  • ln: the incident wave emitted by a source on the unknowns at

time step n - input

  • an: the state of the system at time step n - to compute

Convolution system: M0 · an +

K max

k≥1

Mk · an−k = ln (1) Ω ln

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-18
SLIDE 18

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives TD-BEM Formulation (10)

Linear Formulation

Notations:

  • δΩ discretized in N unknowns/degrees of freedom
  • Mk: the convolution matrices (dimension N × N) - input
  • ln: the incident wave emitted by a source on the unknowns at

time step n - input

  • an: the state of the system at time step n - to compute

Convolution system: M0 · an +

K max

k≥1

Mk · an−k = ln (1) Solve at each time step: an = (M0)−1 ( ln −

Kmax

k=1

Mk · an−k ) (2)

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-19
SLIDE 19

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives TD-BEM Formulation (11)

Interaction/Convolution Matrices (Mk)

  • Interactions between unknowns
  • Symmetric and sparse, Mk(i, j) ̸= 0 if distance(i, j) ≈ k.c.∆t
  • Pre-computed (external tool)

M0 M1 M2 MKmax . . .

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-20
SLIDE 20

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives TD-BEM Formulation (12)

Solve (Schematic View)

an = (M0)−1 ( ln −

Kmax

k=1

Mk · an−k ) (3)

= = =

  • ,

Linear Solver sn ln sn sn

~

sn ~ an M0

M1 M2M3M4M5 an-1an-2an-3an-4an-5

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-21
SLIDE 21

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives TD-BEM Formulation (13)

SpMV (sparse matrix/vector product)

Summation stage → K max SpMVs

  • Permutation, advanced storages/kernels, blocking

[White III and Sadayappan, 1997, Pinar and Heath, 1999, Pichel et al., 2005, Vuduc and Moon, 2005]

  • Auto-tuning

[Im and Yelick, 2001, Vuduc et al., 2005]

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-22
SLIDE 22

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives TD-BEM Formulation (13)

SpMV (sparse matrix/vector product)

Summation stage → K max SpMVs

  • Permutation, advanced storages/kernels, blocking

[White III and Sadayappan, 1997, Pinar and Heath, 1999, Pichel et al., 2005, Vuduc and Moon, 2005]

  • Auto-tuning

[Im and Yelick, 2001, Vuduc et al., 2005] Low Flop-rate:

  • Memory bound operation
  • Flop/Word hardware limit
  • Irregular/not contiguous memory accesses
  • Instruction (pipelining, vectorization)
  • Not appropriate for GPUs

[Garland, 2008, Baskaran and Bordawekar, 2008, Bell and Garland, 2009]

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-23
SLIDE 23

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives TD-BEM Formulation (14)

SpMV (Performance)

. . .

Dense 5000

.

Diagonal 15/500000

.

Random 4/20000

.

Block Random (5/80000)

.

Dense 200 (x10000)

. .

10

.

20

.

30

.

GFlop/s

. . . .

C00 MKL .

.

CRS MKL .

.

DIA MKL .

.

BCSR MKL .

.

CRS cuSparse .

.

BCSR cuSparse

SpMVs MKL/cuSparse (double precision) Peak performance: CPU Haswell Intel Xeon E5-2680 2,50 GHz core 20GFlop/s, and K40-M GPU 1.43TFlop/s.

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-24
SLIDE 24

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives TD-BEM Formulation (15)

TD-BEM Application Stages

User inputs, simulation parameters ↓ Mesh generator, configuration, interaction matrices pre-computation ↓ Solver · Summation stage · M0 Linear Solver (external tool) ↓ Post-processing (TD → FD)

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-25
SLIDE 25

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives (16)

Outline

  • Problem Formulation
  • BEM Solver (Matrix Approach)
  • Fast-Multipole Method Approach
  • FMM Algorithm & Parallelization
  • FMM BEM Solver (Experimental Implementation)
  • Conclusion & Perspectives

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-26
SLIDE 26

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Improving the Summation (17)

Computational Ordering

A0 M 6 A1 M 5 A2 M 4 A3 M 3 A4 M 2 A5 M 1 S6

Front (k)/ SpMV sn(i) =

K max

k=1 N

j=1

Mk(i, j) × an−k(j) , 1 ≤ i ≤ N . (4)

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-27
SLIDE 27

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Improving the Summation (17)

Computational Ordering

A0 M 6 A1 M 5 A2 M 4 A3 M 3 A4 M 2 A5 M 1 S6

Front (k)/ SpMV

A0 M 6 A1 M 5 A2 M 4 A3 M 3 A4 M 2 A5 M 1 S6

Top (i)

A0 M 6 A1 M 5 A2 M 4 A3 M 3 A4 M 2 A5 M 1 S6

Side (j) sn(i) =

K max

k=1 N

j=1

Mk(i, j) × an−k(j) , 1 ≤ i ≤ N . (4)

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-28
SLIDE 28

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Improving the Summation (18)

Structure of a Slice Matrix

A Slicej:

  • When outer loop index is j
  • The concatenation of column j of the interaction matrices Mk

(except M0)

  • Size (N × (Kmax − 1))
  • There is one dense vector per row
  • Slicej(i, k) = Mk(i, j) ̸= 0

with ks = d(i, j)/(c∆t) and ks ≤ k ≤ ks + p

M1(*,j)

Slicej

M2(*,j) M3(*,j) M4(*,j) M5(*,j) M6(*,j) M7(*,j) M8(*,j) M9(*,j) M10(*,j) M11(*,j) M12(*,j) Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-29
SLIDE 29

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Improving the Summation (19)

Computing with a Slice Matrix

a*<n(j)

M

1

( * , j )

Slicej sn

M

2

( * , j ) M

3

( * , j ) M

4

( * , j ) M

5

( * , j ) M

6

( * , j ) M

7

( * , j ) M

8

( * , j ) M

9

( * , j ) M

1

( * , j ) M

1 1

( * , j ) M

1 2

( * , j )

Computation with N vector/vector products (one per line):

  • Regular memory access (vectorization, pipelining)
  • Low Flop/word ratio (same as SpMV)

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-30
SLIDE 30

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Improving the Summation (20)

Improving the Flop/Word Ratio

an-1an-2an-3an-4an-5an-6an-7an-8an-9

M1(*,j)

Slicej sn

M2(*,j) M3(*,j) M4(*,j) M5(*,j) M6(*,j) M7(*,j) M8(*,j) M9(*,j)

State of unknown j

an an+1 an+2 an ? ? ? ….

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-31
SLIDE 31

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Improving the Summation (20)

Improving the Flop/Word Ratio

an-1an-2an-3an-4an-5an-6an-7an-8an-9

M1(*,j)

Slicej sn+1

M2(*,j) M3(*,j) M4(*,j) M5(*,j) M6(*,j) M7(*,j) M8(*,j) M9(*,j)

an an+1 an+2 ? ? ? an-1an-2an-3an-4an-5an-6an-7an-8an-9 an an+1 an+2 …. ? ? ? ….

sn

n =2

g

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-32
SLIDE 32

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Improving the Summation (20)

Improving the Flop/Word Ratio

an-1an-2an-3an-4an-5an-6an-7an-8an-9

M1(*,j)

Slicej sn+1

M2(*,j) M3(*,j) M4(*,j) M5(*,j) M6(*,j) M7(*,j) M8(*,j) M9(*,j)

an an+1 an+2 ? ? ? an-1an-2an-3an-4an-5an-6an-7an-8an-9 an an+1 an+2 …. ? ? ? ….

sn sn+2

an-1an-2an-3an-4an-5an-6an-7 an-9 an an+1 an+2 ? ? ? an-8 ….

n =3

g

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-33
SLIDE 33

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Improving the Summation (20)

Improving the Flop/Word Ratio

M1(*,j)

Slicej sn+1

M2(*,j) M3(*,j) M4(*,j) M5(*,j) M6(*,j) M7(*,j) M8(*,j) M9(*,j)

an-1an-2an-3an-4an-5an-6an-7an-8an-9 an an+1 ….

sn sn+2

n =3

g

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-34
SLIDE 34

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Improving the Summation (20)

Improving the Flop/Word Ratio

M1(*,j)

Slicej sn+1

M2(*,j) M3(*,j) M4(*,j) M5(*,j) M6(*,j) M7(*,j) M8(*,j) M9(*,j)

an-1an-2an-3an-4an-5an-6an-7an-8an-9 an an+1 ….

sn sn+2

n =3

g

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-35
SLIDE 35

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Multi-Vectors/Vector Product (21)

Flop/Word Ratio

Vector length v = 4, group size ng = 4 (v × ng × 2 Flops):

a b c d 1 2 3 r a b c d 1 2 3 r 1 2 3 4 2 3 4 5 3 4 5 6 r r r a b c d 1 2 3 4 5 6 r r r r

Vector/vector product Vector/matrix product Multi-vectors/vector product v ng

  • Vectors product (≈ SpMV) : ng(2v + 1)
  • Vector/matrix product : v + ng(v + 1)
  • Multi-vectors/vector product : (v + ng − 1) + (v) + (ng)

. . . . .

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-36
SLIDE 36

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Multi-Vectors/Vector Product (21)

Flop/Word Ratio

Vector length v = 4, group size ng = 4 (v × ng × 2 Flops):

a b c d 1 2 3 r a b c d 1 2 3 r 1 2 3 4 2 3 4 5 3 4 5 6 r r r a b c d 1 2 3 4 5 6 r r r r

Vector/vector product Vector/matrix product Multi-vectors/vector product v ng

  • Vectors product (≈ SpMV) : ng(2v + 1)
  • Vector/matrix product : v + ng(v + 1)
  • Multi-vectors/vector product : (v + ng − 1) + (v) + (ng)

. . . .

2

.

4

.

6

.

8

.

10

.

12

.

14

.

16

.

18

.

20

.

0 . 2

.

4

.

6

.

v

.

F/W (ng 8) Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-37
SLIDE 37

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Multi-Vectors/Vector Product (22)

Multi-vectors/vector Product (CPU)

. . . . .

AVX-Asm .

.

AVX-Intrinsic .

.

AVX-Template .

.

SSE-Intrinsic .

.

Compiler Version

. . .

0 . 20

.

40

.

60

.

80

. .

5

.

10

.

15

.

20

.

Length of vectors (v)

.

Speed (GFlop/s)

Figure : Nr = 1 024

. . .

0 . 20

.

40

.

60

.

80

. .

5

.

10

.

15

.

20

.

Length of vectors (v)

Figure : Nr = 20 480

Plots show the GFlop/s with ng = 8 for test cases of dimension Nr × v (in double precision). Haswell Intel Xeon E5-2680 at 2, 50GHz (20GFlop/s) Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-38
SLIDE 38

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Multi-Vectors/Vector Product on GPUs (23)

GPUs Slice Storages

  • Blocking scheme (small conversion overhead)
  • Data access appropriate for SIMT/SIMD
  • Memory accesses (coalesced, low bank conflicts)
  • Data re-use (shared memory)
  • CPU/GPU Balancing

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-39
SLIDE 39

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Multi-Vectors/Vector Product on GPUs (23)

GPUs Slice Storages

  • Blocking scheme (small conversion overhead)
  • Data access appropriate for SIMT/SIMD
  • Memory accesses (coalesced, low bank conflicts)
  • Data re-use (shared memory)
  • CPU/GPU Balancing

Slicej a*<n(j) Slicej

2 1 3 4 5 8 6 5 6 1 2 2 (a) (b) (c)

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-40
SLIDE 40

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Parallelization (24)

Parallelization

Sequential algorithm:

= = =

  • ,

Linear Solver sn ln sn sn

~

sn ~ an M0

M1 M2M3M4M5 an-1an-2an-3an-4an-5 Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-41
SLIDE 41

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Parallelization (25)

Parallel Solver (Schematic View)

=

  • ln

sn sn ~

= , Parallel Linear Solver

sn ~ an M0

=

sn M1 M2M3M4M5 an-1n-2n-3n-4

n-5

=

sn

=

sn sn

+

P1 P2 P0 an-1

n-2 n-3n-4n-5 n-5

an-1n-2n-3n-4 M1 M2M3M4 M5 M1 M2M3M4M5 Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-42
SLIDE 42

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Parallelization (26)

Parallel Solver with ng > 1 (Schematic View)

=

  • ln+f sn+f sn+f

~ = ,

Parallel Linear Solver sn+f

~

an+f M0

= = = +

sn+f snsn+1sn+2 M1M2

,

Radiation sn+f+1sn+f+2

,

an+f Radiation

, ,

Radiation

, , +

sn+f ng loops T/ng loops P1 P2 P0 P1 P2 P0 M1M2sn+f+1sn+f+2 an+f M1M2 sn+f+1 sn+f+2 an+f snsn+1sn+2 snsn+1sn+2 an-1n-2n-3n-4n-5 an-1n-2n-3n-4n-5 an-1n-2n-3n-4n-5 M1 M2M3M4 M5 M1 M2M3M4M5 M1 M2M3M4 M5

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-43
SLIDE 43

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Results (27)

Airplane Simulation

  • Acoustics
  • N = 23 962
  • 10 823 time iterations
  • K max = 341 interaction matrices Mk
  • ng = 8
  • 70GB of data
  • double precision
  • Homogeneous node: 24 Cores CPU (128GB memory)
  • Heterogeneous node: 24 Cores CPU (128GB memory) and 4

K40M GPUs (12GB memory)

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-44
SLIDE 44

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Results (28)

Parallel Efficiency/Percentage (Homogeneous)

. . . . .

Full-MPI

. . . . . . . . . . . . . . . . . 1 . 10 . 20 . . 0.5 . 1 . Number of nodes . Efficiency . . . . .

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-45
SLIDE 45

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Results (28)

Parallel Efficiency/Percentage (Homogeneous)

. . . . .

Full-MPI

. . . .

Summation

. .

Idle

. .

Direct solver M0

. . . 1 . 10 . 20 . . 0.5 . 1 . Number of nodes . Efficiency . . . 1 . 10 . 20 . . 20 . 40 . 60 . 80 . 100 . Number of nodes . Percentage (%)

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-46
SLIDE 46

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Results (28)

Parallel Efficiency/Percentage (Homogeneous)

. . . . .

Full-MPI

. . . .

Summation

. .

Idle

. .

Direct solver M0

. . . 1 . 10 . 20 . . 0.5 . 1 . Number of nodes . Efficiency . . . 1 . 10 . 20 . . 20 . 40 . 60 . 80 . 100 . Number of nodes . Percentage (%)

Summation stage ↘ M0 Solve →

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-47
SLIDE 47

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Results (29)

With GPUs

. . . 1 . 2 . 3 . 4 . 5 . 300 . 1,000 . 3,000 . 20,000 . Number of nodes . Time (seconds) . . .

CPU-Only .

.

1GPU

Figure : Execution time

. . . 1 . 2 . 3 . 4 . 5 . 1.0 . 11.0 . 21.0 . Number of nodes . Speedup . . .

1GPU

Figure : Speedup against CPU-Only

Problem ≈ 70GB/GPU 12GB memory

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-48
SLIDE 48

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Results (29)

With GPUs

. . . 1 . 2 . 3 . 4 . 5 . 300 . 1,000 . 3,000 . 20,000 . Number of nodes . Time (seconds) . . .

CPU-Only .

.

1GPU .

.

2GPU

Figure : Execution time

. . . 1 . 2 . 3 . 4 . 5 . 1.0 . 11.0 . 21.0 . Number of nodes . Speedup . . .

1GPU .

.

2GPU

Figure : Speedup against CPU-Only

Problem ≈ 70GB/GPU 12GB memory

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-49
SLIDE 49

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Results (29)

With GPUs

. . . 1 . 2 . 3 . 4 . 5 . 300 . 1,000 . 3,000 . 20,000 . Number of nodes . Time (seconds) . . .

CPU-Only .

.

1GPU .

.

2GPU

. .

3GPU

Figure : Execution time

. . . 1 . 2 . 3 . 4 . 5 . 1.0 . 11.0 . 21.0 . Number of nodes . Speedup . . .

1GPU .

.

2GPU .

.

3GPU

Figure : Speedup against CPU-Only

Problem ≈ 70GB/GPU 12GB memory

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-50
SLIDE 50

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Results (29)

With GPUs

. . . 1 . 2 . 3 . 4 . 5 . 300 . 1,000 . 3,000 . 20,000 . Number of nodes . Time (seconds) . . .

CPU-Only .

.

1GPU .

.

2GPU

. .

3GPU .

.

4GPU

Figure : Execution time

. . . 1 . 2 . 3 . 4 . 5 . 1.0 . 11.0 . 21.0 . Number of nodes . Speedup . . .

1GPU .

.

2GPU .

.

3GPU .

.

4GPU

Figure : Speedup against CPU-Only

Problem ≈ 70GB/GPU 12GB memory

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-51
SLIDE 51

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Summary (30)

Summary:

  • New computational ordering [Bramas et al., 2014]
  • Solver with few communication points

Additional contributions:

  • Permutations/SpMV
  • Efficient SIMD kernel CPU
  • Efficient blocking scheme/kernel for

GPU [Bramas et al., 2015]

  • Dynamic balancing (CPU/GPU)

Limits:

  • M0 Linear solver
  • GPUs’ memory
  • Interaction matrices construction
  • Complexity → O(N2) for each iteration

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-52
SLIDE 52

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives (31)

Outline

  • Problem Formulation
  • BEM Solver (Matrix Approach)
  • Fast-Multipole Method Approach
  • FMM Algorithm & Parallelization
  • FMM BEM Solver (Experimental Implementation)
  • Conclusion & Perspectives

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-53
SLIDE 53

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Algorithm (32)

FMM Operators (1D)

  • Spatial decomposition → Potential decomposition

fi = f near

i

+ f far

i

  • Near field by direct interactions (leaves)
  • Far field with FMM operators (tree)

l = 0 l = 1 l = 2 l = 3

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-54
SLIDE 54

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Algorithm (32)

FMM Operators (1D)

  • Spatial decomposition → Potential decomposition

fi = f near

i

+ f far

i

  • Near field by direct interactions (leaves)
  • Far field with FMM operators (tree)

l = 0 l = 1 l = 2 l = 3

P2P M2M M2L L2L M2L M2L

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-55
SLIDE 55

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Parallelization (33)

Related work:

  • Multicore study [Chandramowlishwaran et al., 2010]
  • NVidia GPU [Yokota and Barba, 2011]
  • Distributed GPU [Hamada et al., 2009]
  • Distributed CPU/GPU [Hu et al., 2011, Lashuk et al., 2012,

Malhotra and Biros, 2015]

  • Using a runtime system (multicore) [Ltaief and Yokota, 2014]

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-56
SLIDE 56

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Parallelization (34)

Paradigms

  • Fork-join
  • Parallel-for (OpenMP)

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-57
SLIDE 57

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Parallelization (34)

Paradigms

  • Fork-join
  • Parallel-for (OpenMP)
  • Task-based
  • Tasks pool (OpenMP 3.1) [Agullo et al., 2014]1

1 Agullo, E., Bramas, B., Coulaud, O., Darve, E., Messner, M., and Takahashi, T. (2014). Task-based fmm for

multicore architectures. SIAM Journal on Scientific Computing, 36(1):C66C93. Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-58
SLIDE 58

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Parallelization (34)

Paradigms

  • Fork-join
  • Parallel-for (OpenMP)
  • Task-based
  • Tasks pool (OpenMP 3.1) [Agullo et al., 2014]1
  • Tasks-and-dependencies (runtime systems, OpenMP 4)

1 Agullo, E., Bramas, B., Coulaud, O., Darve, E., Messner, M., and Takahashi, T. (2014). Task-based fmm for

multicore architectures. SIAM Journal on Scientific Computing, 36(1):C66C93. Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-59
SLIDE 59

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Parallelization (35)

Tasks-and-Dependencies Model (OpenMP 4, StarPU )

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-60
SLIDE 60

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Parallelization (35)

Tasks-and-Dependencies Model (OpenMP 4, StarPU )

Challenges

  • Granularity

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-61
SLIDE 61

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Parallelization (35)

Tasks-and-Dependencies Model (OpenMP 4, *PU)

CPU/GPU

Challenges

  • Granularity
  • Computational kernels

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-62
SLIDE 62

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Parallelization (35)

Tasks-and-Dependencies Model (OpenMP 4, *PU)

Challenges

  • Granularity
  • Computational kernels
  • Scheduling

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-63
SLIDE 63

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Parallelization (36)

Scheduling

A D B C E F CPU0 CPU1 GPU0 A Scheduler

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-64
SLIDE 64

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Parallelization (36)

Scheduling

A D B C E F CPU0 CPU1 GPU0 Scheduler B C D

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-65
SLIDE 65

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Parallelization (36)

Scheduling

A D B C E F CPU0 CPU1 GPU0 Scheduler B C D

  • Priority
  • Work stealing [Blumofe and Leiserson, 1999]
  • Heterogeneous Earliest Finish Time (Heft)

[Topcuouglu et al., 2002]

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-66
SLIDE 66

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Parallelization (36)

Scheduling

A D B C E F CPU0 CPU1 GPU0 Scheduler B C D

  • Priority
  • Work stealing [Blumofe and Leiserson, 1999]
  • Heterogeneous Earliest Finish Time (Heft)

[Topcuouglu et al., 2002] Drawbacks:

  • Calibration
  • Overhead
  • Ready-tasks view

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-67
SLIDE 67

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Parallelization (37)

Heteroprio

  • Heteroprio [Agullo et al., 2015]1
  • Steady-state : execute tasks where they have the best

acceleration factor

  • Critical-state : execute a task by a worker if it does not delay

the hypothetical end

1 Agullo, E., Bramas, B., Coulaud, O., Darve, E., Messner, M., and Takahashi, T. (2015). Task-based fmm for

heterogeneous architectures. Concurrency and Computation: Practice and Experience. Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-68
SLIDE 68

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Parallelization (37)

Heteroprio

  • Heteroprio [Agullo et al., 2015]1
  • Steady-state : execute tasks where they have the best

acceleration factor

  • Critical-state : execute a task by a worker if it does not delay

the hypothetical end

A D B C E F CPU0 CPU1 GPU0 Scheduler B C D CPU Prio GPU Prio

1 Agullo, E., Bramas, B., Coulaud, O., Darve, E., Messner, M., and Takahashi, T. (2015). Task-based fmm for

heterogeneous architectures. Concurrency and Computation: Practice and Experience. Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-69
SLIDE 69

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Parallelization (37)

Heteroprio

  • Heteroprio [Agullo et al., 2015]1
  • Steady-state : execute tasks where they have the best

acceleration factor

  • Critical-state : execute a task by a worker if it does not delay

the hypothetical end

A D B C E F CPU0 CPU1 GPU0 Scheduler B CPU Prio GPU Prio C D

1 Agullo, E., Bramas, B., Coulaud, O., Darve, E., Messner, M., and Takahashi, T. (2015). Task-based fmm for

heterogeneous architectures. Concurrency and Computation: Practice and Experience. Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-70
SLIDE 70

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Parallelization (37)

Heteroprio

  • Heteroprio [Agullo et al., 2015]1
  • Steady-state : execute tasks where they have the best

acceleration factor

  • Critical-state : execute a task by a worker if it does not delay

the hypothetical end

A D B C E F CPU0 CPU1 GPU0 Scheduler B C CPU Prio GPU Prio D

1 Agullo, E., Bramas, B., Coulaud, O., Darve, E., Messner, M., and Takahashi, T. (2015). Task-based fmm for

heterogeneous architectures. Concurrency and Computation: Practice and Experience. Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-71
SLIDE 71

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Parallelization (37)

Heteroprio

  • Heteroprio [Agullo et al., 2015]1
  • Steady-state : execute tasks where they have the best

acceleration factor

  • Critical-state : execute a task by a worker if it does not delay

the hypothetical end

A D B C E F CPU0 CPU1 GPU0 Scheduler B D C CPU Prio GPU Prio

1 Agullo, E., Bramas, B., Coulaud, O., Darve, E., Messner, M., and Takahashi, T. (2015). Task-based fmm for

heterogeneous architectures. Concurrency and Computation: Practice and Experience. Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-72
SLIDE 72

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Particles Interaction Simulations (38)

Test Case

CPU - 24 Cores GPU 3 GPU 4 GPU 2 GPU 1

  • N = 30 millions particles
  • Spherical Expansion/Rotation Kernel
  • Acc = 10−3, h = 7 and Granularity = 1500

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-73
SLIDE 73

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Particles Interaction Simulations (39)

Trace - Heterogeneous (24CPUs)

0GPU/15.5s

Legend: P2P (■), P2M (■) , M2M (■) , M2L (■), L2L (■), L2P (■) and Idle (■)

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-74
SLIDE 74

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Particles Interaction Simulations (39)

Trace - Heterogeneous (1GPU/23CPUs)

0GPU/15.5s 1GPU/13.4s

Legend: P2P (■), P2M (■) , M2M (■) , M2L (■), L2L (■), L2P (■) and Idle (■)

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-75
SLIDE 75

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Particles Interaction Simulations (39)

Trace - Heterogeneous (1GPU/23CPUs)

0GPU/15.5s 1GPU/13.4s

Legend: P2P (■), P2M (■) , M2M (■) , M2L (■), L2L (■), L2P (■) and Idle (■)

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-76
SLIDE 76

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Particles Interaction Simulations (39)

Trace - Heterogeneous (2GPUs/22CPUs)

0GPU/15.5s 2GPU/10.9s

Legend: P2P (■), P2M (■) , M2M (■) , M2L (■), L2L (■), L2P (■) and Idle (■)

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-77
SLIDE 77

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Particles Interaction Simulations (39)

Trace - Heterogeneous (3GPUs/21CPUs)

0GPU/15.5s 3GPU/9.4s

Legend: P2P (■), P2M (■) , M2M (■) , M2L (■), L2L (■), L2P (■) and Idle (■)

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-78
SLIDE 78

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Particles Interaction Simulations (39)

Trace - Heterogeneous (4GPUs/20CPUs)

0GPU/15.5s 4GPU/8.7s

Legend: P2P (■), P2M (■) , M2M (■) , M2L (■), L2L (■), L2P (■) and Idle (■)

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-79
SLIDE 79

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Particles Interaction Simulations (40)

Test Case

CPU - 24 Cores GPU 3 GPU 4 GPU 2 GPU 1

  • N = 30 millions particles
  • Uniform/Lagrange kernel
  • Acc = {10−5, 10−7}, h = 7 and Granularity = 1500

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-80
SLIDE 80

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Particles Interaction Simulations (41)

Trace - Heterogeneous (4GPUs)

Acc = 10−5/7.9s

Legend: P2P (■), P2M (■) , M2M (■) , M2L (■), L2L (■), L2P (■) and Idle (■)

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-81
SLIDE 81

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Particles Interaction Simulations (41)

Trace - Heterogeneous (4GPUs)

Acc = 10−5/7.9s Acc = 10−7/17s

Legend: P2P (■), P2M (■) , M2M (■) , M2L (■), L2L (■), L2P (■) and Idle (■)

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-82
SLIDE 82

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Particles Interaction Simulations (42)

Test Cases

Node 1 - 24 Cores Node 2 - 24 Cores Node 3 - 24 Cores Node 4 - 24 Cores Node 5 - 24 Cores Node 6 - 24 Cores Node 7 - 24 Cores

  • N = 200 millions particles
  • Spherical Expansion/Rotation Kernel
  • Acc = 10−3, h = 8 and Granularity = 2000

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-83
SLIDE 83

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Particles Interaction Simulations (43)

Trace - 7 nodes × 24CPUs

Legend: P2P (■), P2M (■) , M2M (■) , M2L (■), L2L (■), L2P (■) and Idle (■) .

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-84
SLIDE 84

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Summary (44)

Summary:

  • Generic
  • Kernel independent
  • Architecture independent
  • Performance portability

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-85
SLIDE 85

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Summary (44)

Summary:

  • Generic
  • Kernel independent
  • Architecture independent
  • Performance portability

Additional contributions:

  • Commutativity expression in FMM
  • MPI/OpenMP implementation

All included in ScalFMM (C++/HPC library)

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-86
SLIDE 86

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives (45)

Outline

  • Problem Formulation
  • BEM Solver (Matrix Approach)
  • Fast-Multipole Method Approach
  • FMM Algorithm & Parallelization
  • FMM BEM Solver (Experimental Implementation)
  • Conclusion & Perspectives

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-87
SLIDE 87

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives (46)

Propagation of the Current State to the Future

= = =

  • ,

Linear Solver sn ln sn sn

~

sn ~ an M0

M1 M2M3M4M5 an-1an-2an-3an-4an-5

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-88
SLIDE 88

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives (46)

Propagation of the Current State to the Future

= = =

  • ,

Linear Solver sn ln sn sn

~

sn ~ an M0

M1 M2M3M4M5 an-1an-2an-3an-4an-5

=

  • ln

sn sn ~ = ,

Linear Solver

sn ~ an M0 M5 an M1 M2 M3 M4 sn+5 sn+4 sn+3 sn+2 sn+1 + + + + +

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-89
SLIDE 89

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives (47)

With FMM

=

  • ln

sn sn ~ = ,

Linear Solver

sn ~ an M0 an M1 M2 sn+5 sn+4 sn+3 sn+2 sn+1 + + + + +

FMM

  • Far interactions in time (between far elements in space) are

computed by the FMM

  • The spatial decomposition is given by the octree

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-90
SLIDE 90

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives (48)

Overview

  • The octree is over a mesh (integration points)
  • Interactions matrices between leaves
  • Approximation/FMM
  • development in the time-domain
  • multipole: what a cell emits to the outside
  • local: what a cell receives from the outside
  • operators in FD or TD
  • accurate up-to a chosen frequency
  • the results in the TD of the matrix approach ̸= FMM

Figure : Complete unit sphere Figure : Truncated unit sphere

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-91
SLIDE 91

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives (49)

Operators (Overview)

  • P2M
  • compute what is emitted by a leaf to the outside

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-92
SLIDE 92

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives (49)

Operators (Overview)

  • P2M
  • compute what is emitted by a leaf to the outside
  • M2M/L2L
  • Extrapolation + time shift

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-93
SLIDE 93

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives (49)

Operators (Overview)

  • P2M
  • compute what is emitted by a leaf to the outside
  • M2M/L2L
  • Extrapolation + time shift
  • M2L
  • Convolution product in TD

(term-by-term multiplication in FD)

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-94
SLIDE 94

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives (49)

Operators (Overview)

  • P2M
  • compute what is emitted by a leaf to the outside
  • M2M/L2L
  • Extrapolation + time shift
  • M2L
  • Convolution product in TD

(term-by-term multiplication in FD)

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-95
SLIDE 95

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives (49)

Operators (Overview)

  • P2M
  • compute what is emitted by a leaf to the outside
  • M2M/L2L
  • Extrapolation + time shift
  • M2L
  • Convolution product in TD

(term-by-term multiplication in FD)

  • L2P
  • Integration

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-96
SLIDE 96

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Results (50)

Cone-Sphere Test Cases

Case C-927 C-4269 C-10012 Number of unknowns 927 4269 10012 FMM tree height 3 4 5 Number of leaves 16 64 234 Number of Mk matrices (K max) 117 244 370 Number of Mk matrices (leaves) 60 64 49 Number of time steps (T) 2033 4345 6647

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-97
SLIDE 97

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Results (51)

Sequential Executions

TD vs. FD operators:

FMM Stages TD TD + FD-M2L FD Matrix approach Mk Construction 76 s 76 s 76 s 242 s Solve 58 122 s 53 241 s 97 861s 7.8 s(*) Total 58 198 s 53 317 s 97 937 s 249.8 s

Execution time TD-FMM Vs. matrix approach to solve the Case C-927 in double precision. (*) Our optimized BEM solver.

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-98
SLIDE 98

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Results (52)

Parallel Executions (FMM Vs. Matrix Approach)

. . . .

Matrix generation .

.

Solve

. . .

FMM

.

Matrix Approach

. .

500

.

1,000

.

883

.

237

.

Time (s) Figure : C-927 (×3.8)

. . .

FMM

.

Matrix Approach

. .

0.5

.

1

.

·104

.

9,426

.

10,080 Figure : C-4269 (×1)

. . .

FMM

.

Matrix Approach

. .

2

.

4

.

·104

.

33,408

.

25,256 Figure : C-10012 (×1.4) The captions of the different cases show the overhead of the FMM TD-BEM against the matrix approach. Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-99
SLIDE 99

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Summary (53)

Summary:

  • Preliminary results
  • Best configuration: TD + FD M2L
  • Not competitive against the direct approach (maybe on larger

test cases)

  • Any improvement of the matrix creation will make the FMM

less competitive Additional contributions:

  • Incomplete/4D FMM
  • Sphere discretization/length APS signal

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-100
SLIDE 100

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives (54)

Conclusion & Perspectives

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-101
SLIDE 101

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives (55)

Conclusion

Dense BEM/Matrix Approach

  • Based on a new computational order
  • Remove the bottleneck of the SpMV
  • Implemented efficiently on modern architectures
  • Complete BEM solver

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-102
SLIDE 102

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives (56)

Conclusion

FMM

  • Generic and state-of-the-art library
  • Several parallelization strategies
  • Robust OpenMP/MPI implementation (10 billions particles)
  • Modern task-based approach
  • ScalFMM

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-103
SLIDE 103

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives (57)

Conclusion

FMM BEM Solver (Preliminary)

  • Parallelized using ScalFMM
  • Best configuration TD operators + FD M2L
  • Our implementation is not faster than the direct approach

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-104
SLIDE 104

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives (58)

Perspectives

TD-BEM

  • Improve the construction of the interaction matrices
  • M0 linear solver: small matrix, lots of nodes
  • Compare existing solvers (TD vs. FD)

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-105
SLIDE 105

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives (59)

Perspectives

FMM parallelization

  • Task-based with implicit MPI communications
  • Group-Tree update

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-106
SLIDE 106

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives (60)

Perspectives

FMM BEM

  • Study the cost of the solve compare to the direct approach

(complexity for some cases)

  • Lots of remaining optimizations to test

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-107
SLIDE 107

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

(1) Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-108
SLIDE 108

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

(2)

Agullo, E., Bramas, B., Coulaud, O., Darve, E., Messner, M., and Takahashi, T. (2014). Task-based fmm for multicore architectures. SIAM Journal on Scientific Computing, 36(1):C66–C93. Agullo, E., Bramas, B., Coulaud, O., Darve, E., Messner, M., and Takahashi, T. (2015). Task-based fmm for heterogeneous architectures. Concurrency and Computation: Practice and Experience, pages n/a–n/a. cpe.3723. Baskaran, M. M. and Bordawekar, R. (2008). Optimizing sparse matrix-vector multiplication on gpus using compile-time and run-time strategies. IBM Reserach Report, RC24704 (W0812-047). Bell, N. and Garland, M. (2009). Implementing sparse matrix-vector multiplication on throughput-oriented processors.

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-109
SLIDE 109

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

(3)

Vectorized Implementation (Overview)

The algorithm is tied to the vectorization type size lSIMD Example with ng = 4, v = 4, lSIMD = 2:

a*<n(j)

a b c d

Slicej row-vector(i)

2 3 4 5 1

sn+1 sn sn+2

6

sn+2 sn+1 sn sn+3 sn+3 bufger r

  • Each value from the vectors is read only once (and maybe

copied into the buffer)

  • The values in the buffer are shifted to avoid reloading

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-110
SLIDE 110

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

(3)

Vectorized Implementation (Overview)

The algorithm is tied to the vectorization type size lSIMD Example with ng = 4, v = 4, lSIMD = 2:

a*<n(j)

a b c d

Slicej row-vector(i)

2 3 4 5 1

sn+1 sn sn+2

6

sn+2 sn+1 sn sn+3 sn+3 bufger

1 a b

sn+3

2 1

sn+2

a b

r r r

  • Each value from the vectors is read only once (and maybe

copied into the buffer)

  • The values in the buffer are shifted to avoid reloading

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-111
SLIDE 111

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

(3)

Vectorized Implementation (Overview)

The algorithm is tied to the vectorization type size lSIMD Example with ng = 4, v = 4, lSIMD = 2:

a*<n(j)

a b c d

Slicej row-vector(i)

2 3 4 5 1

sn+1 sn sn+2

6

sn+2 sn+1 sn sn+3 sn+3 bufger

1 a b

sn+3

2 1

sn+2 bufger

2 3 3 4 a b

r r r

  • Each value from the vectors is read only once (and maybe

copied into the buffer)

  • The values in the buffer are shifted to avoid reloading

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-112
SLIDE 112

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

(3)

Vectorized Implementation (Overview)

The algorithm is tied to the vectorization type size lSIMD Example with ng = 4, v = 4, lSIMD = 2:

a*<n(j)

a b c d

Slicej row-vector(i)

2 3 4 5 1

sn+1 sn sn+2

6

sn+2 sn+1 sn sn+3 sn+3 bufger

1 a b

sn+3

2 1

sn+2 bufger

2 3 3 4 2 3

bufger

a b a b

sn+1

3 4 a b

sn bufger r r r r r

  • Each value from the vectors is read only once (and maybe

copied into the buffer)

  • The values in the buffer are shifted to avoid reloading

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-113
SLIDE 113

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

(3)

Vectorized Implementation (Overview)

The algorithm is tied to the vectorization type size lSIMD Example with ng = 4, v = 4, lSIMD = 2:

a*<n(j)

a b c d

Slicej row-vector(i)

2 3 4 5 1

sn+1 sn sn+2

6

sn+2 sn+1 sn sn+3 sn+3 bufger

1 a b

sn+3

2 1

sn+2 bufger

2 3 3 4 2 3

bufger

a b a b

sn+1

3 4 a b

sn bufger

2 3

bufger

c d

sn+3

3 4 c d

sn+2 bufger

4 5 c d

sn+1

6 5

sn

c d

r r r r r r r r r

  • Each value from the vectors is read only once (and maybe

copied into the buffer)

  • The values in the buffer are shifted to avoid reloading

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-114
SLIDE 114

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

(4)

In Time or Frequency Domain

Propagation of the wave for several time steps on a target discretized sphere. The different spheres represent the values that will be applied on the included mesh elements.

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-115
SLIDE 115

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

(5)

Contiguous-Blocking Computational Kernel

Slicej

Kmax-1 N

a*<n(j)

I) Coalesced copy of the vector into shared memory III) Coalesced add to global memory II) Coalesced access in global memory (column per column) III) Partial results in registers

snsn+1sn+2

II) Uncoalesced read from shared memory Global Memory Shared Memory Local Memory

(a) (b) (c)

bc

Thread-1 Thread-2 Thread-3 Thread-4 Thread-5 Thread-6 Thread-7 Thread-8 Thread-9

(a) the original slice is transformed in a block during the pre-computation stage (ng = 3, bc = 11) (b) the blocks are moved to the device memory for the summation stage (c) a thread-block (nb − threads = 9) is in charge of the blocks from a slice interval and computes several summation vectors at the same time

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-116
SLIDE 116

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

(6)

Multi-vectors/vector Product

sn+2 (i)sn+1 (i)sn (i)

1 2 3 4

sn+2 (i)sn+1 (i)sn (i)

1 2 3 4 5 6 1 2 3 4 5 2 3 4 5 6

(a) (b)

Computing one slice-row with 3 vectors (ng = 3) (a) using 3 scalar products (b) using the multi-vectors/vector product

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-117
SLIDE 117

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

(7)

In Shared Memory

  • OpenMP parallelization
  • Summation divided/balanced between the threads
  • No communication during the summation
  • Multi-threaded M0 linear solver if possible
  • NUMA effects are not handled

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-118
SLIDE 118

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

(8)

On Heterogeneous Nodes

  • OpenMP parallelization
  • One thread/core per GPU
  • An interval of the slices is moved on each GPU
  • Intervals are balanced between each iteration with a greedy

algorithm

  • The memory limit of the GPU may reduce their performance

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-119
SLIDE 119

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

(9)

Application Example

FD-BEM + FMM (antenna at 1GHz)

Image from Airbus Group.

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-120
SLIDE 120

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

(10)

Cone-Sphere Test Cases

Case C-927 C-4269 C-10012 C-22468 Number of unknowns 927 4269 10012 C-22468 FMM tree height 3 4 5 6 Number of leaves in the FMM tree 16 64 234 936 Number of NNZ interaction matrices (Kmax ) 117 244 370 551 Number of NNZ matrices between FMM leaves 60 64 49 37 Number of time steps (T) 2033 4345 6647 9957 Size of the simulation box 3.3 7.3 11 16 Fmax 348 337 335 334 Incomplete FMM coefficient l = h − 1 16 18 13 10 Incomplete FMM coefficient l = 2 16 36 52 80 Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-121
SLIDE 121

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

(11)

Multi-vectors/vector Product (GPU)

For the Contiguous-Blocking scheme:

Width (bc) 16 32 64 128 16 32 64 128 GPU CPU Single 243 338 431 496 (11%) 4.3 5.5 7.8 6.8 (17%) Double 143 199 248 286 (20%) 3.9 5.6 4.2 4.3 (21%)

GFlop/s for 420 slices (6400 rows and bc columns) (%) percentage of the peak performance

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-122
SLIDE 122

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

(12)

Fork-join (OpenMP)

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-123
SLIDE 123

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

(12)

Fork-join (OpenMP)

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-124
SLIDE 124

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

(12)

Fork-join (OpenMP)

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-125
SLIDE 125

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

(12)

Fork-join (OpenMP)

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-126
SLIDE 126

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

(12)

Fork-join (OpenMP)

  • Level by level
  • Critical balancing
  • Possible bottleneck (top of the tree)
  • Difficult to mix near/far fields

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-127
SLIDE 127

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

(13)

Fork-join+Message-passing (Hybrid OpenMP/MPI)

P0 P1 P2 P3

  • Distribute the tree between nodes
  • Progress level by level
  • Communication between all stages

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-128
SLIDE 128

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

(13)

Fork-join+Message-passing (Hybrid OpenMP/MPI)

P0 P1 P2 P3

  • Distribute the tree between nodes
  • Progress level by level
  • Communication between all stages

Poor parallelism expression

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-129
SLIDE 129

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

(14)

Trace - Heterogeneous (4GPUs)

h = 7/ng = 1500/Acc = 10−7/17s 24 threads, N = 30 millions, uniform distribution, Uniform/Lagrange

  • kernel. Legend: P2P (■) , P2M (■) , M2M (■) , M2L (■), L2L (■),

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-130
SLIDE 130

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

(14)

Trace - Heterogeneous (4GPUs)

h = 7/ng = 1500/Acc = 10−7/17s h = 6/ng = 300/Acc = 10−7/39s (not on the same scale) 24 threads, N = 30 millions, uniform distribution, Uniform/Lagrange

  • kernel. Legend: P2P (■) , P2M (■) , M2M (■) , M2L (■), L2L (■),

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-131
SLIDE 131

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

(15)

Flop/Cost Estimation

. . .

1,000

.

10,000

.

20,000

.

106

.

107

.

108

.

109

. 1 . 6 . 23 . 92 .

Number of unknowns

.

Unit of Cost

. . .

Between FMM leaves

. .

Complete interaction matrices

Figure : Matrix generation cost estimation

. . . . . . . . . . . . . .

The numbers above the slower plot represent the slow-down factors against the faster method. Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-132
SLIDE 132

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

(15)

Flop/Cost Estimation

. . .

1,000

.

10,000

.

20,000

.

106

.

107

.

108

.

109

. 1 . 6 . 23 . 92 .

Number of unknowns

.

Unit of Cost

. . .

Between FMM leaves

. .

Complete interaction matrices

Figure : Matrix generation cost estimation

. . .

1,000

.

10,000

.

20,000

.

1010

.

1013

.

1016

. 1340 . 388 . 203 . 154 . 1320 . 353 . 119 . 57 .

Number of unknowns

.

Number of Flop

. . .

TD-BEM FMM

. .

TD-BEM FMM (FD M2L)

. .

Matrix Approach

Figure : Summation stage Flop estimation

The numbers above the slower plot represent the slow-down factors against the faster method. Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-133
SLIDE 133

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

(16)

Reducing the Complexity

Direct computation 0(N2) . .

X

.

Y

.

O(N2)

. .

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-134
SLIDE 134

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

(16)

Reducing the Complexity

Direct computation 0(N2) → FMM O(N) . .

X

.

Y

.

O(N2)

. .

X

.

Y

.

O(N)

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-135
SLIDE 135

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

(16)

Reducing the Complexity

Direct computation 0(N2) → FMM O(N) . .

X

.

Y

.

O(N2)

. .

X

.

Y

.

O(N)

  • Spatial decomposition → Potential decomposition

fi = f near

i

+ f far

i

  • Near field is computed by direct interactions
  • The far field is done using different operators

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-136
SLIDE 136

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Results (17)

Airplane Simulation

  • Acoustics
  • N = 23 962
  • 10 823 time iterations
  • K max = 341 interaction matrices Mk (≈ 5.5 × 109 NNZ)
  • Computing sn ≈ 11 GFlop
  • Total ≈ 130 651 GFlop
  • ng = 8
  • 70GB of data
  • CPU node : 2 Dodeca-core Haswell Intel Xeon E5-2680 at

2, 50GHz and 128GB (DDR4) of shared memory

  • GPUs per node: 4 NVIDIA Kepler K40M GPU (745MHz),

2880 Cores, 12GB of dedicated memory

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-137
SLIDE 137

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Results (18)

Hybrid MPI/OpenMP

. . . .

Hybrid MPI/OpenMP

. . . 1 . 10 . 20 . 30 . 40 . 50 . 1 . 0.5 . . Number of nodes . Efficienty

Efficiency: Uniform distribution, Spherical Expansion/Rotation Kernel, N = 200 millions, h = 8, Acc = 10−3, from 1 to 50 nodes (24 threads per node), for np = 50 the execution time is 2.24s

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-138
SLIDE 138

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Results (19)

Group-tree

  • Granularity G
  • A group → G cells/leaves
  • Good locality
  • Low iteration complexity
  • Dependencies between cells ̸= between groups

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-139
SLIDE 139

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Results (20)

Trace - Shared Memory - 24CPUs

N = 20 millions, ellipsoid distribution, Spherical Expansion/Rotation Kernel, Acc = 10−3, h = 11 and ng = 8000 in 5.2s. Legend: P2P (■), P2M (■) , M2M (■) , M2L (■), L2L (■), L2P (■) and Idle (■)

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-140
SLIDE 140

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Results (21)

Trace - 24CPUs

N = 30 millions, uniform distribution, Spherical Expansion/Rotation Kernel, Acc = 10−3, h = 7 and ng = 1500 in 15.5s. Legend: P2P (■), P2M (■) , M2M (■) , M2L (■), L2L (■), L2P (■) and Idle (■)

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-141
SLIDE 141

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Particles Interaction Simulations (22)

Test Cases

Interactions between N particles for two distributions:

Figure : Uniform Figure : Ellipsoid

The height of the tree (h) is chosen such that the execution time is minimal in sequential.

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-142
SLIDE 142

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Particles Interaction Simulations (23)

Parallel Strategies for FMM BEM

Three strategies (Fork-join OpenMP)

  • Threaded FMM: divide each level between threads (ScalFMM

classic)

  • Threaded kernel: divide the work inside the kernel
  • Mix FMM/Kernel: two layers of parallelism, one in the FMM

and a second in the kernel

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-143
SLIDE 143

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Particles Interaction Simulations (24)

Parallel Executions

. . . .

Matrix generation .

.

Solve

. . .

Threaded FMM

.

Threaded Kernel

.

Mix FMM/Kernel

.

Matrix Approach

. .

2,000

.

4,000

.

3,752

.

1,897

.

883

.

237

.

Time (s) Figure : C-927

. . .

Threaded FMM

.

Threaded Kernel

.

Mix FMM/Kernel

.

Matrix Approach

. .

0.5

.

1

.

1.5

.

·104

.

15,732

.

15,371

.

9,426

.

10,080 Figure : C-4269 (×1)

. . .

Threaded FMM

.

Threaded Kernel

.

Mix FMM/Kernel

.

Matrix Approach

. .

5

.

·104

.

33,408

.

76,931

.

39,227

.

25,256 Figure : C-10012 The captions of the different cases show the overhead of the FMM TD-BEM against the matrix approach. Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

slide-144
SLIDE 144

. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .

Particles Interaction Simulations (24)

Parallel Executions FMM Vs. Matrix Approach

. . . .

Matrix generation .

.

Solve Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas