Optimization and Parallelization of the Boundary Element Method for - - PowerPoint PPT Presentation
Optimization and Parallelization of the Boundary Element Method for - - PowerPoint PPT Presentation
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B erenger Bramas Inria, Bordeaux - Sud-Ouest PhD defense - Feb. 15th 2016 Advisor: Oliver Coulaud (Inria) Industrial co-advisor: Guillaume
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives (2)
Context
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives (3)
Wave Equation Problems
- Study the wave propagation in acoustics or electromagnetism
- Critical in several industrial fields (design, robustness study)
In our case: antenna placement, electromagnetic compatibility, furtivity, lightning, ...
Image from Airbus Group.
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives (4)
Wave Equation Simulations
Boundary element method (BEM): integral equation over a discretized mesh Interest of BEM compared to other approaches
- Better accuracy
- Surfacic mesh (easier to produce)
Disadvantages of BEM
- Dense matrices (specific solvers)
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives (5)
BEM
Boundary Element Method (BEM) for the Wave Equation In Time Domain (TD) In Frequency Domain (FD)
One solve = a range of frequencies (Fourier Transform) One solve = one frequency Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives (5)
BEM
Boundary Element Method (BEM) for the Wave Equation In Time Domain (TD) In Frequency Domain (FD) Accelerated by FMM (Fast-BEM) Accelerated by H-Matrix Accelerated by FMM Dense BEM/Matrix Approach Dense BEM ... ...
One solve = a range of frequencies One solve = one frequency Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives (5)
BEM
Boundary Element Method (BEM) for the Wave Equation In Time Domain (TD) In Frequency Domain (FD) Accelerated by FMM (Fast-BEM) Accelerated by H-Matrix Accelerated by FMM Dense BEM/Matrix Approach Dense BEM ... ...
One solve = a range of frequencies One solve = one frequency Less studied and used Widely used (academy and industry) Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives (5)
BEM
Boundary Element Method (BEM) for the Wave Equation In Time Domain (TD) In Frequency Domain (FD) Accelerated by FMM (Fast-BEM) Accelerated by H-Matrix Accelerated by FMM Dense BEM/Matrix Approach Dense BEM ... ...
One solve = a range of frequencies One solve = one frequency Less studied and used Widely used (academy and industry)
Advantages/disadvantages depend on the application/configuration
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives (6)
Industrial Context
- In partnership with Airbus Group Innovation (financed jointly
with Region Aquitaine)
- Airbus solvers:
- FD-BEM
- Accelerated by FMM or H-Matrix techniques
- TD-BEM (experimental)
- No stability problem (formulation based on a full Galerkin
discretization unconditionnaly stable from [Terrasse, 1993])
- With FMM [Ergin et al., 2000] (trial)
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives (6)
Industrial Context
- In partnership with Airbus Group Innovation (financed jointly
with Region Aquitaine)
- Airbus solvers:
- FD-BEM
- Accelerated by FMM or H-Matrix techniques
- TD-BEM (experimental)
- No stability problem (formulation based on a full Galerkin
discretization unconditionnaly stable from [Terrasse, 1993])
- With FMM [Ergin et al., 2000] (trial)
Objective:
- Reduce the performance gap between FD and TD approaches
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives (7)
HPC
Super-computers are mandatory to solve large problems
- Shared/Distributed memory
- Heterogeneous (one or more GPU per node)
CPU CPU GPU Node CPU CPU GPU Node CPU CPU GPU Node CPU CPU GPU Node ... Cluster Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives (7)
HPC
Super-computers are mandatory to solve large problems
- Shared/Distributed memory
- Heterogeneous (one or more GPU per node)
CPU CPU GPU Node CPU CPU GPU Node CPU CPU GPU Node CPU CPU GPU Node ... Cluster
Some of the challenges
- Efficient computational algorithm/kernel
- Parallelization
- Balancing
- Hardware abstraction, portable implementation, long-term
development, ...
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives (8)
Outline
- Problem Formulation
- BEM Solver (Matrix Approach)
- Fast-Multipole Method Approach
- FMM Algorithm & Parallelization
- FMM BEM Solver (Experimental Implementation)
- Conclusion & Perspectives
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives (9)
TD-BEM Application Stages
User inputs, simulation parameters ↓ Mesh generator, configuration ↓ Solver ↓ Post-processing (TD → FD)
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives TD-BEM Formulation (10)
Linear Formulation
Notations:
- δΩ discretized in N unknowns/degrees of freedom
Ω ln
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives TD-BEM Formulation (10)
Linear Formulation
Notations:
- δΩ discretized in N unknowns/degrees of freedom
- Mk: the convolution matrices (dimension N × N) - input
- ln: the incident wave emitted by a source on the unknowns at
time step n - input
- an: the state of the system at time step n - to compute
Ω ln
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives TD-BEM Formulation (10)
Linear Formulation
Notations:
- δΩ discretized in N unknowns/degrees of freedom
- Mk: the convolution matrices (dimension N × N) - input
- ln: the incident wave emitted by a source on the unknowns at
time step n - input
- an: the state of the system at time step n - to compute
Convolution system: M0 · an +
K max
∑
k≥1
Mk · an−k = ln (1) Ω ln
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives TD-BEM Formulation (10)
Linear Formulation
Notations:
- δΩ discretized in N unknowns/degrees of freedom
- Mk: the convolution matrices (dimension N × N) - input
- ln: the incident wave emitted by a source on the unknowns at
time step n - input
- an: the state of the system at time step n - to compute
Convolution system: M0 · an +
K max
∑
k≥1
Mk · an−k = ln (1) Solve at each time step: an = (M0)−1 ( ln −
Kmax
∑
k=1
Mk · an−k ) (2)
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives TD-BEM Formulation (11)
Interaction/Convolution Matrices (Mk)
- Interactions between unknowns
- Symmetric and sparse, Mk(i, j) ̸= 0 if distance(i, j) ≈ k.c.∆t
- Pre-computed (external tool)
M0 M1 M2 MKmax . . .
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives TD-BEM Formulation (12)
Solve (Schematic View)
an = (M0)−1 ( ln −
Kmax
∑
k=1
Mk · an−k ) (3)
= = =
- ,
Linear Solver sn ln sn sn
~
sn ~ an M0
M1 M2M3M4M5 an-1an-2an-3an-4an-5
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives TD-BEM Formulation (13)
SpMV (sparse matrix/vector product)
Summation stage → K max SpMVs
- Permutation, advanced storages/kernels, blocking
[White III and Sadayappan, 1997, Pinar and Heath, 1999, Pichel et al., 2005, Vuduc and Moon, 2005]
- Auto-tuning
[Im and Yelick, 2001, Vuduc et al., 2005]
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives TD-BEM Formulation (13)
SpMV (sparse matrix/vector product)
Summation stage → K max SpMVs
- Permutation, advanced storages/kernels, blocking
[White III and Sadayappan, 1997, Pinar and Heath, 1999, Pichel et al., 2005, Vuduc and Moon, 2005]
- Auto-tuning
[Im and Yelick, 2001, Vuduc et al., 2005] Low Flop-rate:
- Memory bound operation
- Flop/Word hardware limit
- Irregular/not contiguous memory accesses
- Instruction (pipelining, vectorization)
- Not appropriate for GPUs
[Garland, 2008, Baskaran and Bordawekar, 2008, Bell and Garland, 2009]
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives TD-BEM Formulation (14)
SpMV (Performance)
. . .
Dense 5000
.
Diagonal 15/500000
.
Random 4/20000
.
Block Random (5/80000)
.
Dense 200 (x10000)
. .
10
.
20
.
30
.
GFlop/s
. . . .
C00 MKL .
.
CRS MKL .
.
DIA MKL .
.
BCSR MKL .
.
CRS cuSparse .
.
BCSR cuSparse
SpMVs MKL/cuSparse (double precision) Peak performance: CPU Haswell Intel Xeon E5-2680 2,50 GHz core 20GFlop/s, and K40-M GPU 1.43TFlop/s.
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives TD-BEM Formulation (15)
TD-BEM Application Stages
User inputs, simulation parameters ↓ Mesh generator, configuration, interaction matrices pre-computation ↓ Solver · Summation stage · M0 Linear Solver (external tool) ↓ Post-processing (TD → FD)
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives (16)
Outline
- Problem Formulation
- BEM Solver (Matrix Approach)
- Fast-Multipole Method Approach
- FMM Algorithm & Parallelization
- FMM BEM Solver (Experimental Implementation)
- Conclusion & Perspectives
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Improving the Summation (17)
Computational Ordering
A0 M 6 A1 M 5 A2 M 4 A3 M 3 A4 M 2 A5 M 1 S6
Front (k)/ SpMV sn(i) =
K max
∑
k=1 N
∑
j=1
Mk(i, j) × an−k(j) , 1 ≤ i ≤ N . (4)
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Improving the Summation (17)
Computational Ordering
A0 M 6 A1 M 5 A2 M 4 A3 M 3 A4 M 2 A5 M 1 S6
Front (k)/ SpMV
A0 M 6 A1 M 5 A2 M 4 A3 M 3 A4 M 2 A5 M 1 S6
Top (i)
A0 M 6 A1 M 5 A2 M 4 A3 M 3 A4 M 2 A5 M 1 S6
Side (j) sn(i) =
K max
∑
k=1 N
∑
j=1
Mk(i, j) × an−k(j) , 1 ≤ i ≤ N . (4)
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Improving the Summation (18)
Structure of a Slice Matrix
A Slicej:
- When outer loop index is j
- The concatenation of column j of the interaction matrices Mk
(except M0)
- Size (N × (Kmax − 1))
- There is one dense vector per row
- Slicej(i, k) = Mk(i, j) ̸= 0
with ks = d(i, j)/(c∆t) and ks ≤ k ≤ ks + p
M1(*,j)
Slicej
M2(*,j) M3(*,j) M4(*,j) M5(*,j) M6(*,j) M7(*,j) M8(*,j) M9(*,j) M10(*,j) M11(*,j) M12(*,j) Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Improving the Summation (19)
Computing with a Slice Matrix
a*<n(j)
M
1
( * , j )
Slicej sn
M
2
( * , j ) M
3
( * , j ) M
4
( * , j ) M
5
( * , j ) M
6
( * , j ) M
7
( * , j ) M
8
( * , j ) M
9
( * , j ) M
1
( * , j ) M
1 1
( * , j ) M
1 2
( * , j )
Computation with N vector/vector products (one per line):
- Regular memory access (vectorization, pipelining)
- Low Flop/word ratio (same as SpMV)
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Improving the Summation (20)
Improving the Flop/Word Ratio
an-1an-2an-3an-4an-5an-6an-7an-8an-9
M1(*,j)
Slicej sn
M2(*,j) M3(*,j) M4(*,j) M5(*,j) M6(*,j) M7(*,j) M8(*,j) M9(*,j)
State of unknown j
an an+1 an+2 an ? ? ? ….
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Improving the Summation (20)
Improving the Flop/Word Ratio
an-1an-2an-3an-4an-5an-6an-7an-8an-9
M1(*,j)
Slicej sn+1
M2(*,j) M3(*,j) M4(*,j) M5(*,j) M6(*,j) M7(*,j) M8(*,j) M9(*,j)
an an+1 an+2 ? ? ? an-1an-2an-3an-4an-5an-6an-7an-8an-9 an an+1 an+2 …. ? ? ? ….
sn
n =2
g
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Improving the Summation (20)
Improving the Flop/Word Ratio
an-1an-2an-3an-4an-5an-6an-7an-8an-9
M1(*,j)
Slicej sn+1
M2(*,j) M3(*,j) M4(*,j) M5(*,j) M6(*,j) M7(*,j) M8(*,j) M9(*,j)
an an+1 an+2 ? ? ? an-1an-2an-3an-4an-5an-6an-7an-8an-9 an an+1 an+2 …. ? ? ? ….
sn sn+2
an-1an-2an-3an-4an-5an-6an-7 an-9 an an+1 an+2 ? ? ? an-8 ….
n =3
g
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Improving the Summation (20)
Improving the Flop/Word Ratio
M1(*,j)
Slicej sn+1
M2(*,j) M3(*,j) M4(*,j) M5(*,j) M6(*,j) M7(*,j) M8(*,j) M9(*,j)
an-1an-2an-3an-4an-5an-6an-7an-8an-9 an an+1 ….
sn sn+2
n =3
g
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Improving the Summation (20)
Improving the Flop/Word Ratio
M1(*,j)
Slicej sn+1
M2(*,j) M3(*,j) M4(*,j) M5(*,j) M6(*,j) M7(*,j) M8(*,j) M9(*,j)
an-1an-2an-3an-4an-5an-6an-7an-8an-9 an an+1 ….
sn sn+2
n =3
g
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Multi-Vectors/Vector Product (21)
Flop/Word Ratio
Vector length v = 4, group size ng = 4 (v × ng × 2 Flops):
a b c d 1 2 3 r a b c d 1 2 3 r 1 2 3 4 2 3 4 5 3 4 5 6 r r r a b c d 1 2 3 4 5 6 r r r r
Vector/vector product Vector/matrix product Multi-vectors/vector product v ng
- Vectors product (≈ SpMV) : ng(2v + 1)
- Vector/matrix product : v + ng(v + 1)
- Multi-vectors/vector product : (v + ng − 1) + (v) + (ng)
. . . . .
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Multi-Vectors/Vector Product (21)
Flop/Word Ratio
Vector length v = 4, group size ng = 4 (v × ng × 2 Flops):
a b c d 1 2 3 r a b c d 1 2 3 r 1 2 3 4 2 3 4 5 3 4 5 6 r r r a b c d 1 2 3 4 5 6 r r r r
Vector/vector product Vector/matrix product Multi-vectors/vector product v ng
- Vectors product (≈ SpMV) : ng(2v + 1)
- Vector/matrix product : v + ng(v + 1)
- Multi-vectors/vector product : (v + ng − 1) + (v) + (ng)
. . . .
2
.
4
.
6
.
8
.
10
.
12
.
14
.
16
.
18
.
20
.
0 . 2
.
4
.
6
.
v
.
F/W (ng 8) Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Multi-Vectors/Vector Product (22)
Multi-vectors/vector Product (CPU)
. . . . .
AVX-Asm .
.
AVX-Intrinsic .
.
AVX-Template .
.
SSE-Intrinsic .
.
Compiler Version
. . .
0 . 20
.
40
.
60
.
80
. .
5
.
10
.
15
.
20
.
Length of vectors (v)
.
Speed (GFlop/s)
Figure : Nr = 1 024
. . .
0 . 20
.
40
.
60
.
80
. .
5
.
10
.
15
.
20
.
Length of vectors (v)
Figure : Nr = 20 480
Plots show the GFlop/s with ng = 8 for test cases of dimension Nr × v (in double precision). Haswell Intel Xeon E5-2680 at 2, 50GHz (20GFlop/s) Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Multi-Vectors/Vector Product on GPUs (23)
GPUs Slice Storages
- Blocking scheme (small conversion overhead)
- Data access appropriate for SIMT/SIMD
- Memory accesses (coalesced, low bank conflicts)
- Data re-use (shared memory)
- CPU/GPU Balancing
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Multi-Vectors/Vector Product on GPUs (23)
GPUs Slice Storages
- Blocking scheme (small conversion overhead)
- Data access appropriate for SIMT/SIMD
- Memory accesses (coalesced, low bank conflicts)
- Data re-use (shared memory)
- CPU/GPU Balancing
Slicej a*<n(j) Slicej
2 1 3 4 5 8 6 5 6 1 2 2 (a) (b) (c)
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Parallelization (24)
Parallelization
Sequential algorithm:
= = =
- ,
Linear Solver sn ln sn sn
~
sn ~ an M0
M1 M2M3M4M5 an-1an-2an-3an-4an-5 Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Parallelization (25)
Parallel Solver (Schematic View)
=
- ln
sn sn ~
= , Parallel Linear Solver
sn ~ an M0
=
sn M1 M2M3M4M5 an-1n-2n-3n-4
n-5
=
sn
=
sn sn
+
P1 P2 P0 an-1
n-2 n-3n-4n-5 n-5
an-1n-2n-3n-4 M1 M2M3M4 M5 M1 M2M3M4M5 Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Parallelization (26)
Parallel Solver with ng > 1 (Schematic View)
=
- ln+f sn+f sn+f
~ = ,
Parallel Linear Solver sn+f
~
an+f M0
= = = +
sn+f snsn+1sn+2 M1M2
,
Radiation sn+f+1sn+f+2
,
an+f Radiation
, ,
Radiation
, , +
sn+f ng loops T/ng loops P1 P2 P0 P1 P2 P0 M1M2sn+f+1sn+f+2 an+f M1M2 sn+f+1 sn+f+2 an+f snsn+1sn+2 snsn+1sn+2 an-1n-2n-3n-4n-5 an-1n-2n-3n-4n-5 an-1n-2n-3n-4n-5 M1 M2M3M4 M5 M1 M2M3M4M5 M1 M2M3M4 M5
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Results (27)
Airplane Simulation
- Acoustics
- N = 23 962
- 10 823 time iterations
- K max = 341 interaction matrices Mk
- ng = 8
- 70GB of data
- double precision
- Homogeneous node: 24 Cores CPU (128GB memory)
- Heterogeneous node: 24 Cores CPU (128GB memory) and 4
K40M GPUs (12GB memory)
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Results (28)
Parallel Efficiency/Percentage (Homogeneous)
. . . . .
Full-MPI
. . . . . . . . . . . . . . . . . 1 . 10 . 20 . . 0.5 . 1 . Number of nodes . Efficiency . . . . .
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Results (28)
Parallel Efficiency/Percentage (Homogeneous)
. . . . .
Full-MPI
. . . .
Summation
. .
Idle
. .
Direct solver M0
. . . 1 . 10 . 20 . . 0.5 . 1 . Number of nodes . Efficiency . . . 1 . 10 . 20 . . 20 . 40 . 60 . 80 . 100 . Number of nodes . Percentage (%)
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Results (28)
Parallel Efficiency/Percentage (Homogeneous)
. . . . .
Full-MPI
. . . .
Summation
. .
Idle
. .
Direct solver M0
. . . 1 . 10 . 20 . . 0.5 . 1 . Number of nodes . Efficiency . . . 1 . 10 . 20 . . 20 . 40 . 60 . 80 . 100 . Number of nodes . Percentage (%)
Summation stage ↘ M0 Solve →
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Results (29)
With GPUs
. . . 1 . 2 . 3 . 4 . 5 . 300 . 1,000 . 3,000 . 20,000 . Number of nodes . Time (seconds) . . .
CPU-Only .
.
1GPU
Figure : Execution time
. . . 1 . 2 . 3 . 4 . 5 . 1.0 . 11.0 . 21.0 . Number of nodes . Speedup . . .
1GPU
Figure : Speedup against CPU-Only
Problem ≈ 70GB/GPU 12GB memory
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Results (29)
With GPUs
. . . 1 . 2 . 3 . 4 . 5 . 300 . 1,000 . 3,000 . 20,000 . Number of nodes . Time (seconds) . . .
CPU-Only .
.
1GPU .
.
2GPU
Figure : Execution time
. . . 1 . 2 . 3 . 4 . 5 . 1.0 . 11.0 . 21.0 . Number of nodes . Speedup . . .
1GPU .
.
2GPU
Figure : Speedup against CPU-Only
Problem ≈ 70GB/GPU 12GB memory
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Results (29)
With GPUs
. . . 1 . 2 . 3 . 4 . 5 . 300 . 1,000 . 3,000 . 20,000 . Number of nodes . Time (seconds) . . .
CPU-Only .
.
1GPU .
.
2GPU
. .
3GPU
Figure : Execution time
. . . 1 . 2 . 3 . 4 . 5 . 1.0 . 11.0 . 21.0 . Number of nodes . Speedup . . .
1GPU .
.
2GPU .
.
3GPU
Figure : Speedup against CPU-Only
Problem ≈ 70GB/GPU 12GB memory
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Results (29)
With GPUs
. . . 1 . 2 . 3 . 4 . 5 . 300 . 1,000 . 3,000 . 20,000 . Number of nodes . Time (seconds) . . .
CPU-Only .
.
1GPU .
.
2GPU
. .
3GPU .
.
4GPU
Figure : Execution time
. . . 1 . 2 . 3 . 4 . 5 . 1.0 . 11.0 . 21.0 . Number of nodes . Speedup . . .
1GPU .
.
2GPU .
.
3GPU .
.
4GPU
Figure : Speedup against CPU-Only
Problem ≈ 70GB/GPU 12GB memory
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Summary (30)
Summary:
- New computational ordering [Bramas et al., 2014]
- Solver with few communication points
Additional contributions:
- Permutations/SpMV
- Efficient SIMD kernel CPU
- Efficient blocking scheme/kernel for
GPU [Bramas et al., 2015]
- Dynamic balancing (CPU/GPU)
Limits:
- M0 Linear solver
- GPUs’ memory
- Interaction matrices construction
- Complexity → O(N2) for each iteration
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives (31)
Outline
- Problem Formulation
- BEM Solver (Matrix Approach)
- Fast-Multipole Method Approach
- FMM Algorithm & Parallelization
- FMM BEM Solver (Experimental Implementation)
- Conclusion & Perspectives
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Algorithm (32)
FMM Operators (1D)
- Spatial decomposition → Potential decomposition
fi = f near
i
+ f far
i
- Near field by direct interactions (leaves)
- Far field with FMM operators (tree)
l = 0 l = 1 l = 2 l = 3
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Algorithm (32)
FMM Operators (1D)
- Spatial decomposition → Potential decomposition
fi = f near
i
+ f far
i
- Near field by direct interactions (leaves)
- Far field with FMM operators (tree)
l = 0 l = 1 l = 2 l = 3
P2P M2M M2L L2L M2L M2L
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Parallelization (33)
Related work:
- Multicore study [Chandramowlishwaran et al., 2010]
- NVidia GPU [Yokota and Barba, 2011]
- Distributed GPU [Hamada et al., 2009]
- Distributed CPU/GPU [Hu et al., 2011, Lashuk et al., 2012,
Malhotra and Biros, 2015]
- Using a runtime system (multicore) [Ltaief and Yokota, 2014]
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Parallelization (34)
Paradigms
- Fork-join
- Parallel-for (OpenMP)
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Parallelization (34)
Paradigms
- Fork-join
- Parallel-for (OpenMP)
- Task-based
- Tasks pool (OpenMP 3.1) [Agullo et al., 2014]1
1 Agullo, E., Bramas, B., Coulaud, O., Darve, E., Messner, M., and Takahashi, T. (2014). Task-based fmm for
multicore architectures. SIAM Journal on Scientific Computing, 36(1):C66C93. Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Parallelization (34)
Paradigms
- Fork-join
- Parallel-for (OpenMP)
- Task-based
- Tasks pool (OpenMP 3.1) [Agullo et al., 2014]1
- Tasks-and-dependencies (runtime systems, OpenMP 4)
1 Agullo, E., Bramas, B., Coulaud, O., Darve, E., Messner, M., and Takahashi, T. (2014). Task-based fmm for
multicore architectures. SIAM Journal on Scientific Computing, 36(1):C66C93. Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Parallelization (35)
Tasks-and-Dependencies Model (OpenMP 4, StarPU )
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Parallelization (35)
Tasks-and-Dependencies Model (OpenMP 4, StarPU )
Challenges
- Granularity
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Parallelization (35)
Tasks-and-Dependencies Model (OpenMP 4, *PU)
CPU/GPU
Challenges
- Granularity
- Computational kernels
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Parallelization (35)
Tasks-and-Dependencies Model (OpenMP 4, *PU)
Challenges
- Granularity
- Computational kernels
- Scheduling
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Parallelization (36)
Scheduling
A D B C E F CPU0 CPU1 GPU0 A Scheduler
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Parallelization (36)
Scheduling
A D B C E F CPU0 CPU1 GPU0 Scheduler B C D
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Parallelization (36)
Scheduling
A D B C E F CPU0 CPU1 GPU0 Scheduler B C D
- Priority
- Work stealing [Blumofe and Leiserson, 1999]
- Heterogeneous Earliest Finish Time (Heft)
[Topcuouglu et al., 2002]
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Parallelization (36)
Scheduling
A D B C E F CPU0 CPU1 GPU0 Scheduler B C D
- Priority
- Work stealing [Blumofe and Leiserson, 1999]
- Heterogeneous Earliest Finish Time (Heft)
[Topcuouglu et al., 2002] Drawbacks:
- Calibration
- Overhead
- Ready-tasks view
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Parallelization (37)
Heteroprio
- Heteroprio [Agullo et al., 2015]1
- Steady-state : execute tasks where they have the best
acceleration factor
- Critical-state : execute a task by a worker if it does not delay
the hypothetical end
1 Agullo, E., Bramas, B., Coulaud, O., Darve, E., Messner, M., and Takahashi, T. (2015). Task-based fmm for
heterogeneous architectures. Concurrency and Computation: Practice and Experience. Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Parallelization (37)
Heteroprio
- Heteroprio [Agullo et al., 2015]1
- Steady-state : execute tasks where they have the best
acceleration factor
- Critical-state : execute a task by a worker if it does not delay
the hypothetical end
A D B C E F CPU0 CPU1 GPU0 Scheduler B C D CPU Prio GPU Prio
1 Agullo, E., Bramas, B., Coulaud, O., Darve, E., Messner, M., and Takahashi, T. (2015). Task-based fmm for
heterogeneous architectures. Concurrency and Computation: Practice and Experience. Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Parallelization (37)
Heteroprio
- Heteroprio [Agullo et al., 2015]1
- Steady-state : execute tasks where they have the best
acceleration factor
- Critical-state : execute a task by a worker if it does not delay
the hypothetical end
A D B C E F CPU0 CPU1 GPU0 Scheduler B CPU Prio GPU Prio C D
1 Agullo, E., Bramas, B., Coulaud, O., Darve, E., Messner, M., and Takahashi, T. (2015). Task-based fmm for
heterogeneous architectures. Concurrency and Computation: Practice and Experience. Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Parallelization (37)
Heteroprio
- Heteroprio [Agullo et al., 2015]1
- Steady-state : execute tasks where they have the best
acceleration factor
- Critical-state : execute a task by a worker if it does not delay
the hypothetical end
A D B C E F CPU0 CPU1 GPU0 Scheduler B C CPU Prio GPU Prio D
1 Agullo, E., Bramas, B., Coulaud, O., Darve, E., Messner, M., and Takahashi, T. (2015). Task-based fmm for
heterogeneous architectures. Concurrency and Computation: Practice and Experience. Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Parallelization (37)
Heteroprio
- Heteroprio [Agullo et al., 2015]1
- Steady-state : execute tasks where they have the best
acceleration factor
- Critical-state : execute a task by a worker if it does not delay
the hypothetical end
A D B C E F CPU0 CPU1 GPU0 Scheduler B D C CPU Prio GPU Prio
1 Agullo, E., Bramas, B., Coulaud, O., Darve, E., Messner, M., and Takahashi, T. (2015). Task-based fmm for
heterogeneous architectures. Concurrency and Computation: Practice and Experience. Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Particles Interaction Simulations (38)
Test Case
CPU - 24 Cores GPU 3 GPU 4 GPU 2 GPU 1
- N = 30 millions particles
- Spherical Expansion/Rotation Kernel
- Acc = 10−3, h = 7 and Granularity = 1500
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Particles Interaction Simulations (39)
Trace - Heterogeneous (24CPUs)
0GPU/15.5s
Legend: P2P (■), P2M (■) , M2M (■) , M2L (■), L2L (■), L2P (■) and Idle (■)
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Particles Interaction Simulations (39)
Trace - Heterogeneous (1GPU/23CPUs)
0GPU/15.5s 1GPU/13.4s
Legend: P2P (■), P2M (■) , M2M (■) , M2L (■), L2L (■), L2P (■) and Idle (■)
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Particles Interaction Simulations (39)
Trace - Heterogeneous (1GPU/23CPUs)
0GPU/15.5s 1GPU/13.4s
Legend: P2P (■), P2M (■) , M2M (■) , M2L (■), L2L (■), L2P (■) and Idle (■)
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Particles Interaction Simulations (39)
Trace - Heterogeneous (2GPUs/22CPUs)
0GPU/15.5s 2GPU/10.9s
Legend: P2P (■), P2M (■) , M2M (■) , M2L (■), L2L (■), L2P (■) and Idle (■)
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Particles Interaction Simulations (39)
Trace - Heterogeneous (3GPUs/21CPUs)
0GPU/15.5s 3GPU/9.4s
Legend: P2P (■), P2M (■) , M2M (■) , M2L (■), L2L (■), L2P (■) and Idle (■)
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Particles Interaction Simulations (39)
Trace - Heterogeneous (4GPUs/20CPUs)
0GPU/15.5s 4GPU/8.7s
Legend: P2P (■), P2M (■) , M2M (■) , M2L (■), L2L (■), L2P (■) and Idle (■)
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Particles Interaction Simulations (40)
Test Case
CPU - 24 Cores GPU 3 GPU 4 GPU 2 GPU 1
- N = 30 millions particles
- Uniform/Lagrange kernel
- Acc = {10−5, 10−7}, h = 7 and Granularity = 1500
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Particles Interaction Simulations (41)
Trace - Heterogeneous (4GPUs)
Acc = 10−5/7.9s
Legend: P2P (■), P2M (■) , M2M (■) , M2L (■), L2L (■), L2P (■) and Idle (■)
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Particles Interaction Simulations (41)
Trace - Heterogeneous (4GPUs)
Acc = 10−5/7.9s Acc = 10−7/17s
Legend: P2P (■), P2M (■) , M2M (■) , M2L (■), L2L (■), L2P (■) and Idle (■)
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Particles Interaction Simulations (42)
Test Cases
Node 1 - 24 Cores Node 2 - 24 Cores Node 3 - 24 Cores Node 4 - 24 Cores Node 5 - 24 Cores Node 6 - 24 Cores Node 7 - 24 Cores
- N = 200 millions particles
- Spherical Expansion/Rotation Kernel
- Acc = 10−3, h = 8 and Granularity = 2000
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Particles Interaction Simulations (43)
Trace - 7 nodes × 24CPUs
Legend: P2P (■), P2M (■) , M2M (■) , M2L (■), L2L (■), L2P (■) and Idle (■) .
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Summary (44)
Summary:
- Generic
- Kernel independent
- Architecture independent
- Performance portability
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Summary (44)
Summary:
- Generic
- Kernel independent
- Architecture independent
- Performance portability
Additional contributions:
- Commutativity expression in FMM
- MPI/OpenMP implementation
All included in ScalFMM (C++/HPC library)
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives (45)
Outline
- Problem Formulation
- BEM Solver (Matrix Approach)
- Fast-Multipole Method Approach
- FMM Algorithm & Parallelization
- FMM BEM Solver (Experimental Implementation)
- Conclusion & Perspectives
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives (46)
Propagation of the Current State to the Future
= = =
- ,
Linear Solver sn ln sn sn
~
sn ~ an M0
M1 M2M3M4M5 an-1an-2an-3an-4an-5
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives (46)
Propagation of the Current State to the Future
= = =
- ,
Linear Solver sn ln sn sn
~
sn ~ an M0
M1 M2M3M4M5 an-1an-2an-3an-4an-5
=
- ln
sn sn ~ = ,
Linear Solver
sn ~ an M0 M5 an M1 M2 M3 M4 sn+5 sn+4 sn+3 sn+2 sn+1 + + + + +
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives (47)
With FMM
=
- ln
sn sn ~ = ,
Linear Solver
sn ~ an M0 an M1 M2 sn+5 sn+4 sn+3 sn+2 sn+1 + + + + +
FMM
- Far interactions in time (between far elements in space) are
computed by the FMM
- The spatial decomposition is given by the octree
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives (48)
Overview
- The octree is over a mesh (integration points)
- Interactions matrices between leaves
- Approximation/FMM
- development in the time-domain
- multipole: what a cell emits to the outside
- local: what a cell receives from the outside
- operators in FD or TD
- accurate up-to a chosen frequency
- the results in the TD of the matrix approach ̸= FMM
Figure : Complete unit sphere Figure : Truncated unit sphere
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives (49)
Operators (Overview)
- P2M
- compute what is emitted by a leaf to the outside
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives (49)
Operators (Overview)
- P2M
- compute what is emitted by a leaf to the outside
- M2M/L2L
- Extrapolation + time shift
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives (49)
Operators (Overview)
- P2M
- compute what is emitted by a leaf to the outside
- M2M/L2L
- Extrapolation + time shift
- M2L
- Convolution product in TD
(term-by-term multiplication in FD)
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives (49)
Operators (Overview)
- P2M
- compute what is emitted by a leaf to the outside
- M2M/L2L
- Extrapolation + time shift
- M2L
- Convolution product in TD
(term-by-term multiplication in FD)
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives (49)
Operators (Overview)
- P2M
- compute what is emitted by a leaf to the outside
- M2M/L2L
- Extrapolation + time shift
- M2L
- Convolution product in TD
(term-by-term multiplication in FD)
- L2P
- Integration
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Results (50)
Cone-Sphere Test Cases
Case C-927 C-4269 C-10012 Number of unknowns 927 4269 10012 FMM tree height 3 4 5 Number of leaves 16 64 234 Number of Mk matrices (K max) 117 244 370 Number of Mk matrices (leaves) 60 64 49 Number of time steps (T) 2033 4345 6647
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Results (51)
Sequential Executions
TD vs. FD operators:
FMM Stages TD TD + FD-M2L FD Matrix approach Mk Construction 76 s 76 s 76 s 242 s Solve 58 122 s 53 241 s 97 861s 7.8 s(*) Total 58 198 s 53 317 s 97 937 s 249.8 s
Execution time TD-FMM Vs. matrix approach to solve the Case C-927 in double precision. (*) Our optimized BEM solver.
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Results (52)
Parallel Executions (FMM Vs. Matrix Approach)
. . . .
Matrix generation .
.
Solve
. . .
FMM
.
Matrix Approach
. .
500
.
1,000
.
883
.
237
.
Time (s) Figure : C-927 (×3.8)
. . .
FMM
.
Matrix Approach
. .
0.5
.
1
.
·104
.
9,426
.
10,080 Figure : C-4269 (×1)
. . .
FMM
.
Matrix Approach
. .
2
.
4
.
·104
.
33,408
.
25,256 Figure : C-10012 (×1.4) The captions of the different cases show the overhead of the FMM TD-BEM against the matrix approach. Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Summary (53)
Summary:
- Preliminary results
- Best configuration: TD + FD M2L
- Not competitive against the direct approach (maybe on larger
test cases)
- Any improvement of the matrix creation will make the FMM
less competitive Additional contributions:
- Incomplete/4D FMM
- Sphere discretization/length APS signal
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives (54)
Conclusion & Perspectives
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives (55)
Conclusion
Dense BEM/Matrix Approach
- Based on a new computational order
- Remove the bottleneck of the SpMV
- Implemented efficiently on modern architectures
- Complete BEM solver
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives (56)
Conclusion
FMM
- Generic and state-of-the-art library
- Several parallelization strategies
- Robust OpenMP/MPI implementation (10 billions particles)
- Modern task-based approach
- ScalFMM
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives (57)
Conclusion
FMM BEM Solver (Preliminary)
- Parallelized using ScalFMM
- Best configuration TD operators + FD M2L
- Our implementation is not faster than the direct approach
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives (58)
Perspectives
TD-BEM
- Improve the construction of the interaction matrices
- M0 linear solver: small matrix, lots of nodes
- Compare existing solvers (TD vs. FD)
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives (59)
Perspectives
FMM parallelization
- Task-based with implicit MPI communications
- Group-Tree update
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives (60)
Perspectives
FMM BEM
- Study the cost of the solve compare to the direct approach
(complexity for some cases)
- Lots of remaining optimizations to test
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
(1) Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
(2)
Agullo, E., Bramas, B., Coulaud, O., Darve, E., Messner, M., and Takahashi, T. (2014). Task-based fmm for multicore architectures. SIAM Journal on Scientific Computing, 36(1):C66–C93. Agullo, E., Bramas, B., Coulaud, O., Darve, E., Messner, M., and Takahashi, T. (2015). Task-based fmm for heterogeneous architectures. Concurrency and Computation: Practice and Experience, pages n/a–n/a. cpe.3723. Baskaran, M. M. and Bordawekar, R. (2008). Optimizing sparse matrix-vector multiplication on gpus using compile-time and run-time strategies. IBM Reserach Report, RC24704 (W0812-047). Bell, N. and Garland, M. (2009). Implementing sparse matrix-vector multiplication on throughput-oriented processors.
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
(3)
Vectorized Implementation (Overview)
The algorithm is tied to the vectorization type size lSIMD Example with ng = 4, v = 4, lSIMD = 2:
a*<n(j)
a b c d
Slicej row-vector(i)
2 3 4 5 1
sn+1 sn sn+2
6
sn+2 sn+1 sn sn+3 sn+3 bufger r
- Each value from the vectors is read only once (and maybe
copied into the buffer)
- The values in the buffer are shifted to avoid reloading
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
(3)
Vectorized Implementation (Overview)
The algorithm is tied to the vectorization type size lSIMD Example with ng = 4, v = 4, lSIMD = 2:
a*<n(j)
a b c d
Slicej row-vector(i)
2 3 4 5 1
sn+1 sn sn+2
6
sn+2 sn+1 sn sn+3 sn+3 bufger
1 a b
sn+3
2 1
sn+2
a b
r r r
- Each value from the vectors is read only once (and maybe
copied into the buffer)
- The values in the buffer are shifted to avoid reloading
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
(3)
Vectorized Implementation (Overview)
The algorithm is tied to the vectorization type size lSIMD Example with ng = 4, v = 4, lSIMD = 2:
a*<n(j)
a b c d
Slicej row-vector(i)
2 3 4 5 1
sn+1 sn sn+2
6
sn+2 sn+1 sn sn+3 sn+3 bufger
1 a b
sn+3
2 1
sn+2 bufger
2 3 3 4 a b
r r r
- Each value from the vectors is read only once (and maybe
copied into the buffer)
- The values in the buffer are shifted to avoid reloading
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
(3)
Vectorized Implementation (Overview)
The algorithm is tied to the vectorization type size lSIMD Example with ng = 4, v = 4, lSIMD = 2:
a*<n(j)
a b c d
Slicej row-vector(i)
2 3 4 5 1
sn+1 sn sn+2
6
sn+2 sn+1 sn sn+3 sn+3 bufger
1 a b
sn+3
2 1
sn+2 bufger
2 3 3 4 2 3
bufger
a b a b
sn+1
3 4 a b
sn bufger r r r r r
- Each value from the vectors is read only once (and maybe
copied into the buffer)
- The values in the buffer are shifted to avoid reloading
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
(3)
Vectorized Implementation (Overview)
The algorithm is tied to the vectorization type size lSIMD Example with ng = 4, v = 4, lSIMD = 2:
a*<n(j)
a b c d
Slicej row-vector(i)
2 3 4 5 1
sn+1 sn sn+2
6
sn+2 sn+1 sn sn+3 sn+3 bufger
1 a b
sn+3
2 1
sn+2 bufger
2 3 3 4 2 3
bufger
a b a b
sn+1
3 4 a b
sn bufger
2 3
bufger
c d
sn+3
3 4 c d
sn+2 bufger
4 5 c d
sn+1
6 5
sn
c d
r r r r r r r r r
- Each value from the vectors is read only once (and maybe
copied into the buffer)
- The values in the buffer are shifted to avoid reloading
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
(4)
In Time or Frequency Domain
Propagation of the wave for several time steps on a target discretized sphere. The different spheres represent the values that will be applied on the included mesh elements.
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
(5)
Contiguous-Blocking Computational Kernel
Slicej
Kmax-1 N
a*<n(j)
I) Coalesced copy of the vector into shared memory III) Coalesced add to global memory II) Coalesced access in global memory (column per column) III) Partial results in registers
snsn+1sn+2
II) Uncoalesced read from shared memory Global Memory Shared Memory Local Memory
(a) (b) (c)
bc
Thread-1 Thread-2 Thread-3 Thread-4 Thread-5 Thread-6 Thread-7 Thread-8 Thread-9
(a) the original slice is transformed in a block during the pre-computation stage (ng = 3, bc = 11) (b) the blocks are moved to the device memory for the summation stage (c) a thread-block (nb − threads = 9) is in charge of the blocks from a slice interval and computes several summation vectors at the same time
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
(6)
Multi-vectors/vector Product
sn+2 (i)sn+1 (i)sn (i)
1 2 3 4
sn+2 (i)sn+1 (i)sn (i)
1 2 3 4 5 6 1 2 3 4 5 2 3 4 5 6
(a) (b)
Computing one slice-row with 3 vectors (ng = 3) (a) using 3 scalar products (b) using the multi-vectors/vector product
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
(7)
In Shared Memory
- OpenMP parallelization
- Summation divided/balanced between the threads
- No communication during the summation
- Multi-threaded M0 linear solver if possible
- NUMA effects are not handled
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
(8)
On Heterogeneous Nodes
- OpenMP parallelization
- One thread/core per GPU
- An interval of the slices is moved on each GPU
- Intervals are balanced between each iteration with a greedy
algorithm
- The memory limit of the GPU may reduce their performance
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
(9)
Application Example
FD-BEM + FMM (antenna at 1GHz)
Image from Airbus Group.
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
(10)
Cone-Sphere Test Cases
Case C-927 C-4269 C-10012 C-22468 Number of unknowns 927 4269 10012 C-22468 FMM tree height 3 4 5 6 Number of leaves in the FMM tree 16 64 234 936 Number of NNZ interaction matrices (Kmax ) 117 244 370 551 Number of NNZ matrices between FMM leaves 60 64 49 37 Number of time steps (T) 2033 4345 6647 9957 Size of the simulation box 3.3 7.3 11 16 Fmax 348 337 335 334 Incomplete FMM coefficient l = h − 1 16 18 13 10 Incomplete FMM coefficient l = 2 16 36 52 80 Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
(11)
Multi-vectors/vector Product (GPU)
For the Contiguous-Blocking scheme:
Width (bc) 16 32 64 128 16 32 64 128 GPU CPU Single 243 338 431 496 (11%) 4.3 5.5 7.8 6.8 (17%) Double 143 199 248 286 (20%) 3.9 5.6 4.2 4.3 (21%)
GFlop/s for 420 slices (6400 rows and bc columns) (%) percentage of the peak performance
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
(12)
Fork-join (OpenMP)
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
(12)
Fork-join (OpenMP)
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
(12)
Fork-join (OpenMP)
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
(12)
Fork-join (OpenMP)
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
(12)
Fork-join (OpenMP)
- Level by level
- Critical balancing
- Possible bottleneck (top of the tree)
- Difficult to mix near/far fields
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
(13)
Fork-join+Message-passing (Hybrid OpenMP/MPI)
P0 P1 P2 P3
- Distribute the tree between nodes
- Progress level by level
- Communication between all stages
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
(13)
Fork-join+Message-passing (Hybrid OpenMP/MPI)
P0 P1 P2 P3
- Distribute the tree between nodes
- Progress level by level
- Communication between all stages
Poor parallelism expression
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
(14)
Trace - Heterogeneous (4GPUs)
h = 7/ng = 1500/Acc = 10−7/17s 24 threads, N = 30 millions, uniform distribution, Uniform/Lagrange
- kernel. Legend: P2P (■) , P2M (■) , M2M (■) , M2L (■), L2L (■),
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
(14)
Trace - Heterogeneous (4GPUs)
h = 7/ng = 1500/Acc = 10−7/17s h = 6/ng = 300/Acc = 10−7/39s (not on the same scale) 24 threads, N = 30 millions, uniform distribution, Uniform/Lagrange
- kernel. Legend: P2P (■) , P2M (■) , M2M (■) , M2L (■), L2L (■),
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
(15)
Flop/Cost Estimation
. . .
1,000
.
10,000
.
20,000
.
106
.
107
.
108
.
109
. 1 . 6 . 23 . 92 .
Number of unknowns
.
Unit of Cost
. . .
Between FMM leaves
. .
Complete interaction matrices
Figure : Matrix generation cost estimation
. . . . . . . . . . . . . .
The numbers above the slower plot represent the slow-down factors against the faster method. Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
(15)
Flop/Cost Estimation
. . .
1,000
.
10,000
.
20,000
.
106
.
107
.
108
.
109
. 1 . 6 . 23 . 92 .
Number of unknowns
.
Unit of Cost
. . .
Between FMM leaves
. .
Complete interaction matrices
Figure : Matrix generation cost estimation
. . .
1,000
.
10,000
.
20,000
.
1010
.
1013
.
1016
. 1340 . 388 . 203 . 154 . 1320 . 353 . 119 . 57 .
Number of unknowns
.
Number of Flop
. . .
TD-BEM FMM
. .
TD-BEM FMM (FD M2L)
. .
Matrix Approach
Figure : Summation stage Flop estimation
The numbers above the slower plot represent the slow-down factors against the faster method. Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
(16)
Reducing the Complexity
Direct computation 0(N2) . .
X
.
Y
.
O(N2)
. .
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
(16)
Reducing the Complexity
Direct computation 0(N2) → FMM O(N) . .
X
.
Y
.
O(N2)
. .
X
.
Y
.
O(N)
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
(16)
Reducing the Complexity
Direct computation 0(N2) → FMM O(N) . .
X
.
Y
.
O(N2)
. .
X
.
Y
.
O(N)
- Spatial decomposition → Potential decomposition
fi = f near
i
+ f far
i
- Near field is computed by direct interactions
- The far field is done using different operators
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Results (17)
Airplane Simulation
- Acoustics
- N = 23 962
- 10 823 time iterations
- K max = 341 interaction matrices Mk (≈ 5.5 × 109 NNZ)
- Computing sn ≈ 11 GFlop
- Total ≈ 130 651 GFlop
- ng = 8
- 70GB of data
- CPU node : 2 Dodeca-core Haswell Intel Xeon E5-2680 at
2, 50GHz and 128GB (DDR4) of shared memory
- GPUs per node: 4 NVIDIA Kepler K40M GPU (745MHz),
2880 Cores, 12GB of dedicated memory
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Results (18)
Hybrid MPI/OpenMP
. . . .
Hybrid MPI/OpenMP
. . . 1 . 10 . 20 . 30 . 40 . 50 . 1 . 0.5 . . Number of nodes . Efficienty
Efficiency: Uniform distribution, Spherical Expansion/Rotation Kernel, N = 200 millions, h = 8, Acc = 10−3, from 1 to 50 nodes (24 threads per node), for np = 50 the execution time is 2.24s
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Results (19)
Group-tree
- Granularity G
- A group → G cells/leaves
- Good locality
- Low iteration complexity
- Dependencies between cells ̸= between groups
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Results (20)
Trace - Shared Memory - 24CPUs
N = 20 millions, ellipsoid distribution, Spherical Expansion/Rotation Kernel, Acc = 10−3, h = 11 and ng = 8000 in 5.2s. Legend: P2P (■), P2M (■) , M2M (■) , M2L (■), L2L (■), L2P (■) and Idle (■)
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Results (21)
Trace - 24CPUs
N = 30 millions, uniform distribution, Spherical Expansion/Rotation Kernel, Acc = 10−3, h = 7 and ng = 1500 in 15.5s. Legend: P2P (■), P2M (■) , M2M (■) , M2L (■), L2L (■), L2P (■) and Idle (■)
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Particles Interaction Simulations (22)
Test Cases
Interactions between N particles for two distributions:
Figure : Uniform Figure : Ellipsoid
The height of the tree (h) is chosen such that the execution time is minimal in sequential.
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Particles Interaction Simulations (23)
Parallel Strategies for FMM BEM
Three strategies (Fork-join OpenMP)
- Threaded FMM: divide each level between threads (ScalFMM
classic)
- Threaded kernel: divide the work inside the kernel
- Mix FMM/Kernel: two layers of parallelism, one in the FMM
and a second in the kernel
Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Particles Interaction Simulations (24)
Parallel Executions
. . . .
Matrix generation .
.
Solve
. . .
Threaded FMM
.
Threaded Kernel
.
Mix FMM/Kernel
.
Matrix Approach
. .
2,000
.
4,000
.
3,752
.
1,897
.
883
.
237
.
Time (s) Figure : C-927
. . .
Threaded FMM
.
Threaded Kernel
.
Mix FMM/Kernel
.
Matrix Approach
. .
0.5
.
1
.
1.5
.
·104
.
15,732
.
15,371
.
9,426
.
10,080 Figure : C-4269 (×1)
. . .
Threaded FMM
.
Threaded Kernel
.
Mix FMM/Kernel
.
Matrix Approach
. .
5
.
·104
.
33,408
.
76,931
.
39,227
.
25,256 Figure : C-10012 The captions of the different cases show the overhead of the FMM TD-BEM against the matrix approach. Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas
. .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. .
Particles Interaction Simulations (24)
Parallel Executions FMM Vs. Matrix Approach
. . . .
Matrix generation .
.
Solve Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas