optimization and parallelization of the boundary element
play

Optimization and Parallelization of the Boundary Element Method for - PowerPoint PPT Presentation

Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B erenger Bramas Inria, Bordeaux - Sud-Ouest PhD defense - Feb. 15th 2016 Advisor: Oliver Coulaud (Inria) Industrial co-advisor: Guillaume


  1. Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives TD-BEM Formulation (10) Linear Formulation Notations: • δ Ω discretized in N unknowns/degrees of freedom l n Ω . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

  2. Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives TD-BEM Formulation (10) Linear Formulation Notations: • δ Ω discretized in N unknowns/degrees of freedom • M k : the convolution matrices (dimension N × N ) - input • l n : the incident wave emitted by a source on the unknowns at time step n - input • a n : the state of the system at time step n - to compute l n Ω . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

  3. Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives TD-BEM Formulation (10) Linear Formulation Notations: • δ Ω discretized in N unknowns/degrees of freedom • M k : the convolution matrices (dimension N × N ) - input • l n : the incident wave emitted by a source on the unknowns at time step n - input • a n : the state of the system at time step n - to compute Convolution system: K max M 0 · a n + M k · a n − k = l n ∑ (1) k ≥ 1 l n Ω . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

  4. Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives TD-BEM Formulation (10) Linear Formulation Notations: • δ Ω discretized in N unknowns/degrees of freedom • M k : the convolution matrices (dimension N × N ) - input • l n : the incident wave emitted by a source on the unknowns at time step n - input • a n : the state of the system at time step n - to compute Convolution system: K max M 0 · a n + M k · a n − k = l n ∑ (1) k ≥ 1 Solve at each time step: ( K max ) a n = ( M 0 ) − 1 l n − M k · a n − k ∑ (2) k =1 . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

  5. Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives TD-BEM Formulation (11) Interaction/Convolution Matrices ( M k ) • Interactions between unknowns • Symmetric and sparse, M k ( i , j ) ̸ = 0 if distance ( i , j ) ≈ k . c . ∆ t • Pre-computed (external tool) M 0 M 1 M 2 M Kmax . . . . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

  6. Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives TD-BEM Formulation (12) Solve (Schematic View) K max ( ) a n = ( M 0 ) − 1 l n − M k · a n − k ∑ (3) k =1 a n-1 a n-2 a n-3 a n-4 a n-5 ~ s n l n s n s n = - = M 1 M 2 M 3 M 4 M 5 Linear Solver = , M 0 ~ a n s n . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. .. . . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

  7. Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives TD-BEM Formulation (13) SpMV (sparse matrix/vector product) Summation stage → K max SpMVs • Permutation, advanced storages/kernels, blocking [White III and Sadayappan, 1997, Pinar and Heath, 1999, Pichel et al., 2005, Vuduc and Moon, 2005] • Auto-tuning [Im and Yelick, 2001, Vuduc et al., 2005] . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

  8. Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives TD-BEM Formulation (13) SpMV (sparse matrix/vector product) Summation stage → K max SpMVs • Permutation, advanced storages/kernels, blocking [White III and Sadayappan, 1997, Pinar and Heath, 1999, Pichel et al., 2005, Vuduc and Moon, 2005] • Auto-tuning [Im and Yelick, 2001, Vuduc et al., 2005] Low Flop-rate: • Memory bound operation • Flop/Word hardware limit • Irregular/not contiguous memory accesses • Instruction (pipelining, vectorization) • Not appropriate for GPUs [Garland, 2008, Baskaran and Bordawekar, 2008, Bell and Garland, 2009] . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. .. . . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

  9. Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives TD-BEM Formulation (14) SpMV (Performance) . C00 MKL . . . CRS MKL . . DIA MKL . . BCSR MKL . . CRS cuSparse . . . BCSR cuSparse 30 20 GFlop/s 10 . . . . . . . . . . . . 0 . Dense 5000 Diagonal 15/500000 Random 4/20000 Block Random (5/80000) Dense 200 (x10000) SpMVs MKL/cuSparse (double precision) Peak performance: CPU Haswell Intel Xeon E5-2680 2,50 GHz core 20 GFlop / s , and K40-M GPU 1 . 43 TFlop / s . . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

  10. Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives TD-BEM Formulation (15) TD-BEM Application Stages User inputs, simulation parameters ↓ Mesh generator, configuration, interaction matrices pre-computation ↓ Solver · Summation stage · M 0 Linear Solver (external tool) ↓ Post-processing (TD → FD) . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

  11. Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives (16) Outline • Problem Formulation • BEM Solver (Matrix Approach) • Fast-Multipole Method Approach • FMM Algorithm & Parallelization • FMM BEM Solver (Experimental Implementation) • Conclusion & Perspectives . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. .. . . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

  12. Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Improving the Summation (17) Computational Ordering A 0 A 1 A 2 A 3 A 4 A 5 M 6 M 5 M 4 M 3 M 2 M 1 S 6 Front ( k )/ SpMV K max N s n ( i ) ∑ ∑ M k ( i , j ) × a n − k ( j ) , 1 ≤ i ≤ N . = (4) k =1 j =1 . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

  13. Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Improving the Summation (17) Computational Ordering A 0 A 0 A 0 A 1 A 1 A 1 A 2 A 2 A 2 A 3 A 3 A 3 A 4 A 4 A 4 A 5 A 5 A 5 M 6 M 6 M 6 M 5 M 5 M 5 M 4 M 4 M 4 M 3 M 3 M 3 M 2 M 2 M 2 M 1 M 1 M 1 S 6 S 6 S 6 Top ( i ) Front ( k )/ SpMV Side ( j ) K max N s n ( i ) ∑ ∑ M k ( i , j ) × a n − k ( j ) , 1 ≤ i ≤ N . = (4) k =1 j =1 . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. .. . .. . .. . .. . .. . .. . .. . .. . . .. .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

  14. Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Improving the Summation (18) Structure of a Slice Matrix A Slice j : • When outer loop index is j • The concatenation of column j of the interaction matrices M k (except M 0 ) • Size ( N × ( K max − 1)) • There is one dense vector per row • Slice j ( i , k ) = M k ( i , j ) ̸ = 0 with k s = d ( i , j ) / ( c ∆ t ) and k s ≤ k ≤ k s + p Slice j M 11 (*,j) M 1 (*,j) M 2 (*,j) M 3 (*,j) M 4 (*,j) M 5 (*,j) M 6 (*,j) M 7 (*,j) M 8 (*,j) M 9 (*,j) M 10 (*,j) M 12 (*,j) . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

  15. Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Improving the Summation (19) Computing with a Slice Matrix a *<n (j) Slice j ) ) ) ) ) ) ) ) ) ) ) ) j j j j j j j j j j j j s n , , , , , , , , , , * , , * * * * * * * * * * * ( ( ( ( ( ( ( ( ( ( ( ( 1 2 3 4 5 6 7 8 9 0 1 2 M M M M M M M M M 1 1 1 M M M Computation with N vector/vector products (one per line): • Regular memory access (vectorization, pipelining) • Low Flop/word ratio (same as SpMV) . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. . .. .. . .. . .. . .. . .. . .. . .. . . .. .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

  16. Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Improving the Summation (20) Improving the Flop/Word Ratio State of unknown j ? ? ? …. a n+2 a n+1 a n a n-1 a n-2 a n-3 a n-4 a n-5 a n-6 a n-7 a n-8 a n-9 Slice j M 1 (*,j) M 2 (*,j) M 3 (*,j) M 4 (*,j) M 5 (*,j) M 6 (*,j) M 7 (*,j) M 8 (*,j) M 9 (*,j) s n a n . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

  17. Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Improving the Summation (20) Improving the Flop/Word Ratio …. a n+2 a n+1 a n a n-1 a n-2 a n-3 a n-4 a n-5 a n-6 a n-7 a n-8 a n-9 ? ? ? a n+2 a n+1 a n a n-1 a n-2 a n-3 a n-4 a n-5 a n-6 a n-7 a n-8 a n-9 …. ? ? ? Slice j M 1 (*,j) M 2 (*,j) M 3 (*,j) M 4 (*,j) M 5 (*,j) M 6 (*,j) M 7 (*,j) M 8 (*,j) M 9 (*,j) s n+1 s n n =2 g . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. . .. .. . .. . .. . .. . .. . .. . .. . . .. .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

  18. Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Improving the Summation (20) Improving the Flop/Word Ratio …. a n+2 a n+1 a n a n-1 a n-2 a n-3 a n-4 a n-5 a n-6 a n-7 a n-8 a n-9 ? ? ? …. a n+2 a n+1 a n a n-1 a n-2 a n-3 a n-4 a n-5 a n-6 a n-7 a n-8 a n-9 ? ? ? …. a n+2 a n+1 a n a n-1 a n-2 a n-3 a n-4 a n-5 a n-6 a n-7 a n-8 a n-9 ? ? ? Slice j M 1 (*,j) M 2 (*,j) M 3 (*,j) M 4 (*,j) M 5 (*,j) M 6 (*,j) M 7 (*,j) M 8 (*,j) M 9 (*,j) s n+2 s n+1 s n n =3 g . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. . .. .. . .. . .. . .. . .. . .. . .. . . .. .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

  19. Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Improving the Summation (20) Improving the Flop/Word Ratio …. a n+1 a n a n-1 a n-2 a n-3 a n-4 a n-5 a n-6 a n-7 a n-8 a n-9 0 0 Slice j M 1 (*,j) M 2 (*,j) M 3 (*,j) M 4 (*,j) M 5 (*,j) M 6 (*,j) M 7 (*,j) M 8 (*,j) M 9 (*,j) s n+2 s n+1 s n n =3 g . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

  20. Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Improving the Summation (20) Improving the Flop/Word Ratio …. a n+1 a n a n-1 a n-2 a n-3 a n-4 a n-5 a n-6 a n-7 a n-8 a n-9 0 0 Slice j M 1 (*,j) M 2 (*,j) M 3 (*,j) M 4 (*,j) M 5 (*,j) M 6 (*,j) M 7 (*,j) M 8 (*,j) M 9 (*,j) s n+2 s n+1 s n n =3 g . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

  21. . . . . . Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Multi-Vectors/Vector Product (21) Flop/Word Ratio Vector length v = 4, group size n g = 4 ( v × n g × 2 Flops): 0 1 v n g 2 0 0 1 2 3 3 1 1 2 3 4 4 2 2 3 4 5 5 3 3 4 5 6 6 a b c d r a b c d r r r r a b c d r r r r Vector/vector Vector/matrix Multi-vectors/vector product product product • Vectors product ( ≈ SpMV) : n g (2 v + 1) • Vector/matrix product : v + n g ( v + 1) • Multi-vectors/vector product : ( v + n g − 1) + ( v ) + ( n g ) . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. . .. .. . .. . .. . .. . .. . .. . .. . . .. .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

  22. Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Multi-Vectors/Vector Product (21) Flop/Word Ratio Vector length v = 4, group size n g = 4 ( v × n g × 2 Flops): 0 1 v n g 2 0 0 1 2 3 3 1 1 2 3 4 4 2 2 3 4 5 5 3 3 4 5 6 6 a b c d r a b c d r r r r a b c d r r r r Vector/vector Vector/matrix Multi-vectors/vector product product product • Vectors product ( ≈ SpMV) : n g (2 v + 1) • Vector/matrix product : v + n g ( v + 1) • Multi-vectors/vector product : ( v + n g − 1) + ( v ) + ( n g ) 6 F / W ( n g 8) 4 2 0 . . . . . . . . . . . . . . . . . . . 0 2 4 6 8 10 12 14 16 18 20 . . . . . . . . . . . . . . . . . . . . . . v .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. . .. .. . .. . .. . .. . .. . .. . .. . . .. .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

  23. Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Multi-Vectors/Vector Product (22) Multi-vectors/vector Product (CPU) . AVX-Asm . . AVX-Intrinsic . . AVX-Template . . SSE-Intrinsic . . . Compiler Version 20 20 . . . Speed ( GFlop / s ) 15 15 10 10 5 5 0 . . . . . . . . . . . . . . 0 . . . . . . . . . . . . . 0 0 20 40 60 80 20 40 60 80 Length of vectors ( v ) Length of vectors ( v ) Figure : N r = 1 024 Figure : N r = 20 480 Plots show the GFlop / s with n g = 8 for test cases of dimension N r × v (in double precision). Haswell Intel Xeon E5-2680 at 2 , 50 GHz (20 GFlop / s ) . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. . .. .. . .. . .. . .. . .. . .. . .. . . .. .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

  24. Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Multi-Vectors/Vector Product on GPUs (23) GPUs Slice Storages • Blocking scheme (small conversion overhead) • Data access appropriate for SIMT/SIMD • Memory accesses (coalesced, low bank conflicts) • Data re-use (shared memory) • CPU/GPU Balancing . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. .. . . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

  25. Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Multi-Vectors/Vector Product on GPUs (23) GPUs Slice Storages • Blocking scheme (small conversion overhead) • Data access appropriate for SIMT/SIMD • Memory accesses (coalesced, low bank conflicts) • Data re-use (shared memory) • CPU/GPU Balancing a *<n (j) Slice j Slice j 0 2 0 1 3 4 5 8 6 5 6 1 2 0 2 (a) (b) (c) . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. .. . . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

  26. Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Parallelization (24) Parallelization Sequential algorithm: a n-1 a n-2 a n-3 a n-4 a n-5 ~ s n l n s n s n = - = M 1 M 2 M 3 M 4 M 5 Linear Solver = , M 0 ~ a n s n . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

  27. Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Parallelization (25) Parallel Solver (Schematic View) n-5 a n-1n-2n-3n-4 s n = P0 M 1 M 2 M 3 M 4 M 5 n-3n-4n-5 n-2 a n-1 Parallel Linear ~ s n s n l n s n s n Solver P1 = = + - = , M 5 M 0 ~ M 1 M 2 M 3 M 4 s n a n n-5 a n-1n-2n-3n-4 s n = P2 M 1 M 2 M 3 M 4 M 5 . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. . .. .. . .. . .. . .. . .. . .. . .. . . .. .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

  28. Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Parallelization (26) Parallel Solver with n g > 1 (Schematic View) T/n g loops a n-1n-2n-3n-4n-5 n g loops s n s n+1 s n+2 Radiation = P0 P0 M 5 M 1 M 2 M 3 M 4 , , a n+f M 1 M 2 s n+f+1 s n+f+2 a n-1n-2n-3n-4n-5 Parallel Linear Radiation s n s n+1 s n+2 l n+f s n+f s n+f s n+f s n+f ~ Solver P1 P1 = = + - = + , , , M 0 a n+f M 1 M 2 M 3 M 4 M 5 ~ s n+f a n+f M 1 M 2 s n+f+1 s n+f+2 a n-1n-2n-3n-4n-5 Radiation s n s n+1 s n+2 P2 , , P2 = a n+f M 1 M 2 s n+f+1 s n+f+2 M 5 M 1 M 2 M 3 M 4 . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. . .. .. . .. . .. . .. . .. . .. . .. . . .. .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

  29. Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Results (27) Airplane Simulation • Acoustics • N = 23 962 • 10 823 time iterations • K max = 341 interaction matrices M k • n g = 8 • 70 GB of data • double precision • Homogeneous node: 24 Cores CPU (128 GB memory) • Heterogeneous node: 24 Cores CPU (128 GB memory) and 4 K40M GPUs (12 GB memory) . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

  30. . . . . . . . . . . . . . . . . . . . Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Results (28) Parallel Efficiency/Percentage (Homogeneous) . . . Full-MPI . . 1 Efficiency 0 . 5 . . . . . . . . . . 0 1 10 20 Number of nodes . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. . .. .. . .. . .. . .. . .. . .. . .. . . .. .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

  31. Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Results (28) Parallel Efficiency/Percentage (Homogeneous) . . . . . . . Summation Idle Full-MPI . . Direct solver M 0 . . . . 100 Percentage (%) 1 80 Efficiency 60 0 . 5 40 20 . . . . . . . . . . 0 . . . . . . . . . . . . . 0 1 10 20 1 10 20 Number of nodes Number of nodes . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. . .. .. . .. . .. . .. . .. . .. . .. . . .. .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

  32. Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Results (28) Parallel Efficiency/Percentage (Homogeneous) . . . . . . . Summation Idle Full-MPI . . Direct solver M 0 . . . . 100 Percentage (%) 1 80 Efficiency 60 0 . 5 40 20 . . . . . . . . . . 0 . . . . . . . . . . . . . 0 1 10 20 1 10 20 Number of nodes Number of nodes Summation stage ↘ M 0 Solve → . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. . .. .. . .. . .. . .. . .. . .. . .. . . .. .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

  33. Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Results (29) With GPUs . CPU-Only . . . 1GPU . . 1GPU Time ( seconds ) 20 , 000 21.0 Speedup 3 , 000 11.0 1 , 000 . . . . . . . . . . . . 300 . 1.0 . 1 2 3 4 5 1 2 3 4 5 Number of nodes Number of nodes Figure : Speedup against CPU-Only Figure : Execution time . . . . . . . . . . . . . Problem ≈ 70 GB /GPU 12 GB memory . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. . .. .. . .. . .. . .. . .. . .. . .. . . .. .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

  34. Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Results (29) With GPUs . CPU-Only . . 1GPU . . . 2GPU . 1GPU . . . 2GPU Time ( seconds ) 20 , 000 21.0 Speedup 3 , 000 11.0 1 , 000 . . . . . . . . . . . . 300 . 1.0 . 1 2 3 4 5 1 2 3 4 5 Number of nodes Number of nodes Figure : Speedup against CPU-Only Figure : Execution time . . . . . . . . . . . . . Problem ≈ 70 GB /GPU 12 GB memory . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. . .. .. . .. . .. . .. . .. . .. . .. . . .. .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

  35. Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Results (29) With GPUs . CPU-Only . . 1GPU . . . 2GPU . 1GPU . . 2GPU . . . 3GPU . . 3GPU Time ( seconds ) 20 , 000 21.0 Speedup 3 , 000 11.0 1 , 000 300 . . . . . . . . . . . . . 1.0 . 1 2 3 4 5 1 2 3 4 5 Number of nodes Number of nodes Figure : Speedup against CPU-Only Figure : Execution time . . . . . . . . . . . . . Problem ≈ 70 GB /GPU 12 GB memory . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. . .. .. . .. . .. . .. . .. . .. . .. . . .. .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

  36. Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Results (29) With GPUs . CPU-Only . . 1GPU . . . 2GPU . 1GPU . . 2GPU . . 3GPU . . . . 3GPU . . . 4GPU 4GPU Time ( seconds ) 20 , 000 21.0 Speedup 3 , 000 11.0 1 , 000 300 . . . . . . . . . . . . . 1.0 . 1 2 3 4 5 1 2 3 4 5 Number of nodes Number of nodes Figure : Speedup against CPU-Only Figure : Execution time . . . . . . . . . . . . . Problem ≈ 70 GB /GPU 12 GB memory . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. . .. .. . .. . .. . .. . .. . .. . .. . . .. .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

  37. Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Summary (30) Summary: • New computational ordering [Bramas et al., 2014] • Solver with few communication points Additional contributions: • Permutations/SpMV • Efficient SIMD kernel CPU • Efficient blocking scheme/kernel for GPU [Bramas et al., 2015] • Dynamic balancing (CPU/GPU) Limits: • M 0 Linear solver • GPUs’ memory • Interaction matrices construction • Complexity → O ( N 2 ) for each iteration . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. .. . . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

  38. Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives (31) Outline • Problem Formulation • BEM Solver (Matrix Approach) • Fast-Multipole Method Approach • FMM Algorithm & Parallelization • FMM BEM Solver (Experimental Implementation) • Conclusion & Perspectives . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. .. . . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

  39. Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Algorithm (32) FMM Operators (1D) • Spatial decomposition → Potential decomposition f i = f near + f far i i • Near field by direct interactions (leaves) • Far field with FMM operators (tree) l = 0 l = 1 l = 2 l = 3 . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. .. . . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

  40. Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Algorithm (32) FMM Operators (1D) • Spatial decomposition → Potential decomposition f i = f near + f far i i • Near field by direct interactions (leaves) • Far field with FMM operators (tree) l = 0 l = 1 l = 2 l = 3 M2L L2L M2L M2L M2M P2P . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

  41. Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Parallelization (33) Related work: • Multicore study [Chandramowlishwaran et al., 2010] • NVidia GPU [Yokota and Barba, 2011] • Distributed GPU [Hamada et al., 2009] • Distributed CPU/GPU [Hu et al., 2011, Lashuk et al., 2012, Malhotra and Biros, 2015] • Using a runtime system (multicore) [Ltaief and Yokota, 2014] . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

  42. Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Parallelization (34) Paradigms • Fork-join - Parallel-for (OpenMP) . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

  43. Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Parallelization (34) Paradigms • Fork-join - Parallel-for (OpenMP) • Task-based - Tasks pool (OpenMP 3.1) [Agullo et al., 2014] 1 1 Agullo, E., Bramas, B., Coulaud, O., Darve, E., Messner, M., and Takahashi, T. (2014). Task-based fmm for . . . . . . . . . . . . . . . . . . . . . . multicore architectures. SIAM Journal on Scientific Computing, 36(1):C66C93. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

  44. Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Parallelization (34) Paradigms • Fork-join - Parallel-for (OpenMP) • Task-based - Tasks pool (OpenMP 3.1) [Agullo et al., 2014] 1 - Tasks-and-dependencies (runtime systems, OpenMP 4) 1 Agullo, E., Bramas, B., Coulaud, O., Darve, E., Messner, M., and Takahashi, T. (2014). Task-based fmm for . . . . . . . . . . . . . . . . . . . . . . multicore architectures. SIAM Journal on Scientific Computing, 36(1):C66C93. .. . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

  45. Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Parallelization (35) Tasks-and-Dependencies Model (OpenMP 4, StarPU ) . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . . .. .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

  46. Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Parallelization (35) Tasks-and-Dependencies Model (OpenMP 4, StarPU ) Challenges • Granularity . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

  47. Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Parallelization (35) Tasks-and-Dependencies Model (OpenMP 4, *PU) CPU/GPU Challenges • Granularity • Computational kernels . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

  48. Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Parallelization (35) Tasks-and-Dependencies Model (OpenMP 4, *PU) Challenges • Granularity • Computational kernels • Scheduling . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

  49. Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Parallelization (36) Scheduling Scheduler E A CPU0 B A F CPU1 C D GPU0 . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. .. . . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

  50. Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Parallelization (36) Scheduling Scheduler E B CPU0 B C A F CPU1 C D D GPU0 . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. .. . . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

  51. Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Parallelization (36) Scheduling Scheduler E B CPU0 B C A F CPU1 C D D GPU0 • Priority • Work stealing [Blumofe and Leiserson, 1999] • Heterogeneous Earliest Finish Time (Heft) [Topcuouglu et al., 2002] . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

  52. Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Parallelization (36) Scheduling Scheduler E B CPU0 B C A F CPU1 C D D GPU0 • Priority • Work stealing [Blumofe and Leiserson, 1999] • Heterogeneous Earliest Finish Time (Heft) [Topcuouglu et al., 2002] Drawbacks: • Calibration • Overhead • Ready-tasks view . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

  53. Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Parallelization (37) Heteroprio • Heteroprio [Agullo et al., 2015] 1 • Steady-state : execute tasks where they have the best acceleration factor • Critical-state : execute a task by a worker if it does not delay the hypothetical end 1 Agullo, E., Bramas, B., Coulaud, O., Darve, E., Messner, M., and Takahashi, T. (2015). Task-based fmm for heterogeneous architectures. Concurrency and Computation: Practice and Experience. . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

  54. Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Parallelization (37) Heteroprio • Heteroprio [Agullo et al., 2015] 1 • Steady-state : execute tasks where they have the best acceleration factor • Critical-state : execute a task by a worker if it does not delay the hypothetical end Scheduler E CPU0 B D C GPU CPU A F CPU1 C Prio Prio B D GPU0 1 Agullo, E., Bramas, B., Coulaud, O., Darve, E., Messner, M., and Takahashi, T. (2015). Task-based fmm for heterogeneous architectures. Concurrency and Computation: Practice and Experience. . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

  55. Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Parallelization (37) Heteroprio • Heteroprio [Agullo et al., 2015] 1 • Steady-state : execute tasks where they have the best acceleration factor • Critical-state : execute a task by a worker if it does not delay the hypothetical end Scheduler E B CPU0 B D C GPU CPU A F CPU1 C Prio Prio D GPU0 1 Agullo, E., Bramas, B., Coulaud, O., Darve, E., Messner, M., and Takahashi, T. (2015). Task-based fmm for heterogeneous architectures. Concurrency and Computation: Practice and Experience. . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

  56. Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Parallelization (37) Heteroprio • Heteroprio [Agullo et al., 2015] 1 • Steady-state : execute tasks where they have the best acceleration factor • Critical-state : execute a task by a worker if it does not delay the hypothetical end Scheduler E B CPU0 B D GPU CPU A F CPU1 C Prio Prio D C GPU0 1 Agullo, E., Bramas, B., Coulaud, O., Darve, E., Messner, M., and Takahashi, T. (2015). Task-based fmm for heterogeneous architectures. Concurrency and Computation: Practice and Experience. . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

  57. Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Parallelization (37) Heteroprio • Heteroprio [Agullo et al., 2015] 1 • Steady-state : execute tasks where they have the best acceleration factor • Critical-state : execute a task by a worker if it does not delay the hypothetical end Scheduler E B CPU0 B GPU CPU A F CPU1 D C Prio Prio D C GPU0 1 Agullo, E., Bramas, B., Coulaud, O., Darve, E., Messner, M., and Takahashi, T. (2015). Task-based fmm for heterogeneous architectures. Concurrency and Computation: Practice and Experience. . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

  58. Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Particles Interaction Simulations (38) Test Case CPU - 24 Cores GPU 1 GPU 2 GPU 3 GPU 4 • N = 30 millions particles • Spherical Expansion/Rotation Kernel • Acc = 10 − 3 , h = 7 and Granularity = 1500 . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

  59. Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Particles Interaction Simulations (39) Trace - Heterogeneous (24CPUs) 0GPU/15 . 5 s Legend: P2P ( ■ ), P2M ( ■ ) , M2M ( ■ ) , M2L ( ■ ), L2L ( ■ ), L2P ( ■ ) and Idle ( ■ ) . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

  60. Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Particles Interaction Simulations (39) Trace - Heterogeneous (1GPU/23CPUs) 0GPU/15 . 5 s 1GPU/13 . 4 s Legend: P2P ( ■ ), P2M ( ■ ) , M2M ( ■ ) , M2L ( ■ ), L2L ( ■ ), L2P ( ■ ) and Idle ( ■ ) . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

  61. Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Particles Interaction Simulations (39) Trace - Heterogeneous (1GPU/23CPUs) 0GPU/15 . 5 s 1GPU/13 . 4 s Legend: P2P ( ■ ), P2M ( ■ ) , M2M ( ■ ) , M2L ( ■ ), L2L ( ■ ), L2P ( ■ ) and Idle ( ■ ) . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

  62. Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Particles Interaction Simulations (39) Trace - Heterogeneous (2GPUs/22CPUs) 0GPU/15 . 5 s 2GPU/10 . 9 s Legend: P2P ( ■ ), P2M ( ■ ) , M2M ( ■ ) , M2L ( ■ ), L2L ( ■ ), L2P ( ■ ) and Idle ( ■ ) . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

  63. Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Particles Interaction Simulations (39) Trace - Heterogeneous (3GPUs/21CPUs) 0GPU/15 . 5 s 3GPU/9 . 4 s Legend: P2P ( ■ ), P2M ( ■ ) , M2M ( ■ ) , M2L ( ■ ), L2L ( ■ ), L2P ( ■ ) and Idle ( ■ ) . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

  64. Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Particles Interaction Simulations (39) Trace - Heterogeneous (4GPUs/20CPUs) 0GPU/15 . 5 s 4GPU/8 . 7 s Legend: P2P ( ■ ), P2M ( ■ ) , M2M ( ■ ) , M2L ( ■ ), L2L ( ■ ), L2P ( ■ ) and Idle ( ■ ) . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

  65. Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Particles Interaction Simulations (40) Test Case CPU - 24 Cores GPU 1 GPU 2 GPU 3 GPU 4 • N = 30 millions particles • Uniform/Lagrange kernel • Acc = { 10 − 5 , 10 − 7 } , h = 7 and Granularity = 1500 . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

  66. Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Particles Interaction Simulations (41) Trace - Heterogeneous (4GPUs) Acc = 10 − 5 /7 . 9 s Legend: P2P ( ■ ), P2M ( ■ ) , M2M ( ■ ) , M2L ( ■ ), L2L ( ■ ), L2P ( ■ ) and . . . . . . . . . . . . . . . . . . . . . . Idle ( ■ ) .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

  67. Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Particles Interaction Simulations (41) Trace - Heterogeneous (4GPUs) Acc = 10 − 5 /7 . 9 s Acc = 10 − 7 /17 s Legend: P2P ( ■ ), P2M ( ■ ) , M2M ( ■ ) , M2L ( ■ ), L2L ( ■ ), L2P ( ■ ) and . . . . . . . . . . . . . . . . . . . . . . Idle ( ■ ) .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

  68. Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Particles Interaction Simulations (42) Test Cases Node 1 - 24 Cores Node 2 - 24 Cores Node 3 - 24 Cores Node 4 - 24 Cores Node 5 - 24 Cores Node 6 - 24 Cores Node 7 - 24 Cores • N = 200 millions particles • Spherical Expansion/Rotation Kernel • Acc = 10 − 3 , h = 8 and Granularity = 2000 . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

  69. Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives FMM Particles Interaction Simulations (43) Trace - 7 nodes × 24CPUs Legend: P2P ( ■ ), P2M ( ■ ) , M2M ( ■ ) , M2L ( ■ ), L2L ( ■ ), L2P ( ■ ) and Idle ( ■ ) . . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

  70. Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Summary (44) Summary: • Generic • Kernel independent • Architecture independent • Performance portability . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

  71. Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Summary (44) Summary: • Generic • Kernel independent • Architecture independent • Performance portability Additional contributions: • Commutativity expression in FMM • MPI/OpenMP implementation All included in ScalFMM (C++/HPC library) . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

  72. Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives (45) Outline • Problem Formulation • BEM Solver (Matrix Approach) • Fast-Multipole Method Approach • FMM Algorithm & Parallelization • FMM BEM Solver (Experimental Implementation) • Conclusion & Perspectives . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. .. . . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

  73. Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives (46) Propagation of the Current State to the Future a n-1 a n-2 a n-3 a n-4 a n-5 ~ s n l n s n s n = - = M 1 M 2 M 3 M 4 M 5 Linear Solver = , M 0 ~ a n s n . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

  74. Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives (46) Propagation of the Current State to the Future a n-1 a n-2 a n-3 a n-4 a n-5 ~ s n l n s n s n = - = M 1 M 2 M 3 M 4 M 5 Linear Solver = , M 0 ~ a n s n a n M 5 s n+5 ~ l n s n s n Linear Solver s n+4 M 4 s n+3 M 3 + s n+2 M 2 - = = + s n+1 M 1 + , + M 0 ~ a n s n + . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

  75. Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives (47) With FMM a n s n+5 ~ s n+4 l n s n s n Linear Solver FMM s n+3 + s n+2 M 2 + - = = s n+1 M 1 + , + M 0 ~ + a n s n • Far interactions in time (between far elements in space) are computed by the FMM • The spatial decomposition is given by the octree . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. . .. .. . .. . .. . .. . .. . .. . .. . . .. .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

  76. Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives (48) Overview • The octree is over a mesh (integration points) • Interactions matrices between leaves • Approximation/FMM • development in the time-domain • multipole: what a cell emits to the outside • local: what a cell receives from the outside • operators in FD or TD • accurate up-to a chosen frequency • the results in the TD of the matrix approach ̸ = FMM Figure : Complete unit sphere Figure : Truncated unit sphere . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

  77. Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives (49) Operators (Overview) • P2M • compute what is emitted by a leaf to the outside . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

  78. Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives (49) Operators (Overview) • P2M • compute what is emitted by a leaf to the outside • M2M/L2L • Extrapolation + time shift . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

  79. Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives (49) Operators (Overview) • P2M • compute what is emitted by a leaf to the outside • M2M/L2L • Extrapolation + time shift • M2L • Convolution product in TD (term-by-term multiplication in FD) . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

  80. Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives (49) Operators (Overview) • P2M • compute what is emitted by a leaf to the outside • M2M/L2L • Extrapolation + time shift • M2L • Convolution product in TD (term-by-term multiplication in FD) . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

  81. Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives (49) Operators (Overview) • P2M • compute what is emitted by a leaf to the outside • M2M/L2L • Extrapolation + time shift • M2L • Convolution product in TD (term-by-term multiplication in FD) • L2P • Integration . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

  82. Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Results (50) Cone-Sphere Test Cases Case C-927 C-4269 C-10012 Number of unknowns 927 4269 10012 FMM tree height 3 4 5 Number of leaves 16 64 234 Number of M k matrices ( K max ) 117 244 370 Number of M k matrices (leaves) 60 64 49 Number of time steps ( T ) 2033 4345 6647 . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. .. . . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

  83. Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Results (51) Sequential Executions TD vs. FD operators: FMM Stages TD TD + FD-M2L FD Matrix approach M k Construction 76 s 76 s 76 s 242 s Solve 58 122 s 53 241 s 97 861 s 7 . 8 s (*) Total 58 198 s 53 317 s 97 937 s 249 . 8 s Execution time TD-FMM Vs. matrix approach to solve the Case C-927 in double precision. (*) Our optimized BEM solver. . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. .. . . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

  84. Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Results (52) Parallel Executions (FMM Vs. Matrix Approach) . Matrix generation . . . Solve · 10 4 · 10 4 4 1 , 000 10 , 080 33 , 408 883 . . 9 , 426 1 25 , 256 Time (s) 2 500 0 . 5 237 . . . . . . . . . . . . . . . . . . . . . . . . . . . 0 0 0 . . . FMM Matrix Approach FMM Matrix Approach FMM Matrix Approach Figure : C-927 ( × 3 . 8) Figure : C-4269 ( × 1) Figure : C-10012 ( × 1 . 4) The captions of the different cases show the overhead of the FMM TD-BEM against the matrix approach. . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. . .. .. . .. . .. . .. . .. . .. . .. . . .. .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

  85. Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives Summary (53) Summary: • Preliminary results • Best configuration: TD + FD M2L • Not competitive against the direct approach (maybe on larger test cases) • Any improvement of the matrix creation will make the FMM less competitive Additional contributions: • Incomplete/4D FMM • Sphere discretization/length APS signal . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

  86. Context Problem Formulation Matrix Approach FMM Approach Conclusion & Perspectives (54) Conclusion & Perspectives . . . . . . . . . . . . . . . . . . . . . . .. . .. . .. . .. . .. . .. . .. . .. . .. . . .. . .. .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . .. . Optimization and Parallelization of the Boundary Element Method for the Wave Equation in Time Domain B´ erenger Bramas

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend