Stencil-like operations on unstructured meshes
Christian Engwer
joint work with P. Bastian,
- J. Fahlke, S. M¨
uthing
13.04.2015, Schloss Dagstuhl
wissen leben WWU M¨ unster
Westf¨ alische Wilhelms-Universit¨ at M¨ unster
Stencil-like operations on unstructured meshes wissen leben - - PowerPoint PPT Presentation
Westf alische Wilhelms-Universit at M unster Stencil-like operations on unstructured meshes wissen leben Christian Engwer 13.04.2015, WWU M unster joint work with P. Bastian, Schloss Dagstuhl J. Fahlke, S. M uthing Westf
Christian Engwer
joint work with P. Bastian,
uthing
13.04.2015, Schloss Dagstuhl
Westf¨ alische Wilhelms-Universit¨ at M¨ unster
Westf¨ alische Wilhelms-Universit¨ at M¨ unster
Stencil-like operations on unstructured meshes 1 /20
Solving Partiell Differential Equations
◮ Wide range of applications ◮ In general requires unstructured meshes ◮ Not accessible to classical stencil approaches
, , WWU M¨ unster Christian Engwer 13.04.2015, Schloss Dagstuhl
Westf¨ alische Wilhelms-Universit¨ at M¨ unster
Stencil-like operations on unstructured meshes 1 /20
Solving Partiell Differential Equations
◮ Wide range of applications ◮ In general requires unstructured meshes ◮ Not accessible to classical stencil approaches
, , WWU M¨ unster Christian Engwer 13.04.2015, Schloss Dagstuhl
Westf¨ alische Wilhelms-Universit¨ at M¨ unster
Stencil-like operations on unstructured meshes 1 /20
Solving Partiell Differential Equations
◮ Wide range of applications ◮ In general requires unstructured meshes ◮ Not accessible to classical stencil approaches
[Unat et.al. – 2012]
, , WWU M¨ unster Christian Engwer 13.04.2015, Schloss Dagstuhl
Westf¨ alische Wilhelms-Universit¨ at M¨ unster
Stencil-like operations on unstructured meshes 2 /20
The EXA-DUNE Project
Stencils vs. FEM?
Introducing Local Structure
Exploiting Local Structure for Vectorization
Matrix-free vs. matrix-based solvers
Discussion
, , WWU M¨ unster Christian Engwer 13.04.2015, Schloss Dagstuhl
Westf¨ alische Wilhelms-Universit¨ at M¨ unster
Stencil-like operations on unstructured meshes 3 /20
Framework approach to software development
◮ Open-Source C++ framework ◮ Integrated toolbox of simulation
components
◮ Existing body of complex applications ◮ Scalability for traditional MPI
applications extra grids localfunctions istl grid core modules pdelab fem discretization modules external modules
[Bastian, Blatt, Dedner, E, Kl¨
Kornhuber, Ohlberger, Sander 2008] http://www.dune-project.org/
, , WWU M¨ unster Christian Engwer 13.04.2015, Schloss Dagstuhl
Westf¨ alische Wilhelms-Universit¨ at M¨ unster
Stencil-like operations on unstructured meshes 3 /20
Framework approach to software development
◮ Open-Source C++ framework ◮ Integrated toolbox of simulation
components
◮ Existing body of complex applications ◮ Scalability for traditional MPI
applications extra grids localfunctions istl grid core modules pdelab fem discretization modules external modules
[Bastian, Blatt, Dedner, E, Kl¨
Kornhuber, Ohlberger, Sander 2008] http://www.dune-project.org/
Challenges
◮ Incorporate new algorithms, hardware paradigms ◮ Integrate changes across simulation stages (Ahmdahl’s Law) ◮ Provide “reasonable” upgrade path for existing applications
, , WWU M¨ unster Christian Engwer 13.04.2015, Schloss Dagstuhl
Westf¨ alische Wilhelms-Universit¨ at M¨ unster
Stencil-like operations on unstructured meshes 4 /20
Framework approach to software development
◮ DUNE + FEAST = Flexibility +
Performance
◮ General Software Frameworks
→ co-designed to specific hardware platforms is not sufficient
◮ Hardware-Oriented Numerics
→ standard low order algorithms do not scale any more
⇒ Much more than a pure implementational issue
M
CPU M CPU MM
MPI UMA
CPU Accl
P...
P L1 P L2...
L1 L2 L1 L2SIMD
M IC...
M IC M IC...
M ICEXA-DUNE hardware model Coarse-grained: MPI between heterogeneous nodes Medium-grained: multicore-CPUs, GPUs, MICs, APUs, ... Fine-grained: vectorization, GPU ‘threads’, ...
, , WWU M¨ unster Christian Engwer 13.04.2015, Schloss Dagstuhl
Westf¨ alische Wilhelms-Universit¨ at M¨ unster
Stencil-like operations on unstructured meshes 4 /20
Framework approach to software development
◮ DUNE + FEAST = Flexibility +
Performance
◮ General Software Frameworks
→ co-designed to specific hardware platforms is not sufficient
◮ Hardware-Oriented Numerics
→ standard low order algorithms do not scale any more
⇒ Much more than a pure implementational issue
M
CPU M CPU MM
MPI UMA
CPU Accl
P...
P L1 P L2...
L1 L2 L1 L2SIMD
M IC...
M IC M IC...
M ICEXA-DUNE hardware model Coarse-grained: MPI between heterogeneous nodes Medium-grained: multicore-CPUs, GPUs, MICs, APUs, ... Fine-grained: vectorization, GPU ‘threads’, ...
, , WWU M¨ unster Christian Engwer 13.04.2015, Schloss Dagstuhl
Westf¨ alische Wilhelms-Universit¨ at M¨ unster
Stencil-like operations on unstructured meshes 4 /20
Framework approach to software development
◮ DUNE + FEAST = Flexibility +
Performance
◮ General Software Frameworks
→ co-designed to specific hardware platforms is not sufficient
◮ Hardware-Oriented Numerics
→ standard low order algorithms do not scale any more
⇒ Much more than a pure implementational issue
M
CPU M CPU MM
MPI UMA
CPU Accl
P...
P L1 P L2...
L1 L2 L1 L2SIMD
M IC...
M IC M IC...
M ICEXA-DUNE hardware model Coarse-grained: MPI between heterogeneous nodes Medium-grained: multicore-CPUs, GPUs, MICs, APUs, ... PS: CPUs are catching up: E5-2699v3: 0.9TF@145W, K80: 2.9TF@300W
, , WWU M¨ unster Christian Engwer 13.04.2015, Schloss Dagstuhl
Westf¨ alische Wilhelms-Universit¨ at M¨ unster
Stencil-like operations on unstructured meshes 5 /20
◮ Structured data layout ◮ Define update operation, based on local neighbourhood as
yi = f (xi, N(xi)) for each i ∈ [0, #DOFs − 1], for some neighbourhood N
◮ Data parallel pattern ◮ Easily vectorized
, , WWU M¨ unster Christian Engwer 13.04.2015, Schloss Dagstuhl
Westf¨ alische Wilhelms-Universit¨ at M¨ unster
Stencil-like operations on unstructured meshes 6 /20
◮ Based on a weak formulation ◮ Loop over cells ◮ Compute local contributions to global residual/stiffness-matrix ◮ Similar to Map/Reduce pattern
A =
REAERT
E
AE = (a(φi, φj))i,j with supp φ ∩ E = ∅
, , WWU M¨ unster Christian Engwer 13.04.2015, Schloss Dagstuhl
Westf¨ alische Wilhelms-Universit¨ at M¨ unster
Stencil-like operations on unstructured meshes 6 /20
◮ Based on a weak formulation ◮ Loop over cells ◮ Compute local contributions to global residual/stiffness-matrix ◮ Similar to Map/Reduce pattern
A =
REAERT
E
AE = (a(φi, φj))i,j with supp φ ∩ E = ∅ Challenges:
◮ size of AE varies ◮ indirect memory access (gather/scatter) ◮ read/write conflicts
, , WWU M¨ unster Christian Engwer 13.04.2015, Schloss Dagstuhl
Westf¨ alische Wilhelms-Universit¨ at M¨ unster
Stencil-like operations on unstructured meshes 7 /20
◮ Neighbor data
◮ CCFV and DG method ◮ simililar to classical stencil semantics ◮ explicit read access to neighbor data
◮ Vertex, Face or Edge data
◮ e.g. Conforming FEM, Raviar-Thomas,
Nedelec
◮ several cells contributing to the same
DOF
◮ implicitly shared: read/write conflicts
◮ Element local data
◮ DG methods ◮ No coupling to other cells ◮ simple, no conflicts
, , WWU M¨ unster Christian Engwer 13.04.2015, Schloss Dagstuhl
Westf¨ alische Wilhelms-Universit¨ at M¨ unster
Stencil-like operations on unstructured meshes 7 /20
◮ Neighbor data
◮ CCFV and DG method ◮ simililar to classical stencil semantics ◮ explicit read access to neighbor data
◮ Vertex, Face or Edge data
◮ e.g. Conforming FEM, Raviar-Thomas,
Nedelec
◮ several cells contributing to the same
DOF
◮ implicitly shared: read/write conflicts
◮ Element local data
◮ DG methods ◮ No coupling to other cells ◮ simple, no conflicts
, , WWU M¨ unster Christian Engwer 13.04.2015, Schloss Dagstuhl
Westf¨ alische Wilhelms-Universit¨ at M¨ unster
Stencil-like operations on unstructured meshes 7 /20
◮ Neighbor data
◮ CCFV and DG method ◮ simililar to classical stencil semantics ◮ explicit read access to neighbor data
◮ Vertex, Face or Edge data
◮ e.g. Conforming FEM, Raviar-Thomas,
Nedelec
◮ several cells contributing to the same
DOF
◮ implicitly shared: read/write conflicts
◮ Element local data
◮ DG methods ◮ No coupling to other cells ◮ simple, no conflicts
, , WWU M¨ unster Christian Engwer 13.04.2015, Schloss Dagstuhl
Westf¨ alische Wilhelms-Universit¨ at M¨ unster
Stencil-like operations on unstructured meshes 8 /20
Locally structured, globally unstructured data → increase algorithmic intensity
◮ Topological tensor product
meshes
◮ Virtual local refinement ◮ Higher order methods:
Discontinuous Galerkin, Reduced Basis
, , WWU M¨ unster Christian Engwer 13.04.2015, Schloss Dagstuhl
Westf¨ alische Wilhelms-Universit¨ at M¨ unster
Stencil-like operations on unstructured meshes 8 /20
Locally structured, globally unstructured data → increase algorithmic intensity
◮ Topological tensor product
meshes
◮ Virtual local refinement ◮ Higher order methods:
Discontinuous Galerkin, Reduced Basis
, , WWU M¨ unster Christian Engwer 13.04.2015, Schloss Dagstuhl
Westf¨ alische Wilhelms-Universit¨ at M¨ unster
Stencil-like operations on unstructured meshes 8 /20
Locally structured, globally unstructured data → increase algorithmic intensity
◮ Topological tensor product
meshes
◮ Virtual local refinement ◮ Higher order methods:
Discontinuous Galerkin, Reduced Basis
, , WWU M¨ unster Christian Engwer 13.04.2015, Schloss Dagstuhl
Westf¨ alische Wilhelms-Universit¨ at M¨ unster
Stencil-like operations on unstructured meshes 8 /20
Locally structured, globally unstructured data → increase algorithmic intensity
◮ Topological tensor product
meshes
◮ Virtual local refinement ◮ Higher order methods:
Discontinuous Galerkin, Reduced Basis
, , WWU M¨ unster Christian Engwer 13.04.2015, Schloss Dagstuhl
Westf¨ alische Wilhelms-Universit¨ at M¨ unster
Stencil-like operations on unstructured meshes 9 /20
Vectorization of local operator (user code)
◮ hardly possible in user code ◮ Ideas:
A)
◮ Evaluate cell assembly ◮ 90% compatible with old code ◮ works also for low-order methods ◮ Vectorize over cell
B)
◮ Spectral DG + Sum-Factorization (reduce computational complexity) ◮ Point evaluations (simplify user code) ◮ Vectorize over point evaluations (quadrature point × test function)
, , WWU M¨ unster Christian Engwer 13.04.2015, Schloss Dagstuhl
Westf¨ alische Wilhelms-Universit¨ at M¨ unster
Stencil-like operations on unstructured meshes 10 /20
◮ Intrinsics
(non-portable)
◮ Special language
(needs special compiler)
◮ Autovectorize
(difficult to drive portably)
, , WWU M¨ unster Christian Engwer 13.04.2015, Schloss Dagstuhl
Westf¨ alische Wilhelms-Universit¨ at M¨ unster
Stencil-like operations on unstructured meshes 10 /20
◮ Intrinsics
(non-portable)
◮ Special language
(needs special compiler)
◮ Autovectorize
(difficult to drive portably)
, , WWU M¨ unster Christian Engwer 13.04.2015, Schloss Dagstuhl
Westf¨ alische Wilhelms-Universit¨ at M¨ unster
Stencil-like operations on unstructured meshes 10 /20
◮ Intrinsics
(non-portable)
◮ Special language
(needs special compiler)
◮ Autovectorize
(difficult to drive portably)
◮ vectorization library
◮ Hide instrinsics beneath a portable
interface
◮ Implementations: e.g. Vc,
Boost.SIMD, NGSolve
, , WWU M¨ unster Christian Engwer 13.04.2015, Schloss Dagstuhl
Westf¨ alische Wilhelms-Universit¨ at M¨ unster
Stencil-like operations on unstructured meshes 11 /20
◮ Patch Grid ◮ reduce costs of unstructured
meshes
◮ extract subset of mesh ◮ store as flat grid ◮ store in consecutive arrays,
without pointers
◮ add structured refinement
◮ exploit local structure ◮ improve data locality
◮ Vectorization library
◮ Hide instrinsics beneath a
portable interface
◮ Use Operator oveloading
, , WWU M¨ unster Christian Engwer 13.04.2015, Schloss Dagstuhl
Westf¨ alische Wilhelms-Universit¨ at M¨ unster
Stencil-like operations on unstructured meshes 11 /20
◮ Patch Grid ◮ reduce costs of unstructured
meshes
◮ extract subset of mesh ◮ store as flat grid ◮ store in consecutive arrays,
without pointers
◮ add structured refinement
◮ exploit local structure ◮ improve data locality
◮ Vectorization library
◮ Hide instrinsics beneath a
portable interface
◮ Use Operator oveloading local coefficients vector v0 v1 v2 v3 global coefficients vector
, , WWU M¨ unster Christian Engwer 13.04.2015, Schloss Dagstuhl
Westf¨ alische Wilhelms-Universit¨ at M¨ unster
Stencil-like operations on unstructured meshes 11 /20
◮ Patch Grid ◮ reduce costs of unstructured
meshes
◮ extract subset of mesh ◮ store as flat grid ◮ store in consecutive arrays,
without pointers
◮ add structured refinement
◮ exploit local structure ◮ improve data locality
◮ Vectorization library
◮ Hide instrinsics beneath a
portable interface
◮ Use Operator oveloading local coefficients vector v0 v1 v2 v3 global coefficients vector
, , WWU M¨ unster Christian Engwer 13.04.2015, Schloss Dagstuhl
Westf¨ alische Wilhelms-Universit¨ at M¨ unster
Stencil-like operations on unstructured meshes 11 /20
◮ Patch Grid ◮ reduce costs of unstructured
meshes
◮ extract subset of mesh ◮ store as flat grid ◮ store in consecutive arrays,
without pointers
◮ add structured refinement
◮ exploit local structure ◮ improve data locality
◮ Vectorization library
◮ Hide instrinsics beneath a
portable interface
◮ Use Operator oveloading local coefficients vector v0 v1 v2 v3 global coefficients vector
, , WWU M¨ unster Christian Engwer 13.04.2015, Schloss Dagstuhl
Westf¨ alische Wilhelms-Universit¨ at M¨ unster
Stencil-like operations on unstructured meshes 11 /20
◮ Patch Grid ◮ reduce costs of unstructured
meshes
◮ extract subset of mesh ◮ store as flat grid ◮ store in consecutive arrays,
without pointers
◮ add structured refinement
◮ exploit local structure ◮ improve data locality
◮ Vectorization library
◮ Hide instrinsics beneath a
portable interface
◮ Use Operator oveloading local coefficients vector v0 v1 v2 v3 global coefficients vector
, , WWU M¨ unster Christian Engwer 13.04.2015, Schloss Dagstuhl
Westf¨ alische Wilhelms-Universit¨ at M¨ unster
Stencil-like operations on unstructured meshes 11 /20
◮ Patch Grid ◮ reduce costs of unstructured
meshes
◮ extract subset of mesh ◮ store as flat grid ◮ store in consecutive arrays,
without pointers
◮ add structured refinement
◮ exploit local structure ◮ improve data locality
◮ Vectorization library
◮ Hide instrinsics beneath a
portable interface
◮ Use Operator oveloading local coefficients vector v0 v1 v2 v3 global coefficients vector
, , WWU M¨ unster Christian Engwer 13.04.2015, Schloss Dagstuhl
Westf¨ alische Wilhelms-Universit¨ at M¨ unster
Stencil-like operations on unstructured meshes 11 /20
◮ Patch Grid ◮ reduce costs of unstructured
meshes
◮ extract subset of mesh ◮ store as flat grid ◮ store in consecutive arrays,
without pointers
◮ add structured refinement
◮ exploit local structure ◮ improve data locality
◮ Vectorization library
◮ Hide instrinsics beneath a
portable interface
◮ Use Operator oveloading local coefficients vector v0 v1 v2 v3 global coefficients vector
, , WWU M¨ unster Christian Engwer 13.04.2015, Schloss Dagstuhl
Westf¨ alische Wilhelms-Universit¨ at M¨ unster
Stencil-like operations on unstructured meshes 11 /20
◮ Patch Grid ◮ reduce costs of unstructured
meshes
◮ extract subset of mesh ◮ store as flat grid ◮ store in consecutive arrays,
without pointers
◮ add structured refinement
◮ exploit local structure ◮ improve data locality
◮ Vectorization library
◮ Hide instrinsics beneath a
portable interface
◮ Use Operator oveloading local coefficients vector v0 v1 v2 v3 global coefficients vector
, , WWU M¨ unster Christian Engwer 13.04.2015, Schloss Dagstuhl
Westf¨ alische Wilhelms-Universit¨ at M¨ unster
Stencil-like operations on unstructured meshes 11 /20
◮ Patch Grid ◮ reduce costs of unstructured
meshes
◮ extract subset of mesh ◮ store as flat grid ◮ store in consecutive arrays,
without pointers
◮ add structured refinement
◮ exploit local structure ◮ improve data locality
◮ Vectorization library
◮ Hide instrinsics beneath a
portable interface
◮ Use Operator oveloading local coefficients vector v0 v1 v2 v3 global coefficients vector
, , WWU M¨ unster Christian Engwer 13.04.2015, Schloss Dagstuhl
Westf¨ alische Wilhelms-Universit¨ at M¨ unster
Stencil-like operations on unstructured meshes 11 /20
◮ Patch Grid ◮ reduce costs of unstructured
meshes
◮ extract subset of mesh ◮ store as flat grid ◮ store in consecutive arrays,
without pointers
◮ add structured refinement
◮ exploit local structure ◮ improve data locality
◮ Vectorization library
◮ Hide instrinsics beneath a
portable interface
◮ Use Operator oveloading local coefficients vector v0 v1 v2 v3 global coefficients vector
, , WWU M¨ unster Christian Engwer 13.04.2015, Schloss Dagstuhl
Westf¨ alische Wilhelms-Universit¨ at M¨ unster
Stencil-like operations on unstructured meshes 12 /20
◮ Poisson problem. ◮ Conforming FEM, Lagrange Qk ◮ Simplified: assembly only, no write back. ◮ Best result: assembly of residual, Q4 on a regular grid of
15 × 16 × 16 elements, 240 threads: 282GFlop/s (28% peak). Matrix computations #elems NNZ GFlop/s %peak speedup Q1 1 966 080 4 027 123 21 2 1.8 Q2 245 760 23 876 718 62 6 3.5 Q3 30 720 30 178 929 65 6 3.4 Q4 3 840 19 897 272 60 6 3.6
, , WWU M¨ unster Christian Engwer 13.04.2015, Schloss Dagstuhl
Westf¨ alische Wilhelms-Universit¨ at M¨ unster
Stencil-like operations on unstructured meshes 13 /20
◮ Conforming FEM, Lagrange, Q1 ◮ Including full write back ◮ Dual-Core Intel i5-3340M ◮ Advertised 10.8 GFlop/s/ Core ◮ Non-vectorized: 7278912391 Flop/s
Note: actually less in vectorized version
Full matrix assembly SIMD lanes runtime GFlop/s %peak none 1 2.674 s 2.7 25.2 SSE 2 1.605 s 4.5 41.6 AVX 4 1.069 s 6.8 62.9
, , WWU M¨ unster Christian Engwer 13.04.2015, Schloss Dagstuhl
Westf¨ alische Wilhelms-Universit¨ at M¨ unster
Stencil-like operations on unstructured meshes 14 /20
◮ Discontinuous Galerkin: Effort for face terms dominates
computation time for say k < 10
◮ Summary of complexities per element for n = m = k + 1 and d = 3
Operation CG DG (moderate k) Matrix-free operator evaluation n4 n3 Assembled operator evaluation n6 n6 Sum factorized matrix assemby n7 n6 Naive matrix assemby n9 n9 Matrix-free point Jacobi application n4 n3 Block SSOR preconditioner n9 n9
, , WWU M¨ unster Christian Engwer 13.04.2015, Schloss Dagstuhl
Westf¨ alische Wilhelms-Universit¨ at M¨ unster
Stencil-like operations on unstructured meshes 14 /20
◮ Discontinuous Galerkin: Effort for face terms dominates
computation time for say k < 10
◮ Summary of complexities per element for n = m = k + 1 and d = 3
Operation CG DG (moderate k) Matrix-free operator evaluation n4 n3 Assembled operator evaluation n6 n6 Sum factorized matrix assemby n7 n6 Naive matrix assemby n9 n9 Matrix-free point Jacobi application n4 n3 Block SSOR preconditioner n9 n9 Storage needed per element is reduced since only 1d basis functions need to be evaluated
, , WWU M¨ unster Christian Engwer 13.04.2015, Schloss Dagstuhl
Westf¨ alische Wilhelms-Universit¨ at M¨ unster
Stencil-like operations on unstructured meshes 15 /20
1.0e-09 1.0e-08 1.0e-07 1.0e-06 1.0e-05 1 2 3 4 5 6 7 8 9 10 20 40 60 80 100 Time per DOF Speedup Polynomial degree naive T/DOF sum fact T/DOF Speedup
Except Q1 in 2d, sum factorization is always faster than naive version
, , WWU M¨ unster Christian Engwer 13.04.2015, Schloss Dagstuhl
Westf¨ alische Wilhelms-Universit¨ at M¨ unster
Stencil-like operations on unstructured meshes 16 /20
20 40 60 80 100 120 140 160 180 200 1 2 3 4 5 6 7 8 9 10 GFLOPs/s Polynomial degree 2D Ivy Bridge, 10 Threads 3D Ivy Bridge, 10 Threads 2D MIC, 60 Threads 3D MIC, 60 Threads
◮ Each tread runs sum factorization kernel in 3D ◮ icc with auto vectorization / SIMD hints ◮ Intel MIC (61 cores) vs. Xeon E5-2680v2 (Ivy Bridge, 10 cores)
, , WWU M¨ unster Christian Engwer 13.04.2015, Schloss Dagstuhl
Westf¨ alische Wilhelms-Universit¨ at M¨ unster
Stencil-like operations on unstructured meshes 17 /20
SF-based Matrix-free operator evaluation: O(nd+1) per element SF-based element matrix assembly: O(n2d+1) instead of O(n3d) Example: 3D Laplace, 83 elements
degree 1 2 3 4 5 6 7 DOF 4096 13824 32768 64000 110592 175616 2621441 Assembly naive 0.02 0.39 3.6 21.2 91.8 319.1 971.2 Assembly SF 0.02 0.17 0.8 2.8 8.2 25.1 59.3 #IT BILU1 20 25 30 31 34 35 38 TSOLVE 0.01 0.09 0.57 2.3 8.3 18.9 48.8 #IT JAC 105 124 153 167 203 251 298 TSOLVE 0.3 1.3 2.4 4.4 10.2 17.9 28.8
1 Matrix has 109 nonzeroes! 2 does not include ILU decomposition! → O(naive Assembly) , , WWU M¨ unster Christian Engwer 13.04.2015, Schloss Dagstuhl
Westf¨ alische Wilhelms-Universit¨ at M¨ unster
Stencil-like operations on unstructured meshes 18 /20
Domain 1200 × 2200 × 170 ft3, 60 × 220 × 85 mesh (1.1 Mio cells) Anisotropic, diagonal permeability tensor, 11 orders of magnitude variation
, , WWU M¨ unster Christian Engwer 13.04.2015, Schloss Dagstuhl
Westf¨ alische Wilhelms-Universit¨ at M¨ unster
Stencil-like operations on unstructured meshes 19 /20
Solve elliptic problem −∇ · (KSPE10∇u) = 0; u = g on ∂ΩSPE10 CCFV with two point flux, AMG preconditioner vs. DG(Q1), hybrid DG/AMG-P0 and Block-SSOR preconditioner Discretization CCFV CCFV DG-Q1 DG-Q1 Preconditioner AMG AMG AMG-DG(2,2,V) BSSOR(1) DOF · 106 1.1 9.0 9.0 9.0 #IT(10−6) 29 36 55 2637 TIT(s) 0.33 2.66 8.90 1.90 Tsolve(s) 9.6 95.8 490 5010 → Matrix-free on DG + AMG on Q1 conforming (or P0) subspace
, , WWU M¨ unster Christian Engwer 13.04.2015, Schloss Dagstuhl
Westf¨ alische Wilhelms-Universit¨ at M¨ unster
Stencil-like operations on unstructured meshes 20 /20
◮ Innermost part of assembler is user code ◮ Auto-vectorization of this user code hardly possible ◮ Patch-grid with uniform refinement ◮ Vectorization over grid cells
Challenges
◮ What is the best data management in the Patch ◮ Which interface for the user code? ◮ How to get the user kernel into CUDA?
, , WWU M¨ unster Christian Engwer 13.04.2015, Schloss Dagstuhl
Westf¨ alische Wilhelms-Universit¨ at M¨ unster
Stencil-like operations on unstructured meshes 20 /20
◮ Innermost part of assembler is user code ◮ Auto-vectorization of this user code hardly possible ◮ Patch-grid with uniform refinement ◮ Vectorization over grid cells
Challenges
◮ What is the best data management in the Patch
◮ ? Patch cells first, then Patch skeleton ◮ ? All at once, using locking ◮ ? Duplicate Face/Vertex DOFS, additional reduce
◮ Which interface for the user code? ◮ How to get the user kernel into CUDA?
, , WWU M¨ unster Christian Engwer 13.04.2015, Schloss Dagstuhl
Westf¨ alische Wilhelms-Universit¨ at M¨ unster
Stencil-like operations on unstructured meshes 20 /20
◮ Innermost part of assembler is user code ◮ Auto-vectorization of this user code hardly possible ◮ Patch-grid with uniform refinement ◮ Vectorization over grid cells
Challenges
◮ What is the best data management in the Patch ◮ Which interface for the user code?
◮ ? Point evaluation ◮ ? Cell evaluation ◮ ... why not both ? ;-)
◮ How to get the user kernel into CUDA?
, , WWU M¨ unster Christian Engwer 13.04.2015, Schloss Dagstuhl
Westf¨ alische Wilhelms-Universit¨ at M¨ unster
Stencil-like operations on unstructured meshes 20 /20
◮ Innermost part of assembler is user code ◮ Auto-vectorization of this user code hardly possible ◮ Patch-grid with uniform refinement ◮ Vectorization over grid cells
Challenges
◮ What is the best data management in the Patch ◮ Which interface for the user code? ◮ How to get the user kernel into CUDA?
◮ ? Tabulate shape function evaluations? ◮ ? Code generator (e.g. Fenics UFL) ◮ ? special compilers (LLVM + something)
, , WWU M¨ unster Christian Engwer 13.04.2015, Schloss Dagstuhl