Stencil-like operations on unstructured meshes wissen leben - - PowerPoint PPT Presentation

stencil like operations on unstructured meshes
SMART_READER_LITE
LIVE PREVIEW

Stencil-like operations on unstructured meshes wissen leben - - PowerPoint PPT Presentation

Westf alische Wilhelms-Universit at M unster Stencil-like operations on unstructured meshes wissen leben Christian Engwer 13.04.2015, WWU M unster joint work with P. Bastian, Schloss Dagstuhl J. Fahlke, S. M uthing Westf


slide-1
SLIDE 1

Stencil-like operations on unstructured meshes

Christian Engwer

joint work with P. Bastian,

  • J. Fahlke, S. M¨

uthing

13.04.2015, Schloss Dagstuhl

wissen leben WWU M¨ unster

Westf¨ alische Wilhelms-Universit¨ at M¨ unster

slide-2
SLIDE 2

wissen leben WWU M¨ unster

Westf¨ alische Wilhelms-Universit¨ at M¨ unster

Stencil-like operations on unstructured meshes 1 /20

What we aim at...

Solving Partiell Differential Equations

◮ Wide range of applications ◮ In general requires unstructured meshes ◮ Not accessible to classical stencil approaches

, , WWU M¨ unster Christian Engwer 13.04.2015, Schloss Dagstuhl

slide-3
SLIDE 3

wissen leben WWU M¨ unster

Westf¨ alische Wilhelms-Universit¨ at M¨ unster

Stencil-like operations on unstructured meshes 1 /20

What we aim at...

Solving Partiell Differential Equations

◮ Wide range of applications ◮ In general requires unstructured meshes ◮ Not accessible to classical stencil approaches

, , WWU M¨ unster Christian Engwer 13.04.2015, Schloss Dagstuhl

slide-4
SLIDE 4

wissen leben WWU M¨ unster

Westf¨ alische Wilhelms-Universit¨ at M¨ unster

Stencil-like operations on unstructured meshes 1 /20

What we aim at...

Solving Partiell Differential Equations

◮ Wide range of applications ◮ In general requires unstructured meshes ◮ Not accessible to classical stencil approaches

[Unat et.al. – 2012]

, , WWU M¨ unster Christian Engwer 13.04.2015, Schloss Dagstuhl

slide-5
SLIDE 5

wissen leben WWU M¨ unster

Westf¨ alische Wilhelms-Universit¨ at M¨ unster

Stencil-like operations on unstructured meshes 2 /20

Outline

1

The EXA-DUNE Project

2

Stencils vs. FEM?

3

Introducing Local Structure

4

Exploiting Local Structure for Vectorization

5

Matrix-free vs. matrix-based solvers

6

Discussion

, , WWU M¨ unster Christian Engwer 13.04.2015, Schloss Dagstuhl

slide-6
SLIDE 6

wissen leben WWU M¨ unster

Westf¨ alische Wilhelms-Universit¨ at M¨ unster

Stencil-like operations on unstructured meshes 3 /20

The EXA-DUNE Project

Framework approach to software development

◮ Open-Source C++ framework ◮ Integrated toolbox of simulation

components

◮ Existing body of complex applications ◮ Scalability for traditional MPI

applications extra grids localfunctions istl grid core modules pdelab fem discretization modules external modules

[Bastian, Blatt, Dedner, E, Kl¨

  • fkorn,

Kornhuber, Ohlberger, Sander 2008] http://www.dune-project.org/

, , WWU M¨ unster Christian Engwer 13.04.2015, Schloss Dagstuhl

slide-7
SLIDE 7

wissen leben WWU M¨ unster

Westf¨ alische Wilhelms-Universit¨ at M¨ unster

Stencil-like operations on unstructured meshes 3 /20

The EXA-DUNE Project

Framework approach to software development

◮ Open-Source C++ framework ◮ Integrated toolbox of simulation

components

◮ Existing body of complex applications ◮ Scalability for traditional MPI

applications extra grids localfunctions istl grid core modules pdelab fem discretization modules external modules

[Bastian, Blatt, Dedner, E, Kl¨

  • fkorn,

Kornhuber, Ohlberger, Sander 2008] http://www.dune-project.org/

Challenges

◮ Incorporate new algorithms, hardware paradigms ◮ Integrate changes across simulation stages (Ahmdahl’s Law) ◮ Provide “reasonable” upgrade path for existing applications

, , WWU M¨ unster Christian Engwer 13.04.2015, Schloss Dagstuhl

slide-8
SLIDE 8

wissen leben WWU M¨ unster

Westf¨ alische Wilhelms-Universit¨ at M¨ unster

Stencil-like operations on unstructured meshes 4 /20

The EXA-DUNE Project (2)

Framework approach to software development

◮ DUNE + FEAST = Flexibility +

Performance

◮ General Software Frameworks

→ co-designed to specific hardware platforms is not sufficient

◮ Hardware-Oriented Numerics

→ standard low order algorithms do not scale any more

⇒ Much more than a pure implementational issue

M

CPU M CPU M

M

MPI UMA

CPU Accl

P

...

P L1 P L2

...

L1 L2 L1 L2

SIMD

M IC

...

M IC M IC

...

M IC

EXA-DUNE hardware model Coarse-grained: MPI between heterogeneous nodes Medium-grained: multicore-CPUs, GPUs, MICs, APUs, ... Fine-grained: vectorization, GPU ‘threads’, ...

, , WWU M¨ unster Christian Engwer 13.04.2015, Schloss Dagstuhl

slide-9
SLIDE 9

wissen leben WWU M¨ unster

Westf¨ alische Wilhelms-Universit¨ at M¨ unster

Stencil-like operations on unstructured meshes 4 /20

The EXA-DUNE Project (2)

Framework approach to software development

◮ DUNE + FEAST = Flexibility +

Performance

◮ General Software Frameworks

→ co-designed to specific hardware platforms is not sufficient

◮ Hardware-Oriented Numerics

→ standard low order algorithms do not scale any more

⇒ Much more than a pure implementational issue

M

CPU M CPU M

M

MPI UMA

CPU Accl

P

...

P L1 P L2

...

L1 L2 L1 L2

SIMD

M IC

...

M IC M IC

...

M IC

EXA-DUNE hardware model Coarse-grained: MPI between heterogeneous nodes Medium-grained: multicore-CPUs, GPUs, MICs, APUs, ... Fine-grained: vectorization, GPU ‘threads’, ...

, , WWU M¨ unster Christian Engwer 13.04.2015, Schloss Dagstuhl

slide-10
SLIDE 10

wissen leben WWU M¨ unster

Westf¨ alische Wilhelms-Universit¨ at M¨ unster

Stencil-like operations on unstructured meshes 4 /20

The EXA-DUNE Project (2)

Framework approach to software development

◮ DUNE + FEAST = Flexibility +

Performance

◮ General Software Frameworks

→ co-designed to specific hardware platforms is not sufficient

◮ Hardware-Oriented Numerics

→ standard low order algorithms do not scale any more

⇒ Much more than a pure implementational issue

M

CPU M CPU M

M

MPI UMA

CPU Accl

P

...

P L1 P L2

...

L1 L2 L1 L2

SIMD

M IC

...

M IC M IC

...

M IC

EXA-DUNE hardware model Coarse-grained: MPI between heterogeneous nodes Medium-grained: multicore-CPUs, GPUs, MICs, APUs, ... PS: CPUs are catching up: E5-2699v3: 0.9TF@145W, K80: 2.9TF@300W

, , WWU M¨ unster Christian Engwer 13.04.2015, Schloss Dagstuhl

slide-11
SLIDE 11

wissen leben WWU M¨ unster

Westf¨ alische Wilhelms-Universit¨ at M¨ unster

Stencil-like operations on unstructured meshes 5 /20

Stencil vs. FEM?

Stencil approach

◮ Structured data layout ◮ Define update operation, based on local neighbourhood as

yi = f (xi, N(xi)) for each i ∈ [0, #DOFs − 1], for some neighbourhood N

◮ Data parallel pattern ◮ Easily vectorized

, , WWU M¨ unster Christian Engwer 13.04.2015, Schloss Dagstuhl

slide-12
SLIDE 12

wissen leben WWU M¨ unster

Westf¨ alische Wilhelms-Universit¨ at M¨ unster

Stencil-like operations on unstructured meshes 6 /20

Stencil vs. FEM?

Local Stiffness Matrix approach

◮ Based on a weak formulation ◮ Loop over cells ◮ Compute local contributions to global residual/stiffness-matrix ◮ Similar to Map/Reduce pattern

A =

  • E

REAERT

E

AE = (a(φi, φj))i,j with supp φ ∩ E = ∅

, , WWU M¨ unster Christian Engwer 13.04.2015, Schloss Dagstuhl

slide-13
SLIDE 13

wissen leben WWU M¨ unster

Westf¨ alische Wilhelms-Universit¨ at M¨ unster

Stencil-like operations on unstructured meshes 6 /20

Stencil vs. FEM?

Local Stiffness Matrix approach

◮ Based on a weak formulation ◮ Loop over cells ◮ Compute local contributions to global residual/stiffness-matrix ◮ Similar to Map/Reduce pattern

A =

  • E

REAERT

E

AE = (a(φi, φj))i,j with supp φ ∩ E = ∅ Challenges:

◮ size of AE varies ◮ indirect memory access (gather/scatter) ◮ read/write conflicts

, , WWU M¨ unster Christian Engwer 13.04.2015, Schloss Dagstuhl

slide-14
SLIDE 14

wissen leben WWU M¨ unster

Westf¨ alische Wilhelms-Universit¨ at M¨ unster

Stencil-like operations on unstructured meshes 7 /20

Possible Patterns

◮ Neighbor data

◮ CCFV and DG method ◮ simililar to classical stencil semantics ◮ explicit read access to neighbor data

◮ Vertex, Face or Edge data

◮ e.g. Conforming FEM, Raviar-Thomas,

Nedelec

◮ several cells contributing to the same

DOF

◮ implicitly shared: read/write conflicts

◮ Element local data

◮ DG methods ◮ No coupling to other cells ◮ simple, no conflicts

, , WWU M¨ unster Christian Engwer 13.04.2015, Schloss Dagstuhl

slide-15
SLIDE 15

wissen leben WWU M¨ unster

Westf¨ alische Wilhelms-Universit¨ at M¨ unster

Stencil-like operations on unstructured meshes 7 /20

Possible Patterns

◮ Neighbor data

◮ CCFV and DG method ◮ simililar to classical stencil semantics ◮ explicit read access to neighbor data

◮ Vertex, Face or Edge data

◮ e.g. Conforming FEM, Raviar-Thomas,

Nedelec

◮ several cells contributing to the same

DOF

◮ implicitly shared: read/write conflicts

◮ Element local data

◮ DG methods ◮ No coupling to other cells ◮ simple, no conflicts

, , WWU M¨ unster Christian Engwer 13.04.2015, Schloss Dagstuhl

slide-16
SLIDE 16

wissen leben WWU M¨ unster

Westf¨ alische Wilhelms-Universit¨ at M¨ unster

Stencil-like operations on unstructured meshes 7 /20

Possible Patterns

◮ Neighbor data

◮ CCFV and DG method ◮ simililar to classical stencil semantics ◮ explicit read access to neighbor data

◮ Vertex, Face or Edge data

◮ e.g. Conforming FEM, Raviar-Thomas,

Nedelec

◮ several cells contributing to the same

DOF

◮ implicitly shared: read/write conflicts

◮ Element local data

◮ DG methods ◮ No coupling to other cells ◮ simple, no conflicts

, , WWU M¨ unster Christian Engwer 13.04.2015, Schloss Dagstuhl

slide-17
SLIDE 17

wissen leben WWU M¨ unster

Westf¨ alische Wilhelms-Universit¨ at M¨ unster

Stencil-like operations on unstructured meshes 8 /20

Introducing Local Structure

Locally structured, globally unstructured data → increase algorithmic intensity

◮ Topological tensor product

meshes

◮ Virtual local refinement ◮ Higher order methods:

Discontinuous Galerkin, Reduced Basis

, , WWU M¨ unster Christian Engwer 13.04.2015, Schloss Dagstuhl

slide-18
SLIDE 18

wissen leben WWU M¨ unster

Westf¨ alische Wilhelms-Universit¨ at M¨ unster

Stencil-like operations on unstructured meshes 8 /20

Introducing Local Structure

Locally structured, globally unstructured data → increase algorithmic intensity

◮ Topological tensor product

meshes

◮ Virtual local refinement ◮ Higher order methods:

Discontinuous Galerkin, Reduced Basis

, , WWU M¨ unster Christian Engwer 13.04.2015, Schloss Dagstuhl

slide-19
SLIDE 19

wissen leben WWU M¨ unster

Westf¨ alische Wilhelms-Universit¨ at M¨ unster

Stencil-like operations on unstructured meshes 8 /20

Introducing Local Structure

Locally structured, globally unstructured data → increase algorithmic intensity

◮ Topological tensor product

meshes

◮ Virtual local refinement ◮ Higher order methods:

Discontinuous Galerkin, Reduced Basis

, , WWU M¨ unster Christian Engwer 13.04.2015, Schloss Dagstuhl

slide-20
SLIDE 20

wissen leben WWU M¨ unster

Westf¨ alische Wilhelms-Universit¨ at M¨ unster

Stencil-like operations on unstructured meshes 8 /20

Introducing Local Structure

Locally structured, globally unstructured data → increase algorithmic intensity

◮ Topological tensor product

meshes

◮ Virtual local refinement ◮ Higher order methods:

Discontinuous Galerkin, Reduced Basis

, , WWU M¨ unster Christian Engwer 13.04.2015, Schloss Dagstuhl

slide-21
SLIDE 21

wissen leben WWU M¨ unster

Westf¨ alische Wilhelms-Universit¨ at M¨ unster

Stencil-like operations on unstructured meshes 9 /20

Vectorization based on local structure

Vectorization of local operator (user code)

◮ hardly possible in user code ◮ Ideas:

A)

◮ Evaluate cell assembly ◮ 90% compatible with old code ◮ works also for low-order methods ◮ Vectorize over cell

B)

◮ Spectral DG + Sum-Factorization (reduce computational complexity) ◮ Point evaluations (simplify user code) ◮ Vectorize over point evaluations (quadrature point × test function)

, , WWU M¨ unster Christian Engwer 13.04.2015, Schloss Dagstuhl

slide-22
SLIDE 22

wissen leben WWU M¨ unster

Westf¨ alische Wilhelms-Universit¨ at M¨ unster

Stencil-like operations on unstructured meshes 10 /20

Programming Approaches

◮ Intrinsics

(non-portable)

◮ Special language

(needs special compiler)

◮ Autovectorize

(difficult to drive portably)

, , WWU M¨ unster Christian Engwer 13.04.2015, Schloss Dagstuhl

slide-23
SLIDE 23

wissen leben WWU M¨ unster

Westf¨ alische Wilhelms-Universit¨ at M¨ unster

Stencil-like operations on unstructured meshes 10 /20

Programming Approaches

◮ Intrinsics

(non-portable)

◮ Special language

(needs special compiler)

◮ Autovectorize

(difficult to drive portably)

, , WWU M¨ unster Christian Engwer 13.04.2015, Schloss Dagstuhl

slide-24
SLIDE 24

wissen leben WWU M¨ unster

Westf¨ alische Wilhelms-Universit¨ at M¨ unster

Stencil-like operations on unstructured meshes 10 /20

Programming Approaches

◮ Intrinsics

(non-portable)

◮ Special language

(needs special compiler)

◮ Autovectorize

(difficult to drive portably)

◮ vectorization library

◮ Hide instrinsics beneath a portable

interface

◮ Implementations: e.g. Vc,

Boost.SIMD, NGSolve

, , WWU M¨ unster Christian Engwer 13.04.2015, Schloss Dagstuhl

slide-25
SLIDE 25

wissen leben WWU M¨ unster

Westf¨ alische Wilhelms-Universit¨ at M¨ unster

Stencil-like operations on unstructured meshes 11 /20

A) Vectorizing over Elements

◮ Patch Grid ◮ reduce costs of unstructured

meshes

◮ extract subset of mesh ◮ store as flat grid ◮ store in consecutive arrays,

without pointers

◮ add structured refinement

◮ exploit local structure ◮ improve data locality

◮ Vectorization library

◮ Hide instrinsics beneath a

portable interface

◮ Use Operator oveloading

, , WWU M¨ unster Christian Engwer 13.04.2015, Schloss Dagstuhl

slide-26
SLIDE 26

wissen leben WWU M¨ unster

Westf¨ alische Wilhelms-Universit¨ at M¨ unster

Stencil-like operations on unstructured meshes 11 /20

A) Vectorizing over Elements

◮ Patch Grid ◮ reduce costs of unstructured

meshes

◮ extract subset of mesh ◮ store as flat grid ◮ store in consecutive arrays,

without pointers

◮ add structured refinement

◮ exploit local structure ◮ improve data locality

◮ Vectorization library

◮ Hide instrinsics beneath a

portable interface

◮ Use Operator oveloading local coefficients vector v0 v1 v2 v3 global coefficients vector

, , WWU M¨ unster Christian Engwer 13.04.2015, Schloss Dagstuhl

slide-27
SLIDE 27

wissen leben WWU M¨ unster

Westf¨ alische Wilhelms-Universit¨ at M¨ unster

Stencil-like operations on unstructured meshes 11 /20

A) Vectorizing over Elements

◮ Patch Grid ◮ reduce costs of unstructured

meshes

◮ extract subset of mesh ◮ store as flat grid ◮ store in consecutive arrays,

without pointers

◮ add structured refinement

◮ exploit local structure ◮ improve data locality

◮ Vectorization library

◮ Hide instrinsics beneath a

portable interface

◮ Use Operator oveloading local coefficients vector v0 v1 v2 v3 global coefficients vector

, , WWU M¨ unster Christian Engwer 13.04.2015, Schloss Dagstuhl

slide-28
SLIDE 28

wissen leben WWU M¨ unster

Westf¨ alische Wilhelms-Universit¨ at M¨ unster

Stencil-like operations on unstructured meshes 11 /20

A) Vectorizing over Elements

◮ Patch Grid ◮ reduce costs of unstructured

meshes

◮ extract subset of mesh ◮ store as flat grid ◮ store in consecutive arrays,

without pointers

◮ add structured refinement

◮ exploit local structure ◮ improve data locality

◮ Vectorization library

◮ Hide instrinsics beneath a

portable interface

◮ Use Operator oveloading local coefficients vector v0 v1 v2 v3 global coefficients vector

, , WWU M¨ unster Christian Engwer 13.04.2015, Schloss Dagstuhl

slide-29
SLIDE 29

wissen leben WWU M¨ unster

Westf¨ alische Wilhelms-Universit¨ at M¨ unster

Stencil-like operations on unstructured meshes 11 /20

A) Vectorizing over Elements

◮ Patch Grid ◮ reduce costs of unstructured

meshes

◮ extract subset of mesh ◮ store as flat grid ◮ store in consecutive arrays,

without pointers

◮ add structured refinement

◮ exploit local structure ◮ improve data locality

◮ Vectorization library

◮ Hide instrinsics beneath a

portable interface

◮ Use Operator oveloading local coefficients vector v0 v1 v2 v3 global coefficients vector

, , WWU M¨ unster Christian Engwer 13.04.2015, Schloss Dagstuhl

slide-30
SLIDE 30

wissen leben WWU M¨ unster

Westf¨ alische Wilhelms-Universit¨ at M¨ unster

Stencil-like operations on unstructured meshes 11 /20

A) Vectorizing over Elements

◮ Patch Grid ◮ reduce costs of unstructured

meshes

◮ extract subset of mesh ◮ store as flat grid ◮ store in consecutive arrays,

without pointers

◮ add structured refinement

◮ exploit local structure ◮ improve data locality

◮ Vectorization library

◮ Hide instrinsics beneath a

portable interface

◮ Use Operator oveloading local coefficients vector v0 v1 v2 v3 global coefficients vector

, , WWU M¨ unster Christian Engwer 13.04.2015, Schloss Dagstuhl

slide-31
SLIDE 31

wissen leben WWU M¨ unster

Westf¨ alische Wilhelms-Universit¨ at M¨ unster

Stencil-like operations on unstructured meshes 11 /20

A) Vectorizing over Elements

◮ Patch Grid ◮ reduce costs of unstructured

meshes

◮ extract subset of mesh ◮ store as flat grid ◮ store in consecutive arrays,

without pointers

◮ add structured refinement

◮ exploit local structure ◮ improve data locality

◮ Vectorization library

◮ Hide instrinsics beneath a

portable interface

◮ Use Operator oveloading local coefficients vector v0 v1 v2 v3 global coefficients vector

, , WWU M¨ unster Christian Engwer 13.04.2015, Schloss Dagstuhl

slide-32
SLIDE 32

wissen leben WWU M¨ unster

Westf¨ alische Wilhelms-Universit¨ at M¨ unster

Stencil-like operations on unstructured meshes 11 /20

A) Vectorizing over Elements

◮ Patch Grid ◮ reduce costs of unstructured

meshes

◮ extract subset of mesh ◮ store as flat grid ◮ store in consecutive arrays,

without pointers

◮ add structured refinement

◮ exploit local structure ◮ improve data locality

◮ Vectorization library

◮ Hide instrinsics beneath a

portable interface

◮ Use Operator oveloading local coefficients vector v0 v1 v2 v3 global coefficients vector

, , WWU M¨ unster Christian Engwer 13.04.2015, Schloss Dagstuhl

slide-33
SLIDE 33

wissen leben WWU M¨ unster

Westf¨ alische Wilhelms-Universit¨ at M¨ unster

Stencil-like operations on unstructured meshes 11 /20

A) Vectorizing over Elements

◮ Patch Grid ◮ reduce costs of unstructured

meshes

◮ extract subset of mesh ◮ store as flat grid ◮ store in consecutive arrays,

without pointers

◮ add structured refinement

◮ exploit local structure ◮ improve data locality

◮ Vectorization library

◮ Hide instrinsics beneath a

portable interface

◮ Use Operator oveloading local coefficients vector v0 v1 v2 v3 global coefficients vector

, , WWU M¨ unster Christian Engwer 13.04.2015, Schloss Dagstuhl

slide-34
SLIDE 34

wissen leben WWU M¨ unster

Westf¨ alische Wilhelms-Universit¨ at M¨ unster

Stencil-like operations on unstructured meshes 11 /20

A) Vectorizing over Elements

◮ Patch Grid ◮ reduce costs of unstructured

meshes

◮ extract subset of mesh ◮ store as flat grid ◮ store in consecutive arrays,

without pointers

◮ add structured refinement

◮ exploit local structure ◮ improve data locality

◮ Vectorization library

◮ Hide instrinsics beneath a

portable interface

◮ Use Operator oveloading local coefficients vector v0 v1 v2 v3 global coefficients vector

, , WWU M¨ unster Christian Engwer 13.04.2015, Schloss Dagstuhl

slide-35
SLIDE 35

wissen leben WWU M¨ unster

Westf¨ alische Wilhelms-Universit¨ at M¨ unster

Stencil-like operations on unstructured meshes 12 /20

A) Vectorizing over Elements (2)

Performance without sub refinement

◮ Poisson problem. ◮ Conforming FEM, Lagrange Qk ◮ Simplified: assembly only, no write back. ◮ Best result: assembly of residual, Q4 on a regular grid of

15 × 16 × 16 elements, 240 threads: 282GFlop/s (28% peak). Matrix computations #elems NNZ GFlop/s %peak speedup Q1 1 966 080 4 027 123 21 2 1.8 Q2 245 760 23 876 718 62 6 3.5 Q3 30 720 30 178 929 65 6 3.4 Q4 3 840 19 897 272 60 6 3.6

, , WWU M¨ unster Christian Engwer 13.04.2015, Schloss Dagstuhl

slide-36
SLIDE 36

wissen leben WWU M¨ unster

Westf¨ alische Wilhelms-Universit¨ at M¨ unster

Stencil-like operations on unstructured meshes 13 /20

A) Vectorizing over Elements (3)

Performance with sub refinement

◮ Conforming FEM, Lagrange, Q1 ◮ Including full write back ◮ Dual-Core Intel i5-3340M ◮ Advertised 10.8 GFlop/s/ Core ◮ Non-vectorized: 7278912391 Flop/s

Note: actually less in vectorized version

Full matrix assembly SIMD lanes runtime GFlop/s %peak none 1 2.674 s 2.7 25.2 SSE 2 1.605 s 4.5 41.6 AVX 4 1.069 s 6.8 62.9

, , WWU M¨ unster Christian Engwer 13.04.2015, Schloss Dagstuhl

slide-37
SLIDE 37

wissen leben WWU M¨ unster

Westf¨ alische Wilhelms-Universit¨ at M¨ unster

Stencil-like operations on unstructured meshes 14 /20

B) Sum Factorization: DG

◮ Discontinuous Galerkin: Effort for face terms dominates

computation time for say k < 10

◮ Summary of complexities per element for n = m = k + 1 and d = 3

Operation CG DG (moderate k) Matrix-free operator evaluation n4 n3 Assembled operator evaluation n6 n6 Sum factorized matrix assemby n7 n6 Naive matrix assemby n9 n9 Matrix-free point Jacobi application n4 n3 Block SSOR preconditioner n9 n9

, , WWU M¨ unster Christian Engwer 13.04.2015, Schloss Dagstuhl

slide-38
SLIDE 38

wissen leben WWU M¨ unster

Westf¨ alische Wilhelms-Universit¨ at M¨ unster

Stencil-like operations on unstructured meshes 14 /20

B) Sum Factorization: DG

◮ Discontinuous Galerkin: Effort for face terms dominates

computation time for say k < 10

◮ Summary of complexities per element for n = m = k + 1 and d = 3

Operation CG DG (moderate k) Matrix-free operator evaluation n4 n3 Assembled operator evaluation n6 n6 Sum factorized matrix assemby n7 n6 Naive matrix assemby n9 n9 Matrix-free point Jacobi application n4 n3 Block SSOR preconditioner n9 n9 Storage needed per element is reduced since only 1d basis functions need to be evaluated

, , WWU M¨ unster Christian Engwer 13.04.2015, Schloss Dagstuhl

slide-39
SLIDE 39

wissen leben WWU M¨ unster

Westf¨ alische Wilhelms-Universit¨ at M¨ unster

Stencil-like operations on unstructured meshes 15 /20

B) Sum Factorization: Basis Evaluation in 3D

1.0e-09 1.0e-08 1.0e-07 1.0e-06 1.0e-05 1 2 3 4 5 6 7 8 9 10 20 40 60 80 100 Time per DOF Speedup Polynomial degree naive T/DOF sum fact T/DOF Speedup

Except Q1 in 2d, sum factorization is always faster than naive version

, , WWU M¨ unster Christian Engwer 13.04.2015, Schloss Dagstuhl

slide-40
SLIDE 40

wissen leben WWU M¨ unster

Westf¨ alische Wilhelms-Universit¨ at M¨ unster

Stencil-like operations on unstructured meshes 16 /20

B) Sum Factorization: Floating-Point Performance

20 40 60 80 100 120 140 160 180 200 1 2 3 4 5 6 7 8 9 10 GFLOPs/s Polynomial degree 2D Ivy Bridge, 10 Threads 3D Ivy Bridge, 10 Threads 2D MIC, 60 Threads 3D MIC, 60 Threads

◮ Each tread runs sum factorization kernel in 3D ◮ icc with auto vectorization / SIMD hints ◮ Intel MIC (61 cores) vs. Xeon E5-2680v2 (Ivy Bridge, 10 cores)

, , WWU M¨ unster Christian Engwer 13.04.2015, Schloss Dagstuhl

slide-41
SLIDE 41

wissen leben WWU M¨ unster

Westf¨ alische Wilhelms-Universit¨ at M¨ unster

Stencil-like operations on unstructured meshes 17 /20

Implicit Case: To Assemble or not to Assemble?

SF-based Matrix-free operator evaluation: O(nd+1) per element SF-based element matrix assembly: O(n2d+1) instead of O(n3d) Example: 3D Laplace, 83 elements

degree 1 2 3 4 5 6 7 DOF 4096 13824 32768 64000 110592 175616 2621441 Assembly naive 0.02 0.39 3.6 21.2 91.8 319.1 971.2 Assembly SF 0.02 0.17 0.8 2.8 8.2 25.1 59.3 #IT BILU1 20 25 30 31 34 35 38 TSOLVE 0.01 0.09 0.57 2.3 8.3 18.9 48.8 #IT JAC 105 124 153 167 203 251 298 TSOLVE 0.3 1.3 2.4 4.4 10.2 17.9 28.8

1 Matrix has 109 nonzeroes! 2 does not include ILU decomposition! → O(naive Assembly) , , WWU M¨ unster Christian Engwer 13.04.2015, Schloss Dagstuhl

slide-42
SLIDE 42

wissen leben WWU M¨ unster

Westf¨ alische Wilhelms-Universit¨ at M¨ unster

Stencil-like operations on unstructured meshes 18 /20

SPE10 Test Problem I

Domain 1200 × 2200 × 170 ft3, 60 × 220 × 85 mesh (1.1 Mio cells) Anisotropic, diagonal permeability tensor, 11 orders of magnitude variation

, , WWU M¨ unster Christian Engwer 13.04.2015, Schloss Dagstuhl

slide-43
SLIDE 43

wissen leben WWU M¨ unster

Westf¨ alische Wilhelms-Universit¨ at M¨ unster

Stencil-like operations on unstructured meshes 19 /20

SPE10 Test Problem II

Solve elliptic problem −∇ · (KSPE10∇u) = 0; u = g on ∂ΩSPE10 CCFV with two point flux, AMG preconditioner vs. DG(Q1), hybrid DG/AMG-P0 and Block-SSOR preconditioner Discretization CCFV CCFV DG-Q1 DG-Q1 Preconditioner AMG AMG AMG-DG(2,2,V) BSSOR(1) DOF · 106 1.1 9.0 9.0 9.0 #IT(10−6) 29 36 55 2637 TIT(s) 0.33 2.66 8.90 1.90 Tsolve(s) 9.6 95.8 490 5010 → Matrix-free on DG + AMG on Q1 conforming (or P0) subspace

, , WWU M¨ unster Christian Engwer 13.04.2015, Schloss Dagstuhl

slide-44
SLIDE 44

wissen leben WWU M¨ unster

Westf¨ alische Wilhelms-Universit¨ at M¨ unster

Stencil-like operations on unstructured meshes 20 /20

Discussion

◮ Innermost part of assembler is user code ◮ Auto-vectorization of this user code hardly possible ◮ Patch-grid with uniform refinement ◮ Vectorization over grid cells

Challenges

◮ What is the best data management in the Patch ◮ Which interface for the user code? ◮ How to get the user kernel into CUDA?

, , WWU M¨ unster Christian Engwer 13.04.2015, Schloss Dagstuhl

slide-45
SLIDE 45

wissen leben WWU M¨ unster

Westf¨ alische Wilhelms-Universit¨ at M¨ unster

Stencil-like operations on unstructured meshes 20 /20

Discussion

◮ Innermost part of assembler is user code ◮ Auto-vectorization of this user code hardly possible ◮ Patch-grid with uniform refinement ◮ Vectorization over grid cells

Challenges

◮ What is the best data management in the Patch

◮ ? Patch cells first, then Patch skeleton ◮ ? All at once, using locking ◮ ? Duplicate Face/Vertex DOFS, additional reduce

◮ Which interface for the user code? ◮ How to get the user kernel into CUDA?

, , WWU M¨ unster Christian Engwer 13.04.2015, Schloss Dagstuhl

slide-46
SLIDE 46

wissen leben WWU M¨ unster

Westf¨ alische Wilhelms-Universit¨ at M¨ unster

Stencil-like operations on unstructured meshes 20 /20

Discussion

◮ Innermost part of assembler is user code ◮ Auto-vectorization of this user code hardly possible ◮ Patch-grid with uniform refinement ◮ Vectorization over grid cells

Challenges

◮ What is the best data management in the Patch ◮ Which interface for the user code?

◮ ? Point evaluation ◮ ? Cell evaluation ◮ ... why not both ? ;-)

◮ How to get the user kernel into CUDA?

, , WWU M¨ unster Christian Engwer 13.04.2015, Schloss Dagstuhl

slide-47
SLIDE 47

wissen leben WWU M¨ unster

Westf¨ alische Wilhelms-Universit¨ at M¨ unster

Stencil-like operations on unstructured meshes 20 /20

Discussion

◮ Innermost part of assembler is user code ◮ Auto-vectorization of this user code hardly possible ◮ Patch-grid with uniform refinement ◮ Vectorization over grid cells

Challenges

◮ What is the best data management in the Patch ◮ Which interface for the user code? ◮ How to get the user kernel into CUDA?

◮ ? Tabulate shape function evaluations? ◮ ? Code generator (e.g. Fenics UFL) ◮ ? special compilers (LLVM + something)

, , WWU M¨ unster Christian Engwer 13.04.2015, Schloss Dagstuhl