Modular forest-of-octrees AMR: algorithms and interfaces Carsten - - PowerPoint PPT Presentation

modular forest of octrees amr algorithms and interfaces
SMART_READER_LITE
LIVE PREVIEW

Modular forest-of-octrees AMR: algorithms and interfaces Carsten - - PowerPoint PPT Presentation

Modular forest-of-octrees AMR: algorithms and interfaces Carsten Burstedde Institut f ur Numerische Simulation (INS) Rheinische Friedrich-Wilhelms-Universit at Bonn, Germany June 06, 2012 FEniCS 12 Simula Research Laboratory, Norway


slide-1
SLIDE 1

Modular forest-of-octrees AMR: algorithms and interfaces

Carsten Burstedde

Institut f¨ ur Numerische Simulation (INS) Rheinische Friedrich-Wilhelms-Universit¨ at Bonn, Germany

June 06, 2012

FEniCS ’12 Simula Research Laboratory, Norway

slide-2
SLIDE 2

Additional Credits

Parallel AMR

◮ joint work with Lucas C. Wilcox, Tobin Isaac, Tiankai Tu

(ICES, The University of Texas at Austin, USA) Numerical methods and applications

◮ joint work with Georg Stadler, James Martin (ICES),

Mike Gurnis, Laura Alisic (CalTech, Pasadena, USA) And most importantly

◮ Omar Ghattas (ICES)

slide-3
SLIDE 3

Key points about AMR

AMR—Adaptive Mesh Refinement

◮ local refinement ◮ local coarsening ◮ dynamic ◮ parallel ◮ (element-based) ◮ (general geometry)

slide-4
SLIDE 4

Key points about AMR

AMR—Adaptive Mesh Refinement

◮ local refinement ◮ local coarsening ◮ dynamic ◮ parallel ◮ (element-based) ◮ (general geometry)

slide-5
SLIDE 5

Key points about AMR

AMR—Adaptive Mesh Refinement

◮ local refinement ◮ local coarsening ◮ dynamic ◮ parallel ◮ (element-based) ◮ (general geometry)

slide-6
SLIDE 6

Why (not) use AMR?

AMR—Adaptive Mesh Refinement

Benefits (problem-dependent)

◮ Reduction in problem size ◮ Reduction in run time ◮ Gain in accuracy per degree of freedom ◮ Gain in modeling flexibility

Challenges (fundamental)

◮ Storage: Irregular mesh structure ◮ Computational: Tree traversals and searches ◮ Networking: Irregular communication patterns ◮ Numerical: Horizontal/vertical projections

slide-7
SLIDE 7

Geoscience simulations enabled by AMR

AMR—Adaptive Mesh Refinement

Mantle convection: High resolution for faults and plate boundaries

Artist rendering Image by US Geological Survey

  • Simul. (w. M. Gurnis, L. Alisic, CalTech)

Surface viscosity (colors), velocity (arrows)

slide-8
SLIDE 8

Geoscience simulations enabled by AMR

AMR—Adaptive Mesh Refinement

Mantle convection: High resolution for faults and plate boundaries

Zoom into the boundary between the Australia/New Hebrides plates

slide-9
SLIDE 9

Geoscience simulations enabled by AMR

AMR—Adaptive Mesh Refinement

Mantle convection: High resolution for faults and plate boundaries

Zoom into the boundary between the Australia/New Hebrides plates

slide-10
SLIDE 10

Geoscience simulations enabled by AMR

AMR—Adaptive Mesh Refinement

Ice sheet dynamics: Complex geometry and boundaries

Antarctica meshes (w. C. Jackson, UTIG) Adapt to geometry from SeaRISE data

slide-11
SLIDE 11

Geoscience simulations enabled by AMR

AMR—Adaptive Mesh Refinement

Seismic wave propagation: Adapt to local wave length

Varying local wave speeds Adapt to local wave length

slide-12
SLIDE 12

AMR

AMR—Adaptive Mesh Refinement

Initial mesh CSG description − → mesh generator − → XML file

◮ uniform element sizes ◮ finer resolution “where it matters”

a-priori adaptation

slide-13
SLIDE 13

AMR

AMR—Adaptive Mesh Refinement

“Where it matters” is sometimes known, often unknown beforehand

◮ emerging features ◮ moving fronts

a-posteriori adaptation

slide-14
SLIDE 14

AMR

AMR—Adaptive Mesh Refinement

Common AMR cycle Solve − → Mark − → Refine − → (repeat)

◮ Mesh exists standalone (topology/geometry)

slide-15
SLIDE 15

AMR

AMR—Adaptive Mesh Refinement

Common AMR cycle Solve − → Estimate − → Mark − → Refine − → (repeat)

◮ Mesh exists standalone (topology/geometry) ◮ Fields (function space elements) are tied to a mesh

Solve − → Solution − → Indicator − → Flag − → Mark

slide-16
SLIDE 16

AMR

AMR—Adaptive Mesh Refinement

Common AMR cycle Solve − → Estimate − → Mark − → Refine − → (repeat)

◮ Mesh exists standalone (topology/geometry) ◮ Fields (function space elements) are tied to a mesh

Solve − → Solution − → Indicator − → Flag − → Mark Solution + Refine − → Interpolate − → Solution

slide-17
SLIDE 17

AMR

AMR—Adaptive Mesh Refinement

Estimator, Flag, Interpolate: element-local (conforming)

slide-18
SLIDE 18

AMR

AMR—Adaptive Mesh Refinement

Estimator, Flag, Interpolate: element-local (non-conforming)

◮ Hanging node values are not part of Solution, never stored

slide-19
SLIDE 19

Parallel AMR

AMR—Adaptive Mesh Refinement

Parallelization aspects S − → E − → M − → R − → Balance − → Partition − → (repeat)

◮ 1. Balance: restore 2:1 non-conformity

Global split propagation ⇒ tricky algorithm (in serial) ⇒ extra tricky in parallel

slide-20
SLIDE 20

Parallel AMR

AMR—Adaptive Mesh Refinement

Parallelization aspects S − → E − → M − → R − → Balance − → Partition − → (repeat)

◮ 2. Partition: restore load balance ◮ Mesh ≡ graph: partition is NP-hard

Add extra structure (⇔ reduce search space) ⇒ faster algorithms

slide-21
SLIDE 21

Parallel AMR

AMR—Adaptive Mesh Refinement

Parallelization aspects S − → E − → M − → R − → Balance − → Partition − → (repeat)

◮ 3. Nodes: create globally unique dof indices ◮ Nodes relevant to 2 or more processes ⇒ ownership conflict

k0 k1 x0 y0 x1 y1

  • 0, p0
  • 1, p0
  • 2, p1
  • 3, p1
  • 4, p2

1 2 3 4 5 6 7 8 9 a

1 2 3 4 p0 5 6 7 p1 8 9 p2

Add ghost elements (⇒ parallel algorithm) ⇒ resolve conflicts locally

slide-22
SLIDE 22

Modular AMR

AMR—Adaptive Mesh Refinement

Yesterday’s quotes on scalability

◮ “straightforward, but time required” ◮ “software engineering problem” ◮ Parallel AMR algorithms are neither

Modular tools available

◮ Outsource distributed mesh generation/modification ◮ Encapsulate algorithms, define interfaces ◮ Differ in scalability and speed/memory footprint

slide-23
SLIDE 23

AMR

AMR—Adaptive Mesh Refinement

Types of AMR

◮ Block-structured (patch-based) AMR

www.cactuscode.org

slide-24
SLIDE 24

AMR

AMR—Adaptive Mesh Refinement

Types of AMR

◮ Conforming tetrahedral (unstructured) AMR

mesh data courtesy David Lazzara, MIT

slide-25
SLIDE 25

AMR

AMR—Adaptive Mesh Refinement

Types of AMR

◮ Octree-based AMR ◮ Octree maps to cube-like geometry ◮ 1:1 relation between octree leaves and mesh elements

slide-26
SLIDE 26

AMR

AMR—Adaptive Mesh Refinement

Types of AMR

◮ Octree-based AMR ◮ Octree maps to cube-like geometry ◮ 1:1 relation between octree leaves and mesh elements

slide-27
SLIDE 27

AMR

AMR—Adaptive Mesh Refinement

Types of AMR

◮ Octree-based AMR ◮ Octree maps to cube-like geometry ◮ 1:1 relation between octree leaves and mesh elements

slide-28
SLIDE 28

AMR

AMR—Adaptive Mesh Refinement

Types of AMR

◮ Octree-based AMR ◮ Octree maps to cube-like geometry ◮ 1:1 relation between octree leaves and mesh elements

slide-29
SLIDE 29

AMR

AMR—Adaptive Mesh Refinement

Types of AMR

◮ Octree-based AMR ◮ Octree maps to cube-like geometry ◮ 1:1 relation between octree leaves and mesh elements

slide-30
SLIDE 30

AMR

AMR—Adaptive Mesh Refinement

Types of AMR

◮ Octree-based AMR ◮ Octree maps to cube-like geometry ◮ 1:1 relation between octree leaves and mesh elements

slide-31
SLIDE 31

AMR

AMR—Adaptive Mesh Refinement

Types of AMR

◮ Octree-based AMR ◮ Octree maps to cube-like geometry ◮ 1:1 relation between octree leaves and mesh elements

slide-32
SLIDE 32

AMR

AMR—Adaptive Mesh Refinement

Types of AMR

◮ Octree-based AMR ◮ Octree maps to cube-like geometry ◮ 1:1 relation between octree leaves and mesh elements

slide-33
SLIDE 33

AMR

AMR—Adaptive Mesh Refinement

Types of AMR

◮ Octree-based AMR ◮ Octree maps to cube-like geometry ◮ 1:1 relation between octree leaves and mesh elements

slide-34
SLIDE 34

AMR

AMR—Adaptive Mesh Refinement

Types of AMR

◮ Octree-based AMR ◮ Octree maps to cube-like geometry ◮ 1:1 relation between octree leaves and mesh elements

slide-35
SLIDE 35

AMR

AMR—Adaptive Mesh Refinement

Types of AMR

◮ Octree-based AMR

Proc 0 Proc 1 Proc 2

◮ Space-filling curve (SFC): Fast parallel partitioning ◮ Fast parallel tree algorithms for sorting and searching

slide-36
SLIDE 36

Octree-based AMR

Efficient encoding and total ordering

00 01 10 11 00 01 10 11 01 11 Proc 0 Proc 1 Proc 2

◮ 1:1 relation between leaves and elements → efficient encoding ◮ path from root to node

10 01 11

slide-37
SLIDE 37

Octree-based AMR

Efficient encoding and total ordering

00 01 10 11 00 01 10 11 01 11 Proc 0 Proc 1 Proc 2

◮ 1:1 relation between leaves and elements → efficient encoding ◮ path from root to node, append level

10 01 11 11 → key

slide-38
SLIDE 38

Octree-based AMR

Efficient encoding and total ordering

00 01 10 11 00 01 10 11 01 11 Proc 0 Proc 1 Proc 2

◮ 1:1 relation between leaves and elements → efficient encoding ◮ path from root to node, append level

10 01 11 11 → key

◮ derive element x-coordinate

1 1 → x = 3

slide-39
SLIDE 39

Octree-based AMR

Efficient encoding and total ordering

00 01 10 11 00 01 10 11 01 11 Proc 0 Proc 1 Proc 2

◮ 1:1 relation between leaves and elements → efficient encoding ◮ path from root to node, append level

10 01 11 11 → key

◮ derive element x-coordinate

1 1 → x = 3

◮ derive element y-coordinate

1 1 → y = 5

slide-40
SLIDE 40

Octree-based AMR

Fast elementary operations

00 01 10 11 00 01 10 11 01 11 Proc 0 Proc 1 Proc 2

◮ Construct parent or children → vertical tree step O(1) ◮ path from root to node, append level

10 01 11 11 → key

slide-41
SLIDE 41

Octree-based AMR

Fast elementary operations

00 01 10 11 00 01 10 11 01 Proc 0 Proc 1 Proc 2

◮ ◮ ◮ Construct parent or children → vertical tree step O(1) ◮ path from root to node, append level

10 01 11 11

◮ zero level coordinates, decrease level

10 01 00 10 → key

slide-42
SLIDE 42

Octree-based AMR

Fast elementary operations

00 01 10 11 00 01 10 11 01 Proc 0 Proc 1 Proc 2

◮ ◮ Construct neighbors → horizontal tree step/jump O(1) ◮ path from root to node, append level

10 01 00 10 → key

slide-43
SLIDE 43

Octree-based AMR

Fast elementary operations

00 01 10 11 00 01 10 11 00 Proc 0 Proc 1 Proc 2

◮ ◮ ◮ Construct neighbors → horizontal tree step/jump O(1) ◮ path from root to node, append level

10 01 00 10

◮ Substract x-coordinate increment

10 00 00 10 → key

◮ Search on-processor element → tree search O(log N P )

slide-44
SLIDE 44

Octree-based AMR

Fast elementary operations

00 01 10 11 00 01 10 11 01 Proc 0 Proc 1 Proc 2

◮ Construct neighbors → horizontal tree step/jump O(1) ◮ path from root to node, append level

10 01 00 10 → key

slide-45
SLIDE 45

Octree-based AMR

Fast elementary operations

00 01 10 11 00 01 10 11 00 Proc 0 Proc 1 Proc 2

◮ ◮ ◮ Construct neighbors → horizontal tree step/jump O(1) ◮ path from root to node, append level

10 01 00 10

◮ Add x-coordinate increment

11 00 00 10 → key

◮ Search off-processor element-owner → search SFC O(log P)

slide-46
SLIDE 46

Synthesis: Forest of octrees

From tree... =

◮ Limitation: Cube-like geometric shapes

slide-47
SLIDE 47

Synthesis: Forest of octrees

...to forest =

◮ Advantage: Geometric flexibility ◮ Challenge: Non-matching coordinate systems between octrees

slide-48
SLIDE 48

“p4est”—forest-of-octrees algorithms

Connect SFC through all octrees

k0 k1 p0 p1 p1 p2 k0 k1 x0 y0 x1 y1

Minimal global shared storage (metadata)

◮ Shared list of octant counts per core (N)p

4 × P bytes

◮ Shared list of partition markers (k; x, y, z)p

16 × P bytes

◮ 2D example above (h = 8): markers (0; 0, 0), (0; 6, 4), (1; 0, 4) [1] C. Burstedde, L. C. Wilcox, O. Ghattas (SISC, 2011)

slide-49
SLIDE 49

“p4est”—forest-of-octrees algorithms

p4est is a pure AMR module

◮ Rationale: Support diverse numerical approaches ◮ Internal state: Element ordering and parallel partition ◮ Provide minimal API for mesh modification

Connect to numerical discretizations / solvers (“App”)

◮ p4est API calls are like MPI collectives (atomic to App) ◮ p4est API hides parallel algorithms and communication ◮ App → p4est: API invokes per-element callbacks ◮ App ← p4est: Access internal state read-only

slide-50
SLIDE 50

“p4est”—forest-of-octrees algorithms

p4est core API (for “write access”)

◮ p4est new: Create a uniformly refined, partitioned forest ◮ p4est refine: Refine per-element acc. to 0/1 callbacks ◮ p4est coarsen: Coarsen 2d elements acc. to 0/1 callbacks ◮ p4est balance: Establish 2:1 neighbor sizes by add. refines ◮ p4est partition: Parallel redistribution acc. to weights ◮ p4est ghost: Gather one layer of off-processor elements

p4est “random read access” not formalized

◮ Loop through p4est data structures as needed

slide-51
SLIDE 51

“p4est”—forest-of-octrees algorithms

Weak scalability on ORNL’s “Jaguar” supercomputer

10 20 30 40 50 60 70 80 90 100 12 60 432 3444 27540 220320 Percentage of runtime Number of CPU cores Partition Balance Ghost Nodes

◮ Cost of New, Refine, Coarsen, Partition negligible ◮ 5.13 × 1011 octants; < 10 seconds per million octants per core

slide-52
SLIDE 52

“p4est”—forest-of-octrees algorithms

Weak scalability on ORNL’s “Jaguar” supercomputer

2 4 6 8 10 12 60 432 3444 27540 220320 Seconds per (million elements / core) Number of CPU cores Balance Nodes

◮ Dominant operations: Balance and Nodes scale over 18,360x ◮ 5.13 × 1011 octants; < 10 seconds per million octants per core

slide-53
SLIDE 53

“p4est”—forest-of-octrees algorithms

What is a p4est element? Anything!

◮ The App defines how it will interprete an element

Examples

◮ Continuous bi-/trilinear elements ◮ High-order continuous spectral elements ◮ High-order DG elements with Gauss quadrature, LGL, . . . ◮ An ijk subgrid optimized for GPU computation ◮ An Md patch from PyClaw ◮ . . .

slide-54
SLIDE 54

Parallel AMR

AMR—Adaptive Mesh Refinement

A-priori adaptation

NewTree RefineTree BalanceTree PartitionTree ExtractMesh refinement guided by material prop- erties or geometry mesh and data fields

A-posteriori/dynamic adaptation

CoarsenTree RefineTree BalanceTree ExtractMesh PartitionTree ExtractMesh InterpolateFields TransferFields

  • ld mesh and

application data are used to derive error indicator intermediate mesh is used for interpolation

  • f data fields

new mesh with interpolated data fields on new partition

[2] C. Burstedde, O. Ghattas, G. Stadler, et.al. (TeraGrid, 2008)

slide-55
SLIDE 55

App: Dynamic-mesh DG (3D advection)

Weak scalability on ORNL’s “Jaguar” supercomputer

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 12 30 60 120 252 504 1020 2040 4080 8160 16K 32K 65K 130K 220K Parallel efficiency Number of CPU cores Normalized work per core per total run time

◮ 3,200 high-order elements per core from 12 to 220,320 cores ◮ Overall parallel efficiency is 70% over a 18,360x scale

slide-56
SLIDE 56

Acknowledgements

Publications

◮ Homepage: http://burstedde.ins.uni-bonn.de/

Funding

◮ NSF DMS, OCI PetaApps, OPP CDI ◮ DOE SciDAC TOPS, SC ◮ AFOSR

HPC Resources

◮ Texas Advanced Computing Center (TACC) ◮ National Center for Computational Science (NCCS)