Towards achieving GPU-native adaptive mesh refinement Ania Brown - - PowerPoint PPT Presentation

towards achieving gpu native adaptive mesh refinement
SMART_READER_LITE
LIVE PREVIEW

Towards achieving GPU-native adaptive mesh refinement Ania Brown - - PowerPoint PPT Presentation

Towards achieving GPU-native adaptive mesh refinement Ania Brown Prof Takayuki Aoki Why AMR? AMR is not GPU friendly Complicated, time varying data structures Can you use AMR and keep GPU performance? My conclusion: yes, but


slide-1
SLIDE 1

Towards achieving GPU-native adaptive mesh refinement

Ania Brown Prof Takayuki Aoki

slide-2
SLIDE 2

Why AMR?

slide-3
SLIDE 3

AMR is not GPU friendly

  • Complicated, time varying

data structures

  • Can you use AMR and

keep GPU performance?
 
 My conclusion: yes, but it’s messy

slide-4
SLIDE 4

Contents

  • Introduction to the algorithm + data structures
  • The challenges
  • Optimisation possibilities
  • The RSE perspective — some lessons learnt
slide-5
SLIDE 5

Problem domain

  • Stencil

calculations on a square structured mesh

  • Cell centre values

i, j+1 i+1, j i-1, j i, j i, j-1

slide-6
SLIDE 6

1) Block structured AMR

slide-7
SLIDE 7

1) Block structured AMR

slide-8
SLIDE 8

1) Block structured AMR

slide-9
SLIDE 9

2) Tree based AMR

Simulation mesh Refinement representation

slide-10
SLIDE 10

Simulation mesh Refinement representation

2) Tree based AMR

slide-11
SLIDE 11

Simulation mesh Refinement representation

2) Tree based AMR

slide-12
SLIDE 12

Simulation mesh Refinement representation … …

2) Tree based AMR

slide-13
SLIDE 13

Simulation mesh Refinement representation … … … …

2) Tree based AMR

slide-14
SLIDE 14

3) Patches +Tree AMR

slide-15
SLIDE 15

Block-structured Enzo CHOMBO Octree RAMSES Octree + patches FLASH, using PARAMESH NIRVANA

CPU GPU

Octree + patches GAMER (2011) Daino (2016)

slide-16
SLIDE 16

The AMR algorithm

Initialize data structures Create halo regions Update patch values Refine/coarsen patches Output values for visualisation

Main loop:

Update neighbour relations

slide-17
SLIDE 17

Initialize data structures

Main loop: CPU GPU

Create halo regions Update patch values Refine/coarsen patches Output values for visualisation Update neighbour relations

slide-18
SLIDE 18

Update step

1 CUDA block

slide-19
SLIDE 19

Update step

  • Tune block size for

coalesced access

  • Zmarching
slide-20
SLIDE 20

Initialize data structures

Main loop: CPU GPU

Create halo regions Update patch values Refine/coarsen patches Output values for visualisation Update neighbour relations

slide-21
SLIDE 21

Initialize data structures

Main loop: CPU GPU

Create halo regions Update patch values Refine/coarsen patches Output values for visualisation Update neighbour relations

slide-22
SLIDE 22

Ordering leaves

slide-23
SLIDE 23

Hilbert curve

At each step:

  • Divide space into 4
  • Replace each quadrant with rotated or reflected versions of

the original curve

  • Connect such that that start and end points remain the same
slide-24
SLIDE 24

Rules for refinement

slide-25
SLIDE 25

Neighbour relations

Leaf nodes:
 Neighbour indices in each direction Parent index Parent nodes: Child indices

slide-26
SLIDE 26

Create halo regions

Find correct neighbour node

slide-27
SLIDE 27

Create halo regions

Find correct neighbour node

slide-28
SLIDE 28

Create halo regions

Find correct neighbour node

slide-29
SLIDE 29

Create halo regions

Copy halo values

slide-30
SLIDE 30

Interpolating halo values

slide-31
SLIDE 31

Interpolating halo values

slide-32
SLIDE 32

Interpolating halo values

slide-33
SLIDE 33

Interpolating halo values

slide-34
SLIDE 34

Interpolating halo values

slide-35
SLIDE 35

Reducing halo values

slide-36
SLIDE 36

Reducing halo values

slide-37
SLIDE 37

Main loop: CPU GPU

Refine/coarsen patch values Update neighbour relations Defragment value array Find patches to coarsen/refine

Coarsen/refine step

slide-38
SLIDE 38

Defragment value array

refine node

slide-39
SLIDE 39

Defragment value array

refine node

slide-40
SLIDE 40

Defragment value array

refine node

slide-41
SLIDE 41

Defragment value array

coarsen node

slide-42
SLIDE 42

Defragment value array

coarsen node

slide-43
SLIDE 43

Defragment value array

coarsen node

slide-44
SLIDE 44

Calculate new defragmented position

  • Input: for all nodes, flag whether that node is to be

refined, coarsened or unchanged

  • Refined: +3


Coarsened: -3
 Unchanged: 0

  • For each element, sum all preceding elements in the

array

  • For n nodes, requires n/2 threads and O(log2(n)) serial

steps

slide-45
SLIDE 45

Multi-GPU

Load balancing

slide-46
SLIDE 46

Boundaries between subdomains

slide-47
SLIDE 47

Node 0 Node 1 Node 2

Boundaries between subdomains

slide-48
SLIDE 48

Node 0 Node 1 Node 2

Boundaries between subdomains

slide-49
SLIDE 49

How to distribute tree

slide-50
SLIDE 50

How to distribute tree

slide-51
SLIDE 51

Software design

  • Code framework — allow user to edit/add functions

for initialisation, resolution criterion, stencil calc

  • Code generation — annotated regular data

structures

  • How much to offer? — cell/node centre,

interpolation level, stencil type

slide-52
SLIDE 52

Software development process

  • Unit testing
  • Verification
  • Profiling
slide-53
SLIDE 53

Code by: T.Shimokawabe, T.Takaki (2011)

Phase field model of dendritic solidification in a binary alloy

slide-54
SLIDE 54

7 refinement levels in quad-tree

slide-55
SLIDE 55

Regular mesh Adaptive mesh

slide-56
SLIDE 56

Performance testing for dendritic solidification model

256 x 256

L = 1.5 × 10−3m

R = 4.5 × 10−4m

∆xmin = 6 × 10−6m ∆xmax = 1.2 × 10−5m

slide-57
SLIDE 57

8192 x 8192

L = 1.5 × 10−3m

R = 4.5 × 10−4m

∆xmax = 1.2 × 10−5m ∆xmin = 1.9 × 10−7m

Performance testing for dendritic solidification model

slide-58
SLIDE 58

Execution time per timestep (ms)

62.5 125 187.5 250

Resolution

1 2 4 2 4 8 3 7 2 4 9 6 5 1 2 6 1 4 4 7 1 6 8 8 1 9 2

AMR Regular Mesh Worst case AMR

2 5 6 5 1 2

slide-59
SLIDE 59

In summary

  • Patch based tree-AMR
  • For quick gains, offload update step to GPU
  • GPU-native version possible — values on GPU,

neighbour relations on CPU

  • Likely won’t be a one size fits all fix
slide-60
SLIDE 60
slide-61
SLIDE 61

Governing PDEs for phase field model

φ(x, y, t)

c = (1 − φ)cL + φcS

c(x, y, t)

cL cS

: phase : liquid concentration : solid concentration

∂c ∂t = r · [DSφrcS + DL(1 φ)rcL]

Diffusion Interface anisotropy Chemical driving force Phase change

∂φ ∂t = Mφ  r · (a2rφ) + ∂ ∂x(a ∂a ∂φx |rφ|2) + ∂ ∂y (a ∂a ∂φy |rφ|2) + ∂ ∂z (a ∂a ∂φz |rφ|2) S∆T dp(φ) dφ W dq(φ) dφ

a p(φ), q : mobility : interface anisotropy : interpolating function DS, DL S W T q(φ) : double well function : diffusion in solid, liquid : entropy of fusion : temperature : height of double well potential

slide-62
SLIDE 62