Dune living knowledge WWU Mnster J Fahlke, Christian Engwer July - - PowerPoint PPT Presentation

dune
SMART_READER_LITE
LIVE PREVIEW

Dune living knowledge WWU Mnster J Fahlke, Christian Engwer July - - PowerPoint PPT Presentation

W ESTFLISCHE W ILHELMS -U NIVERSITT M NSTER Hybrid Parallelization of Assembly in DUNE Dune living knowledge WWU Mnster J Fahlke, Christian Engwer July 16, 2014 W ESTFLISCHE W ILHELMS -U NIVERSITT M NSTER Hybrid


slide-1
SLIDE 1

Hybrid Parallelization of Assembly in DUNE

Jö Fahlke, Christian Engwer July 16, 2014

living knowledge WWU Münster

WESTFÄLISCHE WILHELMS-UNIVERSITÄT MÜNSTER

Dune

slide-2
SLIDE 2

living knowledge WWU Münster

WESTFÄLISCHE WILHELMS-UNIVERSITÄT MÜNSTER

Hybrid Parallelization of Assembly in DUNE 2 /27

ExaDune

Dune+PDELab

◮ Framework for solving PDEs ◮ Already good at MPI-Parallelization

Exa-Scale Computers

◮ Arriving around 2018 ◮ Many processing units ◮ Little memory per processing unit ◮ Accelerator hardware (e.g. GPU, MIC, etc.)

, , Jö Fahlke, Christian Engwer

slide-3
SLIDE 3

living knowledge WWU Münster

WESTFÄLISCHE WILHELMS-UNIVERSITÄT MÜNSTER

Hybrid Parallelization of Assembly in DUNE 3 /27

Outline

Intro Threading Vectorization Conclusions Outlook

, , Jö Fahlke, Christian Engwer

slide-4
SLIDE 4

living knowledge WWU Münster

WESTFÄLISCHE WILHELMS-UNIVERSITÄT MÜNSTER

Hybrid Parallelization of Assembly in DUNE 4 /27

Anatomy of a PDE solver

Two components eat most of the CPU cycles and benefit most from parallelization:

◮ Linear algebra

(Steffen Müthing)

◮ Assembling the linear system

(this talk)

, , Jö Fahlke, Christian Engwer

slide-5
SLIDE 5

living knowledge WWU Münster

WESTFÄLISCHE WILHELMS-UNIVERSITÄT MÜNSTER

Hybrid Parallelization of Assembly in DUNE 5 /27

DUNE-PDELab design

pdelab applications common istl grid geometry localfunctions time integrator Assembler grid function space grid operator local function space petsc Lineare Algebra core modules local operator user code

, , Jö Fahlke, Christian Engwer

slide-6
SLIDE 6

living knowledge WWU Münster

WESTFÄLISCHE WILHELMS-UNIVERSITÄT MÜNSTER

Hybrid Parallelization of Assembly in DUNE 5 /27

DUNE-PDELab design

pdelab applications common istl grid geometry localfunctions time integrator Assembler grid function space grid operator local function space petsc Lineare Algebra core modules local operator user code

, , Jö Fahlke, Christian Engwer

slide-7
SLIDE 7

living knowledge WWU Münster

WESTFÄLISCHE WILHELMS-UNIVERSITÄT MÜNSTER

Hybrid Parallelization of Assembly in DUNE 6 /27

Challenges for PDE frameworks

◮ Assembly kernels often written by the framework user ◮ User is not an expert in programming

= ⇒ We can’t rely on obscure languages

◮ Avoid multiple versions of a kernel

= ⇒ Kernels must be portable.

◮ Keep it open

= ⇒ Avoid relying on proprietary languages and libraries.

, , Jö Fahlke, Christian Engwer

slide-8
SLIDE 8

living knowledge WWU Münster

WESTFÄLISCHE WILHELMS-UNIVERSITÄT MÜNSTER

Hybrid Parallelization of Assembly in DUNE 7 /27

Threading

, , Jö Fahlke, Christian Engwer

slide-9
SLIDE 9

living knowledge WWU Münster

WESTFÄLISCHE WILHELMS-UNIVERSITÄT MÜNSTER

Hybrid Parallelization of Assembly in DUNE 8 /27

Why Threading?

◮ Reduced memory overhead compared to message passing ◮ Reduced communication overhead.

, , Jö Fahlke, Christian Engwer

slide-10
SLIDE 10

living knowledge WWU Münster

WESTFÄLISCHE WILHELMS-UNIVERSITÄT MÜNSTER

Hybrid Parallelization of Assembly in DUNE 9 /27

Challenges in threading

◮ Choosing a partitioning scheme. ◮ Choosing a race avoidance strategy.

, , Jö Fahlke, Christian Engwer

slide-11
SLIDE 11

living knowledge WWU Münster

WESTFÄLISCHE WILHELMS-UNIVERSITÄT MÜNSTER

Hybrid Parallelization of Assembly in DUNE 10 /27

Partitioning Strategies

strided ranged sliced tensor

◮ Strided: calculated on the fly, but only efficient with random

access iterators.

◮ Ranged: memory efficient, needs preprocessing or random

access iterators.

◮ General (sliced, tensor): not memory efficient, but enables

coloring, or tuning for small surface area.

, , Jö Fahlke, Christian Engwer

slide-12
SLIDE 12

living knowledge WWU Münster

WESTFÄLISCHE WILHELMS-UNIVERSITÄT MÜNSTER

Hybrid Parallelization of Assembly in DUNE 11 /27

Data Access Strategies

Races can occur when accumulating into global data structures. Strategies to avoid races:

◮ batched: Batched writeback with global lock. ◮ elock: One lock per mesh entity ◮ coloring: partitions of the same color do not “touch”.

Other strategies not considered here:

◮ global locking ◮ race-free schemes

◮ not considered since it is often not possible. ◮ but tried by R. Klöfkorn (Proceedings of ALGORITMY 2012). , , Jö Fahlke, Christian Engwer

slide-13
SLIDE 13

living knowledge WWU Münster

WESTFÄLISCHE WILHELMS-UNIVERSITÄT MÜNSTER

Hybrid Parallelization of Assembly in DUNE 12 /27

Test Setup

◮ Stationary advection problem

∇ · (−A(x)∇u + b(x)u) + c(x)u = f in Ω, with appropriate Dirichlet and outflow boundary conditions.

◮ DG (weighted SIPG). ◮ Orthonormalized Pk basis. ◮ Wall time for assembly of residual and jacobian. ◮ As many threads as possible.

, , Jö Fahlke, Christian Engwer

slide-14
SLIDE 14

living knowledge WWU Münster

WESTFÄLISCHE WILHELMS-UNIVERSITÄT MÜNSTER

Hybrid Parallelization of Assembly in DUNE 13 /27

Results

batched elock colored strided 1048ns (7%) 350ns (22%) ranged 712ns (10%) 209ns (37%) sliced 712ns (10%) 212ns (37%) 209ns (36%) tensor 715ns (10%) 208ns (37%) 211ns (36%)

Table: PHI, degree=1, jacobian, threads=240.

◮ Runtime per DoF and (efficiency).

, , Jö Fahlke, Christian Engwer

slide-15
SLIDE 15

living knowledge WWU Münster

WESTFÄLISCHE WILHELMS-UNIVERSITÄT MÜNSTER

Hybrid Parallelization of Assembly in DUNE 14 /27

CPU vs. PHI

k t

CPU

1

t

CPU

10

t

CPU

20

t

PHI

1

t

PHI

60

t

PHI

120

t

PHI

240

4.59µs 0.74µs 0.54µs 59.57µs 1.33µs 1.17µs 1.20µs 1 1.38µs 0.22µs 0.17µs 18.92µs 0.37µs 0.27µs 0.26µs 2 1.10µs 0.15µs 0.12µs 17.12µs 0.32µs 0.21µs 0.19µs 3 1.29µs 0.16µs 0.13µs 19.84µs 0.36µs 0.23µs 0.20µs 4 1.52µs 0.18µs 0.15µs 5 1.81µs 0.21µs 0.18µs Runtimes per dof, degree k, jacobian, sliced partitioning, entity-wise locking

◮ CPU still better than PHI.

, , Jö Fahlke, Christian Engwer

slide-16
SLIDE 16

living knowledge WWU Münster

WESTFÄLISCHE WILHELMS-UNIVERSITÄT MÜNSTER

Hybrid Parallelization of Assembly in DUNE 14 /27

CPU vs. PHI

k t

CPU

1

t

CPU

10

t

CPU

20

t

PHI

1

t

PHI

60

t

PHI

120

t

PHI

240

4.59µs 0.74µs 0.54µs 59.57µs 1.33µs 1.17µs 1.20µs 1 1.38µs 0.22µs 0.17µs 18.92µs 0.37µs 0.27µs 0.26µs 2 1.10µs 0.15µs 0.12µs 17.12µs 0.32µs 0.21µs 0.19µs 3 1.29µs 0.16µs 0.13µs 19.84µs 0.36µs 0.23µs 0.20µs 4 1.52µs 0.18µs 0.15µs 5 1.81µs 0.21µs 0.18µs Runtimes per dof, degree k, jacobian, sliced partitioning, entity-wise locking

◮ CPU still better than PHI. ◮ Unfair comparison: SIMD units are 128bit for CPU and 512bit

for PHI. = ⇒ Requires vectorization.

, , Jö Fahlke, Christian Engwer

slide-17
SLIDE 17

living knowledge WWU Münster

WESTFÄLISCHE WILHELMS-UNIVERSITÄT MÜNSTER

Hybrid Parallelization of Assembly in DUNE 15 /27

Conclusions

◮ Useful partitionings:

ranged: memory efficient general: as a backup, allows coloring

◮ Useful data access strategies:

entity-wise locking: general, good performance coloring: good performance, needs particular partitioning

◮ Need vectorization.

, , Jö Fahlke, Christian Engwer

slide-18
SLIDE 18

living knowledge WWU Münster

WESTFÄLISCHE WILHELMS-UNIVERSITÄT MÜNSTER

Hybrid Parallelization of Assembly in DUNE 16 /27

Vectorization

, , Jö Fahlke, Christian Engwer

slide-19
SLIDE 19

living knowledge WWU Münster

WESTFÄLISCHE WILHELMS-UNIVERSITÄT MÜNSTER

Hybrid Parallelization of Assembly in DUNE 17 /27

Zoology of Devices

◮ CPU

◮ Lots of smart heuristics ◮ Good single-thread performance

Typical stats (1 UMA node):

◮ 10 cores, 20 threads ◮ 2 SIMD lanes (double precision) ◮ 48GiB memory

◮ Phi ◮ GPU

, , Jö Fahlke, Christian Engwer

slide-20
SLIDE 20

living knowledge WWU Münster

WESTFÄLISCHE WILHELMS-UNIVERSITÄT MÜNSTER

Hybrid Parallelization of Assembly in DUNE 17 /27

Zoology of Devices

◮ CPU ◮ Phi

◮ Simplified CPU ◮ Needs a host system for housing ◮ Main program can run on host (with offloading) or native on

device

Typical stats:

◮ 60 cores, 240 threads ◮ 8 SIMD lanes (double precision) ◮ 8GiB memory

◮ GPU

, , Jö Fahlke, Christian Engwer

slide-21
SLIDE 21

living knowledge WWU Münster

WESTFÄLISCHE WILHELMS-UNIVERSITÄT MÜNSTER

Hybrid Parallelization of Assembly in DUNE 17 /27

Zoology of Devices

◮ CPU ◮ Phi ◮ GPU

◮ Very basic processor ◮ Needs host system for housing and scheduling ◮ Main programs runs on host, offloading to device

Typical stats:

◮ 2000–3000 cores currently* ◮ 32 SIMT lanes ◮ 5–12GiB memory

*Meaning of core differs from CPU/PHI

, , Jö Fahlke, Christian Engwer

slide-22
SLIDE 22

living knowledge WWU Münster

WESTFÄLISCHE WILHELMS-UNIVERSITÄT MÜNSTER

Hybrid Parallelization of Assembly in DUNE 17 /27

Zoology of Devices

◮ CPU ◮ Phi ◮ GPU

, , Jö Fahlke, Christian Engwer

slide-23
SLIDE 23

living knowledge WWU Münster

WESTFÄLISCHE WILHELMS-UNIVERSITÄT MÜNSTER

Hybrid Parallelization of Assembly in DUNE 18 /27

Programming Approaches I

Unsuitable approaches:

◮ Intrinsics (non-portable) ◮ Special language (needs special compiler) ◮ Autovectorizer (difficult to drive portably)

, , Jö Fahlke, Christian Engwer

slide-24
SLIDE 24

living knowledge WWU Münster

WESTFÄLISCHE WILHELMS-UNIVERSITÄT MÜNSTER

Hybrid Parallelization of Assembly in DUNE 19 /27

Programming Approaches II

Better approach: vectorization library

◮ Hide instrinsics beneath a portable interface ◮ Possible drawback: are compilers still capable of reordering

  • ptimizations?

◮ Implementations: Vc, Boost.SIMD1, NGSolve ◮ Could also be implemented on top of some special languages

(Cilk, OpenMP).

1Not an official Boost library , , Jö Fahlke, Christian Engwer

slide-25
SLIDE 25

living knowledge WWU Münster

WESTFÄLISCHE WILHELMS-UNIVERSITÄT MÜNSTER

Hybrid Parallelization of Assembly in DUNE 20 /27

SIMD Example: SISD

Compute one scalar product

double scalar_product (unsigned size , double *a, double *b) { double sum = 0; for(unsigned i = 0; i < size; ++i) sum += a[i] * b[i]; return sum; }

, , Jö Fahlke, Christian Engwer

slide-26
SLIDE 26

living knowledge WWU Münster

WESTFÄLISCHE WILHELMS-UNIVERSITÄT MÜNSTER

Hybrid Parallelization of Assembly in DUNE 21 /27

SIMD Example: SIMD I

Compute one scalar product

double scalar_product (unsigned size , double *a, double *b) { // assume *a and *b are properly aligned // assume size % lanes == 0 Vector <double > sum = 0; for(unsigned i = 0; i < size; i += lanes) sum += (Vector <double >&)(a[i]) * (Vector <double >&)(b[i]); for(unsigned i = 1; i < lanes; ++i) sum [0] += sum[i]; return sum [0]; }

, , Jö Fahlke, Christian Engwer

slide-27
SLIDE 27

living knowledge WWU Münster

WESTFÄLISCHE WILHELMS-UNIVERSITÄT MÜNSTER

Hybrid Parallelization of Assembly in DUNE 21 /27

SIMD Example: SIMD I

Compute one scalar product

double scalar_product (unsigned size , double *a, double *b) { // assume *a and *b are properly aligned // assume size % lanes == 0 Vector <double > sum = 0; for(unsigned i = 0; i < size; i += lanes) sum += (Vector <double >&)(a[i]) * (Vector <double >&)(b[i]); for(unsigned i = 1; i < lanes; ++i) sum [0] += sum[i]; return sum [0]; }

Bad example!

, , Jö Fahlke, Christian Engwer

slide-28
SLIDE 28

living knowledge WWU Münster

WESTFÄLISCHE WILHELMS-UNIVERSITÄT MÜNSTER

Hybrid Parallelization of Assembly in DUNE 22 /27

SIMD Example: SISD

Compute one scalar product

double scalar_product (unsigned size , double *a, double *b) { double sum = 0; for(unsigned i = 0; i < size; ++i) sum += a[i] * b[i]; return sum; }

, , Jö Fahlke, Christian Engwer

slide-29
SLIDE 29

living knowledge WWU Münster

WESTFÄLISCHE WILHELMS-UNIVERSITÄT MÜNSTER

Hybrid Parallelization of Assembly in DUNE 22 /27

SIMD Example: SIMD II

Compute many scalar products at once

Vector<double> scalar_product (unsigned size , Vector<double> *a, Vector<double> *b) { Vector<double> sum = 0; for(unsigned i = 0; i < size; ++i) sum += a[i] * b[i]; return sum; }

, , Jö Fahlke, Christian Engwer

slide-30
SLIDE 30

living knowledge WWU Münster

WESTFÄLISCHE WILHELMS-UNIVERSITÄT MÜNSTER

Hybrid Parallelization of Assembly in DUNE 22 /27

SIMD Example: SIMD II

Compute many scalar products at once

template<class T> T scalar_product (unsigned size , T *a, T *b) { T sum = 0; for(unsigned i = 0; i < size; ++i) sum += a[i] * b[i]; return sum; }

Works both in scalar and vectorized modes

, , Jö Fahlke, Christian Engwer

slide-31
SLIDE 31

living knowledge WWU Münster

WESTFÄLISCHE WILHELMS-UNIVERSITÄT MÜNSTER

Hybrid Parallelization of Assembly in DUNE 23 /27

Vectorization Results

◮ Continuous CG, Qk ◮ Simplified problem: scattering is missing. ◮ Best result: assembly of residual, Q4 on a regular grid of

15 × 16 × 16 elements, 240 threads: 282GFlop/s (28% peak).

Residual assembly

#elems #dofs GFlop/s %peak speedup Q1 1 966 080 2 013 561 21 2 1.2 Q2 245 760 2 013 561 76 8 1.5 Q3 30 720 856 219 166 16 1.7 Q4 3 840 257 725 282 28 2.0

, , Jö Fahlke, Christian Engwer

slide-32
SLIDE 32

living knowledge WWU Münster

WESTFÄLISCHE WILHELMS-UNIVERSITÄT MÜNSTER

Hybrid Parallelization of Assembly in DUNE 23 /27

Vectorization Results

◮ Continuous CG, Qk ◮ Simplified problem: scattering is missing. ◮ Best result: assembly of residual, Q4 on a regular grid of

15 × 16 × 16 elements, 240 threads: 282GFlop/s (28% peak).

Jacobian assembly

#elems NNZ GFlop/s %peak speedup Q1 1 966 080 4 027 123 21 2 1.8 Q2 245 760 23 876 718 62 6 3.5 Q3 30 720 30 178 929 65 6 3.4 Q4 3 840 19 897 272 60 6 3.6

, , Jö Fahlke, Christian Engwer

slide-33
SLIDE 33

living knowledge WWU Münster

WESTFÄLISCHE WILHELMS-UNIVERSITÄT MÜNSTER

Hybrid Parallelization of Assembly in DUNE 24 /27

Conclusions

, , Jö Fahlke, Christian Engwer

slide-34
SLIDE 34

living knowledge WWU Münster

WESTFÄLISCHE WILHELMS-UNIVERSITÄT MÜNSTER

Hybrid Parallelization of Assembly in DUNE 25 /27

Conclusions

Vectorization libraries + Easy to understand interface. + Simple enough for non-expert. − Preliminary performance results are not very satisfying.

, , Jö Fahlke, Christian Engwer

slide-35
SLIDE 35

living knowledge WWU Münster

WESTFÄLISCHE WILHELMS-UNIVERSITÄT MÜNSTER

Hybrid Parallelization of Assembly in DUNE 26 /27

◮ Finish evaluation of vectorization approaches. ◮ Grid data structures for GPU-type devices

◮ Current ideas inspired by OP2

http://www.oerc.ox.ac.uk/projects/op2

◮ Should also help with vectorization on CPU/PHI.

◮ GPU

, , Jö Fahlke, Christian Engwer

slide-36
SLIDE 36

living knowledge WWU Münster

WESTFÄLISCHE WILHELMS-UNIVERSITÄT MÜNSTER

Hybrid Parallelization of Assembly in DUNE 27 /27

Pointers

◮ DUNE: http://www.dune-project.org/ ◮ PDELab: http://www.dune-project.org/pdelab/ ◮ Vc by M. Kretz and V. Lindenstruth:

http://compeng.uni-frankfurt.de/?vc

◮ Thanks to Steffen Müthing who provided code for counting

floating point ops.

◮ Funding was provided by the DFG through the SPPEXA priority

programme: http://www.sppexa.de/

Dune

, , Jö Fahlke, Christian Engwer