FMM goes GPU A smooth trip or bumpy ride? Member of the - - PowerPoint PPT Presentation

fmm goes gpu
SMART_READER_LITE
LIVE PREVIEW

FMM goes GPU A smooth trip or bumpy ride? Member of the - - PowerPoint PPT Presentation

FMM goes GPU A smooth trip or bumpy ride? Member of the Helmholtz-Association March 18, 2015 B. Kohnke, I.Kabadshow MPI BPC Gttingen & Jlich Supercomputing Centre FMM goes GPU A smooth trip or bumpy ride? The Fast Multipole Method


slide-1
SLIDE 1

Member of the Helmholtz-Association

FMM goes GPU

A smooth trip or bumpy ride?

March 18, 2015

  • B. Kohnke, I.Kabadshow

MPI BPC Göttingen & Jülich Supercomputing Centre

slide-2
SLIDE 2

Member of the Helmholtz-Association

FMM goes GPU

A smooth trip or bumpy ride?

The Fast Multipole Method

FMM Preliminaries Operators Algorithm

GPU Parallelization Strategies

Naïve parallelism Dynamic parallelism

Results & Conclusions

March 18, 2015

  • B. Kohnke, I.Kabadshow

Slide 2

slide-3
SLIDE 3

Member of the Helmholtz-Association

FMM Applications

1/r Long-Range Interactions in O(N) instead of O(N2)

Molecular Dynamics Plasma Physics Astrophysics

March 18, 2015

  • B. Kohnke, I.Kabadshow

Slide 3

slide-4
SLIDE 4

Member of the Helmholtz-Association

FMM Preliminaries

Different Approaches to Solve N-body Problem

Solving the N-body Problem

Classical Approach

Simple load balancing Computational expensive O(N2) Can not be applied to large particle systems

PME (Particle Mesh Ewald)

Computational complexity O(N log N) FFT–based Not flexible enough for our use-case

Fast Multipole Method

Computational complexity O(N) Tree-based approach

March 18, 2015

  • B. Kohnke, I.Kabadshow

Slide 4

slide-5
SLIDE 5

Member of the Helmholtz-Association

N-Body Problem

Classical Approach

For all particles compute force acting on qi due to qj Integrate equations of motion Determine new system configuration Ec =

N−1

  • i=1

N

  • j=i+1

qi · qj rij (1)

March 18, 2015

  • B. Kohnke, I.Kabadshow

Slide 5

slide-6
SLIDE 6

Member of the Helmholtz-Association

N-Body Problem

Classical Approach

For all particles compute force acting on qi due to qj Integrate equations of motion Determine new system configuration Ec =

N−1

  • i=1

N

  • j=i+1

qi · qj rij (1)

FMM on one line

EC ≈

  • level
  • A
  • B(A)

p

  • l=0

l

  • m=−l

p

  • j=0

j

  • k=−j

(−1)jωA

lm(a)M2L(R)ωB jk(b)

March 18, 2015

  • B. Kohnke, I.Kabadshow

Slide 5

slide-7
SLIDE 7

Member of the Helmholtz-Association

FMM Parameter I:

Depth of the FMM Tree d

Tree Depth d, Level L = d + 1

Simulation box divided into 8d subboxes (in 3D)

March 18, 2015

  • B. Kohnke, I.Kabadshow

Slide 6

slide-8
SLIDE 8

Member of the Helmholtz-Association

FMM Parameter II

Separation Criterion ws

Separation Criterion ws

Near field contains (2 · ws + 1)3 boxes

March 18, 2015

  • B. Kohnke, I.Kabadshow

Slide 7

slide-9
SLIDE 9

Member of the Helmholtz-Association

FMM Parameter III

Number of Poles p

Infinite Expansion

1 d = 1

|r − a| =

  • l=0

l

  • m=−l

Olm(a) · Mlm(r)

Finite Expansion Up To Order p

1 d = 1

|r − a| ≈

p

  • l=0

l

  • m=−l

Olm(a) · Mlm(r)

March 18, 2015

  • B. Kohnke, I.Kabadshow

Slide 8

slide-10
SLIDE 10

Member of the Helmholtz-Association

FMM Workflow

FMM Workflow

Preprocessing Steps

Define FMM parameter ws,d,p

March 18, 2015

  • B. Kohnke, I.Kabadshow

Slide 9

slide-11
SLIDE 11

Member of the Helmholtz-Association

FMM Workflow

FMM Workflow Pass 1

Setup FMM tree and expand multipoles (P2M) Shift multipoles to root node (M2M)

March 18, 2015

  • B. Kohnke, I.Kabadshow

Slide 9

slide-12
SLIDE 12

Member of the Helmholtz-Association

FMM Workflow

FMM Workflow Pass 2

Translate multipole moments into Taylor moments (M2L)

March 18, 2015

  • B. Kohnke, I.Kabadshow

Slide 9

slide-13
SLIDE 13

Member of the Helmholtz-Association

FMM Workflow

FMM Workflow Pass 3

Shift Taylor moments to the leaf nodes (L2L)

March 18, 2015

  • B. Kohnke, I.Kabadshow

Slide 9

slide-14
SLIDE 14

Member of the Helmholtz-Association

FMM Workflow

FMM Workflow Pass 4

Computation of the far field energy, forces, potentials

March 18, 2015

  • B. Kohnke, I.Kabadshow

Slide 9

slide-15
SLIDE 15

Member of the Helmholtz-Association

FMM Workflow

FMM Workflow Pass 5

Computation of the near field energy, forces, potentials

March 18, 2015

  • B. Kohnke, I.Kabadshow

Slide 9

slide-16
SLIDE 16

Member of the Helmholtz-Association

FMM Datastructures

Farfield

Coefficient Matrix, Generalized Storage Type

Multipole order p Matrix size O(p2) Triangular shape

Multipole/Taylor Moments ωlm, µlm

Stored as coefficient matrix size O(p2)

Operators, e.g. M2L

Stored as coefficient matrix size O(p2)

Translation Operation µ(b − a) =

p

  • l=0

l

  • m=0

p

  • j=0

j

  • k=−j

ωjk(a)Ol+j,m+k(b)

March 18, 2015

  • B. Kohnke, I.Kabadshow

Slide 10

slide-17
SLIDE 17

Member of the Helmholtz-Association

M2L Operation – FMM Tree Loops

Box – Neighbor Structure, ws=1

for (int d_c = 2; d_c <= depth; ++d_c) { // loop over all relevant tree levels for (int i = 0; i < dim; ++i) { // loop over all boxes on a certain level for (int j = 0; j < dim; ++j) { for (int k = 0; k < dim; ++k) { int idx = boxid(i, j, k); for (int ii = 0; ii < 6; ++ii) { // loop over all neighbors (216 for ws=1) for (int jj = 0; jj < 6; ++jj) { for (int kk = 0; kk < 6; ++kk) { int idx_n = neighbor_boxid(ii, jj, kk); int idx_op = operator_boxid(i, ii, j, jj, k, kk); // single p^4 operation M2L_operation(omega,*m2l_operators[d_c],mu, p, idx, idx_n, idx_op);

March 18, 2015

  • B. Kohnke, I.Kabadshow

Slide 11

slide-18
SLIDE 18

Member of the Helmholtz-Association

M2L Operation – Internal Structure

p4 Loop structure void M2L_operation(...){ for (int l = 0; l <= p; ++l) { for (int m = 0; m <= l; ++m) { mu_l_m = 0; for (int j = 0; j <= p; ++j) { for (int k = -j; k <= j; ++k) { mu_l_m += M2L[idx_op](l + j, m + k) * omega[idx_n](j, k); } } mu[idx](l, m) += mu_l_m; } } }

March 18, 2015

  • B. Kohnke, I.Kabadshow

Slide 12

slide-19
SLIDE 19

Member of the Helmholtz-Association

Parallelization Strategies

Full parallelization – all loops

Full parallel kernel

Parallelize all loops Map last reduction loop to warps use cub::WarpReduce for reduction step mu_l_m += ... Drawbacks

Massive book keeping overhead Execution time dominated by many integer divisions/modulo

Loop structure

for (int l = 0; l <= p; ++l) for (int m = 0; m <= l; ++m) mu_l_m = 0; for (int j = 0; j <= p; ++j) for (int k = -j; k <= j; ++k) mu_l_m += M2L[idx_op](l+j,m+k) *

  • mega[idx_n](j, k);

mu[idx](l, m) += mul_l_m

March 18, 2015

  • B. Kohnke, I.Kabadshow

Slide 13

slide-20
SLIDE 20

Member of the Helmholtz-Association

M2L Runtime – Full Loop Parallelization

Depth 4, 4096 Boxes 2 4 6 8 10 12 14 16 18 10−3 10−1 101 103 multipole order p time [s]

full parallelization

  • ptimal

March 18, 2015

  • B. Kohnke, I.Kabadshow

Slide 14

slide-21
SLIDE 21

Member of the Helmholtz-Association

Full Parallelization Costs

Costs of index computing operations for M2L full parallel kernel 2 4 8 16 0% 20% 40% 60% 80% 100% multipole order p costs of index computation

depth 2 depth 3 depth 4

Reduce Overhead Further

Keep innermost loop sequential Additional 2x – 7x speedup

March 18, 2015

  • B. Kohnke, I.Kabadshow

Slide 15

slide-22
SLIDE 22

Member of the Helmholtz-Association

M2L Runtime – Sequential Inner Loop

Depth 4, 4096 Boxes 2 4 6 8 10 12 14 16 18 10−3 10−1 101 103 multipole order p time [s]

full parallelization sequential inner loop

  • ptimal

March 18, 2015

  • B. Kohnke, I.Kabadshow

Slide 16

slide-23
SLIDE 23

Member of the Helmholtz-Association

Add Dynamic Parallelization I

Dynamic Scheme 1

Launch Kernels

loop over all tree level (2 – 10) @ host CPU loop over boxes on a certain level (64, 512, 4096, . . .), launch kernel from host loop over all neighbor boxes (216), launch kernel from GPU (dynamically) launch kernel from GPU (dynamically) for each M2L operator

Drawbacks

Large kernel launching latencies Kernels to small

March 18, 2015

  • B. Kohnke, I.Kabadshow

Slide 17

slide-24
SLIDE 24

Member of the Helmholtz-Association

M2L Runtime – Dynamic Parallelism I

Depth 4, 4096 Boxes 2 4 6 8 10 12 14 16 18 10−3 10−1 101 103 multipole order p time [s]

full parallelization sequential inner loop dynamic 1

  • ptimal

March 18, 2015

  • B. Kohnke, I.Kabadshow

Slide 18

slide-25
SLIDE 25

Member of the Helmholtz-Association

Add Dynamic Parallelization II

Dynamic Scheme 2

Launch Kernels

loop over all tree level (2 – 10) @ host CPU loop over boxes on a certain level (64, 512, 4096, . . .) @ host CPU loop over all neighbor boxes (216), launch kernel from host launch kernel from GPU (dynamically) for each M2L operator

Drawbacks

Global memory access too slow Disadvantageous access pattern

March 18, 2015

  • B. Kohnke, I.Kabadshow

Slide 19

slide-26
SLIDE 26

Member of the Helmholtz-Association

M2L Runtime – Dynamic Parallelism II

Depth 4, 4096 Boxes 2 4 6 8 10 12 14 16 18 10−3 10−1 101 103 multipole order p time [s]

full parallelization sequential inner loop dynamic 1 dynamic 2

  • ptimal

March 18, 2015

  • B. Kohnke, I.Kabadshow

Slide 20

slide-27
SLIDE 27

Member of the Helmholtz-Association

Add Dynamic Parallelization III

Dynamic Scheme 3

Launch Kernels

loop over all tree level (2 – 10) @ host CPU loop over boxes on a certain level (64, 512, 4096, . . .) @ host CPU loop over all neighbor boxes (216), launch kernel from host launch kernel from GPU (dynamically) for each M2L operator

Improvements

start only p2 kernel threads Use shared GPU memory to cache ω and the operator M

March 18, 2015

  • B. Kohnke, I.Kabadshow

Slide 21

slide-28
SLIDE 28

Member of the Helmholtz-Association

M2L Runtime – Dynamic Parallelism III

Depth 4, 4096 Boxes 2 4 6 8 10 12 14 16 18 10−3 10−1 101 103 multipole order p time [s]

full parallelization sequential inner loop dynamic 1 dynamic 2 dynamic 3

  • ptimal

March 18, 2015

  • B. Kohnke, I.Kabadshow

Slide 22

slide-29
SLIDE 29

Member of the Helmholtz-Association

Total Runtime Distribution

N = 103000, p = 10, d = 4, Tesla K40

P2P 39.61 % P2M 3.95 % M2M 2.50 % M2L 48.24 % L2L 2.18 % Forces 3.52 %

March 18, 2015

  • B. Kohnke, I.Kabadshow

Slide 23

slide-30
SLIDE 30

Member of the Helmholtz-Association

Conclusions

The Smooth Part

Fairly fast start possible due to unified memory Dynamic parallelization avoids subsequent index computations (developer and hardware) Dynamic parallelization allows to achieve good performance (only for some predefined parameter setup)

The Bumpy Part

Finding the best parallelization for every parameter setup is not trivial One ’fits all’ kernel not possible (hybrid approach necessary)

O(p3) vs. O(p4) operator GPU parallelization benefit?

March 18, 2015

  • B. Kohnke, I.Kabadshow

Slide 24

slide-31
SLIDE 31

Member of the Helmholtz-Association

Questions?

bartosz.kohnke@mpibpc.mpg.de

slide-32
SLIDE 32

Member of the Helmholtz-Association

FMM goes GPU

A smooth trip or bumpy ride?

March 18, 2015

  • B. Kohnke, I.Kabadshow

MPI BPC Göttingen & Jülich Supercomputing Centre