Efficient GPU parallelization of the Fast Multipole Method with - - PowerPoint PPT Presentation

β–Ά
efficient gpu parallelization of
SMART_READER_LITE
LIVE PREVIEW

Efficient GPU parallelization of the Fast Multipole Method with - - PowerPoint PPT Presentation

Efficient GPU parallelization of the Fast Multipole Method with periodic boundary conditions Bartosz Kohnke Fast Multipole Method (FMM) Calculation of long-ranged forces in the n-body problem (Greengard and Rokhlin, 1987) Tree based


slide-1
SLIDE 1

Efficient GPU parallelization of the Fast Multipole Method with periodic boundary conditions

Bartosz Kohnke

slide-2
SLIDE 2

Fast Multipole Method (FMM)

Calculation of long-ranged forces in the n-body problem (Greengard and Rokhlin, 1987)

08.05.2017 2 B.Kohnke

  • Tree – based approach
  • O(n) complexity

n := number of particles

  • Multipole expansion
  • f long-range interactions
slide-3
SLIDE 3

Fast Multipole Method (FMM)

1/r Long-Range interactions

08.05.2017 3 B.Kohnke

Molecular Dynamics Plasma Physics Astrophysics

slide-4
SLIDE 4

Molecular Dynamics Simulation

08.05.2017 B.Kohnke 4

Simulations on the atomistic level

slide-5
SLIDE 5

𝐹 =

𝛽

𝐹𝛽 +

π‘—π‘˜

πΉπ‘—π‘˜

π·π‘π‘£π‘š. + πΉπ‘—π‘˜ 𝑀𝑒𝑋.

Molecular Dynamics Simulation

Describing the energy of the system

Bonded interactions

08.05.2017 5 B.Kohnke

E R

slide-6
SLIDE 6

𝐹 =

𝛽

𝐹𝛽 +

π‘—π‘˜

πΉπ‘—π‘˜

π·π‘π‘£π‘š. + πΉπ‘—π‘˜ 𝑀𝑒𝑋.

Molecular Dynamics Simulation

Describing the energy of the system

Non-bonded interactions

08.05.2017 6 B.Kohnke

E R

slide-7
SLIDE 7

𝐹 =

𝛽

𝐹𝛽 +

π‘—π‘˜

πΉπ‘—π‘˜

π·π‘π‘£π‘š. + πΉπ‘—π‘˜ 𝑀𝑒𝑋.

Molecular Dynamics Simulation

Describing the energy of the system

Non-bonded interactions

08.05.2017 7 B.Kohnke

E R

N-body problem O(nΒ²) + periodic boundaries

slide-8
SLIDE 8

GROMACS

08.05.2017 8 B.Kohnke

  • Splitting the computation of electrostatic potential in two parts
  • Direct
  • Compute the particle-particle interactions directly within a cutoff
  • Reciprocal (FFT based)
  • Extrapolate charges on the grid
  • FFT of the charge grid
  • Computation O (n log n)
  • Communication O (nodesΒ²)

Basic idea of PME

𝐹 = 𝐹𝑒𝑗𝑠𝑓𝑑𝑒 + πΉπ‘ π‘“π‘‘π‘—π‘žπ‘ π‘π‘‘π‘π‘š

Particle Mesh Ewald (PME)

slide-9
SLIDE 9

PME vs FMM

Massive parallelization, 150000 Particles

Parallel efficiency

08.05.2017 9 B.Kohnke

0% 20% 40% 60% 80% 100% Efficiency

replication factor

1 2 4 16 8 32 128 64 256

PME FMM (Near Field)

slide-10
SLIDE 10

FMM

08.05.2017 B.Kohnke 10

Basic Idea

Classical O (nΒ²) approach

slide-11
SLIDE 11

FMM

08.05.2017 B.Kohnke 11

Basic Idea

Classical O (nΒ²) approach Tree-code O (n log n)

slide-12
SLIDE 12

FMM

08.05.2017 B.Kohnke 12

Basic Idea

Classical O (nΒ²) approach Tree-code O (n log n) FMM O (n)

slide-13
SLIDE 13

FMM – Parameters

08.05.2017 B.Kohnke 13

Controling the accuracy of the approximation and performance

FMM – Parameters d – depth of the FMM tree

slide-14
SLIDE 14

FMM – Parameters

08.05.2017 B.Kohnke 14

Controling the accuracy of the approximation and performance

FMM – Parameters ws – separation criterion

slide-15
SLIDE 15

FMM – Parameters

08.05.2017 B.Kohnke 15

Controling the accuracy of the approximation and performance

FMM – Parameters p – multipole order

1 𝑒 =

π‘š=0 𝒒 𝑛=βˆ’π‘š π‘š

π‘π‘š π‘š + 𝑛 ! π‘„π‘šπ‘›(cos 𝛽) (π‘š βˆ’ 𝑛)! π‘ π‘š+1 π‘„π‘šπ‘›(cos πœ„) π‘“βˆ’π‘—π‘› π›Ύβˆ’πœš

slide-16
SLIDE 16

FMM – Workflow

08.05.2017 B.Kohnke 16

Preprocessing

Particle input – positions and charges (x, y, z, q)

Pass 1 Pass 2 Pass 3 Pass 4 Pass 5 Preprocessing

ws d p xyz q E F 

Set algorithm parameters

  • ws – separation criterion
  • p – multipoleorder
  • d – tree depth
slide-17
SLIDE 17

FMM – Workflow

08.05.2017 B.Kohnke 17

Preprocessing

Clustering particles

Pass 1 Pass 2 Pass 3 Pass 4 Pass 5 Preprocessing

ws d p xyz q E F 

Build a tree according to chosen parameter d

  • ws – separation criterion
  • p – multipoleorder
  • d – tree depth
slide-18
SLIDE 18

FMM – Workflow

08.05.2017 B.Kohnke 18

Preprocessing

Particle to multipole (P2M)

Pass 1 Pass 2 Pass 3 Pass 4 Pass 5 Preprocessing

ws d p xyz q E F 

Expand particle positions and charges to multipole moments πœ•

  • O(np2) – operation
  • Fill boxes on the lowest level
slide-19
SLIDE 19

FMM – Workflow

08.05.2017 B.Kohnke 19

Pass 1

Translate the multipole moments πœ• up the FMM-tree

  • O(p4) – operation
  • One operation per box
  • Vertical operator

Multipole to multipole (M2M)

Preprocessing

ws d p xyz q E F 

Pass 1 Pass 2 Pass 3 Pass 4 Pass 5

πœ• 𝐛 + 𝐜 =

π‘š=0 π‘ž 𝑛=βˆ’π‘š π‘š π‘˜=0 π‘š 𝑙=βˆ’π‘˜ π‘˜

πœ•π‘˜π‘™ 𝐛 π‘ƒπ‘šβˆ’π‘˜,π‘›βˆ’π‘™(𝐜)

slide-20
SLIDE 20

FMM – Workflow

08.05.2017 B.Kohnke 20

Pass 2

Multipole to local (M2L) Transform multipole moments πœ• into Taylor moments 𝜈

  • O(p4) – operation
  • 189 operations per box (ws = 1)
  • Horizontal operator

Preprocessing

ws d p xyz q E F 

Pass 1 Pass 2 Pass 3 Pass 4 Pass 5

𝜈 𝐜 βˆ’ 𝐛 =

π‘š=0 π‘ž 𝑛=βˆ’π‘š π‘š π‘˜=0 π‘š 𝑙=βˆ’π‘˜ π‘˜

πœ•π‘˜π‘™ 𝐛 π‘π‘š+π‘˜,𝑛+𝑙(𝐜)

slide-21
SLIDE 21

FMM – Workflow

08.05.2017 B.Kohnke 21

Pass 3

Local to local (L2L) Translate local moments 𝜈 down the tree

  • O(p4) – operation
  • One operation per box
  • Vertical operation

Preprocessing

ws d p xyz q E F 

Pass 1 Pass 2 Pass 3 Pass 4 Pass 5

𝜈 𝐬 βˆ’ 𝐜 =

π‘š=0 π‘ž 𝑛=βˆ’π‘š π‘š π‘˜=π‘š π‘ž 𝑙=βˆ’π‘˜ π‘˜

πœˆπ‘˜π‘™ 𝐬 𝑁

π‘˜βˆ’π‘š,π‘™βˆ’π‘›(𝐜)

slide-22
SLIDE 22

FMM – Workflow

08.05.2017 B.Kohnke 22

Pass 4 and 5

Forces computing Far field contribution from 𝜈 and πœ•

  • Compute Φ𝐺𝐺, 𝐆𝐺𝐺, 𝐅𝐺𝐺
  • O(np2) – operation

Near field contribution (particle to particle)

Preprocessing

ws d p xyz q E F 

Pass 1 Pass 2 Pass 3 Pass 4 Pass 5

  • Compute Φ𝑂𝐺, 𝐆𝑂𝐺, 𝐅𝑂𝐺
  • O(ncutoff

πŸ‘

) – operation

slide-23
SLIDE 23

FMM – Data Structures

08.05.2017 B.Kohnke 23

Far field

Coefficient Matrix, Generalized Storage Type

  • Matrix size O(p2)
  • Triangular shape
  • One complex value per element
  • Multipole moments Ο‰
  • Taylor moments Β΅
  • Operators M

0,0 1,0 2,0 1,1 1,-1 2,1 2,2 2,-2 2,-1 0,0 1,0 2,0 1,1 2,1 2,2

Physical memory alignment

Used as storage for

slide-24
SLIDE 24
  • For all boxes in the tree
  • Determine the interaction set
  • Children of direct neighbors of the own parent
  • Determine M2L operator
  • Compute one M2L operation
  • For each valid πœ•, 𝜈 pair

M2L Operation – Tree structure

08.05.2017 B.Kohnke 24

Tree loop and Box – Neighbor Structure, ws=1

M2L Operation

Ξ©

πœ• 𝜈 𝑁

slide-25
SLIDE 25
  • For all boxes in the tree
  • Determine the interaction set
  • Children of direct neighbors of the own parent
  • Determine M2L operator
  • Compute one M2L operation
  • For each valid πœ•, 𝜈 pair

M2L Operation – Tree structure

08.05.2017 B.Kohnke 25

Tree loop and Box – Neighbor Structure, ws=1

M2L Operation

Ξ©

πœ• 𝜈 𝑁

slide-26
SLIDE 26
  • For all boxes in the tree
  • Determine the interaction set
  • Children of direct neighbors of the own parent
  • Determine valid M2L operator
  • Compute one M2L operation
  • For each valid πœ•, 𝜈 pair

M2L Operation – Tree structure

08.05.2017 B.Kohnke 26

Tree loop and Box – Neighbor Structure, ws=1

M2L Operation

Ξ©

πœ• 𝜈 𝑁

slide-27
SLIDE 27
  • For all boxes in the tree
  • Determine the interaction set
  • Children of direct neighbors of the own parent
  • Determine valid M2L operator
  • Compute one M2L operation
  • For each valid πœ•, 𝜈 pair

M2L Operation – Tree structure

08.05.2017 B.Kohnke 27

Tree loop and Box – Neighbor Structure, ws=1

M2L Operation

𝜈 𝐜 βˆ’ 𝐛 =

π‘š=0 π‘ž 𝑛=βˆ’π‘š π‘š π‘˜=0 π‘š 𝑙=βˆ’π‘˜ π‘˜

πœ•π‘˜π‘™ 𝐛 π‘π‘š+π‘˜,𝑛+𝑙(𝐜)

Ξ©

πœ• 𝜈 𝑁

slide-28
SLIDE 28

08.05.2017 B.Kohnke 28

Translating multipole expansion to local expansion, p4 loop structure

πœˆπ‘šπ‘› 𝐜 βˆ’ 𝐛 =

π‘˜=0 π‘š 𝑙=βˆ’π‘˜ π‘˜

πœ•π‘˜π‘™ 𝐛 π‘π‘š+π‘˜,𝑛+𝑙(𝐜)

Translation Operation (M2L) - O(p4)

One M2L operation



lm

 M

for (int l = 0; l <= p; ++l) for (int m = 0; m <= l; ++m) {

  • mega_l_m=0;

for (int j = 0; j <= p; ++j) { for (int k = -j; k <= j; ++k) {

  • mega_l_m += M[m_idx](l+j, m+k)

* omega[o_idx](j,k); } } mu[mu_idx](l, m) += omega_l_m }

M2L Operation – Internal Structure

slide-29
SLIDE 29

M2L – GPU dynamic parallelism

08.05.2017 B.Kohnke 29

Tree loop and Box – Neighbor Structure, ws=1

M2L Operation

  • Start one parent kernel for each πœ•
  • Parent kernel
  • Computes the interaction set
  • Spawns the child kernels
  • Child kernel
  • Compute one p4 operation (p2 threads)
Ξ©

πœ• 𝜈 𝑁

slide-30
SLIDE 30

M2L – GPU dynamic parallelism

08.05.2017 B.Kohnke 30

Tree loop and Box – Neighbor Structure, ws=1

Ξ©

πœ• 𝜈 𝑁

M2L Operation

  • Start one parent kernel for each πœ•
  • Parent kernel
  • Computes the interaction set
  • Spawns the child kernels
  • Child kernel
  • Compute one p4 operation (p2 threads)

parent_kernel<<<(1,1,1)(3,3,3)>>>

slide-31
SLIDE 31

M2L – GPU dynamic parallelism

08.05.2017 B.Kohnke 31

Tree loop and Box – Neighbor Structure, ws=1

  • Blocksize too small for small p values
  • Small grids (streams help to utilize the GPU)
Ξ©

πœ• 𝜈 𝑁

M2L Operation

  • Start one parent kernel for each πœ•
  • Parent kernel
  • Computes interaction set
  • Spawns the child kernels
  • Child kernel
  • Compute one p4 operation (p2 threads)

parent_kernel<<<(1,1,1)(3,3,3)>>> child_kernel<<<(2,2,2)(p,p,1)>>>

slide-32
SLIDE 32

for (int l = 0; l <= p; ++l) for (int m = 0; m <= l; ++m) {

  • mega_l_m=0;

for (int j = 0; j <= p; ++j) { for (int k = -j; k <= j; ++k) {

  • mega_l_m += M[m_idx](l+j, m+k)

* omega[o_idx](j,k); } } mu[mu_idx](l, m) += omega_l_m }

08.05.2017 B.Kohnke 32

Translating multipole expansion to local expansion, p4 loop structure

πœˆπ‘šπ‘› 𝐜 βˆ’ 𝐛 =

π‘˜=0 π‘š 𝑙=βˆ’π‘˜ π‘˜

πœ•π‘˜π‘™ 𝐛 π‘π‘š+π‘˜,𝑛+𝑙(𝐜)

Translation Operation (M2L) - O(p4)

One M2L operation



lm

 M

M2L Operation – Internal Structure

slide-33
SLIDE 33

08.05.2017 33 B.Kohnke

M2L Dynamic Scheme + Shared Memory

Depth 4, 4096 Boxes, periodic boundaries, 4096*189 = 774144 p4 operations

1.2E-04 4.9E-04 2.0E-03 7.8E-03 3.1E-02 1.3E-01 5.0E-01 1 2 3 4 5 6 7 8 9 10 11 12

time in seconds multipole order computational effort (estimate) dynamic

slide-34
SLIDE 34

M2L – enhanced parallelization

Better tree abstraction

08.05.2017 34 B.Kohnke

  • For each box in the tree store:
  • πœ•
  • List of pointers to the targets πœˆπ‘—
  • List of pointers to the operators 𝑁𝑗

Idea – precompute interaction sets

ω𝒋 1 2 … 188 π‚π’šπŸ π‚π’šπŸ π‚π’šπŸ‘ … π‚π’šπŸπŸ—πŸ— π‘΅π’šπŸ π‘΅π’šπŸ π‘΅π’šπŸ‘ … π‘΅π’šπŸπŸ—πŸ—

  • Overhead only in initialization step
  • Increased memory requirement
  • Redundant pointers storage
Ξ©

πœ• 𝜈 𝑁

slide-35
SLIDE 35

M2L – enhanced parallelization

Better tree abstraction

08.05.2017 35 B.Kohnke

  • Periodic boundary conditions
  • All interaction lists have the same size
  • For each box in the tree store:
  • πœ•
  • List of pointers to the targets πœˆπ‘—
  • List of pointers to the operators 𝑁𝑗

Idea – precompute interaction sets

ω𝒋 1 2 … 188 π‚π’šπŸ π‚π’šπŸ π‚π’šπŸ‘ … π‚π’šπŸπŸ—πŸ— π‘΅π’šπŸ π‘΅π’šπŸ π‘΅π’šπŸ‘ … π‘΅π’šπŸπŸ—πŸ—

slide-36
SLIDE 36

M2L – precomputed interaction sets

Enhancing memory access patterns and index computation

08.05.2017 36 B.Kohnke

  • Precompute interactions lists
  • Reorganize memory SoA οƒ  AoS

Parallelization

π››πŸ π››πŸ π››πŸ’ π››πŸ‘ π››πŸ π››πŸπ››πŸ‘ π››πŸ’

  • Threading at tree level (for d >=3)
  • One kernel launch per Symmetry_Type

Kernel<Symmetry_Type>//d=3,p=10 <<<(189,100,1)(64,1,1)>>>

p2 # of boxes and p2 (sequential) # of operators

Ξ©

πœ• 𝜈 𝑁

slide-37
SLIDE 37

M2L – precomputed interaction sets

Enhancing memory access patterns and index computation

08.05.2017 37 B.Kohnke

Kernel<Symmetry_Type>>//d=3,p=10 <<<(189,100,1)(64,1,1)>>>

  • Threads within the same block share the

Operator (shared memory)

  • Nearly no index computation for 𝝏
  • 𝝏𝒋 = ThreadId + offset(Symmetry_Type)
  • Targets 𝝂𝒋 accessed through list
  • Presorted to achieve coalescing

Parallelization

π››πŸ π››πŸ π››πŸ’ π››πŸ‘

Ξ©

πœ• 𝜈 𝑁

slide-38
SLIDE 38

08.05.2017 38 B.Kohnke

M2L – precomputed interaction sets

Depth 2, 64 Boxes, periodic boundaries, 64*189 = 12096 p4 operations

1.9E-06 7.6E-06 3.1E-05 1.2E-04 4.9E-04 2.0E-03 7.8E-03 3.1E-02 1 2 3 4 5 6 7 8 9 10 11 12

time in seconds multipole order computational effort (estimate) dynamic presorted SoA

slide-39
SLIDE 39

08.05.2017 39 B.Kohnke

M2L – precomputed interaction sets

Depth 3, 512 Boxes, periodic boundaries, 512*189 = 96768 p4 operations

1.5E-05 6.1E-05 2.4E-04 9.8E-04 3.9E-03 1.6E-02 6.3E-02 1 2 3 4 5 6 7 8 9 10 11 12

time in seconds multipole order computational effort (estimate) dynamic presorted SoA

slide-40
SLIDE 40

08.05.2017 40 B.Kohnke

M2L – precomputed interaction sets

Depth 4, 4096 Boxes, periodic boundaries, 4096*189 = 774144 p4 operations

1.2E-04 4.9E-04 2.0E-03 7.8E-03 3.1E-02 1.3E-01 5.0E-01 1 2 3 4 5 6 7 8 9 10 11 12

time in seconds multipole order computational effort (estimate) dynamic presorted SoA

slide-41
SLIDE 41

M2L – reduced operator set

Exploring operator symmetry

Symmetry of the operators

08.05.2017 41 B.Kohnke

  • The whole operator set contains 316
  • perators (3D)
  • There are 56 unique operators (3D)
  • Operators of the same length are

rotational symmetric

  • Only the signs of the symmetric

counterparts elements differ

Ξ©

πœ• 𝜈 𝑁

slide-42
SLIDE 42

M2L – reduced operator set

Exploring operator symmetry

Symmetry of the operators

08.05.2017 42 B.Kohnke

  • The whole operator set contains 316
  • perators (3D)
  • There are 56 unique operators (3D)
  • Operators of the same length are

rotational symmetric

  • Only the signs of the symmetric

counterparts elements differ

Ξ©

πœ• 𝜈 𝑁

slide-43
SLIDE 43

M2L – reduced operator set

Exploring operator symmetry

Symmetry of the operators

08.05.2017 43 B.Kohnke

x,y x,y x,y x,y x,-y

  • x,y

x,-y

  • x,y
  • x,y

x,-y

  • x,y

x,-y

  • x,-y

x,-y x,-y

  • x,y
Ξ©

πœ• 𝜈 𝑁

  • The whole operator set contains 316
  • perators (3D)
  • There are 56 unique operators (3D)
  • Operators of the same length are

rotational symmetric

  • Only the signs of the symmetric

counterparts elements differ

slide-44
SLIDE 44

M2L – reduced operator set

Exploring operator symmetry

Different symmetry groups

08.05.2017 44 B.Kohnke

  • The interaction sets are asymmetric
  • 8 distinct types according to the position

within the parent box

  • There are 4 symmetry groups (3D)
  • 7 operators are unique
  • 21 operators are 2-way symmetric
  • 21 operators are 4-way symmetric
  • 7 operators are 8-way symmetric
slide-45
SLIDE 45

M2L – reduced operator set

Exploring operator symmetry

Different symmetry groups

08.05.2017 45 B.Kohnke

  • The interaction sets are asymmetric
  • 8 distinct types according to the position

within the parent box

  • There are 4 symmetry groups (3D)
  • 7 operators are unique
  • 21 operators are 2-way symmetric
  • 21 operators are 4-way symmetric
  • 7 operators are 8-way symmetric
slide-46
SLIDE 46

M2L – reduced operator set

Exploring operator symmetry

Different symmetry groups

08.05.2017 46 B.Kohnke

  • The interaction sets are asymmetric
  • 8 distinct types according to the position

within the parent box

  • There are 4 symmetry groups (3D)
  • 7 operators are unique
  • 21 operators are 2-way symmetric
  • 21 operators are 4-way symmetric
  • 7 operators are 8-way symmetric
slide-47
SLIDE 47

M2L – reduced operator set

Exploring operator symmetry

Different symmetry groups

08.05.2017 47 B.Kohnke

  • The interaction sets are asymmetric
  • 8 distinct types according to the position

within the parent box

  • There are 4 symmetry groups (3D)
  • 7 operators are unique
  • 21 operators are 2-way symmetric
  • 21 operators are 4-way symmetric
  • 7 operators are 8-way symmetric
slide-48
SLIDE 48
  • Preload the operator data into shared

memory

  • Only once for the whole symmetry group
  • Preload the bitsets of signs for the

symmetrical counterparts

  • One full dot product to get (1,2,4,8)

results depending on symmetry group

M2L – precomputed interaction sets

Enhancing memory access patterns and index computation

08.05.2017 48 B.Kohnke

Parallelization of reduced operator

π››πŸ π››πŸ π››πŸ’ π››πŸ‘

Ξ©

πœ• 𝜈 𝑁

x,y x,y x,y x,y

slide-49
SLIDE 49
  • Preload the operator data into shared

memory

  • Only once for the whole symmetry group
  • Preload the bitsets of signs for the

symmetrical counterparts

  • One full dot product to get (1,2,4,8)

results depending on symmetry group

M2L – precomputed interaction sets

Enhancing memory access patterns and index computation

08.05.2017 49 B.Kohnke

Parallelization of reduced operator

π››πŸ π››πŸ π››πŸ’ π››πŸ‘

Ξ©

πœ• 𝜈 𝑁

x,y x,y x,y x,y

1010011101 0110110101 1110110101 1110110101

slide-50
SLIDE 50

M2L – precomputed interaction sets

Enhancing memory access patterns and index computation

08.05.2017 50 B.Kohnke

Ξ©

πœ• 𝜈 𝑁 1010011101

x,y x,y x,y x,y

0110110101 1110110101

  • Preload the operator data into shared

memory

  • Only once for the whole symmetry group
  • Preload the bitsets of signs for the

symmetrical counterparts

  • One full dot product to get (1,2,4,8)

results depending on symmetry group

Parallelization of reduced operator

1110110101

slide-51
SLIDE 51

M2L – precomputed interaction sets

Enhancing memory access patterns and index computation

08.05.2017 51 B.Kohnke

Ξ©

πœ• 𝜈 𝑁

Parallelization of reduced operator

  • Gridsize reduced by a factor of ~ 4
  • Logic added to get and change signs of

the result

Kernel<Symmetry_Type, Operator> <<<(7,100,1)(64,1,1)>>>(…) Kernel<Symmetry_Type, Operator> <<<(21,100,1)(64,1,1)>>>(…) Kernel<Symmetry_Type, Operator> <<<(21,100,1)(64,1,1)>>>(…) Kernel<Symmetry_Type, Operator> <<<(7,100,1)(64,1,1)>>>(…)

slide-52
SLIDE 52

08.05.2017 52 B.Kohnke

M2L – reduced operator

Depth 4, 4096 Boxes, periodic boundaries, 4096*189 = 774144 p4 operations

1.2E-04 4.9E-04 2.0E-03 7.8E-03 3.1E-02 1.3E-01 5.0E-01 1 2 3 4 5 6 7 8 9 10 11 12

time in seconds multipole order computational effort (estimate) dynamic presorted SoA reduced operator

slide-53
SLIDE 53
  • Uncoalesced memory access for the

symmetrical counterparts targets

  • Solution
  • Store coalesced pointers for all (l,m) – targets

M2L – precomputed interaction sets

Enhancing memory access patterns and index computation

08.05.2017 53 B.Kohnke

Ξ©

πœ• 𝜈 𝑁

Parallelization of reduced operator

Kernel<Symmetry_Type, Operator> <<<(7,100,1)(64,1,1)>>>(…) Kernel<Symmetry_Type, Operator> <<<(21,100,1)(64,1,1)>>>(…) Kernel<Symmetry_Type, Operator> <<<(21,100,1)(64,1,1)>>>(…) Kernel<Symmetry_Type, Operator> <<<(7,100,1)(64,1,1)>>>(…)

slide-54
SLIDE 54

08.05.2017 54 B.Kohnke

M2L – reduced operator

Depth 4, 4096 Boxes, periodic boundaries, 4096*189 = 774144 p4 operations

1.2E-04 4.9E-04 2.0E-03 7.8E-03 3.1E-02 1.3E-01 5.0E-01 1 2 3 4 5 6 7 8 9 10 11 12

time in seconds multipole order computational effort (estimate) dynamic presorted SoA reduced operator reduced operator - resorting targets

slide-55
SLIDE 55

FMM vs GROMACS - 285119 particles – channel protein TehA

FMM – DEPTH 3, 512 BOXES

Intel i7-6800 CPU, 4 Cores (8 Hyperthreads), GeForce GTX 1060 GROMACS – fourierspacing 0.12 nm, cutoff 1.0 nm, interpolation order 4

08.05.2017 55 B.Kohnke

0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16

1 2 3 4 5 6 7 8 9 10 11

time in seconds FORCES L2L LATTICE M2L M2M P2M COPY AND FLUSH DISTRIBUTE PARTICLES P2P GROMACS P2P

slide-56
SLIDE 56

FMM vs GROMACS - 285119 particles – channel protein TehA

FMM – DEPTH 4, 4096 BOXES

Intel i7-6800 CPU, 4 Cores (8 Hyperthreads), GeForce GTX 1060 GROMACS – fourierspacing 0.12 nm, cutoff 1.0 nm, interpolation order 4

08.05.2017 56 B.Kohnke

0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16

1 2 3 4 5 6 7 8 9 10 11

time in seconds FORCES L2L LATTICE M2L M2M P2M COPY AND FLUSH DISTRIBUTE PARTICLES P2P GROMACS P2P

slide-57
SLIDE 57

FMM vs GROMACS - 285119 particles – channel protein TehA

FMM – DEPTH 3.32, 1000 BOXES

Intel i7-6800 CPU, 4 Cores (8 Hyperthreads), GeForce GTX 1060 GROMACS – fourierspacing 0.12 nm, cutoff 1.0 nm, interpolation order 4

08.05.2017 57 B.Kohnke

0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16

1 2 3 4 5 6 7 8 9 10 11

time in seconds FORCES L2L LATTICE M2L M2M P2M COPY AND FLUSH DISTRIBUTE PARTICLES P2P GROMACS P2P

slide-58
SLIDE 58

GPU parallelization of the FMM

08.05.2017 B.Kohnke 58

Summary and results

Summary

  • All stages of the FMM run on a GPU
  • Distributing particles
  • Building multipoles
  • M2M (compute bound)
  • M2L (compute bound)
  • L2L (compute bound)
  • Lattice (dealing with periodicity)
  • Computing far-field forces
  • P2P (compute bound)
  • Enhancements are still possible
  • M2L (exploring more symmetry)
  • Redistributing the work between the CPU and GPU (CPU is nearly idle)
slide-59
SLIDE 59

Questions?

Thank You