Efficient GPU parallelization of the Fast Multipole Method with periodic boundary conditions
Bartosz Kohnke
Efficient GPU parallelization of the Fast Multipole Method with - - PowerPoint PPT Presentation
Efficient GPU parallelization of the Fast Multipole Method with periodic boundary conditions Bartosz Kohnke Fast Multipole Method (FMM) Calculation of long-ranged forces in the n-body problem (Greengard and Rokhlin, 1987) Tree based
Bartosz Kohnke
Calculation of long-ranged forces in the n-body problem (Greengard and Rokhlin, 1987)
08.05.2017 2 B.Kohnke
n := number of particles
1/r Long-Range interactions
08.05.2017 3 B.Kohnke
Molecular Dynamics Plasma Physics Astrophysics
08.05.2017 B.Kohnke 4
Simulations on the atomistic level
π½
ππ
π·ππ£π. + πΉππ π€ππ.
Describing the energy of the system
08.05.2017 5 B.Kohnke
π½
ππ
π·ππ£π. + πΉππ π€ππ.
Describing the energy of the system
08.05.2017 6 B.Kohnke
π½
ππ
π·ππ£π. + πΉππ π€ππ.
Describing the energy of the system
08.05.2017 7 B.Kohnke
GROMACS
08.05.2017 8 B.Kohnke
Massive parallelization, 150000 Particles
08.05.2017 9 B.Kohnke
0% 20% 40% 60% 80% 100% Efficiency
replication factor
1 2 4 16 8 32 128 64 256
08.05.2017 B.Kohnke 10
Basic Idea
Classical O (nΒ²) approach
08.05.2017 B.Kohnke 11
Basic Idea
Classical O (nΒ²) approach Tree-code O (n log n)
08.05.2017 B.Kohnke 12
Basic Idea
Classical O (nΒ²) approach Tree-code O (n log n) FMM O (n)
08.05.2017 B.Kohnke 13
Controling the accuracy of the approximation and performance
08.05.2017 B.Kohnke 14
Controling the accuracy of the approximation and performance
08.05.2017 B.Kohnke 15
Controling the accuracy of the approximation and performance
π=0 π π=βπ π
08.05.2017 B.Kohnke 16
Preprocessing
Pass 1 Pass 2 Pass 3 Pass 4 Pass 5 Preprocessing
ws d p xyz q E F ο
08.05.2017 B.Kohnke 17
Preprocessing
Pass 1 Pass 2 Pass 3 Pass 4 Pass 5 Preprocessing
ws d p xyz q E F ο
08.05.2017 B.Kohnke 18
Preprocessing
Pass 1 Pass 2 Pass 3 Pass 4 Pass 5 Preprocessing
ws d p xyz q E F ο
08.05.2017 B.Kohnke 19
Pass 1
Preprocessing
ws d p xyz q E F ο
Pass 1 Pass 2 Pass 3 Pass 4 Pass 5
π π + π =
π=0 π π=βπ π π=0 π π=βπ π
πππ π ππβπ,πβπ(π)
08.05.2017 B.Kohnke 20
Pass 2
Preprocessing
ws d p xyz q E F ο
Pass 1 Pass 2 Pass 3 Pass 4 Pass 5
π π β π =
π=0 π π=βπ π π=0 π π=βπ π
πππ π ππ+π,π+π(π)
08.05.2017 B.Kohnke 21
Pass 3
Preprocessing
ws d p xyz q E F ο
Pass 1 Pass 2 Pass 3 Pass 4 Pass 5
π π¬ β π =
π=0 π π=βπ π π=π π π=βπ π
πππ π¬ π
πβπ,πβπ(π)
08.05.2017 B.Kohnke 22
Pass 4 and 5
Preprocessing
ws d p xyz q E F ο
Pass 1 Pass 2 Pass 3 Pass 4 Pass 5
π
08.05.2017 B.Kohnke 23
Far field
0,0 1,0 2,0 1,1 1,-1 2,1 2,2 2,-2 2,-1 0,0 1,0 2,0 1,1 2,1 2,2
Physical memory alignment
08.05.2017 B.Kohnke 24
Tree loop and Box β Neighbor Structure, ws=1
π π π
08.05.2017 B.Kohnke 25
Tree loop and Box β Neighbor Structure, ws=1
π π π
08.05.2017 B.Kohnke 26
Tree loop and Box β Neighbor Structure, ws=1
π π π
08.05.2017 B.Kohnke 27
Tree loop and Box β Neighbor Structure, ws=1
π π β π =
π=0 π π=βπ π π=0 π π=βπ π
πππ π ππ+π,π+π(π)
Ξ©π π π
08.05.2017 B.Kohnke 28
Translating multipole expansion to local expansion, p4 loop structure
πππ π β π =
π=0 π π=βπ π
πππ π ππ+π,π+π(π)
lm
for (int l = 0; l <= p; ++l) for (int m = 0; m <= l; ++m) {
for (int j = 0; j <= p; ++j) { for (int k = -j; k <= j; ++k) {
* omega[o_idx](j,k); } } mu[mu_idx](l, m) += omega_l_m }
08.05.2017 B.Kohnke 29
Tree loop and Box β Neighbor Structure, ws=1
π π π
08.05.2017 B.Kohnke 30
Tree loop and Box β Neighbor Structure, ws=1
Ξ©π π π
parent_kernel<<<(1,1,1)(3,3,3)>>>
08.05.2017 B.Kohnke 31
Tree loop and Box β Neighbor Structure, ws=1
π π π
parent_kernel<<<(1,1,1)(3,3,3)>>> child_kernel<<<(2,2,2)(p,p,1)>>>
for (int l = 0; l <= p; ++l) for (int m = 0; m <= l; ++m) {
for (int j = 0; j <= p; ++j) { for (int k = -j; k <= j; ++k) {
* omega[o_idx](j,k); } } mu[mu_idx](l, m) += omega_l_m }
08.05.2017 B.Kohnke 32
Translating multipole expansion to local expansion, p4 loop structure
πππ π β π =
π=0 π π=βπ π
πππ π ππ+π,π+π(π)
lm
08.05.2017 33 B.Kohnke
Depth 4, 4096 Boxes, periodic boundaries, 4096*189 = 774144 p4 operations
1.2E-04 4.9E-04 2.0E-03 7.8E-03 3.1E-02 1.3E-01 5.0E-01 1 2 3 4 5 6 7 8 9 10 11 12
time in seconds multipole order computational effort (estimate) dynamic
Better tree abstraction
08.05.2017 34 B.Kohnke
Οπ 1 2 β¦ 188 πππ πππ πππ β¦ πππππ π΅ππ π΅ππ π΅ππ β¦ π΅ππππ
π π π
Better tree abstraction
08.05.2017 35 B.Kohnke
Οπ 1 2 β¦ 188 πππ πππ πππ β¦ πππππ π΅ππ π΅ππ π΅ππ β¦ π΅ππππ
Enhancing memory access patterns and index computation
08.05.2017 36 B.Kohnke
ππ ππ ππ ππ ππ ππππ ππ
Kernel<Symmetry_Type>//d=3,p=10 <<<(189,100,1)(64,1,1)>>>
p2 # of boxes and p2 (sequential) # of operators
Ξ©π π π
Enhancing memory access patterns and index computation
08.05.2017 37 B.Kohnke
Kernel<Symmetry_Type>>//d=3,p=10 <<<(189,100,1)(64,1,1)>>>
ππ ππ ππ ππ
Ξ©π π π
08.05.2017 38 B.Kohnke
Depth 2, 64 Boxes, periodic boundaries, 64*189 = 12096 p4 operations
1.9E-06 7.6E-06 3.1E-05 1.2E-04 4.9E-04 2.0E-03 7.8E-03 3.1E-02 1 2 3 4 5 6 7 8 9 10 11 12
time in seconds multipole order computational effort (estimate) dynamic presorted SoA
08.05.2017 39 B.Kohnke
Depth 3, 512 Boxes, periodic boundaries, 512*189 = 96768 p4 operations
1.5E-05 6.1E-05 2.4E-04 9.8E-04 3.9E-03 1.6E-02 6.3E-02 1 2 3 4 5 6 7 8 9 10 11 12
time in seconds multipole order computational effort (estimate) dynamic presorted SoA
08.05.2017 40 B.Kohnke
Depth 4, 4096 Boxes, periodic boundaries, 4096*189 = 774144 p4 operations
1.2E-04 4.9E-04 2.0E-03 7.8E-03 3.1E-02 1.3E-01 5.0E-01 1 2 3 4 5 6 7 8 9 10 11 12
time in seconds multipole order computational effort (estimate) dynamic presorted SoA
Exploring operator symmetry
08.05.2017 41 B.Kohnke
π π π
Exploring operator symmetry
08.05.2017 42 B.Kohnke
π π π
Exploring operator symmetry
08.05.2017 43 B.Kohnke
x,y x,y x,y x,y x,-y
x,-y
x,-y
x,-y
x,-y x,-y
π π π
Exploring operator symmetry
08.05.2017 44 B.Kohnke
within the parent box
Exploring operator symmetry
08.05.2017 45 B.Kohnke
within the parent box
Exploring operator symmetry
08.05.2017 46 B.Kohnke
within the parent box
Exploring operator symmetry
08.05.2017 47 B.Kohnke
within the parent box
Enhancing memory access patterns and index computation
08.05.2017 48 B.Kohnke
ππ ππ ππ ππ
Ξ©π π π
x,y x,y x,y x,y
Enhancing memory access patterns and index computation
08.05.2017 49 B.Kohnke
ππ ππ ππ ππ
Ξ©π π π
x,y x,y x,y x,y
1010011101 0110110101 1110110101 1110110101
Enhancing memory access patterns and index computation
08.05.2017 50 B.Kohnke
Ξ©π π π 1010011101
x,y x,y x,y x,y
0110110101 1110110101
1110110101
Enhancing memory access patterns and index computation
08.05.2017 51 B.Kohnke
Ξ©π π π
Kernel<Symmetry_Type, Operator> <<<(7,100,1)(64,1,1)>>>(β¦) Kernel<Symmetry_Type, Operator> <<<(21,100,1)(64,1,1)>>>(β¦) Kernel<Symmetry_Type, Operator> <<<(21,100,1)(64,1,1)>>>(β¦) Kernel<Symmetry_Type, Operator> <<<(7,100,1)(64,1,1)>>>(β¦)
08.05.2017 52 B.Kohnke
Depth 4, 4096 Boxes, periodic boundaries, 4096*189 = 774144 p4 operations
1.2E-04 4.9E-04 2.0E-03 7.8E-03 3.1E-02 1.3E-01 5.0E-01 1 2 3 4 5 6 7 8 9 10 11 12
time in seconds multipole order computational effort (estimate) dynamic presorted SoA reduced operator
Enhancing memory access patterns and index computation
08.05.2017 53 B.Kohnke
Ξ©π π π
Kernel<Symmetry_Type, Operator> <<<(7,100,1)(64,1,1)>>>(β¦) Kernel<Symmetry_Type, Operator> <<<(21,100,1)(64,1,1)>>>(β¦) Kernel<Symmetry_Type, Operator> <<<(21,100,1)(64,1,1)>>>(β¦) Kernel<Symmetry_Type, Operator> <<<(7,100,1)(64,1,1)>>>(β¦)
08.05.2017 54 B.Kohnke
Depth 4, 4096 Boxes, periodic boundaries, 4096*189 = 774144 p4 operations
1.2E-04 4.9E-04 2.0E-03 7.8E-03 3.1E-02 1.3E-01 5.0E-01 1 2 3 4 5 6 7 8 9 10 11 12
time in seconds multipole order computational effort (estimate) dynamic presorted SoA reduced operator reduced operator - resorting targets
FMM β DEPTH 3, 512 BOXES
Intel i7-6800 CPU, 4 Cores (8 Hyperthreads), GeForce GTX 1060 GROMACS β fourierspacing 0.12 nm, cutoff 1.0 nm, interpolation order 4
08.05.2017 55 B.Kohnke
0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16
1 2 3 4 5 6 7 8 9 10 11
time in seconds FORCES L2L LATTICE M2L M2M P2M COPY AND FLUSH DISTRIBUTE PARTICLES P2P GROMACS P2P
FMM β DEPTH 4, 4096 BOXES
Intel i7-6800 CPU, 4 Cores (8 Hyperthreads), GeForce GTX 1060 GROMACS β fourierspacing 0.12 nm, cutoff 1.0 nm, interpolation order 4
08.05.2017 56 B.Kohnke
0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16
1 2 3 4 5 6 7 8 9 10 11
time in seconds FORCES L2L LATTICE M2L M2M P2M COPY AND FLUSH DISTRIBUTE PARTICLES P2P GROMACS P2P
FMM β DEPTH 3.32, 1000 BOXES
Intel i7-6800 CPU, 4 Cores (8 Hyperthreads), GeForce GTX 1060 GROMACS β fourierspacing 0.12 nm, cutoff 1.0 nm, interpolation order 4
08.05.2017 57 B.Kohnke
0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16
1 2 3 4 5 6 7 8 9 10 11
time in seconds FORCES L2L LATTICE M2L M2M P2M COPY AND FLUSH DISTRIBUTE PARTICLES P2P GROMACS P2P
08.05.2017 B.Kohnke 58
Summary and results
Questions?