Efficient GPU parallelization of the Fast Multipole Method with - PowerPoint PPT Presentation

Efficient GPU parallelization of the Fast Multipole Method with periodic boundary conditions Bartosz Kohnke

Fast Multipole Method (FMM) Calculation of long-ranged forces in the n-body problem (Greengard and Rokhlin, 1987)  Tree – based approach  O ( n ) complexity n := number of particles  Multipole expansion of long-range interactions 08.05.2017 B.Kohnke 2

Fast Multipole Method (FMM) 1/r Long-Range interactions Molecular Dynamics Astrophysics Plasma Physics 08.05.2017 B.Kohnke 3

Molecular Dynamics Simulation Simulations on the atomistic level 08.05.2017 B.Kohnke 4

Molecular Dynamics Simulation Describing the energy of the system Bonded interactions 𝐷𝑝𝑣𝑚. + 𝐹 𝑗𝑘 𝑤𝑒𝑋. 𝐹 = 𝐹 𝛽 + 𝐹 𝑗𝑘 𝛽 𝑗𝑘 E R 08.05.2017 B.Kohnke 5

Molecular Dynamics Simulation Describing the energy of the system Non-bonded interactions 𝐷𝑝𝑣𝑚. + 𝐹 𝑗𝑘 𝑤𝑒𝑋. 𝐹 = 𝐹 𝛽 + 𝐹 𝑗𝑘 𝛽 𝑗𝑘 E R 08.05.2017 B.Kohnke 6

Molecular Dynamics Simulation Describing the energy of the system Non-bonded interactions 𝐷𝑝𝑣𝑚. + 𝐹 𝑗𝑘 𝑤𝑒𝑋. 𝐹 = 𝐹 𝛽 + 𝐹 𝑗𝑘 𝛽 𝑗𝑘 E N-body problem O ( n² ) + R periodic boundaries 08.05.2017 B.Kohnke 7

Particle Mesh Ewald (PME) GROMACS Basic idea of PME  Splitting the computation of electrostatic potential in two parts  Direct  Compute the particle-particle interactions directly within a cutoff  Reciprocal (FFT based)  Extrapolate charges on the grid 𝐹 = 𝐹 𝑒𝑗𝑠𝑓𝑑𝑢 + 𝐹 𝑠𝑓𝑑𝑗𝑞𝑠𝑝𝑑𝑏𝑚  FFT of the charge grid  Computation O ( n log n )  Communication O ( nodes² ) 08.05.2017 B.Kohnke 8

PME vs FMM Massive parallelization, 150000 Particles Parallel efficiency PME FMM (Near Field) 64 100% 128 32 80% 256 Efficiency 1 60% 16 4 40% replication factor 2 8 20% 0% 08.05.2017 B.Kohnke 9

FMM Basic Idea Classical O ( n² ) approach 08.05.2017 B.Kohnke 10

FMM Basic Idea Classical O ( n² ) approach Tree-code O ( n log n ) 08.05.2017 B.Kohnke 11

FMM Basic Idea Classical O ( n² ) approach Tree-code O ( n log n ) FMM O ( n ) 08.05.2017 B.Kohnke 12

FMM – Parameters Controling the accuracy of the approximation and performance d – depth of the FMM tree FMM – Parameters 08.05.2017 B.Kohnke 13

FMM – Parameters Controling the accuracy of the approximation and performance ws – separation criterion FMM – Parameters 08.05.2017 B.Kohnke 14

FMM – Parameters Controling the accuracy of the approximation and performance p – multipole order FMM – Parameters 𝒒 𝑚 𝑏 𝑚 1 (𝑚 − 𝑛)! 𝑄 𝑚𝑛 (cos 𝜄) 𝑓 −𝑗𝑛 𝛾−𝜚 𝑒 = 𝑚 + 𝑛 ! 𝑄 𝑚𝑛 (cos 𝛽) 𝑠 𝑚+1 𝑚=0 𝑛=−𝑚 08.05.2017 B.Kohnke 15

FMM – Workflow Preprocessing Particle input – positions and charges (x, y, z, q) ws E xyz Preprocessing Pass 1 Pass 2 Pass 3 Pass 4 Pass 5 d F q p  Set algorithm parameters  ws – separation criterion  p – multipoleorder  d – tree depth 08.05.2017 B.Kohnke 16

FMM – Workflow Preprocessing Clustering particles ws E xyz Preprocessing Pass 1 Pass 2 Pass 3 Pass 4 Pass 5 d F q p  Build a tree according to chosen parameter d  ws – separation criterion  p – multipoleorder  d – tree depth 08.05.2017 B.Kohnke 17

FMM – Workflow Preprocessing Particle to multipole (P2M) ws E xyz Preprocessing Pass 1 Pass 2 Pass 3 Pass 4 Pass 5 d F q p  Expand particle positions and charges to multipole moments 𝜕  O ( np 2 ) – operation  Fill boxes on the lowest level 08.05.2017 B.Kohnke 18

FMM – Workflow Pass 1 Multipole to multipole (M2M) ws E xyz Preprocessing Pass 1 Pass 2 Pass 3 Pass 4 Pass 5 d F q p  Translate the multipole moments 𝜕 up the FMM-tree  O ( p 4 ) – operation  One operation per box  Vertical operator 𝑞 𝑚 𝑚 𝑘 𝜕 𝐛 + 𝐜 = 𝜕 𝑘𝑙 𝐛 𝑃 𝑚−𝑘,𝑛−𝑙 (𝐜) 𝑚=0 𝑛=−𝑚 𝑘=0 𝑙=−𝑘 08.05.2017 B.Kohnke 19

FMM – Workflow Pass 2 Multipole to local (M2L) ws E xyz Preprocessing Pass 1 Pass 2 Pass 3 Pass 4 Pass 5 d F q p  Transform multipole moments 𝜕 into Taylor moments 𝜈  O ( p 4 ) – operation  189 operations per box (ws = 1)  Horizontal operator 𝑞 𝑚 𝑚 𝑘 𝜈 𝐜 − 𝐛 = 𝜕 𝑘𝑙 𝐛 𝑁 𝑚+𝑘,𝑛+𝑙 (𝐜) 𝑚=0 𝑛=−𝑚 𝑘=0 𝑙=−𝑘 08.05.2017 B.Kohnke 20

FMM – Workflow Pass 3 Local to local (L2L) ws E xyz Preprocessing Pass 1 Pass 2 Pass 3 Pass 4 Pass 5 d F q p  Translate local moments 𝜈 down the tree  O ( p 4 ) – operation  One operation per box  Vertical operation 𝑞 𝑚 𝑞 𝑘 𝜈 𝐬 − 𝐜 = 𝜈 𝑘𝑙 𝐬 𝑁 𝑘−𝑚,𝑙−𝑛 (𝐜) 𝑚=0 𝑛=−𝑚 𝑘=𝑚 𝑙=−𝑘 08.05.2017 B.Kohnke 21

FMM – Workflow Pass 4 and 5 Forces computing ws E xyz Preprocessing Pass 1 Pass 2 Pass 3 Pass 4 Pass 5 d F q p  Far field contribution from 𝜈 and 𝜕  Compute Φ 𝐺𝐺 , 𝐆 𝐺𝐺 , 𝐅 𝐺𝐺  O ( np 2 ) – operation Near field contribution (particle to particle)  Compute Φ 𝑂𝐺 , 𝐆 𝑂𝐺 , 𝐅 𝑂𝐺 𝟑  O ( n cutoff ) – operation 08.05.2017 B.Kohnke 22

FMM – Data Structures Far field Coefficient Matrix, Generalized Storage Type  Matrix size O ( p 2 ) 0,0  Triangular shape 1,0 1,-1 1,1  One complex value per element 2,-1 2,1 2,-2 2,0 2,2 Used as storage for  Multipole moments ω  Taylor moments µ 2,2 1,0  Operators M 2,0 0,0 1,1 2,1 Physical memory alignment 08.05.2017 B.Kohnke 23

M2L Operation – Tree structure Tree loop and Box – Neighbor Structure, ws =1 M2L Operation 𝜈 Ω 𝜕 𝑁  For all boxes in the tree  Determine the interaction set  Children of direct neighbors of the own parent  Determine M2L operator  Compute one M2L operation  For each valid 𝜕, 𝜈 pair 08.05.2017 B.Kohnke 24

M2L Operation – Tree structure Tree loop and Box – Neighbor Structure, ws =1 M2L Operation 𝜈 Ω 𝜕 𝑁  For all boxes in the tree  Determine the interaction set  Children of direct neighbors of the own parent  Determine M2L operator  Compute one M2L operation  For each valid 𝜕, 𝜈 pair 08.05.2017 B.Kohnke 25

M2L Operation – Tree structure Tree loop and Box – Neighbor Structure, ws =1 M2L Operation 𝜈 Ω 𝜕 𝑁  For all boxes in the tree  Determine the interaction set  Children of direct neighbors of the own parent  Determine valid M2L operator  Compute one M2L operation  For each valid 𝜕, 𝜈 pair 08.05.2017 B.Kohnke 26

M2L Operation – Tree structure Tree loop and Box – Neighbor Structure, ws =1 M2L Operation 𝜈 Ω 𝜕 𝑁  For all boxes in the tree  Determine the interaction set  Children of direct neighbors of the own parent  Determine valid M2L operator  Compute one M2L operation  For each valid 𝜕, 𝜈 pair 𝑞 𝑚 𝑚 𝑘 𝜈 𝐜 − 𝐛 = 𝜕 𝑘𝑙 𝐛 𝑁 𝑚+𝑘,𝑛+𝑙 (𝐜) 𝑚=0 𝑛=−𝑚 𝑘=0 𝑙=−𝑘 08.05.2017 B.Kohnke 27

M2L Operation – Internal Structure Translating multipole expansion to local expansion, p 4 loop structure Translation Operation (M2L) - O ( p 4 ) One M2L operation 𝑚 𝑘 𝜈 𝑚𝑛 𝐜 − 𝐛 = 𝜕 𝑘𝑙 𝐛 𝑁 𝑚+𝑘,𝑛+𝑙 (𝐜) 𝑘=0 𝑙=−𝑘   M for (int l = 0; l <= p; ++l) lm for (int m = 0; m <= l; ++m) { omega_l_m=0; for (int j = 0; j <= p; ++j) { for (int k = -j; k <= j; ++k) { omega_l_m += M[m_idx](l+j, m+k) * omega[o_idx](j,k); } } mu[mu_idx](l, m) += omega_l_m } 08.05.2017 B.Kohnke 28

M2L – GPU dynamic parallelism Tree loop and Box – Neighbor Structure, ws =1 M2L Operation 𝜈 Ω 𝜕 𝑁  Start one parent kernel for each 𝜕  Parent kernel  Computes the interaction set  Spawns the child kernels  Child kernel  Compute one p 4 operation ( p 2 threads) 08.05.2017 B.Kohnke 29

M2L – GPU dynamic parallelism Tree loop and Box – Neighbor Structure, ws =1 M2L Operation 𝜈 Ω 𝜕 𝑁  Start one parent kernel for each 𝜕  Parent kernel  Computes the interaction set  Spawns the child kernels  Child kernel  Compute one p 4 operation ( p 2 threads) parent_kernel<<<(1,1,1)(3,3,3)>>> 08.05.2017 B.Kohnke 30

M2L – GPU dynamic parallelism Tree loop and Box – Neighbor Structure, ws =1 M2L Operation 𝜈 Ω 𝜕 𝑁  Start one parent kernel for each 𝜕  Parent kernel  Computes interaction set  Spawns the child kernels  Child kernel  Compute one p 4 operation ( p 2 threads) parent_kernel<<<(1,1,1)(3,3,3)>>> child_kernel<<<(2,2,2)(p,p,1)>>>  Blocksize too small for small p values  Small grids (streams help to utilize the GPU) 08.05.2017 B.Kohnke 31

Efficient GPU parallelization of the Fast Multipole Method with - PowerPoint PPT Presentation

Efficient GPU parallelization of the Fast Multipole Method with periodic boundary conditions Bartosz Kohnke Fast Multipole Method (FMM) Calculation of long-ranged forces in the n-body problem (Greengard and Rokhlin, 1987) Tree based

Code Parallelization Fabrice Schlegel Introduction Goal: Efficient parallelization and memory

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Speed up evaluation by parallelization /////////// November 2018 Michael Weiss Bayer AG

Parallelization and Parallelization and Proling Proling Programming for Statistical

Parallelization Parallelization Programming for Statistical Programming for Statistical Science

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Efficient Parallelization of Molecular Dynamics Simulations on Hybrid CPU/GPU Supercoputers

An Efficient GPU-based An Efficient GPU-based LDPC Decoder for Long LDPC Decoder for Long

Efficient Stream Reduction on the GPU Efficient Stream Reduction on the GPU David Roger, Ulf

Use Tesla to provide first GPU VM Service in China Feng Zhu

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

Discounted compound renewal sums with a stochastic force of interest Ghislain Lveill, Franck

Small Areas, Benchmarking, and Political Battles: Todays Novel Demands in Small-Area Estimation

Implementation of Source Concept in Matlab Miloslav Capek Department of Electromagnetic Field

Trade, Informality and Enforcement: Evidence from Brazil Vladimir Ponczek (EESP/FGV) Gabriel

Distributed Sequential Estimation in a Network of Cooperative Agents Petar M. Djuri c with

ALANTRA PARTNERS S.A. multiply Ravi Kiran Goli Chinedu Agbara Oeistein H Samuel Sebyala Rafal

Preliminary FY 2018 Spill Surcharge Workshop May 16, 2018 (Published May 8, 2018) B O N N E

The Regions Most Trusted Mid-Market Property Management Firm How Real Estate Appraisals are

Efficient GPU parallelization of the Fast Multipole Method with - PowerPoint PPT Presentation

Efficient GPU parallelization of the Fast Multipole Method with periodic boundary conditions Bartosz Kohnke Fast Multipole Method (FMM) Calculation of long-ranged forces in the n-body problem (Greengard and Rokhlin, 1987) Tree based

Code Parallelization Fabrice Schlegel Introduction Goal: Efficient parallelization and memory

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Speed up evaluation by parallelization /////////// November 2018 Michael Weiss Bayer AG

Parallelization and Parallelization and Proling Proling Programming for Statistical

Parallelization Parallelization Programming for Statistical Programming for Statistical Science

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO &amp; Co-founder Blagovest Taskov, RT GPU Team

Efficient Parallelization of Molecular Dynamics Simulations on Hybrid CPU/GPU Supercoputers

An Efficient GPU-based An Efficient GPU-based LDPC Decoder for Long LDPC Decoder for Long

Efficient Stream Reduction on the GPU Efficient Stream Reduction on the GPU David Roger, Ulf

Use Tesla to provide first GPU VM Service in China Feng Zhu

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

Super GPU &amp; Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

Discounted compound renewal sums with a stochastic force of interest Ghislain Lveill, Franck

Small Areas, Benchmarking, and Political Battles: Todays Novel Demands in Small-Area Estimation

Implementation of Source Concept in Matlab Miloslav Capek Department of Electromagnetic Field

Trade, Informality and Enforcement: Evidence from Brazil Vladimir Ponczek (EESP/FGV) Gabriel

Distributed Sequential Estimation in a Network of Cooperative Agents Petar M. Djuri c with

ALANTRA PARTNERS S.A. multiply Ravi Kiran Goli Chinedu Agbara Oeistein H Samuel Sebyala Rafal

Preliminary FY 2018 Spill Surcharge Workshop May 16, 2018 (Published May 8, 2018) B O N N E

The Regions Most Trusted Mid-Market Property Management Firm How Real Estate Appraisals are

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,