FMM goes GPU A smooth trip or bumpy ride? Member of the - PowerPoint PPT Presentation

FMM goes GPU A smooth trip or bumpy ride? Member of the Helmholtz-Association March 18, 2015 B. Kohnke, I.Kabadshow MPI BPC Göttingen & Jülich Supercomputing Centre

FMM goes GPU A smooth trip or bumpy ride? The Fast Multipole Method FMM Preliminaries Operators Algorithm GPU Parallelization Strategies Naïve parallelism Dynamic parallelism Member of the Helmholtz-Association Results & Conclusions March 18, 2015 B. Kohnke, I.Kabadshow Slide 2

FMM Applications 1 / r Long-Range Interactions in O ( N ) instead of O ( N 2 ) Member of the Helmholtz-Association Molecular Dynamics Plasma Physics Astrophysics March 18, 2015 B. Kohnke, I.Kabadshow Slide 3

FMM Preliminaries Different Approaches to Solve N-body Problem Solving the N-body Problem Classical Approach Simple load balancing Computational expensive O ( N 2 ) Can not be applied to large particle systems PME (Particle Mesh Ewald) Computational complexity O ( N log N ) FFT–based Not flexible enough for our use-case Member of the Helmholtz-Association Fast Multipole Method Computational complexity O ( N ) Tree-based approach March 18, 2015 B. Kohnke, I.Kabadshow Slide 4

N-Body Problem Classical Approach For all particles compute force acting on q i due to q j Integrate equations of motion Determine new system configuration N − 1 N q i · q j � � E c = (1) r ij i = 1 j = i + 1 Member of the Helmholtz-Association March 18, 2015 B. Kohnke, I.Kabadshow Slide 5

N-Body Problem Classical Approach For all particles compute force acting on q i due to q j Integrate equations of motion Determine new system configuration N − 1 N q i · q j � � E c = (1) r ij i = 1 j = i + 1 FMM on one line Member of the Helmholtz-Association p p j l � � � � � � � ( − 1 ) j ω A lm ( a ) M 2 L ( R ) ω B E C ≈ jk ( b ) level A B(A) l = 0 m = − l j = 0 k = − j March 18, 2015 B. Kohnke, I.Kabadshow Slide 5

FMM Parameter I: Depth of the FMM Tree d Member of the Helmholtz-Association Tree Depth d , Level L = d + 1 Simulation box divided into 8 d subboxes (in 3D) March 18, 2015 B. Kohnke, I.Kabadshow Slide 6

FMM Parameter II Separation Criterion ws Member of the Helmholtz-Association Separation Criterion ws Near field contains ( 2 · ws + 1 ) 3 boxes March 18, 2015 B. Kohnke, I.Kabadshow Slide 7

FMM Parameter III Number of Poles p Infinite Expansion l ∞ 1 1 � � d = | r − a | = O lm ( a ) · M lm ( r ) l = 0 m = − l Finite Expansion Up To Order p p l Member of the Helmholtz-Association 1 1 � � d = | r − a | ≈ O lm ( a ) · M lm ( r ) l = 0 m = − l March 18, 2015 B. Kohnke, I.Kabadshow Slide 8

FMM Workflow FMM Workflow Preprocessing Steps Define FMM parameter ws,d,p Member of the Helmholtz-Association March 18, 2015 B. Kohnke, I.Kabadshow Slide 9

FMM Workflow FMM Workflow Pass 1 Member of the Helmholtz-Association Setup FMM tree and expand multipoles (P2M) Shift multipoles to root node (M2M) March 18, 2015 B. Kohnke, I.Kabadshow Slide 9

FMM Workflow FMM Workflow Pass 2 Translate multipole moments into Taylor moments Member of the Helmholtz-Association (M2L) March 18, 2015 B. Kohnke, I.Kabadshow Slide 9

FMM Workflow FMM Workflow Pass 3 Member of the Helmholtz-Association Shift Taylor moments to the leaf nodes (L2L) March 18, 2015 B. Kohnke, I.Kabadshow Slide 9

FMM Workflow FMM Workflow Pass 4 Computation of the far field energy, forces, potentials Member of the Helmholtz-Association March 18, 2015 B. Kohnke, I.Kabadshow Slide 9

FMM Workflow FMM Workflow Pass 5 Computation of the near field energy, forces, potentials Member of the Helmholtz-Association March 18, 2015 B. Kohnke, I.Kabadshow Slide 9

FMM Datastructures Farfield Coefficient Matrix, Generalized Storage Type Matrix size O ( p 2 ) Multipole order p Triangular shape Multipole/Taylor Moments ω lm , µ lm Operators, e.g. M2L Stored as coefficient matrix Stored as coefficient matrix size O ( p 2 ) size O ( p 2 ) Translation Operation Member of the Helmholtz-Association p p j l � � � � µ ( b − a ) = ω jk ( a ) O l + j , m + k ( b ) l = 0 m = 0 j = 0 k = − j March 18, 2015 B. Kohnke, I.Kabadshow Slide 10

M2L Operation – FMM Tree Loops Box – Neighbor Structure, ws=1 for (int d_c = 2; d_c <= depth; ++d_c) { // loop over all relevant tree levels for (int i = 0; i < dim; ++i) { // loop over all boxes on a certain level for (int j = 0; j < dim; ++j) { for (int k = 0; k < dim; ++k) { int idx = boxid(i, j, k); for (int ii = 0; ii < 6; ++ii) { // loop over all neighbors (216 for ws=1) for (int jj = 0; jj < 6; ++jj) { for (int kk = 0; kk < 6; ++kk) { int idx_n = neighbor_boxid(ii, jj, kk); Member of the Helmholtz-Association int idx_op = operator_boxid(i, ii, j, jj, k, kk); // single p^4 operation M2L_operation(omega,*m2l_operators[d_c],mu, p, idx, idx_n, idx_op); March 18, 2015 B. Kohnke, I.Kabadshow Slide 11

M2L Operation – Internal Structure p 4 Loop structure void M2L_operation(...){ for (int l = 0; l <= p; ++l) { for (int m = 0; m <= l; ++m) { mu_l_m = 0; for (int j = 0; j <= p; ++j) { for (int k = -j; k <= j; ++k) { mu_l_m += M2L[idx_op](l + j, m + k) * omega[idx_n](j, k); } } mu[idx](l, m) += mu_l_m; Member of the Helmholtz-Association } } } March 18, 2015 B. Kohnke, I.Kabadshow Slide 12

Parallelization Strategies Full parallelization – all loops Full parallel kernel Loop structure Parallelize all loops for (int l = 0; l <= p; ++l) Map last reduction loop to warps for (int m = 0; m <= l; ++m) mu_l_m = 0; use cub::WarpReduce for reduction for (int j = 0; j <= p; ++j) step mu_l_m += ... for (int k = -j; k <= j; ++k) Drawbacks mu_l_m += M2L[idx_op](l+j,m+k) * Massive book keeping overhead Member of the Helmholtz-Association omega[idx_n](j, k); Execution time dominated by many mu[idx](l, m) += mul_l_m integer divisions/modulo March 18, 2015 B. Kohnke, I.Kabadshow Slide 13

M2L Runtime – Full Loop Parallelization Depth 4, 4096 Boxes 10 3 10 1 time [s] 10 − 1 Member of the Helmholtz-Association full parallelization 10 − 3 optimal 2 4 6 8 10 12 14 16 18 multipole order p March 18, 2015 B. Kohnke, I.Kabadshow Slide 14

Full Parallelization Costs Costs of index computing operations for M2L full parallel kernel 100% costs of index computation 80% 60% Reduce Overhead Further 40% Keep innermost loop sequential Additional 2x – 7x speedup 20% depth 2 Member of the Helmholtz-Association depth 3 depth 4 0% 2 4 8 16 multipole order p March 18, 2015 B. Kohnke, I.Kabadshow Slide 15

M2L Runtime – Sequential Inner Loop Depth 4, 4096 Boxes 10 3 10 1 time [s] 10 − 1 full parallelization Member of the Helmholtz-Association sequential inner loop 10 − 3 optimal 2 4 6 8 10 12 14 16 18 multipole order p March 18, 2015 B. Kohnke, I.Kabadshow Slide 16

Add Dynamic Parallelization I Dynamic Scheme 1 Launch Kernels loop over all tree level (2 – 10) @ host CPU loop over boxes on a certain level (64 , 512 , 4096 , . . . ), launch kernel from host loop over all neighbor boxes (216), launch kernel from GPU (dynamically) launch kernel from GPU (dynamically) for each M2L operator Drawbacks Member of the Helmholtz-Association Large kernel launching latencies Kernels to small March 18, 2015 B. Kohnke, I.Kabadshow Slide 17

M2L Runtime – Dynamic Parallelism I Depth 4, 4096 Boxes 10 3 10 1 time [s] 10 − 1 full parallelization sequential inner loop Member of the Helmholtz-Association dynamic 1 10 − 3 optimal 2 4 6 8 10 12 14 16 18 multipole order p March 18, 2015 B. Kohnke, I.Kabadshow Slide 18

Add Dynamic Parallelization II Dynamic Scheme 2 Launch Kernels loop over all tree level (2 – 10) @ host CPU loop over boxes on a certain level (64 , 512 , 4096 , . . . ) @ host CPU loop over all neighbor boxes (216), launch kernel from host launch kernel from GPU (dynamically) for each M2L operator Drawbacks Member of the Helmholtz-Association Global memory access too slow Disadvantageous access pattern March 18, 2015 B. Kohnke, I.Kabadshow Slide 19

M2L Runtime – Dynamic Parallelism II Depth 4, 4096 Boxes 10 3 10 1 time [s] 10 − 1 full parallelization sequential inner loop dynamic 1 Member of the Helmholtz-Association dynamic 2 10 − 3 optimal 2 4 6 8 10 12 14 16 18 multipole order p March 18, 2015 B. Kohnke, I.Kabadshow Slide 20

Add Dynamic Parallelization III Dynamic Scheme 3 Launch Kernels loop over all tree level (2 – 10) @ host CPU loop over boxes on a certain level (64 , 512 , 4096 , . . . ) @ host CPU loop over all neighbor boxes (216), launch kernel from host launch kernel from GPU (dynamically) for each M2L operator Improvements Member of the Helmholtz-Association start only p 2 kernel threads Use shared GPU memory to cache ω and the operator M March 18, 2015 B. Kohnke, I.Kabadshow Slide 21

M2L Runtime – Dynamic Parallelism III Depth 4, 4096 Boxes 10 3 10 1 time [s] 10 − 1 full parallelization sequential inner loop dynamic 1 dynamic 2 Member of the Helmholtz-Association dynamic 3 10 − 3 optimal 2 4 6 8 10 12 14 16 18 multipole order p March 18, 2015 B. Kohnke, I.Kabadshow Slide 22

FMM goes GPU A smooth trip or bumpy ride? Member of the - PowerPoint PPT Presentation

FMM goes GPU A smooth trip or bumpy ride? Member of the Helmholtz-Association March 18, 2015 B. Kohnke, I.Kabadshow MPI BPC Gttingen & Jlich Supercomputing Centre FMM goes GPU A smooth trip or bumpy ride? The Fast Multipole Method

Outline Problem Definition Overview of FMM Parallel FMM Space Filling Curves and

Distributed Fast Multiple Method Hao Gao CS598 APK Dec 13, 2017 Why FMM? Direct Evaluation

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

GOES DCS Architecture DCS (GOES Data Collection System) DCS receives a combined 34K messages

GOES-R Series Products Dr. Jamese Sims, OSGS/GOES-R GOES-R Series Satellite Product Manager

journey Tales from a practitioner Business Agility Australia 25 September 2018 Page heading

Use Tesla to provide first GPU VM Service in China Feng Zhu

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

GPU Architecture and chitecture and GPU Ar The good The good The bad The bad

GPU programming in Haskell Henning Thielemann 2015-01-23 GPU programming in Haskell Motivation:

Study Purposes: demographic datasets Review enrollment/ Compile regional demographic

Smoothing out the Bumpy Road to Export Success: Evaluating Export Promotion Activities in Belgium

Testing & Validation Co-Simulation of MBD Models with DEM Code to Predict Mobility on Soft

Navarre Minerals Corporate Presentation An Emerging Victorian Gold Company March 2018 ASX: NML

MAXIMISING RE-USE AND RESOURCE EFFICIENCY OF MINING ASSETS: BEST PRACTICE EXAMPLES FROM THE

E XPERIMENT (2) BUOYANCY & FLOTATION (METACENTRIC HEIGHT) By: 1 Eng. Motasem M. Abushaban.

PrepareBE (Buoyant Energy) Exploratory Project Presentation Conference Paper July 2016

str ttsts r t

FMM goes GPU A smooth trip or bumpy ride? Member of the - PowerPoint PPT Presentation

FMM goes GPU A smooth trip or bumpy ride? Member of the Helmholtz-Association March 18, 2015 B. Kohnke, I.Kabadshow MPI BPC Gttingen & Jlich Supercomputing Centre FMM goes GPU A smooth trip or bumpy ride? The Fast Multipole Method

Outline Problem Definition Overview of FMM Parallel FMM Space Filling Curves and

Distributed Fast Multiple Method Hao Gao CS598 APK Dec 13, 2017 Why FMM? Direct Evaluation

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO &amp; Co-founder Blagovest Taskov, RT GPU Team

GOES DCS Architecture DCS (GOES Data Collection System) DCS receives a combined 34K messages

GOES-R Series Products Dr. Jamese Sims, OSGS/GOES-R GOES-R Series Satellite Product Manager

journey Tales from a practitioner Business Agility Australia 25 September 2018 Page heading

Use Tesla to provide first GPU VM Service in China Feng Zhu

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

Super GPU &amp; Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

GPU Architecture and chitecture and GPU Ar The good The good The bad The bad

GPU programming in Haskell Henning Thielemann 2015-01-23 GPU programming in Haskell Motivation:

Study Purposes: demographic datasets Review enrollment/ Compile regional demographic

Smoothing out the Bumpy Road to Export Success: Evaluating Export Promotion Activities in Belgium

Testing &amp; Validation Co-Simulation of MBD Models with DEM Code to Predict Mobility on Soft

Navarre Minerals Corporate Presentation An Emerging Victorian Gold Company March 2018 ASX: NML

MAXIMISING RE-USE AND RESOURCE EFFICIENCY OF MINING ASSETS: BEST PRACTICE EXAMPLES FROM THE

E XPERIMENT (2) BUOYANCY &amp; FLOTATION (METACENTRIC HEIGHT) By: 1 Eng. Motasem M. Abushaban.

PrepareBE (Buoyant Energy) Exploratory Project Presentation Conference Paper July 2016

str ttsts r t

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

Testing & Validation Co-Simulation of MBD Models with DEM Code to Predict Mobility on Soft

E XPERIMENT (2) BUOYANCY & FLOTATION (METACENTRIC HEIGHT) By: 1 Eng. Motasem M. Abushaban.