EXPOSING PARTICLE PARALLELISM IN THE XGC PIC CODE BY EXPLOITING GPU - PowerPoint PPT Presentation

EXPOSING PARTICLE PARALLELISM IN THE XGC PIC CODE BY EXPLOITING GPU MEMORY HIERARCHY Stephen Abbott, March 26 2018

ACKNOWLEDGEMENTS Collaborators: Oak Ridge Nation Laboratory- Ed D’Azevedo NVIDIA - Peng Wang Princeton Plasma Physics Laboratory – Seung-Hoe Ku, Robert Hager, C.S. Chang And many others in the XGC team! This research used resources of the Oak Ridge Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC05- 00OR22725. 2

OUTLINE 1. A very brief review of the Particle-In-Cell cycle 2. Overview of the XGC1 Approach 3. Optimized Particle Pushing with CUDA Fortran/C 3

THE PIC CYCLE SCATTER particle attributes to a mesh/grid SOLVE for the PUSH each fields on the particle mesh/grid GATHER the fields to each particle 4

THE XGC1 ALGORITHM XGC is a full-f gyrokinetic particle-in-cell/finite element code that solves the 5D Gyrokinetic equation on a field-following triangular mesh. A time-step is (minimally) 1) Collect charge density to mesh SCATTER 2) Solve the GK Poisson equation SOLVE 3) Compute E and derivatives 4) Calculate diagnostics 5) Push particles GATHER + PUSH 6) Calculate collisions and source 5

THE XGC1 ELECTRON PUSHER Fundamentals of the method • Electrons stepped O(40) times per ion step • 4 th order Runge-Kutta per step reduces numerical heating • Electron sub-cycling (PUSHE) done without communication • Even after initial CUDA Fortran port to GPUs, was a major cost • Deep call tree makes optimization difficult • Kernel is limited by memory access, not FLOPS To make the kernel faster: understand data structures, how particles move across the grid, and map to GPU memory regions 6

THE XGC1 ELECTRON PUSHER 7

DATA STRUCTURES Unstructured Mesh Data: 𝜚 / E Particles Data characteristics • The only read-write data structures for the 𝜚 / E field lives on field-following electron sub-cycling. triangular mesh • O( 10 6 ) triangles for ITER mesh • CPU storage is A.o.S. • • Per particle: Pre-computed node-to-node - 6 R/W phase variables (8B field following between O(30) toroidal 𝜒 -planes, with node each) - 3 RO constants (8B each) number for lookup. - 1 RO integer uid (8B) 2D Uniform Grids: 𝝎 𝟏 (𝒔, 𝒜) 1D Uniform Grids: 𝑱(𝝎 𝟏 ) Data characteristics Data characteristics • Equilibrium flux function on • Equilibrium current read in from uniform 2D grid read in from EQDSK file Uniform 1D grid in 𝜔 0 space EQDSK file • • Interpolate 𝜔 0 from particle position, Interpolation to particle • positions using PSPLINE bi- then 𝐽(𝜔 0 ) using PSPLINE cubic splines • Extract spline coefficients for GPU • Extract spline coefficients for GPU 8

THE XGC1 ELECTRON PUSHER 9

IDEAL GLOBAL TRANSACTIONS Assumptions Best to read once, write once (or not at all) Particle access are perfectly coalesced (128B lines) 1.57 X 10 6 particles (a medium-sized test) Minimum Necessary 1.72 X 10 6 load transactions 6.37 X 10 5 store transactions ‘Un - optimized’ Profile (has algorithm improvements) 2.58 X 10 8 load transactions 3.20 X 10 6 store transactions Initial profiling shows ~100X loads and ~5X stores 10

PACKING AND UNPACKING PARTICLES Per particle data is small, but there’s a lot of it. Two important considerations: 1. The default Array-of-Structs storage on CPU is un-coalesced on GPUs. 2. Particle data is used throughout the pushe call tree. AoS on host SoA in GPU global memory Pack into thread private struct 11

UNIFORM GRIDS K20X has a 48KB Texture/Constant cache with specialized access paths CUDA Fortran CUDA C/C++ texture, pointer Texture Objects and References, __ldg intent(in) , compiler const __restrict__ analysis 1D grids can be bound in linear memory in CUF 2D grids benefit >15% from proper CUDA textures, but HW filtering is too low precision. 12

UNSTRUCTURED MESH DATA The electric field array is large 3 components 2 directions (along B field) N_phi (4-32) planes (toroidal discretization) N_nodes (50K – 1M on mesh) Touched non-uniformly 4x per particle Extreme register, L1/L2, and DRAM pressure! Can over-run 2^27 element limit on linear textures Solution Must decompose. Basic splitting works for now, not sustainable for larger grids 13 Must still optimize access pattern (bonus material)

PARTICLE SORTING AND SEARCHING Coalesced loads and cache reuse fundamental for good performance Frequent sorting helps Sorting to a coarse Cartesian grid ignores true dimensionality of the underlying data structures (the triangular mesh) Instead sort by the last know triangle number. More keys means slower O(n+k) count sort, but extra cost is worth it Similarly, particles move slow enough per time- step that it’s worth checking last know triangle before conducting a lookup table search 14

PERFORMANCE IMPACT 3mm ITER 12mm ITER Original 38s (100%) 23s (100%) Ptl Mapping 28s (73%) 16s (70%) Tr. check 15s (39%) 12s (52%) Triangle 13s (34%) 11s (48%) sort 2D Textures 9.2s (24%) 7.7s (33%) Mesh Tex. 7.8s (21%) 7.2s (31%) Compiler 7.7s (20%) 7.0s (30%) Options Current Best with 14s (37%) 12s (52%) OpenACC 40 Cycles of 1.57M particles on 40 Cycles of 4.9M particles on K20X K20X 15

SUMMARY Reducing the impact of mesh and cell data is critical to performance in XGC CUDA Fortran has most of the tools needed, CUDA C/C++ has the rest OpenACC compilers can get close, but not close enough Testing is necessary, and alleviates portability impact 16

BONUS COMPILER OPTIMIZATION FLAGS FOR PGI For CUDA Fortran with – Mcuda =…. 7.5,cc35 (on K20X), 8.0,cc60 (on Pascal), 9.1,cc70 (on Volta) to optimize better madconst to put module arrays descriptors in cmem maxrregcount=N to restrict register usage ptxinfo,keepgpu,keepptx to give info on generated source For OpenACC with – acc – ta=tesla: Same as above, but no madconst 18

USING OPTIMIZED CUDA WITH OPENACC Texture Cache Usage (1.58 x 10 6 particles) OpenACC CUDA Fortran Explicit “ texture, pointer ” Use intent(in) to hint __ldg intent(in) hints to use __ldg !$acc copyin(..) Un-cached Global Loads 2.0 X 10 6 2.5 X 10 8 (6.43 X 10 7 ) Texture Cache Hit Rate 76% 99.9% (73%) L2 Texture Reads 1.1 X 10 8 500 (9.1 X 10 7 ) L2 Texture Read Hit Rate 94% 99.6% (93%) subroutine cuda_bi2(..) Deep call trees are hard for the compiler to tune !$acc routine seq nohost Value safety over performance (wrapper for CUDA C) end subroutine cuda_bi2 Developers have finite patience subroutine bicub_interpol2(..) !$acc routine seq bind(cuda_bi2) (host-side routine) end subroutine bicub_interpol2 19

EXPOSING PARTICLE PARALLELISM IN THE XGC PIC CODE BY EXPLOITING GPU - PowerPoint PPT Presentation

EXPOSING PARTICLE PARALLELISM IN THE XGC PIC CODE BY EXPLOITING GPU MEMORY HIERARCHY Stephen Abbott, March 26 2018 ACKNOWLEDGEMENTS Collaborators: Oak Ridge Nation Laboratory- Ed DAzevedo NVIDIA - Peng Wang Princeton Plasma Physics

Hardware Parallelism vs. Software Parallelism USENIX Workshop on Hot Topics in Parallelism March

PIC codes in the HPC environment PIC codes in the HPC environment - A. Beck SMILEI training

Chapter 17: Parallel Databases Introduction I/O Parallelism Interquery Parallelism

EXPOSING EXPOSING A FLEXIBLE, COMPOSABLE & EXTENSIBLE A FLEXIBLE, COMPOSABLE &

Pension Information Committee Report October 2017 Pension Information Committee (PIC) The PIC

Pervasive Parallelism Laboratory Stanford University ppl.stanford.edu Make parallelism

Data-Level Parallelism Nima Honarmand Fall 2015 :: CSE 610 Parallel Computer Architectures

Advanced OpenMP Lecture 6: Nested parallelism Nested parallelism Nested parallelism is

CSCI341 Lecture 37, Introduction to Parallelism PIPELINING Exploits potential parallelism

Making relativistic shocks with a spectral 1D PIC code UZEIN I. Plotnikov (IPAG), B. Lembge

Project 2: Basic particle system Constrained Particle System Tinkertoys Requirements for

SKI: Exposing Kernel Concurrency Bugs through Systematic Schedule Exploration Pedro Fonseca

Comparisons of gyrokinetic PIC and CIP codes Comparisons of gyrokinetic PIC and CIP codes

Public Information Centre (PIC) February 12, 2020 Content Purpose of this PIC Project Team

Glendale Niagara District Plan Study - Phase 2 Public Information Centre (PIC) #2 March 27, 2019

dCache sensors & monitoring A proposal to share sensors Gerard.Bernabeu@pic.es Functional

Parent Information Meeting Parent Information Meeting Spring 2012 Introductions - instructors

W O R L D B A N K , 1 8 T H A P R I L 2 0 1 7 THE FAILURE OF THE OLD ECONOMIC DEVELOPMENT

Lessons Learned: 603 Pandora Fire - EOC activation & Business Continuity 05:18hrs on May 6 th

LESSON LEARNT AND SUGGESTION Reported by MBA DO THI THANH HUYEN Country Director ENDA VIETNAM

Homology modeling by using TINKER package The 10th Protein Folding Winter School Seungryong Heo,

Reproduction of stratosphere dynamics with multiscale version of SLAV atmospheric model V.V.

WE PRESENT Going local www.lgnz.co.nz Going local The Localism Symposium Dave Cull

for Online Learners AT THE CUTTING EDGE Hello. KATHRYN MORRISEY DAVID YOUSSEF DEB ROGERS

EXPOSING PARTICLE PARALLELISM IN THE XGC PIC CODE BY EXPLOITING GPU - PowerPoint PPT Presentation

EXPOSING PARTICLE PARALLELISM IN THE XGC PIC CODE BY EXPLOITING GPU MEMORY HIERARCHY Stephen Abbott, March 26 2018 ACKNOWLEDGEMENTS Collaborators: Oak Ridge Nation Laboratory- Ed DAzevedo NVIDIA - Peng Wang Princeton Plasma Physics

Hardware Parallelism vs. Software Parallelism USENIX Workshop on Hot Topics in Parallelism March

PIC codes in the HPC environment PIC codes in the HPC environment - A. Beck SMILEI training

Chapter 17: Parallel Databases Introduction I/O Parallelism Interquery Parallelism

EXPOSING EXPOSING A FLEXIBLE, COMPOSABLE &amp; EXTENSIBLE A FLEXIBLE, COMPOSABLE &amp;

Pension Information Committee Report October 2017 Pension Information Committee (PIC) The PIC

Pervasive Parallelism Laboratory Stanford University ppl.stanford.edu Make parallelism

Data-Level Parallelism Nima Honarmand Fall 2015 :: CSE 610 Parallel Computer Architectures

Advanced OpenMP Lecture 6: Nested parallelism Nested parallelism Nested parallelism is

CSCI341 Lecture 37, Introduction to Parallelism PIPELINING Exploits potential parallelism

Making relativistic shocks with a spectral 1D PIC code UZEIN I. Plotnikov (IPAG), B. Lembge

Project 2: Basic particle system Constrained Particle System Tinkertoys Requirements for

SKI: Exposing Kernel Concurrency Bugs through Systematic Schedule Exploration Pedro Fonseca

Comparisons of gyrokinetic PIC and CIP codes Comparisons of gyrokinetic PIC and CIP codes

Public Information Centre (PIC) February 12, 2020 Content Purpose of this PIC Project Team

Glendale Niagara District Plan Study - Phase 2 Public Information Centre (PIC) #2 March 27, 2019

dCache sensors &amp; monitoring A proposal to share sensors Gerard.Bernabeu@pic.es Functional

Parent Information Meeting Parent Information Meeting Spring 2012 Introductions - instructors

W O R L D B A N K , 1 8 T H A P R I L 2 0 1 7 THE FAILURE OF THE OLD ECONOMIC DEVELOPMENT

Lessons Learned: 603 Pandora Fire - EOC activation &amp; Business Continuity 05:18hrs on May 6 th

LESSON LEARNT AND SUGGESTION Reported by MBA DO THI THANH HUYEN Country Director ENDA VIETNAM

Homology modeling by using TINKER package The 10th Protein Folding Winter School Seungryong Heo,

Reproduction of stratosphere dynamics with multiscale version of SLAV atmospheric model V.V.

WE PRESENT Going local www.lgnz.co.nz Going local The Localism Symposium Dave Cull

for Online Learners AT THE CUTTING EDGE Hello. KATHRYN MORRISEY DAVID YOUSSEF DEB ROGERS

EXPOSING EXPOSING A FLEXIBLE, COMPOSABLE & EXTENSIBLE A FLEXIBLE, COMPOSABLE &

dCache sensors & monitoring A proposal to share sensors Gerard.Bernabeu@pic.es Functional

Lessons Learned: 603 Pandora Fire - EOC activation & Business Continuity 05:18hrs on May 6 th