CFEL SCIENCE Content Introduction to Plasma Accelerators - PowerPoint PPT Presentation

Accelerating a Spectral Algorithm for Plasma Physics with Python/Numba on GPU FBPIC: A spectral, quasi-3D, GPU accelerated Particle-In-Cell code Manuel Kirchen Rémi Lehe Center for Free-Electron Laser Science BELLA Center & University of Hamburg , Germany   Center for Beam Physics, LBNL , USA manuel.kirchen@desy.de   rlehe@lbl.gov CFEL SCIENCE

Content ‣ Introduction to Plasma Accelerators ‣ Modelling Plasma Physics with Particle-In-Cell Simulations ‣ A Spectral, Quasi-3D PIC Code (FBPIC) ‣ Two-Level Parallelization Concept ‣ GPU Acceleration with Numba ‣ Implementation & Performance ‣ Summary CFEL Manuel Kirchen & Rémi Lehe | GTC | April 6, 2016 | Page 00 2 SCIENCE

Introduction to Plasma Accelerators Example of a Laser-driven Wakefield Basic principle of Laser Wakefield Acceleration Electron bunch Laser e - bunch Laser few fs Wakefield 10 - 100 µm Plasma Wake formed by oscillating electrons due to static heavy ion background Plasma period Image taken from: http://features.boats.com/boat-content/files/2013/07/centurion-elite.jpg ‣ cm-scale plasma target (ionized gas) ‣ Laser pulse or electron beam drives the wake ‣ Length scale of accelerating structure: Plasma wavelength (µm scale) ‣ Charge separation induces strong electric fields (~100 GV/m) Shrink accelerating distance from km to mm scale (orders of magnitude) + Ultra-short timescales (few fs) CFEL Manuel Kirchen & Rémi Lehe | GTC | April 6, 2016 | Page 00 3 SCIENCE

Modelling Plasma Physics with Particle-In-Cell Simulations ‣ Fields on discrete grid ‣ Macroparticles interact with fields Cell ∆ x Charge and PIC Cycle current Grid Fields Particle ‣ Charge/Current deposition on grid nodes ‣ Fields are calculated ➔ Maxwell equations Simulation Box ‣ Fields are gathered onto particles Millions of cells, particles and iterations! ‣ Particles are pushed ➔ Lorentz equation CFEL Manuel Kirchen & Rémi Lehe | GTC | April 6, 2016 | Page 00 4 SCIENCE

Productivity of a (Computational) Physicist Productivity   Python/Numba helped us   (as a physicist) speed up this process Fast simulations,   Simulations take   physical insights ! too long! Develop novel algorithm + efficient parallelization… Time Our goal: Reasonably fast & accurate code with many features and user-friendly interface CFEL Manuel Kirchen & Rémi Lehe | GTC | April 6, 2016 | Page 00 5 SCIENCE

A Spectral, Quasi-3D PIC Code PIC Simulations in 3D are essential, but computationally demanding Majority of algorithms are based on finite-di ff erence algorithms that introduce numerical artefacts Spectral solvers Quasi-cylindrical symmetry ‣ Correct evolution of electromagnetic waves ‣ Captures important 3D e ff ects   PSATD algorithm (Haber et al., 1973) ( Lifschitz et al., 2009) ‣ Less numerical artefacts ‣ Computational cost similar to 2D code Combine best of both worlds ➞ Spectral & quasi-cylindrical algorithm CFEL Manuel Kirchen & Rémi Lehe | GTC | April 6, 2016 | Page 00 6 SCIENCE

Algorithm developed A Spectral, Quasi-3D PIC Code by Rémi Lehe FBPIC (Fourier-Bessel Particle-In-Cell) (R. Lehe et al., 2016) ‣ Written entirely in Python and uses Numba Just-In-Time compilation ‣ Only single-core and not easy to parallelize due to global operations (FFT and DHT) CFEL Manuel Kirchen & Rémi Lehe | GTC | April 6, 2016 | Page 00 7 SCIENCE

Parallelization Approach for Spectral PIC Algorithms Not easy to parallelize by domain decomposition , due to FFT & DHT. Standard (FDTD) Spectral Local Transformations & Domain Decomposition Transformations Domain Decomposition local exchange local communication & exchange global communication high accuracy arbitrary accuracy low accuracy Local parallelization of global operations & global domain decomposition CFEL Manuel Kirchen & Rémi Lehe | GTC | April 6, 2016 | Page 00 8 SCIENCE

Parallelization Concept Typical HPC infrastructure Intra-node parallelization ‣ Shared memory layout ‣ GPU (or multi-core CPU) ‣ Parallel PIC methods & CLUSTER DEVICE MEMORY Transformations RAM GPU CPU ‣ Numba + CUDA NODE Inter-node parallelization ‣ Distributed memory layout LOCAL AREA NETWORK ‣ Multi-CPU / Multi-GPU ‣ Spatial domain decomposition   for spectral codes (Vay et al., 2013) ‣ mpi4py Shared and distributed memory layouts ➞ Two-level parallelization entirely with Python CFEL Manuel Kirchen & Rémi Lehe | GTC | April 6, 2016 | Page 00 9 SCIENCE

Intra-Node Parallelization of PIC Methods Particles Fields ‣ Particle push: Each thread updates one particle ‣ Field push and current correction: Each thread updates one grid value ‣ Field gathering: Some threads read same field value ‣ Transformations: Use optimized parallel algorithms ‣ Field deposition: Some threads write same field value ➞ race conditions! Intra-node parallelization ➞ CUDA with Numba CFEL Manuel Kirchen & Rémi Lehe | GTC | April 6, 2016 | Page 00 10 SCIENCE

CUDA Implementation with Numba Particles Fields ‣ Field gathering and particle push per-particle ‣ Transformation ➞ CUDA Libraries ‣ Field deposition ➞ Particles are sorted and ‣ Field push & current correction per-cell each thread loops over particles in its cell CFEL Manuel Kirchen & Rémi Lehe | GTC | April 6, 2016 | Page 00 11 SCIENCE

CUDA Implementation with Numba ‣ Simple interface for writing CUDA kernels ‣ Made use of cuBLAS, cuFFT, RadixSort ‣ Manual Memory Management   Data is kept on GPU / only copied to CPU for I/O ‣ Almost full control over CUDA API ‣ Ported code to GPU in less than 3 weeks Simple CUDA kernel in FBPIC CFEL Manuel Kirchen & Rémi Lehe | GTC | April 6, 2016 | Page 00 12 SCIENCE

Single-GPU Performance Results Speed-up on different Nvidia GPUs Runtime distribution of the GPU PIC methods Field gathering 77 Particle sort 20.0% 67 14.0% Field push 6.8% 26 29.0% Field deposition 1 8.3% FFT Intel Xeon Nvidia Nvidia Nvidia 14.0% 7.9% E5-2650 v2 M2070 K20m K20x DHT (single-core) Particle push Speed-up of up to ~70 20 ns per particle per step compared to single-core CPU version CFEL Manuel Kirchen & Rémi Lehe | GTC | April 6, 2016 | Page 00 13 SCIENCE

Parallelization of FBPIC Standard FDTD PSATD Local Transformations & Domain Decomposition Transformations Domain Decomposition ✘ ✔ ? local exchange local communication & exchange global communication high accuracy limited accuracy low accuracy work in progress CFEL Manuel Kirchen & Rémi Lehe | GTC | April 6, 2016 | Page 00 14 SCIENCE

Inter-Node Parallelization Spatial domain decomposition Process 0 Process 1 Process 2 Process 3 ‣ Split work by spatial decomposition ‣ Domains computed in parallel ‣ Exchange local information at boundaries ‣ Order of accuracy defines guard region size ( Large guard regions for quasi-spectral accuracy ) Local field and particle exchange Domain 1 overlapping guard regions Concept of domain decomposition in the longitudinal direction Domain 2 CFEL Manuel Kirchen & Rémi Lehe | GTC | April 6, 2016 | Page 00 15 SCIENCE

Scaling of the MPI version of FBPIC Strong scaling on JURECA supercomputer (Nvivida K80) Preliminary results (not optimized) GPU Scaling of FBPIC 32 16384x512 cells 64 guard cells per domain guard region size 16 = local domain size speed up 8 4 Best strategy for our case: 2 Extensive Intra-node parallelization on the GPU and 1 only a few Inter-node domains. 4 8 16 32 64 128 # of GPUs For productive and fast simulations: 4-32 GPUs more than enough! CFEL Manuel Kirchen & Rémi Lehe | GTC | April 6, 2016 | Page 00 16 SCIENCE

Summary ‣ Motivation: Efficient and easy parallelization of a novel PIC algorithm to   combine speed, accuracy and usability in order to work productively as a physicist ‣ FBPIC is entirely written in Python (easy to develop and maintain the code) ‣ Implementation uses Numba (JIT compilation and interface for writing CUDA-Python) ‣ Intra- and Inter-node parallelization approach suitable for spectral algorithms ‣ Single GPU well suited for global operations (FFT & DHT) ‣ Enabling CUDA support for the full code took less than 3 weeks ‣ Multi-GPU parallelization by spatial domain decomposition with mpi4py ‣ Outlook: Finalize Multi-GPU, CUDA Streams, GPU Direct, OpenSourcing of FBPIC CFEL Manuel Kirchen & Rémi Lehe | GTC | April 6, 2016 | Page 00 17 SCIENCE

Thanks… Questions? thanks to funding contributed by JURECA CFEL supercomputer LBNL SCIENCE Special thanks to group LBNL Rémi Lehe Brian McNeil WARP code FSP302 BMBF group group Johannes Bahrdt Jens Osterhoff

CFEL SCIENCE Content Introduction to Plasma Accelerators - PowerPoint PPT Presentation

Accelerating a Spectral Algorithm for Plasma Physics with Python/Numba on GPU FBPIC: A spectral, quasi-3D, GPU accelerated Particle-In-Cell code Manuel Kirchen Rmi Lehe Center for Free-Electron Laser Science BELLA Center & University of

Temporal Characterization of Ultrafast Laser Pulses Francesca Calegari Center For Free Electron

dCache in Use at DESY dCache in Use DESY DV/IT 5.7.2010 Overview grid gridFTP, (gsi)dcap,

Case Study: Alternate Blockchains Jeremy Rand Lead Application Engineer, The Namecoin Project

C OMPARED with a structured P2P network [18], [23], [30], technique causes the unstructured P2P

Parents evening presentation Help your son or daughter through the UCAS process Duncanrig

PRESENTATION AND ASSESSMENT GUIDELINES 2014-15 December 2014 1 Contents 1 Introduction 2

The midpoint theorem states that a line joining midpoints of two sides of a triangle are parallel

4. Collinear points Points that lie on the same straight line. Before we verify the effects of

Labor Labor Term, nulliparous women carrying singleton, cephalic The presence of

Lets Do Math with KCM High School Geometry Rich Mathematics Tasks Welcome! Your host Leah

Radially Expanding Uterine Cervical Dilator A lexandra Schmidt - Leader Megan Courtney - BSAC

What are Neuroendocrine Tumours ? (and a bit about surgery) Tom Armstrong PhD FRCSEd

Dana Nicoleta Mihai, Main Licensed Nurse,Clinical Rehabilitation Hospital, Iasi, Romania Chronic

Cascaded 3D Fully Convolutional Networks for Medical Image Segmentation Holger Roth Assistant

Motion Planning for Autonomous All-Terrain Vehicle Guan-Horng Liu, Samuel Wang, Shu-Kai Lin,

Tamper amperLoks Loks Da DataV taVault ault Dr Drug ug Testing Solution esting Solution

BLUE DOT CATS INDEPENDENT STUDY PROGRAM FOR CAT COMFORTING VOLUNTEERS 3100 Cherry Hill Road |

Experimental Observation of Shear Thickening Oscillation in Dilatant Fluid S. Nagahiro (Sendai

Unresolved questions in the management of post-acute Type B dissections Firas F Mussa, MD, MS

STRENGTHENING FAM AMILY RESCILIENCY IN A V.U.C.A. A. WORLD: A COLL LLAB ABORATIVE APPROACH

Robot For Assistance Master Project ME-GY 996 Presented By : Karim Chamaa Presented To : Dr.

Fleuron A Database of Eighteenth- Century Printers' Ornaments Dr Hazel Wilkinson University of

the early modern era Research is feeding curiosity and answering questions The Guardian 14

Neovasc Inc. Alexei Marko, CEO Chris Clark, CFO June 2014 Forward-Looking Statements