CFEL SCIENCE Content Introduction to Plasma Accelerators - - PowerPoint PPT Presentation

cfel
SMART_READER_LITE
LIVE PREVIEW

CFEL SCIENCE Content Introduction to Plasma Accelerators - - PowerPoint PPT Presentation

Accelerating a Spectral Algorithm for Plasma Physics with Python/Numba on GPU FBPIC: A spectral, quasi-3D, GPU accelerated Particle-In-Cell code Manuel Kirchen Rmi Lehe Center for Free-Electron Laser Science BELLA Center & University of


slide-1
SLIDE 1

Accelerating a Spectral Algorithm for Plasma Physics with Python/Numba on GPU

FBPIC: A spectral, quasi-3D, GPU accelerated Particle-In-Cell code Rémi Lehe

BELLA Center & Center for Beam Physics, LBNL, USA rlehe@lbl.gov

CFEL

SCIENCE

Manuel Kirchen

Center for Free-Electron Laser Science University of Hamburg, Germany
 manuel.kirchen@desy.de

slide-2
SLIDE 2 Manuel Kirchen & Rémi Lehe | GTC | April 6, 2016 | Page 00

CFEL

SCIENCE

Content

  • Introduction to Plasma Accelerators
  • Modelling Plasma Physics with Particle-In-Cell Simulations
  • A Spectral, Quasi-3D PIC Code (FBPIC)
  • Two-Level Parallelization Concept
  • GPU Acceleration with Numba
  • Implementation & Performance
  • Summary
2
slide-3
SLIDE 3 Manuel Kirchen & Rémi Lehe | GTC | April 6, 2016 | Page 00

CFEL

SCIENCE

Introduction to Plasma Accelerators

3 few fs 10 - 100 µm Plasma period e- bunch
  • cm-scale plasma target (ionized gas)
  • Laser pulse or electron beam drives the wake
  • Length scale of accelerating structure: Plasma wavelength (µm scale)
  • Charge separation induces strong electric fields (~100 GV/m)
Example of a Laser-driven Wakefield Laser Electron bunch Plasma Wakefield Basic principle of Laser Wakefield Acceleration Image taken from: http://features.boats.com/boat-content/files/2013/07/centurion-elite.jpg

Shrink accelerating distance from km to mm scale (orders of magnitude) + Ultra-short timescales (few fs)

Laser Wake formed by oscillating electrons due to static heavy ion background
slide-4
SLIDE 4 Manuel Kirchen & Rémi Lehe | GTC | April 6, 2016 | Page 00

CFEL

SCIENCE

Modelling Plasma Physics with Particle-In-Cell Simulations

4 Cell Simulation Box Particle Charge and current

PIC Cycle

  • Charge/Current deposition on grid nodes
  • Fields are calculated ➔ Maxwell equations
  • Fields are gathered onto particles
  • Particles are pushed ➔ Lorentz equation
Fields

∆x

Grid
  • Fields on discrete grid
  • Macroparticles interact with fields
Millions of cells, particles and iterations!
slide-5
SLIDE 5 Manuel Kirchen & Rémi Lehe | GTC | April 6, 2016 | Page 00

CFEL

SCIENCE

Productivity of a (Computational) Physicist

5 Productivity 
 (as a physicist) Time Python/Numba helped us 
 speed up this process Simulations take 
 too long! Develop novel algorithm + efficient parallelization… Fast simulations,
 physical insights!

Our goal: Reasonably fast & accurate code with many features and user-friendly interface

slide-6
SLIDE 6 Manuel Kirchen & Rémi Lehe | GTC | April 6, 2016 | Page 00

CFEL

SCIENCE

A Spectral, Quasi-3D PIC Code

6

PIC Simulations in 3D are essential, but computationally demanding Majority of algorithms are based on finite-difference algorithms that introduce numerical artefacts

Quasi-cylindrical symmetry

  • Captures important 3D effects 

(Lifschitz et al., 2009)
  • Computational cost similar to 2D code

Spectral solvers

  • Correct evolution of electromagnetic waves
PSATD algorithm (Haber et al., 1973)
  • Less numerical artefacts

Combine best of both worlds ➞ Spectral & quasi-cylindrical algorithm

slide-7
SLIDE 7 Manuel Kirchen & Rémi Lehe | GTC | April 6, 2016 | Page 00

CFEL

SCIENCE

A Spectral, Quasi-3D PIC Code

7

FBPIC (Fourier-Bessel Particle-In-Cell)

(R. Lehe et al., 2016)
  • Written entirely in Python and uses Numba Just-In-Time compilation
  • Only single-core and not easy to parallelize due to global operations (FFT and DHT)
Algorithm developed by Rémi Lehe
slide-8
SLIDE 8 Manuel Kirchen & Rémi Lehe | GTC | April 6, 2016 | Page 00

CFEL

SCIENCE

Parallelization Approach for Spectral PIC Algorithms

8

Spectral Transformations

Not easy to parallelize by domain decomposition, due to FFT & DHT.

Standard (FDTD) Domain Decomposition Local Transformations & Domain Decomposition global communication local exchange arbitrary accuracy

Local parallelization of global operations & global domain decomposition

local communication & exchange high accuracy low accuracy

slide-9
SLIDE 9 Manuel Kirchen & Rémi Lehe | GTC | April 6, 2016 | Page 00

CFEL

SCIENCE

Parallelization Concept

9 RAM CPU GPU LOCAL AREA NETWORK NODE CLUSTER DEVICE MEMORY

Shared and distributed memory layouts ➞ Two-level parallelization entirely with Python

Typical HPC infrastructure

Intra-node parallelization

  • Shared memory layout
  • GPU (or multi-core CPU)
  • Parallel PIC methods &

Transformations

  • Numba + CUDA

Inter-node parallelization

  • Distributed memory layout
  • Multi-CPU / Multi-GPU
  • Spatial domain decomposition


for spectral codes (Vay et al., 2013)

  • mpi4py
slide-10
SLIDE 10 Manuel Kirchen & Rémi Lehe | GTC | April 6, 2016 | Page 00

CFEL

SCIENCE

Intra-Node Parallelization of PIC Methods

10

Particles

  • Particle push: Each thread updates one particle
  • Field gathering: Some threads read same field value
  • Field deposition: Some threads write same field value

➞ race conditions! Fields

  • Field push and current correction: Each thread

updates one grid value

  • Transformations: Use optimized parallel algorithms

Intra-node parallelization ➞ CUDA with Numba

slide-11
SLIDE 11 Manuel Kirchen & Rémi Lehe | GTC | April 6, 2016 | Page 00

CFEL

SCIENCE

CUDA Implementation with Numba

11

Fields

  • Transformation ➞ CUDA Libraries
  • Field push & current correction per-cell

Particles

  • Field gathering and particle push per-particle
  • Field deposition ➞ Particles are sorted and

each thread loops over particles in its cell

slide-12
SLIDE 12 Manuel Kirchen & Rémi Lehe | GTC | April 6, 2016 | Page 00

CFEL

SCIENCE

CUDA Implementation with Numba

12 Simple CUDA kernel in FBPIC
  • Simple interface for writing CUDA kernels
  • Made use of cuBLAS, cuFFT, RadixSort
  • Manual Memory Management


Data is kept on GPU / only copied to CPU for I/O

  • Almost full control over CUDA API
  • Ported code to GPU in less than 3 weeks
slide-13
SLIDE 13 Manuel Kirchen & Rémi Lehe | GTC | April 6, 2016 | Page 00

CFEL

SCIENCE

Single-GPU Performance Results

13 Intel Xeon E5-2650 v2 (single-core) Nvidia M2070 Nvidia K20m Nvidia K20x 77 67 26 1 Particle push Field deposition Field gathering Particle sort Field push FFT DHT 29.0% 7.9% 14.0% 8.3% 6.8% 14.0% 20.0%

Speed-up of up to ~70 compared to single-core CPU version 20 ns per particle per step

Runtime distribution of the GPU PIC methods Speed-up on different Nvidia GPUs
slide-14
SLIDE 14 Manuel Kirchen & Rémi Lehe | GTC | April 6, 2016 | Page 00

CFEL

SCIENCE

Parallelization of FBPIC

14

PSATD Transformations Standard FDTD Domain Decomposition Local Transformations & Domain Decomposition global communication local exchange limited accuracy local communication & exchange high accuracy low accuracy

✘ ✔ ?

work in progress
slide-15
SLIDE 15 Manuel Kirchen & Rémi Lehe | GTC | April 6, 2016 | Page 00

CFEL

SCIENCE

Inter-Node Parallelization

15

Process 0

Spatial domain decomposition

  • Split work by spatial decomposition
  • Domains computed in parallel
  • Exchange local information at

boundaries

  • Order of accuracy defines guard

region size (Large guard regions for quasi-spectral accuracy)

Concept of domain decomposition in the longitudinal direction

Local field and particle exchange Process 1 Process 2 Process 3

  • verlapping
guard regions Domain 2 Domain 1
slide-16
SLIDE 16 Manuel Kirchen & Rémi Lehe | GTC | April 6, 2016 | Page 00

CFEL

SCIENCE

Scaling of the MPI version of FBPIC

16 Strong scaling on JURECA supercomputer (Nvivida K80) Preliminary results (not optimized) 4 8 16 32 64 128

# of GPUs

1 2 4 8 16 32

speed up

GPU Scaling of FBPIC

For productive and fast simulations: 4-32 GPUs more than enough!

guard region size = local domain size 16384x512 cells 64 guard cells per domain Best strategy for our case: Extensive Intra-node parallelization on the GPU and
  • nly a few Inter-node domains.
slide-17
SLIDE 17 Manuel Kirchen & Rémi Lehe | GTC | April 6, 2016 | Page 00

CFEL

SCIENCE

Summary

17
  • Motivation: Efficient and easy parallelization of a novel PIC algorithm to 


combine speed, accuracy and usability in order to work productively as a physicist

  • FBPIC is entirely written in Python (easy to develop and maintain the code)
  • Implementation uses Numba (JIT compilation and interface for writing CUDA-Python)
  • Intra- and Inter-node parallelization approach suitable for spectral algorithms
  • Single GPU well suited for global operations (FFT & DHT)
  • Enabling CUDA support for the full code took less than 3 weeks
  • Multi-GPU parallelization by spatial domain decomposition with mpi4py
  • Outlook: Finalize Multi-GPU, CUDA Streams, GPU Direct, OpenSourcing of FBPIC
slide-18
SLIDE 18

Thanks… Questions?

funding contributed by BMBF FSP302

CFEL

SCIENCE group Brian McNeil group Johannes Bahrdt LBNL WARP code JURECA supercomputer

thanks to

group Jens Osterhoff LBNL

Special thanks to Rémi Lehe