Parallel accelerator simulations past, present and future James - PowerPoint PPT Presentation

Parallel accelerator simulations past, present and future James Amundson Fermilab November 21, 2011 James Amundson (Fermilab) Parallel accelerator simulations November 21, 2011 1 / 29

This Talk Accelerator Modeling and High-Performance Computing (HPC) Accelerator Modeling Accelerator Physics Synergia High Performance Computing Supercomputers Clusters with High-Performance Networking Optimizing Synergia Performance James Amundson (Fermilab) Parallel accelerator simulations November 21, 2011 2 / 29

Accelerator Physics Computational accelerator is a huge topic, crossing several disciplines. The three main areas of current interest are Electromagnetic simulations of accelerating structures Simulations of advanced accelerator techniques, primarily involving plasmas Beam dynamics simulations James Amundson (Fermilab) Parallel accelerator simulations November 21, 2011 3 / 29

Independent-Particle Physics and Collective Effects Independent particle physics The interaction of individual particles with external fields, e.g., magnets, RF cavities, etc. Usually the dominant effect in an accelerator Otherwise, it wouldn’t work... Well-established theory of simulation Easily handled by current desktop computers Collective effects Space charge, wake fields, electron cloud, beam-beam interactions, etc. Usually considered a nuisance Topic of current beam dynamics simulation research Calculations typically require massively parallel computing Clusters and supercomputers James Amundson (Fermilab) Parallel accelerator simulations November 21, 2011 4 / 29

Split-Operator and Particle-in-Cell Techniques The split operator technique allows us to approximate the evolution operator for a time t by O ( t ) = O sp ( t / 2) O coll ( t ) O sp ( t / 2) The Particle-in-Cell (PIC) techique allows us to simulate the large number of particles in a bunch (typically O (10 12 )) by a much smaller number of macroparticles (typically O (10 7 )). Collective effects are calculated using fields calculated on discrete meshes with O (10 6 ) degrees of freedom. James Amundson (Fermilab) Parallel accelerator simulations November 21, 2011 5 / 29

Synergia Beam-dynamics framework developed at Fermilab Mixed C++ and Python Designed for MPI-based parallel computations Desktops (laptops) Clusters Supercomputers https://compacc.fnal.gov/projects/wiki/synergia2 James Amundson (Fermilab) Parallel accelerator simulations November 21, 2011 6 / 29

Supercomputers and Clusters with High-Performance Networking Tightly-coupled high-performance computing in the recent era has been dominated by MPI, the Message Passing Interface. MPI provides Point-to-point communications Collective communications Reduce Gather Broadcast Many derivatives and combinations MPI is a relatively low-level interface. Parallelizing a serial program to run efficiently in parallel using MPI is not a trivial undertaking. Modern supercomputers and HPC clusters differ from large collections of desktop machines in networking. High bandwidth Low latency Exotic topologies James Amundson (Fermilab) Parallel accelerator simulations November 21, 2011 7 / 29

Platforms In recent times, we have run Synergia on ALCF’s Intrepid and NERSC’s Hopper. We also run on our (Fermilab’s) Wilson cluster. James Amundson (Fermilab) Parallel accelerator simulations November 21, 2011 8 / 29

Intrepid Intrepid’s Blue Gene/P system consists of: 40 racks 1024 nodes per rack 850 MHz quad-core processor and 2GB RAM per node For a total of 164K cores, 80 terabytes of RAM, and a peak performance of 557 teraflops. James Amundson (Fermilab) Parallel accelerator simulations November 21, 2011 9 / 29

Hopper Hopper’s Cray XE6 system consists of: 6,384 nodes 2 twelve-core AMD ’MagnyCours’ 2.1-GHz processors per node 24 cores per node (153,216 total cores) 32 GB DDR3 1333-MHz memory per node (6,000 nodes) 64 GB DDR3 1333-MHz memory per node (384 nodes) 1.28 Peta-flops for the entire machine James Amundson (Fermilab) Parallel accelerator simulations November 21, 2011 10 / 29

Wilson Cluster 2005: 20 dual-socket, single-core (2 cores/node) Intel Xeon CPU 0.13 TFlop/s Linpack performance 2010: 25 dual-socket, six-core (12 cores/node) Intel Westmere CPU 2.31 TFlop/s Linpack performance 2011: (last week!) 34 quad-socket, eight-core (32 cores/node) AMD Opteron CPU James Amundson (Fermilab) Parallel accelerator simulations November 21, 2011 11 / 29

Strong and Weak Scaling Weak scaling: fixed ratio of problem Strong scaling: fixed problem size size to number of processes actual 1400 ideal 1200 1000 time [sec] 800 600 400 200 0 0 500 1000 1500 2000 2500 number of cores James Amundson (Fermilab) Parallel accelerator simulations November 21, 2011 12 / 29

Strong Scaling is Hard Take a serial program. Profile it. Parallelize routines taking up 99% of runtime. Assume scaling is perfect . Restrict the remaining 1% to non-scaling. Could be worse! 0 10 ideal "real" -1 10 normalized time -2 10 -3 10 1 2 4 8 16 32 64 128 256 procs James Amundson (Fermilab) Parallel accelerator simulations November 21, 2011 13 / 29

Optimizing Synergia Performance In Synergia, particles are distributed among processors randomly. Each processor calculates a spatial subsection of the field in field solves. (Other schemes have been tried.) Major portions a Synergia space charge calculation step: Track individual particles (twice) Easily parallelizable. Deposit charge on grid locally. Easily parallelizable. Add up total charge distribution (semi-) globally. A communication step. Solve the Poisson Equation. Uses parallel FFTW. Internal communications. Calculate electric field from scalar field locally. Easily parallelizable. Broadcast electric field to each processor. A communication step. Apply electric field to particles. Easily parallelizable. James Amundson (Fermilab) Parallel accelerator simulations November 21, 2011 14 / 29

The Benchmark A space charge problem using a 64 × 64 × 512 space charge grid with 10 particles per cell for a total of 20 , 971 , 520 particles. There are 32 evenly-spaced space charge kicks. The single-particle dynamics use second-order maps. Real simulalations are similar, but thousands of times longer. Performed all profiling and optimization on Wilson Cluster. Hopper has similar performance characteristics, but networking is a few times faster. James Amundson (Fermilab) Parallel accelerator simulations November 21, 2011 15 / 29

Initial Profile In May 2011, we embarked on an optimization of the newest version of Synergia, v2.1. Initial profile total sc-get-phi2 sc-get-local-rho sc-get-global-rho sc-get-global-en other independent-operation-aperture sc-apply-kick 2 10 time [s] 1 10 3 4 5 6 7 2 2 2 2 2 cores Decided to look at field applications and communication steps. James Amundson (Fermilab) Parallel accelerator simulations November 21, 2011 16 / 29

Optimizing Field Applications Minimized data extraction from classes kick time before optimization kick time after optimization Minimized function calls Inlined functions in inner loop 1 10 Added a periodic sort of time [s] particles in z-coordinate Minimize cache misses when accessing field data std::sort is really fast 10 0 3 4 5 6 7 2 2 2 2 2 cores Added a faster version of floor Overall gain was ∼ 1 . 9 × James Amundson (Fermilab) Parallel accelerator simulations November 21, 2011 17 / 29

Optimizing Communication Steps Tried different combinations of MPI collectives. Charge communication Field communication 1 1 10 10 reduce_scatter 8 cores/node allreduce 8 cores/node reduce_scatter 12 cores/node allreduce 12 cores/node 0 0 10 10 time [s] time [s] gatherv bcast 8 cores/node -1 -1 10 10 allgatherv 8 cores/node allreduce 8 cores/node gatherv bcast 12 cores/node allgatherv 12 cores/node allreduce 12 cores/node -2 -2 10 10 2 0 2 1 2 2 2 3 2 4 2 0 2 1 2 2 2 3 2 4 nodes nodes James Amundson (Fermilab) Parallel accelerator simulations November 21, 2011 18 / 29

Another MPI implementation The previous results used OpenMPI 1.4.3rc2. Try MVAPICH2 1.6: Charge communication Field communication 1 1 10 10 reduce_scatter 8 cores/node gatherv bcast 8 cores/node allreduce 8 cores/node allgatherv 8 cores/node reduce_scatter 12 cores/node allreduce 8 cores/node allreduce 12 cores/node gatherv bcast 12 cores/node allgatherv 12 cores/node 0 0 10 10 allreduce 12 cores/node time [s] time [s] -1 -1 10 10 10 -2 10 -2 0 1 2 3 4 0 1 2 3 4 2 2 2 2 2 2 2 2 2 2 nodes nodes James Amundson (Fermilab) Parallel accelerator simulations November 21, 2011 19 / 29

Communication Optimization No single solution won. Keep all options. Add a function to try all communications types (once) and keep the fastest one. User can choose his/herself if desired. James Amundson (Fermilab) Parallel accelerator simulations November 21, 2011 20 / 29

Final Results We gained a factor of ∼ 1 . 7 in peak performance. best pre-opt: 74.9 time [s] 2 10 best post-opt: 45.0 (not optimized) 8 cores/node openmpi (not optimized) 12 cores/node openmpi (optimized) 8 cores/node openmpi -1 0 1 2 3 4 2 2 2 2 2 2 nodes James Amundson (Fermilab) Parallel accelerator simulations November 21, 2011 21 / 29

Parallel accelerator simulations past, present and future James - PowerPoint PPT Presentation

Parallel accelerator simulations past, present and future James Amundson Fermilab November 21, 2011 James Amundson (Fermilab) Parallel accelerator simulations November 21, 2011 1 / 29 This Talk Accelerator Modeling and High-Performance

COMPUTER COMPUTER COMPUTER COMPUTER SIMULATIONS SIMULATIONS SIMULATIONS SIMULATIONS

1 3 5 CONVENTIONAL DC MODEL Accelerator Output Accelerator Opening FB-CA SERIES Accelerator

CEBAF Accelerator Status Arne Freyberger Operations Department Accelerator Division Jefferson

SLAC Accelerator Science and R&D R. Hettel Accelerator Research Division Head (acting)

Fermilab Accelerator R&D Program Vladimir Shiltsev, Accelerator Physics Center Institutional

Simulations in Coalgebra Bart Jacobs and Jesse Hughes { bart,jesseh } @cs.kun.nl. University of

OPTIONAL LABORATORY SESSIONS Second-order NL Optics simulations Simulations using SNLO, a

ODU/JLAB PARALLEL-BAR CAVITY DEVELOPMENT Jean Delayen Subashini de Silva Center for Accelerator

Challenges in Accelerator Applications Shukui Zhang Thomas Jefferson National Accelerator

FOA Landscape Manouchehr Farkhondeh DOE Office of Nuclear Physics EIC Accelerator Collaboration

KEK, High Energy Accelerator Research Organization KEK High Energy Accelerator Research

Eric Prebys FNAL Accelerator Physics Center 8/17/10 Im the head of the US LHC Accelerator

US LHC Accelerator Research Program HL-LHC BNL - FNAL- LBNL - SLAC LARP Accelerator Systems 17

Review of accelerator data of relevance to air shower simulations Yoshitaka Itow STE Lab /

Parallel Coupling of CFD-DEM simulations MUG2018 Gabriele Pozzetti, Xavier Besseron, Alban

Parallel Numerical Algorithms Chapter 2 Parallel Thinking Section 2.2 Parallel

Superconformal indices for Sasaki-Einstein backgrounds Johannes Schmude RIKEN (until tomorrow),

Implementing a Constraint Solver: A Case Study Emmanuel Hebrard Cork Constraint Computation

D-branes and Closed String Field Theory Koichi Murakami (KEK) This talk is based on Yutaka Baba,

Conformal embeddings in basic classical Lie superalgebras Pierluigi M oseneder Frajria joint

The Uniform Guidance: A Top Ten-ish List of What You Need to Know 2 CFR CFR P PART 200

ORD Technical Support Centers (TSCs) Presented: November 16, 2018 Kira Lynch 1

A 2011 Till Bargheer c Mostly based on [1003.6120] with Florian Loebbert and Carlo Meneghelli

Research challenges in the large-scale computational modeling of networks Wenji Wu, Phil DeMar

Parallel accelerator simulations past, present and future James - PowerPoint PPT Presentation

Parallel accelerator simulations past, present and future James Amundson Fermilab November 21, 2011 James Amundson (Fermilab) Parallel accelerator simulations November 21, 2011 1 / 29 This Talk Accelerator Modeling and High-Performance

COMPUTER COMPUTER COMPUTER COMPUTER SIMULATIONS SIMULATIONS SIMULATIONS SIMULATIONS

1 3 5 CONVENTIONAL DC MODEL Accelerator Output Accelerator Opening FB-CA SERIES Accelerator

CEBAF Accelerator Status Arne Freyberger Operations Department Accelerator Division Jefferson

SLAC Accelerator Science and R&amp;D R. Hettel Accelerator Research Division Head (acting)

Fermilab Accelerator R&amp;D Program Vladimir Shiltsev, Accelerator Physics Center Institutional

Simulations in Coalgebra Bart Jacobs and Jesse Hughes { bart,jesseh } @cs.kun.nl. University of

OPTIONAL LABORATORY SESSIONS Second-order NL Optics simulations Simulations using SNLO, a

ODU/JLAB PARALLEL-BAR CAVITY DEVELOPMENT Jean Delayen Subashini de Silva Center for Accelerator

Challenges in Accelerator Applications Shukui Zhang Thomas Jefferson National Accelerator

FOA Landscape Manouchehr Farkhondeh DOE Office of Nuclear Physics EIC Accelerator Collaboration

KEK, High Energy Accelerator Research Organization KEK High Energy Accelerator Research

Eric Prebys FNAL Accelerator Physics Center 8/17/10 Im the head of the US LHC Accelerator

US LHC Accelerator Research Program HL-LHC BNL - FNAL- LBNL - SLAC LARP Accelerator Systems 17

Review of accelerator data of relevance to air shower simulations Yoshitaka Itow STE Lab /

Parallel Coupling of CFD-DEM simulations MUG2018 Gabriele Pozzetti, Xavier Besseron, Alban

Parallel Numerical Algorithms Chapter 2 Parallel Thinking Section 2.2 Parallel

Superconformal indices for Sasaki-Einstein backgrounds Johannes Schmude RIKEN (until tomorrow),

Implementing a Constraint Solver: A Case Study Emmanuel Hebrard Cork Constraint Computation

D-branes and Closed String Field Theory Koichi Murakami (KEK) This talk is based on Yutaka Baba,

Conformal embeddings in basic classical Lie superalgebras Pierluigi M oseneder Frajria joint

The Uniform Guidance: A Top Ten-ish List of What You Need to Know 2 CFR CFR P PART 200

ORD Technical Support Centers (TSCs) Presented: November 16, 2018 Kira Lynch 1

A 2011 Till Bargheer c Mostly based on [1003.6120] with Florian Loebbert and Carlo Meneghelli

Research challenges in the large-scale computational modeling of networks Wenji Wu, Phil DeMar

SLAC Accelerator Science and R&D R. Hettel Accelerator Research Division Head (acting)

Fermilab Accelerator R&D Program Vladimir Shiltsev, Accelerator Physics Center Institutional