1
Petascale Parallelization of the Gyrokinetic Toroidal Code Stephane - - PowerPoint PPT Presentation
Petascale Parallelization of the Gyrokinetic Toroidal Code Stephane - - PowerPoint PPT Presentation
Petascale Parallelization of the Gyrokinetic Toroidal Code Stephane Ethier, Princeton Plasma Physics Laboratory Mark Adams, Columbia University Jonathan Carter, Leonid Oliker, Lawrence Berkeley National Laboratory VECPAR 2010 June 23rd, 2010
2
Outline
- System configurations
– Blue Gene/P, Cray XT4, Hyperion cluster
- Parallel gyro-kinetic toroidal code (GTC-
P)
– First fully parallel toroidal PIC code algorithm
- ITER-sized scaling experiments
– 128K IBM BG/P cores – 32K Cray XT4 cores – 2K Hyperion cores
3 13.6 GF/s 8 MB EDRAM 4 processors 1 chip, 20 DRAMs 13.6 GF/s 2.0 GB DDR Supports 4-way SMP 32 Node Cards 1024 chips, 4096 procs 14 TF/s 2 TB 1 to 72 or more Racks 1 PF/s + 144 TB + Cabled 8x8x16 Rack System Compute Card Chip 435 GF/s 64 GB (32 chips 4x4x2) 32 compute, 0-2 IO cards Node Card
Blue Gene/P
Figure courtesy IBM
4
Blue Gene/P Interconnection Networks
- 3 Dimensional Torus
– Interconnects all compute nodes – 3.4 GB/s on all 12 node links (5.1 GB/s per node) – MPI: 3 µs latency for one hop, 10 µs to the farthest; bandwidth 1.27 GB/s – 1.7/2.6 TB/s bisection bandwidth
- Collective Network
– Interconnects all compute and I/O nodes – One-to-all broadcast functionality – Reduction operations functionality – 6.8 GB/s of bandwidth per link – MPI latency of one way tree traversal 5 µs
- Low Latency Global Barrier and
Interrupt
– MPI latency of one way to reach all 72K nodes 1.6 µs
Figure courtesy IBM
5
Cray XT4
- Single socket 2.3 GHz
quadcore AMD Opteron per compute node
- 37 Gflop/s peak per
node
- Microkernel on
Compute PEs, full featured Linux on Service PEs.
- Service PEs specialize
by function
Service Partition Specialized Linux nodes
Compute PE Login PE Network PE System PE I/O PE Figure courtesy Cray
6
Direct Attached Memory
8.5 GB/sec Local Memory Bandwidth
HyperTransport
4 GB/sec MPI Bandwidth
Cray XT4 Network
AMD Opteron 7.6 GB/sec 7.6 GB/sec 7 . 6 G B / s e c 7.6 GB/sec 7 . 6 G B / s e c 7.6 GB/sec Cray SeaStar2 Interconnect
6.5 GB/sec Torus Link Bandwidth
Figure courtesy Cray
MPI latency 4-8 µs, bandwidth 1.7 GB/s
QsNet Elan3, 100BaseT Control 134 Dual Socket Quad Core Compute Nodes (1,072 cores)
1 RPS 2x1 GbE & RAID 1 Login/ Service/ Master
35 Lustre Object Storage Systems 2x10GbE+IBA 4x 732 TB and 47 GB/s
1GbE Management
4 Gateway nodes @ 1.5 GB/s delivered I/O over 2x10GbE
144 Port IBA 4x Uplinks to spine switch Lustre MetaData
MD MD
…
GW GW GW GW
1/10 GbE SAN
GW GW GW GW
4 Gateway nodes @ 1.5 GB/s delivered I/O over 1xIBA 4x DDR
S S S S
IBA 4x DDR SAN
12x24 = 288 Port InfiniBand 4x DDR
Hyperion Scalable Unit
Hyperion Phase 1 - 4 SU 46 TF/s Cluster
576 nodes and 4,608 cores, 12.1 TB/s memory bandwidth, 4.6 TB capacity 85 GF/s dual socket 2.5 GHz quad-core Intel LV Harpertown nodes
Figure courtesy LLNL
8 SU base system 4 expansion SU's
Figure courtesy LLNL
Hyperion Connectivity
- Bandwidth 4XIB DDR 2 GB/s peak
- MPI latency 2-5 µs, bandwidth 400 MB/s
9
The Gyrokinetic Toroidal Code
- 3D particle-in-cell code to study microturbulence in
magnetically confined fusion plasmas
- Solves the gyro-averaged Vlasov equation
- Gyrokinetic Poisson equation solved in real space
- 4-point average method for charge deposition
- Global code (full torus as opposed to only a flux tube)
- Massively parallel: typical runs done on 1000s processors
- Nonlinear and fully self-consistent.
- Written in Fortran 90/95 + MPI
- Originally written by Z. Lin,
subsequently modified
10
Fusion: DOE #1 Facility Priority
November 10, 2003 Energy Secretary Spencer Abraham Announces Department of Energy 20-Year Science Facility Plan Sets Priorities for 28 New, Major Science Research Facilities
#1 on the list of priorities is
ITER, an unprecedented international collaboration on the next major step for the development of fusion #2 is UltraScale Scientific Computing Capability
11
Particle-in-cell (PIC) method
- Particles sample distribution function (markers).
- The particles interact via a grid, on which the
potential is calculated from deposited charges.
The PIC Steps
- “SCATTER”, or deposit,
charges on the grid (nearest neighbors)
- Solve Poisson equation
- “GATHER” forces on each
particle from potential
- Move particles (PUSH)
- Repeat…
12
Parallel GTC: Domain decomposition + particle splitting
- 1D Domain decomposition:
– Several MPI processes allocated to a section of the torus
- Particle splitting method
– The particles in a toroidal section are equally divided between several MPI processes
- Also has loop-level
parallelism – OpenMP directives (not used in this study)
Processor 2 Processor 3 Processor 0 Processor 1
13
Radial grid decomposition
- Non-overlapping
geometric partitioning
14
Charge Deposition for charged rings
Classic PIC 4-Point Average GK (W.W. Lee)
Charge Deposition Step (SCATTER operation)
GTC
15
Overlapping partitioning
- Extend local domain to line up w/ grid
- Extend local domain for gyro radius
16
Major Components in GTC
- Particle work – O(p)
– Major computational kernel “Moves particles” – Large body loops; Lots of loop level parallelism
- Grid (cell) work – O(g)
– Poisson solver, “smoothing”, E-field calculations
- Particle-Grid work – O(p)
– Major computational kernel “Scatter” and “Gather” – Unstructured and “random” access to grids – Semi-structured grids in GTC – Cache effects are critical
17
Major routines in GTC
- Push ions (P,P-G)
– Major computational kernel “Moves particles” – Large body loops; Gathers; Lots of loop level parallelism
- Charge deposition (P-G)
– Major computational kernel “Scatter” – Pressure on cache – unstructured access to grid – block grid
- Shift ions, communication (P)
– Sorts out particles that move out of its domain and sends those to the “next” processor
- Poisson Solver (G)
– Solve Poisson Equation. Prior to 2007 the solve was redundantly executed on each processor. New version uses the PETSc solver to efficiently distribute
- Smooth (G) and Field (G)
– Smaller computational kernels – Prior to 2007 - NOT parallel
18
Weak Scaling Experiments
- Keep number of cells and number of
particles constant per process
- Double size of device in each case
– Final case is ITER sized plasma
- Cray XT4 up to 32K cores
– Quad core / flat MPI (Franklin - NERSC)
- BG/P up to 128K cores
– Quad core / flat MPI (Intrepid – ANL)
- Hyperion up to 2K cores
– Dual socket quad core / flat MPI (LLNL)
19
Absolute Performance
- Cray XT4 is highest
performing at both 2K and 32K procs
- BG/P scaling is much better
than XT4 going from 2K to 32K procs
- Even though Hyperion (Xeon
Harpertown) has higher peak performance than the XT4 (Opteron) performance lags at 2K procs.
– Worse memory bandwidth compared with peak
20
Communication Performance
- Shift routine is a good
proxy for communication costs
- BG/P has lowest
percentage in communications
– Also has lowest performing processor
- Hyperion has highest
percentage in communications
21
Performance on XT4
- Push and Charge scale well
- Shift has moderate scaling
- Field and Smooth scale relatively poorly
22
Performance on BG/P
- Push, Charge and Shift scale well
- Field and Smooth scale relatively poorly
23
Performance on Hyperion
- Push and Charge scale well
- Shift has moderate scaling
- Field and Smooth scale relatively poorly
24
Load Imbalance
- Field and Smooth are both dominated
by grid-related work
- At high processor count number of grid
points per MPI task is imbalanced
– Less grid points in radial domains near center of circular plane – Radial decomposition focuses on same number of particles as this is >80% of the work
25
Summary
- Radial decomposition enables GTC to scale
to ITER size devices
– Impossible to fit full grid on a single node without radial decomposition
- XT4 offers best performance, but perhaps
not as scalable as BG/P
- Hyperion IB cluster seems to lag, although
more data required
– Upgraded to Intel Nehalem nodes and enlarged since our work was completed
26
Acknowledgements
- U. S. Department of Energy
– Offjce of Fusion Energy Sciences under contract number DE-ACO2-76CH03073 – Offjce of Advanced Scientific Computing Research under contract number DE-AC02- 05CH11231
- Computing Resources