Petascale Parallelization of the Gyrokinetic Toroidal Code Stephane - - PowerPoint PPT Presentation

petascale parallelization of the gyrokinetic toroidal code
SMART_READER_LITE
LIVE PREVIEW

Petascale Parallelization of the Gyrokinetic Toroidal Code Stephane - - PowerPoint PPT Presentation

Petascale Parallelization of the Gyrokinetic Toroidal Code Stephane Ethier, Princeton Plasma Physics Laboratory Mark Adams, Columbia University Jonathan Carter, Leonid Oliker, Lawrence Berkeley National Laboratory VECPAR 2010 June 23rd, 2010


slide-1
SLIDE 1

1

Petascale Parallelization of the Gyrokinetic Toroidal Code

Stephane Ethier, Princeton Plasma Physics Laboratory Mark Adams, Columbia University Jonathan Carter, Leonid Oliker, Lawrence Berkeley National Laboratory VECPAR 2010 June 23rd, 2010

slide-2
SLIDE 2

2

Outline

  • System configurations

– Blue Gene/P, Cray XT4, Hyperion cluster

  • Parallel gyro-kinetic toroidal code (GTC-

P)

– First fully parallel toroidal PIC code algorithm

  • ITER-sized scaling experiments

– 128K IBM BG/P cores – 32K Cray XT4 cores – 2K Hyperion cores

slide-3
SLIDE 3

3 13.6 GF/s 8 MB EDRAM 4 processors 1 chip, 20 DRAMs 13.6 GF/s 2.0 GB DDR Supports 4-way SMP 32 Node Cards 1024 chips, 4096 procs 14 TF/s 2 TB 1 to 72 or more Racks 1 PF/s + 144 TB + Cabled 8x8x16 Rack System Compute Card Chip 435 GF/s 64 GB (32 chips 4x4x2) 32 compute, 0-2 IO cards Node Card

Blue Gene/P

Figure courtesy IBM

slide-4
SLIDE 4

4

Blue Gene/P Interconnection Networks

  • 3 Dimensional Torus

– Interconnects all compute nodes – 3.4 GB/s on all 12 node links (5.1 GB/s per node) – MPI: 3 µs latency for one hop, 10 µs to the farthest; bandwidth 1.27 GB/s – 1.7/2.6 TB/s bisection bandwidth

  • Collective Network

– Interconnects all compute and I/O nodes – One-to-all broadcast functionality – Reduction operations functionality – 6.8 GB/s of bandwidth per link – MPI latency of one way tree traversal 5 µs

  • Low Latency Global Barrier and

Interrupt

– MPI latency of one way to reach all 72K nodes 1.6 µs

Figure courtesy IBM

slide-5
SLIDE 5

5

Cray XT4

  • Single socket 2.3 GHz

quadcore AMD Opteron per compute node

  • 37 Gflop/s peak per

node

  • Microkernel on

Compute PEs, full featured Linux on Service PEs.

  • Service PEs specialize

by function

Service Partition Specialized Linux nodes

Compute PE Login PE Network PE System PE I/O PE Figure courtesy Cray

slide-6
SLIDE 6

6

Direct Attached Memory

8.5 GB/sec Local Memory Bandwidth

HyperTransport

4 GB/sec MPI Bandwidth

Cray XT4 Network

AMD Opteron 7.6 GB/sec 7.6 GB/sec 7 . 6 G B / s e c 7.6 GB/sec 7 . 6 G B / s e c 7.6 GB/sec Cray SeaStar2 Interconnect

6.5 GB/sec Torus Link Bandwidth

Figure courtesy Cray

MPI latency 4-8 µs, bandwidth 1.7 GB/s

slide-7
SLIDE 7

QsNet Elan3, 100BaseT Control 134 Dual Socket Quad Core Compute Nodes (1,072 cores)‏

1 RPS 2x1 GbE & RAID 1 Login/ Service/ Master

35 Lustre Object Storage Systems 2x10GbE+IBA 4x 732 TB and 47 GB/s

1GbE Management

4 Gateway nodes @ 1.5 GB/s delivered I/O over 2x10GbE

144 Port IBA 4x Uplinks to spine switch Lustre MetaData

MD MD

GW GW GW GW

1/10 GbE SAN

GW GW GW GW

4 Gateway nodes @ 1.5 GB/s delivered I/O over 1xIBA 4x DDR

S S S S

IBA 4x DDR SAN

12x24 = 288 Port InfiniBand 4x DDR

Hyperion Scalable Unit

Hyperion Phase 1 - 4 SU 46 TF/s Cluster

 576 nodes and 4,608 cores, 12.1 TB/s memory bandwidth, 4.6 TB capacity  85 GF/s dual socket 2.5 GHz quad-core Intel LV Harpertown nodes

Figure courtesy LLNL

slide-8
SLIDE 8

8 SU base system 4 expansion SU's

Figure courtesy LLNL

Hyperion Connectivity

  • Bandwidth 4XIB DDR 2 GB/s peak
  • MPI latency 2-5 µs, bandwidth 400 MB/s
slide-9
SLIDE 9

9

The Gyrokinetic Toroidal Code

  • 3D particle-in-cell code to study microturbulence in

magnetically confined fusion plasmas

  • Solves the gyro-averaged Vlasov equation
  • Gyrokinetic Poisson equation solved in real space
  • 4-point average method for charge deposition
  • Global code (full torus as opposed to only a flux tube)
  • Massively parallel: typical runs done on 1000s processors
  • Nonlinear and fully self-consistent.
  • Written in Fortran 90/95 + MPI
  • Originally written by Z. Lin,

subsequently modified

slide-10
SLIDE 10

10

Fusion: DOE #1 Facility Priority

November 10, 2003 Energy Secretary Spencer Abraham Announces Department of Energy 20-Year Science Facility Plan Sets Priorities for 28 New, Major Science Research Facilities

#1 on the list of priorities is

ITER, an unprecedented international collaboration on the next major step for the development of fusion #2 is UltraScale Scientific Computing Capability

slide-11
SLIDE 11

11

Particle-in-cell (PIC) method

  • Particles sample distribution function (markers).
  • The particles interact via a grid, on which the

potential is calculated from deposited charges.

The PIC Steps

  • “SCATTER”, or deposit,

charges on the grid (nearest neighbors)

  • Solve Poisson equation
  • “GATHER” forces on each

particle from potential

  • Move particles (PUSH)
  • Repeat…
slide-12
SLIDE 12

12

Parallel GTC: Domain decomposition + particle splitting

  • 1D Domain decomposition:

– Several MPI processes allocated to a section of the torus

  • Particle splitting method

– The particles in a toroidal section are equally divided between several MPI processes

  • Also has loop-level

parallelism – OpenMP directives (not used in this study)

Processor 2 Processor 3 Processor 0 Processor 1

slide-13
SLIDE 13

13

Radial grid decomposition

  • Non-overlapping

geometric partitioning

slide-14
SLIDE 14

14

Charge Deposition for charged rings

Classic PIC 4-Point Average GK (W.W. Lee)

Charge Deposition Step (SCATTER operation)

GTC

slide-15
SLIDE 15

15

Overlapping partitioning

  • Extend local domain to line up w/ grid
  • Extend local domain for gyro radius
slide-16
SLIDE 16

16

Major Components in GTC

  • Particle work – O(p)

– Major computational kernel “Moves particles” – Large body loops; Lots of loop level parallelism

  • Grid (cell) work – O(g)

– Poisson solver, “smoothing”, E-field calculations

  • Particle-Grid work – O(p)

– Major computational kernel “Scatter” and “Gather” – Unstructured and “random” access to grids – Semi-structured grids in GTC – Cache effects are critical

slide-17
SLIDE 17

17

Major routines in GTC

  • Push ions (P,P-G)

– Major computational kernel “Moves particles” – Large body loops; Gathers; Lots of loop level parallelism

  • Charge deposition (P-G)

– Major computational kernel “Scatter” – Pressure on cache – unstructured access to grid – block grid

  • Shift ions, communication (P)

– Sorts out particles that move out of its domain and sends those to the “next” processor

  • Poisson Solver (G)

– Solve Poisson Equation. Prior to 2007 the solve was redundantly executed on each processor. New version uses the PETSc solver to efficiently distribute

  • Smooth (G) and Field (G)

– Smaller computational kernels – Prior to 2007 - NOT parallel

slide-18
SLIDE 18

18

Weak Scaling Experiments

  • Keep number of cells and number of

particles constant per process

  • Double size of device in each case

– Final case is ITER sized plasma

  • Cray XT4 up to 32K cores

– Quad core / flat MPI (Franklin - NERSC)

  • BG/P up to 128K cores

– Quad core / flat MPI (Intrepid – ANL)

  • Hyperion up to 2K cores

– Dual socket quad core / flat MPI (LLNL)

slide-19
SLIDE 19

19

Absolute Performance

  • Cray XT4 is highest

performing at both 2K and 32K procs

  • BG/P scaling is much better

than XT4 going from 2K to 32K procs

  • Even though Hyperion (Xeon

Harpertown) has higher peak performance than the XT4 (Opteron) performance lags at 2K procs.

– Worse memory bandwidth compared with peak

slide-20
SLIDE 20

20

Communication Performance

  • Shift routine is a good

proxy for communication costs

  • BG/P has lowest

percentage in communications

– Also has lowest performing processor

  • Hyperion has highest

percentage in communications

slide-21
SLIDE 21

21

Performance on XT4

  • Push and Charge scale well
  • Shift has moderate scaling
  • Field and Smooth scale relatively poorly
slide-22
SLIDE 22

22

Performance on BG/P

  • Push, Charge and Shift scale well
  • Field and Smooth scale relatively poorly
slide-23
SLIDE 23

23

Performance on Hyperion

  • Push and Charge scale well
  • Shift has moderate scaling
  • Field and Smooth scale relatively poorly
slide-24
SLIDE 24

24

Load Imbalance

  • Field and Smooth are both dominated

by grid-related work

  • At high processor count number of grid

points per MPI task is imbalanced

– Less grid points in radial domains near center of circular plane – Radial decomposition focuses on same number of particles as this is >80% of the work

slide-25
SLIDE 25

25

Summary

  • Radial decomposition enables GTC to scale

to ITER size devices

– Impossible to fit full grid on a single node without radial decomposition

  • XT4 offers best performance, but perhaps

not as scalable as BG/P

  • Hyperion IB cluster seems to lag, although

more data required

– Upgraded to Intel Nehalem nodes and enlarged since our work was completed

slide-26
SLIDE 26

26

Acknowledgements

  • U. S. Department of Energy

– Offjce of Fusion Energy Sciences under contract number DE-ACO2-76CH03073 – Offjce of Advanced Scientific Computing Research under contract number DE-AC02- 05CH11231

  • Computing Resources

– Argonne Leadership Computing Facility – National Energy Research Scientific Computing Center – Hyperion Project at LLNL