A parallel Grad-Shafranov solver for real-time control of tokamak - - PowerPoint PPT Presentation

a parallel grad shafranov solver for real time control of
SMART_READER_LITE
LIVE PREVIEW

A parallel Grad-Shafranov solver for real-time control of tokamak - - PowerPoint PPT Presentation

A parallel Grad-Shafranov solver for real-time control of tokamak plasmas M. Rampp, R. Preuss, R. Fischer, L. Giannone, K. Hallatschek Computing Centre (RZG) of the Max-Planck-Society and Max-Planck-Institute for Plasma Physics (IPP) Frascati,


slide-1
SLIDE 1

slide 1

  • M. Rampp (RZG/MPG)

A parallel Grad-Shafranov solver for real-time control of tokamak plasmas

  • M. Rampp, R. Preuss, R. Fischer, L. Giannone, K. Hallatschek

Computing Centre (RZG) of the Max-Planck-Society and Max-Planck-Institute for Plasma Physics (IPP)

Frascati, Mar 26, 2012

slide-2
SLIDE 2

slide 2

  • M. Rampp (RZG/MPG)

Overview Outline: GPEC (Garching Parallel Equilibrium Code), a descendant of GEC (by Lackner et al.)

  • 1. Introduction
  • 2. Basic equations and parallel algorithm for numerical solution
  • 3. Benchmarks & Notes on implementation
  • 4. Summary & Outlook

References ⊲ R. Preuss: Real time equilibrium calculation, AUG Seminar, IPP (July 2009) ⊲ R. Preuss, R. Fischer, M. Rampp, K. Hallatschek, U. von Toussaint, L. Giannone, P. McCarthy: Parallel equilibrium algorithm for real-time control of tokamak plasmas, IPP-Report R/47 (2012)

slide-3
SLIDE 3

slide 3

  • M. Rampp (RZG/MPG)

Introduction Scope & Motivation ⊲ develop a ”fast” numerical solver for the Grad-Shafranov equation ⊲ ”fast”: enable application in plasma equilibrium codes (CLISTE, IDE @AUG): · real-time control of fusion experiments (moderate numerical resolution, runtime 1 ms ) · efficient and scalable offline analysis (high numerical resolution) Strategy ⊲ rely on Open-Source Software, OSS (no vendor lock-in: technology, licensing, ITER policy) ⊲ employ commodity hardware & open and well-established software standards · x86 Linux servers or clusters · standard compilers, programming models & runtime (FORTRAN 90, OpenMP, MPI) ⊲ try to exploit all possible levels of parallelism: processors/tasks, cores/threads, instructions Status ⊲ GPEC fully implemented, validated and documented ⊲ integration into IDE completed (offline analysis of AUG data) ⊲ prototypical MPI-parallelization of IDE (towards real-time capability)

slide-4
SLIDE 4

slide 4

  • M. Rampp (RZG/MPG)

Context: Grad-Shafranov equation in equilibrium codes Grad-Shafranov equation in axisymmetric geometry: ∆∗ψ = −4π2µ0rjφ with ∆∗ := r ∂ ∂r 1 r ∂ ∂r + ∂2 ∂z2 jφ := r∂p(r, ψ) ∂ψ + µ0 F(ψ) r dF(ψ) dψ Application in equilibrium codes: ⊲ expansion of toroidal current density jφ into linear combination of Np + NF basis functions

p′(ψ) =

Np

  • h=1

chπh(ψ) , FF ′(ψ) =

NF

  • k=1

dkϕk(ψ)

⊲ for each basis function, solve individually

∆∗ψh = −4π2µ0r2πh(ψold) , ∆∗ψNp+k = −4π2µ2

0ϕk(ψold)

⊲ therefore: ψ =

Np

  • i=1

ciψi +

NF

  • j=1

djψNp+j

slide-5
SLIDE 5

slide 5

  • M. Rampp (RZG/MPG)

Context: Grad-Shafranov equation in equilibrium codes

⊲ obtain weights ch, dk by linear regression with diagnostic signals m and computed ”response matrix” b:     m1 m2 . . . mNm     =     b1(ψ1) b1(ψ2) ... b1(ψNp+NF) b2(ψ1) b2(ψ2) ... b2(ψNp+NF) . . . . . . . . . . . . bNm(ψ1) bNm(ψ2) ... bNm(ψNp+NF)     ·             c1 c2 . . . cNp d1 d2 . . . dNF            

Summary of the complete procedure ⊲ Picard iteration cycle: 1) solve GS-equation ∆∗ψ = −g(r, z) individually for Np + NF basis functions. g(r, z) is computed using ψ of last iteration step 2) obtain ch, dk from linear regression 3) compute new ψ = Np

h=1 chπh(ψ) + NF k=1 dkϕk(ψ)

⊲ typical: 2. . . 10 iteration steps

slide-6
SLIDE 6

slide 6

  • M. Rampp (RZG/MPG)

Requirements for computational performance Real-time control ⊲ AUG machine control cycle: ≃ 1 ms ⊲ assume 10 iteration steps required for convergence runtime per iteration

!

0.1 ms for

· Np + NF ≃ 8 calls of the GS solver · linear regression, calculation of response matrix, . . .

Offline analysis ⊲ as fast as possible ⊲ goals: allow significantly larger numerical resolution, enable new data analysis methodologies Basic strategy ⊲ starting point: GS solver dominates runtime, GEC/CLISTE: O(ms) for single basis function (Np + NF = 1) ⊲ achieve performance gains by parallelization

slide-7
SLIDE 7

slide 7

  • M. Rampp (RZG/MPG)

Excursus: why parallel? Moore’s law – reinterpreted

  • H. Sutter: The Free Lunch Is Over. A Fundamental Turn Toward Concurrency in Software

http://www.gotw.ca/publications/concurrency-ddj.htm

  • H. Esmaeilzadeh, et al.: Dark Silicon and the End of Multicore Scaling

http://doi.acm.org/10.1145/2000064.2000108

⊲ Moore’s law is holding, in the original terms of number of transistors · transistors on a chip still doubling every 18 months at constant cost · V ∼ l, I ∼ l ⇒ P ∼ l2 (constant power density, Dennard scaling) · but: era of decades of exponential clock rate growth has ended (around 2005) ⊲ Moore’s law ”reinterpreted” · number of cores per chip doubles every 18 months instead of clock performance improvements are now coming from the increase in the number of cores on a processor . . . . . . plus other levels of parallelism (e.g. vec- tor instructions) free (performance) lunch is over ⊲ Future: 100 . . . 1000 cores on chip ?

slide-8
SLIDE 8

slide 8

  • M. Rampp (RZG/MPG)

Excursus: parallel challenges Typical architecture of a present-day x86 64 cluster

Network: ETH, IB, ...

Node 2 Node 1 Node 0

CPU Socket: 2 cores, 4 threads Node: 4 multicore CPUs

⊲ multiple nodes (1 . . . 1000): connected by Ethernet, InfiniBand network ⊲ multiple sockets (CPUs) per node (2. . . 4): connected by QPI, HT (NUMA!) ⊲ multiple cores per CPU (2. . . 16) ⊲ multiple hyper-threads per core (2) ⊲ multiple instructions per cycle (2. . . 4 DP ops, SSE: 128 Bit, AVX: 256 Bit)

slide-9
SLIDE 9

slide 9

  • M. Rampp (RZG/MPG)

Parallelization strategies Basic considerations ⊲ exploit fine-grained parallelism (spatial grid) within the multi-core CPU: threads, SIMD- vectorization (GPEC) ⊲ exploit coarse-grained parallelism (basis functions) across multiple CPUs (IDE) fits well with program structure flexibility of hybrid MPI(tasks)/OpenMP(threads) approach, examples: · distribute 8 BFs across 8 quadcore CPUs, each GPEC call uses all 4 CPU cores · distribute 8 BFs across 2 16-core CPUs, each GPEC call uses only 8 CPU cores Communication estimates ⊲ required bandwidth for communicating a 32 × 64 array of double precision numbers (corre- sponding to a single ψk): 32 × 64 · 8 Byte/0.01 ms ≃ 1.6 GB/s ⊲ inter-node (InfiniBand: QDR: 1 GB/s, FDR: 1.7 GB/s) ⊲ intra-node (Stream 4-socket: ≃ 5 GB/s, 2-socket: ≃ 10 GB/s)

slide-10
SLIDE 10

slide 10

  • M. Rampp (RZG/MPG)

GPEC: discretization

GS equation: ∆∗ψ = −g(r, z) ⊲ standard finite-differences discretization

  • n (M × N)-grid

⊲ five-point stencil for ∆∗ := r ∂

∂r 1 r ∂ ∂r + ∂2 ∂z2

N N−1 ... ... ... ... ... ... ... ... ... ...

j

z

i

1 1 ... ... M−1 M ...

r

⊲ solution methods for linear system: · Block-Cyclic Reduction: GEC · Fourier analysis: GPEC · FACR, multigrid, . . . s+

i ψi+1,j +δiψi,j +s− i ψi−1,j +ri −1(ψi,j+1+ψi,j−1) = ρi,j

1 2 MxM MxM N

slide-11
SLIDE 11

slide 11

  • M. Rampp (RZG/MPG)

GPEC: discretization & FA solution method

Fourier analysis: applying a Discrete Sine Transform (DST) allows to decouple lin- ear system in one dimension

s+

i ψi+1,j+δiψi,j+s− i ψi−1,j+ri −1(ψi,j+1+ψi,j−1) = ρi,j

apply : ξi,j = 2 N

N−1

  • l=1

ˆ ξi,l sin(πjl/N) ξ := ρ, ψ ⇒ s+

i ˆ

ψi+1,l + ˆ δi,l ˆ ψi,l + s+

i−1 ˆ

ψi−1,l = ˆ ρi,l

⊲ Basic solution scheme (since 1970’s):

  • 1. compute M iDSTs of size N

⇒ ˆ ρ

  • 2. solve N tridiag. systems of size M×M

⇒ ˆ ψ

  • 3. compute M DSTs of size N

⇒ ψ ⊲ optimized FFTs and tridiag. solver ⊲ fully parallelizable over M, N s+

i ˆ

ψi+1,l + ˆ δi,l ˆ ψi,l + s+

i−1 ˆ

ψi−1,l = ˆ ρi,l

1 2 MxM N

slide-12
SLIDE 12

slide 12

  • M. Rampp (RZG/MPG)

More details of the complete GPEC algorithm Complete algorithm (Hagenov & Lackner, 1975) ⊲ additional ”inner” iteration for taming Dirichlet boundary condition

  • 1. solve: ∆∗ψ0 = −g(r, z) in R, ψ0 = 0 on δR
  • 2. compute ψ on δR using Greens function ψ1(r) = −
  • ∂R

1 r∗G(r, r∗)

  • ∂ψ0(r∗)

∂n

  • ds∗
  • 3. solve: ∆∗ψ = −g(r, z) using ψ1 on δR

apply 2 times the (iDST → tridiagonal solve → DST)-cycle inbetween, save some work by transforming only the ”boundary” values (cf. Giannone et al. 2010) Main algorithmic steps dst1: M inverse discrete sine transforms (iDSTs) of size N trid1: solution of N tridiagonal linear systems of size M dstspare1: 2 DSTs of size N plus 2 × (M − 2) scalar products of size N for computing 2(N + M − 2) Fourier coefficients matvec: 1 matrix-vector multiplication of rank 2(N + M) × 2(N + M) dstspare2: 2 iDSTs of size N plus a corresponding rank-1 update of a N × M matrix trid2: solution of N tridiagonal linear systems of size M dst2: M DSTs of size N

slide-13
SLIDE 13

slide 13

  • M. Rampp (RZG/MPG)

Computational performance of GPEC Parallel runtimes (per iteration) on n cores

grid 32 × 64 64 × 128 128 × 256 256 × 512 n Tn [ms] Tn [ms] Tn [ms] Tn [ms] Xeon W5580 3.2 GHz 4 0.040 0.114 0.432 3.550 Xeon E5540 2.7 GHz 4 0.043 0.130 0.489 3.163 Xeon X7542 2.7 GHz 6 0.048 0.113 0.343 3.379

Parallel runtimes (per iteration) on 4 cores (Xeon W5580) Observations:

⊲ parallel runtime roughly scales with number of grid points M · N (theoretical complexity) ⊲ large grids: cache & mem-

  • ry bandwidth limit perfor-

mance ⊲ small grids: parallel over- head limits performance

slide-14
SLIDE 14

slide 14

  • M. Rampp (RZG/MPG)

Parallel performance of GPEC Parallel scaling with number of cores

2-socket server (2x quadcore CPU) Intel Xeon W5580 @3.2 GHz

Limits for parallel scaling resp. efficiency ⊲ overhead: synchronization

  • f

threads ⊲ algorithmic: vector-efficiency of tridiagonal solver ⊲ memory access to ”remote” CPU (NUMA) ⊲ positive parallel speedups obtained

  • n 10-core CPU (Xeon E7 8870)
slide-15
SLIDE 15

slide 15

  • M. Rampp (RZG/MPG)

Implementation of GPEC Discrete Sine Transform (dst) ⊲ FFTW www.fftw.org ⊲ the ”Fastest Fourier Transform in the West/World” ? ⊲ OSS, widely adopted in scientific computing ⊲ explicit parallelization (intrinsic FFTW parallelism not efficient for small 1D DSTs)

N ... ... ... ... ... ... ... ... ... ...

j

z

i

... ... M ...

r

thread 0 thread 1 thread n

⊲ compute 1D transformations ξj = 2

N

N−1

l=1 ˆ

ξl sin(πjl/N) j = 1 . . . N−1 ξ := ρ, ψ ⊲ uses precomputed FFTW-plans (machine-specific opti- mization) ⊲ grouping individual 1D DSTs allows to take advantage of FFTW-internal optimizations

slide-16
SLIDE 16

slide 16

  • M. Rampp (RZG/MPG)

Implementation of GPEC Tridiagonal solver (trid) ⊲ our own development, based on the VERTEX code (Rampp & Janka, Astron. & Astrophys. 2002) ⊲ combines SIMD-vectorization and multi-core parallelization ⊲ fits ”transposed” memory layout of main work arrays (allows to perform the DSTs on con- tiguous memory sections) ⊲ conceptually simple to implement (see IPP-Report R/47, 2012)

N ... ... ... ... ... ... ... ... ... ...

j

z

i

... ... M ...

r

thread n thread 1 thread 0

⊲ Solver for sets of independent tridiagonal systems s+

i ˆ

ψi+1,l + ˆ δi,l ˆ ψi,l + s+

i−1 ˆ

ψi−1,l = ˆ ρi,l ⊲ uses precomputed LU decomposition (si, δi are constants) ⊲ no pivoting ⊲ grouping individual solver calls allows to exploit SIMD vectorization within a thread (acknowledgement: Dr. R. Fischer,

NEC inc.)

slide-17
SLIDE 17

slide 17

  • M. Rampp (RZG/MPG)

Implementation of GPEC Matrix-vector multiplication (matvec) ⊲ GotoBLAS2, OpenBLAS https://github.com/xianyi/OpenBLAS ⊲ parallel-efficient implementation of DGEMV ⊲ OSS, might be superseded by projects like ATLAS, MAGMA, . . .

N ... ... ... ... ... ... ... ... ... ...

j

z

i

... ... M ...

r

⊲ Matrix-vector multiplication computes ψ1(r) = −

  • ∂R

1 r∗G(r, r∗)

  • ∂ψ0(r∗)

∂n

  • ds∗

matrix: precomputed integration weights vector: ∂ψ0/∂n ⊲ linearize boundary values to vector ⊲ uses internal parallelization of OpenBLAS

slide-18
SLIDE 18

slide 18

  • M. Rampp (RZG/MPG)

The ”price” of Open-Source Software: performance penalties in GPEC ?

Compare1 our OSS components with highly optimized routines taken from the commercial Intel Math-Kernel Library (MKL)

1caveat: performance depends on grid size, software release, platform, implementation details, . . .

  • Discrete Sine Transform

⊲ MKL’s DSTs are faster by factor ≃ 1.5 wrt. FFTW

  • verall performance penalty: ≃ 20%

note: performance sensitive to M, N, n, gory details of the implementation

  • Tridiagonal solver

⊲ MKL’s DDTTRSB is slower by factor ≃ 2.5 wrt. our own routine VTRS (even in an artificial setup favourable for MKL)

  • verall performance improvement: > 35%

note: routines like DDTTRSB are not optimized for computing sets of independent systems, no advantage from SIMD-vectorization

  • Matrix-Vector multiplication

⊲ MKL’s DGEMV is generally slower (less scalable) wrt. OpenBLAS

  • verall performance improvement

note: GotoBLAS/OpenBLAS well-known for its superiour performance

slide-19
SLIDE 19

slide 19

  • M. Rampp (RZG/MPG)

Summary & Outlook Summary ⊲ GPEC: a fast numerical Grad-Shafranov solver based on Open-Source software (OSS) ⊲ achieved runtimes on a 3.2 GHz quadcore CPU: · 0.04 ms for grid size 32 × 64 · 0.11 ms for grid size 64 × 128 ⊲ lower clock frequencies can be compensated by using more cores ⊲ parallel algorithm grants flexibility with upcoming multi-core processor developments ⊲ OSS solution is highly competitive, performancewise ⊲ GPEC enables ”real-time” applications in equilibrium codes Outlook ⊲ further GPEC optimization: assessment of other methods: parallel BCR, FACR, . . . ⊲ IDE/GPEC: towards a hybrid MPI/OpenMP parallelization (prototype implementation done) ⊲ estimates: 32 × 64 grid, 8 basis functions on a 4-socket x86 64 server (Intel ”Nehalem”): 1/4 · 8 · 0.04 ms + communication + linear regression + . . . ⊲ challenges: communication (MPI allreduce) starts to dominate parallel runtimes

slide-20
SLIDE 20

slide 20

  • M. Rampp (RZG/MPG)

Summary & Outlook (contd.) first IDE benchmarks (very preliminary) ⊲ complete equilibrium computed on 32 × 64 grid, 7 basis functions on a 4-socket x86 64 server Xeon X7542 @ 2.7 GHz:

CPUs cores T/iteration [ms] TGPEC 1 6 0.9 0.45 2 12 0.5 0.22 4 24 0.4 0.11

we are missing a factor of 2. . . 4 (5. . . 10 iterations per ms) ⊲ Intel SandyBridge processors (2x per core due to AVX, 8 cores) ⊲ faster Interconnect ⊲ thorough optimization