From Stencils to Elliptic PDE Solvers U. Rde (FAU Erlangen, - - PowerPoint PPT Presentation

from stencils to elliptic pde solvers
SMART_READER_LITE
LIVE PREVIEW

From Stencils to Elliptic PDE Solvers U. Rde (FAU Erlangen, - - PowerPoint PPT Presentation

From Stencils to Elliptic PDE Solvers U. Rde (FAU Erlangen, ulrich.ruede@fau.de) joint work with B. Gmeiner, H. Kstler, H. Stengel (FAU) M. Huber, C. Waluga, L. John, B. Wohlmuth (TUM) M. Mohr, J. Weismller, P. Bunge (LMU) Lehrstuhl fr


slide-1
SLIDE 1

Stencils and Elliptic Solvers - Ulrich Rüde

From Stencils to Elliptic PDE Solvers

Lehrstuhl für Simulation FAU Erlangen-Nürnberg

  • U. Rüde (FAU Erlangen, ulrich.ruede@fau.de)

joint work with B. Gmeiner, H. Köstler, H. Stengel (FAU)

  • M. Huber, C. Waluga, L. John, B. Wohlmuth (TUM)
  • M. Mohr, J. Weismüller, P. Bunge (LMU)

1

Advanced Stencil-Code Engineering

  • 12. – 17. April 2015

Seminar 15161

slide-2
SLIDE 2

Stencils and Elliptic Solvers - Ulrich Rüde

What are we up to?

Stencil Code Engineering?

  • ne step of the designing efficient parallel algorithms:

application➞model➞discretization➞solver➞simulation➞validation

this week at Dagstuhl:

  • pportunity to build a bridge between CS and Math

but danger to babylonize the theme In my talk I will briefly touch 3 topics how can we optimize stencil codes? architecture awareness can bring large speedups what algorithms should be considered? Jacobi iteration is not enough where do we stand? HHG: stencil-based FE solver as an example

2

slide-3
SLIDE 3

Stencils and Elliptic Solvers - Ulrich Rüde

Stencils

a geometric pattern of weights applied to a grid function at each location in a structured grid Example: the mother of all PDEs (Laplace equation in 2D) structured grid in d dimensions: (scalar, real valued) grid function: an element of a vector space, i.e. on a structured grid constant stencils for a rectangular (cuboid) grid are related to sparse Toeplitz matrices lexicographic linearization of the grid function Toeplitz matrix: a banded matrix with constant entries along the diagonals In a finite grid: boundaries?

3

uh ∈ Gh → R Gh ⊂ hZd

−∆u = f − → 1 h2   −1 −1 4 −1 −1   uh = fh

slide-4
SLIDE 4

Stencils and Elliptic Solvers - Ulrich Rüde

Applying a stencil

A stencil can be applied to a grid function: „sweep“ results in a new grid function

  • ccurs in:

filters (signal processing) linear iterative schemes simultaneous (Jacobi) versus successive (Gauss- Seidel) update When the stenicil has s enties and the grid has N points, then the computational cost is 2sN Flops The stencil application (sweep) is often memory bound To climb the memory wall one may use spatial and temporal blocking (when stencils are applied repeatedly)

4

slide-5
SLIDE 5

Stencils and Elliptic Solvers - Ulrich Rüde

Stencil optimizations

DFG Project (1996-2008) Data Local Iterative Methods For The Efficient Solution of Partial Differential Equations DiME Pack software Focus on cache blocking techniques and interplay with CPU micro architecture (pipelining, superscalarity, in-order/out-of-order, etc.) systematic performance analysis, monitoring tools, perf counters, ... 2D and 3D grids, mostly structured tiling, blocking, fusing, ... but also mem layout transformations: red-black/ padding ... Example: temporal skewed 2D blocking - especially beneficial for in-order architecture with small&fast L1 cache.

5 https://www10.informatik.uni-­‑erlangen.de/Research/Projects/DiME/index.html

slide-6
SLIDE 6

Stencils and Elliptic Solvers - Ulrich Rüde

DiME project archive

6

  • 1. M. Stürmer, H. Köstler, and U. Rüde. A fast full

multigrid solver for applications in image processing.

  • Numer. Linear Algebra Appl., 15:187–200, 2008.
  • 2. Josef Weidendorfer and Carsten Trinitis. Off-loading

Application controlled Data Prefetching in numerical Codes for Multi-Core Processors. Int. J. High Performance Computing and Networking, 4(1):22–28, 2008.

  • 3. M. Stürmer, J. Treibig, and U. Rüde. Optimising a 3D

Multigrid Algorithm for the IA-64 Architecture. International Journal of Computational Science and Engineering (IJCSE), 4(1):29–35 , 2008.

  • 4. Tobias Gradl and Ulrich Rüde. Massively Parallel

Multilevel Finite Element Solvers on the Altix 4700. inSiDE, 5(2):24–29, 2007.

  • 5. C. Freundl, T. Gradl, U. Rüde, and B. Bergen.

Petascale Computing: Algorithms and Applications, Towards Petascale Multilevel Finite Element Solvers. Chapman & Hall/CRC, December 2007.

  • 6. M. Stürmer, J. Götz, G. Richter, and U. Rüde. Blood

Flow Simulation on the Cell Broadband Engine using the Lattice Boltzmann Method. Technical Report 07-9, Lehrstuhl für Informatik 10 (Systemsimulation), Friedrich-Alexander-Universität Erlangen-Nürnberg, September 2007.

  • 7. H. Köstler, M. Stürmer, C. Freundl, and U. Rüde. PDE

based Video Compression in Real Time. Technical Report 07-11, Lehrstuhl für Informatik 10 (Systemsimulation), Friedrich-Alexander-Universität Erlangen-Nürnberg, August 2007.

  • 8. M. Stürmer, H. Köstler, and U. Rüde. A fast multigrid

solver for applications in image processing. Technical Report 07-6

  • 9. C. C. Douglas, U. Rüde, J. Hu, and M. L. Bittencourt.

A Guide to Designing Cache Aware Multigrid

  • Algorithms. Technical Report 07-3,

10.B. Bergen, T. Gradl, F. Hülsemann, and U. Rüde. A Massively Parallel Multigrid Method for Finite

  • Elements. Computing in Science and Engineering.

8(6):56–62, December 2006. 11.G. Wellein, T. Zeiser, G. Hager, and S. Donath. On the single processor performance of simple lattice boltzmann kernels. computers & fluids, 35(8–9):910– 919, November 2006. 12.M. Stürmer, J. Treibig, and U. Rüde. Optimizing a 3D Multigrid Algorithm for the IA-64 Architecture. In Proc.

  • f the ASIM-06 Conf., Frontiers in Simulation. SCS,

2006. 13.Josef Weidendorfer and Carsten Trinitis. Block Prefetching for Numerical Codes. In Proc. of the ASIM-06 Conf., Frontiers in Simulation. SCS, 2006. 14.A. Nitsure, K. Iglberger, U. Rüde, C. Feichtinger,

  • G. Wellein, and G. Hager. Optimization of Cache

Oblivious Lattice Boltzmann Method in 2D and 3D. In

  • Proc. of the ASIM-06 Conf., Frontiers in Simulation.

SCS, 2006. 15.A. Nitsure. Implemenation and optimization of a cache-oblivious Lattice Boltzmann algorithm. Master´s thesis, Lehrstuhl für Informatik 10 (Systemsimulation), Friedrich-Alexander-Universität Erlangen-Nürnberg, August 2006. 16.Josef Weidendorfer and Carsten Trinitis. Cache Optimizations for Iterative Numerical Codes Aware of Hardware Prefetching. volume 3732 of Lecture Notes in Computer Science, pages 921–927. Springer, 2006. 17.J. Götz. Simulation of bloodflow in aneurysms using the lattice boltzmann method and an adapted data

  • structure. Technical Report 06-6, 2006

18.S. Donath, T. Zeiser, G. Hager, J. Habich, and

  • G. Wellein. Optimizing Performance of the Lattice

Boltzmann Method for Complex Structures on Cache- based Architectures. In F. Hülsemann, M. Kowarschik, and U. Rüde, editors, 18th Symposium Simulationstechnique ASIM 2005 Proceedings, volume 15 of Frontiers in Simulation, pages 728–735. ASIM, SCS Publishing House, September 2005. 19.J. Treibig, S. Hausmann, and U. Rüde. Performance Analysis of the Lattice Boltzmann Method on x-86-64

  • Architectures. In F. Hülsemann, M. Kowarschik, and
  • U. Rüde, editors, 18th Symposium

Simulationstechnique ASIM 2005. 20.B. Bergen. Hierarchical Hybrid Grids: Data Structures and Core Algorithms for Efficient Finite Element Simulations on Supercomputers. PhD thesis, FAU Erlangen, 2005. 21.Josef Weidendorfer and Carsten Trinitis. Collecting and Exploiting Cache-Reuse Metrics. In ICCS 2005: 5th International Conference on Computational Science, volume 3515 of LNCS, pages 191-198. Springer, May 2005. 22.Josef Weidendorfer and Carsten Trinitis. Collecting and Exploiting Cache-Reuse Metrics. In ICCS 2005: 5th International Conference on Computational Science, volume 3515 of LNCS, pages 191–198. Springer, May 2005 23.B. Bergen, F. Hülsemann, and U. Rüde. Is 1.7×1010 Unknowns the Largest Finite Element System that Can Be Solved Today? In SC ´05: Proceedings of the 2005 ACM/IEEE conference on Supercomputing, Washington, DC, USA, 2005. IEEE Computer Society. 24.T. Pohl, N. Thürey, F. Deserno, U. Rüde, P. Lammers,

  • G. Wellein, and T. Zeiser. Performance Evaluation of

Parallel Large-Scale Lattice Boltzmann Applications

  • n Three Supercomputing Architectures. November
  • 2004. Supercomputing Conference 04.

25.Markus Kowarschik. Data Locality Optimizations for Iterative Numerical Algorithms and Cellular Automata

  • n Hierarchical Memory Architectures. PhD thesis.

July 2004, SCS Publishing House, Germany. 26.Markus Kowarschik, Iris Christadler and Ulrich Rüde. Towards Cache-Optimized Multigrid Using Patch- Adaptive Relaxation. In /Proceedings of the 2004 Conference on Applied Parallel Computing (PARA'04)/, Copenhagen, Denmark, June 2004. Lecture Notes in Computer Science (LNCS), Springer. 27.Josef Weidendorfer, Markus Kowarschik, and Carsten

  • Trinitis. A Tool Suite for Simulation Based Analysis of

Memory Access Behavior. In Proceedings of the 2004 International Conference on Computational Science, Krakow, Poland, June 2004. Lecture Notes in Computer Science (LNCS), vol. 3038, Springer. 28.Jan Treibig et al. Performance Analysis of the Lattice Boltzmann Method on x86-64 Architectures. In Proceedings of the ASIM-05 Conference, volume 2790

  • f Frontiers in Simulation, pages 441-450. SCS, 2003.

29.Markus Kowarschik and Christian Weiß. An Overview

  • f Cache Optimization Techniques and Cache-Aware

Numerical Algorithms. Proceedings of the GI-Dagstuhl

slide-7
SLIDE 7

Stencils and Elliptic Solvers - Ulrich Rüde

ExaStencils

DFG SPPExa 2013 - 15 Domain Specific Language (DSL) approach http://www.exastencils.org

  • ptimizing stencil codes by

transformations in the context of multigrid algorithms several talks upcoming here at Dagstuhl

7

slide-8
SLIDE 8

Stencils and Elliptic Solvers - Ulrich Rüde

Algorithms:

Good stencil codes are hierarchical

8

4 4 4 4 4 4

  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1

u11 u1n u21 unn u2n

=

slide-9
SLIDE 9

u

new

h

= ✓ I − h2 4 Ah ◆ uh + h2 4 fh

uh = A−1

h fh

Stencils and Elliptic Solvers - Ulrich Rüde

Solving systems described by stencils

The system matrix described by a stencil is sparse by construction However its inverse is often dense Application background: A geometrically local relation is described by the stencil (where points only depend on neighbors) The goal is to compute the global solution (where every point takes influence on any

  • ther one)

9

Ah = 1 h2   −1 −1 4 −1 −1  

Laplace stencil Jacobi iteration: One sweep of stencil application corresponds to a spMxV operation Exact solution

slide-10
SLIDE 10

Stencils and Elliptic Solvers - Ulrich Rüde

Global data exchange is essential

Example: Hydraulic system incompressible fluid discretized by grid functions Stencils describe flow of brake liquid between neighboring cells: local conservation of momentum local conservation of mass But: system is globally incompressible: change at inflow (brake pedal) has immediate effect at outflow (brake pistons) How many stencil sweeps are necessary? Stokes System

10

Source: wikipedia

  Ah Gxh Ah Gyh GT

xh

GT

yh

    uh vh ph   =   fxh fyh  

slide-11
SLIDE 11

Stencils and Elliptic Solvers - Ulrich Rüde

Solving systems described by stencils

Stencil application propagates information just by one grid cell in each sweep if the diameter of the grid is M cells, then at least M sweeps are required to propagate information across the system Complexity for any one-grid solver is worse than (lower bound):

  • ften:

but e.g. worse for pipe system, when the diameter

  • f the mesh is

this complexity for a one-grid solver is achieved by e.g. Krylov-space methods or successive over relaxation (SOR) better complexity (getting rid

  • f the factor M) can be

achieved only by hierarchical techniques FFT multigrid fast multipole ... others?

11

Cost = 2sN × M

M = O(N 1/d)

M = O(N)

slide-12
SLIDE 12

Stencils and Elliptic Solvers - Ulrich Rüde

The efficient solution of many stencil based systems requires hierarchical structures!

Just optimizing stencil sweeps will not do! The core principle Divide et Impera!

Hierarchy Recursion Tree-structures

12

slide-13
SLIDE 13

Stencils and Elliptic Solvers - Ulrich Rüde

Hierarchical Algorithms: Sort

13

Split Fuse

Recurse

Unsorted Sequence Sorted Sequence

Sort! Sort!

slide-14
SLIDE 14

Stencils and Elliptic Solvers - Ulrich Rüde

Two different archetypes

  • f recursive sorting algorithms

Merge sort:

Split arbitrary (trivial) Fuse by „merging“ (dominating cost)

Quicksort

Split at separator (dominating cost) Fuse by (trivial) concatenation

14

the same algorithmic principles can be used as the starting point to design fast solvers for stencil systems

slide-15
SLIDE 15

A x b       α β γ ... ... ... β γ α             x1 . . . . . . xn       =       b1 . . . . . . bn      

Stencils and Elliptic Solvers - Ulrich Rüde

Hierarchical Algorithms for Linear Systems

Divide and Conquer for 1D stencil system

1-D stencil leads to tridiagonal Toeplitz matrix

15

[γ α β]

slide-16
SLIDE 16

A =                   α β γ ... ... ... ... β γ α β γ α β γ α β γ ... ... ... ... β γ α                   =      A11 X Y T S XT Y A22     

Stencils and Elliptic Solvers - Ulrich Rüde

In analogy to quicksort: determine a separator

16

1st approach:

start from quicksort principle

slide-17
SLIDE 17

Stencils and Elliptic Solvers - Ulrich Rüde

Divide and Conquer!

Find a separator S Split the system form the Schur complement: solve the Schur complement system for the separator unknown (here just one scalar unknown) Recurse: using the separator value(s), in parallel: solve with A11 for the upper block of unknowns solve with A22 for the lower block of unknowns Fuse (nothing to do)

The method can also be interpreted as two-sided shooting method for a two-point boundary value problem.

17

T = S − Y T A−1

11 X − XT A−1 22 Y

TxS = ˜ fS

slide-18
SLIDE 18

Stencils and Elliptic Solvers - Ulrich Rüde

but ... this is not ready yet ...

The inverses of A11 and A22 are dense matrices and must not be computed explicitly use LU factorization instead, which could be re-used in in the Schur complement and for solving the two subsystems but even better use recursion: i.e. use a divide et impera method to compute the LU-factorization implicitly of this results in a recursive split of the linear system in binary tree at the leaves we get trivial systems with just one unknown this tree must be traversed twice (in analogy to parsing & attribute evaluation in a syntax tree) The resulting algorithm is equivalent to cyclic reduction Note that

  • 1. the tree is built top down
  • 2. information is collected into the root

bottom up

  • 3. information is distributed towards the

leaves top-down

The algorithm can be realised with stencil applications stencils of growing width this is essential for the global data transport

18

A−1

11 and A−1 22

slide-19
SLIDE 19

Stencils and Elliptic Solvers - Ulrich Rüde

But it only gets interesting once we further try to generalize the principle

for linear systems whenever a „small“ separator can be found: nested dissection algorithms substructuring techniques Complexity of solving Schur complement system often dominates Complexity asymptotically

  • ptimal only when the

separator is smaller than If one avoids the exact computation of the Schur complement, this becomes an iterative process and results in

  • ne of the so-called domain

decomposition algorithms

19

A A

11 22

S S S S S S S

Elimination tree for banded linear systems ¡

O(N 1/3)

A

slide-20
SLIDE 20

Stencils and Elliptic Solvers - Ulrich Rüde

Complexity of iterative nested dissection (aka domain decomposition)

Assume a partitioning in two subsets, each with N/2 unknowns and an iterative scheme that must visit each partition k times Assume that an optimal algorithm exists that solves the problem with cost Then a domain decomposition algorithm with 2 subdomains and k visits to each subdomain will cost Consequence: for k>1 (i.e. if each subdomain is visited more than once), a domain decomposition algorithm will not be optimal

20

Cost(N) = cN

DD-Cost(N) = 2kCost(N/2) = kcN

slide-21
SLIDE 21

Stencils and Elliptic Solvers - Ulrich Rüde

2nd approach:

start from mergesort principle

recursively splitting spans a binary tree single elements (leaves) are trivially sorted form (bottom up) by „merging“

sorted pairs sorted 4-tuples sorted 8-tuples sorted 16-tuples ... sorted full sequence

21

For tridiagonal systems this is analogous to cyclic reduction

slide-22
SLIDE 22

· · · · · · · · · Ei +xi−1 −2xi +xi+1 Ei+1 +xi −2xi+1 +xi+2 Ei+2 +xi+1 −2xi+2 +xi+3 Ei+3 +xi+2 −2xi+3 +xi+4 Ei+4 +xi+3 −2xi+4 +xi+5 · · · · · · · · · · · · · · · · · · Ei +xi−1 −2xi +xi+1 Zi+1

1 2xi−1

+ 0 −1xi+1 + 0

1 2xi+3

Ei+2 +xi+1 −2xi+2 +xi+3 Zi+3

1 2xi+1

+ 0 −1xi+3 + 0

1 2xi+5

Ei+4 +xi+3 −2xi+4 +xi+5 · · · · · · · · · Stencils and Elliptic Solvers - Ulrich Rüde

Cyclic reduction as algorithm with recursive structure analogous to mergesort

22

slide-23
SLIDE 23

Stencils and Elliptic Solvers - Ulrich Rüde

Cyclic reduction

can also be interpreted as red-black partitioning and block-elimination of red unknowns remaining system maintains tridiagonal structure recursive application possible size of system is halved every step results in asymptotically optimal complexity for tridiagonal systems in 1D results in the same algorithm as above

23

slide-24
SLIDE 24

Stencils and Elliptic Solvers - Ulrich Rüde

however: this becomes interesting when using the principle more generally

structured grids in 2D and 3D lead to block-tridiagonal matrices cyclic reduction on block structures numerical stabilization Fast Poisson solvers (Bunemann)

  • ptimal complexity only for

special cases to generalize: avoid exact elimination: partitioning in fine and coarse grid iterative coarse grid correction when fine grid equations are solved exacly: hierachical basis multigrid method relaxation instead of solution: multigrid method

  • optimal complexity

24

4 4 4 4 4 4

  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1

u11 u1n u21 unn u2n

=

slide-25
SLIDE 25

Stencils and Elliptic Solvers - Ulrich Rüde

Where we stand:

Parallel Textbook Multigrid Efficiency in HHG

from TerraNeo@SPPExa

25

Gmeiner, UR, Stengel, Waluga, Wohlmuth: Towards Textbook Efficiency for Parallel Multigrid, Journal of Numerical Mathematics: Theory, Methods and Applications, 2015

slide-26
SLIDE 26

Stencils and Elliptic Solvers - Ulrich Rüde

How big can stencil systems be?

JuQueen has 448 TB memory, say 300 TB user available (w/o operating system, MPI buffers, etc.) When we store in double precision u, uold, f, r, g, take into account coarser grids then Nmax ≈ 7.5×1012 Note that for this size storing the stiffness matrix in a compressed row format for the Laplace operator for trilinear brick elements would cost 3 240 TByte a naive storage of the full variable coefficient Stokes stiffness matrix with quadratic brick elements would cost 45 000 TByte Consequence: a matrix-free implementation is essential whenever possible

26

slide-27
SLIDE 27

Ahuh = f h rh = f h − Ahuh

rH = IH

h rh

AHeH = rH

eh = Ih

HeH

uh ← uh + eh

Stencils and Elliptic Solvers - Ulrich Rüde

27

Multigrid: Algorithms for 1012 unknowns

Relax on Residual Restrict Correct Solve Interpolate by recursion … …

Goal: solve Ah uh = f h using a hierarchy of grids Multigrid uses coarse grids to accomplish the inevitable global data exchange in the most efficient possible way

slide-28
SLIDE 28

Stencils and Elliptic Solvers - Ulrich Rüde

28

Regular tetrahedral refinement in HHG

3-D refinement hierarchy of tetrahedral patches Geometric hierarchy with 1-layer halos for (volume) elements faces edges vertices

communication of ghost layers Structured refinement

  • f tetrahedra
slide-29
SLIDE 29

Stencils and Elliptic Solvers - Ulrich Rüde

Textbook Multigrid Efficiency (TME)

„Textbook multigrid efficiency means solving a discrete PDE problem with a computational effort that is only a small (less than 10) multiple of the

  • peration count associated with the discretized

equations itself.“ [Brandt, 98] work unit (WU) = single elementary relaxation classical algorithmic TME-factor:

  • ps for solution/ ops for work unit

new parallel TME-factor to quantify algorithmic efficiency combined with implementation scalability

29

slide-30
SLIDE 30

Uµsm

T(N, U)

EParTME(N, U) = T(N, U) TWU(N, U) = T(N, U)Uµsm N

TWU(N, U) = N Uµsm

Stencils and Elliptic Solvers - Ulrich Rüde

Parallel TME

# of elementary relaxation steps on single core/sec # cores aggregate relaxation performance

30

µsm

U

time to solution for N unkowns on U cores idealized time for a work unit Parallel textbook efficiency factor combines algorithmic and implementation efficiency.

slide-31
SLIDE 31

Stencils and Elliptic Solvers - Ulrich Rüde

Analysing TME Efficiency: RB-GS Smoother

31

¡ ¡for (int i=1; i < (tsize-j-k-1); i=i+2) {

u[mp_mr+i] = c[0] * (

  • c[1] *u[mp_mr+i+1] - c[2] *u[mp_tr+i-1] -

c[3] *u[mp_tr+i] - c[4] *u[tp_br+i] - c[5] *u[tp_br+i+1] - c[6] *u[tp_mr+i-1] - c[7] *u[tp_mr+i] - c[8] *u[bp_mr+i] - c[9] *u[bp_mr+i+1] - c[10]*u[bp_tr+i-1] - c[11]*u[bp_tr+i] - c[12]*u[mp_br+i] - c[13]*u[mp_br+i+1] - c[14]*u[mp_mr+i-1] + f[mp_mr+i] );

This loop should be executed on single SuperMuc core at 720 M LUPs (in theory - peak performance) = 176 M updates/sec (in practice - memory access bottleneck; RB-ordering prohibits vector loads) Thus whole SuperMuc should perform = 147456*176M ≈ 26T LUPs

tp_br mp_br tp_mr mp_tr mp_mr bp_tr bp_mr

µsm Uµsm

slide-32
SLIDE 32

Stencils and Elliptic Solvers - Ulrich Rüde

Execution-Cache-Memory Model (ECM)

ECM model for the 15-point stencil on SNB core. Arrow indicates a 64 Byte cache line transfer. Run-times represent 8 elementary updates.

32

Gmeiner, UR, Stengel, Waluga, Wohlmuth: Towards Textbook Efficiency for Parallel Multigrid, Journal of Numerical Mathematics: Theory, Methods and Applications, 2015 Hager et al. Exploring performance and power properties of modern multi-core chips via simple machine

  • models. Concurrency and Computation: Practice and Experience, 2013.

 





 

  

 

 



  

  

     

 



model for the 15-point stencil with variable coefficients (left) and con

Variable coefficients Constant coefficients

slide-33
SLIDE 33

Stencils and Elliptic Solvers - Ulrich Rüde

ECM: single-chip prediction vs. measurement

Sandy Bridge single-chip performance scaling of the stencil smoothers

  • n 2563 grid points. Measured data and ECM prediction ranges shown.

33

1 2 3 4 5 6 7 8

# cores

400 800 1200 1600

performance [MLups/s]

constant coefficients ECM model constant coefficients measured variable coefficients ECM model variable coefficients measured

single-chip performance scaling of the stencil smoothers on 2563 grid poi

slide-34
SLIDE 34

Stencils and Elliptic Solvers - Ulrich Rüde

HHG Weak Scalability on JuQueen for Stokes

34

Nodes Threads Grid points Resolution Time: (A) (B) 1 30 2.1 · 1007 32 km 30 s 89 s 4 240 1.6 · 1008 16 km 38 s 114 s 30 1 920 1.3 · 1009 8 km 40 s 121 s 240 15 360 1.1 · 1010 4 km 44 s 133 s 1 920 122 880 8.5 · 1010 2 km 48 s 153 s 15 360 983 040 6.9 · 1011 1 km 54 s 170 s

Largest computation to date: 2.76×1012 unknowns

slide-35
SLIDE 35

Stencils and Elliptic Solvers - Ulrich Rüde

TME and Parallel TME results

35

Setting/Measure ETME ESerTME ENodeTME EParTME1 EParTME2 Grid points

  • 2 · 106

3 · 107 9 · 109 2 · 1011 Processor cores U

  • 1

16 4096 16384 (CC) - FMG(2,2) 6.5 15 22 26 22 (VC) - FMG(2,2) 6.5 11 13 15 13 (SF) - FMG(2,1) 31 64 100 118

  • Three model problems:

scalar constant & scalar variable coefficients Stokes solved via Schur complement Full multigrid with #iterations such that asymptotic

  • ptimality maintained

TME = 6.5 (algorithmically for scalar cases) ParTME around 20 for scalar, and ≥100 for Stokes

slide-36
SLIDE 36

Stencils and Elliptic Solvers - Ulrich Rüde

Thought Experiment

109 elements/nodes assume that every entity must contribute to any other (as is typical for an elliptic problem) equivalent to either multiplication with inverse matrix n × n Coulomb interaction Results in 1018 data movements each 1 NanoJoule (optimistic) Together: 109 Joule = 277 kWh

36

!"#$%"&'$()*!+ ,&--."/012&"+ 3"405/6++(+,)*+ 0&--."/012&"+ !"#%1"&'$(7)*+ ,&--."/012&"+

12

from: ¡ ¡Exascale ¡Programming ¡Challenges, ¡Report ¡of ¡the ¡ 2011 ¡Workshop ¡on ¡Exascale ¡Programming ¡Challenges, ¡ Marina ¡del ¡Rey, ¡July ¡27-­‑29, ¡2011

Energy as limiting factor in exascale computing

slide-37
SLIDE 37

Stencils and Elliptic Solvers - Ulrich Rüde

Thought Experiment

1012 elements/nodes assume that every entity must contribute to any other (as is typical for an elliptic problem) equivalent to either multiplication with inverse matrix n × n Coulomb interaction Results in 1024 data movements each 1 NanoJoule (optimistic) Together: 1015 Joule

37

!"#$%"&'$()*!+ ,&--."/012&"+ 3"405/6++(+,)*+ 0&--."/012&"+ !"#%1"&'$(7)*+ ,&--."/012&"+

12

from: ¡ ¡Exascale ¡Programming ¡Challenges, ¡Report ¡of ¡the ¡ 2011 ¡Workshop ¡on ¡Exascale ¡Programming ¡Challenges, ¡ Marina ¡del ¡Rey, ¡July ¡27-­‑29, ¡2011

277 GWh or 240 Kilotons TNT

the picture shows the Badger-Explosion

  • f 1953 with 23 Kilotons TNT

Source: Wikipedia

Energy as limiting factor in exascale computing

slide-38
SLIDE 38

Stencils and Elliptic Solvers - Ulrich Rüde

Thank you for your attention!

38

Preprints, slides https://www10.informatik.uni-erlangen.de