software sustainability in the many core era
play

Software Sustainability in the Many-Core Era Jonas Thies > - PowerPoint PPT Presentation

> Software Sustainability in the Many-Core Era > J.Thies slides > Erlangen, July 11 2016 DLR.de Chart 1 Software Sustainability in the Many-Core Era Jonas Thies > Software Sustainability in the Many-Core Era > J.Thies


  1. > Software Sustainability in the Many-Core Era > J.Thies · slides > Erlangen, July 11 2016 DLR.de · Chart 1 Software Sustainability in the Many-Core Era Jonas Thies

  2. > Software Sustainability in the Many-Core Era > J.Thies · slides > Erlangen, July 11 2016 DLR.de · Chart 2 German Aerospace Center (DLR) Aerospace center, project manager and space agency ◮ > 8 000 employees ◮ 16(?) sites in Germany Main areas of research ◮ Aeronautics ◮ Energy ◮ Space ◮ Security For the ESA mission ‘Rosetta’, DLR developed and operates the ‘Philae’ lander ... so who am I to talk to you about software and HPC?

  3. > Software Sustainability in the Many-Core Era > J.Thies · slides > Erlangen, July 11 2016 DLR.de · Chart 3 Institute Simulation and Software Technology Software is developed everywhere at DLR ◮ 2005: ∼ 25 % of personnel expenses spent on software development ◮ cost: > 100 million Euro/year ◮ Examples: CFD, material science, onboard computers, data analysis... Our mission ( ∼ 50 staff) is to increase the efficiency of software development in other institutes by software research , teaching and contributing to key projects .

  4. > Software Sustainability in the Many-Core Era > J.Thies · slides > Erlangen, July 11 2016 DLR.de · Chart 4 Equipping Sparse Solvers for the EXa-scale

  5. > Software Sustainability in the Many-Core Era > J.Thies · slides > Erlangen, July 11 2016 DLR.de · Chart 5 Sparse Eigenvalue Problems Formulation Find some Eigenpairs Applications λ j , v j ) of a large and sparse matrix (pair) in a target region of the spectrum Graphene A v j = λ j B v j Quantum Hubbard model and Anderson localization ◮ A Hermitian or general, real or Fluid complex Mechanics ◮ B may be identity matrix (or not) Driven cavity Rayleigh-Benard convection ◮ ‘some’ may mean ‘quite a few’, 100-1 000 or so DLR applications

  6. > Software Sustainability in the Many-Core Era > J.Thies · slides > Erlangen, July 11 2016 DLR.de · Chart 6 Block Jacobi-Davidson QR ◮ Aim: partial QR decomposition, A Q = QR , R ∈ C k × k upper triangular, 1 2 Q T Q − 1 2 I = 0 , Q ∈ R N × k . ◮ Newton’s method, let Q = ˜ Q + ∆ Q ◮ A ∆ Q − ∆ Q ˜ R = A ˜ Q − ˜ Q ˜ R ◮ ˜ Q T ∆ Q = 0

  7. > Software Sustainability in the Many-Core Era > J.Thies · slides > Erlangen, July 11 2016 DLR.de · Chart 7 Block Jacobi-Davidson QR (2) This leads to a set of correction equations ( I − ˜ Q ˜ Q T ) A ( I − ˜ Q ˜ Q T )∆ Q − ∆ Q ˜ R = A ˜ Q − ˜ Q ˜ R ◮ Subspace acceleration: add corrections to expanding search space V ◮ Ritz-Galerkin: M = V T A V , M = S H RS ◮ Lock converged eigenpairs ⇒ growing projection space ˜ Q ◮ Solve correction eq. using (deflated) GMRES or MINRES Krylov solver

  8. > Software Sustainability in the Many-Core Era > J.Thies · slides > Erlangen, July 11 2016 DLR.de · Chart 8 Projection-Based Eigensolvers Input: Interval I λ , Matrix pair A , B ∈ C N × N Output: ˆ m eigenpairs ( X , Λ) in I λ m , choose random Y ∈ C N × m of rank m > ˜ 1 Estimate ˜ m ≈ ˆ m 2 while not ˜ m pairs converged do Compute U = PY with suitable projector P = P I λ ( A , B ) 3 Compute Rayleigh quotients A U = U ∗ AU and B U = U ∗ BU 4 Update estimate ˜ m of ˆ m and adjust m > ˜ m 5 Solve EVP A U W = B U W Λ 6 X ← UW 7 Orthogonalize X against locked vectors, lock newly converged ones 8 Y ← BX 9 10 end while

  9. > Software Sustainability in the Many-Core Era > J.Thies · slides > Erlangen, July 11 2016 DLR.de · Chart 9 Two Ways of Computing the Projector U = PY BEAST-C/FEAST: contour integration Polynomial expansion of resolvent function 1 � ◮ Chebyshev iteration ( z B − A ) − 1 B d z Y U := 2 π i C ◮ requires very large number of spMMVMs ◮ but no global synchronization ◮ ‘filter polynomials’ to reduce Gibbs oscillations aka ChebFD Requires solving many independent but hard linear systems

  10. > Software Sustainability in the Many-Core Era > J.Thies · slides > Erlangen, July 11 2016 DLR.de · Chart 10 Common Operations of Iterative Methods 1. Memory-bounded linear operations involving small and dense matrices C ∈ R m × k (sdMats) sparse matrices multi-vectors node-local/in shared A ∈ R N × N (sparseMat) X , Y ∈ R N × m (mVecs) memory (e.g. Y ← α AX + β Y , C ← X T Y , X ← Y · C ) Developed in ESSEX/ 2. Algorithms for sdMats 3. Sparse matrix (I)LU factorization ◮ not available in ◮ e.g. eigendecomposition of projected matrix ◮ allow using external libraries via Trilinos interface ◮ use LAPACK/PLASMA/MAGMA

  11. > Software Sustainability in the Many-Core Era > J.Thies · slides > Erlangen, July 11 2016 DLR.de · Chart 11 Comparing Performance Results simple(?) operation: C = V T V , V ∈ R 1 M × 4

  12. > Software Sustainability in the Many-Core Era > J.Thies · slides > Erlangen, July 11 2016 DLR.de · Chart 11 Comparing Performance Results simple(?) operation: C = V T V , V ∈ R 1 M × 4

  13. > Software Sustainability in the Many-Core Era > J.Thies · slides > Erlangen, July 11 2016 DLR.de · Chart 11 Comparing Performance Results simple(?) operation: C = V T V , V ∈ R 1 M × 4

  14. > Software Sustainability in the Many-Core Era > J.Thies · slides > Erlangen, July 11 2016 DLR.de · Chart 11 Comparing Performance Results simple(?) operation: C = V T V , V ∈ R 1 M × 4

  15. > Software Sustainability in the Many-Core Era > J.Thies · slides > Erlangen, July 11 2016 DLR.de · Chart 12 Present Challenges to HPC Users Performance increase on low to intermediate levels ◮ SIMD/SIMT ◮ increasingly non-uniform cache/memory hierarchies ◮ increasing core count Many programming models and (semi-)standards ◮ OpenMP+OpenACC vs. ◮ ca. 15 different tasking OpenCL runtimes ◮ vendor-specific (e.g. CUDA) ◮ C++11, Intel TBB, Kokkos ◮ MPI vs. PGAS (GPI/GASPI, Co-Array Fortran, UPC) imo: MPI is here to stay, the node-level is uncertain

  16. > Software Sustainability in the Many-Core Era > J.Thies · slides > Erlangen, July 11 2016 DLR.de · Chart 13 Our Test System Peak Flop/s 1024 Peak bandwidth 256 DP GFlop/s 64 BLAS1 (ddot) 16 4 1 1/4 1 4 16 64 Compute intensity [Flop / DP element] ◮ 2 × 12 core Haswell EP @2.3 GHz ◮ Tesla K40 GPU ◮ Theoretical Peak: 442 GFlop/s ◮ Theoretical Peak: 1.43 TFlop/s ◮ 128 GB RAM ◮ 12 GB RAM ◮ STREAM-Triad: 42 GB/s / socket ◮ STREAM-Triad: 215 GB/s

  17. > Software Sustainability in the Many-Core Era > J.Thies · slides > Erlangen, July 11 2016 DLR.de · Chart 14 SPMD/OK Programming Model A success story: Chebyshev ◮ SPMD (‘BSP’) vs. task methods on Piz Daint parallelism 100 100% Parallel Efficiency Square, Weak Scaling ◮ Heterogenous cluster: Performance in Tflop/s Bar, Weak Scaling Square, Strong Scaling distribute problem according 10 to limiting resource (e.g. memory bandwidth) 1 ◮ O ptimized K ernels make sure each component runs as fast 0.1 1 4 16 64 256 1024 Number of heterogeneous nodes as possible Only needs sparse matrix times ◮ User sees a simple functional multiple vector (spMMV) products interface (no general-purpose and an occasional vector operation looping constructs etc.)

  18. > Software Sustainability in the Many-Core Era > J.Thies · slides > Erlangen, July 11 2016 DLR.de · Chart 15 Upcoming Challenges (even more) heterogenous memory ◮ Knight’s Landing: additional fast NUMA domain ◮ IBM Power 9 + Nvidia Volta: GPU can read from main memory (at same speed as CPU) Algorithm developer must decide which data should be accessed fast ◮ E.g. eigensolvers often have an outer/inner (project/correct) structure, the complete outer search space may not be needed in inner loop

  19. > Software Sustainability in the Many-Core Era > J.Thies · slides > Erlangen, July 11 2016 DLR.de · Chart 16 PHIST Software Architecture a Pipelined Hybrid-parallel Iterative Solver Toolkit ◮ facilitate algorithm development using ◮ holistic performance engineering ◮ portability and interoperability vertical integration application algorithms eigenproblem holistic performance engineering solver templates preconditioners preconditioners «abstraction» FT strategies C wrapper setup/apply algo core computational core computational core adapter «interface» kernel interface sparseMat mVec sdMat

  20. > Software Sustainability in the Many-Core Era > J.Thies · slides > Erlangen, July 11 2016 DLR.de · Chart 16 PHIST Software Architecture a Pipelined Hybrid-parallel Iterative Solver Toolkit ◮ facilitate algorithm development using ◮ holistic performance engineering ◮ portability and interoperability vertical integration application algorithms eigenproblem holistic performance engineering solver templates BEAST preconditioners preconditioners «abstraction» FT strategies C wrapper setup/apply algo core computational core computational core adapter «interface» kernel interface sparseMat mVec sdMat

  21. > Software Sustainability in the Many-Core Era > J.Thies · slides > Erlangen, July 11 2016 DLR.de · Chart 17 Useful Abstraction: Kernel Interface Choose from several ‘backends’ at compile time, to ◮ easily use PHIST in existing applications ◮ perform the same run with different kernel libraries ◮ compare numerical accuracy and performance ◮ exploit unique features of a kernel library (e.g. preconditioners)

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend