DLR.de · Chart 1
> Software Sustainability in the Many-Core Era > J.Thies · slides > Erlangen, July 11 2016
Software Sustainability in the Many-Core Era Jonas Thies > - - PowerPoint PPT Presentation
> Software Sustainability in the Many-Core Era > J.Thies slides > Erlangen, July 11 2016 DLR.de Chart 1 Software Sustainability in the Many-Core Era Jonas Thies > Software Sustainability in the Many-Core Era > J.Thies
DLR.de · Chart 1
> Software Sustainability in the Many-Core Era > J.Thies · slides > Erlangen, July 11 2016
DLR.de · Chart 2
> Software Sustainability in the Many-Core Era > J.Thies · slides > Erlangen, July 11 2016
◮ > 8 000 employees ◮ 16(?) sites in Germany
◮ Aeronautics ◮ Space ◮ Energy ◮ Security
DLR.de · Chart 3
> Software Sustainability in the Many-Core Era > J.Thies · slides > Erlangen, July 11 2016
◮ 2005: ∼ 25% of personnel expenses
◮ cost: > 100 million Euro/year ◮ Examples: CFD, material science, onboard
DLR.de · Chart 4
> Software Sustainability in the Many-Core Era > J.Thies · slides > Erlangen, July 11 2016
DLR.de · Chart 5
> Software Sustainability in the Many-Core Era > J.Thies · slides > Erlangen, July 11 2016
◮ A Hermitian or general, real or
◮ B may be identity matrix (or not) ◮ ‘some’ may mean ‘quite a few’,
Graphene Anderson localization Hubbard model DLR applications Driven cavity Rayleigh-Benard convection
DLR.de · Chart 6
> Software Sustainability in the Many-Core Era > J.Thies · slides > Erlangen, July 11 2016
◮ Aim: partial QR decomposition, AQ = QR, R ∈ Ck×k upper triangular, ◮ 1 2QTQ − 1 2I = 0, Q ∈ RN×k.
◮ A∆Q − ∆Q˜
◮ ˜
DLR.de · Chart 7
> Software Sustainability in the Many-Core Era > J.Thies · slides > Erlangen, July 11 2016
◮ Subspace acceleration: add corrections to expanding search space V ◮ Ritz-Galerkin: M = VTAV, M = SHRS ◮ Lock converged eigenpairs ⇒ growing projection space ˜
◮ Solve correction eq. using (deflated) GMRES or MINRES Krylov solver
DLR.de · Chart 8
> Software Sustainability in the Many-Core Era > J.Thies · slides > Erlangen, July 11 2016
DLR.de · Chart 9
> Software Sustainability in the Many-Core Era > J.Thies · slides > Erlangen, July 11 2016
◮ Chebyshev iteration ◮ requires very large number of
◮ but no global synchronization ◮ ‘filter polynomials’ to reduce
DLR.de · Chart 10
> Software Sustainability in the Many-Core Era > J.Thies · slides > Erlangen, July 11 2016
◮ e.g. eigendecomposition of
◮ use LAPACK/PLASMA/MAGMA
◮ not available in ◮ allow using external libraries via
DLR.de · Chart 11
> Software Sustainability in the Many-Core Era > J.Thies · slides > Erlangen, July 11 2016
DLR.de · Chart 11
> Software Sustainability in the Many-Core Era > J.Thies · slides > Erlangen, July 11 2016
DLR.de · Chart 11
> Software Sustainability in the Many-Core Era > J.Thies · slides > Erlangen, July 11 2016
DLR.de · Chart 11
> Software Sustainability in the Many-Core Era > J.Thies · slides > Erlangen, July 11 2016
DLR.de · Chart 12
> Software Sustainability in the Many-Core Era > J.Thies · slides > Erlangen, July 11 2016
◮ SIMD/SIMT ◮ increasing core count ◮ increasingly non-uniform
◮ OpenMP+OpenACC vs.
◮ vendor-specific (e.g. CUDA) ◮ ca. 15 different tasking
◮ C++11, Intel TBB, Kokkos ◮ MPI vs. PGAS (GPI/GASPI, Co-Array Fortran, UPC)
DLR.de · Chart 13
> Software Sustainability in the Many-Core Era > J.Thies · slides > Erlangen, July 11 2016
1 4 16 64 256 1024 1/4 1 4 16 64 DP GFlop/s Compute intensity [Flop / DP element] Peak bandwidth Peak Flop/s BLAS1 (ddot) ◮ 2 × 12 core Haswell EP @2.3 GHz ◮ Theoretical Peak: 442 GFlop/s ◮ 128 GB RAM ◮ STREAM-Triad: 42 GB/s / socket ◮ Tesla K40 GPU ◮ Theoretical Peak: 1.43 TFlop/s ◮ 12 GB RAM ◮ STREAM-Triad: 215 GB/s
DLR.de · Chart 14
> Software Sustainability in the Many-Core Era > J.Thies · slides > Erlangen, July 11 2016
◮ SPMD (‘BSP’) vs. task
◮ Heterogenous cluster:
◮ Optimized Kernels make sure
◮ User sees a simple functional
1 64 256 1024 4 16 Number of heterogeneous nodes 0.1 1 10 100 Performance in Tflop/s
100% Parallel Efficiency Square, Weak Scaling Bar, Weak Scaling Square, Strong Scaling
DLR.de · Chart 15
> Software Sustainability in the Many-Core Era > J.Thies · slides > Erlangen, July 11 2016
◮ Knight’s Landing: additional fast NUMA domain ◮ IBM Power 9 + Nvidia Volta: GPU can read from main memory (at same
◮ E.g. eigensolvers often have an outer/inner (project/correct) structure,
DLR.de · Chart 16
> Software Sustainability in the Many-Core Era > J.Thies · slides > Erlangen, July 11 2016
◮ facilitate algorithm development using ◮ holistic performance engineering ◮ portability and interoperability
application vertical integration algorithms preconditioners preconditioners computational core computational core «abstraction» eigenproblem setup/apply sparseMat mVec sdMat solver templates FT strategies algo core «interface» kernel interface
holistic performance engineering
C wrapper adapter
DLR.de · Chart 16
> Software Sustainability in the Many-Core Era > J.Thies · slides > Erlangen, July 11 2016
◮ facilitate algorithm development using ◮ holistic performance engineering ◮ portability and interoperability
application vertical integration algorithms preconditioners preconditioners computational core computational core «abstraction» eigenproblem setup/apply sparseMat mVec sdMat solver templates FT strategies algo core «interface» kernel interface
holistic performance engineering
C wrapper adapter
DLR.de · Chart 17
> Software Sustainability in the Many-Core Era > J.Thies · slides > Erlangen, July 11 2016
◮ easily use PHIST in existing applications ◮ perform the same run with different kernel libraries ◮ compare numerical accuracy and performance ◮ exploit unique features of a kernel library (e.g. preconditioners)
DLR.de · Chart 18
> Software Sustainability in the Many-Core Era > J.Thies · slides > Erlangen, July 11 2016
◮ overlap comm. and comp. ◮ asynchronous checkpointing ◮ ...
◮ across platforms (CPU, GPU...) ◮ across kernel libraries ◮ independent of #procs, #threads
◮ deficiencies of kernel lib ◮ implemntation issues of algorithm (strided
◮ fused kernels, e.g. compute
◮ highly accurate core functions, e.g. block
DLR.de · Chart 19
> Software Sustainability in the Many-Core Era > J.Thies · slides > Erlangen, July 11 2016
◮ #MPI procs, #threads ◮ data types (S/D/C/Z) ◮ block sizes and memory alignment ◮ vectorization (SSE,AVX,CUDA) Algorithms
DLR.de · Chart 20
> Software Sustainability in the Many-Core Era > J.Thies · slides > Erlangen, July 11 2016
DLR.de · Chart 21
> Software Sustainability in the Many-Core Era > J.Thies · slides > Erlangen, July 11 2016
DLR.de · Chart 22
> Software Sustainability in the Many-Core Era > J.Thies · slides > Erlangen, July 11 2016
◮ need to update
◮ no adjustments in kernels, interface, or
◮ unit tests will ensure correctness of refactoring ◮ perfcheck will reveal performance benefits
DLR.de · Chart 22
> Software Sustainability in the Many-Core Era > J.Thies · slides > Erlangen, July 11 2016
◮ need to update
◮ no adjustments in kernels, interface, or
◮ unit tests will ensure correctness of refactoring ◮ perfcheck will reveal performance benefits
DLR.de · Chart 23
> Software Sustainability in the Many-Core Era > J.Thies · slides > Erlangen, July 11 2016
◮ kernels and data structures belong together
◮ Separation of kernels, core and high-level algorithms ◮ through interfaces ◮ which are verified by tests, benchmarks and performance models
◮ Interfaces that expose raw data (e.g. reverse communication)
DLR.de · Chart 24
> Software Sustainability in the Many-Core Era > J.Thies · slides > Erlangen, July 11 2016
◮ Project website
◮ Source code