 
              S Los Alamos National Laboratory LA-UR-16-22225 Kokkos Implementation of Albany: you Towards Performance Portable e Finite Element Code logo and delete wo e I. Demeshko, O. Guba, R. P. Pawlowski, A. G. Salinger, W. F. Spotz and I. K. Tezaur, is M.A. Heroux 04/07/2016 Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA
Los Alamos National Laboratory Performance Portability 04/07/16 | 2
Los Alamos National Laboratory Performance Portability 04/07/16 | 3
Los Alamos National Laboratory Performance Portability EXASCALE SYSTEM new programming models new architecture new libraries 04/07/16 | 4
Los Alamos National Laboratory 5 04/07/16 | 5
Los Alamos National Laboratory 04/07/16 | 6
Los Alamos National Laboratory Albany : agile component-based par Albany : agile component-based parallel unstructur allel unstructured ed mesh application mesh applica tion • A finite element based application development environment containing the "typical" building blocks needed for rapid deployment and prototyping of analysis capabilities A Trilinos demonstra/on applica/on, built almost exclusively from reusable libraries. § Albany leverages 100+ packages/libraries. Albany Structure: Software Quality Tools Libraries Demo Apps Open-source Interface § s Analysis Tools Version Control Main Build System Optimization Strategic Goal: Input Parser Regression Testing UQ To enable the Rapid Nonlinea Problem Application Mesh Tools r Model Discretizatio development of new n Mesh Database Solvers Production codes Nonlinear Interoperability Mesh I/O Transient Use Case embedded with Load Balancing Transformational Linear Solve ManyCore Node PDE Assembly capabilities. Linear Solvers Node Kernels Field Manager Iterative PDE Terms Multi-Core Discretization Multi-Level Accelerators 04/07/16 | 7
Los Alamos National Laboratory supports a wide variety of application physics areas Heat transfer Fluid dynamics Quantum device modeling Structural mechanics Climate modeling 04/07/16 | 8
Los Alamos National Laboratory Albany te Albany team is R am is Rapidly Developing Sever apidly Developing Several New al New Component-Based omponent-Based Applica Applications tions 1. Turbulent CFD for nuclear energy [NE] 2. Computational mechanics R&D [ASC] 3. Quantum device design [LDRD] 4. Extended MHD [ASCR] 5. CRADA partner’s in-house code [CRADA] 6. Peridynamics solver [ASC] 7. Biogeochemical element cycling: climate [SciDAC] Temperature Strain 8. Fuel rod degradation modeling [NE] 9. Ice Sheet dynamics [SciDAC] 10. Atmospheric Dynamics [LDRD] + Impacting Many Others Codes are born: parallel, scalable, robust, with sensitivities, optimization, UQ … and ready to adopt: embedded UQ, multi- core kernels, adaptivity, code coupling, ROM 04/07/16 | 9
Los Alamos National Laboratory Our goal: Our goal: To create an architecture-portable version of Albany by using Kokkos library. 04/07/16 | 10
Los Alamos National Laboratory Albany to Albany to Kokk okkos os r refactoring efactoring Albany Library of interoperable tools for manages dependencies between compatible discretizations of different components of the Albany and Partial Differential Equations implements linear algebra objects , manages data in the code. including sparse graphs, sparse matrices, and dense vectors. Phalanx Intrepid Tpetra Piro Kokkos MueLu Trilinos 04/07/16 | 11
Los Alamos National Laboratory • A new Albany-Kokkos implementation: • has Kokkos::Views at the base layer • has Kokkos::Vew –like temporary data • has Kokkos kernels in replacement of original nested loops • is a single code base that runs and is performant on diverse HPC architectures 04/07/16 | 12
Los Alamos National Laboratory FELIX: FELIX: Albany Gr Albany Greenland Ice Sheet model eenland Ice Sheet model 04/07/16 | 13
Los Alamos National Laboratory Albany FELIX Albany FELIX pr project oject An unstructured-grid finite element ice sheet code for • land-ice modeling (Greenland, Antarc/ca). Project objec*ve: • Provide sea level rise predic/on • Run on new architecture machines (hybrid systems). • – 50% *me spent in FE Assembly – 50% /me spent in Linear Solves Funding Source: SciDAC Collaborators: SNL, ORNL, LANL, LBNL, UT, FSU, SC, MIT, NCAR Sandia Staff: A. Salinger, I. Kalashnikova, M. Perego, R. Tuminaro, J. Jakeman, M. Eldred 04/07/16 | 14
Los Alamos National Laboratory Phalanx graph for the Greenland Ice-Sheet model Scatter Stokes 10:9 Stokes Resid 9:5 9:8 9:3 9:4 ViscosityFO Stokes BodyForce 5:4 9:2 8:7 VecInterpolation VecGradInterpolation GradInterpolation 3:0 4:0 3:2 4:2 7:2 7:6 Gather Solution Compute Basis Functions Load State Field 2:1 Gather Coordinate Vector 04/07/16 | 15
Los Alamos National Laboratory Kokkos implementation (Greenland Ice-Sheet model) Loop over the number of worksets Copy solution vector to the Device Scatter Stokes Device: 10:9 Stokes Resid 9:5 9:8 9:3 9:4 ViscosityFO Stokes BodyForce 5:4 9:2 8:7 VecInterpolation VecGradInterpolation GradInterpolation 3:0 4:0 3:2 4:2 7:2 7:6 Gather Solution Compute Basis Functions Load State Field 2:1 Gather Coordinate Vector Copy residual vector to the Host 04/07/16 | 16
Los Alamos National Laboratory Kokkos functor example in Albany 04/07/16 | 17
Los Alamos National Laboratory FELIX Performance results Evaluation environment: Shannon: 32 nodes: Two 8-core Sandy Bridge Xeon E5-2670 @ 2.6GHz (HT deactivated) per node, 128GB DDR3 memory per node, 2x NVIDIA K20x/k40 per node Serial=2 MPI processes OpenMP=16 OpenMP threads CUDA=1 Nvidia K80 GPU UVM for CPU-GPU data management 04/07/16 | 18
Los Alamos National Laboratory FELIX performance results Evaluation environment: TITAN: 18,688 AMD Opteron nodes: • 16 cores per node, • 1 K20X Kepler GPUS per node, • 32GB + 6GB memory per node 04/07/16 | 19
Los Alamos National Laboratory 04/07/16 | 20
Los Alamos National Laboratory • Next generation global atmosphere model. • Numerics are similar to the Community Atmosphere Model - Spectral Elements (CAM-SE) • Model development: shallow water, X-Z hydrostatic, 3D hydrostatic, clouds, 3D non-hydrostatic 04/07/16 | 21
Los Alamos National Laboratory Aeras performance results Evaluation environment: Shannon: Aeras compute !me (Total !me- Gather/Sca<er) Aeras total !me 32 nodes: 100.0 200.0 Two 8-core Sandy Bridge Xeon 90.0 180.0 80.0 E5-2670 @ 2.6GHz (HT 160.0 140.0 70.0 deactivated) per node, !me, sec 120.0 60.0 !me,sec 128GB DDR3 memory per 100.0 50.0 80.0 40.0 node, 60.0 30.0 2x NVIDIA K20x/k40 per node 40.0 20.0 20.0 10.0 0.0 0.0 100 1000 10000 100000 100 1000 10000 100000 #of elements per workset #of lements per workset Serial - 1 MPI thread per node OpenMP - 16 OpenMP threads per node CUDA - 1 NVIDIA K80 GPU per node 04/07/16 | 22
Los Alamos National Laboratory Aeras performance results Evaluation environment: TITAN: 18,688 AMD Opteron nodes: • 16 cores per node, • 1 K20X Kepler GPUS per node, • 32GB + 6GB memory per node 04/07/16 | 23
Los Alamos National Laboratory Conclusion • New version of Albany provides architecture-portability; • Our numerical experiments on two climate applications implemented in Albany show that: (1) a single code can execute correctly in several evaluation environments (MPI, OpenMP, CUDAUVM), and (2) reasonable performance is achieved across the different architectures without implicit data management: speed-ups using OpenMP and GPUs can be achieved over an MPI-only run; 04/07/16 | 24
Los Alamos National Laboratory Acknowledgments I would like to thank: • C. R. Trott and H.C. Edwards for their help with Kokkos, • Adam V. Delora for his work on Intrepid, • Eric T. Phipps, Eric C. Cyr and Andrew Bradley for their help with Trilinos and Albany, • Steve Price and Matt Hoffman and Mauro Perego for providing the data used in the FELIX land-ice runs. 04/07/16 | 25
Los Alamos National Laboratory Thank you! irina@lanl.gov 04/07/16 | 26
Recommend
More recommend