Kokkos Implementation of Albany: you Towards Performance Portable - - PowerPoint PPT Presentation

kokkos implementation of albany
SMART_READER_LITE
LIVE PREVIEW

Kokkos Implementation of Albany: you Towards Performance Portable - - PowerPoint PPT Presentation

S Los Alamos National Laboratory LA-UR-16-22225 Kokkos Implementation of Albany: you Towards Performance Portable e Finite Element Code logo and delete wo e I. Demeshko, O. Guba, R. P. Pawlowski, A. G. Salinger, W. F. Spotz and I. K.


slide-1
SLIDE 1

S you e logo and delete wo e is

Los Alamos National Laboratory

Kokkos Implementation of Albany: Towards Performance Portable Finite Element Code

  • I. Demeshko, O. Guba,
  • R. P. Pawlowski, A. G. Salinger,
  • W. F. Spotz and I. K. Tezaur,

M.A. Heroux 04/07/2016

LA-UR-16-22225

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA

slide-2
SLIDE 2

Los Alamos National Laboratory 04/07/16 | 2

Performance Portability

slide-3
SLIDE 3

Los Alamos National Laboratory 04/07/16 | 3

Performance Portability

slide-4
SLIDE 4

Los Alamos National Laboratory 04/07/16 | 4

Performance Portability

EXASCALE SYSTEM

new architecture new libraries new programming models

slide-5
SLIDE 5

Los Alamos National Laboratory 04/07/16 | 5

5

slide-6
SLIDE 6

Los Alamos National Laboratory 04/07/16 | 6

slide-7
SLIDE 7

Los Alamos National Laboratory 04/07/16 | 7

Albany : agile component-based par Albany : agile component-based parallel unstructur allel unstructured ed mesh applica mesh application tion

  • A finite element based application development environment containing the "typical"

building blocks needed for rapid deployment and prototyping of analysis capabilities § A Trilinos demonstra/on applica/on, built almost exclusively from reusable libraries. Albany leverages 100+ packages/libraries. § Open-source

Main PDE Assembly Solvers Field Manager Discretization Interoperability Use Case Nonlinea r Model Nonlinear Transient Optimization UQ Analysis Tools Iterative Linear Solvers Multi-Level Mesh Tools Mesh I/O Mesh Database Problem Discretizatio n ManyCore Node Multi-Core Accelerators Application Linear Solve Load Balancing Input Parser Node Kernels Regression Testing Version Control Build System Libraries Interface s Software Quality Tools Demo Apps PDE Terms

Albany Structure:

Strategic Goal: To enable the Rapid development of new Production codes embedded with Transformational capabilities.

slide-8
SLIDE 8

Los Alamos National Laboratory 04/07/16 | 8

Heat transfer Fluid dynamics Structural mechanics Quantum device modeling Climate modeling supports a wide variety of application physics areas

slide-9
SLIDE 9

Los Alamos National Laboratory 04/07/16 | 9

Albany te Albany team is R am is Rapidly Developing Sever apidly Developing Several New al New Component-Based

  • mponent-Based Applica

Applications tions

1. Turbulent CFD for nuclear energy [NE] 2. Computational mechanics R&D [ASC] 3. Quantum device design [LDRD] 4. Extended MHD [ASCR] 5. CRADA partner’s in-house code [CRADA] 6. Peridynamics solver [ASC] 7. Biogeochemical element cycling: climate [SciDAC] 8. Fuel rod degradation modeling [NE] 9. Ice Sheet dynamics [SciDAC]

  • 10. Atmospheric Dynamics [LDRD]

+ Impacting Many Others

Codes are born: parallel, scalable, robust, with sensitivities, optimization, UQ

… and ready to adopt: embedded UQ, multi- core kernels, adaptivity, code coupling, ROM

Temperature Strain

slide-10
SLIDE 10

Los Alamos National Laboratory 04/07/16 | 10

Our goal: Our goal:

To create an architecture-portable version of Albany by using Kokkos library.

slide-11
SLIDE 11

Los Alamos National Laboratory 04/07/16 | 11

Albany to Albany to Kokk

  • kkos
  • s r

refactoring efactoring

Phalanx Intrepid Kokkos

Albany Trilinos

Piro

Tpetra

MueLu

manages dependencies between different components of the Albany and manages data in the code. Library of interoperable tools for compatible discretizations of Partial Differential Equations implements linear algebra objects, including sparse graphs, sparse matrices, and dense vectors.

slide-12
SLIDE 12

Los Alamos National Laboratory 04/07/16 | 12

  • A new Albany-Kokkos implementation:
  • has Kokkos::Views at the base layer
  • has Kokkos::Vew –like temporary data
  • has Kokkos kernels in replacement of original nested loops
  • is a single code base that runs and is performant on diverse HPC

architectures

slide-13
SLIDE 13

Los Alamos National Laboratory 04/07/16 | 13

FELIX: FELIX: Albany Gr Albany Greenland Ice Sheet model eenland Ice Sheet model

slide-14
SLIDE 14

Los Alamos National Laboratory 04/07/16 | 14

Albany FELIX Albany FELIX pr project

  • ject
  • An unstructured-grid finite element ice sheet code for

land-ice modeling (Greenland, Antarc/ca).

  • Project objec*ve:
  • Provide sea level rise predic/on
  • Run on new architecture machines (hybrid systems).

– 50% *me spent in FE Assembly – 50% /me spent in Linear Solves Funding Source: SciDAC Collaborators: SNL, ORNL, LANL, LBNL, UT, FSU, SC, MIT, NCAR Sandia Staff: A. Salinger, I. Kalashnikova, M. Perego,

  • R. Tuminaro, J. Jakeman, M. Eldred
slide-15
SLIDE 15

Los Alamos National Laboratory 04/07/16 | 15

Phalanx graph for the Greenland Ice-Sheet model

Gather Solution Gather Coordinate Vector Compute Basis Functions 2:1 VecInterpolation 3:0 3:2 VecGradInterpolation 4:0 4:2 ViscosityFO 5:4 Load State Field GradInterpolation 7:2 7:6 Stokes BodyForce 8:7 Stokes Resid 9:2 9:3 9:4 9:5 9:8 Scatter Stokes 10:9

slide-16
SLIDE 16

Los Alamos National Laboratory 04/07/16 | 16

Kokkos implementation (Greenland Ice-Sheet model)

Device: Copy solution vector to the Device Copy residual vector to the Host

Loop over the number of worksets

Gather Solution Gather Coordinate Vector Compute Basis Functions 2:1 VecInterpolation 3:0 3:2 VecGradInterpolation 4:0 4:2 ViscosityFO 5:4 Load State Field GradInterpolation 7:2 7:6 Stokes BodyForce 8:7 Stokes Resid 9:2 9:3 9:4 9:5 9:8 Scatter Stokes 10:9

slide-17
SLIDE 17

Los Alamos National Laboratory 04/07/16 | 17

Kokkos functor example in Albany

slide-18
SLIDE 18

Los Alamos National Laboratory 04/07/16 | 18

FELIX Performance results

Evaluation environment: Shannon: 32 nodes: Two 8-core Sandy Bridge Xeon E5-2670 @ 2.6GHz (HT deactivated) per node, 128GB DDR3 memory per node, 2x NVIDIA K20x/k40 per node Serial=2 MPI processes OpenMP=16 OpenMP threads CUDA=1 Nvidia K80 GPU UVM for CPU-GPU data management

slide-19
SLIDE 19

Los Alamos National Laboratory 04/07/16 | 19

FELIX performance results

Evaluation environment:

TITAN:

18,688 AMD Opteron nodes:

  • 16 cores per node,
  • 1 K20X Kepler GPUS per node,
  • 32GB + 6GB memory per node
slide-20
SLIDE 20

Los Alamos National Laboratory 04/07/16 | 20

slide-21
SLIDE 21

Los Alamos National Laboratory 04/07/16 | 21

  • Next generation global atmosphere model.
  • Numerics are similar to the Community Atmosphere

Model - Spectral Elements (CAM-SE)

  • Model development: shallow water, X-Z hydrostatic,

3D hydrostatic, clouds, 3D non-hydrostatic

slide-22
SLIDE 22

Los Alamos National Laboratory 04/07/16 | 22

Aeras performance results

0.0 10.0 20.0 30.0 40.0 50.0 60.0 70.0 80.0 90.0 100.0 100 1000 10000 100000 !me,sec #of lements per workset

Aeras compute !me (Total !me- Gather/Sca<er)

Serial - 1 MPI thread per node OpenMP - 16 OpenMP threads per node CUDA - 1 NVIDIA K80 GPU per node 0.0 20.0 40.0 60.0 80.0 100.0 120.0 140.0 160.0 180.0 200.0 100 1000 10000 100000 !me, sec #of elements per workset

Aeras total !me

Evaluation environment: Shannon: 32 nodes: Two 8-core Sandy Bridge Xeon E5-2670 @ 2.6GHz (HT deactivated) per node, 128GB DDR3 memory per node, 2x NVIDIA K20x/k40 per node

slide-23
SLIDE 23

Los Alamos National Laboratory 04/07/16 | 23

Aeras performance results

Evaluation environment:

TITAN:

18,688 AMD Opteron nodes:

  • 16 cores per node,
  • 1 K20X Kepler GPUS per

node,

  • 32GB + 6GB memory per

node

slide-24
SLIDE 24

Los Alamos National Laboratory 04/07/16 | 24

Conclusion

  • New version of Albany provides architecture-portability;
  • Our numerical experiments on two climate applications implemented in

Albany show that:

(1) a single code can execute correctly in several evaluation environments (MPI, OpenMP, CUDAUVM), and (2) reasonable performance is achieved across the different architectures without implicit data management: speed-ups using OpenMP and GPUs can be achieved over an MPI-only run;

slide-25
SLIDE 25

Los Alamos National Laboratory 04/07/16 | 25

Acknowledgments

I would like to thank:

  • C. R. Trott and H.C. Edwards for their help with Kokkos,
  • Adam V. Delora for his work on Intrepid,
  • Eric T. Phipps, Eric C. Cyr and Andrew Bradley for their help with

Trilinos and Albany,

  • Steve Price and Matt Hoffman and Mauro Perego for providing the

data used in the FELIX land-ice runs.

slide-26
SLIDE 26

Los Alamos National Laboratory 04/07/16 | 26

Thank you! irina@lanl.gov