Improving Uintah’s Scalability Through the Use of Portable Kokkos-Based Data Parallel Tasks Kokkos-Based Data Parallel Tasks
John Holmen1, Alan Humphrey1, Daniel Sunderland2, Martin Berzins1
University of Utah1 Sandia National Laboratories2
Uintah Architecture Open source software UQ DRIVERS ARCHES DSL: - - PowerPoint PPT Presentation
Improving Uintahs Scalability Through the U s e of Portable Kokkos-Based Data Parallel Tasks Kokkos-Based Data Parallel Tasks John Holmen 1 , Alan Humphrey 1 , Daniel Sunderland 2 , Martin Berzins 1 University of Utah 1 Sandia National
John Holmen1, Alan Humphrey1, Daniel Sunderland2, Martin Berzins1
University of Utah1 Sandia National Laboratories2
ARCHES DSL: NEBO UQ DRIVERS
model
communications
Simulation Controller Scheduler Load Balancer
Runtime System
PIDX VisIT CPUs GPUs Xeon Phis Task Data Warehouse Hypre Linear Solver
C++ task graph
runtime system
execution,
& computation
node node
DOE NNSA PSAAP II Center
supercritical clean coal boiler at scale with Uintah
50-92 meters
solved today
simultaneously
Diffusion – Convection + Source/Sinks
requires integration of incoming intensity about a sphere
methods
=
N ray ray in
1 4
π
incoming intensity absorbed by a given cell
to compute incoming intensity backwards to the origin cell
adds its contribution to the incoming intensity absorbed by the origin cell
contribution becomes
Back path of ray from S to emitter E, 9-cell structured mesh patch
given cell and/or timestep
rays that never reach an origin cell
Node 1 Global Mesh Local Mesh
rays that never reach an origin cell
across compute nodes
each origin cell that it owns across the entire domain
information and physics properties for the entire domain
Node 2 Node 3 Node 4
communication
computational domain
memory usage, and MPI message
Global Mesh Local
memory usage, and MPI message volume
surrounded by successively coarser grids
taken between cells becomes larger
Local Mesh Coarse Mesh Fine Mesh
for CPU-, GPU-, and MIC-based architectures
9
performant kernels
Example Stencil Task
Unew = Uold + dt*F(Uold,Uhalo)
Old Data Warehouse
GET Uold Uhalo
New Data Warehouse
PUT Unew
Halo Sends Halo Receives Uhalo
Warehouse
PUT Unew Use Kokkos abstraction layer that maps loops
appropriate memory abstractions Kokkos Unmanaged Views Memory Structure Cache, and Vectorization Friendly
implementations
11
selection of optimal iteration schemes via Kokkos
12
13
14
15
Medium Large
16
Medium
larger thread/core counts per node
patch work can be computed cooperatively by multiple threads sharing a cache)
introduced machines
Program
Support provided by the Department of Energy, National Nuclear Security Administration, under Award Number(s) DE-NA0002375. Computing time provided by the NSF Extreme Science and Engineering Discovery Environment (XSEDE) program Texas Advanced Computing Center resources used under Award Number(s) MCA08X004 - ``Resilience and Scalability of the Uintah Software'' Thanks to TACC and those involved with the CCMSC and Uintah past and present Uintah Download: http://www.uintah.utah.edu