Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA
UNCLASSIFIED
Delivering science and technology to protect our nation and promote world stability
UNCLASSIFIED Operated by Los Alamos National Security, LLC for the - - PowerPoint PPT Presentation
Delivering science and technology to protect our nation and promote world stability UNCLASSIFIED Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA Los Alamos National Laboratory UNCLASSIFIED |
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA
Delivering science and technology to protect our nation and promote world stability
Los Alamos National Laboratory | 2 UNCLASSIFIED | LA-UR-16-28559
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA
LA-UR-16-28559 Approved for public release; distribution is unlimited.
Los Alamos National Laboratory | 3 UNCLASSIFIED | LA-UR-16-28559
Los Alamos National Laboratory | 4 UNCLASSIFIED | LA-UR-16-28559
Los Alamos National Laboratory | 5 UNCLASSIFIED | LA-UR-16-28559
Los Alamos National Laboratory | 6 UNCLASSIFIED | LA-UR-16-28559
Los Alamos National Laboratory | 7 UNCLASSIFIED | LA-UR-16-28559
Change is driven by computing technology evolution: growth in scale, and node complexity
Common theme: methods that can tolerate latency variability within a node and across an extreme-scale system
The complexity of node architecture that applications must consider to make effective use of the system has increased significantly
Los Alamos National Laboratory | 8 UNCLASSIFIED | LA-UR-16-28559
Diverse architectures
Common theme at exascale: need for asynchronous methods tolerant of latency variability within a computational node, and across an extreme-scale system
Control & state manager
Legion MPI + threads Asynchrono us MPI + threads Coarse Fine
Resolving grain-level physics: improved fidelity in experiment (DARHT, MaRIE) and simulation
coarse) & bridging between them (multi-scale methods)
asynchronous concurrency
Building leadership in computational science from advanced materials to novel programming models
methods (operator split, MPI) have poor asynchrony
expose more parallelism for asynchronous execution
Diverse questions of interest: diverse physics topologies
Fine
Los Alamos National Laboratory | 9 UNCLASSIFIED | LA-UR-16-28559
Los Alamos National Laboratory | 10 UNCLASSIFIED | LA-UR-16-28559
generation DOE supercomputers
require near real-time analysis (~10 min) of data bursts, requiring burst computational intensities exceeding an exaflop
From detected signal to a model of the sample
sample
computation
Los Alamos National Laboratory | 11 UNCLASSIFIED | LA-UR-16-28559
Los Alamos National Laboratory | 12 UNCLASSIFIED | LA-UR-16-28559
Describe parallel execution elements and algorithmic operations Sequential semantics, with out of
Describe decomposition of computational domain, and
Describes how tasks and regions should be mapped to the target architecture
[=](int i) { rho(i) = … }
rho0 rho1
Mapper
Los Alamos National Laboratory | 13 UNCLASSIFIED | LA-UR-16-28559
Task
Region 1 Region 2
CPU
NUMA 0 NUMA 1
CPU
NUMA 0 NUMA 1
GPU
MEMORY
GPU
MEMORY
GPU
MEMORY
Los Alamos National Laboratory | 14 UNCLASSIFIED | LA-UR-16-28559
Los Alamos National Laboratory | 15 UNCLASSIFIED | LA-UR-16-28559
Weak scaling results on Titan out to 8K nodes
Los Alamos National Laboratory | 16 UNCLASSIFIED | LA-UR-16-28559
Results:
situ calculations by a factor of 10
the achievable performance on Titan and Piz Daint.
D simulation of a realistic primary reference fuel (PRF) blend of iso-octane and n-heptane, involving 116 chemical species and 861 reactions
Challenge:
compute resources between the application and in situ.
Approach: Use data-centric programming approach to scheduling and mapping between application and in situ
IMD
MPI Fortran Piz Daint Legion Piz Daint MPI Fortran Titan Legion Titan 2 4 6 8 10 Time per Time Step (s)
6.79 1.79 7.25 2.25 7.30 1.80 8.42 2.44
Without CEMA With CEMA
In situ Chemical Explosive Mode Analysis (CEMA). Flexible scheduling and mapping reduces analysis overhead to less than 1% of overall execution time. Additional benefits from improvement in overall application performance.
Los Alamos National Laboratory | 17 UNCLASSIFIED | LA-UR-16-28559
from tasks – 15-76% overhead vs. 2-12% of
data
runtime
Performance of S3D checkpoints running on 64 nodes (i.e., 1,024 cores) of Titan.
Los Alamos National Laboratory | 18 UNCLASSIFIED | LA-UR-16-28559
irregular data access
performance and scalability
heterogeneous systems
processors
widely used to sort webpages
GraphX: A distributed graph engine on top of Spark Ligra: State-of-the-art shared-memory graph engine
GraphX (4 nodes) Ligra Legion (4 nodes) 16M vertices 256M edges 128M vertices 2B edges
Los Alamos National Laboratory | 19 UNCLASSIFIED | LA-UR-16-28559
Los Alamos National Laboratory | 20 UNCLASSIFIED | LA-UR-16-28559
Los Alamos National Laboratory | 21 UNCLASSIFIED | LA-UR-16-28559