UNCLASSIFIED Operated by Los Alamos National Security, LLC for the - - PowerPoint PPT Presentation

unclassified
SMART_READER_LITE
LIVE PREVIEW

UNCLASSIFIED Operated by Los Alamos National Security, LLC for the - - PowerPoint PPT Presentation

Delivering science and technology to protect our nation and promote world stability UNCLASSIFIED Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA Los Alamos National Laboratory UNCLASSIFIED |


slide-1
SLIDE 1

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA

UNCLASSIFIED

Delivering science and technology to protect our nation and promote world stability

slide-2
SLIDE 2

Los Alamos National Laboratory | 2 UNCLASSIFIED | LA-UR-16-28559

Confluence on the path to exascale?

Galen M. Shipman

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA

LA-UR-16-28559 Approved for public release; distribution is unlimited.

slide-3
SLIDE 3

Los Alamos National Laboratory | 3 UNCLASSIFIED | LA-UR-16-28559

National Strategic Computing Initiative (NSCI)

  • Create systems that can apply exaflops of computing power

to exabytes of data

  • Keep the United States at the forefront of HPC capabilities
  • Improve HPC application developer productivity
  • Make HPC readily available
  • Establish hardware technology for future HPC systems
slide-4
SLIDE 4

Los Alamos National Laboratory | 4 UNCLASSIFIED | LA-UR-16-28559

Exascale Computing Project

DOE is a lead agency within NSCI with the responsibility that the DOE Office of Science and DOE National Nuclear Security Administration will execute a joint program focused

  • n advanced simulation through a capable exascale

computing program emphasizing sustained performance

  • n relevant applications.
slide-5
SLIDE 5

Los Alamos National Laboratory | 5 UNCLASSIFIED | LA-UR-16-28559

ECP Goals

  • Develop a broad set of modeling and simulation applications

that meet the requirements of the scientific, engineering, and nuclear security programs of the Department of Energy and the NNSA

  • Develop a productive exascale capability in the US by 2023,

including the required software and hardware technologies

  • Prepare two or more DOE Office of Science and NNSA

facilities to house this capability

  • Maximize the benefits of HPC for US economic

competitiveness and scientific discovery

slide-6
SLIDE 6

Los Alamos National Laboratory | 6 UNCLASSIFIED | LA-UR-16-28559

Exascale will be driven by application needs and the demands

  • f changing technology
slide-7
SLIDE 7

Los Alamos National Laboratory | 7 UNCLASSIFIED | LA-UR-16-28559

Exascale challenges for future application capability

– Performance and productivity at extreme-scale – Agile response to new scientific questions; integrating new physics

Change is driven by computing technology evolution: growth in scale, and node complexity

  • Massive parallelism of many-core/GPU nodes
  • Leads to a push away from bulk synchrony
  • Task- and data-parallel programming models
  • Deep memory hierarchies (on node)
  • Cache and scratchpad management
  • Challenge of spatial complexity in codes
  • Need to get granularity of the tasks right
  • Extreme scales
  • Power, load balance, and performance variability
  • Reliability and resilience
  • Data management, and data analysis

Common theme: methods that can tolerate latency variability within a node and across an extreme-scale system

1996 2016

The complexity of node architecture that applications must consider to make effective use of the system has increased significantly

slide-8
SLIDE 8

Los Alamos National Laboratory | 8 UNCLASSIFIED | LA-UR-16-28559

Diverse architectures

LANL’s Next Generation Code (NGC): Multi-Physics simulation at exascale

Common theme at exascale: need for asynchronous methods tolerant of latency variability within a computational node, and across an extreme-scale system

Control & state manager

Legion MPI + threads Asynchrono us MPI + threads Coarse Fine

Resolving grain-level physics: improved fidelity in experiment (DARHT, MaRIE) and simulation

  • Models at different scales (fine to

coarse) & bridging between them (multi-scale methods)

  • Coarse: multi-physics coupling
  • Fine: higher fidelity and

asynchronous concurrency

Building leadership in computational science from advanced materials to novel programming models

  • Traditional physics and CS

methods (operator split, MPI) have poor asynchrony

  • New programming models

expose more parallelism for asynchronous execution

Diverse questions of interest: diverse physics topologies

Fine

slide-9
SLIDE 9

Los Alamos National Laboratory | 9 UNCLASSIFIED | LA-UR-16-28559

EXascale Atomistics for Accuracy, Length and Time

dephase parallel time τcorr replicate ParSplice - parallel replica dynamics using data driven asynchronous tasks Improvement anticipated with aggressive co-design correct Cannot reach boundaries of length-time space with Molecular Dynamics alone.

slide-10
SLIDE 10

Los Alamos National Laboratory | 10 UNCLASSIFIED | LA-UR-16-28559

Data Analytics at the Exascale for Free Electron Lasers (ExaFEL)

  • Perform prompt LCLS data analysis on next

generation DOE supercomputers

  • LCLS will increase its data throughput by three
  • rders of magnitude by 2025
  • Enabling new photon science from the LCLS will

require near real-time analysis (~10 min) of data bursts, requiring burst computational intensities exceeding an exaflop

From detected signal to a model of the sample

  • High-throughput analysis of individual images
  • Ray-tracing for inverse-modeling of the

sample

  • Requires data-driven asynchronous

computation

  • A distributed task-based runtime
slide-11
SLIDE 11

Los Alamos National Laboratory | 11 UNCLASSIFIED | LA-UR-16-28559

These applications require a data- aware asynchronous programming environment

slide-12
SLIDE 12

Los Alamos National Laboratory | 12 UNCLASSIFIED | LA-UR-16-28559

Legion: a data-aware task based programming system

Tasks

(execution model)

Describe parallel execution elements and algorithmic operations Sequential semantics, with out of

  • rder execution, in order completion.

Regions

(data model)

Describe decomposition of computational domain, and

  • Privleges (read-only, read+write, reduce)
  • Coherence (exclusive, atomic, etc.)

Mapper

Describes how tasks and regions should be mapped to the target architecture

[=](int i) { rho(i) = … }

rho0 rho1

Mapper

Mapper allows architecture-specific optimization without affecting the correctness of the task and domain descriptions

slide-13
SLIDE 13

Los Alamos National Laboratory | 13 UNCLASSIFIED | LA-UR-16-28559

Mapper

Task

Mapping Tasks and Data to Hardware Resources

Region 1 Region 2

CPU

NUMA 0 NUMA 1

CPU

NUMA 0 NUMA 1

GPU

MEMORY

GPU

MEMORY

GPU

MEMORY

  • Application selects:
  • Where tasks run and where regions are placed
  • Computed dynamically
  • Decouple correctness from performance
slide-14
SLIDE 14

Los Alamos National Laboratory | 14 UNCLASSIFIED | LA-UR-16-28559

Can a new programming system address the needs of simulation, analysis, workflow, and “big data”?

slide-15
SLIDE 15

Los Alamos National Laboratory | 15 UNCLASSIFIED | LA-UR-16-28559

Simulation: Legion S3D Execution and Performance Details

Weak scaling results on Titan out to 8K nodes

  • Mapping for 963 Heptane
  • Top line shows runtime

workload

  • Different species required

mapping changes (e.g., due to limited GPU memory size) – i.e. tuning is often not just app and system specific…

slide-16
SLIDE 16

Los Alamos National Laboratory | 16 UNCLASSIFIED | LA-UR-16-28559

Results:

  • Flexible data-driven tasking model reduced overhead of in

situ calculations by a factor of 10

  • Time-to-solution improved by 9x and obtains over 80% of

the achievable performance on Titan and Piz Daint.

  • Enabled building blocks for new science: first large-scale 3-

D simulation of a realistic primary reference fuel (PRF) blend of iso-octane and n-heptane, involving 116 chemical species and 861 reactions

Challenge:

  • Data produced by applications is too large for post-processing, so need in situ analysis and visualization.
  • In situ processing works best when tightly coupled with applications in order to avoid unnecessary data movement and copies and to share

compute resources between the application and in situ.

  • Manual data mapping and task scheduling impacts application portability and productivity.

Approach: Use data-centric programming approach to scheduling and mapping between application and in situ

  • Legion runtime developed as part of the ExaCT Co-Design Center. http://legion.stanford.edu
  • Promotes data to a first-class programming construct
  • Separates implementation of computations from mapping to hardware resources
  • Implement data transformation and sublinear algorithms as well as visualization pipeline abstractions

IMD

Analysis: A Unified Approach for Programming in situ Analysis & Visualiza;on

MPI Fortran Piz Daint Legion Piz Daint MPI Fortran Titan Legion Titan 2 4 6 8 10 Time per Time Step (s)

6.79 1.79 7.25 2.25 7.30 1.80 8.42 2.44

Without CEMA With CEMA

In situ Chemical Explosive Mode Analysis (CEMA). Flexible scheduling and mapping reduces analysis overhead to less than 1% of overall execution time. Additional benefits from improvement in overall application performance.

slide-17
SLIDE 17

Los Alamos National Laboratory | 17 UNCLASSIFIED | LA-UR-16-28559

Workflow: Integration of External Resources into the Programming Model

  • We can’t ignore the full workflow!
  • Amdahl's law sneaks in if we consider I/O

from tasks – 15-76% overhead vs. 2-12% of

  • riginal Fortran code!
  • Introduce new semantics for
  • perating with external resources

(e.g. storage, databases, etc.).

  • Incorporates these resources into deferred execution model
  • Maintains consistency between different copies of the same

data

  • Underlying parallel I/O handled by HDF5 but scheduled by

runtime

  • Allow applications to adjust the

snapshot interval based on available storage and system fault concerns instead of overheads.

Performance of S3D checkpoints running on 64 nodes (i.e., 1,024 cores) of Titan.

THANKS OLCF!

slide-18
SLIDE 18

Los Alamos National Laboratory | 18 UNCLASSIFIED | LA-UR-16-28559

Big Data: Exploring the use of Legion in Graph Applications

  • Big Data: High frequency, massive, and

irregular data access

  • E.g., Graph traversal randomly accesses vertices
  • Need deep memory hierarchy to achieve both

performance and scalability

  • Research interests is shifting to

heterogeneous systems

  • E.g., GPUs, CPUs, and FPGAs
  • Need for a generic runtime that coordinates different

processors

  • PageRank: a graph-based application

widely used to sort webpages

  • Legion CPU version: 600 lines of code
  • Legion GPU version: 160 lines of code
  • Legion Mapper: 260 lines of code

GraphX: A distributed graph engine on top of Spark Ligra: State-of-the-art shared-memory graph engine

0.1 1 10 100 RMAT24 RMAT27

GraphX (4 nodes) Ligra Legion (4 nodes) 16M vertices 256M edges 128M vertices 2B edges

slide-19
SLIDE 19

Los Alamos National Laboratory | 19 UNCLASSIFIED | LA-UR-16-28559

  • Exascale should push us to rethink how we program

applications

– Data-aware: to maximize available throughput in a complex distributed hierarchy of memories – Declarative: Describing what needs to be computed rather than precisely how it needs to be computed on a complex system – Runtime assisted: Greater reliance on runtime mechanisms to efficiently schedule computation and data movement

  • Big-data has embraced many of these concepts
  • Will we see a confluence of approaches?
slide-20
SLIDE 20

Los Alamos National Laboratory | 20 UNCLASSIFIED | LA-UR-16-28559

Acknowledgements

Alex Aiken, Michael Bauer, Ben Bergen, Timo Bremer, Jacqueline Chen, David Daniel, Kei Davis, Mattan Erez, Charles Ferenbaugh, Sam Gutierrez, Zhihao Jia, Quincey Koziol, Yongkee Kwon, Wonchan Lee, Carlos Maltzahn, Pat McCormick, Nick Moss, Scott Pakin, Rob Ross, Christine Sweeney, Brad Settlemyer, Elliot Slaughter, Sean Treichler, Noah Watkins and many more…

slide-21
SLIDE 21

Los Alamos National Laboratory | 21 UNCLASSIFIED | LA-UR-16-28559