Using Charm++ to Support you Multiscale Multiphysics On the - - PowerPoint PPT Presentation

using charm to support
SMART_READER_LITE
LIVE PREVIEW

Using Charm++ to Support you Multiscale Multiphysics On the - - PowerPoint PPT Presentation

Los Alamos National Laboratory LA-UR-17-23218 Using Charm++ to Support you Multiscale Multiphysics On the Trinity Supercomputer nt wo Robert Pavel, Christoph Junghans, Susan M. Mniszewski, Timothy C. Germann April, 18 th 2017 Operated by


slide-1
SLIDE 1

you nt wo

Los Alamos National Laboratory

Using Charm++ to Support Multiscale Multiphysics

Robert Pavel, Christoph Junghans, Susan M. Mniszewski, Timothy C. Germann April, 18th 2017

On the Trinity Supercomputer

LA-UR-17-23218

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA

slide-2
SLIDE 2

Los Alamos National Laboratory 04/18/17 | 2

Exascale Co-Design Center for Materials in Extreme Environments (ExMatEx)

  • ExMatEx was one of three* DOE Office of Science application co-design

centers (2011-16)

*Others are: CESAR (ANL/reactors), ExaCT (SNL-CA/combustion)

  • Large scale collaborations between national labs, academia, vendors
  • Coordinated with related DOE NNSA co-design efforts
  • Goal: to establish a relationships between algorithms, software stack, and

architectures needed to enable exascale-ready science applications

  • Two ultimate objectives:
  • Identify the requirements for the

exascale ecosystem that are necessary to perform computational materials science simulations (both single- and multi- scale)

  • Demonstrate and deliver a

prototype scale-bridging materials science application based upon adaptive physics refinement

slide-3
SLIDE 3

Los Alamos National Laboratory 04/18/17 | 3

Tabasco Test Problem: Modeling a Taylor cylinder impact test

  • The simple Taylor model cannot account for the twinning and

anisotropy of a tantalum sample used in a LANL experiments (MST-8), and thus the final shapes do not match.

  • The physics goal of this demonstration is to show that the more

accurate VPSC fine-scale model with an appropriate reduced- dimensionality (~60 degrees of freedom) model of texture can (qualitatively or quantitatively?) reproduce the experimental shape. Ta

P.J. Maudlin, J.F. Bingert, J.W. House, and S.R. Chen, On the modeling of the Taylor cylinder impact test for

  • rthotropic textured materials: experiments and simulations,
  • Int. J. Plasticity 15(2), 139–166 (1999).

CoEVP

slide-4
SLIDE 4

Los Alamos National Laboratory 04/18/17 | 4

DB

On-demand fine scale models

Coarse-scale model DB

Eventually consistent distributed database

Adaptive Sampler

FSM

Subdomain 1 Subdomain 2

FSM FSM Node 1 DB DB DB

Adaptive Sampler Subdomain N-1 Subdomain N

Node N/2

Workflow Overview

slide-5
SLIDE 5

Los Alamos National Laboratory 04/18/17 | 5

Task-Based Scale-bridging Code (TaBaSCo)

  • Demonstrate the feasibility of at-scale heterogeneous computations

composed of:

  • Coarse-scale Lagrangian hydrodynamics
  • Dynamically launched constitutive model calculations
  • Results of fine-scale evaluations stored for reuse in databases
  • Use of Taylor fine scale model evaluation
  • VPSC fine scale model evaluation
  • Adaptive sampling which queries the database, interpolates results, and

decides when to spawn fine-scale evaluations

  • Combines an asynchronous task-based runtime environment with

persistent database storage

  • Provides load-balancer and checkpoints for fault tolerant modes
slide-6
SLIDE 6

Los Alamos National Laboratory 04/18/17 | 6

Tabasco Framework

  • Asynchronous Task-based runtimes explored
  • CHARM++ (built/ran on Trinity)
  • libCircle (built/ran on Darwin, but not Trinity)
  • MPI Task Pool (dual binary version built/ran on Darwin, single binary

version ran for small examples on Trinity)

  • Nearest neighbor search
  • Mtree vs. FLANN (both worked on Trinity)
  • Database storage
  • In memory HashMap (was limited for long runs)
  • Posix (became our reliable database option for Trinity)
  • Posix/Data Warp (only worked for short runs on Trinity)
  • REDIS (never ran on Trinity)
slide-7
SLIDE 7

Los Alamos National Laboratory 04/18/17 | 7

Chare Wrapper mapping

Chare Dim Class Lib Resolution Migrate

CoarseScaleModel 1D Lulesh MPI Rank N FineScaleModel 2D Constitutive/ ElastoPlasticity CM Element Y Evaluate 1D Taylor/VPSC CM Element Y NearestNeighbor Search 1D Approx NearestNeighbors CM/FLANN /Mtree Request (Service) N DBInterface 1D KrigingDataBase/ SingletonDB CM/Redis/ libhio/ POSIX Read/Write (Service) N

slide-8
SLIDE 8

Los Alamos National Laboratory 04/18/17 | 8

Trinity: Advanced Technology System

(a mixture of Intel Haswell and Knights Landing (KNL) processors)

slide-9
SLIDE 9

Los Alamos National Laboratory 04/18/17 | 9

Coarse Scale (LULESH) Neighbor Search Neighbor Interpolation Fine Scale and Evaluate (Taylor/VPSC) NoSQL Database (Prior Fine Scale Results)

Haswell nodes Burst Buffer nodes Haswell nodes Haswell (Phase 1) or KNL (Phase 2) nodes Haswell nodes

Open Science Trinity Port of TaBaSCo

slide-10
SLIDE 10

Los Alamos National Laboratory 04/18/17 | 10

Tabasco Weak and Strong Scaling

  • Weak scaling
  • Brute force (w/o AS) and Adaptive

Sampling (w AS)

  • Edge=64, height=26-13,312,

46,592-23,855,104 elements

  • Good till 128 nodes
  • Communication overhead for

>=256

  • Strong scaling
  • Brute force (w/o AS) and Adaptive

Sampling (w AS)

  • Edge=128, height=208, 1,490,944

elements

20 40 60 80 100 120 140 160 180 1 4 16 64 256 Runtime(s) Number of nodes

Tabasco Weak Scaling

w AS w/o AS 1 10 100 1000 10000 1 4 16 64 256 Runtime(s) Number of nodes

Tabasco Strong Scaling (128x208)

w/o AS w AS

slide-11
SLIDE 11

Los Alamos National Laboratory 04/18/17 | 11

Tabasco Brute Force (w/o Adaptive Sampling) on Trinity (512 nodes) – step 0

slide-12
SLIDE 12

Los Alamos National Laboratory 04/18/17 | 12

Tabasco Brute Force (w/o Adaptive Sampling) on Trinity (512 nodes) – step 500

slide-13
SLIDE 13

Los Alamos National Laboratory 04/18/17 | 13

Tabasco Brute Force (w/o Adaptive Sampling) on Trinity (512 nodes) – step 5000

slide-14
SLIDE 14

Los Alamos National Laboratory 04/18/17 | 14

Tabasco Brute Force (w/o Adaptive Sampling) on Trinity (512 nodes) – step 10,000

slide-15
SLIDE 15

Los Alamos National Laboratory 04/18/17 | 15

Tabasco Brute Force (w/o Adaptive Samp.) on Trinity (512 nodes) – step 20,000

slide-16
SLIDE 16

Los Alamos National Laboratory 04/18/17 | 16

Tabasco Brute Force (w/o Adaptive Samp.) on Trinity (512 nodes) – step 22,000

T a

slide-17
SLIDE 17

Los Alamos National Laboratory 04/18/17 | 17

Early Work in Hybrid Runs on Trinity

  • Trinity is a machine that was installed in two stages
  • 9408 compute nodes with Intel Haswell Processors
  • 32 CPU Cores per node, each with 2x hyperthreads
  • 9500 compute nodes with Intel Knights Landing processors
  • 68 cores per node
  • While similar, nodes have different strengths
  • Goal of Tabasco is to perform hybrid run in which both stages are

utilized

  • Coarse scale solver and less compute intensive work on Haswell nodes
  • Fine grain solver on KNLs
  • Used Charm++’s Logical Machine Entities to identify KNLs and

Haswells

  • And then used custom mapper to assign chares based on physical node

type

slide-18
SLIDE 18

Los Alamos National Laboratory 04/18/17 | 18

Coarse Scale (LULESH) Neighbor Search Neighbor Interpolation Fine Scale and Evaluate (Taylor/VPSC) NoSQL Database (Prior Fine Scale Results)

Haswell nodes Burst Buffer nodes Haswell nodes Haswell (Phase 1) or KNL (Phase 2) nodes Haswell nodes

Open Science Trinity Port of TaBaSCo

slide-19
SLIDE 19

Los Alamos National Laboratory 04/18/17 | 19

Proof of Concept Hybrid Run: Host Platform

  • Current Proof of Concept implementation running on Trinitite
  • Run with three types of node
  • Dual Socket Haswell ``Solver’’ Node
  • 32 MPI ranks per node
  • KNL ``Solver’’ Node
  • 64 MPI ranks per node
  • Dedicated Haswell ``Organizer’’ Node
  • 4 MPI ranks per node
  • Run on a subset of Trinitite
  • Goal was to work with the system stack and get initial performance results
  • Larger Runs Planned following unification of Trinity phases
slide-20
SLIDE 20

Los Alamos National Laboratory 04/18/17 | 20

Proof of Concept Hybrid Run: Simulation Configuration

  • Restricted to ``Taylor’’ Solver
  • No Adaptive Sampling
  • Maximize Work, Minimize Communication
  • Re-used a run from Open Science Phase 1
  • 48 Edge Elements
  • 104 Height Elements
  • 4 Domains (Coarse Scale Chares)
  • 34944 Fine Scale Evaluations
  • 100 Time Steps
slide-21
SLIDE 21

Los Alamos National Laboratory 04/18/17 | 21

Proof of Concept Hybrid Run Results: Raw Execution Time

slide-22
SLIDE 22

Los Alamos National Laboratory 04/18/17 | 22

Approximation of Energy Savings

  • Used very rough approximates to estimate energy savings of hybrid

run

  • Execution times of Proof of Concept runs
  • TDP from spec sheets for each processor
  • Don’t do this
  • Assumed all else the same

𝐹 = TDP ∗ ExecutionTime

slide-23
SLIDE 23

Los Alamos National Laboratory 04/18/17 | 23

Very Early TDP-Based Energy Results

slide-24
SLIDE 24

Los Alamos National Laboratory 04/18/17 | 24

Energy Savings Through Use of KNL Solvers

slide-25
SLIDE 25

Los Alamos National Laboratory 04/18/17 | 25

Questions?