Total Work-Flow: Exploiting Hybrid Computing Architectures for - - PowerPoint PPT Presentation

total work flow exploiting hybrid computing architectures
SMART_READER_LITE
LIVE PREVIEW

Total Work-Flow: Exploiting Hybrid Computing Architectures for - - PowerPoint PPT Presentation

Total Work-Flow: Exploiting Hybrid Computing Architectures for Scientific Computing ScicomP 15 Ben Bergen Computational Physics (CCS-2) Los Alamos National Laboratory Brian Albright (X-1), Kevin Bowers (D.E. Shaw), Lin Yin (X-1), William


slide-1
SLIDE 1

Operated by Los Alamos National Security, LLC for DOE/NNSA

LA-UR 09-02032

Total Work-Flow: Exploiting Hybrid Computing Architectures for Scientific Computing

Ben Bergen Computational Physics (CCS-2) Los Alamos National Laboratory

Brian Albright (X-1), Kevin Bowers (D.E. Shaw), Lin Yin (X-1), William Daughton (X-1)

ScicomP 15

slide-2
SLIDE 2

Operated by Los Alamos National Security, LLC for NNSA

LA-UR 09-02032

Operated by Los Alamos National Security, LLC for NNSA

Overview

 Roadrunner System Overview  Basic Considerations and Programming Models  Adapting VPIC Kinetic Plasma Code to Roadrunner  Optimizing Total Workflow  Open Science on Roadrunner

Slide 2

slide-3
SLIDE 3

Operated by Los Alamos National Security, LLC for NNSA

LA-UR 09-02032

Operated by Los Alamos National Security, LLC for NNSA

Roadrunner is a Cluster

Slide 3

slide-4
SLIDE 4

Operated by Los Alamos National Security, LLC for NNSA

LA-UR 09-02032

Operated by Los Alamos National Security, LLC for NNSA

Roadrunner is a Cluster of Clusters

Slide 4

slide-5
SLIDE 5

Operated by Los Alamos National Security, LLC for NNSA

LA-UR 09-02032

Operated by Los Alamos National Security, LLC for NNSA

Roadrunner is a Cluster of Clusters with Accelerators

Slide 5

slide-6
SLIDE 6

Operated by Los Alamos National Security, LLC for NNSA

LA-UR 09-02032

Operated by Los Alamos National Security, LLC for NNSA

Triblade Compute Node

Slide 6

slide-7
SLIDE 7

Operated by Los Alamos National Security, LLC for NNSA

LA-UR 09-02032

Operated by Los Alamos National Security, LLC for NNSA

Original Blade Topology

Slide 7

 One-to-one affinity between Opteron core and Cell processor  Newer versions of DaCS support two-to-one affinity  Not sure about four-to-one???

slide-8
SLIDE 8

Operated by Los Alamos National Security, LLC for NNSA

LA-UR 09-02032

Operated by Los Alamos National Security, LLC for NNSA

Roadrunner: Basic Considerations for Adaptation

 First hybrid supercomputer of the current generation incorporating x86_64, PowerPC, and SPU ISAs.  Codes require three executables

 x86_64 executable runs on the Opteron host processor  PowerPC executable runs on the Power Processing Element (PPE) accelerator processor  SPU threads runs on the eight Synergistic Processing Element (SPE) special purpose vector unit processors  Three compilers: gcc, ppu-gcc, spu-gcc (also XL C/C++)

 Design considerations: Process launch and synchronization

Slide 8

Roadrunner has three different architectures Opteron PowerPC

SPE

slide-9
SLIDE 9

Operated by Los Alamos National Security, LLC for NNSA

LA-UR 09-02032

Operated by Los Alamos National Security, LLC for NNSA

Roadrunner: Basic Considerations for Adaptation

 Incorporates main memory on the Opteron and Cell eDP blades plus the local store user-controlled SRAM on the SPEs  Codes that run on Roadrunner must handle communication between these memory spaces

 Distributed memory communication between Opteron hosts  Point-to-point communication between Opteron host and Cell accelerator  Direct Memory Access (DMA) communication between Cell main memory and SPE local store memory

 Opteron and Cell have different endianness

 Some byte-swapping is necessary  Cell blades are diskless

 Design considerations: Communication and I/O

Slide 9

Roadrunner has three different address spaces

slide-10
SLIDE 10

Operated by Los Alamos National Security, LLC for NNSA

LA-UR 09-02032

Operated by Los Alamos National Security, LLC for NNSA

Roadrunner: Basic Considerations for Adaptation

 Process launch and synchronization

 MPI, DaCS/ALF, libSPE2

 Communication

 MPI, DaCS/ALF, libSPE2

Slide 10

Multiple tools and programming models MPI DaCS libSPE2 Hierarchical/heterogeneous advantages  Fault tolerance

 Faults can be caught at multiple levels  Scalability  Strong scalability is possible on SPEs  Weak scalability through distributed memory

slide-11
SLIDE 11

Operated by Los Alamos National Security, LLC for NNSA

LA-UR 09-02032

Operated by Los Alamos National Security, LLC for NNSA

Programming Models: Host-Centric (Function Offload)

Slide 11

Opteron Cell

 Allows staged development

 Existing MPI codes will run on Opterons

 Synchronous or asynchronous function

  • ffload to accelerator

 Minimizes reliance on PPE (poor performer!)

Pros

 Potential data-movement bottleneck  Offload cost must be amortized by work done on accelerator

Cons

slide-12
SLIDE 12

Operated by Los Alamos National Security, LLC for NNSA

LA-UR 09-02032

Operated by Los Alamos National Security, LLC for NNSA

Programming Models: Accelerator-Centric

Slide 12

Opteron Cell

 Also allows staged development

 Existing MPI codes will run on PowerPC (PPE)

 Hides complexity of hybrid architecture  Avoids data-movement bottleneck

Pros

 Heavier reliance on PPE  Computationally intensive portions of code must run on SPEs  Requires “relay” to forward message traffic

Cons

slide-13
SLIDE 13

Operated by Los Alamos National Security, LLC for NNSA

LA-UR 09-02032

Operated by Los Alamos National Security, LLC for NNSA

Message Passing Relay

Slide 13

Cell Opteron Cell Opteron

Direct point-to-point communication is not possible between Cells

slide-14
SLIDE 14

Operated by Los Alamos National Security, LLC for NNSA

LA-UR 09-02032

Operated by Los Alamos National Security, LLC for NNSA

Message Passing Relay

Slide 14

Cell Opteron Cell Opteron

Data Data

Relay forwards messages through hosts to peer

slide-15
SLIDE 15

Operated by Los Alamos National Security, LLC for NNSA

LA-UR 09-02032

Operated by Los Alamos National Security, LLC for NNSA

Programming Models: All Roads Lead Everywhere

Slide 15

Opteron Cell

There is a natural evolution of both of these approaches into a fully hybrid computing model

 Initial difference is in program Locus or control-process  On “evolved” model the host process runs a task-queue  Tasks may be offloaded to other host-type cores or to accelerators  Task data may live in worker’s memory to avoid data-movement bottlenecks  More on how we can use this to follow!

Scheduler

Cell/GPU

Opteron Core

slide-16
SLIDE 16

Operated by Los Alamos National Security, LLC for NNSA

LA-UR 09-02032

Operated by Los Alamos National Security, LLC for NNSA Slide 16

Particle-In-Cell (PIC) Methods Simulate Plasma Physics

 One application of VPIC is to simulate Laser Plasma Interactions (LPI) critical to understanding Inertial Confinement Fusion (ICF) at the National Ignition Facility (NIF)  Several difficulties arise during the compression of hohlraum capsules

 Laser scattering – not enough energy to compress capsule  Laser scattering – laser does not target desired areas (unsymmetric compression)  Pre-heating – electrons heat plasma making compression more difficult

LLNL pF3D modeling

  • f a laser beam

VPIC modeling of a single laser speckle Integrated LLNL Hydra modeling of ICF experiment

slide-17
SLIDE 17

Operated by Los Alamos National Security, LLC for NNSA

LA-UR 09-02032

Operated by Los Alamos National Security, LLC for NNSA Slide 17

Particle-In-Cell Method

Advance Particles Accumulate Currents Update Fields Interpolate Field Effects

Time Iteration

grids particles

Spatial Domain

+ + + + +

slide-18
SLIDE 18

Operated by Los Alamos National Security, LLC for NNSA

LA-UR 09-02032

Operated by Los Alamos National Security, LLC for NNSA Slide 18

VPIC – Vector Particle-In-Cell

 3D, fully relativistic, electromagnetic Particle-In-Cell (PIC) code

 Self-consistent evolution of a kinetic plasma  Charge conserving (no implicit solve)

 Optimized for data motion

 Single precision – half the memory bandwidth/double the theoretical peak  Single-pass particle processing  Field interpolation coefficients are pre-computed

 Optimized for modern architectures

 Uses short-vector, SIMD intrinsics (SSE, Altivec, SPU)

Assumes that particles do not leave voxel in which they started

Exceptions are handled separately  O(N) particle sorting

Improves spatial locality of particle data access

Improves temporal locality of Field data access

slide-19
SLIDE 19

Operated by Los Alamos National Security, LLC for NNSA

LA-UR 09-02032

Operated by Los Alamos National Security, LLC for NNSA Slide 19

Porting to Roadrunner (things that we did)

 Message Passing Relay (MP Relay)

 Flattens communication topology  Allows logical point-to-point communication between Cell processors  Abstracts remote I/O layer for restart and visualization dumps

 Pipelined execution

 Code restructured for data-parallel thread execution  Current support for serial, pthreads, and SPE threads  Simple, common interface: init(), finalize(), execute(function_t), sync()

 Particle data structures

 Optimized for efficient communication via DMA requests  Can be tuned to cache size on traditional cached-memory architectures (padding)

 Voxel cache (access to Field data)

 Fully associative least recently used (LRU) policy  Simple interface: voxel_cache_fetch() and voxel_cache_wait()

 Text overlay support

 Allows acceleration of field advance, particle sorting and accumulators

slide-20
SLIDE 20

Operated by Los Alamos National Security, LLC for NNSA

LA-UR 09-02032

Operated by Los Alamos National Security, LLC for NNSA

Pipeline Abstraction

Slide 20

init

Master Thread

execute sync finalize

 Worker threads block for execute message to reduce thread creation overhead

 pthreads implementation uses condition variables  SPE implementation uses mailboxes  SPE symbols are exposed to the PPE through _SPUEAR_ linker magic  Function call is implemented through mailbox message

slide-21
SLIDE 21

Operated by Los Alamos National Security, LLC for NNSA

LA-UR 09-02032

Operated by Los Alamos National Security, LLC for NNSA Slide 21

 Data are processed in segments of even multiples of 16 particles

 Segments are accessed in blocks of up to 512 particles (16 KB largest possible single DMA request)  Triple-buffered: streaming data paradigm (read, update, write)

 Block processing groups particles in sets of 4

 Optimal for single-precision SIMD operations  Inner loop is 4x hand unrolled

VPIC applies best strategy to particle advance

typedef struct particle { float dx, dy, dz; // position (relative to voxel) int32_t i; // index of voxel containing particle float ux, uy, uz; // particle normalized momentum float q; // particle charge } particle_t;

32 bytes

slide-22
SLIDE 22

Operated by Los Alamos National Security, LLC for NNSA

LA-UR 09-02032

Operated by Los Alamos National Security, LLC for NNSA Slide 22

Overlays

 VPIC’s particle advance logic maxes out the Local Store (LS)

 Particle advance data uses 206 KB  This leaves ~50 KB for text (machine instructions)

 Overlays are segments of text that can be loaded/unloaded from LS

 Expand the effective maximum size of an SPE program  Avoid overhead of starting new SPE threads (prohibitive)  Limited by management table size

 IBM has implemented overlay support as a software cache

 Overlay manager fetches text that is not in LS (DMA call)  No prefetch capability

slide-23
SLIDE 23

Operated by Los Alamos National Security, LLC for NNSA

LA-UR 09-02032

Operated by Los Alamos National Security, LLC for NNSA Slide 23

Overlay Properties

Root Segment Region 2

SA

SPE Local Store Data

SA SB SD SE

Main Memory Region 1

SD

SC SF

slide-24
SLIDE 24

Operated by Los Alamos National Security, LLC for NNSA

LA-UR 09-02032

Operated by Los Alamos National Security, LLC for NNSA Slide 24

Overlay Properties

Root Segment Region 2

SA

SPE Local Store Data

SA SB SD SE

Main Memory Region 1

SD

SC SF

Text is partitioned into regions with a static root segment Text

slide-25
SLIDE 25

Operated by Los Alamos National Security, LLC for NNSA

LA-UR 09-02032

Operated by Los Alamos National Security, LLC for NNSA Slide 25

Overlay Properties

Root Segment Region 2

SA

SPE Local Store Data

SA SB SD SE

Main Memory Region 1

SD

SC SF

Each region can be filled by specific segments of text

Segments for region 1

slide-26
SLIDE 26

Operated by Los Alamos National Security, LLC for NNSA

LA-UR 09-02032

Operated by Los Alamos National Security, LLC for NNSA Slide 26

Overlay Properties

Root Segment Region 2

SA

SPE Local Store Data

SA SB SD SE

Main Memory Region 1

SD

SC SF

The size of a region is determined by its largest segment 32KB

28KB 32KB 20KB

slide-27
SLIDE 27

Operated by Los Alamos National Security, LLC for NNSA

LA-UR 09-02032

Operated by Los Alamos National Security, LLC for NNSA Slide 27

Overlay Properties

Root Segment Region 2 SPE Local Store Data

SA SB SD SE

Main Memory Region 1

SD

SC SF

Loading a new segment overwrites its respective region

SB

slide-28
SLIDE 28

Operated by Los Alamos National Security, LLC for NNSA

LA-UR 09-02032

Operated by Los Alamos National Security, LLC for NNSA Slide 28

Overlays

 VPIC has been extended to support overlays

 Particle advance, accumulators and particle sorting have been accelerated  Current code decomposition uses one region with two segments

 Even this fairly trivial approach has difficulties

 VPIC overlay strategy uses stack for data buffers  Stack placement changes with linkage (even with only trivial code changes)  Silent overflows into memory handled by overlay manager  Data corruption, hangs, segmentation faults…

 May need to implement light-weight heap

 void * spu_malloc(), spu_free_all()  Would reserve 224 KB of LS space for heap  Actual implementation will use static byte array in root segment

slide-29
SLIDE 29

Operated by Los Alamos National Security, LLC for NNSA

LA-UR 09-02032

Operated by Los Alamos National Security, LLC for NNSA Slide 29

VPIC Highlights

 1.00 Trillion Particle Run (Poughkeepsie just after stand-up)

 Aggressive test of full system  Achieved sustained performance of >374.25 TF

11% of theoretical max performance (single precision 3.0 PF)

Gordon Bell Prize Finalist SC2008   Cell processes used 42.8 TB RAM (93.8% of available Cell memory)

Opteron processes used 7.3 TB RAM

 Science Runs: Back-scatter in laser plasma interactions

 Current bread-and-butter runs on 6 CUs (4,096 ranks : 32,768 threads)  Next set of runs will be at 16 CUs (11,520 ranks : 92,160 threads)  11x speedup over Opteron-only  Excellent machine stability (main difficulty is I/O subsystem)

slide-30
SLIDE 30

Operated by Los Alamos National Security, LLC for NNSA

LA-UR 09-02032

Operated by Los Alamos National Security, LLC for NNSA

Efficiency is a Poor Metric for Performance

 Real computational workflows expose many more bottlenecks

 Data I/O for visualization and restarts  Data rendering for visualization  Diagnostics and statistical analysis

 Many of these steps can be handled concurrently

 Once the pipeline is full, we can fully subscribe a hybrid compute node  Reduces vulnerability to machine instabilities by reducing total time to solution  Special purpose accelerators can be targeted to specific tasks  Host process only manages tasks

 VPIC will be enhanced to address these issues

 Initial enhancements will use pipeline abstraction  OpenCL implementation planned

Slide 30

Hybrid Computing Architectures Can Help Us!

slide-31
SLIDE 31

Operated by Los Alamos National Security, LLC for NNSA

LA-UR 09-02032

Operated by Los Alamos National Security, LLC for NNSA

Exploiting Hybrid Architectures

Slide 31

Cell Opteron

Scheduler

GPU

Opteron Core

Disk I/O Rendering Computation Host Process/Task Queue Computation/

slide-32
SLIDE 32

Operated by Los Alamos National Security, LLC for NNSA

LA-UR 09-02032

Operated by Los Alamos National Security, LLC for NNSA

Commodity Nodes Like This Already Exist!

 Scalable Informatics – Pegasus GPU+Cell Node

 4-16 AMD or Intel cores  8-128 GB RAM  One or more Tesla GPU cards  One or more GA-180 (PXCAB) Cell cards

Slide 32

How can we develop for a cluster of such nodes?

slide-33
SLIDE 33

Operated by Los Alamos National Security, LLC for NNSA

LA-UR 09-02032

Operated by Los Alamos National Security, LLC for NNSA

OpenCL – One Possibility

 OpenCL is programming framework for accelerated compute nodes

 Runtime – handles work distribution and JIT compilation

Fully static embedded kernels are supported

Topology interrogation  API – process launch, communication and synchronization  OpenCL C – kernel programming language

Abstraction layer for SIMD vector types and intrinsics

Explicit dependency specification of kernel parameters

 Host process controls one or more attached devices  Still missing

 Heterogeneous device support  Support for clusters (host-to-host communication)  Build/configuration system

Slide 33

slide-34
SLIDE 34

Operated by Los Alamos National Security, LLC for NNSA

LA-UR 09-02032

Operated by Los Alamos National Security, LLC for NNSA

Hybrid Compute Node

Slide 34

OpenCL C/C++/OpenCL C

Kernel Logic

OpenMPI (h) C/C++/OpenCL C

Kernel Logic

IB Interconnect

 Hybrid OpenMPI

 Working version in use on Roadrunner architecture  Extends MPI Interface

Process launch on multiple architectures

Introduces hierarchical communicators  Available in next release

 OpenCL

 No current support for peer-to-peer communication between hosts

 Others

 CellSs, OpenMP

Node Control Process

OpenCL OpenMPI (h) OpenMPI Abstraction Layer

slide-35
SLIDE 35

Operated by Los Alamos National Security, LLC for NNSA

LA-UR 09-02032

Operated by Los Alamos National Security, LLC for NNSA

Open Science on Roadrunner

 Internal peer-reviewed process identified 9 projects

 Bio-Fuels, Astrophysics, Plasma Physics, Phylogenetics, Atmospheric Science, Molecular Dynamics

 Projects have been awarded allocations on full machine

 Many are currently underway  Window of 3-4 months before the machine goes behind-the-fence

 Cerrillos

 162 TF Roadrunner architecture (2 CUs)  Call has been issued and proposals are being evaluated  Allocations will begin soon!

 Education

 Hands-On Cell programming class (second offering currently underway)  Student program  Development allocations for collaboration

Slide 35

slide-36
SLIDE 36

Operated by Los Alamos National Security, LLC for NNSA

LA-UR 09-02032

Operated by Los Alamos National Security, LLC for NNSA Slide 36

Supernovae

 The final event in the evolution of a sufficiently massive star is a supernova  SNSPH, developed at LANL by Chris Fryer and Mike Warren, is a parallel three-dimensional smoothed particle hydrodynamics code  Simulations conducted on Roadrunner will allow comparison with data from actual light curves and spectra from supernova observations  Large Synoptic Survey Telescope (LSST) and the Joint Dark Energy Mission (JDEM)

Work on Roadrunner will extend these simulations by calculating light curves and spectra from full radiation- hydrodynamic models of these explosions.

slide-37
SLIDE 37

Operated by Los Alamos National Security, LLC for NNSA

LA-UR 09-02032

Operated by Los Alamos National Security, LLC for NNSA

Shock Compression of Metals

 Dislocation interactions such as line defects determine the strength

  • f metals

 Roadrunner will finally allow us to realize the promise of computational science by bridging the gap between simulation and experiment

Slide 37

This animation shows a shock front traveling through polycrystalline Fe causing a phase transformation from the bcc (gray) to hcp (red) and fcc (green) structure

slide-38
SLIDE 38

Operated by Los Alamos National Security, LLC for NNSA

LA-UR 09-02032

Operated by Los Alamos National Security, LLC for NNSA

Turbulent Mixing in Buoyancy Driven Flows

 Material mixing to molecular scale in the presence of turbulence induced stirring is an important process in many areas  Most studies to date address the Boussinesq case  Significant and unexpected differences in the mixing process

  • ccur as the material density

parameters diverge  These animations highlight the complexity of the mixing process, illustrating the new physics associated with mixing at large density differences

Slide 38

slide-39
SLIDE 39

Operated by Los Alamos National Security, LLC for NNSA

LA-UR 09-02032

Operated by Los Alamos National Security, LLC for NNSA

Laser Plasma Interactions (LPI)

Slide 39

slide-40
SLIDE 40

Operated by Los Alamos National Security, LLC for NNSA

LA-UR 09-02032

Operated by Los Alamos National Security, LLC for NNSA

Thanks!

 Our HPC Division staff is committed to making Roadrunner succeed

 Meghan Wingate  Mark Vernon  Phil Church  Randall Rheinheimer

 Applications’ developers have done amazing work

 Sriram Swaminarayan  Tim Kelley  Paul Henning  Jamal Mohd-Yusof

 Special thanks to Larry Cox for funding Khronos membership!

Slide 40