Total Work-Flow: Exploiting Hybrid Computing Architectures for - PowerPoint PPT Presentation

Total Work-Flow: Exploiting Hybrid Computing Architectures for Scientific Computing ScicomP 15 Ben Bergen Computational Physics (CCS-2) Los Alamos National Laboratory Brian Albright (X-1), Kevin Bowers (D.E. Shaw), Lin Yin (X-1), William Daughton (X-1) LA-UR 09-02032 Operated by Los Alamos National Security, LLC for DOE/NNSA

Overview  Roadrunner System Overview  Basic Considerations and Programming Models  Adapting VPIC Kinetic Plasma Code to Roadrunner  Optimizing Total Workflow  Open Science on Roadrunner LA-UR 09-02032 Slide 2 Operated by Los Alamos National Security, LLC for NNSA Operated by Los Alamos National Security, LLC for NNSA

Roadrunner is a Cluster LA-UR 09-02032 Slide 3 Operated by Los Alamos National Security, LLC for NNSA Operated by Los Alamos National Security, LLC for NNSA

Roadrunner is a Cluster of Clusters LA-UR 09-02032 Slide 4 Operated by Los Alamos National Security, LLC for NNSA Operated by Los Alamos National Security, LLC for NNSA

Roadrunner is a Cluster of Clusters with Accelerators LA-UR 09-02032 Slide 5 Operated by Los Alamos National Security, LLC for NNSA Operated by Los Alamos National Security, LLC for NNSA

Triblade Compute Node LA-UR 09-02032 Slide 6 Operated by Los Alamos National Security, LLC for NNSA Operated by Los Alamos National Security, LLC for NNSA

Original Blade Topology  One-to-one affinity between Opteron core and Cell processor  Newer versions of DaCS support two-to-one affinity  Not sure about four-to-one??? LA-UR 09-02032 Slide 7 Operated by Los Alamos National Security, LLC for NNSA Operated by Los Alamos National Security, LLC for NNSA

Roadrunner: Basic Considerations for Adaptation Roadrunner has three different architectures  First hybrid supercomputer of the current generation incorporating x86_64, PowerPC, and SPU ISAs. Opteron  Codes require three executables  x86_64 executable runs on the Opteron host processor  PowerPC executable runs on the Power Processing Element (PPE) accelerator processor  SPU threads runs on the eight Synergistic Processing PowerPC Element (SPE) special purpose vector unit processors  Three compilers: gcc, ppu-gcc, spu-gcc (also XL C/C++)  Design considerations: Process launch and SPE synchronization LA-UR 09-02032 Slide 8 Operated by Los Alamos National Security, LLC for NNSA Operated by Los Alamos National Security, LLC for NNSA

Roadrunner: Basic Considerations for Adaptation Roadrunner has three different address spaces  Incorporates main memory on the Opteron and Cell eDP blades plus the local store user-controlled SRAM on the SPEs  Codes that run on Roadrunner must handle communication between these memory spaces  Distributed memory communication between Opteron hosts  Point-to-point communication between Opteron host and Cell accelerator  Direct Memory Access (DMA) communication between Cell main memory and SPE local store memory  Opteron and Cell have different endianness  Some byte-swapping is necessary  Cell blades are diskless  Design considerations: Communication and I/O LA-UR 09-02032 Slide 9 Operated by Los Alamos National Security, LLC for NNSA Operated by Los Alamos National Security, LLC for NNSA

Roadrunner: Basic Considerations for Adaptation Multiple tools and programming models MPI  Process launch and synchronization  MPI, DaCS/ALF, libSPE2  Communication  MPI, DaCS/ALF, libSPE2 DaCS Hierarchical/heterogeneous advantages  Fault tolerance  Faults can be caught at multiple levels libSPE2  Scalability  Strong scalability is possible on SPEs  Weak scalability through distributed memory LA-UR 09-02032 Slide 10 Operated by Los Alamos National Security, LLC for NNSA Operated by Los Alamos National Security, LLC for NNSA

Programming Models: Host-Centric (Function Offload) Pros  Allows staged development Opteron  Existing MPI codes will run on Opterons  Synchronous or asynchronous function offload to accelerator  Minimizes reliance on PPE (poor performer!) Cons  Potential data-movement bottleneck Cell  Offload cost must be amortized by work done on accelerator LA-UR 09-02032 Slide 11 Operated by Los Alamos National Security, LLC for NNSA Operated by Los Alamos National Security, LLC for NNSA

Programming Models: Accelerator-Centric Pros  Also allows staged development Opteron  Existing MPI codes will run on PowerPC (PPE)  Hides complexity of hybrid architecture  Avoids data-movement bottleneck Cons  Heavier reliance on PPE Cell  Computationally intensive portions of code must run on SPEs  Requires “relay” to forward message traffic LA-UR 09-02032 Slide 12 Operated by Los Alamos National Security, LLC for NNSA Operated by Los Alamos National Security, LLC for NNSA

Message Passing Relay Opteron Opteron Cell Cell Direct point-to-point communication is not possible between Cells LA-UR 09-02032 Slide 13 Operated by Los Alamos National Security, LLC for NNSA Operated by Los Alamos National Security, LLC for NNSA

Message Passing Relay Opteron Opteron Relay forwards messages through hosts to peer Cell Cell Data Data LA-UR 09-02032 Slide 14 Operated by Los Alamos National Security, LLC for NNSA Operated by Los Alamos National Security, LLC for NNSA

Programming Models: All Roads Lead Everywhere There is a natural evolution of both of these approaches into a Opteron fully hybrid computing model Scheduler  Initial difference is in program Locus or control-process  On “evolved” model the host process runs a Opteron Core task-queue  Tasks may be offloaded to other host-type cores or to accelerators Cell  Task data may live in worker’s memory to avoid data-movement bottlenecks Cell/GPU  More on how we can use this to follow! LA-UR 09-02032 Slide 15 Operated by Los Alamos National Security, LLC for NNSA Operated by Los Alamos National Security, LLC for NNSA

Particle-In-Cell (PIC) Methods Simulate Plasma Physics VPIC modeling of a LLNL pF3D modeling Integrated LLNL Hydra single laser speckle of a laser beam modeling of ICF experiment  One application of VPIC is to simulate Laser Plasma Interactions (LPI) critical to understanding Inertial Confinement Fusion (ICF) at the National Ignition Facility (NIF)  Several difficulties arise during the compression of hohlraum capsules  Laser scattering – not enough energy to compress capsule  Laser scattering – laser does not target desired areas (unsymmetric compression)  Pre-heating – electrons heat plasma making compression more difficult LA-UR 09-02032 Slide 16 Operated by Los Alamos National Security, LLC for NNSA Operated by Los Alamos National Security, LLC for NNSA

Particle-In-Cell Method Time Iteration grids Interpolate Advance Field Effects Particles + particles + + + + Accumulate Update Fields Currents Spatial Domain LA-UR 09-02032 Slide 17 Operated by Los Alamos National Security, LLC for NNSA Operated by Los Alamos National Security, LLC for NNSA

VPIC – Vector Particle-In-Cell  3D, fully relativistic, electromagnetic Particle-In-Cell (PIC) code  Self-consistent evolution of a kinetic plasma  Charge conserving (no implicit solve)  Optimized for data motion  Single precision – half the memory bandwidth/double the theoretical peak  Single-pass particle processing  Field interpolation coefficients are pre-computed  Optimized for modern architectures  Uses short-vector, SIMD intrinsics (SSE, Altivec, SPU) Assumes that particles do not leave voxel in which they started  Exceptions are handled separately   O(N) particle sorting Improves spatial locality of particle data access  Improves temporal locality of Field data access  LA-UR 09-02032 Slide 18 Operated by Los Alamos National Security, LLC for NNSA Operated by Los Alamos National Security, LLC for NNSA

Porting to Roadrunner (things that we did)  Message Passing Relay (MP Relay)  Flattens communication topology  Allows logical point-to-point communication between Cell processors Abstracts remote I/O layer for restart and visualization dumps   Pipelined execution  Code restructured for data-parallel thread execution  Current support for serial, pthreads, and SPE threads  Simple, common interface: init(), finalize(), execute(function_t), sync()  Particle data structures  Optimized for efficient communication via DMA requests  Can be tuned to cache size on traditional cached-memory architectures (padding)  Voxel cache (access to Field data)  Fully associative least recently used (LRU) policy  Simple interface: voxel_cache_fetch() and voxel_cache_wait()  Text overlay support  Allows acceleration of field advance, particle sorting and accumulators LA-UR 09-02032 Slide 19 Operated by Los Alamos National Security, LLC for NNSA Operated by Los Alamos National Security, LLC for NNSA

Pipeline Abstraction Master Thread init execute sync finalize  Worker threads block for execute message to reduce thread creation overhead  pthreads implementation uses condition variables  SPE implementation uses mailboxes  SPE symbols are exposed to the PPE through _SPUEAR_ linker magic  Function call is implemented through mailbox message LA-UR 09-02032 Slide 20 Operated by Los Alamos National Security, LLC for NNSA Operated by Los Alamos National Security, LLC for NNSA

Total Work-Flow: Exploiting Hybrid Computing Architectures for - PowerPoint PPT Presentation

Total Work-Flow: Exploiting Hybrid Computing Architectures for Scientific Computing ScicomP 15 Ben Bergen Computational Physics (CCS-2) Los Alamos National Laboratory Brian Albright (X-1), Kevin Bowers (D.E. Shaw), Lin Yin (X-1), William

Hybrid Construction Hybrid Construction Hybrid Construction Hybrid Construction 1 VP

9.4 Local Perception Filters 9.4 Local Perception Filters Exploiting Exploiting Perceptual

Architectures Architectural styles Software architectures Architectures versus middleware

Hybrid NLP Hybrid NLP O UTLINE O UTLINE Problems of Deep and Shallow Processing

Web/CD Hybrid Model Web/CD Hybrid Model Web/CD Hybrid Model Web/CD Hybrid Model for t he Dist

Hybrid Automobiles Hybrid Automobiles It switches easily between fuel, batteries, or both It

A Hybrid, Dynamic Logic for Hybrid-Dynamic Information Flow Brandon Bohrer and Andr e Platzer

Quantum Quantum Architectures Architectures June 1, 2005 June 1, 2005 Computing? Computing?

Artificial Intelligence: Methods and applications Lecture 5: Hybrid robot architectures Ola

EXPO REAL Hybrid Summit Your virtual exhibition EXPO REAL Hybrid Summit The Hybrid Conference

Model Predictive Control Model Predictive Control of Hybrid Systems of Hybrid Systems Model

1 2 Total Budgeted in Total Planned in Total Planned in 2019 for Total Budgeted 2020 for

Flow Visualization Overview: Flow Visualization (1) Introduction, overview Flow data Simulation

HyPar-Flow : Exploiting MPI and Keras for Scalable Hy brid- Par allel DNN Training with Tensor Flow

Reconfigurable Computing Computing Reconfigurable Reconfigurable Architectures Architectures

National Budget Meeting FY 2019 Total Agencies\Field Offices: 5 Total Tribes: 24

Task Superscalar: Using Processors as Functional Units Yoav Etsion Alex Ramirez Rosa M.

Refactoring and Optimizing the Community Atmosphere Model (CAM) on the Sunway TaihuLight

PERCEPTION IS IS Market is headed to the bottom of the current real estate cycle with

Sunway Architecture 1,3 2 1,3 1,3 Xinliang Wang , Weifeng Liu, Wei Xue, Li Wu 2 1 3 Ou

Structure Groups (and Rings) Wolfgang Rump Instead of Groups and associated Structures I

Full System Simulator Simulates different new IBM architectures like PERCS, PowerPC 970 and

Code Communication SWEN-610 Foundations of Software Engineering Department of Software

Teaching and Learning in the Digital Age: Redesigning Assessment Strategies in Norwegian Higher