Building the Next Generation of Parallel Applications Michael A. - PowerPoint PPT Presentation

Building the Next Generation of Parallel Applications Michael A. Heroux Scalable Algorithms Department Sandia National Laboratories, USA Sandia National Laboratories is a multi-program laboratory operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin company, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE -AC04-94AL85000.

A Brief Personal Computing History 1993 - 2008 1988 - 1997 #include <mpi.h> CMIC$ DO ALL VECTOR IF (N .GT. 800) int main(int argc, char *argv[]) { CMIC$1 SHARED(BETA, N, Y, Z) // Initialize MPI CMIC$2 PRIVATE(I) MPI_Init(&argc,&argv); CDIR$ IVDEP int rank, size; do 15 i = 1, n MPI_Comm_rank(MPI_COMM_WORLD, &rank); z(i) = beta * y(i) MPI_Comm_size(MPI_COMM_WORLD, &size); 15 continue endif

2008 - Present Unification and composition: - Vectorization - Threading - Multiprocessing #include <mpi.h> #include <omp.h> #include <thrust/host_vector.h> int main(int argc, char *argv[]) { #include <thrust/device_vector.h> // Initialize MPI MPI_Init(&argc,&argv); thrust::device_vector<int> vd(10, 1); int rank, size; thrust::host_vector<int> vh(10,1); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); … #pragma omp parallel { double localasum = 0.0; #pragma omp for for (int j=0; j< MyLength_; j++) localasum += std::abs(from[j]); #pragma omp critical asum += localasum; }

Quiz (True or False) 1. MPI-only has the best parallel performance. 2. Future parallel applications will not have MPI_Init(). 3. All future programmers will need to write parallel code. 4. Use of “markup”, e.g., OpenMP pragmas, is the least intrusive approach to parallelizing a code. 5. DRY is not possible across CPUs and GPUs 6. GPUs are a harbinger of CPU things to come. 7. Checkpoint/Restart will be sufficient for scalable resilience. 8. Resilience will be built into algorithms. 9. MPI-only and MPI+X can coexist in the same application. 10.Kernels will be different in the future.

Basic Exascale Concerns: Trends, Manycore • Stein’s Law: If a trend cannot continue, it will stop. Parallel CG Performance 512 Threads Herbert Stein, chairman of the Council of 32 Nodes = 2.2GHz AMD 4sockets X 4cores Economic Advisers under Nixon and 180 Ford. 160 “Status Quo” ~ MPI -only 140 Gigaflops 120 • Trends at risk: 100 p32 x t16 80 – Power. p128 x t4 60 p512 x t1 – Single core performance. 40 20 – Node count. 0 1E+05 1E+06 1E+07 – Memory size & BW. 3D Grid Points with 27pt stencil – Concurrency expression in Edwards: SAND2009-8196 existing Programming Strong Scaling Potential Trilinos ThreadPool Library v1.1. Models. One outcome: Greatly increased interest in OpenMP 5

Implications • MPI- Only is not sufficient, except … much of the time. • Near-to-medium term: – MPI+[OMP|TBB|Pthreads|CUDA|OCL|MPI] – Long term, too? • Long- term: – Something hierarchical, global in scope. • Conjecture: – Data-intensive apps need non-SPDM model. – Will develop new programming model/env. – Rest of apps will adopt over time. – Time span: 20 years.

What Can we Do Right Now? • Study why MPI was successful. • Study new parallel landscape. • Try to cultivate an approach similar to MPI.

MPI Impresssions 8

Dan Reed, Microsoft Workshop on the Road Map for the Revitalization of High End Computing June 16-18, 2003 Tim Stitts, CSCS SOS14 Talk March 2010 “ MPI is often considered the “portable assembly language” of parallel computing, …” Brad Chamberlain, Cray, 2000.

Brad Chamberlain, Cray, PPOPP’06, http://chapel.cray.com/publications/ppopp06 -slides.pdf

MPI Reality 11

dft_fill_wjdc.c Tramonto WJDC Functional • New functional. • Bonded systems. • 552 lines C code. WJDC-DFT (Werthim, Jain, Dominik, and Chapman) theory for bonded systems. (S. Jain, A. Dominik, and W.G. Chapman. Modified interfacial statistical associating fluid theory: A perturbation density functional theory for inhomogeneous complex fluids. J. Chem. Phys., 127:244904, 2007.) Models stoichiometry constraints inherent to bonded systems. How much MPI-specific code?

dft_fill_wjdc.c MPI-specific code

source_pp_g.f MFIX Source term for pressure correction • MPI-callable, OpenMP-enabled. • 340 Fortran lines. • No MPI-specific code. • Ubiquitous OpenMP markup (red regions). MFIX: Multiphase Flows with Interphase eXchanges (https://www.mfix.org/)

Reasons for MPI Success? • Portability? Yes. • Standardized? Yes. • Momentum? Yes. • Separation of many Parallel & Algorithms concerns? Big Yes. • Once framework in place: – Sophisticated physics added as serial code. – Ratio of science experts vs. parallel experts: 10:1. • Key goal for new parallel apps: Preserve this ratio

Computational Domain Expert Writing MPI Code

Computational Domain Expert Writing Future Parallel Code

Evolving Parallel Programming Model 18

Parallel Programming Model: Multi-level/Multi-device Inter-node/inter-device (distributed) Message Passing parallelism and resource management network of computational nodes Node-local control flow (serial) Intra-node (manycore) computational parallelism and resource Threading management node with manycore CPUs and / or GPGPU Stateless computational kernels stateless kernels run on each core Adapted from slide of H. Carter Edwards 19

Domain Scientist’s Parallel Palette • MPI-only (SPMD) apps: – Single parallel construct. – Simultaneous execution. – Parallelism of even the messiest serial code. • Next-generation applications: – Internode: • MPI, yes, or something like it. • Composed with intranode. – Intranode: • Much richer palette. • More care required from programmer. • What are the constructs in our new palette?

Obvious Constructs/Concerns • Parallel for: – No loop-carried dependence. – Rich loops. • Parallel reduce: – Couple with other computations. – Concern for reproducibility.

Other construct: Pipeline • Sequence of filters. • Each filter is: – Sequential (grab element ID, enter global assembly) or – Parallel (fill element stiffness matrix). • Filters executed in sequence. • Programmer’s concern: – Determine (conceptually): Can filter execute in parallel? – Write filter (serial code). – Register it with the pipeline. • Extensible: – New physics feature. – New filter added to pipeline.

Other construct: Thread team • Multiple threads. • Fast barrier. • Shared, fast access memory pool. • Example: Nvidia SM • X86 more vague, emerging more clearly in future.

Finite Elements/Volumes/Differences and parallel node constructs • Parallel for, reduce, pipeline: – Sufficient for vast majority of node level computation. – Supports: • Complex modeling expression. • Vanilla parallelism. • Thread team: – Complicated. – Requires true parallel algorithm knowledge. – Useful in solvers.

Preconditioners for Scalable Multicore Systems Strong scaling of Charon on TLCC (P. Lin, J. Shadid 2009 ) # MPI Ranks • Observe: Iteration count increases with number of subdomains. MPI Tasks Threads Iterations • With scalable threaded triangular solves 4096 1 153 – Solve triangular system on larger subdomains. 2048 2 129 – Reduce number of subdomains. 1024 4 125 • Goal: – Better kernel scaling (threads vs. MPI processes). 512 8 117 – Better convergence, More robust. 256 16 117 • Note: App (-solver) scales very well in MPI-only mode. 128 32 111 • Exascale Potential: Tiled, pipelined implementation. Factors Impacting Performance of Multithreaded Sparse Triangular Solve, Michael M. Wolf and Michael A. Heroux and Erik G. Boman, VECPAR 2010, to appear. 25

Level Set Triangular Solver DAG L Triangular Solve: Permuted • Critical Kernel System - MG Smoothers - Incomplete IC/ILU • Naturally Sequential • Building on classic algorithms: • Level Sched: Multi-step • circa 1990. Algorithm • Vectorization. • New: Generalized. 26

Triangular Solve Results Passive (PB) vs. Active (AB) Barriers: Critical for Performance Speedup Speedup Nehalem Istanbul Speedup Speedup AB + No Thread Affinity (NTA) vs. AB + Thread Affinity (TA) : Also Helpful Level sets: Trilinos/Isorropia Core Kernel Timings: Trilinos/Kokkos. 27

Thread Team Advantanges • Qualitatively better algorithm: – Threaded triangular solve scales. – Fewer MPI ranks means fewer iterations, better robustness. • Exploits: – Shared data. – Fast barrier. – Data-driven parallelism.

Placement and Migration 29

Placement and Migration • MPI: – Data/work placement clear. – Migration explicit. • Threading: – It’s a mess (IMHO). – Some platforms good. – Many not. – Default is bad (but getting better). – Some issues are intrinsic.

Data Placement on NUMA • Memory Intensive computations: Page placement has huge impact. • Most systems: First touch (except LWKs). • Application data objects: – Phase 1: Construction phase, e.g., finite element assembly. – Phase 2: Use phase, e.g., linear solve. • Problem: First touch difficult to control in phase 1. • Idea: Page migration. – Not new: SGI Origin. Many old papers on topic. 31

Building the Next Generation of Parallel Applications Michael A. - PowerPoint PPT Presentation

Building the Next Generation of Parallel Applications Michael A. Heroux Scalable Algorithms Department Sandia National Laboratories, USA Sandia National Laboratories is a multi-program laboratory operated by Sandia Corporation, a wholly owned

Next Generation Next Generation gTLD Dir gTLD Directory Services ectory Services Pr

Next Generation Climate Next Generation Climate Grades 6-8 Supports NGSS Lots of graphs and

Next Generation ACO Model Open Door Forum: Next Generation ACO Application Overview March 29,

Next Generation ACO Model Open Door Forum: Next Generation ACO Application Overview March 14,

Video Consoles - The Next Generation consoles and games from Next Generation 1994 - present

Next Next Generation Sequencing: an overview of Generation Sequencing: an overview of

Overview Parallel computing platforms Approaches to building parallel computers

Parallel Numerical Algorithms Chapter 2 Parallel Thinking Section 2.2 Parallel

Introduction Introduction What is Parallel Architecture? Why Parallel Architecture? Evolution

Parallel and Distributed Programming Introduction Kenjiro Taura 1 / 21 Contents 1 Why Parallel

Introduction to Parallel Computing George Karypis Principles of Parallel Algorithm Design

+ Design of Parallel Algorithms Parallel Algorithm Analysis Tools + Topic Overview n Sources of

+ Design of Parallel Algorithms Parallel Algorithm Analysis Tools + Topic Overview n Sources

Overview Why Parallel Sorting? Parallel Quicksort Bitonic Sort Parallel Merge Sort

Parallel Computing: Opportunities and Challenges Victor Lee Parallel Computing Lab (PCL), Intel

A Massively Parallel Dense Symmetric A Massively Parallel Dense Symmetric A Massively Parallel

Multiphase flows in complex geometries: a UQ perspective Matteo Icardi Stochastic Numerics group

Graph cut, convex relaxation and continuous max-flow problem Egil Bae (UCLA) and Xue-Cheng Tai

Lesson Plan Template Lesson Segment Focus: Phases of Mitosis_ Lesson _of____

+ R N RX R 3 RN + X - time of reaction : 2 3 days solvent: methanol, acetonitrile R

Small-Scale Mechanical Testing on Ion Beam and Neutron Irradiated Engineering Materials Peter

File Syst em Design f or and I nt roduct ion NSF File Ser ver Appliance An appliance is a

A Numerical Model for Two-Phase Shallow Granular Flows over Variable Topography Marica Pelanti

Lets make a web g ame! credit: Photon Storm What will we cover today? The canvas element

Building the Next Generation of Parallel Applications Michael A. - PowerPoint PPT Presentation

Building the Next Generation of Parallel Applications Michael A. Heroux Scalable Algorithms Department Sandia National Laboratories, USA Sandia National Laboratories is a multi-program laboratory operated by Sandia Corporation, a wholly owned

Next Generation Next Generation gTLD Dir gTLD Directory Services ectory Services Pr

Next Generation Climate Next Generation Climate Grades 6-8 Supports NGSS Lots of graphs and

Next Generation ACO Model Open Door Forum: Next Generation ACO Application Overview March 29,

Next Generation ACO Model Open Door Forum: Next Generation ACO Application Overview March 14,

Video Consoles - The Next Generation consoles and games from Next Generation 1994 - present

Next Next Generation Sequencing: an overview of Generation Sequencing: an overview of

Overview Parallel computing platforms Approaches to building parallel computers

Parallel Numerical Algorithms Chapter 2 Parallel Thinking Section 2.2 Parallel

Introduction Introduction What is Parallel Architecture? Why Parallel Architecture? Evolution

Parallel and Distributed Programming Introduction Kenjiro Taura 1 / 21 Contents 1 Why Parallel

Introduction to Parallel Computing George Karypis Principles of Parallel Algorithm Design

+ Design of Parallel Algorithms Parallel Algorithm Analysis Tools + Topic Overview n Sources of

+ Design of Parallel Algorithms Parallel Algorithm Analysis Tools + Topic Overview n Sources

Overview Why Parallel Sorting? Parallel Quicksort Bitonic Sort Parallel Merge Sort

Parallel Computing: Opportunities and Challenges Victor Lee Parallel Computing Lab (PCL), Intel

A Massively Parallel Dense Symmetric A Massively Parallel Dense Symmetric A Massively Parallel

Multiphase flows in complex geometries: a UQ perspective Matteo Icardi Stochastic Numerics group

Graph cut, convex relaxation and continuous max-flow problem Egil Bae (UCLA) and Xue-Cheng Tai

Lesson Plan Template Lesson Segment Focus: __Phases of Mitosis___ Lesson _________of____________

+ R N RX R 3 RN + X - time of reaction : 2 3 days solvent: methanol, acetonitrile R

Small-Scale Mechanical Testing on Ion Beam and Neutron Irradiated Engineering Materials Peter

File Syst em Design f or and I nt roduct ion NSF File Ser ver Appliance An appliance is a

A Numerical Model for Two-Phase Shallow Granular Flows over Variable Topography Marica Pelanti

Lets make a web g ame! credit: Photon Storm What will we cover today? The canvas element

Lesson Plan Template Lesson Segment Focus: Phases of Mitosis_ Lesson _of____