Kokkos: Performance Portability and Photos placed in horizontal - PowerPoint PPT Presentation

Kokkos: Performance Portability and Photos placed in horizontal position with even amount Productivity for C++ Applications of white space between photos and header Performance Portability in Photos placed in horizontal Extreme Scale Computing: position Metrics, Challenges, Solutions with even amount of white space October 23-27, 2017 between photos and header Schloss Dagstuhl Seminar 17431 Wadern, Germany SAND2017-11734 PE Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000. SAND NO. 2011-XXXXP

LAMMPS EMPIRE Albany SPARC Drekar Applications & Libraries Trilinos Kokkos* performance portability for C++ applications HBM HBM HBM HBM DDR DDR DDR DDR DDR Multi-Core APU CPU+GPU Many-Core *κόκκος Greek: “granule” or “grain” ; like grains of sand on a beach 1

Performance Portability and Productivity § Economics: optimize F org (perf,port,prod) § Performance: execution time / energy to solution § On what platforms? § Portability: sustaining for multiple evolving architectures § Tool ecosystem: compilers, debuggers, analyzers, … § Interoperability with “architecture native” programming mechanism § Standard language using “as is” compilers; we use C++ § Productivity: aggregate development time and resources § Skills ecosystem: ease-of-use, education & training, support, … § Incremental path to adoption by legacy code; we use C++ § Kokkos’ 1+ 𝛇 economics § N codes on M architectures leading to N*(1+ 𝛇 *M) versions § O( 𝛇 *N*M) architecture specialized components § Written in “architecture native” programming mechanism 2

Kokkos Programming Model § Foundation § Well-defined Patterns for Parallel Programming (e.g., 2004; Mattson, et. al.) § Well-defined multidimensional array semantics § Strategy § User exposes parallelizable grains of computations and data § Kokkos maps grains onto hardware according to patterns and policies § Integrated mapping of both computations and data leading to architecture-appropriate memory access pattern § Policies may introduce architecture-specific parameters § Without changing source code § N( p )+O( 𝛇 *N*M) versions § Opportunity for auto-tuners to choose parameters? § Policy parameters have architecture-specific defaults § Work for simple / common use cases 3

Programming Model Abstractions Unique to Kokkos Parallel Pattern Polymorphic Multidimensional Array (for, reduce, scan, task-dag, …) (data structure pattern) Parallel Execution Policy Array Element Access Policy (Scheduling, Tiling, Thread Teams, …) (Layout, Tiling, RandomAccess, …) Execution Space Memory Space (CPU, GPU, which cores, which GPU) (CPU, GPU, which NUMA, which GPU) * Extensibility throughout 4

Multidimensional Array w/ Polymorphic Layout § Classical (50 years!) data pattern for science & engineering codes § Computer languages hard-wire multidimensional array layout mapping § Problem: different architectures require different layouts for performance Ø Leads to architecture-specific versions of code to obtain performance § E.g., “Array of Structure” ↔ “Structure of Array” redesigns e.g., e.g., “row-major” “column-major” CPU caching GPU coalescing § Kokkos separates layout from user’s computational code § Choose layout for architecture-specific memory access pattern Ø Without modifying user’s computational code § Polymorphic layout via C++ template meta-programming (extensible) Ø e.g., Hierarchical Tiling layout (array of structure of array) § Bonus: easy/transparent use of special data access hardware § Atomic operations, GPU texture cache, ... (extensible) 5

Performance Impact of Layout: Kokkos Tutorial Kernel: < y , Ax > <y|Ax> Exercise 04 (Layout) Fixed Size KNL: Xeon Phi 68c HSW: Dual Xeon Haswell 2x16c Pascal60: Nvidia GPU 600 HSW Left HSW Right coalesced KNL Left KNL Right 500 Pascal60 Left Pascal60 Right 400 Bandwidth (GB/s) cached 300 uncoalesced 200 cached 100 uncached 0 1x10 6 1x10 7 1x10 8 1x10 9 1 10 100 1000 10000 100000 Number of Rows (N) 6

Multidimensional Array, Kokkos’ C++ API § View< ArrayType , Policy… > a ; § ArrayType defines scalar type and array’s static/dynamic dimensions § Layout mapping indexing operator : a(i,j,k,l) → memory location § Policy: memory space, layout, access intent, reference counting, … § Trivial to swap between array-of-struct (AoS) to struct-of-arrays (SoA) § This is Kokkos’ default between CPU and GPU § Layout is specifiable (otherwise defaults) § Why? For compatibility with legacy code, algorithmic performance tuning, ... § Layout is customizable § E.g., hierarchical tiling (brick) layout (AoSoA) § Changing layout can be transparent to existing code – If written layout-agnostic § Layout-aware algorithm can extract tiles 7

Patterns, Policies, and C++ Lambdas § Pattern composed with policy drives the computational body for ( int i = 0 ; i < N ; ++i ) { /* body */ } pattern policy body parallel_for ( N , [=]( int i ) { /* body */ } ); C++ lambda pattern( Policy<Params…>(params...) , body(args…) ); args… derived from pattern and Policy § Data parallel patterns: for, reduce, scan § Transparently manage thread local values, inter-thread reductions, … § Data parallel policies: 1D range, nD range, thread teams, … § Data parallel policy parameters (extensible) § Static or dynamic work partitioning § nD loop collapse ordering and tiling 8

Pattern and Policy (brand new!) Directed Acyclic Graph (DAG) of Tasks § Parallel Patterns: Task-DAG and Work-DAG § DAG: acyclic execute-after dependences § Task-DAG: Heterogeneous and dynamic collection of parallel computations § Work-DAG: Homogeneous and static collection of parallel computations § Task Scheduler Responsibilities § Choose and execute ready tasks § Update execute-after dependences § Manage tasks’ dynamic memory and lifecycle § GPU was a real challenge! 9

Pattern and Policy Directed Acyclic Graph (DAG) of Tasks § Task-DAG (heterogeneous and dynamic) § Tasks spawn tasks of different functions; DAG is dynamic spawn( Policy(params…) , body ); § GPU portability and performance: tasks cannot block or be interrupted respawn( this , params… ); // replaces “wait” semantics § Policy : single-thread/thread-team, dependences, priority § GPU challenges: scalable scheduler & memory pool, non-coherent L1 cache § Work-DAG (homogeneous and static) § DAG is declared up-front; single work function given integer work index § Similar to data-parallel with an execute-after index graph (CRS array) parallel_for( WorkGraphPolicy<Params...>( graph ), body( int index ) ); 10

Conclusion / Future § Performance Portability, for C++ Applications § Integrated mapping of applications’ computations and data Ø Other programming models fail to map data and limit performance portability § Future proofing via designed-in extensibility and ongoing R&D § github.com/kokkos/kokkos § Application Developer Productivity, for C++ Applications § C++ lambda for simple data parallel loop syntax § Reduce and Scan inter-thread complexity managed by Kokkos § Hierarchical parallelism using nested patterns can increase parallelism § Case Study: no harder than OpenMP, optimization is easier § Goal: Future ISO/C++ Standard subsumes Kokkos abstractions § Parallel algorithms (C++17) incremental step for data parallel pattern/policy § Next steps in progress: Executors and ExecutionContext § Polymorphic multidimensional array on track for C++20 § Atomic operations on non-atomic types on track for C++20 11

Kokkos: Performance Portability and Photos placed in horizontal - PowerPoint PPT Presentation

Kokkos: Performance Portability and Photos placed in horizontal position with even amount Productivity for C++ Applications of white space between photos and header Performance Portability in Photos placed in horizontal Extreme Scale

Number Portability Three kinds of number portability Location portability: a subscriber may move

Kokkos, Manycore Device Photos placed in horizontal position Performance Portability with even

Kokkos: The C++ Performance Portability Programming Model Christian Trott (crtrott@sandia.gov),

Is it performance portability when Im using (small) DGEMM? Dagstuhl Seminar: Performance

EXPLORER+500 Performance and portability combined EXPLORER+ 500 The most used BGAN terminal

Sustainability and Performance through Kokkos: A Case Study with LAMMPS Chris&an Tro, , Si

Kokkos Implementation of Albany: you Towards Performance Portable e Finite Element Code logo

Kokkos update: Memory Spaces, Execution Spaces, Photos placed in horizontal position with even

Kokkos Task-DAG: Photos placed in Memory Management and Locality horizontal position with even

Kokkos Hierarchical Task-Data Parallelism Photos placed in horizontal

Kokkos Hierarchical Task-Data Parallelism Photos placed in horizontal position with even amount

CMEMS data through Social Media Authors: G. Sylaios, N. Kokkos, K. Zachopoulos, M. Zoidou

Uintah Architecture Open source software UQ DRIVERS ARCHES DSL: NEBO Worldwide

Applets Murray Cole Applets 1 Portability and Security JVM and bytecode make

Lecture 1.3 Course Introduction Portability and Scalability in Heterogeneous Parallel

JEDI Portability Across Platforms Containers, Cloud Computing, and HPC Outline I) JEDI

11 Fuzzy Rule-Based Models Fuzzy Systems Engineering Toward Human-Centric Computing Contents

Week 6 Video 3 Visualization Scatterplots Heat Maps Parameter Space Maps Scatterplots (Scatter

Securing Serverless - By Breaking In Guy Podjarny, Snyk @guypod snyk.io About Me Guy

Compressible viscoplastic models for granular flows Duc Nguyen 1 1 LAMA, Universit de

Gastrointestinal Lymphomas EATL, MALT, and beyond Maria A. Pletneva, MD, PhD Lymphoma in GI tract

1 2 3 State R&D Graphic, Version 1 Version 1 4 State R&D Graphic, Version 1,

4. Algorithmen und Datenstrukturen Algorithms and Data Structures, Overview [Cormen et al, Kap.

Impacts of maximum deforestation/reforestation on the regional climate in Europe STRADA Susanna

Kokkos: Performance Portability and Photos placed in horizontal - PowerPoint PPT Presentation

Kokkos: Performance Portability and Photos placed in horizontal position with even amount Productivity for C++ Applications of white space between photos and header Performance Portability in Photos placed in horizontal Extreme Scale

Number Portability Three kinds of number portability Location portability: a subscriber may move

Kokkos, Manycore Device Photos placed in horizontal position Performance Portability with even

Kokkos: The C++ Performance Portability Programming Model Christian Trott (crtrott@sandia.gov),

Is it performance portability when Im using (small) DGEMM? Dagstuhl Seminar: Performance

EXPLORER+500 Performance and portability combined EXPLORER+ 500 The most used BGAN terminal

Sustainability and Performance through Kokkos: A Case Study with LAMMPS Chris&amp;an Tro, , Si

Kokkos Implementation of Albany: you Towards Performance Portable e Finite Element Code logo

Kokkos update: Memory Spaces, Execution Spaces, Photos placed in horizontal position with even

Kokkos Task-DAG: Photos placed in Memory Management and Locality horizontal position with even

Kokkos Hierarchical Task-Data Parallelism Photos placed in horizontal

Kokkos Hierarchical Task-Data Parallelism Photos placed in horizontal position with even amount

CMEMS data through Social Media Authors: G. Sylaios, N. Kokkos, K. Zachopoulos, M. Zoidou

Uintah Architecture Open source software UQ DRIVERS ARCHES DSL: NEBO Worldwide

Applets Murray Cole Applets 1 Portability and Security JVM and bytecode make

Lecture 1.3 Course Introduction Portability and Scalability in Heterogeneous Parallel

JEDI Portability Across Platforms Containers, Cloud Computing, and HPC Outline I) JEDI

11 Fuzzy Rule-Based Models Fuzzy Systems Engineering Toward Human-Centric Computing Contents

Week 6 Video 3 Visualization Scatterplots Heat Maps Parameter Space Maps Scatterplots (Scatter

Securing Serverless - By Breaking In Guy Podjarny, Snyk @guypod snyk.io About Me Guy

Compressible viscoplastic models for granular flows Duc Nguyen 1 1 LAMA, Universit de

Gastrointestinal Lymphomas EATL, MALT, and beyond Maria A. Pletneva, MD, PhD Lymphoma in GI tract

1 2 3 State R&amp;D Graphic, Version 1 Version 1 4 State R&amp;D Graphic, Version 1,

4. Algorithmen und Datenstrukturen Algorithms and Data Structures, Overview [Cormen et al, Kap.

Impacts of maximum deforestation/reforestation on the regional climate in Europe STRADA Susanna

Sustainability and Performance through Kokkos: A Case Study with LAMMPS Chris&an Tro, , Si

1 2 3 State R&D Graphic, Version 1 Version 1 4 State R&D Graphic, Version 1,