Kokkos: Performance Portability and Photos placed in horizontal - - PowerPoint PPT Presentation

kokkos performance portability and
SMART_READER_LITE
LIVE PREVIEW

Kokkos: Performance Portability and Photos placed in horizontal - - PowerPoint PPT Presentation

Kokkos: Performance Portability and Photos placed in horizontal position with even amount Productivity for C++ Applications of white space between photos and header Performance Portability in Photos placed in horizontal Extreme Scale


slide-1
SLIDE 1

Photos placed in horizontal position with even amount

  • f white space

between photos and header

Photos placed in horizontal position with even amount of white space between photos and header

Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000. SAND NO. 2011-XXXXP

Kokkos: Performance Portability and Productivity for C++ Applications

Performance Portability in Extreme Scale Computing: Metrics, Challenges, Solutions

October 23-27, 2017 Schloss Dagstuhl Seminar 17431 Wadern, Germany SAND2017-11734 PE

slide-2
SLIDE 2

1

DDR HBM DDR HBM DDR DDR DDR HBM HBM

Multi-Core Many-Core APU CPU+GPU

Drekar Trilinos SPARC Applications & Libraries

Kokkos*

performance portability for C++ applications Albany EMPIRE LAMMPS

*κόκκος Greek: “granule” or “grain” ; like grains of sand on a beach

slide-3
SLIDE 3

Performance Portability and Productivity

§ Economics: optimize Forg(perf,port,prod)

§ Performance: execution time / energy to solution § On what platforms? § Portability: sustaining for multiple evolving architectures § Tool ecosystem: compilers, debuggers, analyzers, … § Interoperability with “architecture native” programming mechanism § Standard language using “as is” compilers; we use C++ § Productivity: aggregate development time and resources § Skills ecosystem: ease-of-use, education & training, support, … § Incremental path to adoption by legacy code; we use C++

§ Kokkos’ 1+𝛇 economics

§ N codes on M architectures leading to N*(1+𝛇*M) versions § O(𝛇*N*M) architecture specialized components § Written in “architecture native” programming mechanism

2

slide-4
SLIDE 4

Kokkos Programming Model

§ Foundation

§ Well-defined Patterns for Parallel Programming (e.g., 2004; Mattson, et. al.) § Well-defined multidimensional array semantics

§ Strategy

§ User exposes parallelizable grains of computations and data § Kokkos maps grains onto hardware according to patterns and policies § Integrated mapping of both computations and data leading to architecture-appropriate memory access pattern § Policies may introduce architecture-specific parameters § Without changing source code § N(p)+O(𝛇*N*M) versions § Opportunity for auto-tuners to choose parameters? § Policy parameters have architecture-specific defaults § Work for simple / common use cases

3

slide-5
SLIDE 5

Unique to Kokkos

Programming Model Abstractions

4

Execution Space (CPU, GPU, which cores, which GPU) Memory Space (CPU, GPU, which NUMA, which GPU) Parallel Execution Policy (Scheduling, Tiling, Thread Teams, …) Array Element Access Policy (Layout, Tiling, RandomAccess, …) Parallel Pattern (for, reduce, scan, task-dag, …) Polymorphic Multidimensional Array (data structure pattern)

* Extensibility throughout

slide-6
SLIDE 6

5

Multidimensional Array w/ Polymorphic Layout

§ Classical (50 years!) data pattern for science & engineering codes

§ Computer languages hard-wire multidimensional array layout mapping § Problem: different architectures require different layouts for performance Ø Leads to architecture-specific versions of code to obtain performance § E.g., “Array of Structure” ↔ “Structure of Array” redesigns

§ Kokkos separates layout from user’s computational code

§ Choose layout for architecture-specific memory access pattern Ø Without modifying user’s computational code § Polymorphic layout via C++ template meta-programming (extensible) Ø e.g., Hierarchical Tiling layout (array of structure of array)

§ Bonus: easy/transparent use of special data access hardware

§ Atomic operations, GPU texture cache, ... (extensible)

e.g., “row-major” CPU caching e.g., “column-major” GPU coalescing

slide-7
SLIDE 7

6

100 200 300 400 500 600 1 10 100 1000 10000 100000 1x106 1x107 1x108 1x109 Bandwidth (GB/s) Number of Rows (N)

<y|Ax> Exercise 04 (Layout) Fixed Size

KNL: Xeon Phi 68c HSW: Dual Xeon Haswell 2x16c Pascal60: Nvidia GPU HSW Left HSW Right KNL Left KNL Right Pascal60 Left Pascal60 Right

Performance Impact of Layout:

Kokkos Tutorial Kernel: < y , Ax >

coalesced cached uncoalesced cached uncached

slide-8
SLIDE 8

7

Multidimensional Array, Kokkos’ C++ API

§ View< ArrayType , Policy… > a ;

§ ArrayType defines scalar type and array’s static/dynamic dimensions § Layout mapping indexing operator : a(i,j,k,l) → memory location § Policy: memory space, layout, access intent, reference counting, … § Trivial to swap between array-of-struct (AoS) to struct-of-arrays (SoA) § This is Kokkos’ default between CPU and GPU

§ Layout is specifiable (otherwise defaults)

§ Why? For compatibility with legacy code, algorithmic performance tuning, ...

§ Layout is customizable

§ E.g., hierarchical tiling (brick) layout (AoSoA) § Changing layout can be transparent to existing code – If written layout-agnostic § Layout-aware algorithm can extract tiles

slide-9
SLIDE 9

Patterns, Policies, and C++ Lambdas

§ Pattern composed with policy drives the computational body for ( int i = 0 ; i < N ; ++i ) { /* body */ } pattern policy body parallel_for ( N , [=]( int i ) { /* body */ } ); pattern( Policy<Params…>(params...) , body(args…) ); args… derived from pattern and Policy § Data parallel patterns: for, reduce, scan

§ Transparently manage thread local values, inter-thread reductions, …

§ Data parallel policies: 1D range, nD range, thread teams, … § Data parallel policy parameters (extensible)

§ Static or dynamic work partitioning § nD loop collapse ordering and tiling

8

C++ lambda

slide-10
SLIDE 10

§ Parallel Patterns: Task-DAG and Work-DAG

§ DAG: acyclic execute-after dependences § Task-DAG: Heterogeneous and dynamic collection of parallel computations § Work-DAG: Homogeneous and static collection of parallel computations

§ Task Scheduler Responsibilities

§ Choose and execute ready tasks § Update execute-after dependences § Manage tasks’ dynamic memory and lifecycle § GPU was a real challenge!

9

Pattern and Policy (brand new!) Directed Acyclic Graph (DAG) of Tasks

slide-11
SLIDE 11

§ Task-DAG (heterogeneous and dynamic)

§ Tasks spawn tasks of different functions; DAG is dynamic spawn( Policy(params…) , body ); § GPU portability and performance: tasks cannot block or be interrupted respawn( this , params… ); // replaces “wait” semantics § Policy : single-thread/thread-team, dependences, priority § GPU challenges: scalable scheduler & memory pool, non-coherent L1 cache

§ Work-DAG (homogeneous and static)

§ DAG is declared up-front; single work function given integer work index § Similar to data-parallel with an execute-after index graph (CRS array) parallel_for( WorkGraphPolicy<Params...>( graph ), body( int index ) );

10

Pattern and Policy Directed Acyclic Graph (DAG) of Tasks

slide-12
SLIDE 12

Conclusion / Future

§ Performance Portability, for C++ Applications

§ Integrated mapping of applications’ computations and data

Ø Other programming models fail to map data and limit performance portability

§ Future proofing via designed-in extensibility and ongoing R&D § github.com/kokkos/kokkos

§ Application Developer Productivity, for C++ Applications

§ C++ lambda for simple data parallel loop syntax § Reduce and Scan inter-thread complexity managed by Kokkos § Hierarchical parallelism using nested patterns can increase parallelism § Case Study: no harder than OpenMP, optimization is easier

§ Goal: Future ISO/C++ Standard subsumes Kokkos abstractions

§ Parallel algorithms (C++17) incremental step for data parallel pattern/policy § Next steps in progress: Executors and ExecutionContext § Polymorphic multidimensional array on track for C++20 § Atomic operations on non-atomic types on track for C++20

11