Kokkos Hierarchical Task-Data Parallelism Photos placed in - PowerPoint PPT Presentation

Kokkos Hierarchical Task-Data Parallelism Photos placed in horizontal position with even amount for C++ HPC Applications of white space between photos and header GPU Tech. Conference Photos placed in horizontal May 8-11, 2017 position with even amount of white San Jose, CA space between photos and header H. Carter Edwards SAND2017-4681 C Sandia National Laboratories is a multi-mission laboratory managed and operated by National Technology and Engineering Solutions of Sandia, LLC., a wholly owned subsidiary of Honeywell International, Inc., for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-NA0003525.

LAMMPS EMPIRE Albany SPARC Drekar Applications & Libraries Trilinos Kokkos* performance portability for C++ applications HBM HBM HBM HBM DDR DDR DDR DDR DDR Multi-Core APU CPU+GPU Many-Core *κόκκος Greek: “granule” or “grain” ; like grains of sand on a beach 1

Dynamic Directed Acyclic Graph (DAG) of Tasks § Parallel Pattern § Tasks: Heterogeneous collection of parallel computations § DAG: Tasks may have acyclic “execute after” dependences § Dynamic: New tasks may be created/allocated by executing tasks § Task Scheduler Responsibilities § Execute ready tasks § Choose from among ready tasks § Honor “execute after” dependences § Manage tasks’ dynamic lifecycle 2

Motivating Use Cases 1. Incomplete Level-K Cholesky factorization of sparse matrix § Block partitioning into submatrices Chol 0 1 2 3 4 5 6 7 8 9 10 11 0 0 X X Chol Trsm § Given submatrix may/may not exist 1 1 X X Trsm Chol Trsm Herk 2 2 X X X 3 3 X Herk Trsm Trsm Gemm Herk § DAG of submatrix computations 4 4 X X X 5 5 X Herk Gemm Herk 6 6 X X X § Each submatrix computation Chol 7 7 X 8 8 X Trsm is internally data parallel 9 9 Herk 10 10 11 11 Chol § Lead: Kyungjoo Kim / SNL 2. Triangle enumeration in social networks, highly irregular graphs § Discover triangles within the graph 3 k1 § Compute statistics on those triangles § Triangles are an intermediate result 4 that do not need to be saved / stored 1 5 Ø Problem: memory “high water mark” k2 2 § Lead: Michael Wolf / SNL 3

Hierarchical Parallelism § Shared functionality with hierarchical data-data parallelism § The same kernel (task) executed on … § OpenMP: League of Teams of Threads § Cuda: Grid of Blocks of Threads § Intra-Team Parallelism (data or task) § Threads within a team execute concurrently § Data: each team executes the same computation parallel_for Ø Task: each team executes a different task parallel_reduce § Nested parallel patterns: for, reduce, scan § Mapping teams onto hardware § CPU : team == hyperthreads sharing L1 cache § Requires low degree of intra-team parallelism Ø Cuda : team == warp § Requires modest degree of intra-team parallelism § A year ago: team == block, infeasible high degree parallelism 4

Anatomy and Life-cycle of a Task § Anatomy § Is a C++ closure (e.g., functor) of data + function § Is referenced by a Kokkos::future § Executes on a single thread or a thread team § May only execute when its dependences are complete (DAG) § Life-cycle: constructing waiting executing complete serial task task with internal data parallelism on a single thread on a thread team 5

Dynamic Task DAG Challenges § A DAG of heterogeneous closures § Map execution to a single thread or a thread team § Manage memory dynamically created and completed tasks § Manage DAG with dynamically created and completed dependences § GPU – executing task cannot block or yield to another task Ø Forced a beneficial reconceptualization! Non-blocking tasks § Eliminate context switching overhead: stack, registers, ... § Portability and Performance § Heterogeneous function pointers (CPU, GPU) § Creating GPU tasks on the host and within tasks executing on the GPU § Bounded memory pool and scalable allocation/deallocation § Scalable DAG management and scheduling 6

Managing a Non-blocking Task’s Lifecycle § Create: allocate and construct create § By main process or within another task § Allocate from a memory pool constructing spawn § Construct internal data § Assign DAG dependences waiting § Spawn: enqueue to scheduler executing § Assign DAG dependences § Assign priority: high, regular, low respawn § Respawn: re-enqueue to scheduler complete § Replaces waiting or yielding § Assign new DAG dependences and/or priority § Reconceived wait-for-child-task pattern Ø Create & spawn child task(s) Ø Reassign DAG dependence(s) to new child task(s) Ø Re-spawn to execute again after child task(s) complete 7

Task Scheduler and Memory Pool § Memory Pool § Large chunk of memory allocated in Kokkos memory space § Allocate & deallocate small blocks of varying size within a parallel execution § Lock free, extremely low overhead § Tuning: min-alloc-size <= max-alloc-size <= superblock-size <= total-size § Task Scheduler DAG § Uses memory pool for tasks’ memory next next § Ready queues (by priority) and waiting queues DAG Ø Each queue is a simple linked list of tasks next § A ready queue is a head of a linked list § Each task is head of linked list of “execute after” tasks § Limit updates to push/pop, implemented with atomic operations § “When all” is a non-executing task with list of dependences for data 8

Memory Pool Performance, as of April’17 § Test Setup § 10Mb pool comprised of 153 x 64k superblocks, min block size 32 bytes § Allocations ranging between 32 and 128 bytes; average 80 bytes § [1] Allocate to N% ; [2] cyclically deallocate & allocate between N and 2/3 N § parallel_for: every index allocates ; cyclically deallocates & allocates § Measure allocate + deallocate operations / second (best of 10 trials) § Deallocate much simpler and fewer operations than allocate § Test Hardware: Pascal, Broadwell, Knights Landing § Fully subscribe cores § Every thread within every warp allocates & deallocates § For reference, an “apples to oranges” comparison § CUDA malloc / free on Pascal § jemalloc on Knights Landing 9

Memory Pool Performance, as of April’17 Fill 75% Fill 95% Cycle 75% Cycle 95% blocks: 938,500 1,187,500 Pascal 79 M/s 74 M/s 287 M/s 244 M/s Broadwell 13 M/s 13 M/s 46 M/s 49 M/s Knights Landing 5.8 M/s 5.8 M/s 40 M/s 43 M/s apples to oranges comparison: Pascal 3.5 M/s 2.9 M/s 15 M/s 12 M/s using CUDA malloc Knights Landing 379 M/s 4115 M/s using jemalloc thread local caches, optimal blocking, NOT fixed pool size § Memory pools have finite size with well-bounded scope § Algorithms’ and data structures’ memory pools do not pollute (fragment) each other’s memory 10

Scheduler Unit Test Performance, as of April’17 § Test Setup, (silly) Fibonacci task-dag algorithm § F(k) = F(k-1) + F(k-2) § if k >= 2 spawn F(k-1) and F(k-2) then § respawn F(k) dependent on completion of when_all( { F(k-1) , F(k-2) } ) § F(k) cumulatively allocates/deallocates N tasks >> “high water mark” § 1Mb pool comprised of 31 x 32k superblocks, min block size 32 bytes § Fully subscribe cores; single thread Fibonacci task consumes entire GPU warp § Real algorithms’ tasks have modest internal parallelism § Measure tasks / second; compare to raw allocate + deallocate performance F(21) F(23) Alloc/Dealloc cumulative tasks: 53131 139102 (for comparison) Pascal 1.2 M/s 1.3 M/s 144 M/s Broadwell 0.98 M/s 1.1 M/s 24 M/s Knights Landing 0.30 M/s 0.31 M/s 21 M/s 11

Conclusion ü Initial Dynamic Task-DAG capability § Portable: CPU and NVIDIA GPU architectures § Directed acyclic graph (DAG) of heterogeneous tasks § Dynamic – tasks may create tasks and dependences § Hierarchical – thread-team data parallelism within tasks § Challenges, primarily for GPU portability and performance § Non-blocking tasks è respawn instead of wait § Memory pool for dynamically allocatable tasks § Map task’s thread-team onto GPU warp, modest intra-team parallelism 12

Ongoing Research & Development § In progress / to be resolved § Work around warp divergence / fail-to-reconverge bug w/ CUDA 8 + Pascal § Known issue, Nvidia will soon have fix for us § Prevents task-team parallelism: § one thread per warp atomically pops task from DAG § whole warp executes task § In progress / to be done § Merge Kokkos ThreadTeam and TaskTeam intra-team parallel capabilities § Currently are separate / redundant implementations § Performance evaluation & optimization § Performance evaluation with applications’ algorithms § sparse matrix factorization, social network triangle enumeration/analysis § ... stay tuned 13

Kokkos Hierarchical Task-Data Parallelism Photos placed in - PowerPoint PPT Presentation

Kokkos Hierarchical Task-Data Parallelism Photos placed in horizontal position with even amount for C++ HPC Applications of white space between photos and header GPU Tech. Conference Photos placed in horizontal May 8-11, 2017 position

Kokkos Hierarchical Task-Data Parallelism Photos placed in horizontal

Hardware Parallelism vs. Software Parallelism USENIX Workshop on Hot Topics in Parallelism March

Chapter 17: Parallel Databases Introduction I/O Parallelism Interquery Parallelism

Data-Level Parallelism Nima Honarmand Fall 2015 :: CSE 610 Parallel Computer Architectures

Pervasive Parallelism Laboratory Stanford University ppl.stanford.edu Make parallelism

Advanced OpenMP Lecture 6: Nested parallelism Nested parallelism Nested parallelism is

CSCI341 Lecture 37, Introduction to Parallelism PIPELINING Exploits potential parallelism

Kokkos Task-DAG: Photos placed in Memory Management and Locality horizontal position with even

Hierarchical Bounding Volume October 11, 2005 () Hierarchical Bounding Volume October 11, 2005

What is a hierarchical model? Richard Erickson Quantitative Ecologist DataCamp Hierarchical

Hierarchical Task Network (HTN) Planning Jos Luis Ambite* [* Based in part on presentations by

Parallel Models Different ways to exploit parallelism Outline Shared-Variables Parallelism

Parallelism ! Multiple processes concurrently Parallelism CPU1 CPU1 CPU1 Pseudo- Process 1

CO444H parallelism Ben Livshits 1 Why Parallelism? One way to speed up a computation is to

Multi-core Programming: Implicit Parallelism Tuukka Haapasalo April 16, 2009 Tuukka Haapasalo

Plan Parallelism Complexity Measures 1 Multithreaded Parallelism and Performance Measures cilk

Elementary Data Structures Biostatistics 615/815 Lecture 6: . . 1 / 31 . SortedArray Array

LTTng & Tools Roadmap LTTng & Tools Roadmap Content LTTng new and upcoming

Toward Automated Forensic Analysis of Obfuscated Malware Ryan J. Farley George Mason University

Perl Memory Use Tim Bunce @ OSCON July 2012 1 Scope of the talk... Not really

Stato produzone e tests FEI4 Roberto Beccherle FEI4 ~6 ;mes size of FEI3 Pixel

Curriculum on Basics Cadet Responsibilities Cadet Responsibility Agenda B1. Guard Duty

May 2017 Annual Parent Meeting Broad Run High School Band Boosters Association May 9, 2017

Gem5 in a nutshell Christophe Huriaux, Post-doc Inria, IRISA CAIRN Project-Team CAIRN

Kokkos Hierarchical Task-Data Parallelism Photos placed in - PowerPoint PPT Presentation

Kokkos Hierarchical Task-Data Parallelism Photos placed in horizontal position with even amount for C++ HPC Applications of white space between photos and header GPU Tech. Conference Photos placed in horizontal May 8-11, 2017 position

Kokkos Hierarchical Task-Data Parallelism Photos placed in horizontal

Hardware Parallelism vs. Software Parallelism USENIX Workshop on Hot Topics in Parallelism March

Chapter 17: Parallel Databases Introduction I/O Parallelism Interquery Parallelism

Data-Level Parallelism Nima Honarmand Fall 2015 :: CSE 610 Parallel Computer Architectures

Pervasive Parallelism Laboratory Stanford University ppl.stanford.edu Make parallelism

Advanced OpenMP Lecture 6: Nested parallelism Nested parallelism Nested parallelism is

CSCI341 Lecture 37, Introduction to Parallelism PIPELINING Exploits potential parallelism

Kokkos Task-DAG: Photos placed in Memory Management and Locality horizontal position with even

Hierarchical Bounding Volume October 11, 2005 () Hierarchical Bounding Volume October 11, 2005

What is a hierarchical model? Richard Erickson Quantitative Ecologist DataCamp Hierarchical

Hierarchical Task Network (HTN) Planning Jos Luis Ambite* [* Based in part on presentations by

Parallel Models Different ways to exploit parallelism Outline Shared-Variables Parallelism

Parallelism ! Multiple processes concurrently Parallelism CPU1 CPU1 CPU1 Pseudo- Process 1

CO444H parallelism Ben Livshits 1 Why Parallelism? One way to speed up a computation is to

Multi-core Programming: Implicit Parallelism Tuukka Haapasalo April 16, 2009 Tuukka Haapasalo

Plan Parallelism Complexity Measures 1 Multithreaded Parallelism and Performance Measures cilk

Elementary Data Structures Biostatistics 615/815 Lecture 6: . . 1 / 31 . SortedArray Array

LTTng &amp; Tools Roadmap LTTng &amp; Tools Roadmap Content LTTng new and upcoming

Toward Automated Forensic Analysis of Obfuscated Malware Ryan J. Farley George Mason University

Perl Memory Use Tim Bunce @ OSCON July 2012 1 Scope of the talk... Not really

Stato produzone e tests FEI4 Roberto Beccherle FEI4 ~6 ;mes size of FEI3 Pixel

Curriculum on Basics Cadet Responsibilities Cadet Responsibility Agenda B1. Guard Duty

May 2017 Annual Parent Meeting Broad Run High School Band Boosters Association May 9, 2017

Gem5 in a nutshell Christophe Huriaux, Post-doc Inria, IRISA CAIRN Project-Team CAIRN

LTTng & Tools Roadmap LTTng & Tools Roadmap Content LTTng new and upcoming