kokkos hierarchical task data parallelism
play

Kokkos Hierarchical Task-Data Parallelism Photos placed in - PowerPoint PPT Presentation

Kokkos Hierarchical Task-Data Parallelism Photos placed in horizontal position with even amount for C++ HPC Applications of white space between photos and header GPU Tech. Conference Photos placed in horizontal May 8-11, 2017 position


  1. Kokkos Hierarchical Task-Data Parallelism Photos placed in horizontal position with even amount for C++ HPC Applications of white space between photos and header GPU Tech. Conference Photos placed in horizontal May 8-11, 2017 position with even amount of white San Jose, CA space between photos and header H. Carter Edwards SAND2017-4681 C Sandia National Laboratories is a multi-mission laboratory managed and operated by National Technology and Engineering Solutions of Sandia, LLC., a wholly owned subsidiary of Honeywell International, Inc., for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-NA0003525.

  2. LAMMPS EMPIRE Albany SPARC Drekar Applications & Libraries Trilinos Kokkos* performance portability for C++ applications HBM HBM HBM HBM DDR DDR DDR DDR DDR Multi-Core APU CPU+GPU Many-Core *κόκκος Greek: “granule” or “grain” ; like grains of sand on a beach 1

  3. Dynamic Directed Acyclic Graph (DAG) of Tasks § Parallel Pattern § Tasks: Heterogeneous collection of parallel computations § DAG: Tasks may have acyclic “execute after” dependences § Dynamic: New tasks may be created/allocated by executing tasks § Task Scheduler Responsibilities § Execute ready tasks § Choose from among ready tasks § Honor “execute after” dependences § Manage tasks’ dynamic lifecycle 2

  4. Motivating Use Cases 1. Incomplete Level-K Cholesky factorization of sparse matrix § Block partitioning into submatrices Chol 0 1 2 3 4 5 6 7 8 9 10 11 0 0 X X Chol Trsm § Given submatrix may/may not exist 1 1 X X Trsm Chol Trsm Herk 2 2 X X X 3 3 X Herk Trsm Trsm Gemm Herk § DAG of submatrix computations 4 4 X X X 5 5 X Herk Gemm Herk 6 6 X X X § Each submatrix computation Chol 7 7 X 8 8 X Trsm is internally data parallel 9 9 Herk 10 10 11 11 Chol § Lead: Kyungjoo Kim / SNL 2. Triangle enumeration in social networks, highly irregular graphs § Discover triangles within the graph 3 k1 § Compute statistics on those triangles § Triangles are an intermediate result 4 that do not need to be saved / stored 1 5 Ø Problem: memory “high water mark” k2 2 § Lead: Michael Wolf / SNL 3

  5. Hierarchical Parallelism § Shared functionality with hierarchical data-data parallelism § The same kernel (task) executed on … § OpenMP: League of Teams of Threads § Cuda: Grid of Blocks of Threads § Intra-Team Parallelism (data or task) § Threads within a team execute concurrently § Data: each team executes the same computation parallel_for Ø Task: each team executes a different task parallel_reduce § Nested parallel patterns: for, reduce, scan § Mapping teams onto hardware § CPU : team == hyperthreads sharing L1 cache § Requires low degree of intra-team parallelism Ø Cuda : team == warp § Requires modest degree of intra-team parallelism § A year ago: team == block, infeasible high degree parallelism 4

  6. Anatomy and Life-cycle of a Task § Anatomy § Is a C++ closure (e.g., functor) of data + function § Is referenced by a Kokkos::future § Executes on a single thread or a thread team § May only execute when its dependences are complete (DAG) § Life-cycle: constructing waiting executing complete serial task task with internal data parallelism on a single thread on a thread team 5

  7. Dynamic Task DAG Challenges § A DAG of heterogeneous closures § Map execution to a single thread or a thread team § Manage memory dynamically created and completed tasks § Manage DAG with dynamically created and completed dependences § GPU – executing task cannot block or yield to another task Ø Forced a beneficial reconceptualization! Non-blocking tasks § Eliminate context switching overhead: stack, registers, ... § Portability and Performance § Heterogeneous function pointers (CPU, GPU) § Creating GPU tasks on the host and within tasks executing on the GPU § Bounded memory pool and scalable allocation/deallocation § Scalable DAG management and scheduling 6

  8. Managing a Non-blocking Task’s Lifecycle § Create: allocate and construct create § By main process or within another task § Allocate from a memory pool constructing spawn § Construct internal data § Assign DAG dependences waiting § Spawn: enqueue to scheduler executing § Assign DAG dependences § Assign priority: high, regular, low respawn § Respawn: re-enqueue to scheduler complete § Replaces waiting or yielding § Assign new DAG dependences and/or priority § Reconceived wait-for-child-task pattern Ø Create & spawn child task(s) Ø Reassign DAG dependence(s) to new child task(s) Ø Re-spawn to execute again after child task(s) complete 7

  9. Task Scheduler and Memory Pool § Memory Pool § Large chunk of memory allocated in Kokkos memory space § Allocate & deallocate small blocks of varying size within a parallel execution § Lock free, extremely low overhead § Tuning: min-alloc-size <= max-alloc-size <= superblock-size <= total-size § Task Scheduler DAG § Uses memory pool for tasks’ memory next next § Ready queues (by priority) and waiting queues DAG Ø Each queue is a simple linked list of tasks next § A ready queue is a head of a linked list § Each task is head of linked list of “execute after” tasks § Limit updates to push/pop, implemented with atomic operations § “When all” is a non-executing task with list of dependences for data 8

  10. Memory Pool Performance, as of April’17 § Test Setup § 10Mb pool comprised of 153 x 64k superblocks, min block size 32 bytes § Allocations ranging between 32 and 128 bytes; average 80 bytes § [1] Allocate to N% ; [2] cyclically deallocate & allocate between N and 2/3 N § parallel_for: every index allocates ; cyclically deallocates & allocates § Measure allocate + deallocate operations / second (best of 10 trials) § Deallocate much simpler and fewer operations than allocate § Test Hardware: Pascal, Broadwell, Knights Landing § Fully subscribe cores § Every thread within every warp allocates & deallocates § For reference, an “apples to oranges” comparison § CUDA malloc / free on Pascal § jemalloc on Knights Landing 9

  11. Memory Pool Performance, as of April’17 Fill 75% Fill 95% Cycle 75% Cycle 95% blocks: 938,500 1,187,500 Pascal 79 M/s 74 M/s 287 M/s 244 M/s Broadwell 13 M/s 13 M/s 46 M/s 49 M/s Knights Landing 5.8 M/s 5.8 M/s 40 M/s 43 M/s apples to oranges comparison: Pascal 3.5 M/s 2.9 M/s 15 M/s 12 M/s using CUDA malloc Knights Landing 379 M/s 4115 M/s using jemalloc thread local caches, optimal blocking, NOT fixed pool size § Memory pools have finite size with well-bounded scope § Algorithms’ and data structures’ memory pools do not pollute (fragment) each other’s memory 10

  12. Scheduler Unit Test Performance, as of April’17 § Test Setup, (silly) Fibonacci task-dag algorithm § F(k) = F(k-1) + F(k-2) § if k >= 2 spawn F(k-1) and F(k-2) then § respawn F(k) dependent on completion of when_all( { F(k-1) , F(k-2) } ) § F(k) cumulatively allocates/deallocates N tasks >> “high water mark” § 1Mb pool comprised of 31 x 32k superblocks, min block size 32 bytes § Fully subscribe cores; single thread Fibonacci task consumes entire GPU warp § Real algorithms’ tasks have modest internal parallelism § Measure tasks / second; compare to raw allocate + deallocate performance F(21) F(23) Alloc/Dealloc cumulative tasks: 53131 139102 (for comparison) Pascal 1.2 M/s 1.3 M/s 144 M/s Broadwell 0.98 M/s 1.1 M/s 24 M/s Knights Landing 0.30 M/s 0.31 M/s 21 M/s 11

  13. Conclusion ü Initial Dynamic Task-DAG capability § Portable: CPU and NVIDIA GPU architectures § Directed acyclic graph (DAG) of heterogeneous tasks § Dynamic – tasks may create tasks and dependences § Hierarchical – thread-team data parallelism within tasks § Challenges, primarily for GPU portability and performance § Non-blocking tasks è respawn instead of wait § Memory pool for dynamically allocatable tasks § Map task’s thread-team onto GPU warp, modest intra-team parallelism 12

  14. Ongoing Research & Development § In progress / to be resolved § Work around warp divergence / fail-to-reconverge bug w/ CUDA 8 + Pascal § Known issue, Nvidia will soon have fix for us § Prevents task-team parallelism: § one thread per warp atomically pops task from DAG § whole warp executes task § In progress / to be done § Merge Kokkos ThreadTeam and TaskTeam intra-team parallel capabilities § Currently are separate / redundant implementations § Performance evaluation & optimization § Performance evaluation with applications’ algorithms § sparse matrix factorization, social network triangle enumeration/analysis § ... stay tuned 13

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend