kokkos task dag
play

Kokkos Task-DAG: Photos placed in Memory Management and Locality - PowerPoint PPT Presentation

Kokkos Task-DAG: Photos placed in Memory Management and Locality horizontal position with even amount of white space Challenges Conquered between photos and header PADAL Workshop Photos placed in horizontal August 2-4, 2017 position


  1. Kokkos Task-DAG: Photos placed in Memory Management and Locality horizontal position with even amount of white space Challenges Conquered between photos and header PADAL Workshop Photos placed in horizontal August 2-4, 2017 position with even amount of white Chicago, IL space between photos and header H. Carter Edwards SAND2017-8173 C Sandia National Laboratories is a multi-mission laboratory managed and operated by National Technology and Engineering Solutions of Sandia, LLC., a wholly owned subsidiary of Honeywell International, Inc., for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-NA0003525.

  2. LAMMPS EMPIRE Albany SPARC Drekar Applications & Libraries Trilinos Kokkos* performance portability for C++ applications HBM HBM HBM HBM DDR DDR DDR DDR DDR Multi-Core APU CPU+GPU Many-Core *κόκκος Greek: “granule” or “grain” ; like grains of sand on a beach 1

  3. Dynamic Directed Acyclic Graph (DAG) of Tasks § Parallel Pattern § Tasks: Heterogeneous collection of parallel computations § DAG: Tasks may have acyclic execute-after dependences § Dynamic: Tasks allocated by executing tasks, deallocated when complete § Task Scheduler Responsibilities § Execute ready tasks § Choose from among ready tasks § Honor “execute after” dependences § Manage tasks’ dynamic lifecycle § Manage tasks’ dynamic memory 2

  4. Motivating Use Cases 1. Multifrontal Cholesky factorization of sparse matrix § Frontal matrices require different 0 1 2 3 4 5 6 7 8 6 X X 0 0 X X X X 7 X sizes of workspace (green) for sub-assembly 1 1 X X X 8 2 2 X 3 3 X X X § Hybrid task parallelism: tree-parallel & 4 4 X X 3 X X X 5 5 X matrix-parallel within supernodes (brown) 4 X X 6 6 X X 6 X 7 7 X 8 8 8 § Dynamic task-dag with memory constraints § Matrix computation is internally data parallel 0 X X X X 2 X 5 X 1 X X X 3 7 § Lead: Kyungjoo Kim / SNL 4 X X 7 X 8 2. Triangle enumeration in social networks, highly irregular graphs § Discover triangles within the graph 3 § Compute statistics on those triangles k1 § Triangles are an intermediate result that do not need to be saved / stored 4 1 5 Ø Challenge: memory “high water mark” k2 § Lead: Michael Wolf / SNL 2 3

  5. Hierarchical Parallelism § Shared functionality with hierarchical data-data parallelism § The same kernel (task) executed on … § OpenMP: League of Teams of Threads § Cuda: Grid of Blocks of Threads § Inter-Team Parallelism (data or task) § Threads within a team execute concurrently § Data: each team executes the same computation Ø Task: each team executes a different task parallel_for § Intra-Team Parallelism (data) parallel_reduce § Nested parallel patterns: for, reduce, scan § Mapping teams onto hardware § CPU : team == hyperthreads sharing L1 cache’ § GPU : team == warp, for a modest degree of intra-team data parallelism 4

  6. Anatomy and Life-cycle of a Task § Anatomy § Is a C++ closure (e.g., functor) of data + function § Is referenced by a Kokkos::future § Executes on a single thread or a thread team § May only execute when its dependences are complete (DAG) § Life-cycle: constructing waiting executing complete serial task task with internal data parallelism on a single thread on a thread team 5

  7. Dynamic Task DAG Challenges § A DAG of heterogeneous closures § Map execution to a single thread or a thread team § Scalable, low latency scheduling § Scalable dynamically allocated / deallocated tasks § Scalable dynamically created and completed execute-after dependences § GPU idiosyncrasies Ø Non-blocking tasks, forced a beneficial reconceptualization! § Eliminate context switching overhead: stack, registers, ... § Heterogeneous function pointers (CPU, GPU) § Creating GPU tasks on the host and within tasks executing on the GPU § Bounded memory pool and scalable allocation/deallocation § Non-coherent L1 caches 6

  8. Managing a Non-blocking Task’s Lifecycle § Create: allocate and construct create § By main process or within another task § Allocate from a memory pool constructing spawn § Construct internal data § Assign DAG dependences waiting § Spawn: enqueue to scheduler executing § Assign DAG dependences § Assign priority: high, regular, low respawn § Respawn: re-enqueue to scheduler complete § Replaces waiting or yielding § Assign new DAG dependences and/or priority § Reconceived wait-for-child-task pattern Ø Create & spawn child task(s) Ø Reassign DAG dependence(s) to new child task(s) Ø Re-spawn to execute again after child task(s) complete 7

  9. Task Scheduler and Memory Pool § Memory Pool § Large chunk of memory allocated in Kokkos memory space § Allocate & deallocate small blocks of varying size within a parallel execution § Lock free, extremely low latency § Tuning: min-alloc-size <= max-alloc-size <= superblock-size <= total-size § Task Scheduler dep § Uses memory pool for tasks’ memory next next § Ready queues (by priority) and waiting queues dep Ø Each queue is a simple linked list of tasks next § A ready queue is a head of a linked list § Each task is head of linked list of “execute after” tasks § Limit updates to push/pop, implemented with atomic operations § “When all” is a non-executing task with list of dependences for data 8

  10. Memory Pool Performance § Test Setup § 10Mb pool comprised of 153 x 64k superblocks, min block size 32 bytes § Allocations ranging between 32 and 128 bytes; average 80 bytes § [1] Allocate to N% ; [2] cyclically deallocate & allocate between N and 2/3 N § parallel_for: every index allocates ; cyclically deallocates & allocates § Measure allocate + deallocate operations / second (best of 10 trials) § Deallocate much simpler and fewer operations than allocate § Test Hardware: Pascal, Broadwell, Knights Landing § Fully subscribe cores § Every thread within every warp allocates & deallocates § For reference, an “apples to oranges” comparison § CUDA malloc / free on Pascal § jemalloc on Knights Landing 9

  11. Memory Pool Performance Fill 75% Fill 95% Cycle 75% Cycle 95% blocks: 938,500 1,187,500 Pascal 79 M/s 74 M/s 287 M/s 244 M/s Broadwell 13 M/s 13 M/s 46 M/s 49 M/s Knights Landing 5.8 M/s 5.8 M/s 40 M/s 43 M/s apples to oranges comparison: Pascal 3.5 M/s 2.9 M/s 15 M/s 12 M/s using CUDA malloc Knights Landing 379 M/s 4115 M/s using jemalloc thread local caches, optimal blocking, NOT fixed pool size § Memory pools have finite size with well-bounded scope § Algorithms’ and data structures’ memory pools do not pollute (fragment) each other’s memory 10

  12. Scheduler Unit Test Performance § Test Setup, (silly) Fibonacci task-dag algorithm § F(k) = F(k-1) + F(k-2) § if k >= 2 spawn F(k-1) and F(k-2) then § respawn F(k) dependent on completion of when_all( { F(k-1) , F(k-2) } ) § F(k) cumulatively allocates/deallocates N tasks >> “high water mark” § 1Mb pool comprised of 31 x 32k superblocks, min block size 32 bytes § Fully subscribe cores; single thread Fibonacci task consumes entire GPU warp § Real algorithms’ tasks have modest internal parallelism § Measure tasks / second; compare to raw allocate + deallocate performance F(21) F(23) Alloc/Dealloc cumulative tasks: 53131 139102 (for comparison) Pascal 1.2 M/s 1.3 M/s 144 M/s Broadwell 0.98 M/s 1.1 M/s 24 M/s Knights Landing 0.30 M/s 0.31 M/s 21 M/s 11

  13. GPU Non-Coherent L1 Cache § Production and Consumption of Tasks § Create: allocate from memory pool and construct closure in that memory § Complete: destroy closure and deallocate to memory pool § Task memory is re-used as the dynamic task-DAG executes § “Race” consequence of non-coherent L1 cache: Global SM1 SM0 [1] execute Memory L1 L1 & complete Pool cache cache [3] allocate [2] deallocate task-B block task-A {[3-4] untouched} [4] construct [5] pop-queue & push-queue [6] execute task-?? 12

  14. GPU Non-Coherent L1 Cache: Conquered § Options: § Mark all user task code with “virtual” qualifier to bypass L1 cache (CUDA) § Extremely annoying to users: ugly and degrades performance § Manage memory motion through GPU shared memory (a.k.a., explicit L1) Ø Transparent to user code and retains L1 performance Global SM1 SM0 [1] execute Memory explicit explicit & complete Pool cache cache [3] allocate [2] deallocate task-B block task-A task-B [4.1] construct [5.1] pop-queue [4.2] copy [5.2] copy [6] execute [4.3] push-queue task-B 13

  15. Tacho’s Sparse Cholesky Factorization § Multifrontal algorithm with bounded memory constraint § Kokkos task DAG + Kokkos memory pool for shared scratch memory § Task fails allocation => respawn to try again after other tasks deallocate § Test setup: scratch memory size = M * sparse matrix supernode size § Compare to Intel’s pardiso, sparse matrix N=57k, NNZ=383k, 6662 supernodes Haswell (2x16x2) Knights Landing (1x68x4) factorization/minute factorization/minute 1500 700 600 pardiso 500 pardiso 1000 400 tacho 4 tacho 4 300 tacho 8 tacho 8 500 200 tacho tacho 16 100 16 # threads # threads 0 0 0 20 40 60 80 0 20 40 60 80 400 400 350 350 peak memory MB peak memory MB 300 pardiso 300 pardiso 250 250 tacho 4 tacho 200 200 tacho 8 150 tacho 150 tacho 100 100 tacho 16 50 50 # threads # threads 0 0 0 20 40 60 80 0 20 40 60 80 14

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend