Kokkos Hierarchical Task-Data Parallelism Photos placed in - - PowerPoint PPT Presentation

kokkos hierarchical task data parallelism
SMART_READER_LITE
LIVE PREVIEW

Kokkos Hierarchical Task-Data Parallelism Photos placed in - - PowerPoint PPT Presentation

Kokkos Hierarchical Task-Data Parallelism Photos placed in horizontal position with even amount for C++ HPC Applications of white space between photos and header GPU Tech. Conference Photos placed in horizontal May 8-11, 2017 position


slide-1
SLIDE 1

Photos placed in horizontal position with even amount

  • f white space

between photos and header

Photos placed in horizontal position with even amount of white space between photos and header

Sandia National Laboratories is a multi-mission laboratory managed and operated by National Technology and Engineering Solutions

  • f Sandia, LLC., a wholly
  • wned subsidiary of Honeywell

International, Inc., for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-NA0003525.

Kokkos Hierarchical Task-Data Parallelism for C++ HPC Applications

GPU Tech. Conference

May 8-11, 2017 San Jose, CA

  • H. Carter Edwards

SAND2017-4681 C

slide-2
SLIDE 2

1

DDR HBM DDR HBM DDR DDR DDR HBM HBM

Multi-Core Many-Core APU CPU+GPU

Drekar Trilinos SPARC Applications & Libraries

Kokkos*

performance portability for C++ applications Albany EMPIRE LAMMPS

*κόκκος

Greek: “granule” or “grain” ; like grains of sand on a beach

slide-3
SLIDE 3

2

Dynamic Directed Acyclic Graph (DAG) of Tasks

§ Parallel Pattern

§ Tasks: Heterogeneous collection of parallel computations § DAG: Tasks may have acyclic “execute after” dependences § Dynamic: New tasks may be created/allocated by executing tasks

§ Task Scheduler Responsibilities

§ Execute ready tasks § Choose from among ready tasks § Honor “execute after” dependences § Manage tasks’ dynamic lifecycle

slide-4
SLIDE 4

3

Motivating Use Cases

  • 1. Incomplete Level-K Cholesky factorization of sparse matrix

§ Block partitioning into submatrices § Given submatrix may/may not exist § DAG of submatrix computations § Each submatrix computation is internally data parallel § Lead: Kyungjoo Kim / SNL

  • 2. Triangle enumeration in social networks, highly irregular graphs

§ Discover triangles within the graph § Compute statistics on those triangles § Triangles are an intermediate result that do not need to be saved / stored Ø Problem: memory “high water mark” § Lead: Michael Wolf / SNL

11 4 8 7 5 9 1 6 10 2 3 3 8 10 2 7 6 11 5 9 4 1 X X 11 10 9 8 6 X 7 X 4 5 X X X X X X X X X X X 3 2 X X 1

Chol Trsm Herk Chol Herk Herk Gemm Herk Herk Gemm Herk Trsm Trsm Trsm Chol Chol Chol Trsm Trsm

4 5 1 2 3

k2 k1

slide-5
SLIDE 5

parallel_for parallel_reduce

4

Hierarchical Parallelism

§ Shared functionality with hierarchical data-data parallelism

§ The same kernel (task) executed on … § OpenMP: League of Teams of Threads § Cuda: Grid of Blocks of Threads

§ Intra-Team Parallelism (data or task)

§ Threads within a team execute concurrently § Data: each team executes the same computation ØTask: each team executes a different task § Nested parallel patterns: for, reduce, scan

§ Mapping teams onto hardware

§ CPU : team == hyperthreads sharing L1 cache § Requires low degree of intra-team parallelism ØCuda : team == warp § Requires modest degree of intra-team parallelism § A year ago: team == block, infeasible high degree parallelism

slide-6
SLIDE 6

5

Anatomy and Life-cycle of a Task

§ Anatomy

§ Is a C++ closure (e.g., functor) of data + function § Is referenced by a Kokkos::future § Executes on a single thread or a thread team § May only execute when its dependences are complete (DAG)

§ Life-cycle:

constructing waiting executing task with internal data parallelism

  • n a thread team

serial task

  • n a single thread

complete

slide-7
SLIDE 7

6

Dynamic Task DAG Challenges

§ A DAG of heterogeneous closures

§ Map execution to a single thread or a thread team § Manage memory dynamically created and completed tasks § Manage DAG with dynamically created and completed dependences

§ GPU – executing task cannot block or yield to another task

Ø Forced a beneficial reconceptualization! Non-blocking tasks § Eliminate context switching overhead: stack, registers, ...

§ Portability and Performance

§ Heterogeneous function pointers (CPU, GPU) § Creating GPU tasks on the host and within tasks executing on the GPU § Bounded memory pool and scalable allocation/deallocation § Scalable DAG management and scheduling

slide-8
SLIDE 8

7

Managing a Non-blocking Task’s Lifecycle

§ Create: allocate and construct

§ By main process or within another task § Allocate from a memory pool § Construct internal data § Assign DAG dependences

§ Spawn: enqueue to scheduler

§ Assign DAG dependences § Assign priority: high, regular, low

§ Respawn: re-enqueue to scheduler

§ Replaces waiting or yielding § Assign new DAG dependences and/or priority § Reconceived wait-for-child-task pattern

Ø Create & spawn child task(s) Ø Reassign DAG dependence(s) to new child task(s) Ø Re-spawn to execute again after child task(s) complete

constructing waiting executing complete

create spawn respawn

slide-9
SLIDE 9

8

Task Scheduler and Memory Pool

§ Memory Pool

§ Large chunk of memory allocated in Kokkos memory space § Allocate & deallocate small blocks of varying size within a parallel execution § Lock free, extremely low overhead § Tuning: min-alloc-size <= max-alloc-size <= superblock-size <= total-size

§ Task Scheduler

§ Uses memory pool for tasks’ memory § Ready queues (by priority) and waiting queues Ø Each queue is a simple linked list of tasks § A ready queue is a head of a linked list § Each task is head of linked list of “execute after” tasks § Limit updates to push/pop, implemented with atomic operations § “When all” is a non-executing task with list of dependences for data

next next DAG next DAG

slide-10
SLIDE 10

9

Memory Pool Performance, as of April’17

§ Test Setup

§ 10Mb pool comprised of 153 x 64k superblocks, min block size 32 bytes § Allocations ranging between 32 and 128 bytes; average 80 bytes § [1] Allocate to N% ; [2] cyclically deallocate & allocate between N and 2/3 N § parallel_for: every index allocates ; cyclically deallocates & allocates § Measure allocate + deallocate operations / second (best of 10 trials)

§ Deallocate much simpler and fewer operations than allocate

§ Test Hardware: Pascal, Broadwell, Knights Landing

§ Fully subscribe cores § Every thread within every warp allocates & deallocates

§ For reference, an “apples to oranges” comparison

§ CUDA malloc / free on Pascal § jemalloc on Knights Landing

slide-11
SLIDE 11

10

Memory Pool Performance, as of April’17

§ Memory pools have finite size with well-bounded scope

§ Algorithms’ and data structures’ memory pools do not pollute (fragment) each other’s memory

Fill 75% Fill 95% Cycle 75% Cycle 95% blocks: 938,500 1,187,500 Pascal 79 M/s 74 M/s 287 M/s 244 M/s Broadwell 13 M/s 13 M/s 46 M/s 49 M/s Knights Landing 5.8 M/s 5.8 M/s 40 M/s 43 M/s apples to oranges comparison: Pascal using CUDA malloc 3.5 M/s 2.9 M/s 15 M/s 12 M/s Knights Landing using jemalloc 379 M/s 4115 M/s thread local caches, optimal blocking, NOT fixed pool size

slide-12
SLIDE 12

11

Scheduler Unit Test Performance, as of April’17

§ Test Setup, (silly) Fibonacci task-dag algorithm

§ F(k) = F(k-1) + F(k-2) § if k >= 2 spawn F(k-1) and F(k-2) then § respawn F(k) dependent on completion of when_all( { F(k-1) , F(k-2) } ) § F(k) cumulatively allocates/deallocates N tasks >> “high water mark” § 1Mb pool comprised of 31 x 32k superblocks, min block size 32 bytes § Fully subscribe cores; single thread Fibonacci task consumes entire GPU warp § Real algorithms’ tasks have modest internal parallelism § Measure tasks / second; compare to raw allocate + deallocate performance

F(21) F(23) Alloc/Dealloc cumulative tasks: 53131 139102 (for comparison) Pascal 1.2 M/s 1.3 M/s 144 M/s Broadwell 0.98 M/s 1.1 M/s 24 M/s Knights Landing 0.30 M/s 0.31 M/s 21 M/s

slide-13
SLIDE 13

12

Conclusion

ü Initial Dynamic Task-DAG capability

§ Portable: CPU and NVIDIA GPU architectures § Directed acyclic graph (DAG) of heterogeneous tasks § Dynamic – tasks may create tasks and dependences § Hierarchical – thread-team data parallelism within tasks

§ Challenges, primarily for GPU portability and performance

§ Non-blocking tasks è respawn instead of wait § Memory pool for dynamically allocatable tasks § Map task’s thread-team onto GPU warp, modest intra-team parallelism

slide-14
SLIDE 14

13

Ongoing Research & Development

§ In progress / to be resolved

§ Work around warp divergence / fail-to-reconverge bug w/ CUDA 8 + Pascal

§ Known issue, Nvidia will soon have fix for us

§ Prevents task-team parallelism: § one thread per warp atomically pops task from DAG § whole warp executes task

§ In progress / to be done

§ Merge Kokkos ThreadTeam and TaskTeam intra-team parallel capabilities § Currently are separate / redundant implementations § Performance evaluation & optimization § Performance evaluation with applications’ algorithms § sparse matrix factorization, social network triangle enumeration/analysis

§ ... stay tuned