composing multiple starpu applications composing multiple
play

Composing multiple StarPU applications Composing multiple StarPU - PowerPoint PPT Presentation

Composing multiple StarPU applications Composing multiple StarPU applications over heterogeneous machines: a supervised approach Andra Hugo With Abdou Guermouche, Pierre-Andr Wacrenier, Raymond Namyst Inria, LaBRI, University of Bordeaux


  1. Composing multiple StarPU applications Composing multiple StarPU applications over heterogeneous machines: a supervised approach Andra Hugo With Abdou Guermouche, Pierre-André Wacrenier, Raymond Namyst Inria, LaBRI, University of Bordeaux RUNTIME INRIA Group INRIA Bordeaux Sud-Ouest

  2. The increasing role of runtime systems Code reusability • Many HPC applications rely on specific parallel libraries - Linear algebra, FFT, Stencils IntelTBB • Efficient implementations sitting on Harmony StarSs top of dynamic runtime systems Anthill OpenMP Cilk - To deal with hybrid, multicore KAAPI StarPU complex hardware complex hardware Charm++ DAGuE • E.g. MKL/OpenMP, Qilin MAGMA/StarPU - To avoid reinventing the wheel! • Some application may benefit from relying on multiple libraries - Potentially using different underlying runtime systems… Runtime - 2

  3. The increasing role of runtime systems Code reusability • Many HPC applications rely on specific parallel libraries - Linear algebra, FFT, Stencils IntelTBB • Efficient implementations sitting on Harmony StarSs top of dynamic runtime systems Anthill OpenMP Cilk - To deal with hybrid, multicore KAAPI StarPU complex hardware complex hardware Charm++ DAGuE • E.g. MKL/OpenMP, Qilin MAGMA/StarPU - To avoid reinventing the wheel! • Some application may benefit from And the performance relying on multiple libraries => of the application - Potentially using different underlying runtime systems… Runtime - 3

  4. Struggle for resources Interferences between parallel libraries • Parallel libraries typically allocate and bind one thread per core Problems: • Resource over-subscription • Resource under-subscription Solutions: • Stand-alone allocation • Hand-made allocation • Hand-made allocation • Examples: - Sparse direct solvers - Code coupling (multi-physics, multi-scale) - Etc… CPU 1 CPU 1 CPU 2 CPU 2 CPU 4 CPU 4 GPU GPU CPU 3 CPU 3 Example: qr_mumps Runtime - 4

  5. Struggle for resources Interferences between parallel libraries • Parallel libraries typically allocate and bind one thread per core Problems: • Resource over-subscription • Resource under-subscription Solutions: • Stand-alone allocation • Hand-made allocation • Hand-made allocation • Examples: - Sparse direct solvers - Code coupling (multi-physics, multi-scale) - Etc… CPU 1 CPU 1 CPU 2 CPU 2 CPU 4 CPU 4 GPU GPU CPU 3 CPU 3 => Composability problem Example: qr_mumps Runtime - 5

  6. Composability problem How to deal with it? Intel TBB Lithe • Advanced environments allow partitioning of hardware resources - Intel TBB • The pool of workers are split in arenas - Lithe • Resource sharing management interface • Harts are transferred between parallel libraries • Main challenge: Automatically adjusting the amount of resources allocated to each library Runtime - 6

  7. Our approach: Scheduling Contexts Toward code composability Push Push • Isolate concurrent parallel codes Context B • Similar to lightweight virtual machines Context A CPU GPU workers workers Runtime - 7

  8. Our approach: Scheduling Contexts Toward code composability Push Push • Isolate concurrent parallel codes Context B • Similar to lightweight virtual machines Context A Contexts may expand and shrink • Hypervised approach - • Resize contexts • Share resources - Maximize overall throughput CPU GPU workers workers - Use dynamic feedback both from application and runtime Hypervisor Runtime - 8

  9. Tackle the Composability problem Runtime System to validate our proposal • Scheduling contexts to isolate parallel codes • The Hypervisor to (re)size scheduling contexts • Runtime - 9

  10. Tackle the Composability problem Runtime System to validate our proposal • Scheduling contexts to isolate parallel codes • The Hypervisor to (re)size scheduling contexts • Runtime - 10

  11. Using StarPU as an experimental platform A runtime system for *PU architectures for studying resource negociation • The StarPU runtime system - Dynamically schedule tasks on all A = A+B processing units • See a pool of heterogeneous CPU CPU GPU M. processing units CPU CPU CPU CPU - Avoid unnecessary data transfers GPU M. B M. B between accelerators • Software VSM for GPU M. CPU CPU heterogeneous machines CPU CPU GPU M. M. A Runtime - 11

  12. Overview of StarPU Maximizing PU occupancy, minimizing data transfers • Accept tasks that may have HPC Applications multiple implementations Parallel Parallel Compilers Libraries - Potential inter-dependencies • Leads to a directed acyclic graph of tasks • Data-flow approach cpu f gpu spu StarPU (A RW , B R , C R ) Drivers (CUDA, OpenCL) • Open, general purpose CPU GPU MIC scheduling platform - Scheduling policies = plugins Runtime - 12

  13. Tasks scheduling How does it work? • When a task is submitted, it first goes into a pool of “frozen tasks” until all dependencies are met Push • Then, the task is “pushed” to the scheduler Scheduler • • Idle processing units actively poll for Idle processing units actively poll for work (“pop”) • What happens inside the scheduler is… Pop Pop up to you! • Examples: - mct, work stealing, eager, priority CPU GPU workers workers Runtime - 13

  14. Tackle the Composability problem Runtime System to validate our proposal • Scheduling contexts to isolate parallel codes • The Hypervisor to (re)size scheduling contexts • Runtime - 14

  15. Scheduling Contexts in StarPU Extension of StarPU • “Virtual” StarPU machines - Feature their own scheduler - Minimize interferences - Enforce data locality • Allocation of resources - Explicit: • Programmer’s input - Supervised: • Tips on the number of resources • Tips on the number of flops - Shared processing units Runtime - 15

  16. Scheduling contexts in StarPU Easily use contexts in your application int resources1[3] = {CPU_1, CPU_2, GPU_1}; int resources2[4] = {CPU_3, CPU_4, CPU_5, CPU_6}; /* define the scheduling policy and the table MCT of resource ids */ sched_ctx1 = starpu_create_sched_ctx(“mct",resources1,3); sched_ctx2 = starpu_create_sched_ctx("greedy",resources2,4); Runtime - 16

  17. Scheduling contexts in StarPU Easily use contexts in your application int resources1[3] = {CPU_1, CPU_2, GPU_1}; int resources2[4] = {CPU_3, CPU_4, CPU_5, CPU_6}; /* define the scheduling policy and the table of resource ids */ sched_ctx1 = starpu_create_sched_ctx("heft",resources1,3); sched_ctx2 = starpu_create_sched_ctx("greedy",resources2,4); // thread 1: // thread 2: /* define the context associated to kernel 1 */ /* define the context associated to kernel 2 */ starpu_set_sched_ctx(sched_ctx1); starpu_set_sched_ctx(sched_ctx2); /* submit the set of tasks of the parallel kernel /* submit the set of tasks of parallel kernel 2*/ 1*/ for( i = 0; i < ntasks2; i++) for( i = 0; i < ntasks1; i++) starpu_task_submit(tasks2[i]); starpu_task_submit(tasks1[i]); Runtime - 17

  18. Experimental evaluation Platform and Application • 9 CPUs (two Intel hexacore processors, 3 cores devoted to execute GPU drivers) + 3 GPUs • MAGMA Linear Algebra Library - StarPU Implementation - Cholesky Factorization kernel • Euler3D solver - Computational Fluid Dynamic benchmark - Rodinia benchmark suite - Iterative solver for 3D Euler equations for compressible fluids - StarPU Implementation MAGMA – Cholesky Factorization Runtime - 18

  19. Composing Magma and the Euler3D solver Different parallel kernels • Computational Fluid Dynamic: CFD And Cholesky Factorization - Domain decomposition parallelization - Independent tasks per iteration No contexts 19.8 - Dependencies between iterations 20 - Strong affinity with GPUs 2 contexts 18 14.2 - 2 sub-domains: 2 GPUs 16 14 • Cholesky Factorization: 12 Time(s) - Scalable on both CPUs & GPUs 10 - 1GPU & 9 CPUs 8 - Large number of tasks 6 4 2 Contexts’ benefits : • 0 - Enforcing locality constraints Runtime - 19

  20. Micro-benchmark: 9 Cholesky factorizations in parallel Gain performance from data locality Serial Execution • Mixing parallel kernels: 60 52 1 Context: 9 CPUs / 3GPUs 3 contexts : 3 x (3 CPUs / 1 GPU) 44.3 - Unnecessary data transfers 9 Contexts: 9 x ( 1 CPUs / 0.3 GPUs) 50 34.8 34.4 between Host Memory & GPU 40 memory -> blocking waits 30 Time (s) - GPU Memory flushes 20 10 0 Runtime - 20

  21. Micro-benchmark: 9 Cholesky factorizations in parallel Gain performance from data locality Serial Execution : 87 GB • Mixing parallel kernels: 60 1 Context: 9 CPUs / 3GPUs : 113 GB 52 3 contexts : 3 x (3 CPUs / 1 GPU) : 37 GB - Unnecessary data transfers 44.3 9 Contexts: 9 x ( 1 CPUs / 0.3 GPUs) : 41GB 50 34.8 34.4 between Host Memory & GPU 40 memory -> blocking waits 30 Time (s) - GPU Memory flushes 20 10 0 Runtime - 21

  22. Tackle the Composability problem Runtime System to validate our proposal • Scheduling contexts to isolate parallel codes • The Hypervisor to (re)size scheduling contexts • Runtime - 22

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend