a pluggable framework for composable hpc scheduling
play

A Pluggable Framework for Composable HPC Scheduling Libraries Max - PowerPoint PPT Presentation

A Pluggable Framework for Composable HPC Scheduling Libraries Max Grossman 1 , Vivek Kumar 2 , Nick Vrvilo 1 , Zoran Budimlic 1 , Vivek Sarkar 1 1 Habanero Extreme Scale So=ware Research Group, Rice University 2 IIIT-Delhi AsHES 2017 - May 29 2017


  1. A Pluggable Framework for Composable HPC Scheduling Libraries Max Grossman 1 , Vivek Kumar 2 , Nick Vrvilo 1 , Zoran Budimlic 1 , Vivek Sarkar 1 1 Habanero Extreme Scale So=ware Research Group, Rice University 2 IIIT-Delhi AsHES 2017 - May 29 2017 1

  2. Top10 Past decade has seen more heterogeneous supercomputers 2 hQps://www.top500.org

  3. Top10 Majority of Top10 Peak and Achieved GFlop/s has come from heterogeneous machines since 2013 3 hQps://www.top500.org

  4. Top10 We as a community are very bad at programming heterogeneous supercomputers (even for LINPACK). 4 hQps://www.top500.org

  5. How Do We Define Heterogeneity? For the past decade, “heterogeneous compuang” == “GPUs” • Dealing with GPUs has taught us a lot about so=ware heterogeneity But heterogeneity is on the rise everywhere in HPC: • Hardware: memory, networks, storage, cores • So=ware: networking libraries, Processors compute libraries, managed runames, domain libraries, storage Depicaon of the abstract pladorm moavaang this work. APIs 5

  6. Heterogeneous Programming in PracHce pthreads QThreads 6

  7. Heterogeneous Programming in Research Legion: Hide all heterogeneity from user, rely on runame to map problem to hardware efficiently, implicit dependencies discovered by runame. Parsec, OCR: Explicit dataflow model. HCMPI, HCUPC++, HC-CUDA, HPX: Task-based runames that create dedicated proxy threads for managing some external resource (e.g. NIC, GPU). HiPER : Generalize a task-based, locality-aware, work-stealing runame/model to support non-CPU resources. • Retain the appearance of legacy APIs • Composability, extensibility, compaability are first-class ciazens from the start. 7

  8. Outline HiPER Execuaon & Pladorm Model HiPER Use Cases • MPI Module • Composing MPI and CUDA Performance Evaluaaon Conclusions & Future Work 8

  9. Outline HiPER Execuaon & Pladorm Model HiPER Use Cases • MPI Module • Composing MPI and CUDA Performance Evaluaaon Conclusions & Future Work 9

  10. HiPER’s Predecessors Hierarchical Place Trees sysmem L2 L2 L1 L1 L1 L1 10

  11. HiPER’s Predecessors Hierarchical Place Trees GPU sysmem w/ GPU Proxy Thread L2 L2 L1 L1 L1 11

  12. HiPER’s Predecessors GPU sysmem OSHMEM Proxy Thread Proxy Thread L2 L2 Hierarchical Place Trees w/ GPU and OSHMEM L1 L1 12

  13. HiPER’s Predecessors GPU GPU sysmem OSHMEM Proxy Thread Proxy Thread L2 L2 Hierarchical Place Trees w/ GPUs and OSHMEM L1 13

  14. HiPER’s Predecessors GPU MPI GPU sysmem OSHMEM Proxy Thread Proxy Thread L2 L2 Hierarchical Place Trees w/ GPUs, OSHMEM, MPI 14

  15. HiPER’s Predecessors GPU MPI GPU sysmem OSHMEM Proxy Thread Proxy Thread L2 L2 Hierarchical Place Trees w/ GPUs, OSHMEM, MPI • Simple model makes it aQracave for many past research efforts, but … • Not scalable so=ware engineering • Wasteful use of host resources • Not easily extendable to new so=ware/hardware capabiliaes 15

  16. HiPER PlaMorm & ExecuHon Model HiPER Work-Stealing Thread Pool 16

  17. HiPER PlaMorm & ExecuHon Model Pluggable System OSHMEM MPI CUDA etc. Modules Modules expose user-visible APIs for work creaaon. HiPER Work-Stealing Thread Pool 17

  18. HiPER PlaMorm & ExecuHon Model Pluggable System OSHMEM MPI CUDA etc. Modules HiPER PlaMorm Model Pladorm model gives modules somewhere to place work, thread pool somewhere to HiPER Work-Stealing Thread Pool find work. 18

  19. HiPER PlaMorm & ExecuHon Model Pluggable System OSHMEM MPI CUDA etc. Modules HiPER PlaMorm Model Modules fill in pladorm model, tell threads the CPU 0 CPU 1 CPU 2 CPU 3 CPU 4 x86 x86 x86 x86 subset of the pladorm they are responsible HiPER Work-Stealing Thread Pool for scheduling work on. 19

  20. HiPER PlaMorm & ExecuHon Model Pluggable System OSHMEM MPI CUDA etc. Modules HiPER PlaMorm Model NIC Modules fill in pladorm model, tell threads the CPU 0 CPU 1 CPU 2 CPU 3 CPU 4 x86 x86 x86 x86 subset of the pladorm they are responsible HiPER Work-Stealing Thread Pool for scheduling work on. 20

  21. HiPER PlaMorm & ExecuHon Model Pluggable System OSHMEM MPI CUDA etc. Modules HiPER PlaMorm Model GPU NIC CPU 0 CPU 1 CPU 2 CPU 3 CPU 4 x86 x86 x86 x86 HiPER Work-Stealing Thread Pool 21

  22. Outline HiPER Execuaon & Pladorm Model HiPER Use Cases • MPI Module • Composing MPI and CUDA Performance Evaluaaon Conclusions & Future Work 22

  23. Fundamental Task-Parallel API The HiPER core exposes a fundamental C/C++ tasking API. API ExplanaHon Create an asynchronous task async([] { S1; }); Suspend calling task unal nested tasks have finish([] { S2; }); completed Create an async. task at a place in the pladorm async_at([] { S3; }, place); model fut = async_future([] { S4; }); Get a future that is signaled when a task completes Create an asynchronous task whose execuaon is async_await([] { S5; }, fut); predicated on saasfacaon of fut. Summary of core tasking APIs. The above list is not comprehensive. 23

  24. MPI Module Extends HiPER namespace with familiar MPI APIs • Programmers can use the APIs they already know and love • Built on 1) an MPI implementaaon, and 2) HiPER’s core tasking APIs. Asynchronous APIs return futures rather than MPI_Requests, enabling composability in programming layer with all other future-based APIs: hiper::future_t<void> *MPI_Irecv/Isend(...); Enables non-standard extensions, e.g.: Start an asynchronous send hiper::future_t<void> *MPI_Isend_await(..., hiper::future_t<void> *await); once await is saasfied. hiper::future_t<void> *MPI_Allreduce_future(...); Asynchronous collecaves. 24

  25. Example API ImplementaHon hiper::future_t<void> *hiper::MPI_Isend_await(..., hiper::future_t<void> *await) { // Create a promise to be satisfied on the completion of this operation hiper::promise_t<void> *prom = new hiper::promise_t<void>(); // Taskify the actual MPI_Isend at the NIC, pending the satisfaction of await hclib::async_nb_await_at([=] { // At MPI place, do the actual Isend MPI_Request req; ::MPI_Isend(..., &req)); // Create a data structure to track the status of the pending Isend pending_mpi_op *op = malloc(sizeof(*op)); ... hiper::append_to_pending(op, &pending, test_mpi_completion, nic); }, fut, nic); return prom->get_future(); Periodic polling funcaon } 25

  26. Composing System, MPI, CUDA Modules // Asynchronously process ghost regions on this rank in parallel on CPU ghost_fut = forasync_future([] (z) { ... }); // Asynchronously exchange ghost regions with neighbors reqs[0] = MPI_Isend_await(..., ghost_fut); reqs[1] = MPI_Isend_await(..., ghost_fut); reqs[2] = MPI_Irecv(...); reqs[3] = MPI_Irecv(...); // Asynchronously process remainder of z values on this rank kernel_fut = forasync_cuda(..., [] (z) { ... }); // Copy received ghost region to CUDA device copy_fut = async_copy_await(..., reqs[2], reqs[3], kernel_fut); 26

  27. Outline HiPER Execuaon & Pladorm Model HiPER Use Cases • MPI Module • Composing MPI and CUDA Performance Evaluaaon Conclusions & Future Work 27

  28. Task Micro-Benchmarking Micro-benchmark performance normalized to HiPER on Edison, higher is beQer. hQps://github.com/habanero-rice/tasking-micro-benchmark-suite 28

  29. Experimental Setup Experiments shown here were run on Titan @ ORNL and Edison @ NERSC. ApplicaHon PlaMorm Dataset Modules Used Scaling ISx Titan 2 29 keys per node OpenSHMEM Weak HPGMG-FV Edison log2_box_dim=7 UPC++ Weak boxes_per_rank=8 UTS Titan T1XXL OpenSHMEM Strong Graph500 Titan 2 29 nodes OpenSHMEM Strong LBM Titan MPI, CUDA Weak 29

  30. HiPER EvaluaHon – Regular ApplicaHons HIPER is low-overhead, no impact on performance for regular applicaaons 1 20 0.8 Total execution time (s) 15 Total execution time (s) 0.6 10 0.4 5 0.2 0 32 64 128 256 512 1024 0 64 128 256 512 Total nodes on Titan (16 cores per node) Flat OpenSHMEM HIPER Total nodes on Edison (2 processes/sockets per node, 12 cores per process) OpenSHMEM+OpenMP UPC++ + OpenMP HiPER ISx HPGMG Solve Step 30

  31. HiPER EvaluaHon – Regular ApplicaHons ~2% performance improvement through reduced synchronizaaon from futures-based programming. LBM 31

  32. HiPER EvaluaHon – UTS HIPER integraaon 100 improves computaaon- 80 communicaaon Total execution time (s) overlap, 60 scalability, load balance 40 20 0 32 64 128 256 512 1024 Total nodes on Titan (16 cores per node) OpenSHMEM+OpenMP HiPER 32

  33. HiPER EvaluaHon – Graph500 HiPER used for concurrent (not parallel) programming in Graph500. Rather than periodic polling, use novel shmem_async_when APIs to trigger local computaaon on incoming RDMA. Reduces code complexity, hands scheduling problem to the runame. 33

  34. Outline HiPER Execuaon & Pladorm Model HiPER Use Cases • OpenSHMEM w/o Thread Safety • OpenSHMEM w/ Contexts Performance Evaluaaon Conclusions & Future Work 34

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend