A Pluggable Framework for Composable HPC Scheduling Libraries Max - PowerPoint PPT Presentation

A Pluggable Framework for Composable HPC Scheduling Libraries Max Grossman 1 , Vivek Kumar 2 , Nick Vrvilo 1 , Zoran Budimlic 1 , Vivek Sarkar 1 1 Habanero Extreme Scale So=ware Research Group, Rice University 2 IIIT-Delhi AsHES 2017 - May 29 2017 1

Top10 Past decade has seen more heterogeneous supercomputers 2 hQps://www.top500.org

Top10 Majority of Top10 Peak and Achieved GFlop/s has come from heterogeneous machines since 2013 3 hQps://www.top500.org

Top10 We as a community are very bad at programming heterogeneous supercomputers (even for LINPACK). 4 hQps://www.top500.org

How Do We Define Heterogeneity? For the past decade, “heterogeneous compuang” == “GPUs” • Dealing with GPUs has taught us a lot about so=ware heterogeneity But heterogeneity is on the rise everywhere in HPC: • Hardware: memory, networks, storage, cores • So=ware: networking libraries, Processors compute libraries, managed runames, domain libraries, storage Depicaon of the abstract pladorm moavaang this work. APIs 5

Heterogeneous Programming in PracHce pthreads QThreads 6

Heterogeneous Programming in Research Legion: Hide all heterogeneity from user, rely on runame to map problem to hardware efficiently, implicit dependencies discovered by runame. Parsec, OCR: Explicit dataflow model. HCMPI, HCUPC++, HC-CUDA, HPX: Task-based runames that create dedicated proxy threads for managing some external resource (e.g. NIC, GPU). HiPER : Generalize a task-based, locality-aware, work-stealing runame/model to support non-CPU resources. • Retain the appearance of legacy APIs • Composability, extensibility, compaability are first-class ciazens from the start. 7

Outline HiPER Execuaon & Pladorm Model HiPER Use Cases • MPI Module • Composing MPI and CUDA Performance Evaluaaon Conclusions & Future Work 8

HiPER’s Predecessors Hierarchical Place Trees sysmem L2 L2 L1 L1 L1 L1 10

HiPER’s Predecessors Hierarchical Place Trees GPU sysmem w/ GPU Proxy Thread L2 L2 L1 L1 L1 11

HiPER’s Predecessors GPU sysmem OSHMEM Proxy Thread Proxy Thread L2 L2 Hierarchical Place Trees w/ GPU and OSHMEM L1 L1 12

HiPER’s Predecessors GPU GPU sysmem OSHMEM Proxy Thread Proxy Thread L2 L2 Hierarchical Place Trees w/ GPUs and OSHMEM L1 13

HiPER’s Predecessors GPU MPI GPU sysmem OSHMEM Proxy Thread Proxy Thread L2 L2 Hierarchical Place Trees w/ GPUs, OSHMEM, MPI 14

HiPER’s Predecessors GPU MPI GPU sysmem OSHMEM Proxy Thread Proxy Thread L2 L2 Hierarchical Place Trees w/ GPUs, OSHMEM, MPI • Simple model makes it aQracave for many past research efforts, but … • Not scalable so=ware engineering • Wasteful use of host resources • Not easily extendable to new so=ware/hardware capabiliaes 15

HiPER PlaMorm & ExecuHon Model HiPER Work-Stealing Thread Pool 16

HiPER PlaMorm & ExecuHon Model Pluggable System OSHMEM MPI CUDA etc. Modules Modules expose user-visible APIs for work creaaon. HiPER Work-Stealing Thread Pool 17

HiPER PlaMorm & ExecuHon Model Pluggable System OSHMEM MPI CUDA etc. Modules HiPER PlaMorm Model Pladorm model gives modules somewhere to place work, thread pool somewhere to HiPER Work-Stealing Thread Pool find work. 18

HiPER PlaMorm & ExecuHon Model Pluggable System OSHMEM MPI CUDA etc. Modules HiPER PlaMorm Model Modules fill in pladorm model, tell threads the CPU 0 CPU 1 CPU 2 CPU 3 CPU 4 x86 x86 x86 x86 subset of the pladorm they are responsible HiPER Work-Stealing Thread Pool for scheduling work on. 19

HiPER PlaMorm & ExecuHon Model Pluggable System OSHMEM MPI CUDA etc. Modules HiPER PlaMorm Model NIC Modules fill in pladorm model, tell threads the CPU 0 CPU 1 CPU 2 CPU 3 CPU 4 x86 x86 x86 x86 subset of the pladorm they are responsible HiPER Work-Stealing Thread Pool for scheduling work on. 20

HiPER PlaMorm & ExecuHon Model Pluggable System OSHMEM MPI CUDA etc. Modules HiPER PlaMorm Model GPU NIC CPU 0 CPU 1 CPU 2 CPU 3 CPU 4 x86 x86 x86 x86 HiPER Work-Stealing Thread Pool 21

Fundamental Task-Parallel API The HiPER core exposes a fundamental C/C++ tasking API. API ExplanaHon Create an asynchronous task async([] { S1; }); Suspend calling task unal nested tasks have finish([] { S2; }); completed Create an async. task at a place in the pladorm async_at([] { S3; }, place); model fut = async_future([] { S4; }); Get a future that is signaled when a task completes Create an asynchronous task whose execuaon is async_await([] { S5; }, fut); predicated on saasfacaon of fut. Summary of core tasking APIs. The above list is not comprehensive. 23

MPI Module Extends HiPER namespace with familiar MPI APIs • Programmers can use the APIs they already know and love • Built on 1) an MPI implementaaon, and 2) HiPER’s core tasking APIs. Asynchronous APIs return futures rather than MPI_Requests, enabling composability in programming layer with all other future-based APIs: hiper::future_t<void> *MPI_Irecv/Isend(...); Enables non-standard extensions, e.g.: Start an asynchronous send hiper::future_t<void> *MPI_Isend_await(..., hiper::future_t<void> *await); once await is saasfied. hiper::future_t<void> *MPI_Allreduce_future(...); Asynchronous collecaves. 24

Example API ImplementaHon hiper::future_t<void> *hiper::MPI_Isend_await(..., hiper::future_t<void> *await) { // Create a promise to be satisfied on the completion of this operation hiper::promise_t<void> *prom = new hiper::promise_t<void>(); // Taskify the actual MPI_Isend at the NIC, pending the satisfaction of await hclib::async_nb_await_at([=] { // At MPI place, do the actual Isend MPI_Request req; ::MPI_Isend(..., &req)); // Create a data structure to track the status of the pending Isend pending_mpi_op *op = malloc(sizeof(*op)); ... hiper::append_to_pending(op, &pending, test_mpi_completion, nic); }, fut, nic); return prom->get_future(); Periodic polling funcaon } 25

Composing System, MPI, CUDA Modules // Asynchronously process ghost regions on this rank in parallel on CPU ghost_fut = forasync_future([] (z) { ... }); // Asynchronously exchange ghost regions with neighbors reqs[0] = MPI_Isend_await(..., ghost_fut); reqs[1] = MPI_Isend_await(..., ghost_fut); reqs[2] = MPI_Irecv(...); reqs[3] = MPI_Irecv(...); // Asynchronously process remainder of z values on this rank kernel_fut = forasync_cuda(..., [] (z) { ... }); // Copy received ghost region to CUDA device copy_fut = async_copy_await(..., reqs[2], reqs[3], kernel_fut); 26

Task Micro-Benchmarking Micro-benchmark performance normalized to HiPER on Edison, higher is beQer. hQps://github.com/habanero-rice/tasking-micro-benchmark-suite 28

Experimental Setup Experiments shown here were run on Titan @ ORNL and Edison @ NERSC. ApplicaHon PlaMorm Dataset Modules Used Scaling ISx Titan 2 29 keys per node OpenSHMEM Weak HPGMG-FV Edison log2_box_dim=7 UPC++ Weak boxes_per_rank=8 UTS Titan T1XXL OpenSHMEM Strong Graph500 Titan 2 29 nodes OpenSHMEM Strong LBM Titan MPI, CUDA Weak 29

HiPER EvaluaHon – Regular ApplicaHons HIPER is low-overhead, no impact on performance for regular applicaaons 1 20 0.8 Total execution time (s) 15 Total execution time (s) 0.6 10 0.4 5 0.2 0 32 64 128 256 512 1024 0 64 128 256 512 Total nodes on Titan (16 cores per node) Flat OpenSHMEM HIPER Total nodes on Edison (2 processes/sockets per node, 12 cores per process) OpenSHMEM+OpenMP UPC++ + OpenMP HiPER ISx HPGMG Solve Step 30

HiPER EvaluaHon – Regular ApplicaHons ~2% performance improvement through reduced synchronizaaon from futures-based programming. LBM 31

HiPER EvaluaHon – UTS HIPER integraaon 100 improves computaaon- 80 communicaaon Total execution time (s) overlap, 60 scalability, load balance 40 20 0 32 64 128 256 512 1024 Total nodes on Titan (16 cores per node) OpenSHMEM+OpenMP HiPER 32

HiPER EvaluaHon – Graph500 HiPER used for concurrent (not parallel) programming in Graph500. Rather than periodic polling, use novel shmem_async_when APIs to trigger local computaaon on incoming RDMA. Reduces code complexity, hands scheduling problem to the runame. 33

Outline HiPER Execuaon & Pladorm Model HiPER Use Cases • OpenSHMEM w/o Thread Safety • OpenSHMEM w/ Contexts Performance Evaluaaon Conclusions & Future Work 34

A Pluggable Framework for Composable HPC Scheduling Libraries Max - PowerPoint PPT Presentation

A Pluggable Framework for Composable HPC Scheduling Libraries Max Grossman 1 , Vivek Kumar 2 , Nick Vrvilo 1 , Zoran Budimlic 1 , Vivek Sarkar 1 1 Habanero Extreme Scale So=ware Research Group, Rice University 2 IIIT-Delhi AsHES 2017 - May 29 2017

HPC @ SAO S.G. Korzennik - SAO HPC Analyst hpc@cfa February 2013 SGK ( hpc@cfa ) HPC @ SAO

Uni.lu HPC School 2020 PS6: HPC Containers: Singularity Uni.lu High Performance Computing (HPC)

Pluggable type systems reconsidered ISSTA 2018 Impact Paper Award for Practical Pluggable

An Architecture for Open Pluggable Pluggable An Architecture for Open Edge Services (OPES) Edge

UL HPC School 2017 PS5: Advanced Scheduling with SLURM and OAR on UL HPC clusters UL High

The HPC Skill Tree A Brief Overview Kai Himstedt On Behalf of the HPC-CF Board BoF:

EXPOSING EXPOSING A FLEXIBLE, COMPOSABLE & EXTENSIBLE A FLEXIBLE, COMPOSABLE &

Aperiodic Task Scheduling Radek Pel anek Preemptive Scheduling Non-preemptive Scheduling

Whats new in HPC? Gregory Bauer To keep up-to-date on HPC HPC Guru -

UL HPC School 2017[bis] PS1: Getting Started on the UL HPC platform UL High Performance

UL HPC School 2017 PS1: Getting Started on the UL HPC platform UL High Performance Computing

Uni.lu HPC School 2019 PS3: [Advanced] Job scheduling (SLURM) Uni.lu High Performance Computing

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

Module 5: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Chapter 6: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Uniprocessor Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms 2

The Shmitah Cycle Common Holy Year 1 Year 2 Year 1 Year 2 Year 3 Year 4 Year 5 Year 6

James 5:17, Elijah was a man with a nature like ours, and he prayed earnestly that it would

Jonah 1 God calls Jonah to Nineveh Jonah goes the other way. Pagan Sailors being awake

ISASI 2009 Seminar Presentation Wednesday Sept 16, 2009 Paper: SIFTING ASHES FROM THE WRECKAGE:

Lecture 4 - Authentication and Access CSE497b - Spring 2007 Introduction Computer and Network

Review of burial and cremation law Image thanks to Christchurch City Council Scope of review

Spooky Claims 2017 Victor A. Davis October 19, 2017 The webinar will begin shortly. Phone |

King Ahasuerus [Xerxes] imposed tax on the land and on the coastlands of the sea. 2 And all the

A Pluggable Framework for Composable HPC Scheduling Libraries Max - PowerPoint PPT Presentation

A Pluggable Framework for Composable HPC Scheduling Libraries Max Grossman 1 , Vivek Kumar 2 , Nick Vrvilo 1 , Zoran Budimlic 1 , Vivek Sarkar 1 1 Habanero Extreme Scale So=ware Research Group, Rice University 2 IIIT-Delhi AsHES 2017 - May 29 2017

HPC @ SAO S.G. Korzennik - SAO HPC Analyst hpc@cfa February 2013 SGK ( hpc@cfa ) HPC @ SAO

Uni.lu HPC School 2020 PS6: HPC Containers: Singularity Uni.lu High Performance Computing (HPC)

Pluggable type systems reconsidered ISSTA 2018 Impact Paper Award for Practical Pluggable

An Architecture for Open Pluggable Pluggable An Architecture for Open Edge Services (OPES) Edge

UL HPC School 2017 PS5: Advanced Scheduling with SLURM and OAR on UL HPC clusters UL High

The HPC Skill Tree A Brief Overview Kai Himstedt On Behalf of the HPC-CF Board BoF:

EXPOSING EXPOSING A FLEXIBLE, COMPOSABLE &amp; EXTENSIBLE A FLEXIBLE, COMPOSABLE &amp;

Aperiodic Task Scheduling Radek Pel anek Preemptive Scheduling Non-preemptive Scheduling

Whats new in HPC? Gregory Bauer To keep up-to-date on HPC HPC Guru -

UL HPC School 2017[bis] PS1: Getting Started on the UL HPC platform UL High Performance

UL HPC School 2017 PS1: Getting Started on the UL HPC platform UL High Performance Computing

Uni.lu HPC School 2019 PS3: [Advanced] Job scheduling (SLURM) Uni.lu High Performance Computing

CPU Scheduling CPU Scheduling CPU Scheduling 101 CPU Scheduling 101 The CPU scheduler makes a

Module 5: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Chapter 6: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms

Uniprocessor Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms 2

The Shmitah Cycle Common Holy Year 1 Year 2 Year 1 Year 2 Year 3 Year 4 Year 5 Year 6

James 5:17, Elijah was a man with a nature like ours, and he prayed earnestly that it would

Jonah 1 God calls Jonah to Nineveh Jonah goes the other way. Pagan Sailors being awake

ISASI 2009 Seminar Presentation Wednesday Sept 16, 2009 Paper: SIFTING ASHES FROM THE WRECKAGE:

Lecture 4 - Authentication and Access CSE497b - Spring 2007 Introduction Computer and Network

Review of burial and cremation law Image thanks to Christchurch City Council Scope of review

Spooky Claims 2017 Victor A. Davis October 19, 2017 The webinar will begin shortly. Phone |

King Ahasuerus [Xerxes] imposed tax on the land and on the coastlands of the sea. 2 And all the

EXPOSING EXPOSING A FLEXIBLE, COMPOSABLE & EXTENSIBLE A FLEXIBLE, COMPOSABLE &