A Pluggable Framework for Composable HPC Scheduling Libraries Max - - PowerPoint PPT Presentation

a pluggable framework for composable hpc scheduling
SMART_READER_LITE
LIVE PREVIEW

A Pluggable Framework for Composable HPC Scheduling Libraries Max - - PowerPoint PPT Presentation

A Pluggable Framework for Composable HPC Scheduling Libraries Max Grossman 1 , Vivek Kumar 2 , Nick Vrvilo 1 , Zoran Budimlic 1 , Vivek Sarkar 1 1 Habanero Extreme Scale So=ware Research Group, Rice University 2 IIIT-Delhi AsHES 2017 - May 29 2017


slide-1
SLIDE 1

A Pluggable Framework for Composable HPC Scheduling Libraries

Max Grossman1, Vivek Kumar2, Nick Vrvilo1, Zoran Budimlic1, Vivek Sarkar1

1Habanero Extreme Scale So=ware Research Group, Rice University 2IIIT-Delhi

AsHES 2017 - May 29 2017

1

slide-2
SLIDE 2

Top10

2

Past decade has seen more heterogeneous supercomputers

hQps://www.top500.org

slide-3
SLIDE 3

Top10

3

Majority of Top10 Peak and Achieved GFlop/s has come from heterogeneous machines since 2013

hQps://www.top500.org

slide-4
SLIDE 4

Top10

4

We as a community are very bad at programming heterogeneous supercomputers (even for LINPACK).

hQps://www.top500.org

slide-5
SLIDE 5

How Do We Define Heterogeneity?

5

Processors For the past decade, “heterogeneous compuang” == “GPUs”

  • Dealing with GPUs has taught us a

lot about so=ware heterogeneity But heterogeneity is on the rise everywhere in HPC:

  • Hardware: memory, networks,

storage, cores

  • So=ware: networking libraries,

compute libraries, managed runames, domain libraries, storage APIs

Depicaon of the abstract pladorm moavaang this work.

slide-6
SLIDE 6

6

pthreads QThreads

Heterogeneous Programming in PracHce

slide-7
SLIDE 7

7

Heterogeneous Programming in Research

Legion: Hide all heterogeneity from user, rely on runame to map problem to hardware efficiently, implicit dependencies discovered by runame. Parsec, OCR: Explicit dataflow model. HCMPI, HCUPC++, HC-CUDA, HPX: Task-based runames that create dedicated proxy threads for managing some external resource (e.g. NIC, GPU). HiPER: Generalize a task-based, locality-aware, work-stealing runame/model to support non-CPU resources.

  • Retain the appearance of legacy APIs
  • Composability, extensibility, compaability are first-class ciazens from the start.
slide-8
SLIDE 8

8

HiPER Execuaon & Pladorm Model HiPER Use Cases

  • MPI Module
  • Composing MPI and CUDA

Performance Evaluaaon Conclusions & Future Work

Outline

slide-9
SLIDE 9

9

HiPER Execuaon & Pladorm Model HiPER Use Cases

  • MPI Module
  • Composing MPI and CUDA

Performance Evaluaaon Conclusions & Future Work

Outline

slide-10
SLIDE 10

10

HiPER’s Predecessors

sysmem L2 L2 L1 L1 L1 L1 Hierarchical Place Trees

slide-11
SLIDE 11

11

HiPER’s Predecessors

sysmem L2 L2 L1 L1 L1 Hierarchical Place Trees w/ GPU GPU

Proxy Thread

slide-12
SLIDE 12

12

HiPER’s Predecessors

sysmem L2 L2 L1 L1 Hierarchical Place Trees w/ GPU and OSHMEM GPU

Proxy Thread

OSHMEM

Proxy Thread

slide-13
SLIDE 13

13

HiPER’s Predecessors

sysmem L2 L2 L1 Hierarchical Place Trees w/ GPUs and OSHMEM GPU

Proxy Thread

OSHMEM

Proxy Thread

GPU

slide-14
SLIDE 14

14

HiPER’s Predecessors

sysmem L2 L2 GPU

Proxy Thread

OSHMEM

Proxy Thread

GPU MPI Hierarchical Place Trees w/ GPUs, OSHMEM, MPI

slide-15
SLIDE 15

15

HiPER’s Predecessors

sysmem L2 L2 GPU

Proxy Thread

OSHMEM

Proxy Thread

GPU MPI

  • Simple model makes it aQracave for many past research efforts,

but…

  • Not scalable so=ware engineering
  • Wasteful use of host resources
  • Not easily extendable to new so=ware/hardware capabiliaes

Hierarchical Place Trees w/ GPUs, OSHMEM, MPI

slide-16
SLIDE 16

16

HiPER PlaMorm & ExecuHon Model

HiPER Work-Stealing Thread Pool

slide-17
SLIDE 17

17

HiPER PlaMorm & ExecuHon Model

HiPER Work-Stealing Thread Pool etc. CUDA MPI OSHMEM System Pluggable Modules Modules expose user-visible APIs for work creaaon.

slide-18
SLIDE 18

18

HiPER PlaMorm & ExecuHon Model

HiPER Work-Stealing Thread Pool HiPER PlaMorm Model etc. CUDA MPI OSHMEM System Pluggable Modules Pladorm model gives modules somewhere to place work, thread pool somewhere to find work.

slide-19
SLIDE 19

19

HiPER PlaMorm & ExecuHon Model

HiPER Work-Stealing Thread Pool HiPER PlaMorm Model etc. CUDA MPI OSHMEM System Pluggable Modules CPU0 x86 x86 x86 x86 CPU1 CPU2 CPU3 CPU4 Modules fill in pladorm model, tell threads the subset of the pladorm they are responsible for scheduling work on.

slide-20
SLIDE 20

20

HiPER PlaMorm & ExecuHon Model

HiPER Work-Stealing Thread Pool HiPER PlaMorm Model etc. CUDA MPI OSHMEM System Pluggable Modules CPU0 x86 x86 x86 x86 CPU1 CPU2 CPU3 CPU4 NIC Modules fill in pladorm model, tell threads the subset of the pladorm they are responsible for scheduling work on.

slide-21
SLIDE 21

21

HiPER PlaMorm & ExecuHon Model

HiPER Work-Stealing Thread Pool HiPER PlaMorm Model etc. CUDA MPI OSHMEM System Pluggable Modules CPU0 x86 x86 x86 x86 CPU1 CPU2 CPU3 CPU4 GPU NIC

slide-22
SLIDE 22

22

HiPER Execuaon & Pladorm Model HiPER Use Cases

  • MPI Module
  • Composing MPI and CUDA

Performance Evaluaaon Conclusions & Future Work

Outline

slide-23
SLIDE 23

23

Fundamental Task-Parallel API

The HiPER core exposes a fundamental C/C++ tasking API.

API ExplanaHon

async([] { S1; });

Create an asynchronous task

finish([] { S2; });

Suspend calling task unal nested tasks have completed

async_at([] { S3; }, place);

Create an async. task at a place in the pladorm model

fut = async_future([] { S4; }); Get a future that is signaled when a task completes async_await([] { S5; }, fut);

Create an asynchronous task whose execuaon is predicated on saasfacaon of fut.

Summary of core tasking APIs. The above list is not comprehensive.

slide-24
SLIDE 24

24

MPI Module

Extends HiPER namespace with familiar MPI APIs

  • Programmers can use the APIs they already know and love
  • Built on 1) an MPI implementaaon, and 2) HiPER’s core tasking APIs.

Asynchronous APIs return futures rather than MPI_Requests, enabling composability in programming layer with all other future-based APIs:

hiper::future_t<void> *MPI_Irecv/Isend(...);

Enables non-standard extensions, e.g.:

hiper::future_t<void> *MPI_Isend_await(..., hiper::future_t<void> *await);

Start an asynchronous send

  • nce await is saasfied.

hiper::future_t<void> *MPI_Allreduce_future(...); Asynchronous collecaves.

slide-25
SLIDE 25

hiper::future_t<void> *hiper::MPI_Isend_await(..., hiper::future_t<void> *await) { // Create a promise to be satisfied on the completion of this operation hiper::promise_t<void> *prom = new hiper::promise_t<void>(); // Taskify the actual MPI_Isend at the NIC, pending the satisfaction of await hclib::async_nb_await_at([=] { // At MPI place, do the actual Isend MPI_Request req; ::MPI_Isend(..., &req)); // Create a data structure to track the status of the pending Isend pending_mpi_op *op = malloc(sizeof(*op)); ... hiper::append_to_pending(op, &pending, test_mpi_completion, nic); }, fut, nic); return prom->get_future(); }

25

Example API ImplementaHon

Periodic polling funcaon

slide-26
SLIDE 26

26

Composing System, MPI, CUDA Modules

// Asynchronously process ghost regions on this rank in parallel on CPU ghost_fut = forasync_future([] (z) { ... }); // Asynchronously exchange ghost regions with neighbors reqs[0] = MPI_Isend_await(..., ghost_fut); reqs[1] = MPI_Isend_await(..., ghost_fut); reqs[2] = MPI_Irecv(...); reqs[3] = MPI_Irecv(...); // Asynchronously process remainder of z values on this rank kernel_fut = forasync_cuda(..., [] (z) { ... }); // Copy received ghost region to CUDA device copy_fut = async_copy_await(..., reqs[2], reqs[3], kernel_fut);

slide-27
SLIDE 27

27

HiPER Execuaon & Pladorm Model HiPER Use Cases

  • MPI Module
  • Composing MPI and CUDA

Performance Evaluaaon Conclusions & Future Work

Outline

slide-28
SLIDE 28

28

Task Micro-Benchmarking

Micro-benchmark performance normalized to HiPER on Edison, higher is beQer.

hQps://github.com/habanero-rice/tasking-micro-benchmark-suite

slide-29
SLIDE 29

29

Experimental Setup

Experiments shown here were run on Titan @ ORNL and Edison @ NERSC.

ApplicaHon PlaMorm Dataset Modules Used Scaling ISx Titan 229 keys per node OpenSHMEM Weak HPGMG-FV Edison log2_box_dim=7 boxes_per_rank=8 UPC++ Weak UTS Titan T1XXL OpenSHMEM Strong Graph500 Titan 229 nodes OpenSHMEM Strong LBM Titan MPI, CUDA Weak

slide-30
SLIDE 30

5 10 15 20 32 64 128 256 512 1024 Total execution time (s) Total nodes on Titan (16 cores per node) Flat OpenSHMEM OpenSHMEM+OpenMP HIPER

30

HIPER is low-overhead, no impact on performance for regular applicaaons

HiPER EvaluaHon – Regular ApplicaHons

0.2 0.4 0.6 0.8 1 64 128 256 512 Total execution time (s) Total nodes on Edison (2 processes/sockets per node, 12 cores per process) UPC++ + OpenMP HiPER

HPGMG Solve Step ISx

slide-31
SLIDE 31

31

HiPER EvaluaHon – Regular ApplicaHons

LBM ~2% performance improvement through reduced synchronizaaon from futures-based programming.

slide-32
SLIDE 32

32

HiPER EvaluaHon – UTS

20 40 60 80 100 32 64 128 256 512 1024 Total execution time (s) Total nodes on Titan (16 cores per node) OpenSHMEM+OpenMP HiPER

HIPER integraaon improves computaaon- communicaaon

  • verlap,

scalability, load balance

slide-33
SLIDE 33

33

HiPER EvaluaHon – Graph500

HiPER used for concurrent (not parallel) programming in Graph500. Rather than periodic polling, use novel

shmem_async_when APIs to trigger local

computaaon on incoming RDMA. Reduces code complexity, hands scheduling problem to the runame.

slide-34
SLIDE 34

34

HiPER Execuaon & Pladorm Model HiPER Use Cases

  • OpenSHMEM w/o Thread Safety
  • OpenSHMEM w/ Contexts

Performance Evaluaaon Conclusions & Future Work

Outline

slide-35
SLIDE 35

35

HiPER

Working to generalize past work on improving the composability of HPC libraries through tasking. Exploring both improvements at the runame and API layer. Drive system requirements using OpenSHMEM, but also currently support composing CUDA, MPI, UPC++. Future work:

  • Conanuing work on addiaonal module support (integraaon with OpenSHMEM

contexts)

  • Conanue to iterate on exisang benchmarks
  • New applicaaon development (Fast Mulapole Method)

hQps://github.com/habanero-rice/hclib/tree/resource_workers hQps://github.com/habanero-rice/tasking-micro-benchmark-suite

slide-36
SLIDE 36

36

Acknowledgements