UPC++: A High-Performance Communication Framework for Asynchronous - - PowerPoint PPT Presentation

upc a high performance communication framework for
SMART_READER_LITE
LIVE PREVIEW

UPC++: A High-Performance Communication Framework for Asynchronous - - PowerPoint PPT Presentation

UPC++: A High-Performance Communication Framework for Asynchronous Computation John Bachan, Scott B. Baden, Steven Hofmeyr, Mathias Jacquelin Amir Kamil, Dan Bonachea, Paul H. Hargrove, Hadia Ahmed Computational Research Division Lawrence


slide-1
SLIDE 1

UPC++: A High-Performance Communication Framework for Asynchronous Computation

John Bachan, Scott B. Baden, Steven Hofmeyr, Mathias Jacquelin Amir Kamil, Dan Bonachea, Paul H. Hargrove, Hadia Ahmed Computational Research Division Lawrence Berkeley National Laboratory Berkeley, California, USA

slide-2
SLIDE 2

UPC++: a C++ PGAS Library

  • Global Address Space (PGAS)

– A portion of the physically distributed address space is visible

to all processes. Now generalized to handle GPU memory

  • Partitioned (PGAS)

– Global pointers to shared memory segments have an affinity to

a particular rank

– Explicitly managed by the programmer to optimize for locality

2

Rank 0 Rank 1 Rank 2 Rank 3

Global address space Private memory

x: 1 p: x: 5 p: x: 7 p: l: g: l: g: l: g:

Mathias Jacquelin / UPC++ / IPDPS 2019 / upcxx.lbl.gov

slide-3
SLIDE 3

Why is PGAS attractive?

  • The overheads are low

Multithreading can’t speed up overheads

  • Memory-per-core is dropping, requiring reduced

communication granularity

  • Irregular applications exacerbate granularity problem

Asynchronous computations are critical

  • Current and future HPC networks use one-sided

transfers at their lowest level and the PGAS model matches this hardware with very little overhead

3 Mathias Jacquelin / UPC++ / IPDPS 2019 / upcxx.lbl.gov

slide-4
SLIDE 4

What does UPC++ offer?

  • Asynchronous behavior based on futures/promises

– RMA: Low overhead, zero-copy one-sided communication.

Get/put to a remote location in another address space

– RPC: Remote Procedure Call: invoke a function remotely

A higher level of abstraction, though at a cost

  • Design principles encourage performant program design

– All communication is syntactically explicit (unlike UPC) – All communication is asynchronous: futures and promises – Scalability

Global address space (Shared segments) Private memory

Rank 0 Rank 1 Rank 2 Rank 3

Remote procedure call (RPC) One sided communication

4 Mathias Jacquelin / UPC++ / IPDPS 2019 / upcxx.lbl.gov

slide-5
SLIDE 5

How does UPC++ deliver the PGAS model?

  • A “Compiler-Free” approach

– Need only a standard C++ compiler, leverage C++ standards – UPC++ is a C++ template library

  • Relies on GASNet-EX for low overhead communication

– Efficiently utilizes the network, whatever that network may be,

including any special-purpose offload support

  • Designed to allow interoperation with existing

programming systems

– 1-to-1 mapping between MPI and UPC++ ranks – OpenMP and CUDA can be mixed with UPC++ in the same

way as MPI+X

5 Mathias Jacquelin / UPC++ / IPDPS 2019 / upcxx.lbl.gov

slide-6
SLIDE 6

A simple example of asynchronous execution

By default, all communication ops are split-phased

– Initiate operation – Wait for completion

A future holds a value and a state: ready/not ready

Global address space Private memory

Rank 0 Rank 1 Rank 2 Rank 3

Start the get

global_ptr<T> gptr1 = . . .; future<T> f1 = rget(gptr1); // unrelated work.. T t1 = f1.wait();

Wait returns with result when rget completes

6 Mathias Jacquelin / UPC++ / IPDPS 2019 / upcxx.lbl.gov

slide-7
SLIDE 7

Execute a function on another rank, sending arguments and returning an optional result

1.

Injects the RPC to the target rank

2.

Executes fn(arg1, arg2) on target rank at some future time determined at the target

3.

Result becomes available to the caller via the future Many invocations can run simultaneously, hiding data movement

Simple example of remote procedure call

7

Rank 0 Rank (target)

upcxx::rpc(target, fn, arg1, arg2 )

  • ● ●

Execute fn(arg1, arg2) on rank target

fn

1

future

2

Result available via a future

3

Mathias Jacquelin / UPC++ / IPDPS 2019 / upcxx.lbl.gov

slide-8
SLIDE 8

Asynchronous operations

  • Build a DAG of futures, synchronize on the whole rather than on

the individual operations

– Attach a callback: .then(Foo) – Foo is the completion handler, a function or λ

  • runs locally when the rget completes
  • receives arguments containing result

associated with the future

double Foo(int x){ return sqrt(2*x); } global_ptr<int> gptr1; // … gptr1 initialized future<int> f1 = rget(gptr1); future<double> f2 = f1.then(Foo); // DO SOMETHING ELSE double y = f2.wait();

8 Mathias Jacquelin / UPC++ / IPDPS 2019 / upcxx.lbl.gov

slide-9
SLIDE 9

A look under the hood of UPC++

  • Relies on GASNet-EX to provide low-overhead communication

– Efficiently utilizes the network, whatever that network may be,

including any special-purpose support

– Get/put map directly onto the network hardware’s global address

support, when available

  • RPC uses an active message (AM) to enqueue the function

handle remotely.

– Any return result is also transmitted via an AM

  • RPC callbacks are only executed inside a call to a UPC++

method (Also a distinguished progress() method)

– RPC execution is serialized at the target, and this attribute can be

used to avoid explicit synchronization

9 Mathias Jacquelin / UPC++ / IPDPS 2019 / upcxx.lbl.gov

https://gasnet.lbl.gov

slide-10
SLIDE 10

Round-trip Put Latency (lower is better) Flood Put bandwidth (higher is better)

RMA microbenchmarks

10 Mathias Jacquelin / UPC++ / IPDPS 2019 / upcxx.lbl.gov

Experiments on NERSC Cori:

  • Cray XC40 system
  • Two processor partitions:
  • Intel Haswell (2 x 16 cores per node)
  • Intel KNL (1x68 cores per node)

Data collected on Cori Haswell

slide-11
SLIDE 11

Distributed hash table – Productivity

  • Uses Remote Procedure Call (RPC)
  • RPC simplifies the distributed hash table design
  • Store value in a distributed hash table, at a remote location

11

Rank 0 Rank get_target(key)

Hash table partition: a std::unordered _map per rank

  • ● ●

Private memory

key

Mathias Jacquelin / UPC++ / IPDPS 2019 / upcxx.lbl.gov

// C++ global variables correspond to rank-local state std::unordered_map<string, string > local_map; // insert a key-value pair and return a future future<> dht_insert(const string & key, const string & val ) { return upcxx::rpc(get_target(key), [](string key, string val) { local_map.insert ({key ,val }); }, key, val); }

slide-12
SLIDE 12
  • RPC+RMA implementation, higher performance (zero-copy)
  • RPC inserts the key at target and obtains a landing zone pointer
  • Once the RPC completes, an attached callback (.then) uses zero-

copy rput to store the associated data

  • The returned future represents the whole operation

Distributed hash table – Performance

12

Rank 0 Rank get_target(key)

rpc(get_target(key), F, key, len ) Hash table partition: a std::unordered_ map per rank

  • ● ●

F: Allocates landing zone for data of size len Stores (key,gptr) in local hash table (remote to sender) Returns a global pointer loc to landing zone rpc completes:

fut.then(return rput(val.c_str(), loc,val.size()+1))

gptr <char> loc

Global address space Private memory

F

1

future<gptr<char>> fut

2 3

key

Mathias Jacquelin / UPC++ / IPDPS 2019 / upcxx.lbl.gov

slide-13
SLIDE 13

The hash table code

13

// C++ global variables correspond to rank-local state std::unordered_map<string, global_ptr<char> > local_map; // insert a key-value pair and return a future future<> dht_insert(const string & key, const string & val) { auto f1 = rpc( get_target(key), // RPC obtains location for the data [](string key, size_t len) -> global_ptr<char> { global_ptr<char> gptr = new_array<char>(len); local_map[key] = gptr; // insert in local map return gptr; }, key, val.size()+1 ); return f1.then( // callback executes when RPC completes [val](global_ptr<char> loc) -> future<> { // : RMA put return rput(val.c_str(), loc, val.size()+1); } ); }

𝛍 function 𝛍 for callback

Mathias Jacquelin / UPC++ / IPDPS 2019 / upcxx.lbl.gov

slide-14
SLIDE 14

Weak scaling of distributed hash table insertion

14 Mathias Jacquelin / UPC++ / IPDPS 2019 / upcxx.lbl.gov

  • Randomly distributed keys
  • Excellent weak scaling up

to 32K cores

  • RPC leads to simplified

and more efficient design

  • RPC+RMA achieves high

performance at scale

NERSC Cori Haswell

slide-15
SLIDE 15

Weak scaling of distributed hash table insertion

15 Mathias Jacquelin / UPC++ / IPDPS 2019 / upcxx.lbl.gov

  • Randomly distributed keys
  • Excellent weak scaling up

to 32K cores

  • RPC leads to simplified

and more efficient design

  • RPC+RMA achieves high

performance at scale

NERSC Cori Haswell NERSC Cori KNL

slide-16
SLIDE 16

UPC++ improves sparse solver performance

  • Sparse matrix factorizations have low computational intensity and

irregular communication patterns

  • Extend-add operation is an important building block for multifrontal

sparse solvers

  • Sparse factors are organized as a

hierarchy of condensed matrices called frontal matrices:

  • 4 sub-matrices:

factors + contribution block

  • Contribution blocks are accumulated

in parent

16 Mathias Jacquelin / UPC++ / IPDPS 2019 / upcxx.lbl.gov

F11 F11 F11 F21 F21 F21 F12 F12 F12 F22 F22 F22 Ip IlC IrC Parent Left child Right child

slide-17
SLIDE 17

UPC++ improves sparse solver performance

  • Data is packed into per-destination contiguous buffers
  • Traditional MPI implementation uses MPI_Alltoallv

✚ Variants: MPI_Isend/MPI_Irecv + MPI_Waitall /

MPI_Waitany

  • UPC++ Implementation:

✚ RPC sends child

contributions to the parent

✚ RPC compare indices and

accumulate contributions on the target

17 Mathias Jacquelin / UPC++ / IPDPS 2019 / upcxx.lbl.gov

F11 F11 F21 F21 F12 F12 F22 1 2 3 F22 RPC RPC RPC communication i1 i2 i3 i4 i1 i4 i3 i2 i1 i2 i3 i4 i1 i2 i3 i4

slide-18
SLIDE 18

UPC++ improves sparse solver performance

18 Mathias Jacquelin / UPC++ / IPDPS 2019 / upcxx.lbl.gov

Assembly trees / Frontal matrices extracted from STRUMPACK

slide-19
SLIDE 19

UPC++ improves sparse solver performance

19 Mathias Jacquelin / UPC++ / IPDPS 2019 / upcxx.lbl.gov

Assembly trees / Frontal matrices extracted from STRUMPACK

slide-20
SLIDE 20

UPC++ = Productivity + Performance

Productivity

  • UPC++ does not prescribe solutions for implementing distributed

irregular data structures: it provides building blocks

  • Interoperates with MPI, OpenMP and CUDA
  • Develop incrementally, enhance selected parts of the code

Reduced communication costs

  • Embraces communication networks that use one-sided transfers at

their lowest level

  • Low overhead reduces the cost of fine-grained communication
  • Overlap communication via asynchrony and futures
  • High-performance distributed hash table
  • Increased efficiency in the extend-add operation (sparse solvers)

More advanced constructs (not discussed)

  • Remote atomics, distributed objects, teams and collectives
  • Promises, end points, generalized completion
  • Serialization, non-contiguous transfers

20 Mathias Jacquelin / UPC++ / IPDPS 2019 / upcxx.lbl.gov

slide-21
SLIDE 21

The Pagoda Team

  • Scott B. Baden (PI)
  • Paul H. Hargrove (co-PI)
  • John Bachan
  • Dan Bonachea
  • Mathias Jacquelin
  • Amir Kamil
  • Hadia Ahmed
  • Alumni:

Brian van Straalen, Steven Hofmeyr, Khaled Ibrahim Code and documentation at http://upcxx.lbl.gov Examples and extras available at the end of May

21 Mathias Jacquelin / UPC++ / IPDPS 2019 / upcxx.lbl.gov

slide-22
SLIDE 22

Acknowledgements

Early work with UPC++ involved Yili Zheng, Amir Kamil, Kathy Yelick, and others [IPDPS ‘14] This research was supported in part by the Exascale Computing Project (17-SC-20-SC), funded by the U.S. Department of Energy ECP collaborators: Kathy Yelick, Sherry Li, Pieter Ghysels, John Bell and Tan Nguyen (Lawrence Berkeley National Laboratory) Academic collaborators: Alex Pöppl and Michael Bader (TUM) ,

Niclas Jansson and Johann Hoffman (KTH), Sergio Martin (ETH-Zurich), Phuong Ha (Arctic Univ. of Norway)

22

http://upcxx.lbl.gov

Figure courtesy Alexander Pöppl

Mathias Jacquelin / UPC++ / IPDPS 2019 / upcxx.lbl.gov

slide-23
SLIDE 23

Additional information

slide-24
SLIDE 24

Related work on PGAS

  • UPC, Fortran 2008 coarrays, OpenSHMEM, Titanium
  • Fork-join model: X10, Chapel
  • DASH / DART (over MPI-3 RMA backend)
  • Coarray C++
  • Task-based models: HPX, Phalanx, Charm++,

HabaneroUPC++

Mathias Jacquelin / UPC++ / IPDPS 2019 / upcxx.lbl.gov

slide-25
SLIDE 25

Differences with legacy UPC++ v0.1

  • Both implement PGAS model
  • Different APIs:

Current version avoids:

  • Implicit communication
  • Non-scalable data structures

Current version based on futures/promises (similar to C++11)

  • Leg. version uses async/finish syntax (like X10,Habanero-C)
  • New functionalities:

Futures encapsulate values, events do not

Futures allow to attach callbacks

Easier to manage future’s lifetime vs. event

RPCs can return a value, asyncs cannot

Mathias Jacquelin / UPC++ / IPDPS 2019 / upcxx.lbl.gov

slide-26
SLIDE 26

4 1 6 3 2 1 2 8 2 5 6 5 1 2 1 2 4

Processes

101

Time (s) symPACK time for Flan 1565 on Cori Haswell

UPC++ v0.1 UPC++ v1.0

  • SymPACK, supernodal solver for symmetric sparse matrices
  • Implementation based on RPC & RMA
  • Outperforms state-of-the-art solvers implemented using MPI

UPC++ v1.0 vs. v0.1 performance

Mathias Jacquelin / UPC++ / IPDPS 2019 / upcxx.lbl.gov

slide-27
SLIDE 27

Where does message passing overhead come from?

  • Matching sends to receives

– Messages have an associated context that needs to be

matched to handle incoming messages correctly

– Data movement and synchronization are coupled

  • Ordering guarantees are not semantically matched to

the hardware

  • UPC++ avoids these factors that increase the
  • verhead

– No matching overhead between source and target – Executes fewer instructions to perform a transfer

A B C A B C

Mathias Jacquelin / UPC++ / IPDPS 2019 / upcxx.lbl.gov