UCX-PYTHON: A FLEXIBLE COMMUNICATION LIBRARY FOR PYTHON - - PowerPoint PPT Presentation

ucx python a flexible communication library for python
SMART_READER_LITE
LIVE PREVIEW

UCX-PYTHON: A FLEXIBLE COMMUNICATION LIBRARY FOR PYTHON - - PowerPoint PPT Presentation

UCX-PYTHON: A FLEXIBLE COMMUNICATION LIBRARY FOR PYTHON APPLICATIONS March 21, 2018 OUTLINE Motivation and goals Implementation choices Features/API Performance Next steps 2 WHY PYTHON-BASED GPU COMMUNICATION? Python use growing


slide-1
SLIDE 1

March 21, 2018

UCX-PYTHON: A FLEXIBLE COMMUNICATION LIBRARY FOR PYTHON APPLICATIONS

slide-2
SLIDE 2

2

OUTLINE

Motivation and goals Implementation choices Features/API Performance Next steps

slide-3
SLIDE 3

3

WHY PYTHON-BASED GPU COMMUNICATION?

Python use growing Extensive libraries Python in Data science/HPC is growing + GPU usage and communication needs

slide-4
SLIDE 4

4

IMPACT ON DATA SCIENCE

RAPIDS uses dask-distributed for data distribution over python sockets => slows down all communication-bound components Critical to enable dask with the ability to leverage IB, NVLINK

CUDA PYTHON APACHE ARROW DASK DEEP LEARNING FRAMEWORKS CUDNN RAPIDS CUML CUDF CUGRAPH Courtesy RAPIDS Team

slide-5
SLIDE 5

5

CURRENT COMMUNICATION DRAWBACKS

Existing python communication modules primarily rely on sockets Low latency / high bandwidth critical for better system utilization of GPUs (eg: NVLINK, IB) Frameworks that transfer GPU data between sites make copies But CUDA-aware data movement is largely solved in HPC!

slide-6
SLIDE 6

6

REQUIREMENTS AND RESTRICTIONS

Dask – popular framework facilitates scaling python workloads to many nodes Permits use of cuda-based python objects Allows workers to be added and removed dynamically communication backed built around coroutines (more later) Why not use mpi4py then?

Dimension mpi4py CUDA-Aware? No - Makes GPU<->CPU copies Dynamic scaling? No - Imposes MPI restrictions Coroutine support? No known support

slide-7
SLIDE 7

7

GOALS

Provide a flexible communication library that:

  • 1. Supports CPU/GPU buffers over a range of message types
  • raw bytes, host objects/memoryview, cupy objects, numba objects
  • 2. Supports Dynamic connection capability
  • 3. Supports pythonesque programming using Futures, Coroutines, etc (if needed)
  • 4. Provides close to native performance from python world

How? – Cython, UCX

slide-8
SLIDE 8

8

OUTLINE

Motivation and goals Implementation choices Features/API Performance Next steps

slide-9
SLIDE 9

9

WHY UCX?

Popular unified communication library used for MPI/PGAS implementations such as OpenMPI, MPICH, OSHMEM, etc Exposes API for: Client-server based connection establishment Point-to-point, RMA, atomics capabilities Tag matching Callbacks on communication events Blocking/Polling progress Cuda-Aware Point-to-point communication C library!

slide-10
SLIDE 10

10

PYTHON BINDING APPROACHES

Three main considerations: SWIG, CFFI, Cython Problems with SWIG, CFFI Works well for small examples but not for C libraries UCX definitions of structures isn’t consolidated Tedious to populate interface file / python script by hand

slide-11
SLIDE 11

11

CYTHON

Call C functions and structures from cython code (.pyx) Expose classes, functions from python which can use C underneath

ucx_echo.py ... ep=ucp.get_endpoint() ep.send_obj(…) … ucp_py.pyx cdef class ucp_endpoint: def send_obj(…): ucp_py_send_nb(…) ucp_py_ucp_fxns.c struct ctx *ucp_py_send_nb() { ucp_tag_send_nb(…) } Defined in UCX C library Defined in UCX-PY module

slide-12
SLIDE 12

12

UCX-PY STACK

UCX-PY (.py)

OBJECT META-DATA EXTRACTION/ UCX C-WRAPPERS (.pyx) RESOURCE MANAGEMENT/CALLBACK HANDLING/UCX CALLS(.c) UCX C LIBRARY

slide-13
SLIDE 13

13

OUTLINE

Motivation and goals Implementation choices Features/API Performance Next steps

slide-14
SLIDE 14

14

COROUTINES

Co-operative concurrent functions Preempted when read/write from disk perform communication sleep, etc Scheduler/event loop manages execution of all coroutines Single thread utilization increases

def zzz(i): print("start", i) time.sleep(2) print("finish", i) def main(): zzz(1) zzz(2) main() Ouput: start 1 # t = 0 finish 1 # t = 2 start 2 # t = 2 + △ finish 2 # t = 4 + △ async def zzz(i): print("start" , i) await asyncio.sleep(2) print("finish“, i) f = asyncio.create_task async def main(): task1 = f(zzz(1)) task2 = f(zzz(2)) await task1 await task2 asyncio.run(main()) start 1 # t = 0 start 2 # t = 0 + △ finish 1 # t = 2 finish 2 # t = 2 + △

slide-15
SLIDE 15

15

UCX-PY CONNECTION ESTABLISHMENT API

Dynamic connection establishment .start_listener(accept_cb, port, is_coroutine) : Server creates listener .get_endpoint (ip, port) : client connects Multiple listeners allowed, multiple endpoints to server allowed

async def accept_cb(ep, …): … await ep.send_obj() … await ep.recv_obj() … ucp.start_listener(accept_cb, port, is_coroutine=True) async def talk_to_client(): ep = ucp.get_endpoint(ip, port) … await ep.recv_obj() … await ep.send_obj() …

slide-16
SLIDE 16

16

UCX-PY CONNECTION ESTABLISHMENT

ucp.start_listener() ucp.get_endpoint() listening state accept connection invoke callback accept_cb() Server Client

slide-17
SLIDE 17

17

UCX-PY DATA MOVEMENT API

Send data (on endpoint) .send_*() : raw bytes, host objects (numpy), cuda objects (cupy, numba) Receive data (on endpoint) .recv_obj() : pass an object as argument where data is received .recv_future() ‘blind’ : no input; returns received object; low performance

async def accept_cb(ep, …): … await ep.send_obj(cupy.array([42])) … async def talk_to_client(): ep = ucp.get_endpoint(ip, port) … rr = await ep.recv_future() msg = ucp.get_obj_from_msg(rr) …

slide-18
SLIDE 18

18

UCX-PY DATA MOVEMENT SEMANTICS

Send/Recv operations are non-blocking by default Issue of the operation returns a future Calling await on the future or calling future.result() blocks until completion Caveat - Limited number of object types tested memoryview, numpy, cupy, and numba

slide-19
SLIDE 19

19

UNDER THE HOOD

UCX depends on event notification to avoid the main thread from constantly polling

Layer UCX Calls Connection management ucp_{listener/ep}_create Issuing data movement ucp_tag_{send/recv/probe}_nb Request progress ucp_worker_{arm/signal/progress}

UCX-PY Blocking progress UCX event notification mechanism Completion queue events from IB event channel Read/write event from Sockets

slide-20
SLIDE 20

20

OUTLINE

Motivation and goals Implementation choices Features/API Performance Next steps

slide-21
SLIDE 21

21

EXPERIMENTAL TESTBED

Hardware includes 2 Nodes: Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz Tesla V100-SXM2 (CUDA 9.2.88, driver version 410.48) ConnectX-4 Mellanox HCAs (OFED-internal-4.0-1.0.1) Software: UCX 1.5, Python 3.7.1

Case UCX progress mode Python functions Latency bound polling regular Bandwidth bound Blocking (event notification based) coroutines

slide-22
SLIDE 22

22

HOST MEMORY LATENCY

1 2 3 4 5 6 7 8 9 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K

Latency (us) Message Size (bytes)

Short Message Latency

native-UCX python-UCX

Latency-bound host transfers

100 200 300 400 500 600 700 800 16K 32K 64K 128K 256K 512K 1M 2M 4M

Latency (us) Message Size (bytes)

Large Message Latency

native-UCX python-UCX

slide-23
SLIDE 23

23

DEVICE MEMORY LATENCY

2 4 6 8 10 12 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K

Latency (us) Message Size (bytes)

Short Message Latency

native-UCX python-UCX

Latency-bound device transfers

100 200 300 400 500 600 700 800 900 16K 32K 64K 128K 256K 512K 1M 2M 4M

Latency (us) Message Size (bytes)

Large Message Latency

native-UCX python-UCX

slide-24
SLIDE 24

24

DEVICE MEMORY BANDWIDTH

Bandwidth-bound transfers (cupy)

6.15 7.36 8.85 9.86 10.47 10.85 8.7 9.7 10.3 10.7 10.9 11.03 2 4 6 8 10 12 10MB 20MB 40MB 80MB 160MB 320MB

Bandwidth GB/s Message Size

cupy native

slide-25
SLIDE 25

25

OUTLINE

Motivation and goals Implementation choices Features/API Performance Next steps

slide-26
SLIDE 26

26

NEXT STEPS

Performance validate dask-distributed over UCX-PY with dask-cuda workloads Objects that have mixed physical backing (CPU and GPU) Adding blocking support to NVLINK based UCT Non-contiguous data transfers Integration into dask-distributed underway (https://github.com/TomAugspurger/distributed/commits/ucx+data-handling) Current implementation (https://github.com/Akshay-Venkatesh/ucx/tree/topic/py- bind) Push to UCX project underway (https://github.com/openucx/ucx/pull/3165)

slide-27
SLIDE 27

27

SUMMARY

UCX-PY is a flexible communication library Provides python developers a way to leverage high-speed interconnects like IB Can support pythonesque way of overlap communication with other coroutines Or can be non-overlapped like in traditional HPC Can support data-movement of objects residing on CPU memory or on GPU memory users needn’t explicitly copy GPU<->CPU UCX-PY is close to native performance for major use case range

slide-28
SLIDE 28

28

BIG PICTURE

UCX-PY will serve as a high-performance communication module for dask

CUDA PYTHON APACHE ARROW DASK DEEP LEARNING FRAMEWORKS CUDNN RAPIDS CUML CUDF CUGRAPH UCX-PY

slide-29
SLIDE 29