ucx python a flexible communication library for python
play

UCX-PYTHON: A FLEXIBLE COMMUNICATION LIBRARY FOR PYTHON - PowerPoint PPT Presentation

UCX-PYTHON: A FLEXIBLE COMMUNICATION LIBRARY FOR PYTHON APPLICATIONS March 21, 2018 OUTLINE Motivation and goals Implementation choices Features/API Performance Next steps 2 WHY PYTHON-BASED GPU COMMUNICATION? Python use growing


  1. UCX-PYTHON: A FLEXIBLE COMMUNICATION LIBRARY FOR PYTHON APPLICATIONS March 21, 2018

  2. OUTLINE Motivation and goals Implementation choices Features/API Performance Next steps 2

  3. WHY PYTHON-BASED GPU COMMUNICATION? Python use growing Extensive libraries Python in Data science/HPC is growing + GPU usage and communication needs 3

  4. IMPACT ON DATA SCIENCE RAPIDS uses dask-distributed for data distribution over python sockets => slows down all communication-bound components Critical to enable dask with the ability to leverage IB, NVLINK PYTHON DEEP LEARNING FRAMEWORKS RAPIDS DASK CUDF CUML CUGRAPH CUDNN CUDA APACHE ARROW Courtesy RAPIDS Team 4

  5. CURRENT COMMUNICATION DRAWBACKS Existing python communication modules primarily rely on sockets Low latency / high bandwidth critical for better system utilization of GPUs (eg: NVLINK, IB ) Frameworks that transfer GPU data between sites make copies But CUDA-aware data movement is largely solved in HPC! 5

  6. REQUIREMENTS AND RESTRICTIONS Dask – popular framework facilitates scaling python workloads to many nodes Permits use of cuda-based python objects Allows workers to be added and removed dynamically communication backed built around coroutines (more later) Why not use mpi4py then? Dimension mpi4py CUDA-Aware? No - Makes GPU<->CPU copies Dynamic scaling? No - Imposes MPI restrictions Coroutine support? No known support 6

  7. GOALS Provide a flexible communication library that: 1. Supports CPU/GPU buffers over a range of message types - raw bytes, host objects/memoryview, cupy objects, numba objects 2. Supports Dynamic connection capability 3. Supports pythonesque programming using Futures, Coroutines, etc (if needed) 4. Provides close to native performance from python world How? – Cython, UCX 7

  8. OUTLINE Motivation and goals Implementation choices Features/API Performance Next steps 8

  9. WHY UCX? Popular unified communication library used for MPI/PGAS implementations such as OpenMPI, MPICH, OSHMEM, etc Exposes API for: Client-server based connection establishment Point-to-point, RMA, atomics capabilities Tag matching Callbacks on communication events Blocking/Polling progress Cuda-Aware Point-to-point communication C library! 9

  10. PYTHON BINDING APPROACHES Three main considerations: SWIG, CFFI, Cython Problems with SWIG, CFFI Works well for small examples but not for C libraries UCX definitions of structures isn’t consolidated Tedious to populate interface file / python script by hand 10

  11. CYTHON Call C functions and structures from cython code (.pyx) Expose classes, functions from python which can use C underneath ucx_echo.py ucp_py.pyx ucp_py_ucp_fxns.c ... cdef class ucp_endpoint: struct ctx *ucp_py_send_nb() ep=ucp.get_endpoint() def send_obj (…): { ep.send_obj (…) ucp_py_send_nb (…) ucp_tag_send_nb (…) … } Defined in UCX-PY module Defined in UCX C library 11

  12. UCX-PY STACK UCX-PY (.py) OBJECT META-DATA EXTRACTION/ UCX C-WRAPPERS (.pyx) RESOURCE MANAGEMENT/CALLBACK HANDLING/UCX CALLS(.c) UCX C LIBRARY 12

  13. OUTLINE Motivation and goals Implementation choices Features/API Performance Next steps 13

  14. COROUTINES Co-operative concurrent functions def zzz(i): async def zzz(i): print("start", i) print("start" , i) time.sleep(2) await asyncio.sleep(2) Preempted when print("finish", i) print("finish“, i) read/write from disk def main(): f = asyncio.create_task zzz(1) perform communication zzz(2) async def main(): task1 = f(zzz(1)) main() task2 = f(zzz(2)) sleep, etc await task1 Ouput: await task2 Scheduler/event loop manages execution of all coroutines start 1 # t = 0 asyncio.run(main()) finish 1 # t = 2 start 2 # t = 2 + △ start 1 # t = 0 Single thread utilization increases finish 2 # t = 4 + △ start 2 # t = 0 + △ finish 1 # t = 2 finish 2 # t = 2 + △ 14

  15. UCX-PY CONNECTION ESTABLISHMENT API Dynamic connection establishment .start_listener(accept_cb, port, is_coroutine) : Server creates listener .get_endpoint (ip, port) : client connects Multiple listeners allowed, multiple endpoints to server allowed async def accept_cb (ep, …): async def talk_to_client(): … ep = ucp.get_endpoint(ip, port) await ep.send_obj() … … await ep.recv_obj() await ep.recv_obj() … … await ep.send_obj() … ucp.start_listener(accept_cb, port, is_coroutine=True) 15

  16. UCX-PY CONNECTION ESTABLISHMENT ucp.start_listener() listening state ucp.get_endpoint() accept connection invoke callback accept_cb() Server Client 16

  17. UCX-PY DATA MOVEMENT API Send data (on endpoint) .send_*() : raw bytes, host objects (numpy), cuda objects (cupy, numba) Receive data (on endpoint) .recv_obj() : pass an object as argument where data is received .recv_future () ‘blind’ : no input; returns received object; low performance async def talk_to_client(): async def accept_cb (ep, …): ep = ucp.get_endpoint(ip, port) … … await ep.send_obj(cupy.array([42])) rr = await ep.recv_future() … msg = ucp.get_obj_from_msg(rr) … 17

  18. UCX-PY DATA MOVEMENT SEMANTICS Send/Recv operations are non-blocking by default Issue of the operation returns a future Calling await on the future or calling future.result() blocks until completion Caveat - Limited number of object types tested memoryview, numpy, cupy, and numba 18

  19. UNDER THE HOOD Layer UCX Calls Connection management ucp_{listener/ep}_create Issuing data movement ucp_tag_{send/recv/probe}_nb Request progress ucp_worker_{arm/signal/progress} UCX depends on event notification to avoid the main thread from constantly polling Read/write event from Sockets Completion queue events from IB event channel UCX event notification mechanism UCX-PY Blocking progress 19

  20. OUTLINE Motivation and goals Implementation choices Features/API Performance Next steps 20

  21. EXPERIMENTAL TESTBED Hardware includes 2 Nodes: Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz Tesla V100-SXM2 (CUDA 9.2.88, driver version 410.48) ConnectX-4 Mellanox HCAs (OFED-internal-4.0-1.0.1) Software: UCX 1.5, Python 3.7.1 Case UCX progress mode Python functions Latency bound polling regular Bandwidth bound Blocking (event notification based) coroutines 21

  22. HOST MEMORY LATENCY Latency-bound host transfers Short Message Latency Large Message Latency 9 800 8 700 7 600 6 Latency (us) Latency (us) 500 5 400 4 300 3 200 2 100 1 0 0 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M 4M Message Size (bytes) Message Size (bytes) native-UCX python-UCX native-UCX python-UCX 22

  23. DEVICE MEMORY LATENCY Latency-bound device transfers Short Message Latency Large Message Latency 12 900 800 10 700 8 600 Latency (us) Latency (us) 500 6 400 4 300 200 2 100 0 0 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M 4M Message Size (bytes) Message Size (bytes) native-UCX python-UCX native-UCX python-UCX 23

  24. DEVICE MEMORY BANDWIDTH Bandwidth-bound transfers (cupy) 12 11.03 10.9 10.85 10.7 10.47 10.3 9.86 9.7 10 8.85 8.7 Bandwidth GB/s 8 7.36 6.15 6 4 2 0 10MB 20MB 40MB 80MB 160MB 320MB Message Size cupy native 24

  25. OUTLINE Motivation and goals Implementation choices Features/API Performance Next steps 25

  26. NEXT STEPS Performance validate dask-distributed over UCX-PY with dask-cuda workloads Objects that have mixed physical backing (CPU and GPU) Adding blocking support to NVLINK based UCT Non-contiguous data transfers Integration into dask-distributed underway (https://github.com/TomAugspurger/distributed/commits/ucx+data-handling) Current implementation (https://github.com/Akshay-Venkatesh/ucx/tree/topic/py- bind) Push to UCX project underway (https://github.com/openucx/ucx/pull/3165) 26

  27. SUMMARY UCX-PY is a flexible communication library Provides python developers a way to leverage high-speed interconnects like IB Can support pythonesque way of overlap communication with other coroutines Or can be non-overlapped like in traditional HPC Can support data-movement of objects residing on CPU memory or on GPU memory users needn’t explicitly copy GPU< ->CPU UCX-PY is close to native performance for major use case range 27

  28. BIG PICTURE UCX-PY will serve as a high-performance communication module for dask PYTHON UCX-PY DEEP LEARNING FRAMEWORKS RAPIDS DASK CUDF CUML CUGRAPH CUDNN CUDA APACHE ARROW 28

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend