UCX-PYTHON: A FLEXIBLE COMMUNICATION LIBRARY FOR PYTHON - PowerPoint PPT Presentation

UCX-PYTHON: A FLEXIBLE COMMUNICATION LIBRARY FOR PYTHON APPLICATIONS March 21, 2018

OUTLINE Motivation and goals Implementation choices Features/API Performance Next steps 2

WHY PYTHON-BASED GPU COMMUNICATION? Python use growing Extensive libraries Python in Data science/HPC is growing + GPU usage and communication needs 3

IMPACT ON DATA SCIENCE RAPIDS uses dask-distributed for data distribution over python sockets => slows down all communication-bound components Critical to enable dask with the ability to leverage IB, NVLINK PYTHON DEEP LEARNING FRAMEWORKS RAPIDS DASK CUDF CUML CUGRAPH CUDNN CUDA APACHE ARROW Courtesy RAPIDS Team 4

CURRENT COMMUNICATION DRAWBACKS Existing python communication modules primarily rely on sockets Low latency / high bandwidth critical for better system utilization of GPUs (eg: NVLINK, IB ) Frameworks that transfer GPU data between sites make copies But CUDA-aware data movement is largely solved in HPC! 5

REQUIREMENTS AND RESTRICTIONS Dask – popular framework facilitates scaling python workloads to many nodes Permits use of cuda-based python objects Allows workers to be added and removed dynamically communication backed built around coroutines (more later) Why not use mpi4py then? Dimension mpi4py CUDA-Aware? No - Makes GPU<->CPU copies Dynamic scaling? No - Imposes MPI restrictions Coroutine support? No known support 6

GOALS Provide a flexible communication library that: 1. Supports CPU/GPU buffers over a range of message types - raw bytes, host objects/memoryview, cupy objects, numba objects 2. Supports Dynamic connection capability 3. Supports pythonesque programming using Futures, Coroutines, etc (if needed) 4. Provides close to native performance from python world How? – Cython, UCX 7

WHY UCX? Popular unified communication library used for MPI/PGAS implementations such as OpenMPI, MPICH, OSHMEM, etc Exposes API for: Client-server based connection establishment Point-to-point, RMA, atomics capabilities Tag matching Callbacks on communication events Blocking/Polling progress Cuda-Aware Point-to-point communication C library! 9

PYTHON BINDING APPROACHES Three main considerations: SWIG, CFFI, Cython Problems with SWIG, CFFI Works well for small examples but not for C libraries UCX definitions of structures isn’t consolidated Tedious to populate interface file / python script by hand 10

CYTHON Call C functions and structures from cython code (.pyx) Expose classes, functions from python which can use C underneath ucx_echo.py ucp_py.pyx ucp_py_ucp_fxns.c ... cdef class ucp_endpoint: struct ctx *ucp_py_send_nb() ep=ucp.get_endpoint() def send_obj (…): { ep.send_obj (…) ucp_py_send_nb (…) ucp_tag_send_nb (…) … } Defined in UCX-PY module Defined in UCX C library 11

UCX-PY STACK UCX-PY (.py) OBJECT META-DATA EXTRACTION/ UCX C-WRAPPERS (.pyx) RESOURCE MANAGEMENT/CALLBACK HANDLING/UCX CALLS(.c) UCX C LIBRARY 12

COROUTINES Co-operative concurrent functions def zzz(i): async def zzz(i): print("start", i) print("start" , i) time.sleep(2) await asyncio.sleep(2) Preempted when print("finish", i) print("finish“, i) read/write from disk def main(): f = asyncio.create_task zzz(1) perform communication zzz(2) async def main(): task1 = f(zzz(1)) main() task2 = f(zzz(2)) sleep, etc await task1 Ouput: await task2 Scheduler/event loop manages execution of all coroutines start 1 # t = 0 asyncio.run(main()) finish 1 # t = 2 start 2 # t = 2 + △ start 1 # t = 0 Single thread utilization increases finish 2 # t = 4 + △ start 2 # t = 0 + △ finish 1 # t = 2 finish 2 # t = 2 + △ 14

UCX-PY CONNECTION ESTABLISHMENT API Dynamic connection establishment .start_listener(accept_cb, port, is_coroutine) : Server creates listener .get_endpoint (ip, port) : client connects Multiple listeners allowed, multiple endpoints to server allowed async def accept_cb (ep, …): async def talk_to_client(): … ep = ucp.get_endpoint(ip, port) await ep.send_obj() … … await ep.recv_obj() await ep.recv_obj() … … await ep.send_obj() … ucp.start_listener(accept_cb, port, is_coroutine=True) 15

UCX-PY CONNECTION ESTABLISHMENT ucp.start_listener() listening state ucp.get_endpoint() accept connection invoke callback accept_cb() Server Client 16

UCX-PY DATA MOVEMENT API Send data (on endpoint) .send_*() : raw bytes, host objects (numpy), cuda objects (cupy, numba) Receive data (on endpoint) .recv_obj() : pass an object as argument where data is received .recv_future () ‘blind’ : no input; returns received object; low performance async def talk_to_client(): async def accept_cb (ep, …): ep = ucp.get_endpoint(ip, port) … … await ep.send_obj(cupy.array([42])) rr = await ep.recv_future() … msg = ucp.get_obj_from_msg(rr) … 17

UCX-PY DATA MOVEMENT SEMANTICS Send/Recv operations are non-blocking by default Issue of the operation returns a future Calling await on the future or calling future.result() blocks until completion Caveat - Limited number of object types tested memoryview, numpy, cupy, and numba 18

UNDER THE HOOD Layer UCX Calls Connection management ucp_{listener/ep}_create Issuing data movement ucp_tag_{send/recv/probe}_nb Request progress ucp_worker_{arm/signal/progress} UCX depends on event notification to avoid the main thread from constantly polling Read/write event from Sockets Completion queue events from IB event channel UCX event notification mechanism UCX-PY Blocking progress 19

EXPERIMENTAL TESTBED Hardware includes 2 Nodes: Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz Tesla V100-SXM2 (CUDA 9.2.88, driver version 410.48) ConnectX-4 Mellanox HCAs (OFED-internal-4.0-1.0.1) Software: UCX 1.5, Python 3.7.1 Case UCX progress mode Python functions Latency bound polling regular Bandwidth bound Blocking (event notification based) coroutines 21

HOST MEMORY LATENCY Latency-bound host transfers Short Message Latency Large Message Latency 9 800 8 700 7 600 6 Latency (us) Latency (us) 500 5 400 4 300 3 200 2 100 1 0 0 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M 4M Message Size (bytes) Message Size (bytes) native-UCX python-UCX native-UCX python-UCX 22

DEVICE MEMORY LATENCY Latency-bound device transfers Short Message Latency Large Message Latency 12 900 800 10 700 8 600 Latency (us) Latency (us) 500 6 400 4 300 200 2 100 0 0 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M 4M Message Size (bytes) Message Size (bytes) native-UCX python-UCX native-UCX python-UCX 23

DEVICE MEMORY BANDWIDTH Bandwidth-bound transfers (cupy) 12 11.03 10.9 10.85 10.7 10.47 10.3 9.86 9.7 10 8.85 8.7 Bandwidth GB/s 8 7.36 6.15 6 4 2 0 10MB 20MB 40MB 80MB 160MB 320MB Message Size cupy native 24

NEXT STEPS Performance validate dask-distributed over UCX-PY with dask-cuda workloads Objects that have mixed physical backing (CPU and GPU) Adding blocking support to NVLINK based UCT Non-contiguous data transfers Integration into dask-distributed underway (https://github.com/TomAugspurger/distributed/commits/ucx+data-handling) Current implementation (https://github.com/Akshay-Venkatesh/ucx/tree/topic/py- bind) Push to UCX project underway (https://github.com/openucx/ucx/pull/3165) 26

SUMMARY UCX-PY is a flexible communication library Provides python developers a way to leverage high-speed interconnects like IB Can support pythonesque way of overlap communication with other coroutines Or can be non-overlapped like in traditional HPC Can support data-movement of objects residing on CPU memory or on GPU memory users needn’t explicitly copy GPU< ->CPU UCX-PY is close to native performance for major use case range 27

BIG PICTURE UCX-PY will serve as a high-performance communication module for dask PYTHON UCX-PY DEEP LEARNING FRAMEWORKS RAPIDS DASK CUDF CUML CUGRAPH CUDNN CUDA APACHE ARROW 28

UCX-PYTHON: A FLEXIBLE COMMUNICATION LIBRARY FOR PYTHON - PowerPoint PPT Presentation

UCX-PYTHON: A FLEXIBLE COMMUNICATION LIBRARY FOR PYTHON APPLICATIONS March 21, 2018 OUTLINE Motivation and goals Implementation choices Features/API Performance Next steps 2 WHY PYTHON-BASED GPU COMMUNICATION? Python use growing

The The Beverly Beverly Middle Middle School School Flexible Flexible Learning Learning

Library Department FY 2021 Library Department FY 2021 Library Organization Chart Springfield

Presentation 7.3b: Multiple linear regression Murray Logan 09 Aug 2016 library (GGally) library

Personalized Learning Flexible Seating and Space Flexible Seating and Space Flexible Seating and

Python for Data Science Overview of Python Why Python Installing Python Installing Python Modules

AAPoly Library Orientation Library Contacts Phone : 61 3 8610 4132 Email : library@aapoly.edu.au

Python Tidbits Python created by that guy ---> Python is named after Monty Pythons

SK Telecom 1 U U U U U U U- U - - communication - - - - - communication

Flexible Instruction Day Parent Presentation Flexible Instruction Day March 16 - 20 - Flexible

Flexible Infrastructure Qualification What Is Flexible Infrastructure/Benefits Flexible

The Homeschooling - Library Connection Diane Pamel- Library Director Southworth Library and

Eric Lashley Library Director, Georgetown Public Library (TX) Patrick Lloyd, LMSW Community

Looping through Python data structures Justin Kiggins Product Manager DataCamp Python for

HPC Python Programming Ramses van Zon July 10, 2019 Ramses van Zon HPC Python Programming July

First Tool: Python! Introduction to python programming Gholamhossein Tavasoli @ ZNU First Tool:

Session 12 Assessing and Developing Communication SECTION 4: 1 Communication Communication

Enable Alabama July 2018 1 8/30/18 8/30/18 What is ABLE: an Overview Became federal law

Disadvantaged Business Enterprise Program Certification and Prequalification Ohio Unifie d Ce

LL.M. in French and European Law specialization in Taxation Law, Business Law and Compliance

Aemon: Information-agnostic Mix-flow Scheduling in Data Center Networks Tao Wang 1 , Hong Xu 2 ,

New Trading Platform for ASX 24 Market Participants Compliance Briefing February 2017 Agenda

UCP (P) 24 th Quarterly meeting Catchment based Research, Development, and Innovation

T F A Indianapolis Airport Authority R Overview of the ACDBE Program Update Federal Fiscal

Job Inventory Worksheet UCPEA Town Hall Presentation Michelle Fournier, Workforce Solutions

UCX-PYTHON: A FLEXIBLE COMMUNICATION LIBRARY FOR PYTHON - PowerPoint PPT Presentation

UCX-PYTHON: A FLEXIBLE COMMUNICATION LIBRARY FOR PYTHON APPLICATIONS March 21, 2018 OUTLINE Motivation and goals Implementation choices Features/API Performance Next steps 2 WHY PYTHON-BASED GPU COMMUNICATION? Python use growing

The The Beverly Beverly Middle Middle School School Flexible Flexible Learning Learning

Library Department FY 2021 Library Department FY 2021 Library Organization Chart Springfield

Presentation 7.3b: Multiple linear regression Murray Logan 09 Aug 2016 library (GGally) library

Personalized Learning Flexible Seating and Space Flexible Seating and Space Flexible Seating and

Python for Data Science Overview of Python Why Python Installing Python Installing Python Modules

AAPoly Library Orientation Library Contacts Phone : 61 3 8610 4132 Email : library@aapoly.edu.au

Python Tidbits Python created by that guy ---&gt; Python is named after Monty Pythons

SK Telecom 1 U U U U U U U- U - - communication - - - - - communication

Flexible Instruction Day Parent Presentation Flexible Instruction Day March 16 - 20 - Flexible

Flexible Infrastructure Qualification What Is Flexible Infrastructure/Benefits Flexible

The Homeschooling - Library Connection Diane Pamel- Library Director Southworth Library and

Eric Lashley Library Director, Georgetown Public Library (TX) Patrick Lloyd, LMSW Community

Looping through Python data structures Justin Kiggins Product Manager DataCamp Python for

HPC Python Programming Ramses van Zon July 10, 2019 Ramses van Zon HPC Python Programming July

First Tool: Python! Introduction to python programming Gholamhossein Tavasoli @ ZNU First Tool:

Session 12 Assessing and Developing Communication SECTION 4: 1 Communication Communication

Enable Alabama July 2018 1 8/30/18 8/30/18 What is ABLE: an Overview Became federal law

Disadvantaged Business Enterprise Program Certification and Prequalification Ohio Unifie d Ce

LL.M. in French and European Law specialization in Taxation Law, Business Law and Compliance

Aemon: Information-agnostic Mix-flow Scheduling in Data Center Networks Tao Wang 1 , Hong Xu 2 ,

New Trading Platform for ASX 24 Market Participants Compliance Briefing February 2017 Agenda

UCP (P) 24 th Quarterly meeting Catchment based Research, Development, and Innovation

T F A Indianapolis Airport Authority R Overview of the ACDBE Program Update Federal Fiscal

Job Inventory Worksheet UCPEA Town Hall Presentation Michelle Fournier, Workforce Solutions

Python Tidbits Python created by that guy ---> Python is named after Monty Pythons