Dask extending Python data tools for parallel and distributed - PowerPoint PPT Presentation

Dask extending Python data tools for parallel and distributed computing Joris Van den Bossche - FOSDEM 2017 1 / 29

Python's scientific/data tools ecosystem ## Thanks to Jake VanderPlas for the figure 2 / 29

3 / 29

Provides high-performance, easy-to-use data structures and tools Widely used for doing practical data analysis in Python Suited for tabular data (e.g. column data, spread-sheets, databases) import pandas as pd df = pd.read_csv("myfile.csv") subset = df[df['value'] > 0] subset.groupby('key').mean() 4 / 29

Python has a fast and pragmatic data science ecosystem 5 / 29

Python has a fast and pragmatic data science ecosystem ... restricted to in-memory and a single core 6 / 29

a flexible library for parallelism 7 / 29

Dask is A parallel computing framework Lets you work on larger-than-memory datasets Written in pure Python That leverages the excellent Python ecosystem Using blocked algorithms and task scheduling 8 / 29

Dask.array Parallel and out-of-core array library Mirrors NumPy interface Coordinate many NumPy arrays into single logical Dask array } } NumPy Array Dask Array 9 / 29

Dask.array Parallel and out-of-core array library Mirrors NumPy interface Coordinate many NumPy arrays into single logical Dask array } } NumPy Array Dask Array import numpy as np import dask.array as da x = np.random.random(...) x = da.random.random(..., chunks=(1000, 1000)) u, s, v = np.linalg.svd(x.dot(x.T)) u, s, v = da.linalg.svd(x.dot(x.T)) 10 / 29

Dask.dataframe } Parallel and out-of-core January, 2016 dataframe library Mirrors the Pandas interface } Pandas Febrary, 2016 Coordinates many Pandas DataFrame DataFrames into single logical Dask DataFrame Dask Index is (optionally) sorted, DataFrame March, 2016 allowing for optimizations April, 2016 May, 2016 11 / 29

Dask.dataframe } January, 2016 import pandas as pd df = pd.read_csv('2015-01-01.csv') res = df.groupby('user_id').mean() } import dask.dataframe as dd Pandas Febrary, 2016 df = dd.read_csv('2015-*-*.csv') DataFrame res = df.groupby('user_id').mean() res.compute() Dask DataFrame March, 2016 April, 2016 May, 2016 12 / 29

Complex graphs 13 / 29

ND-Array - sum x = da.ones((15, 15), (5, 5)) x.sum(axis=0) 14 / 29

ND-Array - matrix multiply x = da.ones((15, 15), (5, 5)) x.dot(x.T + 1) 15 / 29

Efficient timeseries - resample df.value.resample('1w').mean() 16 / 29

Efficient rolling df.value.rolling(100).mean() 17 / 29

Some problems don't fit well into collections 18 / 29

Dask Delayed Tool for creating arbitrary task graphs Dead simple interface (one function) _ results = {} for a in A: for b in B: results[a, b] = fit(a, b) best = score(results) _ 19 / 29

Dask Delayed Tool for creating arbitrary task graphs Dead simple interface (one function) from dask import delayed results = {} for a in A: for b in B: results[a, b] = delayed(fit)(a, b) best = delayed(score)(results) result = best.compute() 19 / 29

Collections author task graphs Now we need to run them efficiently 20 / 29

Collections build task graphs Schedulers execute task graphs 21 / 29

Collections build task graphs Schedulers execute task graphs Dask schedulers target different architectures Easy swapping enables scaling up and down 22 / 29

Single Machine Scheduler Optimized for larger-than-memory use Parallel CPU: Uses multiple threads or processes Minimizes RAM: Choose tasks to remove intermediates Low overhead: ~100us per task Concise: ~600 LOC, stable 23 / 29

Distributed Scheduler 24 / 29

Distributed Scheduler Distributed: One scheduler coordinates many workers Data local: Tries to moves computation to "best" worker Asynchronous: Continuous non-blocking conversation Multi-user: Several users can share the same system HDFS Aware: Works well with HDFS, S3, YARN, etc.. Less Concise: ~3000 LOC Tornado TCP application 25 / 29

Visual dashboards 26 / 29

To summarise: Dask is Dynamic task scheduler for arbitrary computations Familiar: Implements NumPy/Pandas interfaces Flexible: Handles arbitrary task graphs efficiently (custom workloads, integration with other projects) Fast: Optimized for demanding applications Scales up: Runs resiliently on clusters with 1000s of cores Scales down: Pragmatic on a laptop Responsive: for interactive computing Dask builds on the existing Python ecosystem. 27 / 29

Acknowledgements: slides partly based on material from dask developers Matthew Rocklin and Jim Crist (Continuum Analytics) http://dask.pydata.org 28 / 29

About me Researcher at Vrije Universiteit Brussel (VUB), and contractor for Continuum Analytics PhD bio-science engineer, air quality research pandas core dev https://github.com/jorisvandenbossche @jorisvdbossche 29 / 29

Dask extending Python data tools for parallel and distributed - PowerPoint PPT Presentation

Dask extending Python data tools for parallel and distributed computing Joris Van den Bossche - FOSDEM 2017 1 / 29 Python's scientific/data tools ecosystem ## Thanks to Jake VanderPlas for the figure 2 / 29 3 / 29 3 / 29 Provides

B u ilding Dask Bags & Globbing PAR AL L E L P R OG R AMMIN G W ITH DASK IN P YTH ON Dha

Ch u nking Arra y s in Dask PAR AL L E L P R OG R AMMIN G W ITH DASK IN P YTH ON Dha v ide

Preparing Flight Dela y Data PAR AL L E L P R OG R AMMIN G W ITH DASK IN P YTH ON Dha v ide

Scaling RAPIDS with Dask Matthew Rocklin, Systems Software Manager GTC San Jose 2019 PyData is

Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS Peter Andreas Entschev Senior System

Understanding Comp u ter Storage & Big Data PAR AL L E L P R OG R AMMIN G W ITH DASK IN

UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING Nikolay Sakharnykh, Chirayu Garg, and Dmitri

Modern pandas Herv Mignot EQUANCY 1 Building Pipelines with Python Data Size PySpark x100

THE SOULTZ EGS PROJECT: JURIDICAL AND ADMINISTRATIVE ENVIRONMENT Pauline RAUSCHER, Jean-Jacques

Contents Financial Performance Financial Performance Operations Review Market Outlook

Financial Results for the Period Financial Results for the Period from 26 Apr to 30 Sep 2006

Scheffler on Nozick Capitalism University of Virginia Matthias Brinkmann Contents 1.

Distributed Computing In IceCube David Schultz, Gonzalo Merino, Vladimir Brik, and Jan Oertlin

Data-Intensive Distributed Computing 451/651 (Fall 2020) Part 1: Introduction to Big Data Ali

Parallel Distributed Processing: Further Explorations in the Microstructure of Cognition

CSC 369: Distributed Computing Alex Dekhtyar Day 1: Welcome Syllabus Teaching and

MapReduce Reduce Introdu duction ion and Hadoop p Overvie view Lab Course: Databases &

Distributed Graph Processing Lecture 13 CSCI 4974/6971 17 Oct 2016 1 / 9 Todays Biz 1.

Communication Complexity of Learning Discrete Distributions Krzysztof Onak IBM T.J. Watson

Lecture 13 : The Exponential Distribution 0/ 19 Definition A continuous random variable X is

Security proofs for continuous-variable quantum key distribution Anthony Leverrier Inria Paris

Definition of Stochastic Processes Definition of Stochastic Processes st Order Density

BeyOND Unleashing BOND Thomas Bernecker, Franz Graf, Hans-Peter Kriegel, , , g , Christian

Embe mbedding dding as a To Tool ol for Al Algorithm orithm De Design sign Le Song

Dask extending Python data tools for parallel and distributed - PowerPoint PPT Presentation

Dask extending Python data tools for parallel and distributed computing Joris Van den Bossche - FOSDEM 2017 1 / 29 Python's scientific/data tools ecosystem ## Thanks to Jake VanderPlas for the figure 2 / 29 3 / 29 3 / 29 Provides

B u ilding Dask Bags &amp; Globbing PAR AL L E L P R OG R AMMIN G W ITH DASK IN P YTH ON Dha

Ch u nking Arra y s in Dask PAR AL L E L P R OG R AMMIN G W ITH DASK IN P YTH ON Dha v ide

Preparing Flight Dela y Data PAR AL L E L P R OG R AMMIN G W ITH DASK IN P YTH ON Dha v ide

Scaling RAPIDS with Dask Matthew Rocklin, Systems Software Manager GTC San Jose 2019 PyData is

Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS Peter Andreas Entschev Senior System

Understanding Comp u ter Storage &amp; Big Data PAR AL L E L P R OG R AMMIN G W ITH DASK IN

UNIFIED MEMORY FOR DATA ANALYTICS AND DEEP LEARNING Nikolay Sakharnykh, Chirayu Garg, and Dmitri

Modern pandas Herv Mignot EQUANCY 1 Building Pipelines with Python Data Size PySpark x100

THE SOULTZ EGS PROJECT: JURIDICAL AND ADMINISTRATIVE ENVIRONMENT Pauline RAUSCHER, Jean-Jacques

Contents Financial Performance Financial Performance Operations Review Market Outlook

Financial Results for the Period Financial Results for the Period from 26 Apr to 30 Sep 2006

Scheffler on Nozick Capitalism University of Virginia Matthias Brinkmann Contents 1.

Distributed Computing In IceCube David Schultz, Gonzalo Merino, Vladimir Brik, and Jan Oertlin

Data-Intensive Distributed Computing 451/651 (Fall 2020) Part 1: Introduction to Big Data Ali

Parallel Distributed Processing: Further Explorations in the Microstructure of Cognition

CSC 369: Distributed Computing Alex Dekhtyar Day 1: Welcome Syllabus Teaching and

MapReduce Reduce Introdu duction ion and Hadoop p Overvie view Lab Course: Databases &amp;

Distributed Graph Processing Lecture 13 CSCI 4974/6971 17 Oct 2016 1 / 9 Todays Biz 1.

Communication Complexity of Learning Discrete Distributions Krzysztof Onak IBM T.J. Watson

Lecture 13 : The Exponential Distribution 0/ 19 Definition A continuous random variable X is

Security proofs for continuous-variable quantum key distribution Anthony Leverrier Inria Paris

Definition of Stochastic Processes Definition of Stochastic Processes st Order Density

BeyOND Unleashing BOND Thomas Bernecker, Franz Graf, Hans-Peter Kriegel, , , g , Christian

Embe mbedding dding as a To Tool ol for Al Algorithm orithm De Design sign Le Song

B u ilding Dask Bags & Globbing PAR AL L E L P R OG R AMMIN G W ITH DASK IN P YTH ON Dha

Understanding Comp u ter Storage & Big Data PAR AL L E L P R OG R AMMIN G W ITH DASK IN

MapReduce Reduce Introdu duction ion and Hadoop p Overvie view Lab Course: Databases &