Vaex: Out of core dataframes for Python Maarten A. Breddels & - PowerPoint PPT Presentation

Vaex: Out of core dataframes for Python Maarten A. Breddels & Jovan Veljanoski Article: A&A 618, 2017 / Arxiv 1801.02638 PyParis - Nov 13/2018

Maarten Breddels • Ex: astronomer (working on software for big data and visualization: vaex)

Maarten Breddels • Ex: astronomer (working on software for big data and visualization: vaex) • Now: Freelancer / consultant / data scientist for Python / Jupyter

Maarten Breddels • Ex: astronomer (working on software for big data and visualization: vaex) • Now: Freelancer / consultant / data scientist for Python / Jupyter • Core Jupyter-Widgets developer

Maarten Breddels • Ex: astronomer (working on software for big data and visualization: vaex) • Now: Freelancer / consultant / data scientist for Python / Jupyter • Core Jupyter-Widgets developer • Authors of vaex and ipyvolume

Maarten Breddels • Ex: astronomer (working on software for big data and visualization: vaex) • Now: Freelancer / consultant / data scientist for Python / Jupyter • Core Jupyter-Widgets developer • Authors of vaex and ipyvolume I live on the internet at: @maartenbreddels maartenbreddels@gmail.com github.com/maartenbreddels www.maartenbreddels.com

Maarten Breddels Jovan Veljanoski • Ex: astronomer (working on software for big data and visualization: vaex) • Now: Freelancer / consultant / data scientist for Python / Jupyter • Core Jupyter-Widgets developer • Authors of vaex and ipyvolume I live on the internet at: @maartenbreddels maartenbreddels@gmail.com github.com/maartenbreddels www.maartenbreddels.com

Maarten Breddels Jovan Veljanoski • Ex: astronomer (working on software for big data and visualization: vaex) • Ex- astronomer (big influence on vaex) • Now: Freelancer / consultant / data scientist for Python / Jupyter • Core Jupyter-Widgets developer • Authors of vaex and ipyvolume I live on the internet at: @maartenbreddels maartenbreddels@gmail.com github.com/maartenbreddels www.maartenbreddels.com

Maarten Breddels Jovan Veljanoski • Ex: astronomer (working on software for big data and visualization: vaex) • Ex- astronomer (big influence on vaex) • Now: Freelancer / consultant / data scientist for Python / Jupyter • Data scientists at Xebia Labs • Core Jupyter-Widgets developer • Authors of vaex and ipyvolume I live on the internet at: @maartenbreddels maartenbreddels@gmail.com github.com/maartenbreddels www.maartenbreddels.com

Maarten Breddels Jovan Veljanoski • Ex: astronomer (working on software for big data and visualization: vaex) • Ex- astronomer (big influence on vaex) • Now: Freelancer / consultant / data scientist for Python / Jupyter • Data scientists at Xebia Labs • Core Jupyter-Widgets developer • vaex coauthor • Authors of vaex and ipyvolume I live on the internet at: @maartenbreddels maartenbreddels@gmail.com github.com/maartenbreddels www.maartenbreddels.com

Maarten Breddels Jovan Veljanoski • Ex: astronomer (working on software for big data and visualization: vaex) • Ex- astronomer (big influence on vaex) • Now: Freelancer / consultant / data scientist for Python / Jupyter • Data scientists at Xebia Labs • Core Jupyter-Widgets developer • vaex coauthor • Authors of vaex and ipyvolume I live on the internet at: I live on the internet at: @maartenbreddels @N147185 maartenbreddels@gmail.com jovan.veljanoski@gmail.com github.com/maartenbreddels github.com/JovanVeljanoski www.maartenbreddels.com https:/ /www.linkedin.com/in/jovanvel/

Agenda • Why does vaex exist? • What is vaex? • Why is it so fast? • Demos • Summary

Motivation: Gaia

Motivation: Gaia • > 1 billion stars • Sky positions • Distance • Motions • And many more • Errors / Correlations

Motivation: Gaia • > 1 billion stars • Sky positions • Distance • Motions • And many more • Errors / Correlations • Latest data release • 1.7 billion rows • 1.2 TB • 94 columns/features

scatter

scatter density

• How fast can it be done?

• How fast can it be done? • 10 9 * 2 * 8 bytes = 15 GiB (double is 8 bytes)

• How fast can it be done? • 10 9 * 2 * 8 bytes = 15 GiB (double is 8 bytes) • Memory bandwidth: 10-50 GiB/s: ~1 second

• How fast can it be done? • 10 9 * 2 * 8 bytes = 15 GiB (double is 8 bytes) • Memory bandwidth: 10-50 GiB/s: ~1 second • CPU: 3 Ghz (but multicore, say 4-8): 12-24 cycles/second

• How fast can it be done? • 10 9 * 2 * 8 bytes = 15 GiB (double is 8 bytes) • Memory bandwidth: 10-50 GiB/s: ~1 second • CPU: 3 Ghz (but multicore, say 4-8): 12-24 cycles/second • Few cycles per row/object, simple algorithm

• How fast can it be done? • 10 9 * 2 * 8 bytes = 15 GiB (double is 8 bytes) • Memory bandwidth: 10-50 GiB/s: ~1 second • CPU: 3 Ghz (but multicore, say 4-8): 12-24 cycles/second • Few cycles per row/object, simple algorithm • Histograms/Density/Statistics grids

1d 2d 3d

0d 330,000 rows 1d 2d 3d

0d 330,000 rows mean: -0.083 1d 2d 3d

vaex • ~1 second

vaex • Python library (conda/pip installable) • ~1 second

vaex • Python library (conda/pip installable) • Pandas-like (familiar API) • Out-of-core, expression system • ~1 second • ApacheArrow / hdf5 + memory mapping

vaex • Python library (conda/pip installable) • Pandas-like (familiar API) • Out-of-core, expression system • ~1 second • ApacheArrow / hdf5 + memory mapping • Strong focus on statistics on N-d grids (count/mean/max/std/…)

vaex • Python library (conda/pip installable) • Pandas-like (familiar API) • Out-of-core, expression system • ~1 second • ApacheArrow / hdf5 + memory mapping • Strong focus on statistics on N-d grids (count/mean/max/std/…) • >1 billion rows / sec on a desktop (quad core 3Gz) • >50x faster than scipy.stats.binned_statistic_2d

vaex • Python library (conda/pip installable) • Pandas-like (familiar API) • Out-of-core, expression system • ~1 second • ApacheArrow / hdf5 + memory mapping • Strong focus on statistics on N-d grids (count/mean/max/std/…) • >1 billion rows / sec on a desktop (quad core 3Gz) • >50x faster than scipy.stats.binned_statistic_2d • Does visualisation / matplotlib / bqplot / ipyvolume / ipyleaflet

vaex • Python library (conda/pip installable) • Pandas-like (familiar API) • Out-of-core, expression system • ~1 second • ApacheArrow / hdf5 + memory mapping • Strong focus on statistics on N-d grids (count/mean/max/std/…) • >1 billion rows / sec on a desktop (quad core 3Gz) • >50x faster than scipy.stats.binned_statistic_2d • Does visualisation / matplotlib / bqplot / ipyvolume / ipyleaflet • More • Machine learning (Boosted Trees, K-means, PCA, ..) • Distributed computing (>10 10 rows)

What kind of data?

“Never do a live demo” -Many people Demo notebooks at: https://github.com/maartenbreddels/talk-pyparis-2018

Takeaway

Takeaway • Next generation data frame library (vaex?)

Takeaway • Next generation data frame library (vaex?) • Large datasets should be explored with statistics, not individual points

Takeaway • Next generation data frame library (vaex?) • Large datasets should be explored with statistics, not individual points • Large datasets should be memory mapped: Apache Arrow / hdf5

Takeaway • Next generation data frame library (vaex?) • Large datasets should be explored with statistics, not individual points • Large datasets should be memory mapped: Apache Arrow / hdf5 • Should use expressions

Takeaway • Next generation data frame library (vaex?) • Large datasets should be explored with statistics, not individual points • Large datasets should be memory mapped: Apache Arrow / hdf5 • Should use expressions • No memory wasted

Takeaway • Next generation data frame library (vaex?) • Large datasets should be explored with statistics, not individual points • Large datasets should be memory mapped: Apache Arrow / hdf5 • Should use expressions • No memory wasted • No information lost: JIT/derivatives

Takeaway • Next generation data frame library (vaex?) • Large datasets should be explored with statistics, not individual points • Large datasets should be memory mapped: Apache Arrow / hdf5 • Should use expressions • No memory wasted • No information lost: JIT/derivatives • ML pipelines are a byproduct

• vaex • https://vaex.io • https://github.com/maartenbreddels/vaex • pip install —pre vaex • conda install -c conda-forge vaex • https://github.com/maartenbreddels/talk-pyparis-2018 • maartenbreddels@gmail.com • jovan.veljanoski@gmail.com

Vaex: Out of core dataframes for Python Maarten A. Breddels & - PowerPoint PPT Presentation

Vaex: Out of core dataframes for Python Maarten A. Breddels & Jovan Veljanoski Article: A&A 618, 2017 / Arxiv 1801.02638 PyParis - Nov 13/2018 Maarten Breddels Ex: astronomer (working on software for big data and visualization: vaex)

Merging DataFrames Merging DataFrames with pandas Population DataFrame In [1]: import pandas as

Welcome Welcome Core: Core A Regional Destination Core: Core UL Core: Core Downtown

Python for Data Science Overview of Python Why Python Installing Python Installing Python Modules

Appending & concatenating Series Merging DataFrames with pandas append() .append():

Introducing DataFrames DATA MAN IP ULATION W ITH PAN DAS Richie Cotton Curriculum Architect

Introduction to PySpark DataFrames BIG DATA F UN DAMEN TALS W ITH P YS PARK Upendra Devisetty

Review of pandas DataFrames PAN DAS F OUN DATION S Dhavide Aruliah Director of Training,

Python Tidbits Python created by that guy ---> Python is named after Monty Pythons

Caching, Parallelism, Fault Tolerance Marco Serafini COMPSCI 532 Lectures 2-3 Memory Hierarchy

VAEX: 1 BILLION ROWS, 1 LAPTOP, SERIOUS DATA SCIENCE JOVAN VELJANOSKI Sr. Data Scientist @

Looping through Python data structures Justin Kiggins Product Manager DataCamp Python for

HPC Python Programming Ramses van Zon July 10, 2019 Ramses van Zon HPC Python Programming July

First Tool: Python! Introduction to python programming Gholamhossein Tavasoli @ ZNU First Tool:

Chapter 1 : Informatics Practices Advance operations Class XII ( As per on dataframes CBSE

Getting Started with Python The Python Interpreter A piece of software that executes

We already know Java. Why learn Python? Using Python to Implement Algorithms Python has far less

Intro to Zoom Lecture Math 482, Lecture 20.5 Misha Lavrov March 23, 2020 Plans for the online

Darry ryl Nicholson olson ContactDarrylNicholson@gmail.com Introduction Context /

Extending MPI to Accelerators* Jeff A. Stuart, John D. Owens University of California, Davis

The first million is always the hardest. May 21, 2013 - Heavybit Industries Javier A. Soltero

GR committee J. Orduna, Chair Louise Suter, Deputy Chair Important links and announcements DC

ZIVD, LLC 1 Laboratory Optimized patient care Clinician Optimized patient care 2

Raising AI: Tutoring Matters Jordi Bieger 1 (jbieger@gmail.com), Kristinn R. Thrisson 1,2 &

Transactional and Experiential Law Teaching Trends and Challenges in the Asia Pacific Region

Vaex: Out of core dataframes for Python Maarten A. Breddels & - PowerPoint PPT Presentation

Vaex: Out of core dataframes for Python Maarten A. Breddels & Jovan Veljanoski Article: A&A 618, 2017 / Arxiv 1801.02638 PyParis - Nov 13/2018 Maarten Breddels Ex: astronomer (working on software for big data and visualization: vaex)

Merging DataFrames Merging DataFrames with pandas Population DataFrame In [1]: import pandas as

Welcome Welcome Core: Core A Regional Destination Core: Core UL Core: Core Downtown

Python for Data Science Overview of Python Why Python Installing Python Installing Python Modules

Appending &amp; concatenating Series Merging DataFrames with pandas append() .append():

Introducing DataFrames DATA MAN IP ULATION W ITH PAN DAS Richie Cotton Curriculum Architect

Introduction to PySpark DataFrames BIG DATA F UN DAMEN TALS W ITH P YS PARK Upendra Devisetty

Review of pandas DataFrames PAN DAS F OUN DATION S Dhavide Aruliah Director of Training,

Python Tidbits Python created by that guy ---&gt; Python is named after Monty Pythons

Caching, Parallelism, Fault Tolerance Marco Serafini COMPSCI 532 Lectures 2-3 Memory Hierarchy

VAEX: 1 BILLION ROWS, 1 LAPTOP, SERIOUS DATA SCIENCE JOVAN VELJANOSKI Sr. Data Scientist @

Looping through Python data structures Justin Kiggins Product Manager DataCamp Python for

HPC Python Programming Ramses van Zon July 10, 2019 Ramses van Zon HPC Python Programming July

First Tool: Python! Introduction to python programming Gholamhossein Tavasoli @ ZNU First Tool:

Chapter 1 : Informatics Practices Advance operations Class XII ( As per on dataframes CBSE

Getting Started with Python The Python Interpreter A piece of software that executes

We already know Java. Why learn Python? Using Python to Implement Algorithms Python has far less

Intro to Zoom Lecture Math 482, Lecture 20.5 Misha Lavrov March 23, 2020 Plans for the online

Darry ryl Nicholson olson ContactDarrylNicholson@gmail.com Introduction Context /

Extending MPI to Accelerators* Jeff A. Stuart, John D. Owens University of California, Davis

The first million is always the hardest. May 21, 2013 - Heavybit Industries Javier A. Soltero

GR committee J. Orduna, Chair Louise Suter, Deputy Chair Important links and announcements DC

ZIVD, LLC 1 Laboratory Optimized patient care Clinician Optimized patient care 2

Raising AI: Tutoring Matters Jordi Bieger 1 (jbieger@gmail.com), Kristinn R. Thrisson 1,2 &amp;

Transactional and Experiential Law Teaching Trends and Challenges in the Asia Pacific Region

Appending & concatenating Series Merging DataFrames with pandas append() .append():

Python Tidbits Python created by that guy ---> Python is named after Monty Pythons

Raising AI: Tutoring Matters Jordi Bieger 1 (jbieger@gmail.com), Kristinn R. Thrisson 1,2 &