Vaex: Out of core dataframes for Python Maarten A. Breddels & - - PowerPoint PPT Presentation

vaex out of core dataframes for python
SMART_READER_LITE
LIVE PREVIEW

Vaex: Out of core dataframes for Python Maarten A. Breddels & - - PowerPoint PPT Presentation

Vaex: Out of core dataframes for Python Maarten A. Breddels & Jovan Veljanoski Article: A&A 618, 2017 / Arxiv 1801.02638 PyParis - Nov 13/2018 Maarten Breddels Ex: astronomer (working on software for big data and visualization: vaex)


slide-1
SLIDE 1

Vaex: Out of core dataframes for Python

PyParis - Nov 13/2018

Maarten A. Breddels & Jovan Veljanoski Article: A&A 618, 2017 / Arxiv 1801.02638

slide-2
SLIDE 2

Maarten Breddels

  • Ex: astronomer (working on software for big data and visualization: vaex)
slide-3
SLIDE 3

Maarten Breddels

  • Ex: astronomer (working on software for big data and visualization: vaex)
  • Now: Freelancer / consultant / data scientist for Python / Jupyter
slide-4
SLIDE 4

Maarten Breddels

  • Ex: astronomer (working on software for big data and visualization: vaex)
  • Now: Freelancer / consultant / data scientist for Python / Jupyter
  • Core Jupyter-Widgets developer
slide-5
SLIDE 5

Maarten Breddels

  • Ex: astronomer (working on software for big data and visualization: vaex)
  • Now: Freelancer / consultant / data scientist for Python / Jupyter
  • Core Jupyter-Widgets developer
  • Authors of vaex and ipyvolume
slide-6
SLIDE 6

Maarten Breddels

I live on the internet at: @maartenbreddels maartenbreddels@gmail.com github.com/maartenbreddels www.maartenbreddels.com

  • Ex: astronomer (working on software for big data and visualization: vaex)
  • Now: Freelancer / consultant / data scientist for Python / Jupyter
  • Core Jupyter-Widgets developer
  • Authors of vaex and ipyvolume
slide-7
SLIDE 7

Maarten Breddels

I live on the internet at: @maartenbreddels maartenbreddels@gmail.com github.com/maartenbreddels www.maartenbreddels.com

  • Ex: astronomer (working on software for big data and visualization: vaex)
  • Now: Freelancer / consultant / data scientist for Python / Jupyter
  • Core Jupyter-Widgets developer
  • Authors of vaex and ipyvolume

Jovan Veljanoski

slide-8
SLIDE 8

Maarten Breddels

I live on the internet at: @maartenbreddels maartenbreddels@gmail.com github.com/maartenbreddels www.maartenbreddels.com

  • Ex: astronomer (working on software for big data and visualization: vaex)
  • Now: Freelancer / consultant / data scientist for Python / Jupyter
  • Core Jupyter-Widgets developer
  • Authors of vaex and ipyvolume

Jovan Veljanoski

  • Ex- astronomer (big influence on vaex)
slide-9
SLIDE 9

Maarten Breddels

I live on the internet at: @maartenbreddels maartenbreddels@gmail.com github.com/maartenbreddels www.maartenbreddels.com

  • Ex: astronomer (working on software for big data and visualization: vaex)
  • Now: Freelancer / consultant / data scientist for Python / Jupyter
  • Core Jupyter-Widgets developer
  • Authors of vaex and ipyvolume

Jovan Veljanoski

  • Ex- astronomer (big influence on vaex)
  • Data scientists at Xebia Labs
slide-10
SLIDE 10

Maarten Breddels

I live on the internet at: @maartenbreddels maartenbreddels@gmail.com github.com/maartenbreddels www.maartenbreddels.com

  • Ex: astronomer (working on software for big data and visualization: vaex)
  • Now: Freelancer / consultant / data scientist for Python / Jupyter
  • Core Jupyter-Widgets developer
  • Authors of vaex and ipyvolume

Jovan Veljanoski

  • Ex- astronomer (big influence on vaex)
  • Data scientists at Xebia Labs
  • vaex coauthor
slide-11
SLIDE 11

Maarten Breddels

I live on the internet at: @maartenbreddels maartenbreddels@gmail.com github.com/maartenbreddels www.maartenbreddels.com

  • Ex: astronomer (working on software for big data and visualization: vaex)
  • Now: Freelancer / consultant / data scientist for Python / Jupyter
  • Core Jupyter-Widgets developer
  • Authors of vaex and ipyvolume

Jovan Veljanoski

  • Ex- astronomer (big influence on vaex)
  • Data scientists at Xebia Labs
  • vaex coauthor

I live on the internet at: @N147185 jovan.veljanoski@gmail.com github.com/JovanVeljanoski https:/ /www.linkedin.com/in/jovanvel/

slide-12
SLIDE 12

Agenda

  • Why does vaex exist?
  • What is vaex?
  • Why is it so fast?
  • Demos
  • Summary
slide-13
SLIDE 13

Motivation: Gaia

slide-14
SLIDE 14

Motivation: Gaia

  • > 1 billion stars
  • Sky positions
  • Distance
  • Motions
  • And many more
  • Errors / Correlations
slide-15
SLIDE 15

Motivation: Gaia

  • > 1 billion stars
  • Sky positions
  • Distance
  • Motions
  • And many more
  • Errors / Correlations
  • Latest data release
  • 1.7 billion rows
  • 1.2 TB
  • 94 columns/features
slide-16
SLIDE 16
slide-17
SLIDE 17

scatter

slide-18
SLIDE 18

scatter density

slide-19
SLIDE 19
  • How fast can it be done?
slide-20
SLIDE 20
  • How fast can it be done?
  • 109 * 2 * 8 bytes = 15 GiB (double is 8 bytes)
slide-21
SLIDE 21
  • How fast can it be done?
  • 109 * 2 * 8 bytes = 15 GiB (double is 8 bytes)
  • Memory bandwidth: 10-50 GiB/s: ~1 second
slide-22
SLIDE 22
  • How fast can it be done?
  • 109 * 2 * 8 bytes = 15 GiB (double is 8 bytes)
  • Memory bandwidth: 10-50 GiB/s: ~1 second
  • CPU: 3 Ghz (but multicore, say 4-8): 12-24 cycles/second
slide-23
SLIDE 23
  • How fast can it be done?
  • 109 * 2 * 8 bytes = 15 GiB (double is 8 bytes)
  • Memory bandwidth: 10-50 GiB/s: ~1 second
  • CPU: 3 Ghz (but multicore, say 4-8): 12-24 cycles/second
  • Few cycles per row/object, simple algorithm
slide-24
SLIDE 24
  • How fast can it be done?
  • 109 * 2 * 8 bytes = 15 GiB (double is 8 bytes)
  • Memory bandwidth: 10-50 GiB/s: ~1 second
  • CPU: 3 Ghz (but multicore, say 4-8): 12-24 cycles/second
  • Few cycles per row/object, simple algorithm
  • Histograms/Density/Statistics grids
slide-25
SLIDE 25
  • How fast can it be done?
  • 109 * 2 * 8 bytes = 15 GiB (double is 8 bytes)
  • Memory bandwidth: 10-50 GiB/s: ~1 second
  • CPU: 3 Ghz (but multicore, say 4-8): 12-24 cycles/second
  • Few cycles per row/object, simple algorithm
  • Histograms/Density/Statistics grids
slide-26
SLIDE 26
slide-27
SLIDE 27

2d 1d

slide-28
SLIDE 28

2d 1d 3d

slide-29
SLIDE 29

2d 1d 3d 0d 330,000 rows

slide-30
SLIDE 30

2d 1d 3d 0d 330,000 rows

slide-31
SLIDE 31

2d 1d 3d 0d 330,000 rows

slide-32
SLIDE 32

2d 1d 3d 0d 330,000 rows mean: -0.083

slide-33
SLIDE 33

2d 1d 3d 0d 330,000 rows mean: -0.083

slide-34
SLIDE 34

vaex

  • ~1 second
slide-35
SLIDE 35

vaex

  • Python library (conda/pip installable)
  • ~1 second
slide-36
SLIDE 36

vaex

  • Python library (conda/pip installable)
  • Pandas-like (familiar API)
  • Out-of-core, expression system
  • ApacheArrow / hdf5 + memory mapping
  • ~1 second
slide-37
SLIDE 37

vaex

  • Python library (conda/pip installable)
  • Pandas-like (familiar API)
  • Out-of-core, expression system
  • ApacheArrow / hdf5 + memory mapping
  • Strong focus on statistics on N-d grids (count/mean/max/std/…)
  • ~1 second
slide-38
SLIDE 38

vaex

  • Python library (conda/pip installable)
  • Pandas-like (familiar API)
  • Out-of-core, expression system
  • ApacheArrow / hdf5 + memory mapping
  • Strong focus on statistics on N-d grids (count/mean/max/std/…)
  • >1 billion rows / sec on a desktop (quad core 3Gz)
  • >50x faster than scipy.stats.binned_statistic_2d
  • ~1 second
slide-39
SLIDE 39

vaex

  • Python library (conda/pip installable)
  • Pandas-like (familiar API)
  • Out-of-core, expression system
  • ApacheArrow / hdf5 + memory mapping
  • Strong focus on statistics on N-d grids (count/mean/max/std/…)
  • >1 billion rows / sec on a desktop (quad core 3Gz)
  • >50x faster than scipy.stats.binned_statistic_2d
  • Does visualisation / matplotlib / bqplot / ipyvolume / ipyleaflet
  • ~1 second
slide-40
SLIDE 40

vaex

  • Python library (conda/pip installable)
  • Pandas-like (familiar API)
  • Out-of-core, expression system
  • ApacheArrow / hdf5 + memory mapping
  • Strong focus on statistics on N-d grids (count/mean/max/std/…)
  • >1 billion rows / sec on a desktop (quad core 3Gz)
  • >50x faster than scipy.stats.binned_statistic_2d
  • Does visualisation / matplotlib / bqplot / ipyvolume / ipyleaflet
  • More
  • Machine learning (Boosted Trees, K-means, PCA, ..)
  • Distributed computing (>1010 rows)
  • ~1 second
slide-41
SLIDE 41

What kind of data?

slide-42
SLIDE 42

What kind of data?

slide-43
SLIDE 43

What kind of data?

slide-44
SLIDE 44

What kind of data?

slide-45
SLIDE 45

“Never do a live demo”

  • Many people

Demo notebooks at: https://github.com/maartenbreddels/talk-pyparis-2018

slide-46
SLIDE 46

Takeaway

slide-47
SLIDE 47

Takeaway

  • Next generation data frame library (vaex?)
slide-48
SLIDE 48

Takeaway

  • Next generation data frame library (vaex?)
  • Large datasets should be explored with statistics, not individual

points

slide-49
SLIDE 49

Takeaway

  • Next generation data frame library (vaex?)
  • Large datasets should be explored with statistics, not individual

points

  • Large datasets should be memory mapped: Apache Arrow / hdf5
slide-50
SLIDE 50

Takeaway

  • Next generation data frame library (vaex?)
  • Large datasets should be explored with statistics, not individual

points

  • Large datasets should be memory mapped: Apache Arrow / hdf5
  • Should use expressions
slide-51
SLIDE 51

Takeaway

  • Next generation data frame library (vaex?)
  • Large datasets should be explored with statistics, not individual

points

  • Large datasets should be memory mapped: Apache Arrow / hdf5
  • Should use expressions
  • No memory wasted
slide-52
SLIDE 52

Takeaway

  • Next generation data frame library (vaex?)
  • Large datasets should be explored with statistics, not individual

points

  • Large datasets should be memory mapped: Apache Arrow / hdf5
  • Should use expressions
  • No memory wasted
  • No information lost: JIT/derivatives
slide-53
SLIDE 53

Takeaway

  • Next generation data frame library (vaex?)
  • Large datasets should be explored with statistics, not individual

points

  • Large datasets should be memory mapped: Apache Arrow / hdf5
  • Should use expressions
  • No memory wasted
  • No information lost: JIT/derivatives
  • ML pipelines are a byproduct
slide-54
SLIDE 54
  • vaex
  • https://vaex.io
  • https://github.com/maartenbreddels/vaex
  • pip install —pre vaex
  • conda install -c conda-forge vaex
  • https://github.com/maartenbreddels/talk-pyparis-2018
  • maartenbreddels@gmail.com
  • jovan.veljanoski@gmail.com