Ian Ozsvald @IanOzsvald ianozsvald.com Introductions Interim - - PowerPoint PPT Presentation

ian ozsvald
SMART_READER_LITE
LIVE PREVIEW

Ian Ozsvald @IanOzsvald ianozsvald.com Introductions Interim - - PowerPoint PPT Presentation

Making Pandas Fly (live from London) EuroPython 2020 Ian Ozsvald @IanOzsvald ianozsvald.com Introductions Interim Chief Data Scientist 19+ years experience Edition! Team coaching & public courses d n 2 Im sharing


slide-1
SLIDE 1

Making Pandas Fly (live from London)

@IanOzsvald – ianozsvald.com

Ian Ozsvald

EuroPython 2020

slide-2
SLIDE 2

Interim Chief Data Scientist 19+ years experience Team coaching & public courses

– I’m sharing from my Higher Performance Python course

Introductions

By [ian]@ianozsvald[.com]

Ian Ozsvald

2

n d

Edition!

slide-3
SLIDE 3

All volunteers – go say thank you in #lobby They’ve put in a huge amount of volunteered work for us!

Thank the organisers!

By [ian]@ianozsvald[.com]

Ian Ozsvald

slide-4
SLIDE 4

Pandas

– Saving RAM to fjt in more data – Calculating faster by dropping to Numpy

Advice for “being highly performant” Has Covid 19 afgected UK Company Registrations?

Today’s goal

By [ian]@ianozsvald[.com]

Ian Ozsvald

slide-5
SLIDE 5

Strings are expensive and slow

By [ian]@ianozsvald[.com]

Ian Ozsvald

slide-6
SLIDE 6

Categoricals are cheap and fast!

By [ian]@ianozsvald[.com]

Ian Ozsvald

Circa 1% of previous memory cost

slide-7
SLIDE 7

Categoricals “.cat” accessor

By [ian]@ianozsvald[.com]

Ian Ozsvald

slide-8
SLIDE 8

Categoricals – over 10x speed up (on this data)!

By [ian]@ianozsvald[.com]

Ian Ozsvald

slide-9
SLIDE 9

Categoricals – index queries faster!

By [ian]@ianozsvald[.com]

Ian Ozsvald

Circa 500x speed-up!

slide-10
SLIDE 10

fmoat64 is default and a bit expensive

By [ian]@ianozsvald[.com]

Ian Ozsvald

slide-11
SLIDE 11

fmoat32 “half-price” and a bit faster

By [ian]@ianozsvald[.com]

Ian Ozsvald

slide-12
SLIDE 12

Make choices to save RAM

By [ian]@ianozsvald[.com]

Ian Ozsvald

Including the index (previously we ignored it) we still save circa 50% RAM so you can fjt in more rows of data

slide-13
SLIDE 13

“dtype_diet” gives you advice

By [ian]@ianozsvald[.com]

Ian Ozsvald

slide-14
SLIDE 14

Drop to NumPy if you know you can

By [ian]@ianozsvald[.com]

Ian Ozsvald

Caveat – Pandas mean is not np mean, the fair comparison is to np nanmean which is slower – see my blog or PyDataAmsterdam 2020 talk for details

slide-15
SLIDE 15

NumPy vs Pandas overhead (ser.sum())

By [ian]@ianozsvald[.com]

Ian Ozsvald

25 fjles, 83 functions Very few NumPy calls! Thanks!

slide-16
SLIDE 16

Overhead...

By [ian]@ianozsvald[.com]

Ian Ozsvald

slide-17
SLIDE 17

Overhead with ser.values.sum()

By [ian]@ianozsvald[.com]

Ian Ozsvald

18 fjles, 51 functions Many fewer Pandas calls (but still a lot!)

slide-18
SLIDE 18

Is Pandas unnecessarily slow – NO!

By [ian]@ianozsvald[.com]

Ian Ozsvald

https://github.com/pandas-dev/pandas/issues/34773 - the truth is a bit complicated!

slide-19
SLIDE 19

Install optional (but great!) Pandas dependencies

– bottleneck – numexpr

Investigate https://github.com/ianozsvald/dtype_diet Investigate my ipython_memory_usage (PyPI/Conda)

Being highly performant

By [ian]@ianozsvald[.com]

Ian Ozsvald

https://pandas.pydata.org/pandas-docs/stable/user_guide/enhancingperf.html

slide-20
SLIDE 20

Pure Python is “slow” and expressive

By [ian]@ianozsvald[.com]

Ian Ozsvald

Deliberately poor function – pretend this is clever but slow!

slide-21
SLIDE 21

Compile to Numba judiciously

By [ian]@ianozsvald[.com]

Ian Ozsvald

Near 10x speed-up!

slide-22
SLIDE 22

Parallelise with Dask for multi-core

By [ian]@ianozsvald[.com]

Ian Ozsvald

Make plain-Python

code multi-core

Note I had to drop text

index column due to speed-hit

Data copy cost can

  • verwhelm any benefjts

so (always) profjle & time

slide-23
SLIDE 23

Mistakes slow us down (PAY ATTENTION!)

– Try nullable Int64 & boolean, forthcoming Float64 – Write tests (unit & end-to-end) – Lots more material & my newsletter on my blog IanOzsvald.com – Time saving docs:

Being highly performant

By [ian]@ianozsvald[.com]

Ian Ozsvald

slide-24
SLIDE 24

Memory mapped & lazy computation

– New string dtype (RAM efgicient)

Modin sits on Pandas, new “algebra” for dfs

– Drop in replacement, easy to try

Vaex / Modin

By [ian]@ianozsvald[.com]

Ian Ozsvald

See talks on my blog:

slide-25
SLIDE 25

Make it right then make it fast Think about being performant See blog for my classes I’d love a postcard if you learned

something new!

Summary

By [ian]@ianozsvald[.com]

Ian Ozsvald

slide-26
SLIDE 26

Covid 19’s efgect on UK Economy?

By [ian]@ianozsvald[.com]

Ian Ozsvald

Sharp decline in corporate registration after Lockdown – then apparent surge (perhaps just backed-up paperwork?). Will the recovery “last”? All open data, you can do similar things!