Scientist meets web dev: how Python became the language of data Ga - - PowerPoint PPT Presentation

scientist meets web dev
SMART_READER_LITE
LIVE PREVIEW

Scientist meets web dev: how Python became the language of data Ga - - PowerPoint PPT Presentation

Scientist meets web dev: how Python became the language of data Ga el Varoquaux Scientist meets web dev: how Python became the language of data Ga el Varoquaux Very diverse community This talk: a reflection on what we have in common,


slide-1
SLIDE 1

Scientist meets web dev:

how Python became the language of data Ga¨ el Varoquaux

slide-2
SLIDE 2

Scientist meets web dev:

how Python became the language of data Ga¨ el Varoquaux

Very diverse community This talk: a reflection on what we have in common, Python

I am talking about things you don’t understand (my science) and things I don’t understand (web dev)

slide-3
SLIDE 3

I actually did a PhD in quantum physics

Hence I think I qualify as a “scientist”

G Varoquaux 3

slide-4
SLIDE 4

I now do computer science for neuroscience Try to link neural activity to thoughts and cognition

G Varoquaux 4

slide-5
SLIDE 5

I now do computer science for neuroscience Try to link neural activity to thoughts and cognition We attack it as a machine learning problem

Python software: nilearn

G Varoquaux 5

slide-6
SLIDE 6

On the way, we created a machine-learning library: scikit-learn

G Varoquaux 6

slide-7
SLIDE 7

Data science with Python is hot Huge success. Cool. Data science is THE thing.

G Varoquaux 7

slide-8
SLIDE 8

Data science with Python is hot Huge success. Cool. Data science is THE thing. Python is the go-to language How did it happen? We built scikit-learn, others pandas, etc..., but these were built on solid foundations

G Varoquaux 7

slide-9
SLIDE 9

1 Scientists come from Jupiter

And web devs from Saturn?

And sysadmins from Neptune? G Varoquaux 8

slide-10
SLIDE 10

1 We’re different numbers (in arrays) arrays (of numbers) arrays arrays strings databases

  • bject-oriented

programming flow control A bit of a culture gap

G Varoquaux 9

slide-11
SLIDE 11

1 Let’s do something together: sort EuroPython site 205 talks: How OpenStack makes Python better (and vice-versa)

Introduction to aiohttp

So you think your Python startup is worth $10 million...

SQLAlchemy as the backbone of a Data Science company

Learn Python The Fun Way

Scaling Microservices with Crossbar.io

If you can read this you don’t need glasses

Let’s find some common topics with data science

G Varoquaux 10

slide-12
SLIDE 12

1 Let’s do something together: sort EuroPython site 205 talks: How OpenStack makes Python better (and vice-versa)

Introduction to aiohttp

So you think your Python startup is worth $10 million...

SQLAlchemy as the backbone of a Data Science company

Learn Python The Fun Way

Scaling Microservices with Crossbar.io

If you can read this you don’t need glasses

Let’s find some common topics with data science

Anyone who has used Python to search text for substring patterns has at least heard of the regular expression module. Many of us use it extensively for parsers and lexers, T h e p y . t e s t t

  • l

p r e s e n t s a r a p i d a n d s i m p l e w a y t

  • w

r i t e t e s t s f

  • r

y

  • u

r P y t h

  • n

c

  • d

e . T h i s t r a i n i n g g i v e s a q u i c k i n t r

  • d

u c t i

  • n

w i t h e x e r c i s e s i n t

  • s
  • m

e d i s t i n g u i s h i n g f e a t u r e s . C h a t w i t h t h e c

  • r

e d e v e l

  • p

e r s a b

  • u

t h

  • w

t

  • e

x t e n d d j a n g

  • C

M S

  • r

h

  • w

t

  • i

n t e g r a t e y

  • u

r

  • w

n a p p s s e a m l e s s l y . L e t s t a l k a b

  • u

t y

  • u

r p l u g i n s , a p p h

  • k

s , t

  • l

b a r e x t e n s i

  • n

s

G Varoquaux 10

slide-13
SLIDE 13

1 Let’s do something together: sort EuroPython site 205 talks: How OpenStack makes Python better (and vice-versa)

Introduction to aiohttp

So you think your Python startup is worth $10 million...

SQLAlchemy as the backbone of a Data Science company

Learn Python The Fun Way

Scaling Microservices with Crossbar.io

If you can read this you don’t need glasses

Let’s find some common topics with data science

Anyone who has used Python to search text for substring patterns has at least heard of the regular expression module. Many of us use it extensively for parsers and lexers, T h e p y . t e s t t

  • l

p r e s e n t s a r a p i d a n d s i m p l e w a y t

  • w

r i t e t e s t s f

  • r

y

  • u

r P y t h

  • n

c

  • d

e . T h i s t r a i n i n g g i v e s a q u i c k i n t r

  • d

u c t i

  • n

w i t h e x e r c i s e s i n t

  • s
  • m

e d i s t i n g u i s h i n g f e a t u r e s . C h a t w i t h t h e c

  • r

e d e v e l

  • p

e r s a b

  • u

t h

  • w

t

  • e

x t e n d d j a n g

  • C

M S

  • r

h

  • w

t

  • i

n t e g r a t e y

  • u

r

  • w

n a p p s s e a m l e s s l y . L e t s t a l k a b

  • u

t y

  • u

r p l u g i n s , a p p h

  • k

s , t

  • l

b a r e x t e n s i

  • n

s

import urllib2, bs4 import sklearn, wordcloud G Varoquaux 10

slide-14
SLIDE 14

1 Let’s do something together: sort EuroPython site Crawl the schedule to get a list of titles and URLs talk pages to retrieve abstract and tags bs4: beautiful soup, matchings on the DOM tree

G Varoquaux 11

slide-15
SLIDE 15

1 Let’s do something together: sort EuroPython site Crawl the schedule to get a list of titles and URLs talk pages to retrieve abstract and tags bs4: beautiful soup, matchings on the DOM tree Vectorize

Anyone who has used Python to search text for substring patterns has at least heard of the regular expression module. Many of us use it extensively for parsers and lexers, The py.test tool presents a rapid and simple way to write tests for your Python code. This training gives a quick introduction with exercises into some distinguishing features.

C h a t w i t h t h e c

  • r

e d e v e l

  • p

e r s a b

  • u

t h

  • w

t

  • e

x t e n d d j a n g

  • C

M S

  • r

h

  • w

t

  • i

n t e g r a t e y

  • u

r

  • w

n a p p s s e a m l e s s l y . L e t s t a l k a b

  • u

t y

  • u

r p l u g i n s , a p p h

  • k

s , t

  • l

b a r e x t e n s i

  • n

s

a can code is module profiling performance Python the 20 10 4 14 3 2 1 9 18 T erm Freq

G Varoquaux 11

slide-16
SLIDE 16

1 Let’s do something together: sort EuroPython site Crawl the schedule to get a list of titles and URLs talk pages to retrieve abstract and tags bs4: beautiful soup, matchings on the DOM tree Vectorize

Anyone who has used Python to search text for substring patterns has at least heard of the regular expression module. Many of us use it extensively for parsers and lexers, The py.test tool presents a rapid and simple way to write tests for your Python code. This training gives a quick introduction with exercises into some distinguishing features.

C h a t w i t h t h e c

  • r

e d e v e l

  • p

e r s a b

  • u

t h

  • w

t

  • e

x t e n d d j a n g

  • C

M S

  • r

h

  • w

t

  • i

n t e g r a t e y

  • u

r

  • w

n a p p s s e a m l e s s l y . L e t s t a l k a b

  • u

t y

  • u

r p l u g i n s , a p p h

  • k

s , t

  • l

b a r e x t e n s i

  • n

s

a can code is module profiling performance Python the 20 10 4 14 3 2 1 9 18 T erm Freq 1321 540 208 964 123 7 6 191 1450 All docs

G Varoquaux 11

slide-17
SLIDE 17

1 Let’s do something together: sort EuroPython site Crawl the schedule to get a list of titles and URLs talk pages to retrieve abstract and tags bs4: beautiful soup, matchings on the DOM tree Vectorize

Anyone who has used Python to search text for substring patterns has at least heard of the regular expression module. Many of us use it extensively for parsers and lexers, The py.test tool presents a rapid and simple way to write tests for your Python code. This training gives a quick introduction with exercises into some distinguishing features.

C h a t w i t h t h e c

  • r

e d e v e l

  • p

e r s a b

  • u

t h

  • w

t

  • e

x t e n d d j a n g

  • C

M S

  • r

h

  • w

t

  • i

n t e g r a t e y

  • u

r

  • w

n a p p s s e a m l e s s l y . L e t s t a l k a b

  • u

t y

  • u

r p l u g i n s , a p p h

  • k

s , t

  • l

b a r e x t e n s i

  • n

s

a can code is module profiling performance Python the 20 10 4 14 3 2 1 9 18 T erm Freq 1321 540 208 964 123 7 6 191 1450 All docs .015 .018 .019 .014 .023 .286 .167 .047 .012 Ratio

TF-IDF in scikit-learn

sklearn.feature extraction.text.TfidfVectorizer

G Varoquaux 11

slide-18
SLIDE 18

1 Let’s do something together: sort EuroPython site

3 7 8 9 7 7 9 7 7 9 7 5 2 7 5 7 8 9 4 7 1 6 7 9 7 9 7 8 7 1 4 4 9 5 2 5 8 documents the Python performance profiling module is code can a

Term-document matrix

G Varoquaux 12

slide-19
SLIDE 19

1 Let’s do something together: sort EuroPython site

3 7 8 9 7 7 9 7 7 9 7 5 2 7 5 7 8 9 4 7 1 6 7 9 7 9 7 8 7 1 4 4 9 5 2 5 8 documents the Python performance profiling module is code can a

Term-document matrix

3 7 8 9 7 7 9 7 7 9 7 5 2 7 5 7 8 9 4 7 1 6 7 9 7 9 7 8 7 1 4 4 9 5 2 5 8

Can be a sparse matrix

G Varoquaux 12

slide-20
SLIDE 20

1 Let’s do something together: sort EuroPython site

3 7 8 9 7 7 9 7 7 9 7 5 2 7 5 7 8 9 4 7 1 6 7 9 7 9 7 8 7 1 4 4 9 5 2 5 8 documents the Python performance profiling module is code can a

3 7 8 9 7 7 9 7 7 9 7 5 2 7 5 7 8 9 4 7 1 6 7 9 7 topics the Python performance profiling module is code can a 3 7 9 4 9 1 documents topics

+

What terms are in a topics What documents are in a topics

A matrix factorization Often with non-negative constraints

sklearn.decompositions.NMF

G Varoquaux 13

slide-21
SLIDE 21

1 Let’s do something together: sort EuroPython site EuroPyton abstracts Topic 1

G Varoquaux 14

slide-22
SLIDE 22

1 Let’s do something together: sort EuroPython site EuroPyton abstracts Topic 2

G Varoquaux 14

slide-23
SLIDE 23

1 Let’s do something together: sort EuroPython site EuroPyton abstracts Topic 3

G Varoquaux 14

slide-24
SLIDE 24

1 Let’s do something together: sort EuroPython site EuroPyton abstracts

G Varoquaux 14

slide-25
SLIDE 25

1 Let’s do something together: sort EuroPython site EuroPyton abstracts

Add one of Python’s great templating engine ... get a usable website

https://gaelvaroquaux.github.io/my_topics/ep16

G Varoquaux 14

slide-26
SLIDE 26

Want to try it?

$ pip install scikit-learn ... ImportError: Numerical Python (NumPy) is not installed. scikit-learn requires NumPy >= 1.6.1

G Varoquaux 15

slide-27
SLIDE 27

Want to try it?

$ pip install scikit-learn ... ImportError: Numerical Python (NumPy) is not installed. scikit-learn requires NumPy >= 1.6.1

G Varoquaux 15

slide-28
SLIDE 28

Want to try it?

$ pip install scikit-learn ... ImportError: Numerical Python (NumPy) is not installed. scikit-learn requires NumPy >= 1.6.1 C:> pip install numpy ... error: Unable to find vcvarsall.bat

G Varoquaux 15

slide-29
SLIDE 29

Want to try it?

$ pip install scikit-learn ... ImportError: Numerical Python (NumPy) is not installed. scikit-learn requires NumPy >= 1.6.1 C:> pip install numpy ... error: Unable to find vcvarsall.bat

G Varoquaux 15

slide-30
SLIDE 30

1 We’re different Well fast linear algebra

ATLAS (Fortran) 70x faster

libfortran.so.3 ?? you’re kidding me

G Varoquaux 16

slide-31
SLIDE 31

1 We’re different Well fast linear algebra

ATLAS (Fortran) 70x faster

libfortran.so.3 ?? you’re kidding me Packaging is a major roadblock for scientific Python A lot of compiled code + shared libraries ⇒ library + ABI compatibility issues Progress:

  • Manylinux wheels:

PEP 513, RT. McGibbon, NJ. Smith

rely on a conservative core set of libs

  • Openblas:

pure-C, fast linear algebra

G Varoquaux 16

slide-32
SLIDE 32

1 We’re different But working together gives us awesome things Text mining ⇒ intelligent interfaces

G Varoquaux 17

slide-33
SLIDE 33

2 The scientist’s view of code

Numerics versus control flow

Numerics versus databases

Numerics versus strings

Numerics versus the world

G Varoquaux 18

slide-34
SLIDE 34

2 Why we love numpy 100 000 term frequency vs inverse doc frequency:

In [*]: %timeit [t * i for t, i in izip(tf, idf)] 100 loops, best of 3: 6.2 ms per loop

The numpy style:

In [*]: %timeit tf * idf 1000 loops, best of 3: 74.2 µs per loop

10

2

10

4

10

6

number of elements

1µs 100ns 10ns 1ns

time per element lists numpy G Varoquaux 19

slide-35
SLIDE 35

2 Why we love numpy 100 000 term frequency vs inverse doc frequency:

In [*]: %timeit [t * i for t, i in izip(tf, idf)] 100 loops, best of 3: 6.2 ms per loop

The numpy style:

In [*]: %timeit tf * idf 1000 loops, best of 3: 74.2 µs per loop

10

2

10

4

10

6

number of elements

1µs 100ns 10ns 1ns

time per element lists numpy

Array computing can be more readable tf * idf vs [t * i for t, i in izip(tf, idf)]

G Varoquaux 19

slide-36
SLIDE 36

2 arrays are nothing but pointers A numpy array = memory address data type shape strides

3 8 7 8 7 9 4 7 9 7 9 2 7 1 7 9 7 5 2 7 1 5 7 8 9 4 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 4 9 5 1 9 7 4 7 5 4 2 6 5 3 5 8 9 8 4 8 7 2 1 5 4 6 3 4 9 8 4 9 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2

stride 2 stride 1 shape 1 shape 2

Represents any regular data in a structured way: how to access elements via pointer arythmetics (computing offsets)

stride 2 stride 1 shape 1 shape 2

03878794797927 01790752701578

...

G Varoquaux 20

slide-37
SLIDE 37

2 arrays are nothing but pointers A numpy array = memory address data type shape strides

3 8 7 8 7 9 4 7 9 7 9 2 7 1 7 9 7 5 2 7 1 5 7 8 9 4 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 4 9 5 1 9 7 4 7 5 4 2 6 5 3 5 8 9 8 4 8 7 2 1 5 4 6 3 4 9 8 4 9 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2

stride 2 stride 1 shape 1 shape 2

Represents any regular data in a structured way: how to access elements via pointer arythmetics (computing offsets)

stride 2 stride 1 shape 1 shape 2

03878794797927 01790752701578

...

Matches the memory model of numerical libraries ⇒ Enables copyless interactions Numpy is really a memory model

G Varoquaux 20

slide-38
SLIDE 38

2 Array computing is fast

10

2

10

4

10

6

number of elements

1µs 100ns 10ns 1ns

time per element lists numpy

tf idf = tf * idf

CPU

03878794797927 01790752701578

*

tf_idf=

No type checking Direct sequential memory access Vector operations (SIMD)

G Varoquaux 21

slide-39
SLIDE 39

2 Array computing is limited by CPU starvation

10

3

10

4

10

5

10

6

number of elements

1ns 2ns 3ns 4ns 5ns

time per element

tf idf = tf * idf

CPU

03878794797927 01790752701578

*

tf_idf=

2x slowdown passed a certain size

G Varoquaux 22

slide-40
SLIDE 40

2 Array computing is limited by CPU starvation

10

3

10

4

10

5

10

6

number of elements

1ns 2ns 3ns 4ns 5ns

time per element

105 ∼ size of the CPU cache

Memory is much slower than CPU tf idf = tf * idf

03878794797927 01790752701578

*

tf_idf=

0387879 0179

cpu cache

2x slowdown passed a certain size

G Varoquaux 22

slide-41
SLIDE 41

2 Array computing is limited by CPU starvation

10

3

10

4

10

5

10

6

number of elements

1ns 2ns 3ns 4ns 5ns

time per element

Memory is much slower than CPU tf idf = tf * idf - 1 It gets worse for complex expressions

G Varoquaux 22

slide-42
SLIDE 42

2 Array computing is limited by CPU starvation

10

3

10

4

10

5

10

6

number of elements

1ns 2ns 3ns 4ns 5ns

time per element

Memory is much slower than CPU tf idf = tf * idf - 1

What’s going on:

  • 1. tmp ← tf * idf
  • 2. tf idf ← tmp - 1

Big temporary: Moving data in & out of cache

G Varoquaux 22

slide-43
SLIDE 43

2 Array computing is limited by CPU starvation tf idf = tf * idf - 1

What’s going on:

  • 1. tmp ← tf * idf
  • 2. tf idf ← tmp - 1

Big temporary: Moving data in & out of cache

In [*]: %timeit tf * idf 10000 loops, best of 3: 74.2 µs per loop In [*]: %timeit tf * idf - 1 1000 loops, best of 3: 418 µs per loop

G Varoquaux 22

slide-44
SLIDE 44

2 Array computing is limited by CPU starvation tf idf = tf * idf - 1

What’s going on:

  • 1. tmp ← tf * idf
  • 2. tf idf ← tmp - 1

Big temporary: Moving data in & out of cache

In [*]: %timeit tf * idf 10000 loops, best of 3: 74.2 µs per loop In [*]: %timeit tf * idf - 1 1000 loops, best of 3: 418 µs per loop

In-place operations: reuse the allocation

In [*]: %timeit tmp = tf * idf; tmp -= 1 10000 loops, best of 3: 112 µs per loop

G Varoquaux 22

slide-45
SLIDE 45

2 Array computing is limited by CPU starvation

10

3

10

4

10

5

10

6

number of elements

1ns 2ns 3ns 4ns 5ns

time per element numpy np inplace

tmp = tf * idf tmp -= 1

tf idf = tf * idf - 1

What’s going on:

  • 1. tmp ← tf * idf
  • 2. tf idf ← tmp - 1

Big temporary: Moving data in & out of cache

G Varoquaux 22

slide-46
SLIDE 46

2 Array computing is limited by CPU starvation

10

3

10

4

10

5

10

6

number of elements

1ns 2ns 3ns 4ns 5ns

time per element numpy np inplace

tmp = tf * idf tmp -= 1

tf idf = tf * idf - 1

What’s going on:

  • 1. tmp ← tf * idf
  • 2. tf idf ← tmp - 1

Big temporary: Moving data in & out of cache A compilation problem:

tf idf = tf * idf - 1

tf idf = tf * idf

tf idf -= 1

G Varoquaux 22

slide-47
SLIDE 47

2 Array computing is limited by CPU starvation

10

3

10

4

10

5

10

6

number of elements

1ns 2ns 3ns 4ns 5ns

time per element numpy np inplace numexpr

tf idf = tf * idf - 1

What’s going on:

  • 1. tmp ← tf * idf
  • 2. tf idf ← tmp - 1

Big temporary: Moving data in & out of cache A compilation problem:

  • Removing/reusing temporaries
  • Operating on “chunks” that fit in cache

Addressed by numexpr, with string expressions

numexpr.evaluate(’tf * idf - 1’, locals())

G Varoquaux 22

slide-48
SLIDE 48

2 Array computing is limited by CPU starvation

10

3

10

4

10

5

10

6

number of elements

1ns 2ns 3ns 4ns 5ns

time per element numpy np inplace numexpr

tf idf = tf * idf - 1

What’s going on:

  • 1. tmp ← tf * idf
  • 2. tf idf ← tmp - 1

Big temporary: Moving data in & out of cache A compilation problem:

  • Removing/reusing temporaries
  • Operating on “chunks” that fit in cache

Addressed by numexpr, with string expressions Addressed by numba, with bytecode inspection lazyarray Similar problem to pagination with SQL queries

G Varoquaux 22

slide-49
SLIDE 49

2 Array computing is limited by CPU starvation tf idf = tf * idf

Too small:

  • verhead

Too BIG: Out of cache B I G D a t a

$

$

$

G Varoquaux 23

slide-50
SLIDE 50

2 Numerics versus control flow What if there is an if tf idf = tf / idf tf idf[idf == 0] = 0 Suppose the we are looking at ages in a population: ages[gender == ’male’].mean()

  • ages[gender == ’female’].mean()

G Varoquaux 24

slide-51
SLIDE 51

2 Numerics versus control flow What if there is an if tf idf = tf / idf tf idf[idf == 0] = 0 Suppose the we are looking at ages in a population: ages[gender == ’male’].mean()

  • ages[gender == ’female’].mean()

This is really starting to be looking like databases pandas: something in between arrays and an in-memory database Great for queries, less great for numerics.

G Varoquaux 24

slide-52
SLIDE 52

Installation PROBLEMS

Beautiful Python COde Routines in Fortran

  • r C++

S c a l a B I L i T Y

G Varoquaux 25

slide-53
SLIDE 53

Installation PROBLEMS

Beautiful Python COde Routines in Fortran

  • r C++

S c a l a B I L i T Y Deployment PROBLEMS

Beautiful Python COde DATABASE in C++, JAVA, ERLANG...

S c a l a B I L i T Y

Numpy is the scientist’s equivalent to an ORM Gives speed with non-Python code

G Varoquaux 25

slide-54
SLIDE 54

numerics vs databases numerics efficient on regularly spaced data But numpy creates cache misses for big arrays ⇒ Need to remove temporaries and chunk data

G Varoquaux 26

slide-55
SLIDE 55

numerics vs databases numerics efficient on regularly spaced data But numpy creates cache misses for big arrays ⇒ Need to remove temporaries and chunk data selection and grouping efficient with indexes or trees ⇒ Need to group queries Compilation is unpythonic

G Varoquaux 26

slide-56
SLIDE 56

numerics vs databases numerics efficient on regularly spaced data But numpy creates cache misses for big arrays ⇒ Need to remove temporaries and chunk data selection and grouping efficient with indexes or trees ⇒ Need to group queries Compilation is unpythonic A computation & query language?

numexpr

I hate domain-specific languages (SQL) Numpy is very expressive

G Varoquaux 26

slide-57
SLIDE 57

numerics vs databases numerics efficient on regularly spaced data But numpy creates cache misses for big arrays ⇒ Need to remove temporaries and chunk data selection and grouping efficient with indexes or trees ⇒ Need to group queries Compilation is unpythonic A computation & query language?

numexpr

I hate domain-specific languages (SQL) Numpy is very expressive PonyORM: Compiling Python to optimized SQL

Datascience with SQL: Ibis & Blaze

G Varoquaux 26

slide-58
SLIDE 58

numerics vs databases numerics efficient on regularly spaced data But numpy creates cache misses for big arrays ⇒ Need to remove temporaries and chunk data selection and grouping efficient with indexes or trees ⇒ Need to group queries Spark: java-world “big data” rising star combines distributed store + computing model We (scikit-learn) are faster when data fits in RAM

G Varoquaux 26

slide-59
SLIDE 59

Operations on chunks, or algorithms on chunks Machine learning, data mining = numerics

03878794797927 03878794797927

G Varoquaux 27

slide-60
SLIDE 60

Operations on chunks, or algorithms on chunks Machine learning, data mining = numerics

03878794797927 03878794797927 03878794797927 03878794797927

ETL (extract, transform, & load) Multivariate statistics

G Varoquaux 27

slide-61
SLIDE 61

Operations on chunks, or algorithms on chunks Machine learning, data mining = numerics

03878794797927 03878794797927 03878794797927 03878794797927

ETL (extract, transform, & load) Multivariate statistics Out-of-core opera- tions not efficient: no data locality

On-line aglorithms (streaming)

0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0

eg stochastic gradient descent As in deep learning

G Varoquaux 27

slide-62
SLIDE 62

Making the data-science magic happens

from sklearn import

3 7 8 9 7 7 9 7 7 9 7 5 2 7 5 7 8 9 4 7 1 6 7 9 7 9 7 8 7 1 4 4 9 5 2 5 8 documents the Python performance profiling module is code can a 3 7 8 9 7 7 9 7 7 9 7 5 2 7 5 7 8 9 4 7 1 6 7 9 7 topics the Python performance profiling module is code can a 3 7 9 4 9 1 documents topics

+

What terms are in a topics What documents are in a topics

G Varoquaux 28

slide-63
SLIDE 63

Making the data-science magic happens

from sklearn import

Turning applied maths papers to robust code High-level, readable, simple syntax reduces cognitive load Thanks

G Varoquaux 28

slide-64
SLIDE 64

3 Beyond numerics

Make #PyData great (again)

G Varoquaux 29

slide-65
SLIDE 65

3 Data/computation flow is crucial

03878794797927 03878794797927 03878794797927 03878794797927

0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0

Data-flow engines are everywhere dask pure-Python dynamic scheduler static compiler parallel & distributed theano expression analysis pure-Python tensorflow C library distributed

G Varoquaux 30

slide-66
SLIDE 66

3 Data/computation flow is crucial

03878794797927 03878794797927 03878794797927 03878794797927

0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0

Data-flow engines are everywhere Python should shine there: reflexivity + metaprogramming + async

“Python is the best numerical language out there because it’s not a numerical language.” – Nathaniel Smith

API challenging: For algorithm design: no framework / inversion of control

G Varoquaux 30

slide-67
SLIDE 67

3 Ingredients for future data flows Distributed computation & Run-time analysis Reflexivity is central Debugging Interactive work Code analysis Persistence

03878794797927 03878794797927

Parallel computing

G Varoquaux 31

slide-68
SLIDE 68

3 Ingredients for future data flows Distributed computation & Run-time analysis Reflexivity is central Debugging Interactive work Code analysis Persistence

03878794797927 03878794797927

Parallel computing Pickle distribute code and data without data model serialize intermediate results deep of hash of any data structure joblib.hash Very limited (eg no lambda #19272) ⇒ variants: dill, cloudpickle

G Varoquaux 31

slide-69
SLIDE 69

3 Ingredients for future data flows Distributed computation & Run-time analysis Reflexivity is central Debugging Interactive work Code analysis Persistence

03878794797927 03878794797927

Parallel computing Pickle distribute code and data without data model serialize intermediate results deep of hash of any data structure joblib.hash

joblib:

Simple parallel syntax:

Parallel(n jobs=2)(delayed(sqrt)(i) for i in range(10))

Fast persistence:

joblib.dump(anything, ’filename.pkl.gz’)

Primitive for out of core:

pointer = mem.cache(f).call and shelves(big data)

  • Non-invasive syntax / paradigm
  • Fast on big numpy arrays
  • Soon backend system (job broker and persistence)

Gets job managment into algorithms (eg in scikit-learn)

G Varoquaux 31

slide-70
SLIDE 70

3 The Python VM is great The simplicity of the VM is our strength Software Transactional Memory...

would be nice

But, I want to use foreign memory Java gained jmalloc for foreign memory Better garbage collection Yes but, I easily plug into reference counting A strength of Python is its clear C API ⇒ Easy foreign functionality

G Varoquaux 32

slide-71
SLIDE 71

3 The Python VM is great The simplicity of the VM is our strength Software Transactional Memory...

would be nice

But, I want to use foreign memory Java gained jmalloc for foreign memory Better garbage collection Yes but, I easily plug into reference counting A strength of Python is its clear C API ⇒ Easy foreign functionality

Cython: the best of C and Python

Add types for speed (numpy arrays as float*) Call C to bind external libraries: surprisingly easy no pointer arithmetics An adaptation layer between Python VM and C

G Varoquaux 32

slide-72
SLIDE 72

4 Working together

G Varoquaux 33

slide-73
SLIDE 73

4 Scikit-learn is easy machine learning As easy as py from s k l e a r n import svm c l a s s i f i e r = svm.SVC() c l a s s i f i e r . f i t ( X t r a i n , Y t r a i n ) Y t e s t = c l a s s i f i e r . p r e d i c t ( X t e s t ) People love the encapsulation classifier is a semi black box The power of a simple object-oriented API Documentation-driven development

G Varoquaux 34

slide-74
SLIDE 74

4 Scikit-learn is easy machine learning As easy as py from s k l e a r n import svm c l a s s i f i e r = svm.SVC() c l a s s i f i e r . f i t ( X t r a i n , Y t r a i n ) Y t e s t = c l a s s i f i e r . p r e d i c t ( X t e s t ) People love the encapsulation classifier is a semi black box The power of a simple object-oriented API Documentation-driven development High-level, readable, simple API reduces cognitive load PyData loves Python in return

G Varoquaux 34

slide-75
SLIDE 75

4 Difference is richness We all do different things We can all benefit from others though we don’t know how

G Varoquaux 35

slide-76
SLIDE 76

4 Difference is richness, but requires outreach We all do different things We can all benefit from others though we don’t know how Being didactic outside one’s community is crucial Avoiding jargon

take that machine learning

Prioritizing information “Simple is better than complex”

Students learning numerics don’t care about unicode

Build documentation upon very simple examples

Think stackoverflow Sphinx + Sphinx-gallery

G Varoquaux 35

slide-77
SLIDE 77

@GaelVaroquaux

Scientist web dev: Python is the language for data Python language & VM is perfect to manipulate low-level constructs with high-level wordings Connects to other paradigms, eg C

slide-78
SLIDE 78

@GaelVaroquaux

Scientist web dev: Python is the language for data Python language & VM is perfect to manipulate low-level constructs with high-level wordings Dynamism and reflexivity ⇒ meta-programming and debugging

slide-79
SLIDE 79

@GaelVaroquaux

Scientist web dev: Python is the language for data Python language & VM is perfect to manipulate low-level constructs with high-level wordings Dynamism and reflexivity ⇒ meta-programming and debugging Needs for compilation and dynamism: a difficult balance PEP 509: guards on run-time modification PEP 510: function specicalization

slide-80
SLIDE 80

@GaelVaroquaux

Scientist web dev: Python is the language for data Python language & VM is perfect to manipulate low-level constructs with high-level wordings Dynamism and reflexivity ⇒ meta-programming and debugging Needs for compilation and dynamism Pydata will use DB and concurrency from web PyData can give knowledge engineering + AI