Building Data Pipelines in Python Marco Bonzanini QCon London 2017 - - PowerPoint PPT Presentation

building data pipelines in python
SMART_READER_LITE
LIVE PREVIEW

Building Data Pipelines in Python Marco Bonzanini QCon London 2017 - - PowerPoint PPT Presentation

Building Data Pipelines in Python Marco Bonzanini QCon London 2017 Nice to meet you R&D Engineering R&D Engineering R&D results in production = high value Big Data Problems vs Big Data Problems Data Pipelines (from


slide-1
SLIDE 1

Building Data Pipelines in Python

Marco Bonzanini

QCon London 2017

slide-2
SLIDE 2

Nice to meet you

slide-3
SLIDE 3

R&D ≠ Engineering

slide-4
SLIDE 4

R&D ≠ Engineering

R&D results in production = high value

slide-5
SLIDE 5
slide-6
SLIDE 6

Big Data Problems vs Big Data Problems

slide-7
SLIDE 7

Data Pipelines (from 30,000ft)

Data ETL Analytics

slide-8
SLIDE 8

Data Pipelines (zooming in)

ETL {

Extract Transform Load { Clean Augment Join

slide-9
SLIDE 9

Good Data Pipelines

Easy to Reproduce Productise

{

slide-10
SLIDE 10

Towards Good Data Pipelines

slide-11
SLIDE 11

Towards Good Data Pipelines (a)

Your Data is Dirty

unless proven otherwise

“It’s in the database, so it’s already good”

slide-12
SLIDE 12

Towards Good Data Pipelines (b)

All Your Data is Important

unless proven otherwise

slide-13
SLIDE 13

Towards Good Data Pipelines (b)

All Your Data is Important

unless proven otherwise

Keep it. Transform it. Don’t overwrite it.

slide-14
SLIDE 14

Towards Good Data Pipelines (c)

Pipelines vs Script Soups

slide-15
SLIDE 15

Tasty, but not a pipeline

Pic: Romanian potato soup from Wikipedia

slide-16
SLIDE 16

$ ./do_something.sh $ ./do_something_else.sh $ ./extract_some_data.sh $ ./join_some_other_data.sh ...

Anti-pattern: the script soup

slide-17
SLIDE 17

Script soups kill replicability

slide-18
SLIDE 18

$ cat ./run_everything.sh ./do_something.sh ./do_something_else.sh ./extract_some_data.sh ./join_some_other_data.sh $ ./run_everything.sh

Anti-pattern: the master script

slide-19
SLIDE 19

Towards Good Data Pipelines (d)

Break it Down

setup.py and conda

slide-20
SLIDE 20

Towards Good Data Pipelines (e)

Automated Testing

i.e. why scientists don’t write unit tests

slide-21
SLIDE 21

Intermezzo

Let me rant about testing

Icon by Freepik from flaticon.com

slide-22
SLIDE 22

(Unit) Testing

Unit tests in three easy steps:

  • import unittest
  • Write your tests
  • Quit complaining about lack of time to write tests
slide-23
SLIDE 23

Benefits of (unit) testing

  • Safety net for refactoring
  • Safety net for lib upgrades
  • Validate your assumptions
  • Document code / communicate your intentions
  • You’re forced to think
slide-24
SLIDE 24

Testing: not convinced yet?

slide-25
SLIDE 25

Testing: not convinced yet?

slide-26
SLIDE 26

Testing: not convinced yet?

f1 = fscore(p, r) min_bound, max_bound = sorted([p, r]) assert min_bound <= f1 <= max_bound

slide-27
SLIDE 27

Testing: I’m almost done

  • Unit tests vs Defensive Programming
  • Say no to tautologies
  • Say no to vanity tests
  • The Python ecosystem is rich: 


py.test, nosetests, hypothesis, coverage.py, …

slide-28
SLIDE 28

</rant>

slide-29
SLIDE 29

Towards Good Data Pipelines (f)

Orchestration

Don’t re-invent the wheel

slide-30
SLIDE 30

You need a workflow manager

Think: 
 GNU Make + Unix pipes + Steroids

slide-31
SLIDE 31

Intro to Luigi

  • Task dependency management
  • Error control, checkpoints, failure recovery
  • Minimal boilerplate
  • Dependency graph visualisation

$ pip install luigi

slide-32
SLIDE 32

Luigi Task: unit of execution

class MyTask(luigi.Task): def requires(self): return [SomeTask()] def output(self): return luigi.LocalTarget(…) def run(self): mylib.run()

slide-33
SLIDE 33

Luigi Target: output of a task

class MyTarget(luigi.Target): def exists(self): ... # return bool

Great off the shelf support 
 local file system, S3, Elasticsearch, RDBMS (also via luigi.contrib)

slide-34
SLIDE 34
slide-35
SLIDE 35

Intro to Airflow

  • Like Luigi, just younger
  • Nicer (?) GUI
  • Scheduling
  • Apache Project
slide-36
SLIDE 36

Towards Good Data Pipelines (g)

When things go wrong

The Joy of debugging

slide-37
SLIDE 37

import logging

slide-38
SLIDE 38

Who reads the logs?

You’re not going to read the logs, unless…

  • E-mail notifications (built-in in Luigi)
  • Slack notifications

$ pip install luigi_slack # WIP

slide-39
SLIDE 39

Towards Good Data Pipelines (h)

Static Analysis

The Joy of Duck Typing

slide-40
SLIDE 40

If it looks like a duck, swims like a duck, and quacks like a duck, then it probably is a duck.

— somebody on the Web

slide-41
SLIDE 41

>>> 1.0 == 1 == True True >>> 1 + True 2

slide-42
SLIDE 42

>>> '1' * 2 '11' >>> '1' + 2 Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: Can't convert 'int' object to str implicitly

slide-43
SLIDE 43

def do_stuff(a: int, b: int) -> str: ... return something

PEP 3107 — Function Annotations
 (since Python 3.0)

(annotations are ignored by the interpreter)

slide-44
SLIDE 44

typing module: semantically coherent

PEP 484 — Type Hints
 (since Python 3.5)

(still ignored by the interpreter)

slide-45
SLIDE 45

pip install mypy

slide-46
SLIDE 46
  • Add optional types
  • Run:

mypy --follow-imports silent mylib

  • Refine gradual typing (e.g. Any)
slide-47
SLIDE 47

Summary

Basic engineering principles help


(packaging, testing, orchestration, logging, static analysis, ...)

slide-48
SLIDE 48

Summary

R&D is not Engineering:
 can we meet halfway?

slide-49
SLIDE 49

Vanity Slide

  • speakerdeck.com/marcobonzanini
  • github.com/bonzanini
  • marcobonzanini.com
  • @MarcoBonzanini