bonobo Simple ETL in Python 3.5+ Romain Dorgueil @rdorgueil - - PowerPoint PPT Presentation

bonobo
SMART_READER_LITE
LIVE PREVIEW

bonobo Simple ETL in Python 3.5+ Romain Dorgueil @rdorgueil - - PowerPoint PPT Presentation

bonobo Simple ETL in Python 3.5+ Romain Dorgueil @rdorgueil CTO/Hacker in Residence LAtelier BNP Paribas Technical Co-founder WeAreTheShops (Solo) Founder RDC Dist. Agency Eng. Manager Sensio/SensioLabs Developer A ffi liationWizard


slide-1
SLIDE 1

bonobo

Simple ETL in Python 3.5+

slide-2
SLIDE 2

Romain Dorgueil @rdorgueil CTO/Hacker in Residence

Technical Co-founder

(Solo) Founder

  • Eng. Manager

Developer

L’Atelier BNP Paribas

WeAreTheShops

RDC Dist. Agency

Sensio/SensioLabs

AffiliationWizard

Felt too young in a Linux Cauldron

Dismantler of Atari computers Basic literacy using a Minitel Guitars & accordions Off by one baby Inception
slide-3
SLIDE 3

STARTUP ACCELERATION PROGRAMS

NO HYPE, JUST BUSINESS launchpad.atelier.net

slide-4
SLIDE 4

bonobo

Simple ETL in Python 3.5+

slide-5
SLIDE 5
  • History of Extract Transform Load
  • Concept ; Existing tools ; Related tools ; Ignition
  • Practical Bonobo
  • Tutorial ; Under the hood ; Demo ; Plugins & Extensions ; More demos
  • Wrap up
  • Present & future ; Resources ; Sprint ; Feedback

Plan

slide-6
SLIDE 6

Once upon a time…

slide-7
SLIDE 7

Extract Transform Load

  • Not new. Popular concept in the 1970s [1] [2]
  • Everywhere. Commerce, websites, marketing, finance, …

[1] https://en.wikipedia.org/wiki/Extract,_transform,_load [2] https://www.sas.com/en_us/insights/data-management/what-is-etl.html

slide-8
SLIDE 8

Extract Transform Load

foo bar baz

Extract Transform Load

slide-9
SLIDE 9

Extract Transform Load

foo bar baz

Extract Transform Load Transform
 more Join 
 DB HTTP POST log?

slide-10
SLIDE 10

Data Integration Tools

  • Pentaho Data Integration (IDE/Java)
  • Talend Open Studio (IDE/Java)
  • CloverETL (IDE/Java)
slide-11
SLIDE 11 Talend Open Studio
slide-12
SLIDE 12
slide-13
SLIDE 13
slide-14
SLIDE 14

Data Integration Tools

  • Java + IDE based, for most of them
  • Data transformations are blocks
  • IO flow managed by connections
  • Execution

GUI first, eventually code :-(

slide-15
SLIDE 15

In the Python world …

  • Bubbles (https://github.com/stiivi/bubbles)
  • PETL (https://github.com/alimanfoo/petl)
  • (insert a few more here)
  • and now… Bonobo (https://www.bonobo-project.org/)

You can also use amazing libraries including 
 Joblib, Dask, Pandas, Toolz, 
 but ETL is not their main focus.

slide-16
SLIDE 16

Other scales…

slide-17
SLIDE 17

Small Automation Tools

  • Mostly aimed at simple recurring tasks.
  • Cloud / SaaS only.

slide-18
SLIDE 18

Big Data Tools

  • Can do anything. And probably more. Fast.
  • Either needs an infrastructure, or cloud based.
slide-19
SLIDE 19

Story time

slide-20
SLIDE 20

Partner 1 Data Integration

slide-21
SLIDE 21

WE GOT DEALS !!!

slide-22
SLIDE 22

Partner 1 Partner 2 Partner 3 Partner 4 Partner 5 Partner 6 Partner 7 Partner 8 Partner 9 …

slide-23
SLIDE 23

Tiny bug there… Can you fix it ?

slide-24
SLIDE 24
slide-25
SLIDE 25

My need

  • A data integration / ETL tool using code as configuration.
  • Preferably Python code.
  • Something that can be tested (I mean, by a machine).
  • Something that can use inheritance.
  • Fast & cheap install on laptop, thought for servers too.
slide-26
SLIDE 26

And that’s Bonobo

slide-27
SLIDE 27

It is …

  • A framework to write ETL jobs in Python 3 (3.5+)
  • Using the same concepts as the old ETLs.
  • You can use OOP!

Code first. Eventually a GUI will come.

slide-28
SLIDE 28

It is NOT …

  • Pandas / R Dataframes
  • Dask (but will probably implement a dask.distributed

strategy someday)

  • Luigi / Airflow
  • Hadoop / Big Data / Big Query / …
  • A monkey (spoiler : it’s an ape, damnit french language…)
slide-29
SLIDE 29

Let’s see…

slide-30
SLIDE 30

Create a project

~ $ pip install bonobo ~ $ bonobo init europython/tutorial ~ $ bonobo run europython/tutorial

slide-31
SLIDE 31

TEMPLATE

~ $ bonobo run .

…demo

slide-32
SLIDE 32

Write our own

import bonobo def extract(): yield 'euro' yield 'python' yield '2017' def transform(s): return s.title() def load(s): print(s) graph = bonobo.Graph( extract, transform, load, )

slide-33
SLIDE 33

EXAMPLE_1

~ $ bonobo run .

…demo

slide-34
SLIDE 34

EXAMPLE_1

~ $ bonobo run first.py

…demo

slide-35
SLIDE 35

Under the hood…

slide-36
SLIDE 36

graph = bonobo.Graph(…)

slide-37
SLIDE 37 BEGIN

CsvReader( 'clients.csv' ) InsertOrUpdate( 'db.site', 'clients', key='guid' ) update_crm retrieve_orders

slide-38
SLIDE 38

Graph…

class Graph: def __init__(self, *chain): self.edges = {} self.nodes = [] self.add_chain(*chain) def add_chain(self, *nodes, _input=None, _output=None): # ...

slide-39
SLIDE 39

bonobo.run(graph)

  • r in a shell…

$ bonobo run main.py

slide-40
SLIDE 40 BEGIN

CsvReader( 'clients.csv' ) InsertOrUpdate( 'db.site', 'clients', key='guid' ) update_crm retrieve_orders

slide-41
SLIDE 41 BEGIN

CsvReader( 'clients.csv' ) InsertOrUpdate( 'db.site', 'clients', key='guid' ) update_crm retrieve_orders

Context + Thread Context + Thread Context + Thread Context + Thread

slide-42
SLIDE 42

Context…

class GraphExecutionContext: def __init__(self, graph, plugins, services): self.graph = graph self.nodes = [ NodeExecutionContext(node, parent=self) for node in self.graph ] self.plugins = [ PluginExecutionContext(plugin, parent=self) for plugin in plugins ] self.services = services

slide-43
SLIDE 43

Strategy…

class ThreadPoolExecutorStrategy(Strategy): def execute(self, graph, plugins, services): context = self.create_context(graph, plugins, services) executor = self.create_executor() for node_context in context.nodes: executor.submit( self.create_runner(node_context) ) while context.alive: self.sleep() executor.shutdown() return context

slide-44
SLIDE 44

</ implementation details >

slide-45
SLIDE 45

Transformations

a.k.a

nodes in the graph

slide-46
SLIDE 46

Functions

def get_more_infos(api, **row): more = api.query(row.get('id')) return { **row, **(more or {}), }

slide-47
SLIDE 47

Generators

def join_orders(order_api, **row): for order in order_api.get(row.get('customer_id')): yield { **row, **order, }

slide-48
SLIDE 48

Iterators

extract = ( 'foo', 'bar', 'baz', ) extract = range(0, 1001, 7)

slide-49
SLIDE 49

Classes

class RiminizeThis: def __call__(self, **row): return { **row, 'Rimini': 'Woo-hou-wo...', }

Anything, as long as it’s callable().

slide-50
SLIDE 50

Configurable classes

from bonobo.config import Configurable, Option, Service class QueryDatabase(Configurable): table_name = Option(str, default=‘customers') database = Service('database.default') def call(self, database, **row): customer = database.query(self.table_name, customer_id=row['clientId' return { **row, 'is_customer': bool(customer), }

slide-51
SLIDE 51

Configurable classes

from bonobo.config import Configurable, Option, Service class QueryDatabase(Configurable): table_name = Option(str, default=‘customers') database = Service('database.default') def call(self, database, **row): customer = database.query(self.table_name, customer_id=row['clientId' return { **row, 'is_customer': bool(customer), }

slide-52
SLIDE 52

Configurable classes

from bonobo.config import Configurable, Option, Service class QueryDatabase(Configurable): table_name = Option(str, default=‘customers') database = Service('database.default') def call(self, database, **row): customer = database.query(self.table_name, customer_id=row['clientId' return { **row, 'is_customer': bool(customer), }

slide-53
SLIDE 53

Configurable classes

from bonobo.config import Configurable, Option, Service class QueryDatabase(Configurable): table_name = Option(str, default=‘customers') database = Service('database.default') def call(self, database, **row): customer = database.query(self.table_name, customer_id=row['clientId' return { **row, 'is_customer': bool(customer), }

slide-54
SLIDE 54

Configurable classes

query_database = QueryDatabase( table_name='test_customers', database='database.testing', )

slide-55
SLIDE 55

Services

slide-56
SLIDE 56

Define as names

class QueryDatabase(Configurable): database = Service('database.default') def call(self, database, **row): return { … }

slide-57
SLIDE 57

Runtime injection

import bonobo graph = bonobo.Graph(...) def get_services(): return { ‘database.default’: MyDatabaseImpl() }

slide-58
SLIDE 58

Bananas!

slide-59
SLIDE 59

Library

bonobo.FileReader(…) bonobo.CsvReader(…) bonobo.JsonReader(…) bonobo.PickleReader(…) bonobo.ExcelReader(…) bonobo.XMLReader(…) … more to come bonobo.FileWriter(…) bonobo.CsvWriter(…) bonobo.JsonWriter(…) bonobo.PickleWriter(…) bonobo.ExcelWriter(…) bonobo.XMLWriter(…) … more to come

slide-60
SLIDE 60

Library

bonobo.Limit(limit) bonobo.PrettyPrinter() bonobo.Filter(…) … more to come

slide-61
SLIDE 61

Extensions & Plugins

slide-62
SLIDE 62

Console Plugin

slide-63
SLIDE 63

Jupyter Plugin

slide-64
SLIDE 64

SQLAlchemy Extension

bonobo_sqlalchemy.Select( query, *, pack_size=1000, limit=None ) bonobo_sqlalchemy.InsertOrUpdate( table_name, *, fetch_columns, insert_only_fields, discriminant, … )

PREVIEW

slide-65
SLIDE 65

Docker Extension

$ pip install bonobo[docker] $ bonobo runc myjob.py PREVIEW

slide-66
SLIDE 66

Dev Kit

PREVIEW

https://github.com/python-bonobo/bonobo-devkit

slide-67
SLIDE 67

More examples

?

slide-68
SLIDE 68

EXAMPLE_1 -> EXAMPLE_2

…demo

  • Use filesystem service.
  • Write to a CSV
  • Also write to JSON
slide-69
SLIDE 69

EXAMPLE_3

Rimini open data

slide-70
SLIDE 70

~/bdk/demos/europython2017

Europython attendees

featuring… jupyter notebook
 selenium & firefox

slide-71
SLIDE 71

~/bdk/demos/sirene

French companies registry

featuring… docker postgresql sql alchemy

slide-72
SLIDE 72

Wrap up

slide-73
SLIDE 73

Young

  • First commit : December 2016
  • 23 releases, ~420 commits, 4 contributors
  • Current « stable » 0.4.3
  • Target : 1.0 early 2018
slide-74
SLIDE 74

Python 3.5+

  • {**}
  • async/await
  • (…, *, …)
  • GIL :(
slide-75
SLIDE 75

1.0

  • 100% Open-Source.
  • Light & Focused.
  • Very few dependencies.
  • Comprehensive standard library.
  • The rest goes to plugins and extensions.
slide-76
SLIDE 76

Small scale

  • 1 minute to install
  • Easy to deploy
  • NOT : Big Data, Statistics, Analytics …
  • IS : Lean manufacturing for data
slide-77
SLIDE 77

Interwebs are crazy

slide-78
SLIDE 78

Data Processing for Humans

slide-79
SLIDE 79

www.bonobo-project.org docs.bonobo-project.org bonobo-slack.herokuapp.com github.com/python-bonobo

Let me know what you think!

slide-80
SLIDE 80

Sprint

  • Sprints at Europython are amazing
  • Nice place to learn about Bonobo, basics, etc.
  • Nice place to contribute while learning.
  • You’re amazing.
slide-81
SLIDE 81

Thank you!

@monkcage @rdorgueil

https://goo.gl/e25eoa

slide-82
SLIDE 82

bonobo

@monkcage