Mastering a data pipeline with Python: 6 years of learned lessons - PowerPoint PPT Presentation

Mastering a data pipeline with Python: 6 years of learned lessons from mistakes to success Robson Júnior GitHub

Me DEVELOPER TELEGRAM: TWITTER: GITHUB + 16 YEARS BSAO0 BSAO

It’s not about code Anatomy of a data product Lambda vs Kappa Architecture Agenda Qualities of a data pipeline Where python matters My goal is help you to start to planning great data driven products.

Anatomy of a data product API’s Logs Jobs & Datasets DB DB Ingress Processes Egress Veracity / Velocity Veracity Volume / Variety Credits: Lars Albertsson https://www.youtube.com/watch?v=IVEl0bsTbdg

API’s Memory Functions Variables Files Files RAM Input Processes Output SAME AS A COMPUTER PROGRAM Credits: Lars Albertsson https://www.youtube.com/watch?v=IVEl0bsTbdg

Lambda and Kappa Λ VS VS Κ architecture

Lambda Speed Layer Stream Real time views Data Query Batch Layer All data Batch views Ingress Serving Layer

Applications System that requires permanent data stored. User queries based on immutable data. Users or Systems that requires huge amount of updates in the data and serves it in new datasets. Pros Cons Reliable and safe Premature data modelling, it’s getting hard to migrate schemas or datasets. Fault tolerant ( you can re-processes everything from scratch) Might be expensive due to volume of data you need to processes Scalable in each batch cycle. Manage all the historical data in a distributed file system ( delta Code can become complex due the separation of concerns lake ) between the layers.

Kappa Speed Layer Data Stream Query Real time views Pro tip: Unless you desperate for real time answers, stay in Batch Process

Applications You do need a well – define event order and can interact with your dataset any time. Systems that need a real time learning ( Social Networks, Ads Platform, Fraud Detection ) Focus on the code changes Pros Cons Use less resource than Lambda architecture Errors on data processing need a better exception manager Leverage Machine Learning to real time basis Might stop the pipeline to get bugs fixed Horizontally scalable You just need to reprocess the data when the code changes

Qualities of a IT’S A COMPUTER PROGRAM :) Pipeline PROBLEMS ARE ALMOST THE SAME If you see something that will get wrong in a software, probably it will get wrong on a data pipeline.

Access levels to the data levels Privacy over all layers Security Use a common format Separation of concerns Avoid hard-coding / duplication

Versioning Use the power of different tech Automation platforms CI/CD Code Review / Lint

Let cloud to help you (cheap and fast) Monitoring Avoid vendor lock-in Infrastructure monitoring

Regression tests Inputs must be deterministic Focus in test the units of the pipeline Testable and (internal) Traceable Test all the 3 rd party components After all implement an end-to-end test

Python plays well with all technologies

PySpark - Apache Spark Python API. dask - A flexible parallel computing library for analytic computing. Where Python matters luigi - A module that helps you build complex pipelines of batch jobs. ELT • mrjob - Run MapReduce jobs on Hadoop or Amazon Web Services. Streaming • Ray - A system for parallel and distributed Python Analysis • that unifies the machine learning ecosystem. Management & Scheduling • Testing • Validation •

faust - A stream processing library, porting the ideas from Kafka Streams to Python. streamparse - Run Python code against real-time Where Python streams of data via Apache Storm. matters ELT • Streaming • Analysis • Management & Scheduling • Testing • Validation •

Pandas - A library providing high-performance, easy-to-use data structures and data analysis tools. Blaze - NumPy and Pandas interface to Big Data. Where Python matters Open Mining - Business Intelligence (BI) in Pandas interface. ELT • Orange - Data mining, data visualization, analysis and machine learning through visual Streaming • programming or scripts. Analysis • Optimus - Agile Data Science Workflows made Management & Scheduling • easy with PySpark. Testing • Validation •

Airflow - Airflow is a platform to programmatically author, schedule and monitor workflows. Where Python matters ELT • Streaming • Analysis • Management & Scheduling • Testing • Validation •

pytest - A mature full-featured Python testing tool. mimesis - is a Python library that help you generate fake data. Where Python matters fake2db - Fake database generator. https://github.com/holdenk/spark-testing-base - a ELT • python framework to implemente pyspark tests Streaming • Analysis • Management & Scheduling • Testing • Validation •

Cerberus - A lightweight and extensible data validation library. schema - A library for validating Python data Where Python structures. matters voluptuous - A Python data validation library. ELT • Streaming • Analysis • Management & Scheduling • Testing • Validation •

Obrigado Thank you dankie shukran do jeh xie xie dêkuji tak kiitos merci danke efharisto toda QUESTIONS? sukria terima kasih grazie HELLO@BSAO.ME arigato kamsa hamnida takk salamat po dziekuje spasibo gracias istutiy asante tack kawp-kun krap/ka' tesekkür ederim

Mastering a data pipeline with Python: 6 years of learned lessons - PowerPoint PPT Presentation

Mastering a data pipeline with Python: 6 years of learned lessons from mistakes to success Robson Jnior GitHub Me DEVELOPER TELEGRAM: TWITTER: GITHUB + 16 YEARS BSAO0 BSAO Its not about code Anatomy of a data product Lambda vs

THE MOD METHOD with VESPERS MASTERING In this Module What mastering can do & what it

Python for Data Science Overview of Python Why Python Installing Python Installing Python Modules

Python Tidbits Python created by that guy ---> Python is named after Monty Pythons

Migrate Python from 2.X to 3.X WHO AM I? C++ & PYTHON DEVELOPER 6,5 years in 20,5 years in

Looping through Python data structures Justin Kiggins Product Manager DataCamp Python for

HPC Python Programming Ramses van Zon July 10, 2019 Ramses van Zon HPC Python Programming July

First Tool: Python! Introduction to python programming Gholamhossein Tavasoli @ ZNU First Tool:

Data types Cleaning Data in Python Prepare and clean data Cleaning Data in Python Data types

Mastering Complex Complex Analogue Analogue Mixed Signal Mixed Signal Mastering Systems with

Download How to Wash a Chicken Mastering the Business Presentation pdf ebook by Tim Calkins

Mastering the Gospel P resentation Welcome to the CMF Training page on Mastering a Gospel

Mastering Your Mindset Mastering Your Money Focus: How to Focus on Earning More Income, and

Lessons Learned Lessons Learned From From Lessons Learned Lessons Learned From From

Diagnose data for cleaning Cleaning Data in Python Cleaning data Prepare data for analysis

Getting Started with Python The Python Interpreter A piece of software that executes

We already know Java. Why learn Python? Using Python to Implement Algorithms Python has far less

Sigismondo Fanti, Triompho di Fortuna, 1527 fortune-telling game Clement balanced precariously

Models, Fictions, and Representing Scientific Practice: (Or, I dont know much about models...

Transferring Human Skills to Humanoid Robots Dongheui Lee dhlee@tum.de Dynamic

Art in the Ancient World LECTURE 5 | Art of Hellenistic Greece & Ancient Rome A U G G U S

Remembering, Imagining, and De Se, Revisited Pranav Anand UC Santa Cruz panand@ucsc.edu June

Analog and digital-analog quantum simulation of the Quantum Rabi Model E. Solano University of

Design Principles of the Mimetic Finite Difference Schemes Konstantin Lipnikov Los Alamos

The Mimetic Finite Difference Method Gianmarco Manzini 1 Istituto di Matematica Applicata e

Mastering a data pipeline with Python: 6 years of learned lessons - PowerPoint PPT Presentation

Mastering a data pipeline with Python: 6 years of learned lessons from mistakes to success Robson Jnior GitHub Me DEVELOPER TELEGRAM: TWITTER: GITHUB + 16 YEARS BSAO0 BSAO Its not about code Anatomy of a data product Lambda vs

THE MOD METHOD with VESPERS MASTERING In this Module What mastering can do &amp; what it

Python for Data Science Overview of Python Why Python Installing Python Installing Python Modules

Python Tidbits Python created by that guy ---&gt; Python is named after Monty Pythons

Migrate Python from 2.X to 3.X WHO AM I? C++ &amp; PYTHON DEVELOPER 6,5 years in 20,5 years in

Looping through Python data structures Justin Kiggins Product Manager DataCamp Python for

HPC Python Programming Ramses van Zon July 10, 2019 Ramses van Zon HPC Python Programming July

First Tool: Python! Introduction to python programming Gholamhossein Tavasoli @ ZNU First Tool:

Data types Cleaning Data in Python Prepare and clean data Cleaning Data in Python Data types

Mastering Complex Complex Analogue Analogue Mixed Signal Mixed Signal Mastering Systems with

Download How to Wash a Chicken Mastering the Business Presentation pdf ebook by Tim Calkins

Mastering the Gospel P resentation Welcome to the CMF Training page on Mastering a Gospel

Mastering Your Mindset Mastering Your Money Focus: How to Focus on Earning More Income, and

Lessons Learned Lessons Learned From From Lessons Learned Lessons Learned From From

Diagnose data for cleaning Cleaning Data in Python Cleaning data Prepare data for analysis

Getting Started with Python The Python Interpreter A piece of software that executes

We already know Java. Why learn Python? Using Python to Implement Algorithms Python has far less

Sigismondo Fanti, Triompho di Fortuna, 1527 fortune-telling game Clement balanced precariously

Models, Fictions, and Representing Scientific Practice: (Or, I dont know much about models...

Transferring Human Skills to Humanoid Robots Dongheui Lee dhlee@tum.de Dynamic

Art in the Ancient World LECTURE 5 | Art of Hellenistic Greece &amp; Ancient Rome A U G G U S

Remembering, Imagining, and De Se, Revisited Pranav Anand UC Santa Cruz panand@ucsc.edu June

Analog and digital-analog quantum simulation of the Quantum Rabi Model E. Solano University of

Design Principles of the Mimetic Finite Difference Schemes Konstantin Lipnikov Los Alamos

The Mimetic Finite Difference Method Gianmarco Manzini 1 Istituto di Matematica Applicata e

THE MOD METHOD with VESPERS MASTERING In this Module What mastering can do & what it

Python Tidbits Python created by that guy ---> Python is named after Monty Pythons

Migrate Python from 2.X to 3.X WHO AM I? C++ & PYTHON DEVELOPER 6,5 years in 20,5 years in

Art in the Ancient World LECTURE 5 | Art of Hellenistic Greece & Ancient Rome A U G G U S