9 7 layers of data testing hell Clients Name | Country | Title of - PowerPoint PPT Presentation

Da Data’ s Inf Inferno 9 7 layers of data testing hell Clients Name | Country | Title of this presentation | Date 0

Who are we? • Globally active bank, based in Amsterdam, • Big Data and Data Science Consultancy, the Netherlands based in Amsterdam, The Netherlands • 52.000 employees, 38.4 million customers • 40 people, 2/3 data scientist, 1/3 data • Wholesale Banking Advanced Analytics engineer (WBAA) • Provide expertise in machine learning, big • Works for the corporate clients, like data, cloud, and scalable architectures Shell and Unilever • Help organizations to become more data- • Consists of mostly Data driven Scientists(booo!) and Data • Develop production-ready data applications Engineers(yeaaaah!) • Build data-driven algorithmic products Clients Name | Country | Title of this presentation | Date 1

Real data sucks Square peg, round hole Test the pegs! So, you thought it was square? Clients Name | Country | Title of this presentation | Date 2

Data quality is a problem everywhere and always! Clients Name | Country | Title of this presentation | Date 3

Airflow 101 Define your data pipeline in Python, as a (complex) sequence of tasks called a Directed Acyclic Graph (DAG) which can be scheduled t1 = BashOperator( task_id='print_date', bash_command='date', dag=dag) t2 = BashOperator( task_id='sleep', bash_command='sleep 5', retries=3, dag=dag) t2.set_upstream(t1) Clients Name | Country | Title of this presentation | Date 4

Our environment Clients Name | Country | Title of this presentation | Date 5

Seems like a highway to hell? Clients Name | Country | Title of this presentation | Date 6

La Layer I DA DAG In Integrity y Te Tests Clients Name | Country | Title of this presentation | Date 7

DAG integrity test (Layer 1) CI 2 main use cases: • Airflow version upgrade testing in Continuous Integration pipeline (CI) • Sanity and typo checking of our DAGs ssh_execute_operator à ssh_operator Clients Name | Country | Title of this presentation | Date 8

DAG integrity test (Layer 1) CI task_a = BashOperator(task_id=‘task_a’, …) test_task_a = BashOperator(task_id=‘test_task_a’, …) task_b = BashOperator(task_id=‘task_a’, …) Clients Name | Country | Title of this presentation | Date 9

DAG integrity test (Layer 1) CI for all DAG.py in /dags/: assert all objects are valid Airflow DAGs • We spend a lot of time fixing typos and logic errors in DAGs, after we upload it to Airflow to actually run • You can now do this in your CI thanks to the guys and girls at CloverHealth Clients Name | Country | Title of this presentation | Date 10

La Layer II Spl Split da data i a ingestion n fr from data de depl ploym yments Clients Name | Country | Title of this presentation | Date 11

Split your data ingestion from your data deployment (Layer 2) Clients Name | Country | Title of this presentation | Date 12

Split your data ingestion from your data deployment (Layer 2) • Data comes in from many different sources • Create an ingestion DAG per source • Create an interface for systems that do the same thing i.e. payment transactions • Let your data deployment pipeline for your project work with “clean” data • Make sure you add a debugging column from your source, like a unique ID, if none exists. Also add a column indicating from which source it came Clients Name | Country | Title of this presentation | Date 13

La Layer III Da Data Tests ts Clients Name | Country | Title of this presentation | Date 14

Data tests (Layer 3) • After every action we take we have a test to check if that step has gone as expected • We split this up in testing ingestion of data sources and testing data deployment steps Source Central Data Store • Are there files available for ingestion? • Did we get the columns that we expected? • Are the rows that are in there valid? (Join it) • Did the count only increase? Clients Name | Country | Title of this presentation | Date 15

Data tests (Layer 3) The data deployment pipeline also contains tests along the way • Are the known private individuals filtered out? • Are the known companies still in? • Do all the output rows have a classification? • Has the aggregation of PI’s worked correctly? Clients Name | Country | Title of this presentation | Date 16

La Layer IV Ch Chuck Norr rris Clients Name | Country | Title of this presentation | Date 17

Alerting by Chuck Norris (Layer 4) Go take a look, something blew up Pointing out mistakes of others J People owning their mistakes J Clients Name | Country | Title of this presentation | Date 18

La Layer V Nu Nuclear GIT Clients Name | Country | Title of this presentation | Date 19

Nuclear GIT (Layer 5) if PRD ≈ DEV: Clients Name | Country | Title of this presentation | Date 20

Nuclear GIT (Layer 5) # this will hard reset all repos to the version on master branch # any local commits that have not been pushed yet will be lost. echo "Resetting "${dir%/*} Don’t copy paste git fetch this at home! git checkout -f master git reset --hard origin/master git clean -df git submodule update --recursive --force Clients Name | Country | Title of this presentation | Date 21

La Layer VI Mo Mock Pipeline Te Tests Clients Name | Country | Title of this presentation | Date 22

Mock pipeline tests (Layer 6) CI Two variables; code and data Data Control the exact data that goes in Code is the variable; allowing to your pipeline you to test your logic Clients Name | Country | Title of this presentation | Date 23

Mock pipeline tests (Layer 6) CI Step 1: Create fake data that looks like your real data in a pytest fixture PERSONS = [(‘name’: ‘Kim Yong Un’, ’country’: ‘North Korea’, ‘iban’: ‘NK99NKBK0000000666’), …] TRANSACTIONS= [(‘iban’: ‘NK99NKBK0000000666’, ‘amount’: 10 ), …] Step 2: Run your code in pytest filter_data(spark, “tst”) filter_data(spark, PERSONS, TRANSACTIONS) Step 3: Check if your task returns the data that you expect: assert spark.sql(“””SELECT COUNT(*) ct FROM filtered_data WHERE iban = assert spark.sql(”””SELECT COUNT(*) ct FROM filtered_data WHERE ’NK99NKBK0000000666’”””).first().ct == 0 iban = ’NK99NKBK0000000666’”””).first().ct == 0 Clients Name | Country | Title of this presentation | Date 24

La Layer 7 DT DTAP Clients Name | Country | Title of this presentation | Date 25

DTAP (Layer 7) So… you’ve passed 6 layers. It will work now right? REAL DATA SUCKS Clients Name | Country | Title of this presentation | Date 26

DTAP (Layer 7) DEV TST • Quickly run your pipeline on a very small Select a subset of your data for data that • subset of your data you know • In our case 0.0025% of all data Immediately see if something is off • • Nothing will make sense, but it’s a nice Still quick to run • integration test ACC PRD Carbon copy of production Greenlight procedure for merging from ACC • • You can check if you feel comfortable to PRD • pushing to PRD Manual operation • Give access to a Product Owner for them to • check Clients Name | Country | Title of this presentation | Date 27

DTAP (Layer 7) DEV TST ACC PRD Automatic Automatic • 4 branches, dev, tst, acc, prd, each separately checked out in your /dags/ directory • An environment.conf file outside of GIT in the corresponding directory • Automatic promotion of code from dev to tst and tst to acc if everything went “green” in the DAG • TriggerDagRunOperator to trigger the next DAG automatically for dev to tst and tst to acc Clients Name | Country | Title of this presentation | Date 28

Local testing of Airflow with Whirl Colleagues Bas Beelen and Kris Geusebroek made some very nice improvements after our time on this project. https://www.youtube.com/watch?v=jqK_HCOJ9Ak High level overview: - Data is confidential, can’t take this local - There are many different DAGs, some of which are very complex - Whirl speeds up development by - Being able to reuse standard components of a DAG - Test your DAG locally end to end with fake data using Docker - Open source code is in the pipeline with Bas and Kris J Clients Name | Country | Title of this presentation | Date 29

Now take your time to understand it all! Blogpost : https://medium.com/@ingwbaa/datas-inferno-7-circles-of-data-testing-hell-with-airflow-cef4adff58d8 Github : https://github.com/danielvdende/data-testing-with-airflow Clients Name | Country | Title of this presentation | Date 30

Thank you! Questions? Clients Name | Country | Title of this presentation | Date 31

9 7 layers of data testing hell Clients Name | Country | Title of - PowerPoint PPT Presentation

Da Data s Inf Inferno 9 7 layers of data testing hell Clients Name | Country | Title of this presentation | Date 0 Who are we? Globally active bank, based in Amsterdam, Big Data and Data Science Consultancy, the Netherlands

Hell Defeated The Harrowing of Hell Figure: The harrowing of hell, s. xv (cc0: source) Biblical

Spreadsheet from hell preadsheet from hell What I learned in 18 months of tracking my health

THE REALITY OF HELL How We Sometimes See Hell As Something Not Real The Unbelieving and

Levels of Testing Chapter 12 Beyond unit testing Developer Testing stages Unit testing

Testing Terminology System testing Types of errors Function testing Structure

Pace Layers The social economy ecosystem has many layers, all of which change at different

Property-Based Testing Matt Bachmann @mattbachmann Testing is Important Testing is Important

Software Testing Overview What is software testing? General testing criteria Testing

Software testing Software Testing Introduction Testing levels Automated testing Principles and

1. Test page This page is for testing. This page is for testing. This page is for testing.

Microservices and the art of taming the Dependency Hell Monster Michael Bryzek Cofounder &

Hashimotos Go To Hell-Part II Diving Deeper Into The Underlying Cause Keri Topouzian,

How can a loving god send people to hell? Let me say at the outset that I consider the concept

Fluoreszenz-Mikroskopie Stefan W. Hell MPI Biophys. Chemie, Gttingen, Germany NP Chemie 2014

Manufacturing in China, what the hell! The problem is we dont understand the problem

(BJARNE-FEST) Pavol Hell, SFU and Charles U List Homomorphisms, Time and Space Homomorphisms

MAPS GSM A Interface Emulator GSM A Interface Emulation Over IP and TDM 818 West Diamond

Presented by Nike Neuvenheim, MPA Sr. Program Coordinator, Disease Control and Prevention Program

Informal Member States Consultation GPW 13 WHO Impact Framework 25-26 March 2019 The heart of

Fieldwork Documentation Orientation Nursing Programs Administrative Office Office Hours

zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA "Reviewed November 2014" 2D

Platform delivery with Ansible Red Hat Forum 2018 Chris Verhoef & Maarten van Kessel Agenda

Division of Criminal Justice Services 2009 Drug Law Reform Update June 2011 June 2011 April

The health and wellbeing of children and young people in London Dr Marilena Korkodilos, Deputy

Sambuz

Useful Links

Newsletter

Mail Us

9 7 layers of data testing hell Clients Name | Country | Title of - PowerPoint PPT Presentation

Da Data s Inf Inferno 9 7 layers of data testing hell Clients Name | Country | Title of this presentation | Date 0 Who are we? Globally active bank, based in Amsterdam, Big Data and Data Science Consultancy, the Netherlands

Hell Defeated The Harrowing of Hell Figure: The harrowing of hell, s. xv (cc0: source) Biblical

Spreadsheet from hell preadsheet from hell What I learned in 18 months of tracking my health

THE REALITY OF HELL How We Sometimes See Hell As Something Not Real The Unbelieving and

Levels of Testing Chapter 12 Beyond unit testing Developer Testing stages Unit testing

Testing Terminology System testing Types of errors Function testing Structure

Pace Layers The social economy ecosystem has many layers, all of which change at different

Property-Based Testing Matt Bachmann @mattbachmann Testing is Important Testing is Important

Software Testing Overview What is software testing? General testing criteria Testing

Software testing Software Testing Introduction Testing levels Automated testing Principles and

1. Test page This page is for testing. This page is for testing. This page is for testing.

Microservices and the art of taming the Dependency Hell Monster Michael Bryzek Cofounder &amp;

Hashimotos Go To Hell-Part II Diving Deeper Into The Underlying Cause Keri Topouzian,

How can a loving god send people to hell? Let me say at the outset that I consider the concept

Fluoreszenz-Mikroskopie Stefan W. Hell MPI Biophys. Chemie, Gttingen, Germany NP Chemie 2014

Manufacturing in China, what the hell! The problem is we dont understand the problem

(BJARNE-FEST) Pavol Hell, SFU and Charles U List Homomorphisms, Time and Space Homomorphisms

MAPS GSM A Interface Emulator GSM A Interface Emulation Over IP and TDM 818 West Diamond

Presented by Nike Neuvenheim, MPA Sr. Program Coordinator, Disease Control and Prevention Program

Informal Member States Consultation GPW 13 WHO Impact Framework 25-26 March 2019 The heart of

Fieldwork Documentation Orientation Nursing Programs Administrative Office Office Hours

zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA &quot;Reviewed November 2014&quot; 2D

Platform delivery with Ansible Red Hat Forum 2018 Chris Verhoef &amp; Maarten van Kessel Agenda

Division of Criminal Justice Services 2009 Drug Law Reform Update June 2011 June 2011 April

The health and wellbeing of children and young people in London Dr Marilena Korkodilos, Deputy

Sambuz

Useful Links

Newsletter

Mail Us

Microservices and the art of taming the Dependency Hell Monster Michael Bryzek Cofounder &

zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA "Reviewed November 2014" 2D

Platform delivery with Ansible Red Hat Forum 2018 Chris Verhoef & Maarten van Kessel Agenda