Testing and documenting your data doesnt have to suck Data Council - - PowerPoint PPT Presentation

testing and documenting your data doesn t have to suck
SMART_READER_LITE
LIVE PREVIEW

Testing and documenting your data doesnt have to suck Data Council - - PowerPoint PPT Presentation

Testing and documenting your data doesnt have to suck Data Council NYC - Nov 2019 @abeGong About me (Abe) Data scientist/engineer Tech-first and enterprise Human-scale, ethical data First time in NYC as an adult (?!)


slide-1
SLIDE 1

@abeGong

Testing and documenting your data doesn’t have to suck

Data Council NYC - Nov 2019

slide-2
SLIDE 2

@abeGong

  • Data scientist/engineer
  • Tech-first and “enterprise”
  • Human-scale, ethical data
  • First time in NYC as an adult (?!)

About me (Abe)

slide-3
SLIDE 3

@abeGong

  • 1. A thing we do that is ABSOLUTELY CRAZY
  • 2. How to defeat pipeline debt
  • 3. Volunteers wanted!

Outline

slide-4
SLIDE 4

@abeGong

a thing we do

that is

ABSOLUTELY CRAZY

slide-5
SLIDE 5

@abeGong

a thing we do that is ABSOLUTELY CRAZY

slide-6
SLIDE 6

@abeGong

a thing we do that is ABSOLUTELY CRAZY

Undocumented

slide-7
SLIDE 7

@abeGong

Undocumented Untested

a thing we do that is ABSOLUTELY CRAZY

slide-8
SLIDE 8

@abeGong

a thing we do that is ABSOLUTELY CRAZY

Undocumented Untested Unstable

slide-9
SLIDE 9

@abeGong

a thing we do that is ABSOLUTELY CRAZY

Undocumented Untested Unstable

slide-10
SLIDE 10

@abeGong

a thing we do that is ABSOLUTELY CRAZY

Undocumented Untested Unstable

slide-11
SLIDE 11

@abeGong

a thing we do that is ABSOLUTELY CRAZY

Undocumented Untested Unstable

slide-12
SLIDE 12

@abeGong

Trying to maintain a

data system

that is untested, undocumented and unstable is ABSOLUTELY CRAZY

slide-13
SLIDE 13

@abeGong

?

slide-14
SLIDE 14

@abeGong

a thing we do that is ABSOLUTELY CRAZY

Give the monster a name

  • > Pipeline debtc
slide-15
SLIDE 15

@abeGong

a thing we do that is ABSOLUTELY CRAZY

Give the monster a name

  • > Pipeline debtc

The monster’s name is pipeline debt.

slide-16
SLIDE 16

@abeGong

Always know what to expect from your data

slide-17
SLIDE 17

@abeGong expect_column_to_exist expect_table_row_count_to_be_between expect_column_values_to_be_unique expect_column_values_to_not_be_null expect_column_values_to_be_between expect_column_values_to_match_regex expect_column_values_to_match_strftime_format expect_column_mean_to_be_between expect_column_kl_divergence_to_be_less_than

  • etc. etc. etc.

great_expectations Expectations are assertions about data

slide-18
SLIDE 18

@abeGong expect_column_to_exist expect_table_row_count_to_be_between expect_column_values_to_be_unique expect_column_values_to_not_be_null expect_column_values_to_be_between expect_column_values_to_match_regex expect_column_values_to_match_strftime_format expect_column_mean_to_be_between expect_column_kl_divergence_to_be_less_than

  • etc. etc. etc.

great_expectations Expectations are assertions about data

slide-19
SLIDE 19

@abeGong expect_column_to_exist expect_table_row_count_to_be_between expect_column_values_to_be_unique expect_column_values_to_not_be_null expect_column_values_to_be_between expect_column_values_to_match_regex expect_column_values_to_match_strftime_format expect_column_mean_to_be_between expect_column_kl_divergence_to_be_less_than

  • etc. etc. etc.

great_expectations Expectations are assertions about data

slide-20
SLIDE 20

@abeGong expect_column_to_exist expect_table_row_count_to_be_between expect_column_values_to_be_unique expect_column_values_to_not_be_null expect_column_values_to_be_between expect_column_values_to_match_regex expect_column_values_to_match_strftime_format expect_column_mean_to_be_between expect_column_kl_divergence_to_be_less_than

  • etc. etc. etc.

great_expectations Expectations are assertions about data

slide-21
SLIDE 21

@abeGong

Expectations are assertions about data

Expectation Types

slide-22
SLIDE 22

@abeGong

Expectations are assertions about data

Expectation Types Data Sources

slide-23
SLIDE 23

@abeGong How to draw an owl

  • 1. Draw some circles
  • 2. Draw the rest of the stupid owl
slide-24
SLIDE 24

@abeGong

Great Expectations has a bunch of shiny new features

slide-25
SLIDE 25

@abeGong

Great Expectations has a bunch of shiny new features

Stores Profilers Renderers and Views Validation Operators Data Context and Data Asset namespace Expectation Types Data Sources

slide-26
SLIDE 26

@abeGong

Great Expectations has a bunch of shiny new features

slide-27
SLIDE 27

@abeGong

Great Expectations has a bunch of shiny new features

slide-28
SLIDE 28

@abeGong

Great Expectations has a bunch of shiny new features

slide-29
SLIDE 29

@abeGong

Set up data testing in a day, not a month.

slide-30
SLIDE 30

@abeGong

Your docs are your tests, and your tests are your docs.

Icons created by SBTS from Noun Project
slide-31
SLIDE 31

@abeGong

Your docs are your tests, and your tests are your docs.

https://www.locallyoptimistic.com/post/data_dictionaries/

slide-32
SLIDE 32

@abeGong

Your docs are your tests, and your tests are your docs.

expect_column_values_to_be_between( column=”room_temp”, min_value=60, max_value=75, mostly=.95 ) “Values in this column should be between 60 and 75, at least 95% of the time.” “Warning: more than 5% of values fell

  • utside the specified range of 60 to 75.”
slide-33
SLIDE 33

@abeGong

Your docs are your tests, and your tests are your docs.

slide-34
SLIDE 34

@abeGong

Warning: Great Expectations still has rough edges

slide-35
SLIDE 35

@abeGong

Warning: Great Expectations still has rough edges

Stores Profilers Renderers and Views Validation Operators Data Context and Data Asset namespace Expectation Types Data Sources

slide-36
SLIDE 36

@abeGong

Volunteers wanted!

1. Pick a day 2. Work with us 3. Get set up 4. Improve the project How to get in touch:

👌

https://greatexpectations.io/slack

slide-37
SLIDE 37

@abeGong

Recap

slide-38
SLIDE 38

@abeGong

Trying to maintain a

data system

that is untested, undocumented and unstable is ABSOLUTELY CRAZY

slide-39
SLIDE 39

@abeGong

a thing we do that is ABSOLUTELY CRAZY

Give the monster a name

  • > Pipeline debtc

The monster’s name is pipeline debt.

slide-40
SLIDE 40

@abeGong

To defeat pipeline debt, always know what to expect of your data.

expect_column_to_exist expect_table_row_count_to_be_between expect_column_values_to_be_unique expect_column_values_to_not_be_null expect_column_values_to_be_between expect_column_values_to_match_regex expect_column_values_to_match_strftime_format expect_column_mean_to_be_between expect_column_kl_divergence_to_be_less_than

  • etc. etc. etc.
slide-41
SLIDE 41

@abeGong

Set up data testing in a day, not a month.

slide-42
SLIDE 42

@abeGong

Your docs are your tests, and your tests are your docs.

Icons created by SBTS from Noun Project
slide-43
SLIDE 43

@abeGong

Warning: Great Expectations still has rough edges

slide-44
SLIDE 44

@abeGong

Volunteers wanted!

1. Pick a day 2. Work with us 3. Get set up 4. Improve the project How to get in touch:

👌

https://greatexpectations.io/slack

slide-45
SLIDE 45

@abeGong

https://greatexpectations.io/slack

Thank you, New York!