Dirty Data Its a mess. Its your problem. Friso van Vollenhoven @fzk - - PowerPoint PPT Presentation

▶

Oct 22, 2022 45 likes •456 views

Dirty Data Its a mess. Its your problem. Friso van Vollenhoven @fzk frisovanvollenhoven@godatadriven.com Go DataDriven PROUDLY PART OF THE XEBIA GROUP 'februari-22 2013' A: Yes, sometimes as often as 1 in every 10K calls. Or about once

SLIDE 1

GoDataDriven

PROUDLY PART OF THE XEBIA GROUP

@fzk frisovanvollenhoven@godatadriven.com

Dirty Data

Friso van Vollenhoven

It’s a mess. It’s your problem.

SLIDE 2

SLIDE 3

SLIDE 4

SLIDE 5

SLIDE 6

SLIDE 7

'februari-22 2013'

SLIDE 8

SLIDE 9

A: Yes, sometimes as

ften as 1 in every 10K
calls. Or about once a

week at 3K files / day.

SLIDE 10

SLIDE 11

SLIDE 12

þ

SLIDE 13

þ

SLIDE 14

TSV == thorn separated values?

SLIDE 15

þ == 0xFE

SLIDE 16

r -2, in Hive

CREATE TABLE browsers ( browser_id STRING, browser STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '-2';

SLIDE 17

SLIDE 18

SLIDE 19

SLIDE 20

SLIDE 21

SLIDE 22

SLIDE 23

The format will change
Faulty deliveries will occur
Your parser will break
Records will be mistakingly produced (over-logging)
Other people test in production too (and you get the

data from it)

Etc., etc.

SLIDE 24

Simple deployment of ETL code
Scheduling
Scalable
Independent jobs
Fixable data store
Incremental where possible
Metrics

SLIDE 25

EXTRACT TRANSFORM LOAD

SLIDE 26

SLIDE 27

SLIDE 28

SLIDE 29

SLIDE 30

SLIDE 31

No JVM startup overhead for Hadoop API usage
Relatively concise syntax (Python)
Mix Python standard library with any Java libs

SLIDE 32

SLIDE 33

Flexible scheduling with dependencies
Saves output
E-mails on errors
Scales to multiple nodes
REST API
Status monitor
Integrates with version control

SLIDE 34

SLIDE 35

Deployment

git push jenkins master

SLIDE 36

Independent jobs

source (external) staging (HDFS) hive-staging (HDFS) Hive

HDFS upload + move in place MapReduce + HDFS move Hive map external table + SELECT INTO

SLIDE 37

Out of order jobs

At any point, you don’t really know what ‘made it’

to Hive

Will happen anyway, because some days the data

delivery is going to be three hours late

Or you get half in the morning and the other half

later in the day

It really depends on what you do with the data
This is where metrics + fixable data store help...

SLIDE 38

Fixable data store

Using Hive partitions
Jobs that move data from staging create partitions
When new data / insight about the data arrives,

drop the partition and re-insert

Be careful to reset any metrics in this case
Basically: instead of trying to make everything

transactional, repair afterwards

Use metrics to determine whether data is fit for

purpose

SLIDE 39

Metrics

SLIDE 40

Metrics service

Job ran, so may units processed, took so much

time

e.g. 10GB imported, took 1 hr
e.g. 60M records transformed, took 10 minutes
Dropped partition
Inserted X records into partition

SLIDE 41

GoDataDriven

We’re hiring / Questions? / Thank you!

@fzk frisovanvollenhoven@godatadriven.com Friso van Vollenhoven