Dirty Data Its a mess. Its your problem. Friso van Vollenhoven @fzk - - PowerPoint PPT Presentation

dirty data
SMART_READER_LITE
LIVE PREVIEW

Dirty Data Its a mess. Its your problem. Friso van Vollenhoven @fzk - - PowerPoint PPT Presentation

Dirty Data Its a mess. Its your problem. Friso van Vollenhoven @fzk frisovanvollenhoven@godatadriven.com Go DataDriven PROUDLY PART OF THE XEBIA GROUP 'februari-22 2013' A: Yes, sometimes as often as 1 in every 10K calls. Or about once


slide-1
SLIDE 1

GoDataDriven

PROUDLY PART OF THE XEBIA GROUP

@fzk frisovanvollenhoven@godatadriven.com

Dirty Data

Friso van Vollenhoven

It’s a mess. It’s your problem.

slide-2
SLIDE 2
slide-3
SLIDE 3
slide-4
SLIDE 4
slide-5
SLIDE 5
slide-6
SLIDE 6
slide-7
SLIDE 7

'februari-22 2013'

slide-8
SLIDE 8
slide-9
SLIDE 9

A: Yes, sometimes as

  • ften as 1 in every 10K
  • calls. Or about once a

week at 3K files / day.

slide-10
SLIDE 10
slide-11
SLIDE 11
slide-12
SLIDE 12

þ

slide-13
SLIDE 13

þ

slide-14
SLIDE 14

TSV == thorn separated values?

slide-15
SLIDE 15

þ == 0xFE

slide-16
SLIDE 16
  • r -2, in Hive

CREATE TABLE browsers ( browser_id STRING, browser STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '-2';

slide-17
SLIDE 17
slide-18
SLIDE 18
slide-19
SLIDE 19
slide-20
SLIDE 20
slide-21
SLIDE 21
slide-22
SLIDE 22
slide-23
SLIDE 23
  • The format will change
  • Faulty deliveries will occur
  • Your parser will break
  • Records will be mistakingly produced (over-logging)
  • Other people test in production too (and you get the

data from it)

  • Etc., etc.
slide-24
SLIDE 24
  • Simple deployment of ETL code
  • Scheduling
  • Scalable
  • Independent jobs
  • Fixable data store
  • Incremental where possible
  • Metrics
slide-25
SLIDE 25

EXTRACT TRANSFORM LOAD

slide-26
SLIDE 26
slide-27
SLIDE 27
slide-28
SLIDE 28
slide-29
SLIDE 29
slide-30
SLIDE 30
slide-31
SLIDE 31
  • No JVM startup overhead for Hadoop API usage
  • Relatively concise syntax (Python)
  • Mix Python standard library with any Java libs
slide-32
SLIDE 32
slide-33
SLIDE 33
  • Flexible scheduling with dependencies
  • Saves output
  • E-mails on errors
  • Scales to multiple nodes
  • REST API
  • Status monitor
  • Integrates with version control
slide-34
SLIDE 34
slide-35
SLIDE 35

Deployment

git push jenkins master

slide-36
SLIDE 36

Independent jobs

source (external) staging (HDFS) hive-staging (HDFS) Hive

HDFS upload + move in place MapReduce + HDFS move Hive map external table + SELECT INTO

slide-37
SLIDE 37

Out of order jobs

  • At any point, you don’t really know what ‘made it’

to Hive

  • Will happen anyway, because some days the data

delivery is going to be three hours late

  • Or you get half in the morning and the other half

later in the day

  • It really depends on what you do with the data
  • This is where metrics + fixable data store help...
slide-38
SLIDE 38

Fixable data store

  • Using Hive partitions
  • Jobs that move data from staging create partitions
  • When new data / insight about the data arrives,

drop the partition and re-insert

  • Be careful to reset any metrics in this case
  • Basically: instead of trying to make everything

transactional, repair afterwards

  • Use metrics to determine whether data is fit for

purpose

slide-39
SLIDE 39

Metrics

slide-40
SLIDE 40

Metrics service

  • Job ran, so may units processed, took so much

time

  • e.g. 10GB imported, took 1 hr
  • e.g. 60M records transformed, took 10 minutes
  • Dropped partition
  • Inserted X records into partition
slide-41
SLIDE 41

GoDataDriven

We’re hiring / Questions? / Thank you!

@fzk frisovanvollenhoven@godatadriven.com Friso van Vollenhoven