GoDataDriven
PROUDLY PART OF THE XEBIA GROUP@fzk frisovanvollenhoven@godatadriven.com
Dirty Data
Friso van Vollenhoven
It’s a mess. It’s your problem.
Dirty Data Its a mess. Its your problem. Friso van Vollenhoven @fzk - - PowerPoint PPT Presentation
Dirty Data Its a mess. Its your problem. Friso van Vollenhoven @fzk frisovanvollenhoven@godatadriven.com Go DataDriven PROUDLY PART OF THE XEBIA GROUP 'februari-22 2013' A: Yes, sometimes as often as 1 in every 10K calls. Or about once
GoDataDriven
PROUDLY PART OF THE XEBIA GROUP@fzk frisovanvollenhoven@godatadriven.com
Dirty Data
Friso van Vollenhoven
It’s a mess. It’s your problem.
'februari-22 2013'
A: Yes, sometimes as
week at 3K files / day.
CREATE TABLE browsers ( browser_id STRING, browser STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '-2';
data from it)
Deployment
git push jenkins master
Independent jobs
source (external) staging (HDFS) hive-staging (HDFS) Hive
HDFS upload + move in place MapReduce + HDFS move Hive map external table + SELECT INTO
Out of order jobs
to Hive
delivery is going to be three hours late
later in the day
Fixable data store
drop the partition and re-insert
transactional, repair afterwards
purpose
Metrics
Metrics service
time
We’re hiring / Questions? / Thank you!
@fzk frisovanvollenhoven@godatadriven.com Friso van Vollenhoven