Dataiku Flow and dctc
Data pipelines made easy
Berlin Buzzwords 2013
Dataiku Flow and dctc Data pipelines made easy Berlin Buzzwords - - PowerPoint PPT Presentation
Dataiku Flow and dctc Data pipelines made easy Berlin Buzzwords 2013 About me Clment Stenac <clement.stenac@dataiku.com> @ClementStenac CTO @ Dataiku Head of product R&D @ Exalead (Search Engine T echnology) OSS
Berlin Buzzwords 2013
Clément Stenac <clement.stenac@dataiku.com> @ClementStenac
CTO @ Dataiku Head of product R&D @ Exalead
(Search Engine T echnology)
OSS developer @ VLC, Debian and OpenStreetMap
Dataiku Training – Hadoop for Data Science
The hard life of a Data Scientist Dataiku Flow DCTC Lunch !
DIP – Introduction to Dataiku Flow
Dataiku - Pig, Hive and Cascading
Tracker Log
MongoDB
MySQL MySQL Syslog
Product Catalog Order Apache Logs Session
Product Transformation Category Affinity Category T argeting Customer Profile
Recommender S3
Search Logs
(External) Search Engine Optimization (Internal) Search Ranking
MongoDB MySQL Partner FTP
Sync In Sync Out Pig Pig Hive Hive
ElasticSearch
Python
Dataiku - Pig, Hive and Cascading
Page Views Orders Catalog
Bots, Special Users
Filtered Page Views
User Affinity Product Popularity
User Similarity (Per Category) Recommendation Graph
Recommendation
Order Summary User Similarity (Per Brand)
Machine Learning
Many tasks and tools Dozens of stage, evolves daily Exceptional situations are the norm Many pains
and computation
DIP – Introduction to Dataiku Flow
1970 Shell scripts 1977 Makefile 1980 Makedeps 1999 SCons/CMake 2001 Maven … Shell Scripts 2008 HaMake 2009 Oozie ETLS, … Next ?
Better dependencies Higher-level tasks
The hard life of a Data Scientist Dataiku Flow DCTC Lunch !
DIP – Introduction to Dataiku Flow
DIP – Introduction to Dataiku Flow
Manage data, not steps and taks Simplify common maintainance situations
Handle real day-to-day pains
Like a table : contains records, with a schema Can be partitioned
Various backends
DIP – Introduction to Dataiku Flow
Has input datasets and output datasets Declares dependencies from input to output Built-in tasks with strong integration
Customizable tasks
DIP – Introduction to Dataiku Flow
Aggregate Visits Visits Customers Weekly aggregation Daily aggregation
DIP – Introduction to Dataiku Flow
Visits Shaker « cleanlogs » Clean logs Web T racker logs Pig « aggr_visits » CRM table customers Shaker « enrich_cust » Clean custs. Browser s Referent . Hive « customer_visits » Cust last visits Pig « customer_last_ product » CRM table products Cust last products
Flow is data-oriented
Don’t ask « Run task A and then task B » Don’t even ask « Run all tasks that depend from task A » Ask « Do what’s needed so that my aggregated customers
data for 2013/01/25 is up to date »
Flow manages dependencies between datasets, through
tasks
You don’t execute tasks, you compute or refresh datasets
DIP – Introduction to Dataiku Flow
Dataiku Training – Hadoop for Data Science
Shaker cleantask1 cleanlog wtlogs Pig « aggr_visits » weekly_aggr
"wtlogs" and "cleanlog" are day-partitioned "weekly_aggr" needs the previous 7 days of clean logs "sliding days" partition-level dependency "Compute weekly_aggr for 2012-01-25"
partition
sliding_days(7)
DIP – Introduction to Dataiku Flow
Flow computes the global DAG of required activities Compute activities that can take place in parallel Previous example: 8 activities
Manages running activities Starts new activities based on available resources
Datasets have a schema, available in all tools Advanced verification of computed data
than 40% different than day A"
Automatic tests for data pipelines
DIP – Introduction to Dataiku Flow
Native knowledge of Pig and Hive formats Schema-aware loaders and storages A great ecosystem, but not omnipotent
Hadoop = first-class citizen of Flow, but not the only one Native integration of SQL capabilities Automatic incremental synchronization to/from MongoDB,
Vertica, ElasticSearch, …
Custom tasks
DIP – Introduction to Dataiku Flow
DIP – Introduction to Dataiku Flow
Engine and core tasks are working Under active development for first betas Get more info and stay informed
http://flowbeta.dataiku.com
DIP – Introduction to Dataiku Flow
DIP – Introduction to Dataiku Flow
The hard life of a Data Scientist Dataiku Flow DCTC Lunch !
DIP – Introduction to Dataiku Flow
Extract from the core of Flow Manipulate files across filesystems
DIP – Introduction to Dataiku Flow # Li st t he f i l es and f
der s i n a S3 bucket % dct c l s s3: / / m y- bucket # Synchr
ze i ncr em ent al l y f r
G CS t
f
der % dct c sync gs: / / m y- bucket / m y- pat h t ar get
r ect
y # Copy f r
G CS t
com pr ess t
gz
t he f l y # ( decom pr ess handl ed t
% dct c cp –R –c gs: / / m y- bucket / m y- pat h hdf s: / / / dat a/ i nput # D i spat ch t he l i nes
a f i l e t
f i l es
S3, gzi p- com pr essed % dct c di spat ch i nput s3: / / bucket / t ar get –f r andom –nf 8
DIP – Introduction to Dataiku Flow # cat f r
anyw her e % dct c cat f t p: / / account @ : / pub/ dat a/ dat a. csv # Edi t a r em ot e f i l e ( w i t h $ED I TO R) % dct c edi t ssh: / / account @ : m yf i l e. t xt # Tr anspar ent l y unzi p % dct c # Head / t ai l f r
t he cl
% dct c t ai l s3: / / bucket / huge- l
csv # M ul t i
aw ar e % dct c sync s3: / / account 1@ pat h s3: / / account 2@ ot her _pat h
Self-contained binary for Linux, OS X, Windows Amazon S3 Google Cloud Storage FTP
DIP – Introduction to Dataiku Flow
HTTP SSH HDFS (through local install)
Marc Batty Chief Customer Officer marc.batty@dataiku.com +33 6 45 65 67 04 @battymarc Florian Douetteau Chief Executive Officer florian.douetteau@dataiku.com +33 6 70 56 88 97 @fdouetteau Thomas Cabrol Chief Data Scientist thomas.cabrol@dataiku.com +33 7 86 42 62 81 @ThomasCabrol Clément Stenac Chief T echnical Officer clement.stenac@dataiku.com +33 6 28 06 79 04 @ClementStenac