Dataiku Flow and dctc Data pipelines made easy Berlin Buzzwords - PowerPoint PPT Presentation

Dataiku Flow and dctc Data pipelines made easy  Berlin Buzzwords 2013

About me Clément Stenac <clement.stenac@dataiku.com> @ClementStenac  CTO @ Dataiku  Head of product R&D @ Exalead (Search Engine T echnology)  OSS developer @ VLC, Debian and OpenStreetMap Dataiku Training – Hadoop for Data Science

 The hard life of a Data Scientist  Dataiku Flow  DCTC  Lunch ! DIP – Introduction to Dataiku Flow

Follow the Flow Sync In Sync Out Tracker Log Session Hive Customer Profile Pig MongoDB MongoDB Product Recommender Product Transformation Catalog MySQL Category Affinity MySQL Hive Order Python MySQL Category T argeting Apache Logs Partner FTP Pig Syslog (External) Search Search Logs Engine Optimization ElasticSearch S3 (Internal) Search Ranking Dataiku - Pig, Hive and Cascading

Zooming more Bots, Special Users Page Views Filtered Page Views User Affinity User Similarity Catalog (Per Category) User Similarity (Per Brand) Product Popularity Orders Order Summary Recommendation Graph Machine Learning Recommendation Dataiku - Pig, Hive and Cascading

Real-life data pipelines  Many tasks and tools  Dozens of stage, evolves daily  Exceptional situations are the norm  Many pains ◦ Shared schemas ◦ Efficient incremental synchronization and computation ◦ Data is bad DIP – Introduction to Dataiku Flow

An evolution similar to build  1970 Shell scripts  1977 Makefile  1980 Makedeps  1999 SCons/CMake  2001 Maven  … Shell Scripts  2008 HaMake  Better dependencies  2009 Oozie  Higher-level tasks  ETLS, …  Next ?

Introduction to Flow Dataiku Flow is a data-driven orchestration framework for complex data pipelines  Manage data, not steps and taks  Simplify common maintainance situations ◦ Data rebuilds ◦ Processing steps update  Handle real day-to-day pains ◦ Data validity checks ◦ Transfers between systems DIP – Introduction to Dataiku Flow

Concepts: Dataset  Like a table : contains records, with a schema  Can be partitioned ◦ Time partitioning (by day, by hour, …) ◦ « Value » partitioning (by country, by partner, …)  Various backends ◦ SQL ◦ Filesystem ◦ NoSQL (MongoDB, …) ◦ HDFS ◦ Cloud Storages ◦ ElasticSearch DIP – Introduction to Dataiku Flow

Concepts: Task  Has input datasets and output datasets Weekly Visits aggregation Aggregate Visits Daily Customers aggregation  Declares dependencies from input to output  Built-in tasks with strong integration ◦ Pig ◦ Python Pandas & SciKit ◦ Hive ◦ Data transfers  Customizable tasks ◦ Shell script, Java, … DIP – Introduction to Dataiku Flow

Introduction to Flow A sample Flow Browser s Referent . Web Shaker Pig Clean Visits T racker « cleanlogs » « aggr_visits » logs logs Shaker Hive Clean Cust last CRM table « enrich_cust « customer_visits custs. visits customers » » Pig Cust last CRM table « customer_last_ products products product » DIP – Introduction to Dataiku Flow

Data-oriented Flow is data-oriented  Don’t ask « Run task A and then task B »  Don’t even ask « Run all tasks that depend from task A »  Ask « Do what’s needed so that my aggregated customers data for 2013/01/25 is up to date »  Flow manages dependencies between datasets, through tasks  You don’t execute tasks, you compute or refresh datasets DIP – Introduction to Dataiku Flow

Partition-level dependencies Shaker Pig cleantask1 cleanlog « aggr_visits » weekly_aggr wtlogs sliding_days(7)  "wtlogs" and "cleanlog" are day-partitioned  "weekly_aggr" needs the previous 7 days of clean logs  "sliding days" partition-level dependency  "Compute weekly_aggr for 2012-01-25" ◦ Automatically computes the required 7 partitions ◦ For each partition, check if cleanlog is up-to-date wrt. the wtlogs partition ◦ Perform cleantask1 in parallel for all missing / stale days ◦ Perform aggr_visits with the 7 partitions as input Dataiku Training – Hadoop for Data Science

Automatic parallelism  Flow computes the global DAG of required activities  Compute activities that can take place in parallel  Previous example: 8 activities ◦ 7 can be parallelized ◦ 1 requires the other 7 first  Manages running activities  Starts new activities based on available resources DIP – Introduction to Dataiku Flow

Schema and data validity checks  Datasets have a schema, available in all tools  Advanced verification of computed data ◦ "Check that output is not empty" ◦ "Check that this custom query returns between X and Y records" ◦ "Check that this specific record is found in output" ◦ "Check that number of computed records for day B is no more than 40% different than day A"  Automatic tests for data pipelines DIP – Introduction to Dataiku Flow

Integrated in Hadoop, open beyond  Native knowledge of Pig and Hive formats  Schema-aware loaders and storages  A great ecosystem, but not omnipotent ◦ Not everything requires Hadoop's strong points  Hadoop = first-class citizen of Flow, but not the only one  Native integration of SQL capabilities  Automatic incremental synchronization to/from MongoDB, Vertica, ElasticSearch, …  Custom tasks DIP – Introduction to Dataiku Flow

What about Oozie and Hcatalog ? DIP – Introduction to Dataiku Flow

Are we there yet ?  Engine and core tasks are working  Under active development for first betas  Get more info and stay informed http://flowbeta.dataiku.com And while you wait, another thing Ever been annoyed by data transfers ? DIP – Introduction to Dataiku Flow

Feel the pain DIP – Introduction to Dataiku Flow

DCTC : Cloud data manipulation  Extract from the core of Flow  Manipulate files across filesystems # Li st t he f i l es and f ol der s i n a S3 bucket % dct c l s s3: / / m y- bucket # Synchr oni ze i ncr em ent al l y f r om G CS t o l ocal f ol der % dct c sync gs: / / m y- bucket / m y- pat h t ar get - di r ect or y # Copy f r om G CS t o HD FS, com pr ess t o . gz on t he f l y # ( decom pr ess handl ed t oo) % dct c cp –R –c gs: / / m y- bucket / m y- pat h hdf s: / / / dat a/ i nput # D i spat ch t he l i nes of a f i l e t o 8 f i l es on S3, gzi p- com pr essed % dct c di spat ch i nput s3: / / bucket / t ar get –f r andom –nf 8 - c DIP – Introduction to Dataiku Flow

DCTC : More examples # cat f r om anyw her e % dct c cat f t p: / / account @ : / pub/ dat a/ dat a. csv # M ul t i - account aw ar e % dct c sync s3: / / account 1@ pat h s3: / / account 2@ ot her _pat h # Edi t a r em ot e f i l e ( w i t h $ED I TO R) % dct c edi t ssh: / / account @ : m yf i l e. t xt # Tr anspar ent l y unzi p % dct c # Head / t ai l f r om t he cl oud % dct c t ai l s3: / / bucket / huge- l og. csv DIP – Introduction to Dataiku Flow

Try it now http://dctc.io  Self-contained binary for Linux, OS X, Windows  Amazon S3  HTTP  Google Cloud Storage  SSH  FTP  HDFS (through local install) DIP – Introduction to Dataiku Flow

Questions ? Florian Douetteau Marc Batty Thomas Cabrol Clément Stenac Chief Executive Officer Chief Customer Officer Chief Data Scientist Chief T echnical Officer florian.douetteau@dataiku.com marc.batty@dataiku.com thomas.cabrol@dataiku.com clement.stenac@dataiku.com +33 6 70 56 88 97 +33 6 45 65 67 04 +33 7 86 42 62 81 +33 6 28 06 79 04 @fdouetteau @battymarc @ThomasCabrol @ClementStenac

Dataiku Flow and dctc Data pipelines made easy Berlin Buzzwords - PowerPoint PPT Presentation

Dataiku Flow and dctc Data pipelines made easy Berlin Buzzwords 2013 About me Clment Stenac <clement.stenac@dataiku.com> @ClementStenac CTO @ Dataiku Head of product R&D @ Exalead (Search Engine T echnology) OSS

Introduction to Stan and Bayesian Inference Paris Machine Learning Meetup Dataiku User Meetup

BIG DATA IN HYBRID WORLDS The Story of M H i ! Im Florian CEO of Dataiku maker Data

Flow Visualization Overview: Flow Visualization (1) Introduction, overview Flow data Simulation

Traffic Flow Characteristics and Theory Primary Elements of Traffic Flow a. Flow Rate = q

Potential Flow & Flow Nets Potential Flow Irrotational flow for which implies:

1 What Is Control-Flow Analysis? Loop Concepts Control-flow analysis discovers the flow of

Coupling free flow / porous-medium flow General idea free flow, Navier-Stokes wind 1 phase, 2

Flow networks, flow, maximum flow Can interpret directed graph as flow network. Material

= edge edge ( (u,v u,v) ) is not in is not in E E f x Y ( , ) f x y ( , ) y Y

FLOW CYTOMETRY DATA COMPRESSION A.E. Bras PhD Student Erasmus University, Rotterdam, the

Flow Visualization Overview: Flow Visualization (1) Introduction, overview Fl Flow data d t

Steady Flow: Lid-Driven Cavity Flow This tutorial demonstrates the performance of STAR-CCM+ in

Network Flow CS31005: Algorithms-II Autumn 2020 IIT Kharagpur Network Flow Models the flow

Network Flow 5 Network Flow terminology Network flow is similar to finding how much water we

Flow Analysis Data-flow analysis, Control-flow analysis, Abstract interpretation, AAM Helpful

Flow Modeling on Massive Terrains Laura Toma Duke University Flow Modeling on Massive Terrains

The new battleground The power of personalisation James Skellington Strategic Automation

Adherence, Avatars and Where to from here Kerry Y. Fang, Heidi Bjering, Athula Ginige Medication

MetaCAPTCHA: A Metamorphic Throttling Service for the Web Akshay Dua, Thai Bui, Tien Le, Nhan

Funner LLVM development Nico Weber, @thakis Goma .cpp, .h .o

Simplifying the contribution process for both contributors & maintainers A case study of

Panel: Context-Dependent Evaluation of Tools for NL RE Tasks: Recall vs. Precision, and Beyond

PROJECT HUB Demos! Open workspace Food! Make friends! Learn stuff!

Location Privacy Protection with a Semi-honest Anonymizer in Information Centric Networking

Dataiku Flow and dctc Data pipelines made easy Berlin Buzzwords - PowerPoint PPT Presentation

Dataiku Flow and dctc Data pipelines made easy Berlin Buzzwords 2013 About me Clment Stenac <clement.stenac@dataiku.com> @ClementStenac CTO @ Dataiku Head of product R&D @ Exalead (Search Engine T echnology) OSS

Introduction to Stan and Bayesian Inference Paris Machine Learning Meetup Dataiku User Meetup

BIG DATA IN HYBRID WORLDS The Story of M H i ! Im Florian CEO of Dataiku maker Data

Flow Visualization Overview: Flow Visualization (1) Introduction, overview Flow data Simulation

Traffic Flow Characteristics and Theory Primary Elements of Traffic Flow a. Flow Rate = q

Potential Flow &amp; Flow Nets Potential Flow Irrotational flow for which implies:

1 What Is Control-Flow Analysis? Loop Concepts Control-flow analysis discovers the flow of

Coupling free flow / porous-medium flow General idea free flow, Navier-Stokes wind 1 phase, 2

Flow networks, flow, maximum flow Can interpret directed graph as flow network. Material

= edge edge ( (u,v u,v) ) is not in is not in E E f x Y ( , ) f x y ( , ) y Y

FLOW CYTOMETRY DATA COMPRESSION A.E. Bras PhD Student Erasmus University, Rotterdam, the

Flow Visualization Overview: Flow Visualization (1) Introduction, overview Fl Flow data d t

Steady Flow: Lid-Driven Cavity Flow This tutorial demonstrates the performance of STAR-CCM+ in

Network Flow CS31005: Algorithms-II Autumn 2020 IIT Kharagpur Network Flow Models the flow

Network Flow 5 Network Flow terminology Network flow is similar to finding how much water we

Flow Analysis Data-flow analysis, Control-flow analysis, Abstract interpretation, AAM Helpful

Flow Modeling on Massive Terrains Laura Toma Duke University Flow Modeling on Massive Terrains

The new battleground The power of personalisation James Skellington Strategic Automation

Adherence, Avatars and Where to from here Kerry Y. Fang, Heidi Bjering, Athula Ginige Medication

MetaCAPTCHA: A Metamorphic Throttling Service for the Web Akshay Dua, Thai Bui, Tien Le, Nhan

Funner LLVM development Nico Weber, @thakis Goma .cpp, .h .o

Simplifying the contribution process for both contributors &amp; maintainers A case study of

Panel: Context-Dependent Evaluation of Tools for NL RE Tasks: Recall vs. Precision, and Beyond

PROJECT HUB Demos! Open workspace Food! Make friends! Learn stuff!

Location Privacy Protection with a Semi-honest Anonymizer in Information Centric Networking

Potential Flow & Flow Nets Potential Flow Irrotational flow for which implies:

Simplifying the contribution process for both contributors & maintainers A case study of