Dataiku Flow and dctc Data pipelines made easy Berlin Buzzwords - - PowerPoint PPT Presentation

dataiku flow and dctc
SMART_READER_LITE
LIVE PREVIEW

Dataiku Flow and dctc Data pipelines made easy Berlin Buzzwords - - PowerPoint PPT Presentation

Dataiku Flow and dctc Data pipelines made easy Berlin Buzzwords 2013 About me Clment Stenac <clement.stenac@dataiku.com> @ClementStenac CTO @ Dataiku Head of product R&D @ Exalead (Search Engine T echnology) OSS


slide-1
SLIDE 1

Dataiku Flow and dctc

Data pipelines made easy

 Berlin Buzzwords 2013

slide-2
SLIDE 2

Clément Stenac <clement.stenac@dataiku.com> @ClementStenac

 CTO @ Dataiku  Head of product R&D @ Exalead

(Search Engine T echnology)

 OSS developer @ VLC, Debian and OpenStreetMap

About me

Dataiku Training – Hadoop for Data Science

slide-3
SLIDE 3

 The hard life of a Data Scientist  Dataiku Flow  DCTC  Lunch !

DIP – Introduction to Dataiku Flow

slide-4
SLIDE 4

Follow the Flow

Dataiku - Pig, Hive and Cascading

Tracker Log

MongoDB

MySQL MySQL Syslog

Product Catalog Order Apache Logs Session

Product Transformation Category Affinity Category T argeting Customer Profile

Recommender S3

Search Logs

(External) Search Engine Optimization (Internal) Search Ranking

MongoDB MySQL Partner FTP

Sync In Sync Out Pig Pig Hive Hive

ElasticSearch

Python

slide-5
SLIDE 5

Zooming more

Dataiku - Pig, Hive and Cascading

Page Views Orders Catalog

Bots, Special Users

Filtered Page Views

User Affinity Product Popularity

User Similarity (Per Category) Recommendation Graph

Recommendation

Order Summary User Similarity (Per Brand)

Machine Learning

slide-6
SLIDE 6

 Many tasks and tools  Dozens of stage, evolves daily  Exceptional situations are the norm  Many pains

  • Shared schemas
  • Efficient incremental synchronization

and computation

  • Data is bad

Real-life data pipelines

DIP – Introduction to Dataiku Flow

slide-7
SLIDE 7

 1970 Shell scripts  1977 Makefile  1980 Makedeps  1999 SCons/CMake  2001 Maven  … Shell Scripts  2008 HaMake  2009 Oozie  ETLS, …  Next ?

An evolution similar to build

 Better dependencies  Higher-level tasks

slide-8
SLIDE 8

 The hard life of a Data Scientist  Dataiku Flow  DCTC  Lunch !

DIP – Introduction to Dataiku Flow

slide-9
SLIDE 9

Dataiku Flow is a data-driven

  • rchestration framework for

complex data pipelines

Introduction to Flow

DIP – Introduction to Dataiku Flow

 Manage data, not steps and taks  Simplify common maintainance situations

  • Data rebuilds
  • Processing steps update

 Handle real day-to-day pains

  • Data validity checks
  • Transfers between systems
slide-10
SLIDE 10

 Like a table : contains records, with a schema  Can be partitioned

  • Time partitioning (by day, by hour, …)
  • « Value » partitioning (by country, by partner, …)

 Various backends

  • Filesystem
  • HDFS
  • ElasticSearch

Concepts: Dataset

DIP – Introduction to Dataiku Flow

  • SQL
  • NoSQL (MongoDB, …)
  • Cloud Storages
slide-11
SLIDE 11

 Has input datasets and output datasets  Declares dependencies from input to output  Built-in tasks with strong integration

  • Pig
  • Hive

 Customizable tasks

  • Shell script, Java, …

Concepts: Task

DIP – Introduction to Dataiku Flow

Aggregate Visits Visits Customers Weekly aggregation Daily aggregation

  • Python Pandas & SciKit
  • Data transfers
slide-12
SLIDE 12

Introduction to Flow

A sample Flow

DIP – Introduction to Dataiku Flow

Visits Shaker « cleanlogs » Clean logs Web T racker logs Pig « aggr_visits » CRM table customers Shaker « enrich_cust » Clean custs. Browser s Referent . Hive « customer_visits » Cust last visits Pig « customer_last_ product » CRM table products Cust last products

slide-13
SLIDE 13

Flow is data-oriented

 Don’t ask « Run task A and then task B »  Don’t even ask « Run all tasks that depend from task A »  Ask « Do what’s needed so that my aggregated customers

data for 2013/01/25 is up to date »

 Flow manages dependencies between datasets, through

tasks

 You don’t execute tasks, you compute or refresh datasets

Data-oriented

DIP – Introduction to Dataiku Flow

slide-14
SLIDE 14

Partition-level dependencies

Dataiku Training – Hadoop for Data Science

Shaker cleantask1 cleanlog wtlogs Pig « aggr_visits » weekly_aggr

 "wtlogs" and "cleanlog" are day-partitioned  "weekly_aggr" needs the previous 7 days of clean logs  "sliding days" partition-level dependency  "Compute weekly_aggr for 2012-01-25"

  • Automatically computes the required 7 partitions
  • For each partition, check if cleanlog is up-to-date wrt. the wtlogs

partition

  • Perform cleantask1 in parallel for all missing / stale days
  • Perform aggr_visits with the 7 partitions as input

sliding_days(7)

slide-15
SLIDE 15

Automatic parallelism

DIP – Introduction to Dataiku Flow

 Flow computes the global DAG of required activities  Compute activities that can take place in parallel  Previous example: 8 activities

  • 7 can be parallelized
  • 1 requires the other 7 first

 Manages running activities  Starts new activities based on available resources

slide-16
SLIDE 16

 Datasets have a schema, available in all tools  Advanced verification of computed data

  • "Check that output is not empty"
  • "Check that this custom query returns between X and Y records"
  • "Check that this specific record is found in output"
  • "Check that number of computed records for day B is no more

than 40% different than day A"

 Automatic tests for data pipelines

Schema and data validity checks

DIP – Introduction to Dataiku Flow

slide-17
SLIDE 17

 Native knowledge of Pig and Hive formats  Schema-aware loaders and storages  A great ecosystem, but not omnipotent

  • Not everything requires Hadoop's strong points

 Hadoop = first-class citizen of Flow, but not the only one  Native integration of SQL capabilities  Automatic incremental synchronization to/from MongoDB,

Vertica, ElasticSearch, …

 Custom tasks

Integrated in Hadoop,

  • pen beyond

DIP – Introduction to Dataiku Flow

slide-18
SLIDE 18

What about Oozie and Hcatalog ?

DIP – Introduction to Dataiku Flow

slide-19
SLIDE 19

 Engine and core tasks are working  Under active development for first betas  Get more info and stay informed

http://flowbeta.dataiku.com

And while you wait, another thing Ever been annoyed by data transfers ? Are we there yet ?

DIP – Introduction to Dataiku Flow

slide-20
SLIDE 20

Feel the pain

DIP – Introduction to Dataiku Flow

slide-21
SLIDE 21

 The hard life of a Data Scientist  Dataiku Flow  DCTC  Lunch !

DIP – Introduction to Dataiku Flow

slide-22
SLIDE 22

 Extract from the core of Flow  Manipulate files across filesystems

DCTC : Cloud data manipulation

DIP – Introduction to Dataiku Flow # Li st t he f i l es and f

  • l

der s i n a S3 bucket % dct c l s s3: / / m y- bucket # Synchr

  • ni

ze i ncr em ent al l y f r

  • m

G CS t

  • l
  • cal

f

  • l

der % dct c sync gs: / / m y- bucket / m y- pat h t ar get

  • di

r ect

  • r

y # Copy f r

  • m

G CS t

  • HD FS,

com pr ess t

  • .

gz

  • n

t he f l y # ( decom pr ess handl ed t

  • o)

% dct c cp –R –c gs: / / m y- bucket / m y- pat h hdf s: / / / dat a/ i nput # D i spat ch t he l i nes

  • f

a f i l e t

  • 8

f i l es

  • n

S3, gzi p- com pr essed % dct c di spat ch i nput s3: / / bucket / t ar get –f r andom –nf 8

  • c
slide-23
SLIDE 23

DCTC : More examples

DIP – Introduction to Dataiku Flow # cat f r

  • m

anyw her e % dct c cat f t p: / / account @ : / pub/ dat a/ dat a. csv # Edi t a r em ot e f i l e ( w i t h $ED I TO R) % dct c edi t ssh: / / account @ : m yf i l e. t xt # Tr anspar ent l y unzi p % dct c # Head / t ai l f r

  • m

t he cl

  • ud

% dct c t ai l s3: / / bucket / huge- l

  • g.

csv # M ul t i

  • account

aw ar e % dct c sync s3: / / account 1@ pat h s3: / / account 2@ ot her _pat h

slide-24
SLIDE 24

http://dctc.io

 Self-contained binary for Linux, OS X, Windows  Amazon S3  Google Cloud Storage  FTP

Try it now

DIP – Introduction to Dataiku Flow

 HTTP  SSH  HDFS (through local install)

slide-25
SLIDE 25

Questions ?

Marc Batty Chief Customer Officer marc.batty@dataiku.com +33 6 45 65 67 04 @battymarc Florian Douetteau Chief Executive Officer florian.douetteau@dataiku.com +33 6 70 56 88 97 @fdouetteau Thomas Cabrol Chief Data Scientist thomas.cabrol@dataiku.com +33 7 86 42 62 81 @ThomasCabrol Clément Stenac Chief T echnical Officer clement.stenac@dataiku.com +33 6 28 06 79 04 @ClementStenac

slide-26
SLIDE 26