Automating Work fl ows for Analytics Pipelines Sadayuki Furuhashi - PowerPoint PPT Presentation

Open Source Summit 2017 Automating Work fl ows for Analytics Pipelines Sadayuki Furuhashi

Sadayuki Furuhashi An open-source hacker. A founder of Treasure Data, Inc. located in Silicon Valley. Github: @frsyuki OSS projects I founded:

What's Work fl ow Engine? • Automates your manual operations. • Load data → Clean up → Analyze → Build reports • Get customer list → Generate HTML → Send email • Monitor server status → Restart on abnormal • Backup database → Alert on failure • Run test → Package it → Deploy   (Continuous Delivery)

Challenge: Multiple Cloud & Regions Di ff erent API, Di ff erent tools, Many scripts. On-Premises

Challenge: Multiple DB technologies Amazon   Redshift Amazon S3 Amazon EMR

Challenge: Multiple DB technologies > Hi! Amazon   Redshift > I'm a new technology! Amazon S3 Amazon EMR

Challenge: Modern complex data analytics Ingest Enrich Model Load Utilize Load Utilize Ingest Enrich Model Removing bot access Creating indexes Recommendation Application logs A/B Testing Geo location from IP API Data partitioning User attribute data Funnel analysis address Realtime ad bidding Data compression Ad impressions Segmentation Parsing User-Agent Visualize using BI analysis Statistics 3rd-party cookie data JOIN user attributes applications collection Machine learning to event logs

Traditional "false" solution > Poor error handling > Write once, Nobody reads > No alerts on failure #!/bin/bash > No alerts on too long run ./run_mysql_query.sh ./load_facebook_data.sh > No retrying on errors ./rsync_apache_logs.sh > No resuming ./start_emr_cluster.sh > No parallel execution for query in emr/*.sql; do ./run_emr_hive $query > No distributed execution done ./shutdown_emr_cluster.sh > No log collection ./run_redshift_queries.sh > No visualized monitoring ./call_finish_notification.sh > No modularization > No parameterization

Solution: Multi-Cloud Work fl ow Engine > Poor error handling > Write once, Nobody reads > No alerts on failure > No alerts on too long run > No retrying on errors Solves > No resuming > No parallel execution > No distributed execution > No log collection > No visualized monitoring > No modularization > No parameterization

Example in our case 2. load all tables to 3. Run queries 4. Create reports Treasure Data on Tableau Server   (on-premises) 1. Dump data to BigQuery 5. Notify on slack

Work fl ow constructs

Unite Engineering & Analytic Teams Powerful for Engineers +wait_for_arrival: > Comfortable for advanced users s3_wait>: | bucket/www_${session_date}.csv Friendly for Analysts > Still straight forward for analysts to understand & leverage workflows +load_table: redshift>: scripts/copy.sql

Unite Engineering & Analytic Teams Powerful for Engineers +wait_for_arrival: > Comfortable for advanced users s3_wait>: | bucket/www_${session_date}.csv Friendly for Analysts > Still straight forward for analysts to understand & leverage workflows +load_table: redshift>: scripts/copy.sql + is a task > is an operator ${...} is a variable

Operator library Standard libraries _export: redshift>: runs Amazon Redshift queries emr>: create/shutdowns a cluster & runs td: steps database: workflow_temp s3_wait>: waits until a file is put on S3 pg>: runs PostgreSQL queries td>: runs Treasure Data queries +task1: td_for_each>: repeats task for result rows td>: queries/open.sql mail>: sends an email create_table: daily_open Open-source libraries You can release & use open-source +task2: operator libraries. td>: queries/close.sql create_table: daily_close

    Parallel execution +load_data: Parallel execution _parallel: true Tasks under a same group run in parallel if _parallel option is set to +load_users: true. redshift>: copy/users.sql +load_items: redshift>: copy/items.sql

Loops & Parameters Parameter +send_email_to_active_users: td_for_each>: list_active.sql A task can propagate parameters to following tasks _do: +send: Loop email>: tempalte.txt to: ${td.for_each.addr} Generate subtasks dynamically so that Digdag applies the same set of operators to different data sets.

Grouping work fl ows... +task +task +task +task +task +task +task +task +task +task +task +task Ingest Enrich Model Load Utilize

Grouping work fl ows +model +learn +ingest +load +enrich +task +tasks +task +task +task +task +basket_analysis +task +task Ingest Enrich Model Load Utilize

Pushing work fl ows to a server with Docker image Digdag server schedule: > Develop on laptop, push it to a server. daily>: 01:30:00 > Workflows run periodically on a server. > Backfill timezone: Asia/Tokyo > Web editor & monitor _export: Docker docker: > Install scripts & dependences in a image: my_image:latest Docker image, not on a server. > Workflows can run anywhere including developer's laptop. +task: sh>: ./run_in_docker

Digdag is production-ready It's just like a web application. API & Visual UI scheduler & All task state executor Digdag Digdag PostgreSQL client server

Digdag is production-ready Stateless servers + Replicated DB API & scheduler & executor Visual UI Digdag All task state server Digdag HTTP Load Digdag PostgreSQL client Balancer server HA PostgreSQL

Digdag is production-ready Isolating API and execution for reliability API Digdag All task state server Digdag HTTP Load Digdag PostgreSQL client Balancer server HA Digdag server PostgreSQL Digdag server scheduler &   executor

Digdag at Treasure Data 850 active work fl ows 3,600 work fl ows run every day 28,000 tasks run every day 400,000 work fl ow executions in total

Digdag & Open Source

Learning from my OSS projects • Make it pluggable! 700+ plugins in 6 years input/output, and fi lter 200+ plugins in 3 years input/output, parser/formatter,   decoder/encoder, fi lter, and executor 70+ implementations in 8 years

Digdag also has plugin architecture 32 operators 7 schedulers 2 command executors 1 error noti fi cation module

Visit my website! https://digdag.io Sadayuki Furuhashi

Automating Work fl ows for Analytics Pipelines Sadayuki Furuhashi - PowerPoint PPT Presentation

Open Source Summit 2017 Automating Work fl ows for Analytics Pipelines Sadayuki Furuhashi Sadayuki Furuhashi An open-source hacker. A founder of Treasure Data, Inc. located in Silicon Valley. Github: @frsyuki OSS projects I founded: What's

OWS COMPANY PROFILE FIGURES DRANCO TECHNOLOGY DEVELOPED IN 1983 OWS CREATED IN 1988

Analytics and Data Summit 2020 Analytics and Data Summit 2020 Analytics and Data Summit 2020

Automating Authority Work Automating authority work, or, Be your own authority control vendor

Licensed Pipelines & the Planning System Council Briefing 2019 Critical Infrastructure

COMPLETED PIPELINES FT Completed Pipelines SNOWSWICK BLUNSDEN - 2019 Instalcom for Thames

UK COMPLETED PIPELINES FT Completed Pipelines WING PIPELINE ANGLIAN WATER 1000mm water

Planning Near Transmission Pipelines Planning Near Transmission Pipelines Meghan Thoreau, planner

CS 104 Computer Organization and Design Fancy Pipelines: not just scalar in-order CS104: Fancy

Automating Predictive Analytics www.xpanseanalytics.com Agenda Predictive Analytics vs

Automating batch fecundity measurements Automating batch fecundity measurements using digital

REDHAT KICKSTART REDHAT KICKSTART Automating Linux Installation Automating Linux Installation

Automating the Automating the configuration of flow configuration of flow monitoring probes

Automating MySQL Deployments on Kubernetes Calin Don & Flavius Mecea Presslabs Automating

Automating Production of Cross Media Automating Production of Cross Media Content for

RANDOMIZING AND RANDOMIZING AND AUTOMATING ASSESSMENT AUTOMATING ASSESSMENT WITH R WITH R exams

Automating Operational Decisions in Real-time Chris Sanden Senior Analytics Engineer About Me.

Web & Networks IG 4 th Conference Call Meetjng 25-Nov-2019 Agenda TPAC Meetjng Summary

Correio Eletrnico E-mail & SMTP Tpicos em Sistemas de Computao 2014 Prof. Dr.

SMTP adaptation with OPES draft-ietf-opes-smtp-00.txt OPES WG meeting on 64 th IETF in Vancouver,

SMTP [in]Security Ian Foster Jon Larson Goals 1. Does the global email system currently

Electronic Mail Security December 7, 2000 email 2 Characteristics File transfer, except. . .

TOG web pages EVN pages: http://www.evlbi.org/ Radionet wiki:

802.1 Plenary - 07/2010 Opening Agenda Opening Agenda G General information... l i f i

USENET History ( 1 ) > The first USENET In 1979 Tom Truscott, Jim Ellis, Steve

Automating Work fl ows for Analytics Pipelines Sadayuki Furuhashi - PowerPoint PPT Presentation

Open Source Summit 2017 Automating Work fl ows for Analytics Pipelines Sadayuki Furuhashi Sadayuki Furuhashi An open-source hacker. A founder of Treasure Data, Inc. located in Silicon Valley. Github: @frsyuki OSS projects I founded: What's

OWS COMPANY PROFILE FIGURES DRANCO TECHNOLOGY DEVELOPED IN 1983 OWS CREATED IN 1988

Analytics and Data Summit 2020 Analytics and Data Summit 2020 Analytics and Data Summit 2020

Automating Authority Work Automating authority work, or, Be your own authority control vendor

Licensed Pipelines &amp; the Planning System Council Briefing 2019 Critical Infrastructure

COMPLETED PIPELINES FT Completed Pipelines SNOWSWICK BLUNSDEN - 2019 Instalcom for Thames

UK COMPLETED PIPELINES FT Completed Pipelines WING PIPELINE ANGLIAN WATER 1000mm water

Planning Near Transmission Pipelines Planning Near Transmission Pipelines Meghan Thoreau, planner

CS 104 Computer Organization and Design Fancy Pipelines: not just scalar in-order CS104: Fancy

Automating Predictive Analytics www.xpanseanalytics.com Agenda Predictive Analytics vs

Automating batch fecundity measurements Automating batch fecundity measurements using digital

REDHAT KICKSTART REDHAT KICKSTART Automating Linux Installation Automating Linux Installation

Automating the Automating the configuration of flow configuration of flow monitoring probes

Automating MySQL Deployments on Kubernetes Calin Don &amp; Flavius Mecea Presslabs Automating

Automating Production of Cross Media Automating Production of Cross Media Content for

RANDOMIZING AND RANDOMIZING AND AUTOMATING ASSESSMENT AUTOMATING ASSESSMENT WITH R WITH R exams

Automating Operational Decisions in Real-time Chris Sanden Senior Analytics Engineer About Me.

Web &amp; Networks IG 4 th Conference Call Meetjng 25-Nov-2019 Agenda TPAC Meetjng Summary

Correio Eletrnico E-mail &amp; SMTP Tpicos em Sistemas de Computao 2014 Prof. Dr.

SMTP adaptation with OPES draft-ietf-opes-smtp-00.txt OPES WG meeting on 64 th IETF in Vancouver,

SMTP [in]Security Ian Foster Jon Larson Goals 1. Does the global email system currently

Electronic Mail Security December 7, 2000 email 2 Characteristics File transfer, except. . .

TOG web pages EVN pages: http://www.evlbi.org/ Radionet wiki:

802.1 Plenary - 07/2010 Opening Agenda Opening Agenda G General information... l i f i

USENET History ( 1 ) &gt; The first USENET In 1979 Tom Truscott, Jim Ellis, Steve

Licensed Pipelines & the Planning System Council Briefing 2019 Critical Infrastructure

Automating MySQL Deployments on Kubernetes Calin Don & Flavius Mecea Presslabs Automating

Web & Networks IG 4 th Conference Call Meetjng 25-Nov-2019 Agenda TPAC Meetjng Summary

Correio Eletrnico E-mail & SMTP Tpicos em Sistemas de Computao 2014 Prof. Dr.

USENET History ( 1 ) > The first USENET In 1979 Tom Truscott, Jim Ellis, Steve