Automating Work fl ows for Analytics Pipelines Sadayuki Furuhashi - - PowerPoint PPT Presentation

automating work fl ows for analytics pipelines
SMART_READER_LITE
LIVE PREVIEW

Automating Work fl ows for Analytics Pipelines Sadayuki Furuhashi - - PowerPoint PPT Presentation

Open Source Summit 2017 Automating Work fl ows for Analytics Pipelines Sadayuki Furuhashi Sadayuki Furuhashi An open-source hacker. A founder of Treasure Data, Inc. located in Silicon Valley. Github: @frsyuki OSS projects I founded: What's


slide-1
SLIDE 1

Automating Workflows for Analytics Pipelines

Sadayuki Furuhashi

Open Source Summit 2017

slide-2
SLIDE 2

Sadayuki Furuhashi

A founder of Treasure Data, Inc. located in Silicon Valley.

OSS projects I founded:

An open-source hacker. Github: @frsyuki

slide-3
SLIDE 3

What's Workflow Engine?

  • Automates your manual operations.
  • Load data → Clean up → Analyze → Build reports
  • Get customer list → Generate HTML → Send email
  • Monitor server status → Restart on abnormal
  • Backup database → Alert on failure
  • Run test → Package it → Deploy


(Continuous Delivery)

slide-4
SLIDE 4

Challenge: Multiple Cloud & Regions

On-Premises

Different API, Different tools, Many scripts.

slide-5
SLIDE 5

Challenge: Multiple DB technologies

Amazon S3

Amazon 
 Redshift Amazon EMR

slide-6
SLIDE 6

Challenge: Multiple DB technologies

Amazon S3

Amazon 
 Redshift Amazon EMR

> Hi! > I'm a new technology!

slide-7
SLIDE 7

Challenge: Modern complex data analytics

Ingest Application logs User attribute data Ad impressions 3rd-party cookie data Enrich Removing bot access Geo location from IP address Parsing User-Agent JOIN user attributes to event logs Model A/B Testing Funnel analysis Segmentation analysis Machine learning Load Creating indexes Data partitioning Data compression Statistics collection Utilize Recommendation API Realtime ad bidding Visualize using BI applications

Ingest Utilize Enrich Model Load

slide-8
SLIDE 8

Traditional "false" solution

#!/bin/bash ./run_mysql_query.sh ./load_facebook_data.sh ./rsync_apache_logs.sh ./start_emr_cluster.sh for query in emr/*.sql; do ./run_emr_hive $query done ./shutdown_emr_cluster.sh ./run_redshift_queries.sh ./call_finish_notification.sh

> Poor error handling > Write once, Nobody reads > No alerts on failure > No alerts on too long run > No retrying on errors > No resuming > No parallel execution > No distributed execution > No log collection > No visualized monitoring > No modularization > No parameterization

slide-9
SLIDE 9

Solution: Multi-Cloud Workflow Engine

Solves

> Poor error handling > Write once, Nobody reads > No alerts on failure > No alerts on too long run > No retrying on errors > No resuming > No parallel execution > No distributed execution > No log collection > No visualized monitoring > No modularization > No parameterization

slide-10
SLIDE 10

Example in our case

  • 1. Dump data to

BigQuery

  • 2. load all tables to

Treasure Data

  • 3. Run queries
  • 5. Notify on slack
  • 4. Create reports
  • n Tableau Server


(on-premises)

slide-11
SLIDE 11

Workflow constructs

slide-12
SLIDE 12

Unite Engineering & Analytic Teams

+wait_for_arrival: s3_wait>: | bucket/www_${session_date}.csv +load_table: redshift>: scripts/copy.sql Powerful for Engineers

> Comfortable for advanced users

Friendly for Analysts

> Still straight forward for analysts to

understand & leverage workflows

slide-13
SLIDE 13

Unite Engineering & Analytic Teams

Powerful for Engineers

> Comfortable for advanced users

Friendly for Analysts

> Still straight forward for analysts to

understand & leverage workflows

+wait_for_arrival: s3_wait>: | bucket/www_${session_date}.csv +load_table: redshift>: scripts/copy.sql

+ is a task > is an operator ${...} is a variable

slide-14
SLIDE 14

Operator library

_export: td: database: workflow_temp +task1: td>: queries/open.sql create_table: daily_open +task2: td>: queries/close.sql create_table: daily_close

Standard libraries

redshift>: runs Amazon Redshift queries emr>: create/shutdowns a cluster & runs steps s3_wait>: waits until a file is put on S3 pg>: runs PostgreSQL queries td>: runs Treasure Data queries td_for_each>: repeats task for result rows mail>: sends an email

Open-source libraries

You can release & use open-source

  • perator libraries.
slide-15
SLIDE 15

Parallel execution

+load_data: _parallel: true 
 +load_users: redshift>: copy/users.sql 
 +load_items: redshift>: copy/items.sql

Parallel execution

Tasks under a same group run in parallel if _parallel option is set to true.

slide-16
SLIDE 16

Loops & Parameters

+send_email_to_active_users: td_for_each>: list_active.sql _do: +send: email>: tempalte.txt to: ${td.for_each.addr}

Parameter

A task can propagate parameters to following tasks

Loop

Generate subtasks dynamically so that Digdag applies the same set of

  • perators to different data sets.
slide-17
SLIDE 17

Grouping workflows...

Ingest Utilize Enrich Model Load +task +task +task +task +task +task +task +task +task +task +task +task

slide-18
SLIDE 18

Grouping workflows

Ingest Utilize Enrich Model Load +ingest +enrich +task +task +model

+basket_analysis

+task +task +learn +load +task +task

+tasks

+task

slide-19
SLIDE 19

Pushing workflows to a server with Docker image

schedule: daily>: 01:30:00 timezone: Asia/Tokyo _export: docker: image: my_image:latest +task: sh>: ./run_in_docker

Digdag server

> Develop on laptop, push it to a server. > Workflows run periodically on a server. > Backfill > Web editor & monitor

Docker

> Install scripts & dependences in a

Docker image, not on a server.

> Workflows can run anywhere including

developer's laptop.

slide-20
SLIDE 20

Demo

slide-21
SLIDE 21

Digdag is production-ready

Digdag server PostgreSQL It's just like a web application. Digdag client All task state API & scheduler & executor Visual UI

slide-22
SLIDE 22

Digdag is production-ready

PostgreSQL Stateless servers + Replicated DB Digdag client API & scheduler & executor PostgreSQL All task state Digdag server Digdag server HTTP Load Balancer Visual UI HA

slide-23
SLIDE 23

Digdag is production-ready

Digdag server PostgreSQL Isolating API and execution for reliability Digdag client API PostgreSQL HA Digdag server Digdag server Digdag server scheduler &
 executor HTTP Load Balancer All task state

slide-24
SLIDE 24

Digdag at Treasure Data

3,600 workflows run every day 28,000 tasks run every day 850 active workflows 400,000 workflow executions in total

slide-25
SLIDE 25

Digdag & Open Source

slide-26
SLIDE 26

Learning from my OSS projects

  • Make it pluggable!

700+ plugins in 6 years 200+ plugins in 3 years

input/output, parser/formatter,
 decoder/encoder, filter, and executor input/output, and filter

70+ implementations in 8 years

slide-27
SLIDE 27

Digdag also has plugin architecture

32 operators 7 schedulers 2 command executors 1 error notification module

slide-28
SLIDE 28

Sadayuki Furuhashi https://digdag.io Visit my website!