ETL pipeline to achieve reliability at scale By Isabel Lpez - - PowerPoint PPT Presentation

etl pipeline to achieve reliability at scale
SMART_READER_LITE
LIVE PREVIEW

ETL pipeline to achieve reliability at scale By Isabel Lpez - - PowerPoint PPT Presentation

ETL pipeline to achieve reliability at scale By Isabel Lpez Andrade Accounting at Smarkets Accounting at Smarkets Reports generated daily using transactions from the exchange. In 2013, the average number of daily transactions was under


slide-1
SLIDE 1

ETL pipeline to achieve reliability at scale

By Isabel López Andrade

slide-2
SLIDE 2

Accounting at Smarkets

slide-3
SLIDE 3

Accounting at Smarkets

Reports generated daily using transactions from the exchange. In 2013, the average number of daily transactions was under 190K. In 2018, this figure is over 8.8M.

slide-4
SLIDE 4

! Original pipeline

  • Difficult to identify errors. "
  • Manual work to regenerate reports and expert knowledge of the system. #
  • System too slow and unable to scale. It took more than one day to run. $
  • Costly storage. %
slide-5
SLIDE 5

Requirements

  • Fault tolerance and reliability.
  • Fast io, availability, durability, and cost efficient.
  • Good processing performance.
  • Scalable.

Exchange Generate transaction files Generate daily and monthly account statistics Generate Accounting reports Persistent storage

slide-6
SLIDE 6

Fault tolerance and reliability

Vulnerabilities

  • Communication with exchange may fail.
  • Hardware or so"ware errors may happen while

the job is running.

Design solutions

  • Store transactions per day.
  • Compute financial statistics per day.
  • Retrieve the last two days worth of transactions.
  • Break the accounting job into modular Luigi tasks.
slide-7
SLIDE 7

class GenerateHumanReadableAccountingReport(AccountingTask): def requires(self) -> luigi.Task: return GenerateAccountingReport() def run(self) -> None: with self.input().operate('r') as target_path: df_accounting = pd.read_parquet(target_path) with self.output().open('w') as file_: df_accounting.to_csv(file_, sep='\t', index=False) def output(self) -> luigi.Target: return self.get_target(path='data/reports/accounting-report.tsv')

slide-8
SLIDE 8

Efficient storage

  • Columnar storage.
  • Only read the columns needed for

the task.

  • Minimised I/O.
  • Efficient compression and encoding.
  • Python support.

Parquet

Row-based Column-based

slide-9
SLIDE 9

Efficient storage

  • High durability.
  • High availability.
  • Low maintenance.
  • Cost efficient.
  • Decoupling of processing and storage.
  • Python library boto/boto3.
  • Web interface.
slide-10
SLIDE 10

Good performance

Requirements

  • Fast data processing.
  • Scalable.

Solution

  • General purpose data

processing engine.

  • Massive parallel. Spark builds its
  • wn execution plans.
  • Caches data in RAM.
  • Python support.
slide-11
SLIDE 11

Action

Result

Create RDD Transformation

RDD

Spark key concepts

RDD

Resilient: fault-tolerant. Distributed: partitioned across multiple nodes. Dataset: collection of data.

Dataframes

Data organised in columns built on top of RDDs. Better performance than RDDs. User friendly API.

Lineage

slide-12
SLIDE 12

Execution on Spark

slide-13
SLIDE 13

Spark job from Luigi

class GenerateSmarketsAccountReport(PySparkTask, AccountingTask): def requires(self) -> luigi.Task: return GenerateAccountingReport() def main(self, sc: pyspark.SparkContext) -> None: spark = pyspark.sql.SparkSession(sc) sdf_per_account = read_parquet(spark, self.input()) sdf_smarkets = sdf_per_account.filter( sdf_per_account.account_id == SMARKETS_ACCOUNT_ID ) write_parquet(sdf_smarkets, self.output()) def output(self) -> luigi.Target: return self.get_target( path=‘data/reports/accounting-report-smarkets.parquet' )

slide-14
SLIDE 14

Scalability

  • Spark cluster.
  • Fast deployment.
  • Easy to use.
  • Flexible.
  • Seamless integration with S3 - EMRFS.
  • Ability to shutdown the cluster when job is done

without data loss.

  • Low cost.
  • Nice web interface.

Submit Step

slide-15
SLIDE 15

Spark on EMR

EMR

Master Node

Client

YARN Resource Manager

Slave Node(s) YARN Container Spark Driver

Spark Context

YARN Container Spark Executor

Task Task

YARN Container Spark Executor

Task Task 1 2 3 4 5 6

EMR

Master Node

YARN Resource Manager

Slave Node(s) YARN Container YARN Container YARN Container

slide-16
SLIDE 16

Spark on EMR

EMR

Master Node

Client

YARN Resource Manager

Slave Node(s) YARN Container Spark Driver

Spark Context

YARN Container Spark Executor

Task Task

YARN Container Spark Executor

Task Task 1 2 3 4 5 6

EMR

Master Node

Client

YARN Resource Manager

Slave Node(s) YARN Container YARN Container YARN Container

1

slide-17
SLIDE 17

Spark on EMR

EMR

Master Node

Client

YARN Resource Manager

Slave Node(s) YARN Container Spark Driver

Spark Context

YARN Container Spark Executor

Task Task

YARN Container Spark Executor

Task Task 1 2 3 4 5 6

EMR

Master Node

Client

YARN Resource Manager

Slave Node(s) YARN Container Spark Driver

Spark Context

YARN Container YARN Container

1 2

slide-18
SLIDE 18

Spark on EMR

EMR

Master Node

Client

YARN Resource Manager

Slave Node(s) YARN Container Spark Driver

Spark Context

YARN Container Spark Executor

Task Task

YARN Container Spark Executor

Task Task 1 2 3 4 5 6

EMR

Master Node

Client

YARN Resource Manager

Slave Node(s) YARN Container Spark Driver

Spark Context

YARN Container YARN Container

1 2 3

slide-19
SLIDE 19

Spark on EMR

EMR

Master Node

Client

YARN Resource Manager

Slave Node(s) YARN Container Spark Driver

Spark Context

YARN Container Spark Executor

Task Task

YARN Container Spark Executor

Task Task 1 2 3 4 5 6

EMR

Master Node

Client

YARN Resource Manager

Slave Node(s) YARN Container Spark Driver

Spark Context

YARN Container Spark Executor YARN Container Spark Executor

1 2 3 4

slide-20
SLIDE 20

Spark on EMR

EMR

Master Node

Client

YARN Resource Manager

Slave Node(s) YARN Container Spark Driver

Spark Context

YARN Container Spark Executor

Task Task

YARN Container Spark Executor

Task Task 1 2 3 4 5 6

EMR

Master Node

Client

YARN Resource Manager

Slave Node(s) YARN Container Spark Driver

Spark Context

YARN Container Spark Executor YARN Container Spark Executor

1 2 3 4 5

slide-21
SLIDE 21

Spark on EMR

EMR

Master Node

Client

YARN Resource Manager

Slave Node(s) YARN Container Spark Driver

Spark Context

YARN Container Spark Executor

Task Task

YARN Container Spark Executor

Task Task 1 2 3 4 5 6

slide-22
SLIDE 22

Lui gi t ask Appl i cat i on scr i pt Lui gi t ask m ai n( )

Create Spark cluster on EMR Upload task pickle to S3 Add step to EMR cluster to execute application script Unpickle Luigi task Run Luigi task main() Dataframe and RDD operations Store result in S3 Create SparkContext instance Luigi Event Handler SUCCESS FAILURE Luigi Scheduler no pending EmrSparkSubmitTask that can be run succesfully Destroy EMR Cluster Run Accounting job Create accounting container Start Luigi Central Scheduler Run Accounting wrapper task Create Spark cluster on EMR Submit Luigi tasks to EMR cluster Destroy EMR cluster Accounting container and EMR cluster share/save files using S3 Poll for step status

slide-23
SLIDE 23

Thanks!

slide-24
SLIDE 24

Parquet

+ +

slide-25
SLIDE 25

Submit Spark application to EMR from Luigi

Lui gi t ask Appl i cat i on scr i pt Lui gi t ask m ai n( )

Create Spark cluster on EMR Upload task pickle to S3 Add step to EMR cluster to execute application script Unpickle Luigi task Run Luigi task main() Dataframe and RDD operations Store result in S3 Create SparkContext instance Luigi Event handler SUCCESS FAILURE PROCESS_FAILURE TIMEOUT BROKEN_TASK DEPENDENCY_MISSING Luigi Scheduler no pending EmrSparkSubmitTask that can be run succesfully Destroy EMR cluster Run Accounting task Create accounting container Start Luigi Central Scheduler Run Accounting wrapper task Create Spark cluster on EMR Submit Luigi tasks to EMR cluster Destroy EMR cluster Accounting container and EMR cluster share/save files using S3 Poll for step status

slide-26
SLIDE 26

Shutdown EMR cluster

  • A task won’t raise an event if one dependency has failed.
  • In case of a dependency failure, we want to destroy cluster if the only tasks left depend
  • n failing task.
  • Information about pending tasks and task dependencies fetched from Luigi Central

Scheduler.

Lui gi t ask Appl i cat i on scr i pt Lui gi t ask m ai n( )

Create Spark cluster on EMR Upload task pickle to S3 Add step to EMR cluster to execute application script Unpickle Luigi task Run Luigi task main() Dataframe and RDD operations Store result in S3 Create SparkContext instance Luigi Event Handler SUCCESS FAILURE Luigi Scheduler no pending EmrSparkSubmitTask that can be run succesfully Destroy EMR Cluster Run Accounting task Create accounting container Start Luigi Central Scheduler Run Accounting wrapper task Create Spark cluster on EMR Submit Luigi tasks to EMR cluster Destroy EMR cluster Accounting container and EMR cluster share/save files using S3 Poll for step status