ETL pipeline to achieve reliability at scale By Isabel Lpez - PowerPoint PPT Presentation

ETL pipeline to achieve reliability at scale By Isabel López Andrade

Accounting at Smarkets

Accounting at Smarkets Reports generated daily using transactions from the exchange. In 2013, the average number of daily transactions was under 190K. In 2018, this figure is over 8.8M.

! Original pipeline • Di ff icult to identify errors. " • Manual work to regenerate reports and expert knowledge of the system. # • System too slow and unable to scale. It took more than one day to run. $ • Costly storage. %

Persistent storage Generate daily Generate Generate and monthly Accounting transaction files account Exchange reports statistics Requirements • Fault tolerance and reliability. • Fast io, availability, durability, and cost e ff icient. • Good processing performance. • Scalable.

Fault tolerance and reliability Vulnerabilities • Communication with exchange may fail. • Hardware or so " ware errors may happen while the job is running. Design solutions • Store transactions per day. • Compute financial statistics per day. • Retrieve the last two days worth of transactions. • Break the accounting job into modular Luigi tasks.

class GenerateHumanReadableAccountingReport(AccountingTask): def requires(self) -> luigi.Task: return GenerateAccountingReport() def run(self) -> None: with self.input().operate('r') as target_path: df_accounting = pd.read_parquet(target_path) with self.output().open('w') as file_: df_accounting.to_csv(file_, sep='\t', index=False) def output(self) -> luigi.Target: return self.get_target(path='data/reports/accounting-report.tsv')

E ff icient storage • Columnar storage. • Only read the columns needed for the task. • Row-based Minimised I/O. • E ff icient compression and encoding. Column-based • Python support. Parquet

E ff icient storage • High durability. • High availability. • Low maintenance. • Cost e ff icient. • Decoupling of processing and storage. • Python library boto/boto3. • Web interface.

Good performance Solution Requirements • • General purpose data Fast data processing. processing engine. • Scalable. • Massive parallel. Spark builds its own execution plans. • Caches data in RAM. • Python support.

Spark key concepts RDD Dataframes Resilient : fault-tolerant. Data organised in columns built on top of RDDs. Distributed : partitioned across multiple nodes. Better performance than RDDs. Dataset : collection of data. User friendly API. Create RDD RDD Transformation Lineage Action Result

Execution on Spark

Spark job from Luigi class GenerateSmarketsAccountReport(PySparkTask, AccountingTask): def requires(self) -> luigi.Task: return GenerateAccountingReport() def main(self, sc: pyspark.SparkContext) -> None: spark = pyspark.sql.SparkSession(sc) sdf_per_account = read_parquet(spark, self.input()) sdf_smarkets = sdf_per_account.filter( sdf_per_account.account_id == SMARKETS_ACCOUNT_ID ) write_parquet(sdf_smarkets, self.output()) def output(self) -> luigi.Target: return self.get_target( path=‘data/reports/accounting-report-smarkets.parquet' )

Scalability • Spark cluster. • Fast deployment. • Easy to use. • Flexible. • Seamless integration with S3 - EMRFS. Submit Step • Ability to shutdown the cluster when job is done without data loss. • Low cost. • Nice web interface.

EMR Slave Node(s) Master Node 1 Spark Client 2 YARN Resource 5 Context Manager Spark Driver 3 YARN Container 4 Task Task 6 Task Task Spark Executor Spark Executor YARN Container YARN Container Spark on EMR EMR Slave Node(s) Master Node YARN Resource YARN Container Manager YARN Container YARN Container

EMR Slave Node(s) Master Node Spark 1 Client 2 YARN Resource 5 Context Manager Spark Driver 3 YARN Container 4 Task Task 6 Task Task Spark Executor Spark Executor YARN Container YARN Container Spark on EMR EMR Slave Node(s) Master Node 1 Client YARN Resource YARN Container Manager YARN Container YARN Container

EMR Slave Node(s) Master Node Spark 1 Client 2 5 YARN Resource Context Manager Spark Driver 3 YARN Container 4 Task Task 6 Task Task Spark Executor Spark Executor YARN Container YARN Container Spark on EMR EMR Slave Node(s) Master Node Spark 1 Client 2 YARN Resource Context Manager Spark Driver YARN Container YARN Container YARN Container

EMR Slave Node(s) Master Node Spark 1 Client 2 5 YARN Resource Context Manager Spark Driver 3 YARN Container 4 Task Task 6 Task Task Spark Executor Spark Executor YARN Container YARN Container Spark on EMR EMR Slave Node(s) Master Node Spark 1 Client 2 YARN Resource Context Manager Spark Driver 3 YARN Container YARN Container YARN Container

EMR Slave Node(s) Master Node Spark 1 Client 2 5 YARN Resource Context Manager Spark Driver 3 YARN Container 4 Task Task 6 Task Task Spark Executor Spark Executor YARN Container YARN Container Spark on EMR EMR Slave Node(s) Master Node Spark 1 Client 2 YARN Resource Context Manager Spark Driver 3 YARN Container 4 Spark Executor Spark Executor YARN Container YARN Container

EMR Slave Node(s) Master Node Spark 1 Client 2 5 YARN Resource Context Manager Spark Driver 3 YARN Container 4 Task Task 6 Task Task Spark Executor Spark Executor YARN Container YARN Container Spark on EMR EMR Slave Node(s) Master Node Spark 1 Client 2 YARN Resource 5 Context Manager Spark Driver 3 YARN Container 4 Spark Executor Spark Executor YARN Container YARN Container

Spark on EMR EMR Slave Node(s) Master Node Spark 1 Client 2 5 YARN Resource Context Manager Spark Driver 3 YARN Container 4 Task Task 6 Task Task Spark Executor Spark Executor YARN Container YARN Container

Lui gi t ask Add step to EMR Create Spark Upload task cluster to Poll for step Luigi Scheduler Luigi Event Handler cluster on EMR pickle to S3 execute status application script Destroy EMR no pending SUCCESS Cluster EmrSparkSubmitTask that FAILURE can be run succesfully Appl i cat i on scr i pt Create Unpickle Luigi Run Luigi task SparkContext task main() instance Run Accounting job Lui gi t ask m ai n( ) Dataframe and Store result in S3 RDD operations Create accounting Start Luigi Central Run Accounting container Scheduler wrapper task Create Spark Submit Luigi tasks Destroy EMR cluster on EMR to EMR cluster cluster Accounting container and EMR cluster share/save files using S3

Thanks!

+ + Parquet

Submit Spark application to EMR from Luigi Lui gi t ask Add step to EMR Luigi Event handler Create Spark Upload task cluster to Poll for step SUCCESS cluster on EMR pickle to S3 execute status Luigi Scheduler FAILURE application script no pending Destroy EMR PROCESS_FAILURE EmrSparkSubmitTask that cluster TIMEOUT can be run succesfully BROKEN_TASK DEPENDENCY_MISSING Appl i cat i on scr i pt Create Unpickle Luigi Run Luigi task SparkContext task main() instance Run Accounting task Lui gi t ask m ai n( ) Dataframe and Store result in S3 RDD operations Create accounting Start Luigi Central Run Accounting container Scheduler wrapper task Create Spark Submit Luigi tasks Destroy EMR cluster on EMR to EMR cluster cluster Accounting container and EMR cluster share/save files using S3

Shutdown EMR cluster Lui gi t ask Add step to EMR Create Spark Upload task cluster to Poll for step Luigi Scheduler execute Luigi Event Handler cluster on EMR pickle to S3 status application script Destroy EMR no pending SUCCESS Cluster EmrSparkSubmitTask that FAILURE can be run succesfully Appl i cat i on scr i pt Create - A task won’t raise an event if one dependency has failed. Unpickle Luigi Run Luigi task SparkContext task main() instance - In case of a dependency failure, we want to destroy cluster if the only tasks left depend on failing task. Run Accounting - Information about pending tasks and task dependencies fetched from Luigi Central task Lui gi t ask m ai n( ) Scheduler. Dataframe and Store result in S3 RDD operations Create accounting Start Luigi Central Run Accounting container Scheduler wrapper task Create Spark Submit Luigi tasks Destroy EMR cluster on EMR to EMR cluster cluster Accounting container and EMR cluster share/save files using S3

ETL pipeline to achieve reliability at scale By Isabel Lpez - PowerPoint PPT Presentation

ETL pipeline to achieve reliability at scale By Isabel Lpez Andrade Accounting at Smarkets Accounting at Smarkets Reports generated daily using transactions from the exchange. In 2013, the average number of daily transactions was under

ETL Overview Extract, Transform, Load (ETL) General ETL issues ETL/DW refreshment process

ETL and Event Sourcing Integration Architecture: Best Practice and Case Study Marc Siegel -

Reliability of Cloud-Scale Systems (CS 598) Fall 2018 Tianyin Xu 1 Reliability of Cloud-Scale

Data Science in the Wild Lecture 5: ETL - Extract, Transform, Load - 2 Eran Toch Data Science

Reliability Engineering - Discussions and Clarifications Reliability Engineering VS.

Software Reliability and System Reliability Introduction 1 Software Reliability and System

Spire Missouri STL Pipeline November 13, 2019 STL Pipeline Supports strategy to modernize

August 23, 2012 Data Replication/ETL: Terms Data Replication : Data Replication is the process of

Saving Professor Campbell Anne McDougall & Genevieve Goupil ETL teachers , Montreal

T09: ETL Reference Design Overview (402.8.4) Artur Apresyan Fermilab US-MTD Technical Review

Building a Fault-Tolerant ETL Pipeline for Claims CAF Internship Presentation - Summer 2018 Kim

Reliability Perspectives on Clean Power Plan Implications NERC Reliability Assessments John Moura

The Future of Reliability: Stanton Energy Reliability Center DCBO Bidders Conference

Office of Pipeline Safety Office of Pipeline Safety Presentation on Presentation on Damage

Ma Magic Mountain Pipeline Phase 6 gic Mountain Pipeline Phase 6 Project ject Board Meeting

Internal Pipeline Corrosion Kenneth Lee Pipeline Safety Director, Engineering & Research

On the resonant production of axions in a magnetar magnitosphere Dmitry Rumyantsev Yaroslavl

Recent Developments of Integrated Data Analysis at ASDEX Upgrade R. Fischer, L. Barrera, A.

December 2 nd 2020 Thermal imaging get free training borrow a camera find thermal

Very high energy emission from the hard spectrum sources HESS J1641 463, HESS J1741 302 and

Todays Objec3ves Phils Talk Review Amazon Web Services Elas3c Map Reduce (EMR)

GoBack Structure in narrow rings: The Scattering approach Luis Benet, Olivier Merlo Instituto

CS 5150 So(ware Engineering Evalua4on and User Tes4ng

Inheritance Tiziana Ligorio 1 Todays Plan Recap Useful C++ / OOP Intro to Inheritance

ETL pipeline to achieve reliability at scale By Isabel Lpez - PowerPoint PPT Presentation

ETL pipeline to achieve reliability at scale By Isabel Lpez Andrade Accounting at Smarkets Accounting at Smarkets Reports generated daily using transactions from the exchange. In 2013, the average number of daily transactions was under

ETL Overview Extract, Transform, Load (ETL) General ETL issues ETL/DW refreshment process

ETL and Event Sourcing Integration Architecture: Best Practice and Case Study Marc Siegel -

Reliability of Cloud-Scale Systems (CS 598) Fall 2018 Tianyin Xu 1 Reliability of Cloud-Scale

Data Science in the Wild Lecture 5: ETL - Extract, Transform, Load - 2 Eran Toch Data Science

Reliability Engineering - Discussions and Clarifications Reliability Engineering VS.

Software Reliability and System Reliability Introduction 1 Software Reliability and System

Spire Missouri STL Pipeline November 13, 2019 STL Pipeline Supports strategy to modernize

August 23, 2012 Data Replication/ETL: Terms Data Replication : Data Replication is the process of

Saving Professor Campbell Anne McDougall &amp; Genevieve Goupil ETL teachers , Montreal

T09: ETL Reference Design Overview (402.8.4) Artur Apresyan Fermilab US-MTD Technical Review

Building a Fault-Tolerant ETL Pipeline for Claims CAF Internship Presentation - Summer 2018 Kim

Reliability Perspectives on Clean Power Plan Implications NERC Reliability Assessments John Moura

The Future of Reliability: Stanton Energy Reliability Center DCBO Bidders Conference

Office of Pipeline Safety Office of Pipeline Safety Presentation on Presentation on Damage

Ma Magic Mountain Pipeline Phase 6 gic Mountain Pipeline Phase 6 Project ject Board Meeting

Internal Pipeline Corrosion Kenneth Lee Pipeline Safety Director, Engineering &amp; Research

On the resonant production of axions in a magnetar magnitosphere Dmitry Rumyantsev Yaroslavl

Recent Developments of Integrated Data Analysis at ASDEX Upgrade R. Fischer, L. Barrera, A.

December 2 nd 2020 Thermal imaging get free training borrow a camera find thermal

Very high energy emission from the hard spectrum sources HESS J1641 463, HESS J1741 302 and

Todays Objec3ves Phils Talk Review Amazon Web Services Elas3c Map Reduce (EMR)

GoBack Structure in narrow rings: The Scattering approach Luis Benet, Olivier Merlo Instituto

CS 5150 So(ware Engineering Evalua4on and User Tes4ng

Inheritance Tiziana Ligorio 1 Todays Plan Recap Useful C++ / OOP Intro to Inheritance

Saving Professor Campbell Anne McDougall & Genevieve Goupil ETL teachers , Montreal

Internal Pipeline Corrosion Kenneth Lee Pipeline Safety Director, Engineering & Research