 
              ETL pipeline to achieve reliability at scale By Isabel López Andrade
Accounting at Smarkets
Accounting at Smarkets Reports generated daily using transactions from the exchange. In 2013, the average number of daily transactions was under 190K. In 2018, this figure is over 8.8M.
! Original pipeline • Di ff icult to identify errors. " • Manual work to regenerate reports and expert knowledge of the system. # • System too slow and unable to scale. It took more than one day to run. $ • Costly storage. %
Persistent storage Generate daily Generate Generate and monthly Accounting transaction files account Exchange reports statistics Requirements • Fault tolerance and reliability. • Fast io, availability, durability, and cost e ff icient. • Good processing performance. • Scalable.
Fault tolerance and reliability Vulnerabilities • Communication with exchange may fail. • Hardware or so " ware errors may happen while the job is running. Design solutions • Store transactions per day. • Compute financial statistics per day. • Retrieve the last two days worth of transactions. • Break the accounting job into modular Luigi tasks.
class GenerateHumanReadableAccountingReport(AccountingTask): def requires(self) -> luigi.Task: return GenerateAccountingReport() def run(self) -> None: with self.input().operate('r') as target_path: df_accounting = pd.read_parquet(target_path) with self.output().open('w') as file_: df_accounting.to_csv(file_, sep='\t', index=False) def output(self) -> luigi.Target: return self.get_target(path='data/reports/accounting-report.tsv')
E ff icient storage • Columnar storage. • Only read the columns needed for the task. • Row-based Minimised I/O. • E ff icient compression and encoding. Column-based • Python support. Parquet
E ff icient storage • High durability. • High availability. • Low maintenance. • Cost e ff icient. • Decoupling of processing and storage. • Python library boto/boto3. • Web interface.
Good performance Solution Requirements • • General purpose data Fast data processing. processing engine. • Scalable. • Massive parallel. Spark builds its own execution plans. • Caches data in RAM. • Python support.
Spark key concepts RDD Dataframes Resilient : fault-tolerant. Data organised in columns built on top of RDDs. Distributed : partitioned across multiple nodes. Better performance than RDDs. Dataset : collection of data. User friendly API. Create RDD RDD Transformation Lineage Action Result
Execution on Spark
Spark job from Luigi class GenerateSmarketsAccountReport(PySparkTask, AccountingTask): def requires(self) -> luigi.Task: return GenerateAccountingReport() def main(self, sc: pyspark.SparkContext) -> None: spark = pyspark.sql.SparkSession(sc) sdf_per_account = read_parquet(spark, self.input()) sdf_smarkets = sdf_per_account.filter( sdf_per_account.account_id == SMARKETS_ACCOUNT_ID ) write_parquet(sdf_smarkets, self.output()) def output(self) -> luigi.Target: return self.get_target( path=‘data/reports/accounting-report-smarkets.parquet' )
Scalability • Spark cluster. • Fast deployment. • Easy to use. • Flexible. • Seamless integration with S3 - EMRFS. Submit Step • Ability to shutdown the cluster when job is done without data loss. • Low cost. • Nice web interface.
EMR Slave Node(s) Master Node 1 Spark Client 2 YARN Resource 5 Context Manager Spark Driver 3 YARN Container 4 Task Task 6 Task Task Spark Executor Spark Executor YARN Container YARN Container Spark on EMR EMR Slave Node(s) Master Node YARN Resource YARN Container Manager YARN Container YARN Container
EMR Slave Node(s) Master Node Spark 1 Client 2 YARN Resource 5 Context Manager Spark Driver 3 YARN Container 4 Task Task 6 Task Task Spark Executor Spark Executor YARN Container YARN Container Spark on EMR EMR Slave Node(s) Master Node 1 Client YARN Resource YARN Container Manager YARN Container YARN Container
EMR Slave Node(s) Master Node Spark 1 Client 2 5 YARN Resource Context Manager Spark Driver 3 YARN Container 4 Task Task 6 Task Task Spark Executor Spark Executor YARN Container YARN Container Spark on EMR EMR Slave Node(s) Master Node Spark 1 Client 2 YARN Resource Context Manager Spark Driver YARN Container YARN Container YARN Container
EMR Slave Node(s) Master Node Spark 1 Client 2 5 YARN Resource Context Manager Spark Driver 3 YARN Container 4 Task Task 6 Task Task Spark Executor Spark Executor YARN Container YARN Container Spark on EMR EMR Slave Node(s) Master Node Spark 1 Client 2 YARN Resource Context Manager Spark Driver 3 YARN Container YARN Container YARN Container
EMR Slave Node(s) Master Node Spark 1 Client 2 5 YARN Resource Context Manager Spark Driver 3 YARN Container 4 Task Task 6 Task Task Spark Executor Spark Executor YARN Container YARN Container Spark on EMR EMR Slave Node(s) Master Node Spark 1 Client 2 YARN Resource Context Manager Spark Driver 3 YARN Container 4 Spark Executor Spark Executor YARN Container YARN Container
EMR Slave Node(s) Master Node Spark 1 Client 2 5 YARN Resource Context Manager Spark Driver 3 YARN Container 4 Task Task 6 Task Task Spark Executor Spark Executor YARN Container YARN Container Spark on EMR EMR Slave Node(s) Master Node Spark 1 Client 2 YARN Resource 5 Context Manager Spark Driver 3 YARN Container 4 Spark Executor Spark Executor YARN Container YARN Container
Spark on EMR EMR Slave Node(s) Master Node Spark 1 Client 2 5 YARN Resource Context Manager Spark Driver 3 YARN Container 4 Task Task 6 Task Task Spark Executor Spark Executor YARN Container YARN Container
Lui gi t ask Add step to EMR Create Spark Upload task cluster to Poll for step Luigi Scheduler Luigi Event Handler cluster on EMR pickle to S3 execute status application script Destroy EMR no pending SUCCESS Cluster EmrSparkSubmitTask that FAILURE can be run succesfully Appl i cat i on scr i pt Create Unpickle Luigi Run Luigi task SparkContext task main() instance Run Accounting job Lui gi t ask m ai n( ) Dataframe and Store result in S3 RDD operations Create accounting Start Luigi Central Run Accounting container Scheduler wrapper task Create Spark Submit Luigi tasks Destroy EMR cluster on EMR to EMR cluster cluster Accounting container and EMR cluster share/save files using S3
Thanks!
+ + Parquet
Submit Spark application to EMR from Luigi Lui gi t ask Add step to EMR Luigi Event handler Create Spark Upload task cluster to Poll for step SUCCESS cluster on EMR pickle to S3 execute status Luigi Scheduler FAILURE application script no pending Destroy EMR PROCESS_FAILURE EmrSparkSubmitTask that cluster TIMEOUT can be run succesfully BROKEN_TASK DEPENDENCY_MISSING Appl i cat i on scr i pt Create Unpickle Luigi Run Luigi task SparkContext task main() instance Run Accounting task Lui gi t ask m ai n( ) Dataframe and Store result in S3 RDD operations Create accounting Start Luigi Central Run Accounting container Scheduler wrapper task Create Spark Submit Luigi tasks Destroy EMR cluster on EMR to EMR cluster cluster Accounting container and EMR cluster share/save files using S3
Shutdown EMR cluster Lui gi t ask Add step to EMR Create Spark Upload task cluster to Poll for step Luigi Scheduler execute Luigi Event Handler cluster on EMR pickle to S3 status application script Destroy EMR no pending SUCCESS Cluster EmrSparkSubmitTask that FAILURE can be run succesfully Appl i cat i on scr i pt Create - A task won’t raise an event if one dependency has failed. Unpickle Luigi Run Luigi task SparkContext task main() instance - In case of a dependency failure, we want to destroy cluster if the only tasks left depend on failing task. Run Accounting - Information about pending tasks and task dependencies fetched from Luigi Central task Lui gi t ask m ai n( ) Scheduler. Dataframe and Store result in S3 RDD operations Create accounting Start Luigi Central Run Accounting container Scheduler wrapper task Create Spark Submit Luigi tasks Destroy EMR cluster on EMR to EMR cluster cluster Accounting container and EMR cluster share/save files using S3
Recommend
More recommend