Scalable Data Ingestion Architecture Using Airflow and Spark April - PowerPoint PPT Presentation

Scalable Data Ingestion Architecture Using Airflow and Spark April 17, 2019 Johannes Leppä Data Council Data Engineer San Francisco, CA johannes.leppa@komodohealth.com

Agenda ❖ Komodo Health ❖ Data Ingestion Challenges ❖ Data Ingestion System Architecture ❖ Lessons Learned and Future Developments ❖ Scaling Processes ❖ Conclusions

Our Mission To reduce the global burden of disease through the most actionable healthcare map

Komodo Health™ Integrity Our Map Links Activities of the Entire Healthcare System Payers Biopharma • 500+ payers • $20B payments Providers Clinical Trials • 3.5 M • 100k+ Clinical Trials Patient-Centric doctors / nurses AI powered linkages Institutions Scientific Publications • 450K • 20M publications hospitals / clinics

Variation in data size and cadency External Public and proprietary sources ● Size of data ● Source 1 From MBs to TBs ○ Refresh cadencies: ● Source 2 Daily ○ Weekly ○ Monthly ○ Source 3 Quarterly ○ Bi-annual ○ Source 4 One-off ○ Historical drop followed by ■ incremental additions Source 5

Variation in access to raw data External Landed Public and proprietary sources ● Original format Size of data ● Source 1 From MBs to TBs ○ SFTP Refresh cadencies: ● Source 2 Daily ○ AWS S3 Weekly ○ Monthly ○ Source 3 Quarterly ○ API Bi-annual ○ Source 4 One-off ○ Download Historical drop followed by ■ incremental additions Source 5 Several interfaces for data extraction Hard drive ●

Variation in file formats External Landed Raw Original file formats ● Original format Parquet CSV ○ XML Source 1 ○ SFTP SAS ○ Fixed-width ○ Source 2 Parquet ○ AWS S3 Various compression formats ● Source 3 Encrypted data ● API Source 4 Download Source 5 Hard drive

Cover several aspects of healthcare system External Landed Raw Transformed Several datasets covering a ● Original format Parquet Parquet single aspect of healthcare Source 1 Different schemas ○ Different conventions ○ Need to transform to ● Source 2 common schema Source 3 Source 4 Source 5

Security and privacy External Landed Raw Transformed Security and privacy ● Original format Parquet Parquet Access control ○ Data encryption Source 1 ○ Compliances ○ Source 2 Source 3 Source 4 Source 5

Prior to centralized data ingestion system Eternal question: What is the priority? ● Scalability, maintainability, robustness, reliability ○ Rapid development ○

Prior to centralized data ingestion system Eternal question: What is the priority? ● Scalability, maintainability, robustness, reliability ○ Rapid development ← startup choice ○ Provide value to customers and show progress to investors ■ React to changing requirements ■

Prior to centralized data ingestion system Eternal question: What is the priority? ● Scalability, maintainability, robustness, reliability ○ Rapid development ← startup choice ○ Provide value to customers and show progress to investors ■ React to changing requirements ■ Consequences: ● Specialized pipelines ○ Manual operations ○ Variation in technologies and how to use them ○ Less reusable code ○

Why did we build a centralized ingestion system? Previous approach hard to maintain ● Overhead in onboarding engineers to processes ○ Accumulation of manual tasks ○ Project to integrate a few new data sources ● Daily increments ○ Similar data sources ○ Opportunity : build system for these sources and migrate other sources later ○ Pros of in-house implementation ● Flexibility ○ Integrate with our tech stack ○ Leverage previous experience ■

Overview of the system infrastructure Airflow ● Organize workflows ○ Automation ○ Alerting ○ Spark ● Distributed processing ○ Kubernetes ● Container management ○ AWS ● EC2 - servers ○ S3 - store data ○

Pros: Airflow: Schedule workflows DAGs written in Python ● Hooks to integrate with sources ● External Landed Raw Transformed Operators for common tasks ● Original format Parquet Parquet Alert on success/failure ● Monitoring ● Source 1 Parallelize DAGs and tasks ● SFTP Source 2 AWS S3 Source 3 API Source 4 Download

Pros: Airflow: Schedule workflows DAGs written in Python ● Hooks to integrate with sources ● External Landed Raw Transformed Operators for common tasks ● Original format Parquet Parquet Alert on success/failure ● Monitoring ● Source 1 Parallelize DAGs and tasks ● SFTP Cons: Source 2 AWS S3 Had to customize hooks and ● operators Source 3 Handling credentials ○ API Needing additional S3 ○ metadata Source 4 Download

Spark: Distributed processing External Landed Raw Transformed Original format Parquet Parquet Pros: Source 1 Reliable ● Python and Scala APIs ● Source 2 Cons: Performance tuning can be tricky ●

Kubernetes: Container management Pros: Environments isolated to namespaces ● Spark Node selectors for resource allocation ● Master ○ Nodes labeled based on the Auto Pod Scaling Groups instances are tied to Self-healing of pods! ● Airflow Airflow Cons: Scheduler WebUI Occasional stability issues ● ○ Networking issues Pod Pod Difficult to troubleshoot ● Node Node

So far so good Scheduled execution Parallelized tasks Scalable resources Alerting Monitoring Resilient infrastructure Isolated environments

Infra limitation: Spark scaled manually Big spikes in resource usage Spark Spark Wasteful to keep scaled up ● Worker Worker Scaling down is tricky ● Currently run big workloads on separate cluster ● Pod Pod Manual operation :( ○ Node Node

Infra limitation: Spark scaled manually Big spikes in resource usage Spark Spark Wasteful to keep scaled up ● Worker Worker Scaling down is tricky ● Currently run big workloads on separate cluster ● Pod Pod Manual operation :( ○ Spark Worker Two Spark workers on the same node Pod resulted in double counting Spark resources Node Node

Automatic scaling under development Big spikes in resource usage Spark Spark Wasteful to keep scaled up ● Executor Executor Scaling down is tricky ● Currently run big workloads on separate cluster ● Pod Pod Manual operation :( ○ Spark Spark Future solution: Executor Executor Run Spark directly on Kubernetes ● Pod Pod Introduced in Spark 2.4.0 for client mode ○ Node Node K8s autoscaler to scale nodes ●

Infra limitation: Scheduler a single point of failure Using local executor Spark Tasks executed as subprocesses of scheduler Driver ● Scale resources vertically Spark ● Driver Self-healing on failures? It depends... ● File transfer Airflow Scheduler Pod Node Node

Infra limitation: Scheduler a single point of failure Using local executor Spark Tasks executed as subprocesses of scheduler Driver ● Scale resources vertically Spark ● Driver Self-healing on failures? It depends... ● File transfer Issues in self-healing: Airflow Scheduler Inconsistency in Airflow database ● Dependency on lost local file ● Pod Pod evicted due to disk pressure ● Node Node

Why are you using local executor? It has served us well, so far Spark It was enough when we started Driver ● Did not want to add complexity Spark ● Driver File transfer Airflow Scheduler Pod Node Node

Automatic scaling under development, again It has served us well, so far Spark It was enough when we started ● File transfer Driver Did not want to add complexity ● Pod Pod Future solution: Airflow Spark Kubernetes executor ● Scheduler Driver Introduced in Airflow 1.10.0 ○ K8s autoscaler to scale nodes ● Pod Pod Node Node

Beyond infra - Scaling the ingestion processes Our data ingestion priorities: ● Speed of data delivery ○ Data quality ○ Security and privacy ○ Bottleneck is engineering time spent on integrating new data sources ● Tools to simplify processes ○

Early and fast iterations External Landed Raw Transformed Original format Parquet Parquet Commonize Data profiling tool: Source 1 ● Recognize columns ○ Simplifies commonization Source 2 ● Validate raw data ○ Communicate issues with source ○ Compliance risks Data Profiling

Scalable Data Ingestion Architecture Using Airflow and Spark April - PowerPoint PPT Presentation

Scalable Data Ingestion Architecture Using Airflow and Spark April 17, 2019 Johannes Lepp Data Council Data Engineer San Francisco, CA johannes.leppa@komodohealth.com Agenda Komodo Health Data Ingestion Challenges Data

Production Docker Image for Apache Airflow Airflow Summit 2020 - 14.07.2020 Production

Building ( Better ) Data Pipelines using Apache Airflow Sid Anand (@r39132) QCon.AI 2018 1

Data Orchestration with Apache Airflow Data driven empower the organization to seek more

Practical R: Data Ingestion and Munging Practical R: Data Ingestion and Munging Abhijit Dasgupta

Manageable data pipelines with Airflow (and Kubernetes) GDG DevFest Warsaw 2018 @higrys,

Building Reusable and Trustworthy pipelines 1 Airflow Summit 2020, @nehiljain Outline 1.

Establishing Primary Airflow in WBCS Establishing Primary Airflow in WBCS Systems Systems

Whats new in Airflow 2 Apache Airflow Online Summit 8th of July 2020 Who are we? Tomek

Teaching an old DAG new tricks Migrating a decade old pipeline to Airflow Outline Cloud native

AIP-31: Airflow functional DAG Airflow Summit 2020 1 Introduction 2 Why functional DAG? 3

Airflow Summit Advanced Apache Superset for Data Engineers A passion for building data tools!

DIGI- VAV DIGI-VAV Digi-VAV measures true fan airflow using embedded patent technology and

... and other open source software April 17, 2019 Data Council San Francisco, CA Greg

Airflow on Kubernetes: Containerizing your Workflows By Michael Hewitt Agenda Kubernetes

Improving the Airflow User Experience Speakers Ry Walker Viraj Parekh Maxime Beauchemin

Laboratory Airflow Control: Free Technical Seminar, Monday 20 th March, 6-8pm We are pleased to

Modern Data Management and Governance Benjamin Pecheux Data Management and Governance for Better

security vulnerabilities and strategies Fabio Patrone Polytechnic School, University of Genoa 1

Compositional Security Evaluation: The MILS approach John Rushby and Rance DeLong Computer

Disclaimers My personal views and opinions may not represent the position(s) of my employers.

Web Server Admin Don Porter CSE/ISE 311: Systems Administra5on

More on Func+ons PHP Func+on names can be stored in

W3Objects Overview Framework to aid in the construction of Web-based applications key goal

Tools for Hi-Fi Prototyping and Web Design CS 3053 March 28, 2007 Macromedia Flash Interactive