Democratized data workflows at scale Emil Todorov Mihail Petkov - - PowerPoint PPT Presentation

▶

Mar 08, 2024 34 likes •805 views

Democratized data workflows at scale Emil Todorov Mihail Petkov Our agenda for today Why Airflow? Architecture Security Execution environment in Kubernetes FT is a data driven organization Time for a change Why

SLIDE 1

Democratized data workflows at scale

Emil Todorov Mihail Petkov

SLIDE 2

Our agenda for today

Why Airflow?
Architecture
Security
Execution environment in Kubernetes

SLIDE 3

SLIDE 4

FT is a data driven organization

SLIDE 5

Time for a change

SLIDE 6

SLIDE 7

Why Airflow?

SLIDE 8

Dynamic Extendable Scalable Elegant

SLIDE 9

Architecture

SLIDE 10

PostgreSQL Scheduler Pod Web Server Pod Worker Pod Worker Pod Worker Pod

Architecture

SLIDE 11

User Business Tech

SLIDE 12

Airflow will be used by multiple teams

SLIDE 13

Team 1 Team 2 Team N

Airflow requirements

SLIDE 14

SLIDE 15

Teams will share Airflow resources

SLIDE 16

Team 1 DAGs Team 2 DAGs Team N DAGs Team 1 Team 2 Team N Team 1 Connections Team 2 Connections Team N Connections

Airflow shared components

SLIDE 17

Teams will share Kubernetes resources

SLIDE 18

Team 1 Worker Pod Team N Worker Pod Team 2 Worker Pod

Kubernetes shared components

SLIDE 19

SLIDE 20

How to evolve this architecture?

SLIDE 21

SLIDE 22

Airflow instance per team

SLIDE 23

One instance components

SLIDE 24

SLIDE 25

Instance per team problems

Adding new team is hard
Maintaining environment per team is difficult
Releasing new features is slow
Resources are not fully utilised
Total cost increase

SLIDE 26

SLIDE 27

Another way?

SLIDE 28

Multitenancy

SLIDE 29

Multiple independent instances in a shared environment

SLIDE 30

Multi-tenant components

SLIDE 31

How to make AWS multi-tenant?

SLIDE 32

IAM Security

Team 1 IAM user Team 2 IAM user Team N IAM user

SLIDE 33

IAM Security

Team 1 IAM user Team 2 IAM user Team N IAM user

SLIDE 34

How to enhance Kubernetes?

SLIDE 35

Team N namespace Team 2 namespace Team 1 namespace

Service Account Resource Quota

System namespace

Airflow scheduler Airflow web server

Team 1 worker Pod Team 1 worker Pod Service Account Resource Quota Team 2 worker Pod Team 2 worker Pod Service Account Resource Quota Team 3 worker Pod Team 3 worker Pod

SLIDE 36

How to improve PostgreSQL?

SLIDE 37

CHANGES

SLIDE 38

How to extend Airflow?

SLIDE 39

Redesign Airflow source code

SLIDE 40

Redesign Airflow source code

Module per team

SLIDE 41

Redesign Airflow source code

Module per team
Connections per team

SLIDE 42

Redesign Airflow source code

Module per team
Connections per team
Extend hooks, operators and sensors

SLIDE 43

Redesign Airflow source code

Module per team
Connections per team
Extend hooks, operators and sensors
Use airflow_local_settings.py

SLIDE 44

Redesign repository structure

Airflow system code repository Team 1 DAG repository Team 2 DAG repository Team N DAG repository Airflow repository

SLIDE 45

Execution environment in Kubernetes

SLIDE 46

Load Transform Extract

ETL

DATA SOURCE 1 DATA SOURCE 2 AGGREGATIONS DATA DESTINATION

SLIDE 47

Load Transform Extract

Extract

DATA SOURCE 1 DATA SOURCE 2 AGGREGATIONS DATA DESTINATION

SLIDE 48

Load Transform Extract

Load

DATA SOURCE 1 DATA SOURCE 2 AGGREGATIONS DATA DESTINATION

SLIDE 49

Transform?

SLIDE 50

Example workflow

Task 1 Task 2 Task 3 Task 4

SLIDE 51

Language agnostic jobs Cross task data access

Our goals

SLIDE 52

KubernetesPodOperator

SLIDE 53

Language agnostic jobs Cross task data access

Our goals

SLIDE 54

Unique storage pattern

Unique team name from the multitenancy
Unique DAG id
Unique task id per DAG
Unique execution date per DAG run

/{team}/{dag_id}/{task_id}/{execution_date}

SLIDE 55

The power of extensibility

SLIDE 56

ExecutionEnvironmentOperator

PRE EXECUTE KUBERNETES POD OPERATOR EXECUTE POST EXECUTE KUBERNETES POD OPERATOR EXECUTE KubernetesPodOperator

SLIDE 57

Configurable cross task data dependencies

SLIDE 58

Example input configuration

SLIDE 59

Example output configuration

SLIDE 60

Pre-execute

Bootstrap the environment
Enrich the configuration
Export the configuration to the execution environment pod

KUBERNETES POD OPERATOR EXECUTE PRE EXECUTE

SLIDE 61

Post-execute

Handle the execution
Clear all bootstraps
Deal with the output

KUBERNETES POD OPERATOR EXECUTE POST EXECUTE

SLIDE 62

POC with AWS S3 as intermediate storage

Task 1 Task 2 Task 3 Task 4

SLIDE 63

Is this efficient?

Multiple downloads and uploads Single processing power Always loading the data in memory

SLIDE 64

How to evolve the execution environment?

Remove unnecessary data transfers Parallelize the processing Provide hot data access

SLIDE 65

Shared file system

SLIDE 66

Kubernetes persistent volume

Task 1 Task 2 Task 3 Task 4

SLIDE 67

Kubernetes persistent volume with EFS

Task 1 Task 2 Task 3 Task 4

SLIDE 68

So far so good

Remove unnecessary data transfers Parallelize the processing Provide hot data access

SLIDE 69

One worker?

SLIDE 70

SLIDE 71

Benefits from Spark

Runs perfectly in Kubernetes
Supports many distributed storages
Allows faster data processing
Supports multiple languages
Easy to use

SLIDE 72

SparkExecutionEnvironmentOperator

PRE EXECUTE KUBERNETES POD OPERATOR EXECUTE POST EXECUTE SETUP SPARK ENVIRONMENT RUN SPARK BASED IMAGE CLEAR SPARK BASED RESOURCES

SLIDE 73

Spark execution environment

Spark driver Spark workers

SLIDE 74

Our current state

Remove unnecessary data transfers Parallelize the processing Provide hot data access

SLIDE 75

Hot & cold data

HOT DATA COLD DATA Task 1 Task 2 Task 3 Task 4

SLIDE 76

Alluxio

HOT DATA COLD DATA Task 1 Task 2 Task 3 Task 4

SLIDE 77

Thank you!

#apacheairflow