Democratized data workflows at scale Emil Todorov Mihail Petkov - - PowerPoint PPT Presentation

democratized data workflows at scale
SMART_READER_LITE
LIVE PREVIEW

Democratized data workflows at scale Emil Todorov Mihail Petkov - - PowerPoint PPT Presentation

Democratized data workflows at scale Emil Todorov Mihail Petkov Our agenda for today Why Airflow? Architecture Security Execution environment in Kubernetes FT is a data driven organization Time for a change Why


slide-1
SLIDE 1

Democratized data workflows at scale

Emil Todorov Mihail Petkov

slide-2
SLIDE 2

Our agenda for today

  • Why Airflow?
  • Architecture
  • Security
  • Execution environment in Kubernetes
slide-3
SLIDE 3
slide-4
SLIDE 4

FT is a data driven organization

slide-5
SLIDE 5

Time for a change

slide-6
SLIDE 6
slide-7
SLIDE 7

Why Airflow?

slide-8
SLIDE 8

Dynamic Extendable Scalable Elegant

slide-9
SLIDE 9

Architecture

slide-10
SLIDE 10

PostgreSQL Scheduler Pod Web Server Pod Worker Pod Worker Pod Worker Pod

Architecture

slide-11
SLIDE 11

User Business Tech

slide-12
SLIDE 12

Airflow will be used by multiple teams

slide-13
SLIDE 13

Team 1 Team 2 Team N

Airflow requirements

slide-14
SLIDE 14
slide-15
SLIDE 15

Teams will share Airflow resources

slide-16
SLIDE 16

Team 1 DAGs Team 2 DAGs Team N DAGs Team 1 Team 2 Team N Team 1 Connections Team 2 Connections Team N Connections

Airflow shared components

slide-17
SLIDE 17

Teams will share Kubernetes resources

slide-18
SLIDE 18

Team 1 Worker Pod Team N Worker Pod Team 2 Worker Pod

Kubernetes shared components

slide-19
SLIDE 19
slide-20
SLIDE 20

How to evolve this architecture?

slide-21
SLIDE 21
slide-22
SLIDE 22

Airflow instance per team

slide-23
SLIDE 23

One instance components

slide-24
SLIDE 24
slide-25
SLIDE 25

Instance per team problems

  • Adding new team is hard
  • Maintaining environment per team is difficult
  • Releasing new features is slow
  • Resources are not fully utilised
  • Total cost increase
slide-26
SLIDE 26
slide-27
SLIDE 27

Another way?

slide-28
SLIDE 28

Multitenancy

slide-29
SLIDE 29

Multiple independent instances in a shared environment

slide-30
SLIDE 30

Multi-tenant components

slide-31
SLIDE 31

How to make AWS multi-tenant?

slide-32
SLIDE 32

IAM Security

Team 1 IAM user Team 2 IAM user Team N IAM user

slide-33
SLIDE 33

IAM Security

Team 1 IAM user Team 2 IAM user Team N IAM user

slide-34
SLIDE 34

How to enhance Kubernetes?

slide-35
SLIDE 35

Team N namespace Team 2 namespace Team 1 namespace

Service Account Resource Quota

System namespace

Airflow scheduler Airflow web server

Team 1 worker Pod Team 1 worker Pod Service Account Resource Quota Team 2 worker Pod Team 2 worker Pod Service Account Resource Quota Team 3 worker Pod Team 3 worker Pod

slide-36
SLIDE 36

How to improve PostgreSQL?

slide-37
SLIDE 37

CHANGES

slide-38
SLIDE 38

How to extend Airflow?

slide-39
SLIDE 39

Redesign Airflow source code

slide-40
SLIDE 40

Redesign Airflow source code

  • Module per team
slide-41
SLIDE 41

Redesign Airflow source code

  • Module per team
  • Connections per team
slide-42
SLIDE 42

Redesign Airflow source code

  • Module per team
  • Connections per team
  • Extend hooks, operators and sensors
slide-43
SLIDE 43

Redesign Airflow source code

  • Module per team
  • Connections per team
  • Extend hooks, operators and sensors
  • Use airflow_local_settings.py
slide-44
SLIDE 44

Redesign repository structure

Airflow system code repository Team 1 DAG repository Team 2 DAG repository Team N DAG repository Airflow repository

slide-45
SLIDE 45

Execution environment in Kubernetes

slide-46
SLIDE 46

Load Transform Extract

ETL

DATA SOURCE 1 DATA SOURCE 2 AGGREGATIONS DATA DESTINATION

slide-47
SLIDE 47

Load Transform Extract

Extract

DATA SOURCE 1 DATA SOURCE 2 AGGREGATIONS DATA DESTINATION

slide-48
SLIDE 48

Load Transform Extract

Load

DATA SOURCE 1 DATA SOURCE 2 AGGREGATIONS DATA DESTINATION

slide-49
SLIDE 49

Transform?

slide-50
SLIDE 50

Example workflow

Task 1 Task 2 Task 3 Task 4

slide-51
SLIDE 51

Language agnostic jobs Cross task data access

Our goals

slide-52
SLIDE 52

KubernetesPodOperator

slide-53
SLIDE 53

Language agnostic jobs Cross task data access

Our goals

slide-54
SLIDE 54

Unique storage pattern

  • Unique team name from the multitenancy
  • Unique DAG id
  • Unique task id per DAG
  • Unique execution date per DAG run

/{team}/{dag_id}/{task_id}/{execution_date}

slide-55
SLIDE 55

The power of extensibility

slide-56
SLIDE 56

ExecutionEnvironmentOperator

ExecutionEnvironmentOperator

PRE EXECUTE KUBERNETES POD OPERATOR EXECUTE POST EXECUTE KUBERNETES POD OPERATOR EXECUTE KubernetesPodOperator

slide-57
SLIDE 57

Configurable cross task data dependencies

slide-58
SLIDE 58

Example input configuration

slide-59
SLIDE 59

Example output configuration

slide-60
SLIDE 60

Pre-execute

  • Bootstrap the environment
  • Enrich the configuration
  • Export the configuration to the execution environment pod

KUBERNETES POD OPERATOR EXECUTE PRE EXECUTE

slide-61
SLIDE 61

Post-execute

  • Handle the execution
  • Clear all bootstraps
  • Deal with the output

KUBERNETES POD OPERATOR EXECUTE POST EXECUTE

slide-62
SLIDE 62

POC with AWS S3 as intermediate storage

Task 1 Task 2 Task 3 Task 4

slide-63
SLIDE 63

Is this efficient?

Multiple downloads and uploads Single processing power Always loading the data in memory

slide-64
SLIDE 64

How to evolve the execution environment?

Remove unnecessary data transfers Parallelize the processing Provide hot data access

slide-65
SLIDE 65

Shared file system

slide-66
SLIDE 66

Kubernetes persistent volume

Task 1 Task 2 Task 3 Task 4

slide-67
SLIDE 67

Kubernetes persistent volume with EFS

Task 1 Task 2 Task 3 Task 4

slide-68
SLIDE 68

So far so good

Remove unnecessary data transfers Parallelize the processing Provide hot data access

slide-69
SLIDE 69

One worker?

slide-70
SLIDE 70
slide-71
SLIDE 71

Benefits from Spark

  • Runs perfectly in Kubernetes
  • Supports many distributed storages
  • Allows faster data processing
  • Supports multiple languages
  • Easy to use
slide-72
SLIDE 72

SparkExecutionEnvironmentOperator

PRE EXECUTE KUBERNETES POD OPERATOR EXECUTE POST EXECUTE SETUP SPARK ENVIRONMENT RUN SPARK BASED IMAGE CLEAR SPARK BASED RESOURCES

slide-73
SLIDE 73

Spark execution environment

Spark driver Spark workers

slide-74
SLIDE 74

Our current state

Remove unnecessary data transfers Parallelize the processing Provide hot data access

slide-75
SLIDE 75

Hot & cold data

HOT DATA COLD DATA Task 1 Task 2 Task 3 Task 4

slide-76
SLIDE 76

Alluxio

HOT DATA COLD DATA Task 1 Task 2 Task 3 Task 4

slide-77
SLIDE 77

Thank you!

#apacheairflow