ETL and Event Sourcing Integration Architecture: Best Practice and - - PowerPoint PPT Presentation

etl and event sourcing
SMART_READER_LITE
LIVE PREVIEW

ETL and Event Sourcing Integration Architecture: Best Practice and - - PowerPoint PPT Presentation

ETL and Event Sourcing Integration Architecture: Best Practice and Case Study Marc Siegel - Panorama Education - Wed Feb 6 2019 ETL pipelines from external systems ETL and Event Sourcing Prerequisite knowledge Familiarity with traditional ETL


slide-1
SLIDE 1

ETL and Event Sourcing

Integration Architecture: Best Practice and Case Study

Marc Siegel - Panorama Education - Wed Feb 6 2019

slide-2
SLIDE 2

ETL pipelines from external systems

slide-3
SLIDE 3

ETL and Event Sourcing

Prerequisite knowledge

Familiarity with traditional ETL architectures:

Software systems that Extract data from external systems, Transform them, and Load the resulting data sets into internal systems, most often relational databases

Dissatisfaction with traditional ETL architectures / curiosity to learn about and consider an alternative architecture

slide-4
SLIDE 4

ETL and Event Sourcing

What you’ll learn How Event Sourcing can be applied to ETL How Determinism can be a property of a system Value of treating the Past as First Class

slide-5
SLIDE 5

What is ETL?

slide-6
SLIDE 6

ETL

In a nutshell

slide-7
SLIDE 7

ETL

In a nutshell

External System

slide-8
SLIDE 8

ETL

Traditional ETL Process Extract

In a nutshell

External System

slide-9
SLIDE 9

ETL

Traditional ETL Process Extract Transform

In a nutshell

External System

slide-10
SLIDE 10

ETL

Traditional ETL Process Extract Transform Load

In a nutshell

External System

slide-11
SLIDE 11

ETL

Traditional ETL Process Extract Transform Load Internal Database

In a nutshell

External System

slide-12
SLIDE 12

ETL

Traditional ETL Process Extract Transform Load Internal Database

In a nutshell

External System

Q: What is the System of Record? What is the Source of Truth?

slide-13
SLIDE 13

ETL

In a nutshell

External System

System of Record

The authoritative data source for a given data element or piece of information (1)

slide-14
SLIDE 14

ETL

Internal Database

In a nutshell

Source of Truth

A trusted data source that gives a complete picture of the data object as a whole (2)

slide-15
SLIDE 15

ETL

Traditional ETL Process Extract Transform Load Internal Database

In a nutshell

External System

slide-16
SLIDE 16
slide-17
SLIDE 17

ETL Challenges

Operational Domain Modelling Selective Attention

slide-18
SLIDE 18

ETL Challenges

Operational Domain Modelling Selective Attention

Must rerun long ETL job to test edge case Missing Interests:

  • Decoupling
slide-19
SLIDE 19

ETL Challenges

Operational Domain Modelling Selective Attention

Must rerun long ETL job to test edge case Running ETL job can overwrite history Missing Interests:

  • Decoupling
  • Determinism
slide-20
SLIDE 20

Interests and Positions

ETL ELT Event Sourcing Decoupling Determinism Modeling State Explicitly Past as First Class Low Cost

ETL Challenges

slide-21
SLIDE 21

ETL Challenges

Operational Domain Modelling Selective Attention

Must create one true schema to load into Missing Interests:

  • Decoupling (of each interpretation)
slide-22
SLIDE 22

ETL Challenges

Operational Domain Modelling Selective Attention

Must create one true schema to load into Tend toward lowest common denominator OR superset of all external model features Missing Interests:

  • Decoupling (of each interpretation)
  • Modeling State Explicitly
slide-23
SLIDE 23

Interests and Positions

ETL ELT Event Sourcing Decoupling Determinism Modeling State Explicitly Past as First Class Low Cost

ETL Challenges

slide-24
SLIDE 24

ETL Challenges

Operational Domain Modelling Selective Attention

From Psychology: the act of focusing on a particular object while ignoring irrelevant information → Can’t re-interpret past extracts Missing Interests:

  • Past as First Class
slide-25
SLIDE 25

ETL Problems

Awareness Tests YouTube:

  • Basketball
  • Monkey

Business

How many passes did the team in white make?

slide-26
SLIDE 26

Interests and Positions

ETL ELT Event Sourcing Decoupling Determinism Modeling State Explicitly Past as First Class Low Cost

ETL Challenges

slide-27
SLIDE 27

ETL Advantage

Not just problems. Positive trade-offs of ETL?

  • Low Costs: Training, framing, explaining

○ Training: Low cost to train new engineers in ETL concepts ○ Framing: No requirement for explicit domain modeling ○ Explaining: Intuitive to explain to non-engineers

slide-28
SLIDE 28

Interests and Positions

ETL ELT Event Sourcing Decoupling Determinism Modeling State Explicitly Past as First Class Low Cost

ETL Challenges

slide-29
SLIDE 29
slide-30
SLIDE 30

What is ELT?

slide-31
SLIDE 31

ETL

Traditional ETL Process Extract Transform Load Internal Database

In a nutshell

External System

slide-32
SLIDE 32

ETL and ELT

Traditional ETL Process Extract Transform Load Internal Database External System

slide-33
SLIDE 33

ETL and ELT

EL Process

Extract

Traditional ETL Process Extract Transform Load Internal Database

Load

External System

slide-34
SLIDE 34

ETL and ELT

EL Process

Extract

Data Lake

  • r Blob or

File Store Traditional ETL Process Extract Transform Load Internal Database

Load

External System

slide-35
SLIDE 35

ETL and ELT

EL Process

Extract

Data Lake

  • r Blob or

File Store T Process

Do anything here! Many vendors

  • ffering various solutions.

Traditional ETL Process Extract Transform Load Internal Database

Load

External System

slide-36
SLIDE 36

ETL and ELT

EL Process

Extract

Data Lake

  • r Blob or

File Store T Process(es)

Do anything here! Many vendors

  • ffering various solutions.

Traditional ETL Process Extract Transform Load Internal Database

Load

External System

slide-37
SLIDE 37

Interests and Positions

ETL ELT Event Sourcing Decoupling Determinism Modeling State Explicitly Past as First Class Low Cost

ETL and ELT

slide-38
SLIDE 38

Interests and Positions

ETL ELT Event Sourcing Decoupling Determinism Modeling State Explicitly Past as First Class Low Cost

ETL and ELT

slide-39
SLIDE 39

ETL and ELT

EL Process

Extract

Data Lake

  • r Blob or

File Store T Process(es)

Do anything here! Many vendors

  • ffering various solutions.

Traditional ETL Process Extract Transform Load Internal Database

Load

External System

slide-40
SLIDE 40

Interests and Positions

ETL ELT Event Sourcing Decoupling Determinism Modeling State Explicitly Past as First Class Low Cost

ETL and ELT

slide-41
SLIDE 41

Interests and Positions

ETL ELT Event Sourcing Decoupling Determinism Modeling State Explicitly Past as First Class Low Cost

ETL and ELT

slide-42
SLIDE 42

Interests and Positions

ETL ELT Event Sourcing Decoupling Determinism Modeling State Explicitly Past as First Class Low Cost

ETL and ELT

slide-43
SLIDE 43

What is Event Sourcing?

slide-44
SLIDE 44

ETL

Traditional ETL Process Extract Transform Load Internal Database

In a nutshell

External System

slide-45
SLIDE 45

ETL and ELT

EL Process

Extract

Data Lake

  • r Blob or

File Store T Process(es)

Do anything here! Many vendors

  • ffering various solutions.

Traditional ETL Process Extract Transform Load Internal Database

Load

External System

slide-46
SLIDE 46

ETL and Event Sourcing

EL Process

Ex

Traditional ETL Process Extract Transform Load Internal Database

Lo

External System

slide-47
SLIDE 47

ETL and Event Sourcing

EL Process

Ex

Traditional ETL Process Extract Transform Load Internal Database

Lo

External System Immutable & Sequential Store

slide-48
SLIDE 48

ETL and Event Sourcing

EL Process

Ex

Traditional ETL Process Extract Transform Load Internal Database

Lo

External System Immutable & Sequential Store TeTL Process

slide-49
SLIDE 49

ETL and Event Sourcing

EL Process

Ex

Traditional ETL Process Extract Transform Load Internal Database

Lo

External System Immutable & Sequential Store TeTL Process

Domain Events Tr

slide-50
SLIDE 50

ETL and Event Sourcing

EL Process

Ex

Traditional ETL Process Extract Transform Load Internal Database

Lo

External System Immutable & Sequential Store TeTL Process

Domain Events Tr Tr Lo

slide-51
SLIDE 51

ETL and Event Sourcing

EL Process

Ex

Traditional ETL Process Extract Transform Load Internal Database

Lo

External System Immutable & Sequential Store

Read Model

TeTL Process

Domain Events Tr Tr Lo

slide-52
SLIDE 52

ETL and Event Sourcing

EL Process

Ex

Traditional ETL Process Extract Transform Load Internal Database

Lo

External System Immutable & Sequential Store

Read Model(s)

TeTL Process(es)

Domain Events Tr Tr Lo

slide-53
SLIDE 53

ETL and Event Sourcing

EL Process

Ex

Traditional ETL Process Extract Transform Load Internal Database

Lo

External System Immutable & Sequential Store

Read Model(s)

TeTL Process(es)

Domain Events Tr Tr Lo

1) Decouple extractions 2) Source of Truth: the extracts 3) Deterministic transform: to events + to model regular expression mnemonic: from /(ETL)/ to /E{1}T*L*/ ← Extract once, Transform & Load Infinitely

slide-54
SLIDE 54

Interests and Positions

ETL ELT Event Sourcing Decoupling Determinism Modeling State Explicitly Past as First Class Low Cost

ETL, ELT, and Event Sourcing

slide-55
SLIDE 55

Interests and Positions

ETL ELT Event Sourcing Decoupling Determinism Modeling State Explicitly Past as First Class Low Cost

ETL, ELT, and Event Sourcing

slide-56
SLIDE 56

Interests and Positions

ETL ELT Event Sourcing Decoupling Determinism Modeling State Explicitly Past as First Class Low Cost

ETL, ELT, and Event Sourcing

slide-57
SLIDE 57

ETL and Event Sourcing

EL Process

Ex

Traditional ETL Process Extract Transform Load Internal Database

Lo

External System Immutable & Sequential Store

Read Model(s)

TeTL Process(es)

Domain Events Tr Tr Lo

1) Decouple extractions 2) Source of Truth: the extracts 3) Deterministic transform: to events + to model regular expression mnemonic: from /(ETL)/ to /E{1}T*L*/ ← Extract once, Transform & Load Infinitely

slide-58
SLIDE 58

Interests and Positions

ETL ELT Event Sourcing Decoupling Determinism Modeling State Explicitly Past as First Class Low Cost

ETL, ELT, and Event Sourcing

slide-59
SLIDE 59

Event Sourcing Challenge

Not just advantages. Negative trade-offs of ES?

  • High Costs: Training, framing, explaining

○ Training: Higher cost to train new engineers in ES concepts ○ Framing: Requirement for (lots of) explicit domain modeling ○ Explaining: Not necessarily intuitive to explain to non-engineers

slide-60
SLIDE 60

Interests and Positions

ETL ELT Event Sourcing Decoupling Determinism Modeling State Explicitly Past as First Class Low Cost

ETL, ELT, and Event Sourcing

slide-61
SLIDE 61
slide-62
SLIDE 62

How does Event Sourcing work?

slide-63
SLIDE 63

Event Sourcing Basics

GradeCreated

student_id: 123 course_id: abc grade: B+

GradeUpdated

student_id: 123 course_id: abc grade: C

GradeUpdated

student_id: 123 course_id: abc grade: A-

Events

slide-64
SLIDE 64

Event Sourcing Basics

Events

State transitions are an important part of our problem space and should be modeled within our domain.

slide-65
SLIDE 65

Event Sourcing Basics

Events

State transitions are an important part of our problem space and should be modeled within our domain. Event Sourcing says all state is transient and you only store facts.

slide-66
SLIDE 66

Event Sourcing Basics

Events

State transitions are an important part of our problem space and should be modeled within our domain. Event Sourcing says all state is transient and you only store facts. Event: something that happened in the past; a fact; a state transition.

slide-67
SLIDE 67

Event Sourcing Basics

GradeCreated

student_id: 123 course_id: abc grade: B+

GradeUpdated

student_id: 123 course_id: abc grade: C

GradeUpdated

student_id: 123 course_id: abc grade: A-

Events

slide-68
SLIDE 68

Event Sourcing Basics

GradeCreated

student_id: 123 course_id: abc grade: B+

GradeUpdated

student_id: 123 course_id: abc grade: C

GradeUpdated

student_id: 123 course_id: abc grade: A-

Events Read Models

student_id course_id grade

123 abc B+

slide-69
SLIDE 69

Event Sourcing Basics

GradeCreated

student_id: 123 course_id: abc grade: B+

GradeUpdated

student_id: 123 course_id: abc grade: C

GradeUpdated

student_id: 123 course_id: abc grade: A-

Events Read Models

student_id course_id grade

123 abc C

slide-70
SLIDE 70

Event Sourcing Basics

GradeCreated

student_id: 123 course_id: abc grade: B+

GradeUpdated

student_id: 123 course_id: abc grade: C

GradeUpdated

student_id: 123 course_id: abc grade: A-

Events Read Models

student_id course_id grade

123 abc A-

slide-71
SLIDE 71

Event Sourcing Basics

Read Models

Event Sourcing takes the term Read Model from CQRS.

slide-72
SLIDE 72

Event Sourcing Basics

Read Models

Event Sourcing takes the term Read Model from CQRS. A Read Model is an interpretation of a sequence of events, that is

  • ptimized for answering a given set of queries (reads).
slide-73
SLIDE 73

Event Sourcing Basics

Read Models

Event Sourcing takes the term Read Model from CQRS. A Read Model is an interpretation of a sequence of events, that is

  • ptimized for answering a given set of queries (reads).

Read Models: are independent representations of state that we deterministically regenerate from events using projections.

slide-74
SLIDE 74

Event Sourcing Basics

GradeCreated

student_id: 123 course_id: abc grade: B+

GradeUpdated

student_id: 123 course_id: abc grade: C

GradeUpdated

student_id: 123 course_id: abc grade: A-

Events Projections

def f(state, event) state.where( student_id: event.student_id, course_id: event.course_id ).update(grade: event.grade) end

student_id course_id grade

123 abc A-

slide-75
SLIDE 75

Event Sourcing Basics

Projections

When we talk about Event Sourcing, current state is a left-fold of previous behaviors.

slide-76
SLIDE 76

Event Sourcing Basics

Projections

When we talk about Event Sourcing, current state is a left-fold of previous behaviors. We play back a stream of events, applying a function f ( staten, eventn ) -> staten+1

slide-77
SLIDE 77

Event Sourcing Basics

Projections

When we talk about Event Sourcing, current state is a left-fold of previous behaviors. We play back a stream of events, applying a function f ( staten, eventn ) -> staten+1 Projection: a function through which we apply events in sequence to deterministically derive the state of our application

slide-78
SLIDE 78

Event Sourcing Basics

GradeCreated

student_id: 123 course_id: abc grade: B+

GradeUpdated

student_id: 123 course_id: abc grade: C

GradeUpdated

student_id: 123 course_id: abc grade: A-

Events Projections

def f(state, event) state.where( student_id: event.student_id, course_id: event.course_id ).update(grade: event.grade) end

student_id course_id grade

123 abc A-

Read Models

slide-79
SLIDE 79

Event Sourcing Basics

Review

Event: something that happened in the past; a fact; a state transition. Projection: a function through which we apply events in sequence to deterministically derive the state of our application Read Models: are independent representations of state that we deterministically regenerate from events using projections.

slide-80
SLIDE 80

Event Sourcing Basics

GradeCreated

student_id: 123 course_id: abc grade: B+

GradeUpdated

student_id: 123 course_id: abc grade: C

GradeUpdated

student_id: 123 course_id: abc grade: A-

Events Projections

def f(state, event) state.where( student_id: event.student_id, course_id: event.course_id ).update(grade: event.grade) end

student_id course_id grade

123 abc A-

Read Models

slide-81
SLIDE 81

Applying Event Sourcing to ETL

slide-82
SLIDE 82

Applying Event Sourcing to ETL

Q: How to we get from ETL to explicitly modeled Domain Events?

slide-83
SLIDE 83

Applying Event Sourcing to ETL

Q: How to we get from ETL to explicitly modeled Domain Events?

Immutable & Sequential Store

Read Model(s)

TeTL Process(es)

Domain Events Tr Tr Lo

slide-84
SLIDE 84

Applying Event Sourcing to ETL

Q: How to we get from ETL to explicitly modeled Domain Events? A: Build an Observational Event Sourced system

Immutable & Sequential Store

Read Model(s)

TeTL Process(es)

Domain Events Tr Tr Lo

slide-85
SLIDE 85

Observations

student_id course_id grade

123 abc A-

Applying Event Sourcing to ETL

Domain Events

GradeUpdated

student_id: 123 course_id: abc grade: A-

Read Models

slide-86
SLIDE 86

Applying Event Sourcing to ETL

Observational

When capturing observations of external systems using Event Sourcing, the events in our domain are the observations we capture.

slide-87
SLIDE 87

Applying Event Sourcing to ETL

Observational

When capturing observations of external systems using Event Sourcing, the events in our domain are the observations we capture. Transforming a sequence of observations into explicitly modeled domain events is the first projection.

slide-88
SLIDE 88

Applying Event Sourcing to ETL

Observational

When capturing observations of external systems using Event Sourcing, the events in our domain are the observations we capture. Transforming a sequence of observations into explicitly modeled domain events is the first projection. Observational: an Event Sourced system where the event history is

  • f captured observations, and all state is derived from them.
slide-89
SLIDE 89

Observations

student_id course_id grade

123 abc A-

Applying Event Sourcing to ETL

Domain Events

GradeUpdated

student_id: 123 course_id: abc grade: A-

Read Models

slide-90
SLIDE 90

Observations

student_id course_id grade

123 abc A-

Applying Event Sourcing to ETL

Domain Events

GradeUpdated

student_id: 123 course_id: abc grade: A-

Read Models

Immutable & Sequential Store

slide-91
SLIDE 91

Observations

student_id course_id grade

123 abc A-

Applying Event Sourcing to ETL

Domain Events

GradeUpdated

student_id: 123 course_id: abc grade: A-

Read Models

Immutable & Sequential Store TeTL Process(es)

Domain Events Tr

slide-92
SLIDE 92

Observations

student_id course_id grade

123 abc A-

Applying Event Sourcing to ETL

Domain Events

GradeUpdated

student_id: 123 course_id: abc grade: A-

Read Models

Immutable & Sequential Store

Read Model(s)

TeTL Process(es)

Domain Events Tr Tr Lo

slide-93
SLIDE 93

Case study: Event Sourcing ETL

slide-94
SLIDE 94

Case study: Event Sourcing ETL

GradeUpdated

student_id: 1 date: Oct 11 course: Biology grade: B-

GradeUpdated

student_id: 1 date: Oct 12 course: Biology grade: B+

projection

  • bservation events

domain events

slide-95
SLIDE 95

Case study: Event Sourcing ETL

GradeUpdated

student_id: 1 date: Oct 11 course: Biology grade: B-

GradeUpdated

student_id: 1 date: Oct 12 course: Biology grade: B+

projection InProgressGrades domain events read models

slide-96
SLIDE 96

Case study: Event Sourcing ETL

queried InProgressGrades read models

slide-97
SLIDE 97

Case study: Event Sourcing ETL

Past as First Class

First Later interpretation

slide-98
SLIDE 98

Case study: Event Sourcing ETL

Past as First Class

First Later interpretation

slide-99
SLIDE 99

Case study: Event Sourcing ETL

Past as First Class

First Later interpretation

slide-100
SLIDE 100

Case study: Event Sourcing ETL

Determinism

slide-101
SLIDE 101

Case study: Event Sourcing ETL

Determinism

  • Read Models regenerated nightly from source of truth

○ Given the same history, we regenerate the same Read Models

slide-102
SLIDE 102

Case study: Event Sourcing ETL

Determinism

  • Read Models regenerated nightly from source of truth

○ Given the same history, we regenerate the same Read Models

  • On-demand Read Model Comparison tool

○ Ensure no Read Model changes across larger code refactors

slide-103
SLIDE 103

Case study: Event Sourcing ETL

Determinism

Read Model Comparison - Before and After Regeneration

Read Model DB Same DB, but later.

Regenerations Run

Clone Read Model Clone Read Model Again batch_BEFORE batch_AFTER

slide-104
SLIDE 104

Case study: Event Sourcing ETL

Determinism

Read Model Comparison - Before and After Regeneration

Read Model DB Same DB, but later.

Regenerations Run

slide-105
SLIDE 105

Case study: Event Sourcing ETL

Determinism

Read Model Comparison - Before and After Regeneration

Read Model DB Same DB, but later.

Regenerations Run

slide-106
SLIDE 106

Case study: Event Sourcing ETL

Trade-off: Investment in Training

slide-107
SLIDE 107

Case study: Event Sourcing ETL

Trade-off: Investment in Training

  • 5 x 1 hr training videos + 1 hr discussions = 10 hrs
slide-108
SLIDE 108

Case study: Event Sourcing ETL

Trade-off: Investment in Training

  • 5 x 1 hr training videos + 1 hr discussions = 10 hrs
  • Gentle ramp up w/ pairing and joint designs (weeks)
slide-109
SLIDE 109

Case study: Event Sourcing ETL

Trade-off: Investment in Training

  • 5 x 1 hr training videos + 1 hr discussions = 10 hrs
  • Gentle ramp up w/ pairing and joint designs (weeks)
  • Set expectation that architecture will feel different
slide-110
SLIDE 110

Lessons Learned

At the two year mark

  • Lessons learned: Thinnest extractions possible
  • Lessons learned: Extracted files as Source of Truth
  • Lessons learned: Many iterations on transformations
  • Lessons learned: Why TL must be fast and run often
slide-111
SLIDE 111

Lessons Learned

At the two year mark

Lessons learned: Thinnest extractions possible

My first version of converting [one type of] XML to CSV was silently dropping rows, and would have lost all that data if not for the ability to replace from original extract.

slide-112
SLIDE 112

Lessons Learned

At the two year mark

Lessons learned: Extracted files as Source of Truth

Real world example of changing incorrect foreign key reference (which had been nearly all overlapping previously).

slide-113
SLIDE 113

Lessons Learned

At the two year mark

Lessons learned: Many iterations on interpretations

Very natural to handle the changes, big and small, that appear in the format and content of the data we have extracted. Also, new features sometimes mean new or changed interpretations.

slide-114
SLIDE 114

Lessons Learned

At the two year mark

Lessons learned: Why TL must be fast and run often

Consider the “nightly restores from backups” to prove that you can actually restore from backups. This practice exists in our application rather than our tools. If regeneration ever gets too slow to complete overnight, we could lose this.

slide-115
SLIDE 115

Summary and Review

What we covered How Event Sourcing can be applied to ETL How Determinism can be a property of a system Value of treating the Past as First Class

slide-116
SLIDE 116

Learn More

Resources

  • DDD, CQRS, and Event Sourcing videos by Greg Young
  • CQRS documentation site by Edument AB
  • Domain Driven Design book by Eric Evans

Keep in touch!

  • twitter: @ms_ati
  • email: msiegel@panoramaed.com