ETL is dead; long-live streams Neha Narkhede, Co-founder & - - PowerPoint PPT Presentation

etl is dead long live streams
SMART_READER_LITE
LIVE PREVIEW

ETL is dead; long-live streams Neha Narkhede, Co-founder & - - PowerPoint PPT Presentation

ETL is dead; long-live streams Neha Narkhede, Co-founder & CTO, Confluent Data and data systems have really changed in the past decade Old world: Two popular locations for data DB DB DWH DB DB Operational databases Relational


slide-1
SLIDE 1

ETL is dead; long-live streams

Neha Narkhede, Co-founder & CTO, Confluent

slide-2
SLIDE 2

Data and data systems have really changed in the past decade

slide-3
SLIDE 3

Old world: Two popular locations for data Operational databases Relational data warehouse DB DB DB DB DWH

slide-4
SLIDE 4

Several recent data trends are driving a dramatic change in the ETL architecture

slide-5
SLIDE 5

#1: Single-server databases are replaced by a myriad of distributed data platforms that operate at company-wide

scale

slide-6
SLIDE 6

#2: Many more types of data sources beyond transactional data - logs, sensors, metrics...

slide-7
SLIDE 7

#3: Stream data is increasingly ubiquitous; need for faster processing than daily

slide-8
SLIDE 8

The end result? This is what data integration ends up looking like in

practice

slide-9
SLIDE 9

App App App App search Hadoop DWH monitoring security MQ MQ cache cache

slide-10
SLIDE 10

A giant mess!

App App App App search Hadoop DWH monitoring security MQ MQ cache cache

slide-11
SLIDE 11

We will see how transitioning to streams cleans up this mess and works towards...

slide-12
SLIDE 12

Streaming platform

DWH Hadoop security App App App App search NoSQL monitor ing request-response messaging OR stream processing streaming data pipelines changelogs

slide-13
SLIDE 13

A short history of data integration

slide-14
SLIDE 14

Surfaced in the 1990s in retail

  • rganizations for analyzing buyer trends
slide-15
SLIDE 15

Extract data from databases Transform into destination warehouse schema Load into a central data warehouse

slide-16
SLIDE 16

BUT … ETL tools have been around for a long time, data coverage in data warehouses is still low! WHY?

slide-17
SLIDE 17

Etl has drawbacks

slide-18
SLIDE 18

#1: The need for a global schema

slide-19
SLIDE 19

#2: Data cleansing and curation is manual and fundamentally error-prone

slide-20
SLIDE 20

#3: Operational cost of ETL is high; it is slow; time and resource intensive

slide-21
SLIDE 21

#4: ETL tools were built to narrowly focus on connecting databases and the data warehouse in a batch fashion

slide-22
SLIDE 22

Early take on real-time ETL = Enterprise Application Integration (EAI)

slide-23
SLIDE 23

EAI: A different class of data integration technology for connecting applications in real-time

slide-24
SLIDE 24

EAI employed Enterprise Service Buses and MQs; weren’t scalable

slide-25
SLIDE 25

ETL and EAI are

  • utdated!
slide-26
SLIDE 26

Old world: scale or timely data, pick one

real-time scale batch EAI ETL

real-time BUT not scalable scalable BUT batch

slide-27
SLIDE 27

Data integration and ETL in the modern world need a complete revamp

slide-28
SLIDE 28

new world: streaming, real-time and scalable

real-time scale EAI ETL

Streaming Platform real-time BUT not scalable real-time AND scalable scalable BUT batch

batch

slide-29
SLIDE 29

Modern streaming world has new set of

requirements for data integration

slide-30
SLIDE 30

#1: Ability to process high-volume and high-diversity data

slide-31
SLIDE 31

#2 Real-time from the grounds up; a fundamental transition to

event-centric thinking

slide-32
SLIDE 32

Event-Centric Thinking

Streaming Platform “A product was viewed” Hadoop Web app

slide-33
SLIDE 33

Event-Centric Thinking

Streaming Platform “A product was viewed” Hadoop Web app mobile app APIs

slide-34
SLIDE 34

mobile app web app APIs Streaming Platform Hadoop Security Monitoring Rec engine “A product was viewed”

Event-Centric Thinking

slide-35
SLIDE 35

Event-centric thinking, when applied at a company-wide scale, leads to this simplification ...

slide-36
SLIDE 36

Streaming platform

DWH Hadoop App App App App App App App App request-response messaging OR stream processing streaming data pipelines changelogs

slide-37
SLIDE 37

#3: Enable forward-compatible data architecture; the ability to add more applications that need to process the same data … differently

slide-38
SLIDE 38

To enable forward compatibility, redefine the T in ETL:

Clean data in; Clean data out

slide-39
SLIDE 39

app logs app logs app logs app logs #1: Extract as unstructured text #2: Transform1 = data cleansing = “what is a product view” #4: Transform2 = drop PII fields” #3: Load into DWH DWH

slide-40
SLIDE 40

#1: Extract as unstructured text #2: Transform1 = data cleansing = “what is a product view” #4: Transform2 = drop PII fields”

DWH

#2: Transform1 = data cleansing = “what is a product view” #4: Transform2 = drop PII fields”

Cassandra

#1: Extract as unstructured text again #3: Load cleansed data #3: Load cleansed data

slide-41
SLIDE 41

#1: Extract as structured product view events #2: Transforms = drop PII fields” #4:.1 Load product view stream #4: Load filtered product View stream DWH Cassandra Streaming Platform #4.2 Load filtered product view stream

slide-42
SLIDE 42

To enable forward compatibility, redefine the T in ETL: Data transformations, not data cleansing!

slide-43
SLIDE 43

#1: Extract once as structured product view events #2: Transform once = drop PII fields” and enrich with product metadata #4.1: Load product views stream #4: Load filtered and enriched product views stream DWH Cassandra Streaming Platform #4.2: Load filtered and enriched product views stream

slide-44
SLIDE 44

Forward compatibility = Extract clean-data once; Transform many different ways before Loading into respective destinations … as and when required

slide-45
SLIDE 45

In summary, needs of modern data integration solution? Scale, diversity, latency and forward compatibility

slide-46
SLIDE 46

Requirements for a modern streaming data

integration solution

  • Fault tolerance
  • Parallelism
  • Latency
  • Delivery semantics
  • Operations and

monitoring

  • Schema management
slide-47
SLIDE 47

Data integration: platform vs tool

Central, reusable infrastructure for many use cases One-off, non-reusable solution for a particular use case

slide-48
SLIDE 48

New shiny future of etl: a streaming platform

NoSQL RDBMS Hadoop DWH Apps Apps Apps Search Monitoring RT analytics

slide-49
SLIDE 49

Streaming platform serves as the central nervous system for a

company’s data in the following ways ...

slide-50
SLIDE 50

#1: Serves as the real-time, scalable

messaging bus for applications; no

EAI

slide-51
SLIDE 51

#2: Serves as the source-of-truth pipeline for feeding all data processing destinations; Hadoop, DWH, NoSQL systems and more

slide-52
SLIDE 52

#3: Serves as the building block for stateful stream processing microservices

slide-53
SLIDE 53

Batch data integration

Streaming

slide-54
SLIDE 54

Batch ETL

Streaming

slide-55
SLIDE 55

a short history of data integration drawbacks of ETL needs and requirements for a streaming platform new, shiny future of ETL: a streaming platform What does a streaming platform look like and how it enables Streaming ETL?

slide-56
SLIDE 56

Apache kafka: a distributed streaming platform

slide-57
SLIDE 57 57 Confidential

Apache kafka 6 years ago

57

slide-58
SLIDE 58 58 Confidential

> 1,400,000,000,000

messages processed / day

58

slide-59
SLIDE 59

Now Adopted at 1000s of companies worldwide

slide-60
SLIDE 60

What role does Kafka play in the new shiny future for data integration?

slide-61
SLIDE 61

#1: Kafka is the de-facto storage of choice for stream data

slide-62
SLIDE 62

The log

1 2 3 4 5 6 7 next write reader 1 reader 2

slide-63
SLIDE 63

The log & pub-sub

1 2 3 4 5 6 7 publisher subscriber 1 subscriber 2

slide-64
SLIDE 64

#2: Kafka offers a scalable

messaging backbone for application

integration

slide-65
SLIDE 65

Kafka messaging APIs: scalable eai app Messaging APIs produce(message) consume(message)

slide-66
SLIDE 66

#3: Kafka enables building streaming

data pipelines (E & L in ETL)

slide-67
SLIDE 67

Kafka’s Connect API: Streaming data ingestion app Messaging APIs Messaging APIs Connect API Connect API app source sink Extract Load

slide-68
SLIDE 68

#4: Kafka is the basis for stream

processing and transformations

slide-69
SLIDE 69

Kafka’s streams API: stream processing (transforms)

Messaging API Streams API apps apps Connect API Connect API source sink

Extract Load Transforms

slide-70
SLIDE 70

Kafka’s connect API = E and L in Streaming ETL

slide-71
SLIDE 71

Connectors!

NoSQL RDBMS Hadoop DWH Search Monitoring RT analytics Apps Apps Apps

slide-72
SLIDE 72

How to keep data centers in-sync?

slide-73
SLIDE 73

Sources and sinks Connect API Connect API source sink Extract Load

slide-74
SLIDE 74

changelogs

slide-75
SLIDE 75

Transforming changelogs

slide-76
SLIDE 76

Kafka’s Connect API = Connectors Made Easy!

  • Scalability: Leverages Kafka for scalability
  • Fault tolerance: Builds on Kafka’s fault tolerance model
  • Management and monitoring: One way of monitoring all

connectors

  • Schemas: Offers an option for preserving schemas

from source to sink

slide-77
SLIDE 77

Kafka all the things! Connect API

slide-78
SLIDE 78

Kafka’s streams API = The T in STREAMING ETL

slide-79
SLIDE 79

Stream processing =

transformations on stream data

slide-80
SLIDE 80

2 visions for stream processing

Real-time Mapreduce Event-driven microservices

VS

slide-81
SLIDE 81

2 visions for stream processing

Real-time Mapreduce Event-driven microservices

VS

  • Central cluster
  • Custom packaging,

deployment & monitoring

  • Suitable for

analytics-type use cases

  • Embedded library

in any Java app

  • Just Kafka and

your app

  • Makes stream

processing accessible to any use case

slide-82
SLIDE 82

Vision 1: real-time mapreduce

slide-83
SLIDE 83

Vision 2: event-driven microservices => Kafka’s streams API

Streams API microservice

Transforms

slide-84
SLIDE 84

Kafka’s Streams API = Easiest way to do stream processing using Kafka

slide-85
SLIDE 85

#1: Powerful and lightweight Java library; need just Kafka and your app

app

slide-86
SLIDE 86

#2: Convenient DSL with all sorts of

  • perators: join(), map(), filter(), windowed

aggregates etc

slide-87
SLIDE 87

Word count program using Kafka’s streams API

slide-88
SLIDE 88

#3: True event-at-a-time stream processing; no microbatching

slide-89
SLIDE 89

#4: Dataflow-style windowing based on event-time; handles late-arriving data

slide-90
SLIDE 90

#5: Out-of-the-box support for local

state; supports fast stateful processing

slide-91
SLIDE 91

External state

slide-92
SLIDE 92

local state

slide-93
SLIDE 93

Fault-tolerant local state

slide-94
SLIDE 94

#6: Kafka’s Streams API allows reprocessing; useful to upgrade apps or do A/B testing

slide-95
SLIDE 95

reprocessing

slide-96
SLIDE 96

Real-time dashboard for security monitoring

slide-97
SLIDE 97

Kafka’s streams api: simple is beautiful Vision 1 Vision 2

slide-98
SLIDE 98

Logs unify batch and stream processing

slide-99
SLIDE 99

Streams API app sink source Connect API Connect API

Transforms Load Extract New shiny future of ETL: Kafka

slide-100
SLIDE 100

A giant mess!

App App App App search Hadoop DWH monitoring security MQ MQ cache cache

slide-101
SLIDE 101

All your data … everywhere … now Streaming platform

DWH Hadoop App App App App App App App App request-response messaging OR stream processing streaming data pipelines changelogs

slide-102
SLIDE 102

VISION: All your data … everywhere … now Streaming platform

DWH Hadoop App App App App App App App App request-response messaging OR stream processing streaming data pipelines changelogs

slide-103
SLIDE 103

Thank you!

@nehanarkhede