[PPT] - Swimming in the data river Or, when streaming analytics isnt Gian PowerPoint Presentation

SLIDE 1

Swimming in the data river

Or, when “streaming analytics” isn’t

Gian Merlino

gian@imply.io

SLIDE 2

Who am I?

Gian Merlino Committer & PMC member on Cofounder at 10 years working on scalable systems

2

SLIDE 3

Agenda

From warehouses to rivers
What can we do with streaming data?
Streaming analytics
Enter the Druid
Do try this at home!

3

SLIDE 4

Rolling down the river

4

SLIDE 5

Data warehouses

Tightly coupled architecture with limited flexibility.

5 Data Data Data

Data Sources ETL

Data warehouse

Processing Store and Compute

Analytics Reporting Data mining

Querying

SLIDE 6

Data lakes

Modern data architectures are more application-centric.

6

Confidential. Do not redistribute.

Data Data Data

Data Sources MapReduce, Spark Apps ETL

SQL ML/AI TSDB

Data lake Storage

SLIDE 7

Data rivers

Streaming architectures are true-to-life and enable faster decision cycles.

7

Confidential. Do not redistribute.

Data Data Data

Data Sources Stream processors Stream hub

Streaming analytics Databases

ETL Storage Apps

Archive to data lake

SLIDE 8

Streaming data

8

Source: https://www.confluent.io/blog/building-real-time-streaming-etl-pipeline-20-minutes/

SLIDE 9

Streaming data

9

Source: https://www.confluent.io/blog/building-real-time-streaming-etl-pipeline-20-minutes/ (plus question marks)

? ? ? ? ? ? ?

App or set of requirements? Alerting, charting,

bservability, APM?

Do search and NoSQL need apps? Analytical or transactional? How does data move between systems?

SLIDE 10

Streaming data

10

Stream hub App

Direct production

Stream hub library

SLIDE 11

Streaming data

11

Transactional DB Stream hub App

Change data capture

DB connection Something DB-specific

SLIDE 12

Streaming data

12

Stream hub

Streaming data pipeline

EDW or data lake

Kafka Connect Kinesis Firehose

SLIDE 13

Streaming data

13

Microservice communication

Stream hub Service #1 Service #2

SLIDE 14

Streaming data

14

SLIDE 15

Streaming data

15

Stream hub

Spark Streaming Flink Storm Samza Kafka Streams Kinesis Data Analytics Azure Stream Analytics Google Cloud Dataflow

?

SLIDE 16

Streaming data

16

Stream hub

Stream processors

Real-time actions

Alerting, API calls

SLIDE 17

Streaming data

17

Stream hub

Stream processors

Storage HDFS Cloud storage

Data movement

SLIDE 18

Streaming data

18

Stream hub Storage HDFS Cloud storage

Data movement + enrichment

Stream processors Stream processors

Continuous query

SLIDE 19

Streaming data

19

Stream processors

K / V stores HBase Cassandra Redis

Continuous query + write to serving layer

Stream hub

Stream processors

Realtime dashboard

SLIDE 20

Streaming data

20

K / V stores HBase Cassandra Redis Realtime dashboard

Continuous query + write to serving layer + unemitted state serving

Direct state access

Stream hub

Stream processors Stream processors

SLIDE 21

Streaming data

21

K / V stores HBase Cassandra Redis

Continuous query + write to serving layer + unemitted state serving

Direct state access

Stream hub

Stream processors Stream processors

Key lookups Short range scans ‘Precomputation’ Continuous query Access unemitted aggregation state, etc Realtime dashboard

SLIDE 22

22

SLIDE 23

The problem

23

SLIDE 24

The problem

24

SLIDE 25

The problem

Slice-and-dice for big data streams
Interactive exploration
Look under the hood of reports and dashboards
And we want our data fresh, too

25

SLIDE 26

Challenges

Scale: when data is large, we need a lot of servers
Speed: aiming for sub-second response time
Complexity: too much fine grain to precompute
High dimensionality: 10s or 100s of dimensions
Concurrency: many users and tenants
Freshness: load from streams

26

SLIDE 27

27

high performance analytics data store for event-driven data

SLIDE 28

What is Druid?

“high performance”: low query latency, high ingest rates
“analytics”: counting, ranking, groupBy, time trend
“data store”: the cluster stores a copy of your data
“event-driven data”: fact data like clickstream, network flows,

user behavior, digital marketing, server metrics, IoT

28

SLIDE 29

Streaming data

29

Analytical Database (Like Apache Druid!) App

SQL

Stream hub

Stream processors

Continuous load Aggregation + serving layers merged Imply Pivot Apache Superset Looker Your App Here™ Enrichment

SLIDE 30

Key features

Column oriented
High concurrency
Scalable to 100s of servers, millions of messages/sec
Continuous, real-time ingest
Indexes on all dimensions by default
Query through SQL
Target query latency sub-second to a few seconds

30

SLIDE 31

Use cases

Clickstreams, user behavior
Digital advertising
Application performance management
Network flows
IoT

31

SLIDE 32

Powered by Apache Druid

32

Source: http://druid.io/druid-powered.html + many more!

SLIDE 33

Powered by Apache Druid

“The performance is great ... some of the tables that we have internally in Druid have billions and billions of events in them, and we’re scanning them in under a second.”

33

Source: https://www.infoworld.com/article/2949168/hadoop/yahoo-struts-its-hadoop-stuff.html

From Yahoo:

SLIDE 34

Architecture

34

Indexer Indexer Indexer

Files

Historical Historical Historical

Streams

Broker Broker Queries

SLIDE 35

Why this works

Computers are fast these days
Indexes help save work and cost
But don’t be afraid to scan tables — it can be done efficiently

35

SLIDE 36

36

Integration patterns

SLIDE 37

Deployment patterns

37

Stream hub Event streams

(millions of events/sec)

Modern data

architecture

Centered around

stream hub Enrichment

SLIDE 38

Deployment patterns

38

Data lake

(Hadoop, S3)

File dumps

(hourly, daily)

(Slightly less)

modern data architecture

Centered around

data lake Enrichment

(Spark, Hive)

SLIDE 39

39

SLIDE 40

Download

Apache Druid community site (new): https://druid.apache.org/ Apache Druid community site (legacy): http://druid.io/ Imply distribution: https://imply.io/get-started

40

SLIDE 41

Contribute

41

https://github.com/apache/druid

SLIDE 42

Stay in touch

42

@druidio

Join the community! http://druid.apache.org/ Follow the Druid project on Twitter!