Swimming in the data river Or, when streaming analytics isnt Gian - - PowerPoint PPT Presentation

swimming in the data river
SMART_READER_LITE
LIVE PREVIEW

Swimming in the data river Or, when streaming analytics isnt Gian - - PowerPoint PPT Presentation

Swimming in the data river Or, when streaming analytics isnt Gian Merlino gian@imply.io Who am I? Gian Merlino Committer & PMC member on Cofounder at 10 years working on scalable systems 2 Agenda From warehouses to rivers


slide-1
SLIDE 1

Swimming in the data river

Or, when “streaming analytics” isn’t

Gian Merlino

gian@imply.io

slide-2
SLIDE 2

Who am I?

Gian Merlino Committer & PMC member on Cofounder at 10 years working on scalable systems

2

slide-3
SLIDE 3

Agenda

  • From warehouses to rivers
  • What can we do with streaming data?
  • Streaming analytics
  • Enter the Druid
  • Do try this at home!

3

slide-4
SLIDE 4

Rolling down the river

4

slide-5
SLIDE 5

Data warehouses

Tightly coupled architecture with limited flexibility.

5 Data Data Data

Data Sources ETL

Data warehouse

Processing Store and Compute

Analytics Reporting Data mining

Querying

slide-6
SLIDE 6

Data lakes

Modern data architectures are more application-centric.

6

  • Confidential. Do not redistribute.

Data Data Data

Data Sources MapReduce, Spark Apps ETL

SQL ML/AI TSDB

Data lake Storage

slide-7
SLIDE 7

Data rivers

Streaming architectures are true-to-life and enable faster decision cycles.

7

  • Confidential. Do not redistribute.

Data Data Data

Data Sources Stream processors Stream hub

Streaming analytics Databases

ETL Storage Apps

Archive to data lake

slide-8
SLIDE 8

Streaming data

8

Source: https://www.confluent.io/blog/building-real-time-streaming-etl-pipeline-20-minutes/

slide-9
SLIDE 9

Streaming data

9

Source: https://www.confluent.io/blog/building-real-time-streaming-etl-pipeline-20-minutes/ (plus question marks)

? ? ? ? ? ? ?

App or set of requirements? Alerting, charting,

  • bservability, APM?

Do search and NoSQL need apps? Analytical or transactional? How does data move between systems?

slide-10
SLIDE 10

Streaming data

10

Stream hub App

Direct production

Stream hub library

slide-11
SLIDE 11

Streaming data

11

Transactional DB Stream hub App

Change data capture

DB connection Something DB-specific

slide-12
SLIDE 12

Streaming data

12

Stream hub

Streaming data pipeline

EDW or data lake

Kafka Connect Kinesis Firehose

slide-13
SLIDE 13

Streaming data

13

Microservice communication

Stream hub Service #1 Service #2

slide-14
SLIDE 14

Streaming data

14

slide-15
SLIDE 15

Streaming data

15

Stream hub

Spark Streaming Flink Storm Samza Kafka Streams Kinesis Data Analytics Azure Stream Analytics Google Cloud Dataflow

?

slide-16
SLIDE 16

Streaming data

16

Stream hub

Stream processors

Real-time actions

Alerting, API calls

slide-17
SLIDE 17

Streaming data

17

Stream hub

Stream processors

Storage HDFS Cloud storage

Data movement

slide-18
SLIDE 18

Streaming data

18

Stream hub Storage HDFS Cloud storage

Data movement + enrichment

Stream processors Stream processors

Continuous query

slide-19
SLIDE 19

Streaming data

19

Stream processors

K / V stores HBase Cassandra Redis

Continuous query + write to serving layer

Stream hub

Stream processors

Realtime dashboard

slide-20
SLIDE 20

Streaming data

20

K / V stores HBase Cassandra Redis Realtime dashboard

Continuous query + write to serving layer + unemitted state serving

Direct state access

Stream hub

Stream processors Stream processors

slide-21
SLIDE 21

Streaming data

21

K / V stores HBase Cassandra Redis

Continuous query + write to serving layer + unemitted state serving

Direct state access

Stream hub

Stream processors Stream processors

Key lookups Short range scans ‘Precomputation’ Continuous query Access unemitted aggregation state, etc Realtime dashboard

slide-22
SLIDE 22

22

slide-23
SLIDE 23

The problem

23

slide-24
SLIDE 24

The problem

24

slide-25
SLIDE 25

The problem

  • Slice-and-dice for big data streams
  • Interactive exploration
  • Look under the hood of reports and dashboards
  • And we want our data fresh, too

25

slide-26
SLIDE 26

Challenges

  • Scale: when data is large, we need a lot of servers
  • Speed: aiming for sub-second response time
  • Complexity: too much fine grain to precompute
  • High dimensionality: 10s or 100s of dimensions
  • Concurrency: many users and tenants
  • Freshness: load from streams

26

slide-27
SLIDE 27

27

high performance analytics data store for event-driven data

slide-28
SLIDE 28

What is Druid?

  • “high performance”: low query latency, high ingest rates
  • “analytics”: counting, ranking, groupBy, time trend
  • “data store”: the cluster stores a copy of your data
  • “event-driven data”: fact data like clickstream, network flows,

user behavior, digital marketing, server metrics, IoT

28

slide-29
SLIDE 29

Streaming data

29

Analytical Database (Like Apache Druid!) App

SQL

Stream hub

Stream processors

Continuous load Aggregation + serving layers merged Imply Pivot Apache Superset Looker Your App Here™ Enrichment

slide-30
SLIDE 30

Key features

  • Column oriented
  • High concurrency
  • Scalable to 100s of servers, millions of messages/sec
  • Continuous, real-time ingest
  • Indexes on all dimensions by default
  • Query through SQL
  • Target query latency sub-second to a few seconds

30

slide-31
SLIDE 31

Use cases

  • Clickstreams, user behavior
  • Digital advertising
  • Application performance management
  • Network flows
  • IoT

31

slide-32
SLIDE 32

Powered by Apache Druid

32

Source: http://druid.io/druid-powered.html + many more!

slide-33
SLIDE 33

Powered by Apache Druid

“The performance is great ... some of the tables that we have internally in Druid have billions and billions of events in them, and we’re scanning them in under a second.”

33

Source: https://www.infoworld.com/article/2949168/hadoop/yahoo-struts-its-hadoop-stuff.html

From Yahoo:

slide-34
SLIDE 34

Architecture

34

Indexer Indexer Indexer

Files

Historical Historical Historical

Streams

Broker Broker Queries

slide-35
SLIDE 35

Why this works

  • Computers are fast these days
  • Indexes help save work and cost
  • But don’t be afraid to scan tables — it can be done efficiently

35

slide-36
SLIDE 36

36

Integration patterns

slide-37
SLIDE 37

Deployment patterns

37

Stream hub Event streams

(millions of events/sec)

  • Modern data

architecture

  • Centered around

stream hub Enrichment

slide-38
SLIDE 38

Deployment patterns

38

Data lake

(Hadoop, S3)

File dumps

(hourly, daily)

  • (Slightly less)

modern data architecture

  • Centered around

data lake Enrichment

(Spark, Hive)

slide-39
SLIDE 39

39

slide-40
SLIDE 40

Download

Apache Druid community site (new): https://druid.apache.org/ Apache Druid community site (legacy): http://druid.io/ Imply distribution: https://imply.io/get-started

40

slide-41
SLIDE 41

Contribute

41

https://github.com/apache/druid

slide-42
SLIDE 42

Stay in touch

42

@druidio

Join the community! http://druid.apache.org/ Follow the Druid project on Twitter!