Swimming in the data river
Or, when “streaming analytics” isn’t
Gian Merlino
gian@imply.io
Swimming in the data river Or, when streaming analytics isnt Gian - - PowerPoint PPT Presentation
Swimming in the data river Or, when streaming analytics isnt Gian Merlino gian@imply.io Who am I? Gian Merlino Committer & PMC member on Cofounder at 10 years working on scalable systems 2 Agenda From warehouses to rivers
Gian Merlino
gian@imply.io
Gian Merlino Committer & PMC member on Cofounder at 10 years working on scalable systems
2
3
4
Tightly coupled architecture with limited flexibility.
5 Data Data Data
Data Sources ETL
Data warehouse
Processing Store and Compute
Analytics Reporting Data mining
Querying
Modern data architectures are more application-centric.
6
Data Data Data
Data Sources MapReduce, Spark Apps ETL
SQL ML/AI TSDB
Data lake Storage
Streaming architectures are true-to-life and enable faster decision cycles.
7
Data Data Data
Data Sources Stream processors Stream hub
Streaming analytics Databases
ETL Storage Apps
Archive to data lake
8
Source: https://www.confluent.io/blog/building-real-time-streaming-etl-pipeline-20-minutes/
9
Source: https://www.confluent.io/blog/building-real-time-streaming-etl-pipeline-20-minutes/ (plus question marks)
App or set of requirements? Alerting, charting,
Do search and NoSQL need apps? Analytical or transactional? How does data move between systems?
10
Stream hub App
Direct production
Stream hub library
11
Transactional DB Stream hub App
Change data capture
DB connection Something DB-specific
12
Stream hub
Streaming data pipeline
EDW or data lake
Kafka Connect Kinesis Firehose
13
Microservice communication
Stream hub Service #1 Service #2
14
15
Stream hub
Spark Streaming Flink Storm Samza Kafka Streams Kinesis Data Analytics Azure Stream Analytics Google Cloud Dataflow
16
Stream hub
Stream processors
Real-time actions
Alerting, API calls
17
Stream hub
Stream processors
Storage HDFS Cloud storage
Data movement
18
Stream hub Storage HDFS Cloud storage
Data movement + enrichment
Stream processors Stream processors
Continuous query
19
Stream processors
K / V stores HBase Cassandra Redis
Continuous query + write to serving layer
Stream hub
Stream processors
Realtime dashboard
20
K / V stores HBase Cassandra Redis Realtime dashboard
Continuous query + write to serving layer + unemitted state serving
Direct state access
Stream hub
Stream processors Stream processors
21
K / V stores HBase Cassandra Redis
Continuous query + write to serving layer + unemitted state serving
Direct state access
Stream hub
Stream processors Stream processors
Key lookups Short range scans ‘Precomputation’ Continuous query Access unemitted aggregation state, etc Realtime dashboard
22
23
24
25
26
27
user behavior, digital marketing, server metrics, IoT
28
29
Analytical Database (Like Apache Druid!) App
SQL
Stream hub
Stream processors
Continuous load Aggregation + serving layers merged Imply Pivot Apache Superset Looker Your App Here™ Enrichment
30
31
32
Source: http://druid.io/druid-powered.html + many more!
“The performance is great ... some of the tables that we have internally in Druid have billions and billions of events in them, and we’re scanning them in under a second.”
33
Source: https://www.infoworld.com/article/2949168/hadoop/yahoo-struts-its-hadoop-stuff.html
From Yahoo:
34
Indexer Indexer Indexer
Files
Historical Historical Historical
Streams
Broker Broker Queries
35
36
37
Stream hub Event streams
(millions of events/sec)
architecture
stream hub Enrichment
38
Data lake
(Hadoop, S3)
File dumps
(hourly, daily)
modern data architecture
data lake Enrichment
(Spark, Hive)
39
Apache Druid community site (new): https://druid.apache.org/ Apache Druid community site (legacy): http://druid.io/ Imply distribution: https://imply.io/get-started
40
41
https://github.com/apache/druid
42
@druidio
Join the community! http://druid.apache.org/ Follow the Druid project on Twitter!