Web analytics at scale with Druid at naver.com Jason Heo - - PowerPoint PPT Presentation
Web analytics at scale with Druid at naver.com Jason Heo - - PowerPoint PPT Presentation
Web analytics at scale with Druid at naver.com Jason Heo (analytic.js.heo@navercorp.com) Doo Yong Kim (dooyong.kim@navercorp.com) Agenda Part 1 About naver.com What is & Why Druid The Architecture of our service Part 2
- Part 1
- About naver.com
- What is & Why Druid
- The Architecture of our service
- Part 2
- Druid Segment File Structure
- Spark Druid Connector
- TopN Query
- Plywood & Split-Apply-Combine
- How to fix TopN’s unstable results
- Appendix
Agenda
About naver.com
https://en.wikipedia.org/wiki/Naver
- naver.com
- The biggest website in South Korea
- The Google of South Korea
- 74.7% of all web searches in South Korea
- Developed Analytics Systems at Naver
- Working with Databases since 2000
- Author of 3 MySQL books
- Currently Elasticsearch, Spark, Kudu,
and Druid
- Working on Spark and Druid-based OLAP
platform
- Implemented search infrastructure at
coupang.com
- Have been interested in MPP and advanced file
formats for big data Jason Heo Doo Yong Kim
About Speakers
Platforms we've tested so far
Parquet ORC Carbon Data Elasticsearch ClickHouse Kudu
Druid
SparkSQL Hive Impala Drill Presto Kylin Phoenix Query Engine Storage Format
- What is Druid?
- Our Requirements
- Why Druid?
- Experimental Results
What is & Why Druid
- Column-oriented distributed datastore
- Real-time streaming ingestion
- Scalable to petabytes of data
- Approximate algorithms (hyperLogLog, theta sketch)
https://www.slideshare.net/HadoopSummit/scalable- realtime-analytics-using-druid From HORTONWORKS
What is Druid?
From my point of view
- Druid is a cumbersome version of Elasticsearch (w/o search feature)
- Similar points
- Secondary Index
- DSLs for query
- Flow of Query Processing
- Terms Aggregation ↔ TopN Query, Coordinator ↔ Broker, Data Node ↔ Historical
- Different points
- more complicated to operate
- better with much more data
- better for Ultra High Cardinality
- less GC overhead
- better for Spark Connectivity (for Full Scan)
What is Druid?
Real-time Node Historical Broker Overlord Middle Manager Coordinator Kafka Index Service Segment management
What is Druid? - Architecture
MySQL
metadata
Zookeeper
cluster mgmt.
Deep Storage
(HDFS, S3) stores Druid segments for durability
Query Service Clients
Druid DSL
Segments download Segments for query
Real-time Node Historical Broker
{ "queryType": "groupBy", "dataSource": "sample_data", "dimension": ["country", "device"], "filter": {}, "aggregation": [...], "limitSpec": [...] } { "queryType": "topN", "dataSource": "sample_data", "dimension": "sample_dim", "filter": {...} "aggregation": [...], "threshold": 5 } SELECT ... FROM dataSource
What is Druid? - Queries
- SQLs can be converted to Druid DSL
- No JOIN
SELECT COUNT(*) FROM logs WHERE url = ?;
- 1. Random Access
(OLTP) SELECT url, COUNT(*) FROM logs GROUP BY url ORDER BY COUNT(*) DESC LIMIT 10;
- 2. Most Viewed
SELECT visitor, COUNT(*) FROM logs GROUP BY visitor;
- 3. Full Aggregation
SELECT ... FROM logs INNER JOIN users GROUP BY ... HAVING ...
- 4. JOIN
Why Druid? - Requirements
- Supports Bitmap Index
- Fast Random Access
Perfect solution for OLTP and OLAP For OLTP
- Supports TopN Query
- 100x times faster than GroupBy query
- Supports Complex Queries
- JOIN, HAVING, etc
- with our Spark Druid Connector
For OLAP
Why Druid?
★★★★☆
- 1. Random Access
★★★★☆
- 3. Full Aggregation
★★★★★
- 2. Most Viewed
★★★★☆
- 4. JOIN
- Fast Random Access
- Terms Aggregation
- TopN Query
- Easy to manage
Pros Cons
- Slow full scan with es-hadoop
- Low Performance for multi-field terms aggregation
(esp. High Cardinality)
- GC Overhead
Comparison – ElasticSearch
- 1. Random Access
★★★★★
- 3. Full Aggregation
☆☆☆☆☆
- 2. Most Viewed
★★★☆☆
- 4. JOIN
☆☆☆☆☆
- Fast Random Access via Primary Key
- Fast OLAP with Impala
Pros
- No Secondary Index
- No TopN Query
Cons
Comparison – Kudu + Impala
★★★★★ (PK) ★☆☆☆☆ (non-PK)
- 1. Random Access
★★★★★
- 3. Full Aggregation
☆☆☆☆☆
- 2. Most Viewed
★★★★★
- 4. JOIN
Random Access Most Viewed
0.25 0.35 0.08 2.7 2.9 0.78
0.5 1 1.5 2 2.5 3 3.5
Elasticesarch Kudu+Impala Druid
1 Field 2 Fields
0.003 0.14 0.03
0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16
Elastisearch Kudu+Impala Druid
Experimental Results – Response Time
sec sec
Experimental Results – Notes
- ES: Lucene Index
- Kudu+Impala: Primary Key
- Druid: Bitmap Index
Random Access
- ES: Terms Aggregation
- Kudu+Implala: Group By
- Druid: TopN
- Split-Apply-Combine for Multi Fields
Most Viewed
- 210 mil. rows
- same parallelism
- same number of shards/partitions/segments
Data Sets
Logs
The Architecture of our service
Zeppelin Plywood
Druid DSL
Coordinator Overlord Middle Manager Peon Spark Thrift Server
Batch Ingestion
Parquet Kafka Run daily batch job API Server Historical Spark Executor Segments File Broker
Druid
SparkSQL
Kafka Indexing Service Kafka
transform logs
Parquet
remove duplicated logs Real-time Ingestion
Switching
Introduction – Who am I?
- 1. Doo Yong Kim
- 2. Naver
- 3. Software engineer
- 4. Big data
Contents
- 1. Druid Storage Model
- 2. Spark Druid Connector Implementation
- 3. TopN Query
- 4. Plywood & Split-Combine-Apply
- 5. Extending Druid Query
Druid Storage Model – 4 characteristics
- Columnar format
- Explicit distinguishes between dimension, metric
- Bitmap index
- Dictionary encoded
Druid Storage Model - background
Druid treats dimension and metric separately. Dimension Metric
- Bitmap Index
- GroupBy Fields
- Argument of Aggregate Function
{ "dimensionsSpec": { "dimensions": ["country", "device", ...] }, ... "metricsSpec": [ { "type": "count", "name": "count" }, { "type": "doubleSum", "fieldName": "duration", "name": "duration" } ] }
Druid Ingestion Spec
Druid Storage Model- Dimension
Country (Dimension) Korea UK Korea Korea Korea UK Korea ↔ 0 UK ↔ 1 Dictionary for country UK appears in 2nd, 6th rows Korea → 101110 UK → 010001 Bitmap for Korea 1 1 Dictionary Encoded Values
Druid Storage Model - Metric
13 2 15 29 30 14 Country (Dimension) duration (Metric) Korea 13 UK 2 Korea 15 Korea 29 Korea 30 UK 14
Row Filter it manually device LIKE 'Iphone%'
Druid Storage Model
Bitmap country Filtering Bitmap device Filtering duration Filtering Filter by bitmap country = 'Korea'
('Korea', 'Iphone 6s', 13) SELECT country, device, duration FROM logs WHERE country = 'Korea' AND device LIKE 'Iphone%'
Spark Druid Connector
Spark Druid Connector
- 1. 3 Ways to implement, Our implementation
- 2. What is needed to implement
- 3. Sample Codes, Performance Test
- 4. How to implement
Spark Druid Connector - 3 Ways to implement
Druid Broker Spark Driver DSL SQL Druid Historical Spark Driver SQL Spark Executor
- Good if SQL is rewritable to DSL
- But DSL does not support all SQL
- Ex: JOIN, sub-query
- Easy to implement
- No need to understand Druid Index Library
- Ser/de operation is expensive
- Parallelism is bounded to no. of Historical
Select DSL Large JSON
1st way 2nd way
Spark Druid Connector - 3 Ways to implement
Spark Driver SQL
- Read Druid segment files directly.
- Similar to the way of reading Parquet
- Difficult to implement
- Need to understand Druid segment library
3rd way
Executor Segment File
Reads segments using Druid Library Allocate Spark executor into Historical Node
We chose this way!
spark.read .format("com.navercorp.ni.druid.spark.druid") .option("coordinator", "host1.com:18081") .option("broker", "host2.com:18082") .option("datasource", "logs").load() .createOrReplaceTempView("logs")
Spark Druid Connector – How to use
spark.sql(""" SELECT country, device, duration FROM logs WHERE country = 'Korea' AND device LIKE 'Iphone%' """).show(false)
Create table Execute Query
Total 4.4B rows
0.21 7.5 1 2 3 4 5 6 7 8 Spark Druid Spark Parquet
Random Access
24.1 7.7 5 10 15 20 25 30
Spark Druid Spark Parquet
Full Scan & GROUP BY
Spark Druid Connector - Performance
Seconds, lower is better
Spark Druid Connector – How to implement
Spark Druid Connector – How to implement
- 1. Druid Rest API
- 2. Druid Segment Library
- 3. Spark Data Source API
Spark Druid Connector – Get table schema
Spark Driver Druid Broker
{ "queryType": "segmentMetaData", "dataSource": "logs", "merge": "true" } { "columns": { "__time": {...}, "country": {...}, "device": {...}, "duration": {...} ... } spark.read .format("...") .option("coordinator", "...") .option("broker", "...") .option("datasource", "logs") .load()
Schema
Spark Druid Connector – Partition pruning
WHERE country = 'Korea' AND_time = CAST('2018-05-23' AS TIMESTAMP)
Segments can be pruned by interval condition and single dimension partition
- 1. Interval condition
serverview returns only matched segments
- 2. Single dimension partition
compare start and end with given filter
Spark Driver Druid Coordinator
GET /.../logs/intervals/2018-05-23/serverview [ { "segment": { "shardSpec": { "dimension": "country", "start": "null", "end": "b" ...}, "id": "segmentId" }, "servers": [ {"host": "host1"}, {"host": "host2"} ] }, { "segment": ...}, ... }
Spark Druid Connector – Spark filters to Druid filters
WHERE country = 'Korea' AND city = 'Seoul' buildScan(requiredColumns: [country, device, duration], filters: [EqualTo(country, Korea), EqualTo(city, Seoul)]) Spark's filters are converted into Druid's DimFilter private def toDruidDimFilters(sparkFilter: Filter): DimFilter = { sparkFilter match { ... case EqualTo(attribute, value) => { new SelectorDimFilter( attribute, value.toString, null ) case GreaterThan(attribute, value) => ...
Spark Druid Connector – Attach locality to RACK_LOCAL
- getPreferredLocations(partition: Partition)
- Returns Hosts having Druid Segments
- Caution: Spark does not always guarantee that executors launch on preferred locations
- Set spark.locality.wait to very large value
Spark Druid Connector - How to implement
Done!
Now Spark executor can read records from Druid segment files. Segment File Spark Druid Connector Spark
TopN Query
TopN Query
- 1. How TopN Query works
- 2. Performance
- 3. Limitation
TopN Query flow (N=100) Broker Historical
Segment Cache
User
TopN Query – We heavily use TopN query
Historical
Segment Cache
Historical
Segment Cache
Client get merged results from each historical node. Broker merge each’s results and make final records. Each historical node return local top 100 results
country SUM(duration) korea 114 uk 47 us 21 country SUM(duration) uk 67 korea 24 usa 3 country SUM(duration) korea 87 uk 57 china 33 country SUM(duration) korea 225 uk 171 china 33 usa 24 country SUM(duration) korea 225 uk 171 china 33
TopN Query - Example
Top 3 country ORDER BY SUM(duration) Broker Top 3 Result Top 3 of Historical a Top 3 of Historical b Top 3 of Historical c
country SUM(duration) korea 114 uk 47 usa 21 china 17 country SUM(duration) uk 67 korea 24 usa 3 china 1 country SUM(duration) korea 87 uk 57 usa 22 china 33 country SUM(duration) korea 225 uk 171 china 33
Missing!
TopN – is an approximate approach
GroupBy (Few minutes) TopN (1536 ms)
rank metric rank metric 1 1,948,297 1 1,948,297 2 1,404,167 2 1,404,167 3 1,383,538 3 1,383,538 4 1,141,977 4 1,141,977 5 1,099,028 5 1,090,277 6 1,090,277 6 1,079,242 7 1,051,448 7 1,051,448 8 996,961 8 996,961 9 941,284 9 941,284 10 937,078 10 937,078
100x Faster!
TopN – 100x faster than GroupBy
- 1. rank changed
rank 5 → rank 6
- 2. value changed
1,099,028 → 1,079,242
TopN – Limitations
- 1. TopN only has one dimension.
- 2. Unstable result when replication factor is larger than 2.
Plywood
- 1. Plywood
- 2. Split-Apply-Combine
- 3. Our Improvement
- 1. https://www.jstatsoft.org/article/view/v040i01/v40i01.pdf
- 2. http://plywood.imply.io/index
// Split [ country, city, device ] ply() .apply(dataSource, $(dataSource).filter(...)) // Filter1 .apply(dataSource, $(dataSource).filter(...)) // Filter2 .apply(dataSource, $(dataSource).filter(...)) // Filter3 .apply('country', $(dataSource).split(...) .apply(...) // Filter to Split1 (country) .apply('city', $(dataSource).split(...) .apply(...) // Filter to Split2 (city) .apply(...) // Filter to Split2 (city) .apply('device', $(dataSource).split(...) .apply(...) // Filter to Split3 (device) ) ) )
SELECT country, city, device FROM $TABLE WHERE … GROUP BY country, city, device
≒
Split Apply Combine - SAC
Before After
Plywood tuning
Throughput (qps, higher is better) Before
Before After
Tuning Results
Challenge
Same query but the results can be different under 2+ replica factor configuration
Stable TopN - Motivation
Seg_1 Seg_2
Historical 1
Seg_1 Seg_2
Historical 2 Broker Historical 1 Historical 2 Broker
TopN(Seg_1 + Seg_2) TopN(Seg_2 + Seg_3)
First Result Second Result Results can be different
!=
Seg_3 Seg_3 Seg_1 Seg_2 Seg_3 Seg_2 Seg_3
TopN(Seg_3)
Seg_1
TopN(Seg_1)
Bypass Historical side TopN Merge, do Broker side merge TopN results for each segment by it’s ID
- rder
by_segment patch
Broker Broker First Result Second Result Always identical
==
Seg_1 Seg_2
Historical 1
Seg_1 Seg_2
Historical 2 Historical 1 Historical 2
TopN(Seg_1) + TopN(Seg_2) TopN(Seg_2) + TopN(Seg_3)
Seg_3 Seg_3 Seg_1 Seg_2 Seg_3 Seg_2 Seg_3
TopN(Seg_3)
Seg_1
TopN(Seg_1)
Navis @ SK Telecom Ens @ Naver
Special Thanks
Thank you!
Appendix
- 10 Broker Nodes
- 40 Historical Nodes
- 2 MiddleManager & Overlord Nodes
- 2 Coordinator Nodes
- 10 Yarn & HDFS Nodes for Batch Ingestion
- Spark Standalone Cluster runs on Historical Nodes
- for Locality
Druid Deploy & Configuration (1)
- Druid version : 0.11
- H/W Spec for Broker & Historical
- CPU: 40 cores (w/ hyperthread)
- RAM: 128GB
- HDD: SSD w/ RAID 5
- Memory Configuration
Configuration Value for Broker Value for Historical
- Xmx
20GB 12GB
- XX:MaxDirectMemorySize
30GB 45GB druid.processing.numMergeBuffers 10 20 druid.processing.numThreads 20 30 druid.processing.buffer.sizeBytes 512MB 800MB druid.cache.sizeInBytes 5GB druid.server.http.numThreads 40 40
Druid Deploy & Configuration (2)
Use Yarn External Resource for Batch Ingestion "tuningConfig": { "type": "hadoop", "jobProperties": { "yarn.resourcemanager.hostname" : "host1.com", "yarn.resourcemanager.address" : "host1.com:8032", "yarn.resourcemanager.scheduler.address": "host1.com:8030", "yarn.resourcemanager.webapp.address": "host1.com:8088", "yarn.resourcemanager.resource-tracker.address": "host1.com:8031", "yarn.resourcemanager.admin.address": "host1.com:8033" } }
Ingest Spec for External Yarn and HDFS
Use External HDFS for intermediate MR output "tuningConfig": { "type": "hadoop", "jobProperties": { "fs.defaultFS": "hdfs://DEFAULT_FS:8020", "dfs.namenode.http-address": "NAMENODE:50070", "dfs.namenode.https-address": "NAMENODE:50470", "dfs.namenode.servicerpc-address": "NAMENODE:8022" } }
Ingest Spec for External Yarn and HDFS
Lambda Architecture with Two Databases
https://en.wikipedia.org/wiki/Lambda_architecture
Lambda Architecture with Druid
https://www.slideshare.net/gianmerlino/druid-at-sf-big-analytics- 2015-1201
Why Druid? – Simple Lambda Architecture
How Kafka Indexing Service
https://github.com/knoguchi/cm-druid
Druid on CDH
Extending Druid Query
- 1. Accumulated Metric in TopN
- 2. Stable TopN Result
Row stream Query Second Query Historical Result Result
Extending Druid Query
Client Broker Historical Cursor Aggregation
Row Row Row Row Row
Extending Druid Query - Motivation
2 queries are needed to make following table
1. Total 3 times TopN query for 3 countries 2. Aggregation query for total duration
Country SUM(duration) Ratio over total duration korea 225 20% uk 171 15.2% usa 33 2.9%
Can we do it at once?
Extending Druid Query - Background
Yes we can! Just do TopN operation and SUM operation simultaneously!
country SUM(duration) korea 114 china 17 usa 21 uk 47 country duration korea 100 korea 14 uk 40 uk 7 usa 21 china 17
Segment Data Aggregated in map structure
country SUM(duration) korea 114 uk 47 usa 21
Final records
Total duration equals sum of all metric values!
{ "queryType": "topN", ... "metric": "edits", "accMetrics": ["edits"], ... } { ... "edits": 33, "__acc_edits": 1234 ... }
User Request Druid Response
Extending Druid Query in TopN
Broker Historical Cursor TopN Aggregation
Row
TopN Queue Count Metric We customized Druid to calculate total edits and metric at once!
Row Row Row Row Row
Huge intermediate files with MapReduce
- Druid's default Batch Ingestion use MapReduce
- To ingest 1.4GB Parquet file (Single Dim. Partition)
- Read: 16.6GB
- Write: 20.5GB
- Total: 41.1GB
Druid Spark Batch
We modified Original Druid Spark Batch
- https://github.com/metamx/druid-spark-batch
- Original version of Druid Spark Batch from Metamarket (creator of Druid)
- We added some features
- Parquet input
- Single Dimension Partition
- Query Granularity
- Same Ingest spec with Druid MapReduce Batch
Druid Spark Batch
37.1 7 5 10 15 20 25 30 35 40 MapReduce Spark
Disk Read, Write
759 2260 500 1000 1500 2000 2500 MapReduce Spark
Ingest time
(Single Dim Partition) (3 Segments, 430MB each)
333 376 50 100 150 200 250 300 350 400 MapReduce Spark
Ingest time
(Single Dim Partition) (11 Segments, 135MB each)
Druid Spark Batch
GB, lower is better Seconds, lower is better Seconds, lower is better