Web analytics at scale with Druid at naver.com Jason Heo - - PowerPoint PPT Presentation

web analytics at scale with druid at naver com
SMART_READER_LITE
LIVE PREVIEW

Web analytics at scale with Druid at naver.com Jason Heo - - PowerPoint PPT Presentation

Web analytics at scale with Druid at naver.com Jason Heo (analytic.js.heo@navercorp.com) Doo Yong Kim (dooyong.kim@navercorp.com) Agenda Part 1 About naver.com What is & Why Druid The Architecture of our service Part 2


slide-1
SLIDE 1

Web analytics at scale with Druid at naver.com

Jason Heo (analytic.js.heo@navercorp.com) Doo Yong Kim (dooyong.kim@navercorp.com)

slide-2
SLIDE 2
  • Part 1
  • About naver.com
  • What is & Why Druid
  • The Architecture of our service
  • Part 2
  • Druid Segment File Structure
  • Spark Druid Connector
  • TopN Query
  • Plywood & Split-Apply-Combine
  • How to fix TopN’s unstable results
  • Appendix

Agenda

slide-3
SLIDE 3

About naver.com

https://en.wikipedia.org/wiki/Naver

  • naver.com
  • The biggest website in South Korea
  • The Google of South Korea
  • 74.7% of all web searches in South Korea
slide-4
SLIDE 4
  • Developed Analytics Systems at Naver
  • Working with Databases since 2000
  • Author of 3 MySQL books
  • Currently Elasticsearch, Spark, Kudu,

and Druid

  • Working on Spark and Druid-based OLAP

platform

  • Implemented search infrastructure at

coupang.com

  • Have been interested in MPP and advanced file

formats for big data Jason Heo Doo Yong Kim

About Speakers

slide-5
SLIDE 5

Platforms we've tested so far

Parquet ORC Carbon Data Elasticsearch ClickHouse Kudu

Druid

SparkSQL Hive Impala Drill Presto Kylin Phoenix Query Engine Storage Format

slide-6
SLIDE 6
  • What is Druid?
  • Our Requirements
  • Why Druid?
  • Experimental Results

What is & Why Druid

slide-7
SLIDE 7
  • Column-oriented distributed datastore
  • Real-time streaming ingestion
  • Scalable to petabytes of data
  • Approximate algorithms (hyperLogLog, theta sketch)

https://www.slideshare.net/HadoopSummit/scalable- realtime-analytics-using-druid From HORTONWORKS

What is Druid?

slide-8
SLIDE 8

From my point of view

  • Druid is a cumbersome version of Elasticsearch (w/o search feature)
  • Similar points
  • Secondary Index
  • DSLs for query
  • Flow of Query Processing
  • Terms Aggregation ↔ TopN Query, Coordinator ↔ Broker, Data Node ↔ Historical
  • Different points
  • more complicated to operate
  • better with much more data
  • better for Ultra High Cardinality
  • less GC overhead
  • better for Spark Connectivity (for Full Scan)

What is Druid?

slide-9
SLIDE 9

Real-time Node Historical Broker Overlord Middle Manager Coordinator Kafka Index Service Segment management

What is Druid? - Architecture

MySQL

metadata

Zookeeper

cluster mgmt.

Deep Storage

(HDFS, S3) stores Druid segments for durability

Query Service Clients

Druid DSL

Segments download Segments for query

slide-10
SLIDE 10

Real-time Node Historical Broker

{ "queryType": "groupBy", "dataSource": "sample_data", "dimension": ["country", "device"], "filter": {}, "aggregation": [...], "limitSpec": [...] } { "queryType": "topN", "dataSource": "sample_data", "dimension": "sample_dim", "filter": {...} "aggregation": [...], "threshold": 5 } SELECT ... FROM dataSource

What is Druid? - Queries

  • SQLs can be converted to Druid DSL
  • No JOIN
slide-11
SLIDE 11

SELECT COUNT(*) FROM logs WHERE url = ?;

  • 1. Random Access

(OLTP) SELECT url, COUNT(*) FROM logs GROUP BY url ORDER BY COUNT(*) DESC LIMIT 10;

  • 2. Most Viewed

SELECT visitor, COUNT(*) FROM logs GROUP BY visitor;

  • 3. Full Aggregation

SELECT ... FROM logs INNER JOIN users GROUP BY ... HAVING ...

  • 4. JOIN

Why Druid? - Requirements

slide-12
SLIDE 12
  • Supports Bitmap Index
  • Fast Random Access

Perfect solution for OLTP and OLAP For OLTP

  • Supports TopN Query
  • 100x times faster than GroupBy query
  • Supports Complex Queries
  • JOIN, HAVING, etc
  • with our Spark Druid Connector

For OLAP

Why Druid?

★★★★☆

  • 1. Random Access

★★★★☆

  • 3. Full Aggregation

★★★★★

  • 2. Most Viewed

★★★★☆

  • 4. JOIN
slide-13
SLIDE 13
  • Fast Random Access
  • Terms Aggregation
  • TopN Query
  • Easy to manage

Pros Cons

  • Slow full scan with es-hadoop
  • Low Performance for multi-field terms aggregation

(esp. High Cardinality)

  • GC Overhead

Comparison – ElasticSearch

  • 1. Random Access

★★★★★

  • 3. Full Aggregation

☆☆☆☆☆

  • 2. Most Viewed

★★★☆☆

  • 4. JOIN

☆☆☆☆☆

slide-14
SLIDE 14
  • Fast Random Access via Primary Key
  • Fast OLAP with Impala

Pros

  • No Secondary Index
  • No TopN Query

Cons

Comparison – Kudu + Impala

★★★★★ (PK) ★☆☆☆☆ (non-PK)

  • 1. Random Access

★★★★★

  • 3. Full Aggregation

☆☆☆☆☆

  • 2. Most Viewed

★★★★★

  • 4. JOIN
slide-15
SLIDE 15

Random Access Most Viewed

0.25 0.35 0.08 2.7 2.9 0.78

0.5 1 1.5 2 2.5 3 3.5

Elasticesarch Kudu+Impala Druid

1 Field 2 Fields

0.003 0.14 0.03

0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16

Elastisearch Kudu+Impala Druid

Experimental Results – Response Time

sec sec

slide-16
SLIDE 16

Experimental Results – Notes

  • ES: Lucene Index
  • Kudu+Impala: Primary Key
  • Druid: Bitmap Index

Random Access

  • ES: Terms Aggregation
  • Kudu+Implala: Group By
  • Druid: TopN
  • Split-Apply-Combine for Multi Fields

Most Viewed

  • 210 mil. rows
  • same parallelism
  • same number of shards/partitions/segments

Data Sets

slide-17
SLIDE 17

Logs

The Architecture of our service

Zeppelin Plywood

Druid DSL

Coordinator Overlord Middle Manager Peon Spark Thrift Server

Batch Ingestion

Parquet Kafka Run daily batch job API Server Historical Spark Executor Segments File Broker

Druid

SparkSQL

Kafka Indexing Service Kafka

transform logs

Parquet

remove duplicated logs Real-time Ingestion

slide-18
SLIDE 18

Switching

slide-19
SLIDE 19

Introduction – Who am I?

  • 1. Doo Yong Kim
  • 2. Naver
  • 3. Software engineer
  • 4. Big data
slide-20
SLIDE 20

Contents

  • 1. Druid Storage Model
  • 2. Spark Druid Connector Implementation
  • 3. TopN Query
  • 4. Plywood & Split-Combine-Apply
  • 5. Extending Druid Query
slide-21
SLIDE 21

Druid Storage Model – 4 characteristics

  • Columnar format
  • Explicit distinguishes between dimension, metric
  • Bitmap index
  • Dictionary encoded
slide-22
SLIDE 22

Druid Storage Model - background

Druid treats dimension and metric separately. Dimension Metric

  • Bitmap Index
  • GroupBy Fields
  • Argument of Aggregate Function

{ "dimensionsSpec": { "dimensions": ["country", "device", ...] }, ... "metricsSpec": [ { "type": "count", "name": "count" }, { "type": "doubleSum", "fieldName": "duration", "name": "duration" } ] }

Druid Ingestion Spec

slide-23
SLIDE 23

Druid Storage Model- Dimension

Country (Dimension) Korea UK Korea Korea Korea UK Korea ↔ 0 UK ↔ 1 Dictionary for country UK appears in 2nd, 6th rows Korea → 101110 UK → 010001 Bitmap for Korea 1 1 Dictionary Encoded Values

slide-24
SLIDE 24

Druid Storage Model - Metric

13 2 15 29 30 14 Country (Dimension) duration (Metric) Korea 13 UK 2 Korea 15 Korea 29 Korea 30 UK 14

slide-25
SLIDE 25

Row Filter it manually device LIKE 'Iphone%'

Druid Storage Model

Bitmap country Filtering Bitmap device Filtering duration Filtering Filter by bitmap country = 'Korea'

('Korea', 'Iphone 6s', 13) SELECT country, device, duration FROM logs WHERE country = 'Korea' AND device LIKE 'Iphone%'

slide-26
SLIDE 26

Spark Druid Connector

slide-27
SLIDE 27

Spark Druid Connector

  • 1. 3 Ways to implement, Our implementation
  • 2. What is needed to implement
  • 3. Sample Codes, Performance Test
  • 4. How to implement
slide-28
SLIDE 28

Spark Druid Connector - 3 Ways to implement

Druid Broker Spark Driver DSL SQL Druid Historical Spark Driver SQL Spark Executor

  • Good if SQL is rewritable to DSL
  • But DSL does not support all SQL
  • Ex: JOIN, sub-query
  • Easy to implement
  • No need to understand Druid Index Library
  • Ser/de operation is expensive
  • Parallelism is bounded to no. of Historical

Select DSL Large JSON

1st way 2nd way

slide-29
SLIDE 29

Spark Druid Connector - 3 Ways to implement

Spark Driver SQL

  • Read Druid segment files directly.
  • Similar to the way of reading Parquet
  • Difficult to implement
  • Need to understand Druid segment library

3rd way

Executor Segment File

Reads segments using Druid Library Allocate Spark executor into Historical Node

We chose this way!

slide-30
SLIDE 30

spark.read .format("com.navercorp.ni.druid.spark.druid") .option("coordinator", "host1.com:18081") .option("broker", "host2.com:18082") .option("datasource", "logs").load() .createOrReplaceTempView("logs")

Spark Druid Connector – How to use

spark.sql(""" SELECT country, device, duration FROM logs WHERE country = 'Korea' AND device LIKE 'Iphone%' """).show(false)

Create table Execute Query

slide-31
SLIDE 31

Total 4.4B rows

0.21 7.5 1 2 3 4 5 6 7 8 Spark Druid Spark Parquet

Random Access

24.1 7.7 5 10 15 20 25 30

Spark Druid Spark Parquet

Full Scan & GROUP BY

Spark Druid Connector - Performance

Seconds, lower is better

slide-32
SLIDE 32

Spark Druid Connector – How to implement

slide-33
SLIDE 33

Spark Druid Connector – How to implement

  • 1. Druid Rest API
  • 2. Druid Segment Library
  • 3. Spark Data Source API
slide-34
SLIDE 34

Spark Druid Connector – Get table schema

Spark Driver Druid Broker

{ "queryType": "segmentMetaData", "dataSource": "logs", "merge": "true" } { "columns": { "__time": {...}, "country": {...}, "device": {...}, "duration": {...} ... } spark.read .format("...") .option("coordinator", "...") .option("broker", "...") .option("datasource", "logs") .load()

Schema

slide-35
SLIDE 35

Spark Druid Connector – Partition pruning

WHERE country = 'Korea' AND_time = CAST('2018-05-23' AS TIMESTAMP)

Segments can be pruned by interval condition and single dimension partition

  • 1. Interval condition

serverview returns only matched segments

  • 2. Single dimension partition

compare start and end with given filter

Spark Driver Druid Coordinator

GET /.../logs/intervals/2018-05-23/serverview [ { "segment": { "shardSpec": { "dimension": "country", "start": "null", "end": "b" ...}, "id": "segmentId" }, "servers": [ {"host": "host1"}, {"host": "host2"} ] }, { "segment": ...}, ... }

slide-36
SLIDE 36

Spark Druid Connector – Spark filters to Druid filters

WHERE country = 'Korea' AND city = 'Seoul' buildScan(requiredColumns: [country, device, duration], filters: [EqualTo(country, Korea), EqualTo(city, Seoul)]) Spark's filters are converted into Druid's DimFilter private def toDruidDimFilters(sparkFilter: Filter): DimFilter = { sparkFilter match { ... case EqualTo(attribute, value) => { new SelectorDimFilter( attribute, value.toString, null ) case GreaterThan(attribute, value) => ...

slide-37
SLIDE 37

Spark Druid Connector – Attach locality to RACK_LOCAL

  • getPreferredLocations(partition: Partition)
  • Returns Hosts having Druid Segments
  • Caution: Spark does not always guarantee that executors launch on preferred locations
  • Set spark.locality.wait to very large value
slide-38
SLIDE 38

Spark Druid Connector - How to implement

Done!

Now Spark executor can read records from Druid segment files. Segment File Spark Druid Connector Spark

slide-39
SLIDE 39

TopN Query

slide-40
SLIDE 40

TopN Query

  • 1. How TopN Query works
  • 2. Performance
  • 3. Limitation
slide-41
SLIDE 41

TopN Query flow (N=100) Broker Historical

Segment Cache

User

TopN Query – We heavily use TopN query

Historical

Segment Cache

Historical

Segment Cache

Client get merged results from each historical node. Broker merge each’s results and make final records. Each historical node return local top 100 results

slide-42
SLIDE 42

country SUM(duration) korea 114 uk 47 us 21 country SUM(duration) uk 67 korea 24 usa 3 country SUM(duration) korea 87 uk 57 china 33 country SUM(duration) korea 225 uk 171 china 33 usa 24 country SUM(duration) korea 225 uk 171 china 33

TopN Query - Example

Top 3 country ORDER BY SUM(duration) Broker Top 3 Result Top 3 of Historical a Top 3 of Historical b Top 3 of Historical c

slide-43
SLIDE 43

country SUM(duration) korea 114 uk 47 usa 21 china 17 country SUM(duration) uk 67 korea 24 usa 3 china 1 country SUM(duration) korea 87 uk 57 usa 22 china 33 country SUM(duration) korea 225 uk 171 china 33

Missing!

TopN – is an approximate approach

slide-44
SLIDE 44

GroupBy (Few minutes) TopN (1536 ms)

rank metric rank metric 1 1,948,297 1 1,948,297 2 1,404,167 2 1,404,167 3 1,383,538 3 1,383,538 4 1,141,977 4 1,141,977 5 1,099,028 5 1,090,277 6 1,090,277 6 1,079,242 7 1,051,448 7 1,051,448 8 996,961 8 996,961 9 941,284 9 941,284 10 937,078 10 937,078

100x Faster!

TopN – 100x faster than GroupBy

  • 1. rank changed

rank 5 → rank 6

  • 2. value changed

1,099,028 → 1,079,242

slide-45
SLIDE 45

TopN – Limitations

  • 1. TopN only has one dimension.
  • 2. Unstable result when replication factor is larger than 2.
slide-46
SLIDE 46

Plywood

  • 1. Plywood
  • 2. Split-Apply-Combine
  • 3. Our Improvement
slide-47
SLIDE 47
  • 1. https://www.jstatsoft.org/article/view/v040i01/v40i01.pdf
  • 2. http://plywood.imply.io/index

// Split [ country, city, device ] ply() .apply(dataSource, $(dataSource).filter(...)) // Filter1 .apply(dataSource, $(dataSource).filter(...)) // Filter2 .apply(dataSource, $(dataSource).filter(...)) // Filter3 .apply('country', $(dataSource).split(...) .apply(...) // Filter to Split1 (country) .apply('city', $(dataSource).split(...) .apply(...) // Filter to Split2 (city) .apply(...) // Filter to Split2 (city) .apply('device', $(dataSource).split(...) .apply(...) // Filter to Split3 (device) ) ) )

SELECT country, city, device FROM $TABLE WHERE … GROUP BY country, city, device

Split Apply Combine - SAC

slide-48
SLIDE 48

Before After

Plywood tuning

slide-49
SLIDE 49

Throughput (qps, higher is better) Before

Before After

Tuning Results

slide-50
SLIDE 50

Challenge

slide-51
SLIDE 51

Same query but the results can be different under 2+ replica factor configuration

Stable TopN - Motivation

Seg_1 Seg_2

Historical 1

Seg_1 Seg_2

Historical 2 Broker Historical 1 Historical 2 Broker

TopN(Seg_1 + Seg_2) TopN(Seg_2 + Seg_3)

First Result Second Result Results can be different

!=

Seg_3 Seg_3 Seg_1 Seg_2 Seg_3 Seg_2 Seg_3

TopN(Seg_3)

Seg_1

TopN(Seg_1)

slide-52
SLIDE 52

Bypass Historical side TopN Merge, do Broker side merge TopN results for each segment by it’s ID

  • rder

by_segment patch

Broker Broker First Result Second Result Always identical

==

Seg_1 Seg_2

Historical 1

Seg_1 Seg_2

Historical 2 Historical 1 Historical 2

TopN(Seg_1) + TopN(Seg_2) TopN(Seg_2) + TopN(Seg_3)

Seg_3 Seg_3 Seg_1 Seg_2 Seg_3 Seg_2 Seg_3

TopN(Seg_3)

Seg_1

TopN(Seg_1)

slide-53
SLIDE 53

Navis @ SK Telecom Ens @ Naver

Special Thanks

slide-54
SLIDE 54

Thank you!

slide-55
SLIDE 55

Appendix

slide-56
SLIDE 56
  • 10 Broker Nodes
  • 40 Historical Nodes
  • 2 MiddleManager & Overlord Nodes
  • 2 Coordinator Nodes
  • 10 Yarn & HDFS Nodes for Batch Ingestion
  • Spark Standalone Cluster runs on Historical Nodes
  • for Locality

Druid Deploy & Configuration (1)

slide-57
SLIDE 57
  • Druid version : 0.11
  • H/W Spec for Broker & Historical
  • CPU: 40 cores (w/ hyperthread)
  • RAM: 128GB
  • HDD: SSD w/ RAID 5
  • Memory Configuration

Configuration Value for Broker Value for Historical

  • Xmx

20GB 12GB

  • XX:MaxDirectMemorySize

30GB 45GB druid.processing.numMergeBuffers 10 20 druid.processing.numThreads 20 30 druid.processing.buffer.sizeBytes 512MB 800MB druid.cache.sizeInBytes 5GB druid.server.http.numThreads 40 40

Druid Deploy & Configuration (2)

slide-58
SLIDE 58

Use Yarn External Resource for Batch Ingestion "tuningConfig": { "type": "hadoop", "jobProperties": { "yarn.resourcemanager.hostname" : "host1.com", "yarn.resourcemanager.address" : "host1.com:8032", "yarn.resourcemanager.scheduler.address": "host1.com:8030", "yarn.resourcemanager.webapp.address": "host1.com:8088", "yarn.resourcemanager.resource-tracker.address": "host1.com:8031", "yarn.resourcemanager.admin.address": "host1.com:8033" } }

Ingest Spec for External Yarn and HDFS

slide-59
SLIDE 59

Use External HDFS for intermediate MR output "tuningConfig": { "type": "hadoop", "jobProperties": { "fs.defaultFS": "hdfs://DEFAULT_FS:8020", "dfs.namenode.http-address": "NAMENODE:50070", "dfs.namenode.https-address": "NAMENODE:50470", "dfs.namenode.servicerpc-address": "NAMENODE:8022" } }

Ingest Spec for External Yarn and HDFS

slide-60
SLIDE 60

Lambda Architecture with Two Databases

https://en.wikipedia.org/wiki/Lambda_architecture

Lambda Architecture with Druid

https://www.slideshare.net/gianmerlino/druid-at-sf-big-analytics- 2015-1201

Why Druid? – Simple Lambda Architecture

slide-61
SLIDE 61

How Kafka Indexing Service

slide-62
SLIDE 62

https://github.com/knoguchi/cm-druid

Druid on CDH

slide-63
SLIDE 63

Extending Druid Query

  • 1. Accumulated Metric in TopN
  • 2. Stable TopN Result
slide-64
SLIDE 64

Row stream Query Second Query Historical Result Result

Extending Druid Query

Client Broker Historical Cursor Aggregation

Row Row Row Row Row

slide-65
SLIDE 65

Extending Druid Query - Motivation

2 queries are needed to make following table

1. Total 3 times TopN query for 3 countries 2. Aggregation query for total duration

Country SUM(duration) Ratio over total duration korea 225 20% uk 171 15.2% usa 33 2.9%

Can we do it at once?

slide-66
SLIDE 66

Extending Druid Query - Background

Yes we can! Just do TopN operation and SUM operation simultaneously!

country SUM(duration) korea 114 china 17 usa 21 uk 47 country duration korea 100 korea 14 uk 40 uk 7 usa 21 china 17

Segment Data Aggregated in map structure

country SUM(duration) korea 114 uk 47 usa 21

Final records

Total duration equals sum of all metric values!

slide-67
SLIDE 67

{ "queryType": "topN", ... "metric": "edits", "accMetrics": ["edits"], ... } { ... "edits": 33, "__acc_edits": 1234 ... }

User Request Druid Response

Extending Druid Query in TopN

Broker Historical Cursor TopN Aggregation

Row

TopN Queue Count Metric We customized Druid to calculate total edits and metric at once!

Row Row Row Row Row

slide-68
SLIDE 68

Huge intermediate files with MapReduce

  • Druid's default Batch Ingestion use MapReduce
  • To ingest 1.4GB Parquet file (Single Dim. Partition)
  • Read: 16.6GB
  • Write: 20.5GB
  • Total: 41.1GB

Druid Spark Batch

slide-69
SLIDE 69

We modified Original Druid Spark Batch

  • https://github.com/metamx/druid-spark-batch
  • Original version of Druid Spark Batch from Metamarket (creator of Druid)
  • We added some features
  • Parquet input
  • Single Dimension Partition
  • Query Granularity
  • Same Ingest spec with Druid MapReduce Batch

Druid Spark Batch

slide-70
SLIDE 70

37.1 7 5 10 15 20 25 30 35 40 MapReduce Spark

Disk Read, Write

759 2260 500 1000 1500 2000 2500 MapReduce Spark

Ingest time

(Single Dim Partition) (3 Segments, 430MB each)

333 376 50 100 150 200 250 300 350 400 MapReduce Spark

Ingest time

(Single Dim Partition) (11 Segments, 135MB each)

Druid Spark Batch

GB, lower is better Seconds, lower is better Seconds, lower is better