web analytics at scale with druid at naver com
play

Web analytics at scale with Druid at naver.com Jason Heo - PowerPoint PPT Presentation

Web analytics at scale with Druid at naver.com Jason Heo (analytic.js.heo@navercorp.com) Doo Yong Kim (dooyong.kim@navercorp.com) Agenda Part 1 About naver.com What is & Why Druid The Architecture of our service Part 2


  1. Web analytics at scale with Druid at naver.com Jason Heo (analytic.js.heo@navercorp.com) Doo Yong Kim (dooyong.kim@navercorp.com)

  2. Agenda • Part 1 • About naver.com • What is & Why Druid • The Architecture of our service • Part 2 • Druid Segment File Structure • Spark Druid Connector • TopN Query • Plywood & Split-Apply-Combine • How to fix TopN’s unstable results • Appendix

  3. About naver.com • naver.com • The biggest website in South Korea • The Google of South Korea • 74.7% of all web searches in South Korea https://en.wikipedia.org/wiki/Naver

  4. About Speakers Jason Heo Doo Yong Kim • Developed Analytics Systems at Naver • Working on Spark and Druid-based OLAP platform • Working with Databases since 2000 • Implemented search infrastructure at • Author of 3 MySQL books coupang.com • Currently Elasticsearch, Spark, Kudu, • Have been interested in MPP and advanced file and Druid formats for big data

  5. Platforms we've tested so far Query Storage SparkSQL Engine Format Parquet Hive Elasticsearch ORC Impala Druid Drill Carbon Data Presto Kudu ClickHouse Kylin Phoenix

  6. What is & Why Druid • What is Druid? • Our Requirements • Why Druid? • Experimental Results

  7. What is Druid? From HORTONWORKS • Column-oriented distributed datastore • Real-time streaming ingestion • Scalable to petabytes of data • Approximate algorithms (hyperLogLog, theta sketch) https://www.slideshare.net/HadoopSummit/scalable- realtime-analytics-using-druid

  8. What is Druid? From my point of view • Druid is a cumbersome version of Elasticsearch (w/o search feature) • Similar points • Secondary Index • DSLs for query • Flow of Query Processing Terms Aggregation ↔ TopN Query, Coordinator ↔ Broker, Data Node ↔ Historical • • Different points • more complicated to operate • better with much more data • better for Ultra High Cardinality • less GC overhead • better for Spark Connectivity (for Full Scan)

  9. What is Druid? - Architecture Kafka MySQL metadata Coordinator Segment management Zookeeper Real-time Node cluster mgmt. Overlord Broker download Segments for Segments Deep Storage query Historical (HDFS, S3) Middle stores Druid segments Manager for durability Query Service Druid DSL Index Service Clients

  10. What is Druid? - Queries { "queryType": "groupBy", "dataSource": "sample_data", "dimension": ["country", "device"], "filter": {}, "aggregation": [...], "limitSpec": [...] Real-time } Node { "queryType": "topN", "dataSource": "sample_data", Broker "dimension": "sample_dim", "filter": {...} "aggregation": [...], "threshold": 5 } Historical SELECT ... FROM dataSource • SQLs can be converted to Druid DSL • No JOIN

  11. Why Druid? - Requirements 1. Random Access 2. Most Viewed 3. Full Aggregation 4. JOIN (OLTP) SELECT COUNT(*) SELECT url, SELECT visitor, SELECT ... FROM logs COUNT(*) COUNT(*) FROM logs INNER WHERE url = ?; FROM logs FROM logs JOIN users GROUP BY url GROUP BY visitor; GROUP BY ... ORDER BY COUNT(*) HAVING ... DESC LIMIT 10;

  12. Why Druid? Perfect solution for OLTP and OLAP ★★★★☆ 1. Random Access For OLTP • Supports Bitmap Index • Fast Random Access ★★★★★ 2. Most Viewed For OLAP ★★★★☆ 3. Full Aggregation • Supports TopN Query • 100x times faster than GroupBy query • Supports Complex Queries ★★★★☆ 4. JOIN • JOIN, HAVING, etc • with our Spark Druid Connector

  13. Comparison – ElasticSearch Pros • Fast Random Access ★★★★★ 1. Random Access • Terms Aggregation • TopN Query ★★★☆☆ 2. Most Viewed • Easy to manage ☆☆☆☆☆ 3. Full Aggregation Cons • Slow full scan with es-hadoop • Low Performance for multi-field terms aggregation ☆☆☆☆☆ 4. JOIN (esp. High Cardinality) • GC Overhead

  14. Comparison – Kudu + Impala Pros ★★★★★ (PK) • Fast Random Access via Primary Key 1. Random Access ★☆☆☆☆ (non-PK) • Fast OLAP with Impala 2. Most Viewed ☆☆☆☆☆ Cons • No Secondary Index 3. Full Aggregation ★★★★★ • No TopN Query 4. JOIN ★★★★★

  15. Experimental Results – Response Time Random Access Most Viewed sec sec 3.5 0.16 0.14 3 0.14 2.9 0.12 2.5 2.7 0.1 2 0.08 1.5 0.06 1 0.04 0.78 0.5 0.02 0.03 0.35 0.08 0.25 0.003 0 0 Elasticesarch Kudu+Impala Druid Elastisearch Kudu+Impala Druid 1 Field 2 Fields

  16. Experimental Results – Notes Random Access Most Viewed ES: Lucene Index ES: Terms Aggregation • • Kudu+Impala: Primary Key Kudu+Implala: Group By • • Druid: Bitmap Index Druid: TopN • • Split-Apply-Combine for Multi Fields • Data Sets 210 mil. rows • same parallelism • same number of shards/partitions/segments •

  17. The Architecture of our service Logs Druid Kafka transform logs Coordinator Middle Indexing Kafka Kafka Manager Service Real-time Ingestion Parquet Peon Historical remove Batch Segments File Broker duplicated logs Ingestion Spark Overlord Parquet Executor Run daily batch job SparkSQL Druid DSL Spark Thrift Plywood Server Zeppelin API Server

  18. Switching

  19. Introduction – Who am I? 1. Doo Yong Kim 2. Naver 3. Software engineer 4. Big data

  20. Contents 1. Druid Storage Model 2. Spark Druid Connector Implementation 3. TopN Query 4. Plywood & Split-Combine-Apply 5. Extending Druid Query

  21. Druid Storage Model – 4 characteristics • Columnar format • Explicit distinguishes between dimension, metric • Bitmap index • Dictionary encoded

  22. Druid Storage Model - background Druid treats dimension and metric separately. Dimension Metric Bitmap Index Argument of Aggregate Function • • GroupBy Fields • { "dimensionsSpec": { "dimensions": ["country", "device", ...] }, ... "metricsSpec": [ { "type": "count", "name": "count" }, { "type": "doubleSum", "fieldName": "duration", "name": "duration" } ] } Druid Ingestion Spec

  23. Druid Storage Model- Dimension Country (Dimension) Korea ↔ 0 Dictionary for country UK ↔ 1 Korea UK Korea Korea → 101110 Bitmap for Korea Korea UK appears in 2 nd , 6 th rows UK → 010001 Korea UK 0 1 0 Dictionary Encoded Values 0 0 1

  24. Druid Storage Model - Metric Country (Dimension) duration (Metric) 13 Korea 13 2 UK 2 15 29 Korea 15 30 Korea 29 14 Korea 30 UK 14

  25. Druid Storage Model SELECT country, device, duration FROM logs WHERE country = 'Korea' AND device LIKE 'Iphone%' country Bitmap Filtering Filter by bitmap country = 'Korea' ('Korea', 'Iphone 6s', 13) Row device Bitmap Filtering Filter it manually device LIKE 'Iphone%' duration Filtering

  26. Spark Druid Connector

  27. Spark Druid Connector 1. 3 Ways to implement, Our implementation 2. What is needed to implement 3. Sample Codes, Performance Test 4. How to implement

  28. Spark Druid Connector - 3 Ways to implement 2 nd way 1 st way Select DSL SQL DSL SQL Spark Druid Spark Druid Spark Driver Executor Historical Driver Broker Large JSON Good if SQL is rewritable to DSL Easy to implement • • But DSL does not support all SQL No need to understand Druid Index Library • • Ser/de operation is expensive • Ex: JOIN, sub-query • Parallelism is bounded to no. of Historical •

  29. Spark Druid Connector - 3 Ways to implement 3 rd way SQL Allocate Spark executor into Historical Node Spark Executor Driver Reads segments using Druid Library Read Druid segment files directly. • Segment File Similar to the way of reading Parquet • Difficult to implement • Need to understand Druid segment library • We chose this way!

  30. Spark Druid Connector – How to use Create table Execute Query spark.read spark.sql(""" .format("com.navercorp.ni.druid.spark.druid") SELECT country, device, duration .option("coordinator", "host1.com:18081") FROM logs .option("broker", "host2.com:18082") WHERE country = 'Korea' .option("datasource", "logs").load() AND device LIKE 'Iphone%' .createOrReplaceTempView("logs") """).show(false)

  31. Spark Druid Connector - Performance Random Access Full Scan & GROUP BY 8 30 7.5 7 25 24.1 6 20 5 15 4 3 10 2 7.7 5 1 0.21 0 0 Spark Druid Spark Parquet Spark Druid Spark Parquet Total 4.4B rows Seconds, lower is better

  32. Spark Druid Connector – How to implement

  33. Spark Druid Connector – How to implement 1. Druid Rest API 2. Druid Segment Library 3. Spark Data Source API

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend