clickhouse materialized views
play

CLICKHOUSE MATERIALIZED VIEWS A SECRET WEAPON FOR HIGH PERFORMANCE - PowerPoint PPT Presentation

CLICKHOUSE MATERIALIZED VIEWS A SECRET WEAPON FOR HIGH PERFORMANCE ANALYTICS Robert Hodges -- Percona Live 2018 Amsterdam Introduction to Presenter Robert Hodges - Altinity CEO www.altinity.com 30+ years on DBMS plus Leading software and


  1. CLICKHOUSE MATERIALIZED VIEWS A SECRET WEAPON FOR HIGH PERFORMANCE ANALYTICS Robert Hodges -- Percona Live 2018 Amsterdam

  2. Introduction to Presenter Robert Hodges - Altinity CEO www.altinity.com 30+ years on DBMS plus Leading software and services virtualization and security. provider for ClickHouse ClickHouse is DBMS #20 Major committer and community sponsor in US and Western Europe

  3. Introduction to ClickHouse Understands SQL a b c d a b c d Runs on bare metal to cloud Stores data in columns Parallel and vectorized execution a b c d a b c d Scales to many petabytes Is Open source (Apache 2.0) Is WAY fast!

  4. ClickHouse materialized views are triggers (Trigger) Source table Target table INSERT Target table Target table

  5. You can even create recursive views (Trigger) Source table Target table INSERT Target table Target table

  6. Why might a materialized view be useful? SELECT toYear(FlightDate) AS year, sum(Cancelled) / count(*) AS cancelled, sum(DepDel15) / count(*) AS delayed_15 FROM airline.ontime GROUP BY year ORDER BY year ASC LIMIT 10 ... 10 rows in set. Elapsed: 0.894 sec. Processed 173.82 million rows, 1.74 GB (194.52 million rows/s., 1.95 GB/s.) Can we make it faster?

  7. Let’s precompute and store aggregates! CREATE MATERIALIZED VIEW ontime_daily_cancelled_mv ENGINE = SummingMergeTree PARTITION BY tuple() ORDER BY (FlightDate, Carrier) POPULATE AS SELECT FlightDate, Carrier, count(*) AS flights, sum(Cancelled) / count(*) AS cancelled, sum(DepDel15) / count(*) AS delayed_15 FROM ontime GROUP BY FlightDate, Carrier

  8. Query performance impact is significant SELECT toYear(FlightDate) AS year, sum(flights) AS flights, sum(cancelled) AS cancelled, sum(delayed_15) AS delayed_15 FROM airline.ontime_daily_cancelled_mv GROUP BY year ORDER BY year ASC LIMIT 10 . . . 10 rows in set. Elapsed: 0.007 sec. Processed 148.16 thousand rows, 3.85 MB (20.37 million rows/s., 529.50 MB/s.)

  9. What’s going on under the covers? Compressed size: ~14.6GB Compressed size: ~2.31 MB Uncompressed size: ~4.10 MB Uncompressed size: ~55.4GB ontime .inner.ontime_daily_cancelled_mv INSERT (MergeTree) (SummingMergeTree) ( T r INSERT i g g e r ) ontime_daily_cancelled_mv SELECT (materialized view) Materialized view is 6471x smaller than source table! SELECT

  10. A brief study of ClickHouse table structures CREATE TABLE ontime ( Year UInt16, Table engine type Quarter UInt8, Month UInt8, How to break data ... into parts ) ENGINE = MergeTree() PARTITION BY toYYYYMM(FlightDate) ORDER BY (Carrier, FlightDate) How to index and sort data in each part

  11. Possible ways to transform tables Reduce ontime_agg (Materialized data view) And/Or Change ontime ontime_resorted layout And/Or Change ontime_kafka table type

  12. Exercise: the famous ‘last point problem’ CPU CPU Utilization CPU Host Utilization Utilization 7023 CPU Table CPU CPU Utilization Host Utilization 9601 Problem : Show the CPU CPU Utilization current CPU utilization for CPU Host Utilization Utilization each host 6522

  13. ClickHouse can solve this using a subquery SELECT t.hostname, tags_id, 100 - usage_idle usage FROM ( TABLE SCAN! SELECT tags_id, usage_idle FROM cpu WHERE (tags_id, created_at) IN (SELECT tags_id, max(created_at) USE INDEX FROM cpu GROUP BY tags_id) ) AS c INNER JOIN tags AS t ON c.tags_id = t.id ORDER BY usage DESC, OPTIMIZED t.hostname ASC JOIN COST LIMIT 10

  14. SQL queries work but are inefficient Using direct query on table: OUTPUT: ┌─hostname──┬─tags_id─┬─usage─┐ 10 rows in set. Elapsed: 0.566 sec. │ host_1002 │ 9003 │ 100 │ Processed 32.87 million rows, 263.13 │ host_1116 │ 9117 │ 100 │ MB (53.19 million rows/s., 425.81 │ host_1141 │ 9142 │ 100 │ │ host_1163 │ 9164 │ 100 │ MB/s.) │ host_1210 │ 9211 │ 100 │ │ host_1216 │ 9217 │ 100 │ Can we bring last │ host_1234 | 9235 │ 100 │ │ host_1308 │ 9309 │ 100 │ point performance │ host_1419 │ 9420 │ 100 │ │ host_1491 │ 9492 │ 100 │ closer to real-time? └───────────┴─────────┴───────┘

  15. Create target table for aggregate data CREATE TABLE cpu_last_point_idle_agg ( created_date AggregateFunction(argMax, Date, DateTime), max_created_at AggregateFunction(max, DateTime), time AggregateFunction(argMax, String, DateTime), tags_id UInt32, usage_idle AggregateFunction(argMax, Float64, DateTime) ) Minimal data ENGINE = AggregatingMergeTree() PARTITION BY tuple() ORDER BY tags_id Different table type Different storage layout

  16. argMaxState links columns with aggregates CREATE MATERIALIZED VIEW cpu_last_point_idle_mv MV TO cpu_last_point_idle_agg table AS SELECT argMaxState(created_date, created_at) AS created_date, maxState(created_at) AS max_created_at, argMaxState(time, created_at) AS time, tags_id, argMaxState(usage_idle, created_at) AS usage_idle FROM cpu GROUP BY tags_id Derive data

  17. Digression: How aggregation works Source value created_at Partial maxState(created_at) aggregate Merged maxMerge(max_created_at) aggregate

  18. Selecting rows that match max value (Same row) Source created_at usage_idle values (Pick usage_idle value from any row with matching created_at) Partial maxState(created_at) argMaxState(usage_idle, aggregates created_at) (Pick usage_idle from aggregate with matching created_at) Merged avgMerge(created_at) avgMaxMerge(usage_idle) aggregates

  19. Let’s hide the merge details with a view CREATE VIEW cpu_last_point_idle_v AS SELECT argMaxMerge(created_date) AS created_date, maxMerge(max_created_at) AS created_at, argMaxMerge(time) AS time, tags_id, argMaxMerge(usage_idle) AS usage_idle FROM cpu_last_point_idle_mv GROUP BY tags_id

  20. ...Select again from the covering view SELECT t.hostname, tags_id, 100 - usage_idle usage FROM cpu_last_point_idle_v AS b INNER JOIN tags AS t ON b.tags_id = t.id ORDER BY usage DESC, t.hostname ASC LIMIT 10 ... 10 rows in set. Elapsed: 0.005 sec. Processed 14.00 thousand rows, 391.65 KB (2.97 million rows/s., 82.97 MB/s.) Last point view is 113 times faster

  21. Common uses for materialized views ● Precompute aggregates ● Fetch last point data ● Transform table on-disk indexing and sorting ○ Like a Vertica projection ● Keep aggregates after raw input is dropped ● Create data cleaning pipelines ● Read from Kafka queues

  22. Presenter: rhodges@altinity.com Thank you! ClickHouse: https://github.com/ClickHouse/ClickHouse Altinity: We’re hiring! https://www.altinity.com

  23. ...to reduce the amount of data we read Index Columns Part Rows match Table PARTITION BY expression Index Columns Sort columns Part on ORDER BY clause Skip indexes reduce data Part Sparse index selects rows to read

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend