How Cloudflare analyzes >1m DNS queries per second Tom Arnfeld - PowerPoint PPT Presentation

How Cloudflare analyzes >1m DNS queries per second Tom Arnfeld (and Marek Vavrusa � )

≦ 3M DNS queries/second >10% 100+ Internet requests everyday Data centers globally 2.5B 6M+ Monthly unique visitors 5M+ websites, apps & APIs in 150 countries HTTP requests/second

Anatomy of a DNS query 30+ $ dig www.cloudflare.com Fields ; <<>> DiG 9.8.3-P1 <<>> www.cloudflare.com ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 36582 ;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 0 ;; QUESTION SECTION: ;www.cloudflare.com. IN A ;; ANSWER SECTION: www.cloudflare.com. 5 IN A 198.41.215.162 www.cloudflare.com. 5 IN A 198.41.214.162 ;; Query time: 34 msec ;; SERVER: 192.168.1.1#53(192.168.1.1) ;; WHEN: Sat Sep 2 10:48:30 2017 ;; MSG SIZE rcvd: 68

Logs are received and de-multiplexed Anycast DNS Logs are written into various kafka topics Cloudflare DNS Server HTTP & Other Edge Services Log Logs from all edge services and all PoPs are Forwarder shipped over TLS to be processed

Logs are received and de-multiplexed Anycast DNS Logs are written into various kafka topics Log messages are serialized with Cap’n’Proto Cloudflare DNS Server HTTP & Other Edge Services Log Logs from all edge services and all PoPs are Forwarder shipped over TLS to be processed

What did we want? ≦ 3M Queries Per Second - Multidimensional query analytics 100+ - Complex ad-hoc queries Edge Points of Presence - Capable of current and expected future scale - Gracefully handle late arriving log data 20+ - Roll-ups/aggregations for long term storage Query Dimensions - Highly available and replicated architecture 5+ Years of stored aggregation

Kafka, Apache Spark and Parquet - Scanning firehose is slow and Logs are received and adding filters is time consuming de-multiplexed - Offline analysis is difficult with Logs are written into large amounts of data various kafka topics - Not a fast or friendly user experience Download and filter data from Kafka using Apache Spark - Doesn’t work for customers Converted into Parquet and written to HDFS

Let’s aggregate everything... with streams Timestamp QName QType RCODE 2017/01/01 01:00:00 www.cloudflare.com A NODATA 2017/01/01 01:00:01 api.cloudflare.com AAAA NOERROR Time Bucket QName QType RCODE Count p50 Response Time 2017/01/01 01:00 www.cloudflare.com A NODATA 5 0.4876ms 2017/01/01 01:00 api.cloudflare.com AAAA NOERROR 10 0.5231ms

Let’s aggregate everything... with streams - Counters - Total number of queries - Query types - Response codes - Top-n query names - Top-n query sources - Response time/size quantiles

Aggregating with Spark Streaming - Spark experience in-house, though Logs are received and Java/Scala de-multiplexed - Batch-oriented and need a DB to Logs are written into serve online queries various kafka topics - Difficult to support ad-hoc analysis Produce low cardinality aggregates with Spark Streaming - Low resolution aggregates - Scanning raw data is slow - Late arriving data

Spark Streaming + CitusDB - Distributed time-series DB Logs are received and de-multiplexed - Existing deployments of CitusDB - High cardinality aggregations are Logs are written into various kafka topics tricky due to insert performance - Late arriving data Produce low cardinality aggregates with Spark Streaming - SQL API Insert aggregate rows into CitusDB cluster for reads

Apache Flink + (CitusDB?) - Dataflow API and support for Logs are received and stream watermarks de-multiplexed - Checkpoint performance issues Logs are written into various kafka topics - High cardinality aggregations are tricky due to insert performance Produce low cardinality aggregates with Flink - SQL API Insert aggregate rows into CitusDB cluster for reads

Druid - Insertion rate couldn’t keep up in Logs are received and our initial tests de-multiplexed - Estimated costs of a suitable cluster Logs are written into were way expensive various kafka topics - Seemed performant for random Insert into a cluster of reads but not the best we’d seen Druid nodes - Operational complexity seemed high

Let’s aggregate everything... with streams - Raw data isn’t easily queried ad-hoc Timestamp QName QType RCODE 2017/01/01 01:00:00 www.cloudflare.com A NODATA - Backfilling new aggregates is impossible or can be very difficult without custom tools 2017/01/01 01:00:01 api.cloudflare.com AAAA NOERROR - A stream can’t serve actual queries - Can be costly for high cardinality dimensions Time Bucket QName QType RCODE Count p50 Response Time *https://clickhouse.yandex/docs/en/introduction/what_is_clickhouse.html 2017/01/01 01:00 www.cloudflare.com A NODATA 5 0.4876ms 2017/01/01 01:00 api.cloudflare.com AAAA NOERROR 10 0.5231ms

� ClickHouse - Tabular, column-oriented data store - Single binary, clustered architecture - Familiar SQL query interface Lots of very useful built-in aggregation functions - Raw log data stored for 3 months ~7 trillion rows - Aggregated data for ∞ 1m, 1h aggregations across 3 dimensions

Logs are received and de-multiplexed Anycast DNS Logs are written into various kafka topics Log messages are serialized with Cap’n’Proto Cloudflare DNS Server Go Inserters write the data in parallel HTTP & Other Edge Services Log Logs from all edge services and all PoPs are Forwarder shipped over TLS to be processed Multi-tenant ClickHouse cluster stores data

ClickHouse Cluster TinyLog dnslogs_2016_01_01_14_30_pN ReplicatedMergeTree Initial table design dnslogs_2016_01_01 - Raw logs are inserted into ReplicatedMergeTree sharded tables dnslogs_2016_01 - Sidecar processes aggregates data into day/month/year tables ReplicatedMergeTree dnslogs_2016

ClickHouse Cluster ReplicatedMergeTree r{0,2}.dnslogs First attempt in prod. - Raw logs are inserted into one replicated, sharded table - Multiple r{0,2} databases to better pack the cluster with shards and replicas

Speeding up typical queries - SUM() and COUNT() over a few low-cardinality dimensions - Global overview (trends and monitoring) - Storing intermediate state for non-additive functions

ClickHouse Cluster ReplicatedMergeTree r{0,2}.dnslogs Today... ReplicatedAggregatingMergeTree dnslogs_rollup_X - Raw logs are inserted into one replicated, sharded table - Multiple r{0,2} databases to better pack the cluster with shards and replicas - Aggregate tables for long-term storage

Finalized schema, deployed a production October 2016 ClickHouse cluster of 6 nodes Began evaluating technologies and August 2017 architecture, 1 instance in Docker Migrated to a new cluster with multi-tenancy November 2016 Growing interest among other Prototype ClickHouse cluster with 3 Cloudflare engineering teams, nodes, inserting a sample of data worked on standard tooling Spring 2017 December 2016 TopN, IP prefix matching, Go native ClickHouse visualisations with driver, Analytics library, pkey in Superset and Grafana monotonic functions

Finalized schema, deployed a production October 2016 ClickHouse cluster of 6 nodes Began evaluating technologies and August 2017 architecture, 1 instance in Docker Migrated to a new cluster with multi-tenancy November 2016 Growing interest among other Prototype ClickHouse cluster with 3 Cloudflare engineering teams, nodes, inserting a sample of data worked on standard tooling Spring 2017 December 2016 TopN, IP prefix matching, Go native ClickHouse visualisations with driver, Analytics library, pkey in Superset and Grafana monotonic functions Multi-tenant ClickHouse cluster 33 8M+ 4GB+ 2PB+ Nodes Row Insertion/s Insertion Throughput/s Raid-0 Spinning Disks

ClickHouse Today… 12 Trillion Rows SELECT table, sum(rows) AS total FROM system.cluster_parts WHERE database = 'r0' GROUP BY table ORDER BY total DESC ┌─table──────────────────────────────┬─────────────total─┐ │ ███████████████ │ 9,051,633,001,267 │ │ ████████████████████ │ 2,088,851,716,078 │ │ ███████████████████ │ 847,768,860,981 │ │ ██████████████████████ │ 259,486,159,236 │ │ … │ … │

How Cloudflare analyzes >1m DNS queries per second Tom Arnfeld - PowerPoint PPT Presentation

How Cloudflare analyzes >1m DNS queries per second Tom Arnfeld (and Marek Vavrusa ) 3M DNS queries/second >10% 100+ Internet requests everyday Data centers globally 2.5B 6M+ Monthly unique visitors 5M+ websites, apps &

DNS and Security DNS and Security DNS and Security DNS and Security DNS and Security DNS and

DNSSEC and DNS Proxying DNS is hard at scale when you are a huge target 2 CloudFlare

CloudFlare DNS Anycast Services lafur Gu mundsson | olafur@cloudflare.com Network Over 80

DNSSEC at Scale Dani Grant | DNS @ CloudFlare CloudFlare - Authoritative DNS provider (includes

Hidden Linux Metrics with ebpf_exporter Ivan Babrou @ibobrik Performance team @Cloudflare What

Introduction to Cloudflare Jrme Fleury BNIX meeting, Thursday 29th 2016, Brussels. What is

XDP in Practice DDoS Mitigation @Cloudflare Gilberto Bertin About me Systems Engineer at

Four years of Go at CloudFlare John Graham-Cumming CloudFlare You

DNS Session 2: DNS cache operation and DNS debugging TENET NSRC - 2013 DNS Cache Operation

and DNS data mining Making Windows DNS Server Cloud Ready ~Kumar Ashutosh, Microsoft Windows DNS

Name Detection System By Auke Zwaan DNS DNS DNS Give me google. gle.nl nl DNS Give me

Resilient Networking 6: Attacks on DNS 1 Chapter Outline Overview of DNS Known attacks

Oblivious DNS: Practical Privacy for DNS Queries Paul Schmitt (Princeton) Anne Edmundson

DNS(SEC) client analysis assisted by Bart Gijsen (TNO) DNS-OARC, San Francisco, March 2011

DNS Session 2: DNS cache operation and DNS debugging These materials are licensed under the

Domain Name System (DNS) Learning Goal Foundations of DNS Security in DNS: Integrity

Chapter 2: Application layer 2.1 Principles of network 2.6 P2P applications applications

Sockets / RPC 1 last time redo logging write log + commit, then do operation on failure,

Hiding in Plain Sight Advances in Malware Covert Communication Channels Pierre-Marc Bureau

dnsmon DNS Server Monitoring Daniel Karrenberg <daniel.karrenberg@ripe.net> 1 dnsmon

ExperienceswithCoralCDN AFiveYearOpera:onalView

Detecting Malicious Web Links and Identifying Their Attack Types 1 Hyunsang Choi, 2 Bin B. Zhu, 1

NAP APH Be Beyond the Ba Basic ics W Webin inar June 5, 2013 Mary P. Malone, MS, JD Carrie

3/19/2020 How Im Changing My Practice, Client Meetings, Client Planning, and More, to Address

How Cloudflare analyzes >1m DNS queries per second Tom Arnfeld - PowerPoint PPT Presentation

How Cloudflare analyzes >1m DNS queries per second Tom Arnfeld (and Marek Vavrusa ) 3M DNS queries/second >10% 100+ Internet requests everyday Data centers globally 2.5B 6M+ Monthly unique visitors 5M+ websites, apps &

DNS and Security DNS and Security DNS and Security DNS and Security DNS and Security DNS and

DNSSEC and DNS Proxying DNS is hard at scale when you are a huge target 2 CloudFlare

CloudFlare DNS Anycast Services lafur Gu mundsson | olafur@cloudflare.com Network Over 80

DNSSEC at Scale Dani Grant | DNS @ CloudFlare CloudFlare - Authoritative DNS provider (includes

Hidden Linux Metrics with ebpf_exporter Ivan Babrou @ibobrik Performance team @Cloudflare What

Introduction to Cloudflare Jrme Fleury BNIX meeting, Thursday 29th 2016, Brussels. What is

XDP in Practice DDoS Mitigation @Cloudflare Gilberto Bertin About me Systems Engineer at

Four years of Go at CloudFlare John Graham-Cumming CloudFlare You

DNS Session 2: DNS cache operation and DNS debugging TENET NSRC - 2013 DNS Cache Operation

and DNS data mining Making Windows DNS Server Cloud Ready ~Kumar Ashutosh, Microsoft Windows DNS

Name Detection System By Auke Zwaan DNS DNS DNS Give me google. gle.nl nl DNS Give me

Resilient Networking 6: Attacks on DNS 1 Chapter Outline Overview of DNS Known attacks

Oblivious DNS: Practical Privacy for DNS Queries Paul Schmitt (Princeton) Anne Edmundson

DNS(SEC) client analysis assisted by Bart Gijsen (TNO) DNS-OARC, San Francisco, March 2011

DNS Session 2: DNS cache operation and DNS debugging These materials are licensed under the

Domain Name System (DNS) Learning Goal Foundations of DNS Security in DNS: Integrity

Chapter 2: Application layer 2.1 Principles of network 2.6 P2P applications applications

Sockets / RPC 1 last time redo logging write log + commit, then do operation on failure,

Hiding in Plain Sight Advances in Malware Covert Communication Channels Pierre-Marc Bureau

dnsmon DNS Server Monitoring Daniel Karrenberg &lt;daniel.karrenberg@ripe.net&gt; 1 dnsmon

ExperienceswithCoralCDN AFiveYearOpera:onalView

Detecting Malicious Web Links and Identifying Their Attack Types 1 Hyunsang Choi, 2 Bin B. Zhu, 1

NAP APH Be Beyond the Ba Basic ics W Webin inar June 5, 2013 Mary P. Malone, MS, JD Carrie

3/19/2020 How Im Changing My Practice, Client Meetings, Client Planning, and More, to Address

dnsmon DNS Server Monitoring Daniel Karrenberg <daniel.karrenberg@ripe.net> 1 dnsmon