openintel
play

OpenINTEL an infrastructure for long-term, large-scale and - PowerPoint PPT Presentation

OpenINTEL an infrastructure for long-term, large-scale and high-performance active DNS measurements DACS DACS Design and Analysis of Communication Systems Why measure DNS? (Almost) every networked service relies on DNS DNS


  1. OpenINTEL an infrastructure for long-term, large-scale and 
 high-performance active DNS measurements DACS DACS Design and Analysis of Communication Systems

  2. Why measure DNS? • (Almost) every networked service relies on DNS • DNS translates human readable names into machine readable information • e.g. IP addresses, but also: mail hosts, certificate information, … • Measuring what is in the DNS over time provides information about the evolution of the Internet • (we started this because we were interested in the rise of DDoS Protection Services)

  3. 
 Goals and Challenges • Send a comprehensive set of DNS queries 
 for every name in a TLD, once per day • But can we do this at scale ? How does this impact the global DNS? 
 .com + .net + .org ≈ over 150 million names 
 (about 50% of the global DNS namespace) • How do we store and analyse this data efficiently?

  4. Data collection stages • We distinguish three stages for data collection: • Stage 1: 
 Collection of zone files for TLDs to scan, compute daily deltas • Stage 2: 
 Main measurement, perform queries for each names, collect meta data, store results • Stage 3: 
 Prepare data for analysis

  5. High-level architecture aggregation server … ……… database cluster manager collection …… per TLD per TLD server … … ……… Stage I …… … … Worker cloud … per TLD TLD zone metadata repositories Stage II server NAS for Hadoop cluster long term storage Stage III Internet

  6. What do we query and store? • We ask for: • SOA • We store: • A, AAAA • (apex, ‘www’ and ‘mail’) • All records in the answer • NS section • MX • CNAME expansions • TXT • DNSSEC signatures (RRSIG) • SPF • Metadata (Geo IP, AS) • DS • DNSKEY • NSEC(3)

  7. Impact on the global DNS • Our measurement is clearly visible in SURFnet’s traffic flows 300 Other answers Other queries Measurement answers 250 Measurement queries 200 Mbit/s 150 100 50 0 00:00 02:00 04:00 06:00 08:00 10:00 12:00 14:00 16:00 18:00 20:00 22:00

  8. Impact on the global DNS • Deeper analysis shows very few top talkers (less than 35 receive more than 100 packets/sec.) 100% 99.9% 99.8% 99.7% 99.6% % of hosts 99.5% 99.4% 99.3% 99.2% 99.1% 99% 1 5 10 25 50 100 200 400 packets per second

  9. Big data? Yes! • Calling your research 
 “big data” is all the rage • So would our work 
 qualify as big data? • The human genome is 
 about 3 ⋅ 10 9 base pairs • We collect around 1.8 ⋅ 10 9 DNS records per day • Since February 2015, through December 31st we collected 511 ⋅ 10 9 (511 billion ) results

  10. Some numbers • Workers: 1 CPU core, 2GB RAM, 5 GB disk TLD #domains workers measure time .org 10.9M 10 7h19m .net 15.6M 10 14h29m .com 80 123.1M 17h10m .nl 3 5.6M 3h09m • Data collected daily: TLD #domains (failed) #results Avro Parquet uncompressed .org 10.9M (1.2%) 125M 2.6GB 3.2GB 18.5GB .net 15.6M (0.9%) 166M 3.5GB 4.3GB 24.4GB .com 124.0M (0.6%) 1419M 30.0GB 36.8GB 213.4GB .nl 5.6M (0.5%) 112M 8.5GB 11.8GB 27.8GB total 156.1M (0.6%) 1.8B 43.3GB 54.7GB 284.1GB • We collect almost 16TB of compressed data per year

  11. Big data? Use the right tools • With 3 partners invested 
 in a Hadoop cluster 
 (SURFnet, SIDN, UTwente) • Use latest & greatest tools 
 for analysis, Impala, Spark, 
 Flume, … • Working on making datasets 
 accessible to other network 
 researchers

  12. Query performance • Example query: 
 top 10 countries A records 
 geo-locate to in the .com TLD • Storage format matters a lot! Storage format Compression Relative size Query run-time none 100% 25.1s Avro 
 deflate 17% 15.5s (row oriented) snappy 23% 9.3s Sweet none 44% 17.5s Parquet 
 spot! gzip 10% 5.7s (columnar) snappy 17% 4.3s

  13. An example: cloud e-mail 70% Google 60% Microsoft relative growth in #domains Yahoo 50% O ffi ce 365 • Google largest (4.57M) Windows Live 40% 30% • But Microsoft grows 
 20% much faster! 10% 0% • Yahoo in decline -10% -20% -30% Mar '15 Apr '15 May '15 Jun '15 Jul '15 Aug '15 Sep '15 Oct '15 Nov '15 Dec '15 90% #domains with an SPF record • SPF protects against e-mail 
 80% 70% forgery Google Microsoft 60% • Microsoft users show (near) 
 50% ubiquitous SPF use 40% • Google users at only one third 30% M A M J J A S O N D u u p u e o e a a n l c r g p t v c r y ' 1 ' ' 1 ' ' 1 ' ' 1 ' ' 1 ' 5 1 1 1 1 1 5 5 5 5 5 5 5 5 5

  14. Data access • Working on ways to make this resource accessible to the measurement research community • Problem: contracts for zone file access (com/net/ org/nl/…) are (very) restrictive • Current thinking: • Publishing aggregate data sets is OK • “Toy” cluster with open data (e.g. Alexa 1M) to allow others to write queries & scripts, then execute “on behalf” • Anonymisation of data?

  15. Thank you for your attention! Questions? 
 (come see us for a live demo) F nl.linkedin.com/in/rolandvanrijswijk nl.linkedin.com/in/mattijsj L @reseauxsansfil ✉ r.m.vanrijswijk@utwente.nl m.jonker@utwente.nl

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend