OpenINTEL an infrastructure for long-term, large-scale and - - PowerPoint PPT Presentation

openintel
SMART_READER_LITE
LIVE PREVIEW

OpenINTEL an infrastructure for long-term, large-scale and - - PowerPoint PPT Presentation

OpenINTEL an infrastructure for long-term, large-scale and high-performance active DNS measurements DACS DACS Design and Analysis of Communication Systems Why measure DNS? (Almost) every networked service relies on DNS DNS


slide-1
SLIDE 1

DACS DACS

Design and Analysis of

Communication Systems

OpenINTEL

an infrastructure for long-term, large-scale and 
 high-performance active DNS measurements

slide-2
SLIDE 2

Why measure DNS?

  • (Almost) every networked service relies on DNS
  • DNS translates human readable names into

machine readable information

  • e.g. IP addresses, but also: mail hosts, certificate

information, …

  • Measuring what is in the DNS over time provides

information about the evolution of the Internet

  • (we started this because we were interested in the

rise of DDoS Protection Services)

slide-3
SLIDE 3

Goals and Challenges

  • Send a comprehensive set of DNS queries 


for every name in a TLD, once per day

  • But can we do this at scale? How does this impact

the global DNS?
 
 .com + .net + .org ≈ over 150 million names
 (about 50% of the global DNS namespace)

  • How do we store and analyse this data efficiently?
slide-4
SLIDE 4

Data collection stages

  • We distinguish three stages for data collection:
  • Stage 1:


Collection of zone files for TLDs to scan, compute daily deltas

  • Stage 2:


Main measurement, perform queries for each names, collect meta data, store results

  • Stage 3:


Prepare data for analysis

slide-5
SLIDE 5

High-level architecture

database per TLD

Stage I

collection server cluster manager per TLD

… … …… ……… ………

Worker cloud per TLD metadata server

Stage II

NAS for long term storage aggregation server

Stage III

Internet Hadoop cluster

… … …… … …

TLD zone repositories

slide-6
SLIDE 6

What do we query and store?

  • We ask for:
  • SOA
  • A, AAAA
  • (apex, ‘www’ and ‘mail’)
  • NS
  • MX
  • TXT
  • SPF
  • DS
  • DNSKEY
  • NSEC(3)
  • We store:
  • All records in the answer

section

  • CNAME expansions
  • DNSSEC signatures

(RRSIG)

  • Metadata (Geo IP, AS)
slide-7
SLIDE 7

Impact on the global DNS

  • Our measurement is clearly visible in SURFnet’s

traffic flows

50 100 150 200 250 300 00:00 02:00 04:00 06:00 08:00 10:00 12:00 14:00 16:00 18:00 20:00 22:00 Mbit/s Other answers Other queries Measurement answers Measurement queries

slide-8
SLIDE 8

Impact on the global DNS

  • Deeper analysis shows very few top talkers (less

than 35 receive more than 100 packets/sec.)

99% 99.1% 99.2% 99.3% 99.4% 99.5% 99.6% 99.7% 99.8% 99.9% 100% 1 5 10 25 50 100 200 400 % of hosts packets per second

slide-9
SLIDE 9

Big data? Yes!

  • Calling your research 


“big data” is all the rage

  • So would our work 


qualify as big data?

  • The human genome is 


about 3⋅109 base pairs

  • We collect around 1.8⋅109 DNS records per day
  • Since February 2015, through December 31st we

collected 511⋅109 (511 billion) results

slide-10
SLIDE 10

Some numbers

  • Workers:

1 CPU core, 2GB RAM, 5 GB disk

  • Data collected daily:
  • We collect almost 16TB of compressed data per year

TLD #domains workers measure time .org 10.9M 10 7h19m .net 15.6M 10 14h29m .com 123.1M 80 17h10m .nl 5.6M 3 3h09m TLD #domains (failed) #results Avro Parquet uncompressed .org 10.9M (1.2%) 125M 2.6GB 3.2GB 18.5GB .net 15.6M (0.9%) 166M 3.5GB 4.3GB 24.4GB .com 124.0M (0.6%) 1419M 30.0GB 36.8GB 213.4GB .nl 5.6M (0.5%) 112M 8.5GB 11.8GB 27.8GB total 156.1M (0.6%) 1.8B 43.3GB 54.7GB 284.1GB

slide-11
SLIDE 11

Big data? Use the right tools

  • With 3 partners invested


in a Hadoop cluster
 (SURFnet, SIDN, UTwente)

  • Use latest & greatest tools


for analysis, Impala, Spark,
 Flume, …

  • Working on making datasets


accessible to other network
 researchers

slide-12
SLIDE 12

Query performance

  • Example query:


top 10 countries A records 
 geo-locate to in the .com TLD

  • Storage format matters a lot!

Storage format Compression Relative size Query run-time Avro
 (row oriented) none 100% 25.1s deflate 17% 15.5s snappy 23% 9.3s Parquet
 (columnar) none 44% 17.5s gzip 10% 5.7s snappy 17% 4.3s

Sweet spot!

slide-13
SLIDE 13

An example: cloud e-mail

  • 30%
  • 20%
  • 10%
0% 10% 20% 30% 40% 50% 60% 70% Mar '15 Apr '15 May '15 Jun '15 Jul '15 Aug '15 Sep '15 Oct '15 Nov '15 Dec '15 relative growth in #domains Google Microsoft Yahoo Office 365 Windows Live 30% 40% 50% 60% 70% 80% 90% M a r ' 1 5 A p r ' 1 5 M a y ' 1 5 J u n ' 1 5 J u l ' 1 5 A u g ' 1 5 S e p ' 1 5 O c t ' 1 5 N
  • v
' 1 5 D e c ' 1 5 #domains with an SPF record Google Microsoft
  • Google largest (4.57M)
  • But Microsoft grows


much faster!

  • Yahoo in decline
  • SPF protects against e-mail


forgery

  • Microsoft users show (near)


ubiquitous SPF use

  • Google users at only one third
slide-14
SLIDE 14

Data access

  • Working on ways to make this resource accessible

to the measurement research community

  • Problem: contracts for zone file access (com/net/
  • rg/nl/…) are (very) restrictive
  • Current thinking:
  • Publishing aggregate data sets is OK
  • “Toy” cluster with open data (e.g. Alexa 1M) to

allow others to write queries & scripts, then execute “on behalf”

  • Anonymisation of data?
slide-15
SLIDE 15

F L

nl.linkedin.com/in/rolandvanrijswijk nl.linkedin.com/in/mattijsj @reseauxsansfil r.m.vanrijswijk@utwente.nl m.jonker@utwente.nl

Thank you for your attention! Questions? 
 (come see us for a live demo)