When the Dike Breaks: Dissecting DNS Defenses During DDoS Giovane C. - - PowerPoint PPT Presentation

when the dike breaks dissecting dns defenses during ddos
SMART_READER_LITE
LIVE PREVIEW

When the Dike Breaks: Dissecting DNS Defenses During DDoS Giovane C. - - PowerPoint PPT Presentation

When the Dike Breaks: Dissecting DNS Defenses During DDoS Giovane C. M. Moura 1 , 2 , John Heidemann 3 , Moritz Mller 1 , 4 , Ricardo de O. Schmidt 5 , Marco Davids 1 RIPE 77, Amsterdam, The Netherlands 2018-10-15 1 SIDN Labs, 2 TU Delft, 3


slide-1
SLIDE 1

When the Dike Breaks: Dissecting DNS Defenses During DDoS

Giovane C. M. Moura1,2, John Heidemann3, Moritz Müller1,4, Ricardo de O. Schmidt5, Marco Davids1 RIPE 77, Amsterdam, The Netherlands 2018-10-15

1SIDN Labs, 2TU Delft, 3USC/ISI, 4University of Twente, 5University of Passo Fundo

1

slide-2
SLIDE 2

Research paper to appear on ACM IMC 2018

  • Joint research work to appear at:

https://conferences.sigcomm.org/imc/2018/

  • Full text (PDF):

https://www.isi.edu/~johnh/PAPERS/Moura18b.pdf

2

slide-3
SLIDE 3

DDoS Attacks

  • DDoS attacks are on the rise
  • Getting bigger, more frequent, cheaper, and easier
  • Arbor: 1.7 Tb/s [2] (2018)
  • Github DDoS: 1.35 Tb/s [1] (2018)
  • Dyn DDoS: 1.2 Tb/s (Mirai IoT) [6] (2017)
  • DDoS as a service: few dollars with booters [8].
  • Many DNS services have been victim of DDOS attacks

3

slide-4
SLIDE 4

DDoS and DNS: two examples

Root DNS DDoS Nov 2015 no known reports of errors seen by users [3] Dyn Oct 2016 some users could not reach popular sites [6]

Two large DDoSes, very different outcomes. Why?

4

slide-5
SLIDE 5

DDoS and DNS: two examples

Root DNS DDoS Nov 2015 no known reports of errors seen by users [3] Dyn Oct 2016 some users could not reach popular sites [6]

Two large DDoSes, very different outcomes. Why?

4

slide-6
SLIDE 6

DNS Basics

User

Internet Query: example.nl? Answer:192.168.1.1

  • That’s what most users (need to) know about DNS
  • Let’s see what really happens

5

slide-7
SLIDE 7

Background: the many parts of DNS

Stub Resolver e.g.: OS/applications

Stub

Recursives (1st level e.g.: modem)

R1a R1b CR1a CR1b Rna CRna ... Rnn CRnb

Recursives (nth level) e.g: ISP resolv. Authoritative Servers e.g.: ns1.example.nl

AT1 ... ATn

Figure 1: Relationship between resolvers,caches, and authoritatives

  • DNS query: where’s example.nl ($ dig A example.nl)
  • Answer: example.nl.

3600 IN A 94.198.159.35

  • DNS TTL: max time to cache a record

6

slide-8
SLIDE 8

Background: the many parts of DNS

Stub Resolver e.g.: OS/applications

Stub

Recursives (1st level e.g.: modem)

R1a R1b CR1a CR1b Rna CRna ... Rnn CRnb

Recursives (nth level) e.g: ISP resolv. Authoritative Servers e.g.: ns1.example.nl

AT1 ... ATn

DDoS attack

  • How much will resolver’s built-in defenses help users during

DDoS?

7

slide-9
SLIDE 9

OPS expectation during DDoS

Stub Resolver e.g.: OS/applications Stub Recursives (1st level e.g.: modem) R1a R1b CR1a CR1b Rna CRna ... Rnn CRnb Recursives (nth level) e.g: ISP resolv. Authoritative Servers e.g.: ns1.example.nl AT1 ... ATn

DDoS attack

Figure 2: TTL= how long your star powers will last – answer from cache

8

slide-10
SLIDE 10

Evaluating DNS Resiliency

  • Part 1: evaluate user experience under “normal” operations
  • Part 2: Verify results of Part 1 in production zones (.nl)
  • Part 3: Emulate DDoSes in the wild to evaluate

caching/retrials under stress, to observe user experience

9

slide-11
SLIDE 11

Part 1: measuring caching in the wild

Setup

  • 1. register our new domain (cachetest.nl)
  • 2. run two unicast IPv4 authoritatives on EC2 Frankfurt
  • 3. User Ripe Atlas and their resolvers as vantage points (∼ 15k)
  • 4. Each VP sends a unique AAAA query, so no interference
  • e.g.,: 500.cachetest.nl for probeID=500
  • 5. Each AAAA DNS answer encodes a counter that allow us to

tell if it was cache hit or miss

  • $PREFIX:$SERIAL:$PROBEID:$TTL
  • 6. Probe every 20min, and run scenarios with different TTLs, for

2 to 3 hours (to match various TTLs in the wild)

  • 60, 1800,3600, and 86400 seconds TTL

10

slide-12
SLIDE 12

Part 1: measuring caching in the wild

  • We control auth servers and clients (stub resolver)
  • We do not control recursives
  • How efficient is caching in the wild?
  • Remember: TTL sets upper limit for HOW LONG it should be

cached by recursives

11

slide-13
SLIDE 13

Results: how good caching is in the wild?

20000 40000 60000 80000 100000 120000 60s 1800s 3600s 86400s 3600s-10m

Miss: 0.0% Miss: 32.6% Miss: 32.9% Miss: 30.9% Miss: 28.5%

remaining queries Experiment AA CC AC CA

  • 1. Good news: caching works fine for 70% of all 15,000 VPs
  • With our not popular domain
  • 2. Not so good news: ∼ 30% of cache misses (AC)

12

slide-14
SLIDE 14

Why cache misses (Why AC?)

Possible: capacity limits, cache flushes, complex caches Mostly: complex caches

  • cache fragmentation with multiple servers
  • (previous work on Google DNS [9])

TTL 60 1800 3600 86400 3600-10m AC Answers 37 24645 24091 23202 47,262 Public R1 12000 11359 10869 21955 Google Public R1 9693 9026 8585 17325

  • ther Public R1

2307 2333 2284 4630 Non-Public R1 37 12645 12732 12333 25307 Google Public Rn 1196 1091 248 1708

  • ther Rn

37 11449 11641 12085 23599

Table 1: AC answers (cache miss) public resolver classification

13

slide-15
SLIDE 15

Part 2: caching in production zones

  • OK, in our controlled environment, we show that caching

works 70% as expected

  • Are these experiments representative?
  • We look at .nl production data
  • we compute ∆t (time since last query)
  • Compare to TTL of 3600s
  • 485k queries from 7,779 recursives

14

slide-16
SLIDE 16

Part 2: caching in production zones

  • Most resolvers send queries usually ∼3600s (.nl TTL)
  • 28% do not respect the 1h TTL
  • Yes, experiments are like real zone
  • (we also look into the Roots , see paper [4])

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 2000 4000 6000 8000 10000 CDF Δ t

15

slide-17
SLIDE 17

OK, so what do you we have so far?

  • We know how caching works in the wild (both Ripe and .nl)
  • Time to move Part 3: emulate DDoS
  • Goal: understand client experience under DDoS

16

slide-18
SLIDE 18

Part 3: Emulating DDoS

  • Similar setup as other experiments:
  • Emulate DDoS: drop incoming queries at certain rates at

Authoritative servers, with iptables

  • Question: (when) do caches protect clients?
  • Or why some DDoS attacks seem to have more impact?
  • We show only few experiments, many more in the paper

17

slide-19
SLIDE 19

Scenario A: all servers DOWN

  • Worst nightmare for a DNS operator
  • Only resolver’s cache can save clients
  • TTL=3600s (1 hour)
  • We probe every 10 minutes
  • At t = 10min, we drop all packets

18

slide-20
SLIDE 20

Complete DDoS: TTL: 60min, 100% failure

5000 10000 15000 20000 10 20 30 40 50 60 70 80 90 100 110 cache-only cache-expired answers minutes after start OK SERVFAIL No answer

Figure 3: Scenario A: 100% failure after 10min, TTL: 60min

  • DDoS starts after 1st query (fresh cache)
  • During DDoS: 35%-70% of clients are served (cache)
  • After cache expires: only 0.2% clients (serve state)
  • draft-ietf-dnsop-serve-stale-00

19

slide-21
SLIDE 21

Complete DDoS: changing cache freshness

  • Scenario B: Cache freshness: about to expire
  • How clients will experience DDoS?

5000 10000 15000 20000

10 20 30 40 50 60 70 80 90 100110 120130 140150 160170

cache-only normal normal answers minutes after start OK SERVFAIL No answer

Figure 4: Scenario B: 100% failure after 60min, TTL: 60min

  • Cache much less effective (as times out near attack)
  • Fragmented cached helps some (by filling later)

20

slide-22
SLIDE 22

Complete DDoS: changing cache freshness

  • Scenario B: Cache freshness: about to expire
  • How clients will experience DDoS?

5000 10000 15000 20000

10 20 30 40 50 60 70 80 90 100110 120130 140150 160170

cache-only normal normal answers minutes after start OK SERVFAIL No answer

Figure 4: Scenario B: 100% failure after 60min, TTL: 60min

  • Cache much less effective (as times out near attack)
  • Fragmented cached helps some (by filling later)

20

slide-23
SLIDE 23

Complete DDoS: TTL record influence

  • Influence of TTL: reducing from 60min to 30min
  • How clients will experience DDoS?

5000 10000 15000 20000

10 20 30 40 50 60 70 80 90 100110120130140150160170

normal normal answers minutes after start OK SERVFAIL No answer cache-

  • nly

cache- expired

Figure 5: Scenario C: 100% failure after 60min, TTL: 30min

  • Users experience worsens with shorter TTL
  • OPs: choose wisely the TTL of your records when

engineering for DDoS

21

slide-24
SLIDE 24

Discussion complete DDoS

  • Caching is partially successful during complete DDoS
  • OPs: don’t expect protection for clients as long as your TTL;

depends on their cache state

  • Serving stale content provides the last resort for Doomsday

scenario

  • some ops (Google, OpenDNS) seem to do it, but it is not

widespread yet

  • TTL of records: the shorter you set them, the less you protect

users during a complete DDoS

22

slide-25
SLIDE 25

Partial DDoS

  • Not all DDoS are strong enough to bring all servers down
  • Some lead to partial failure (Root DNS Nov 2015 [3])
  • Partial failure: some of the available authoritative fail to answer

all queries, or take longer to answer; then users experience longer latencies

  • In this case, how would users experience the attack?

23

slide-26
SLIDE 26

Experiment E: 50% success DDoS, TTL: 30min

5000 10000 15000 20000

10 20 30 40 50 60 70 80 90 100110120130140150160170

50% packet loss (both NSes) normal normal answers minutes after start OK SERVFAIL No answer 500 1000 1500 2000 2500 3000 3500 4000 20 40 60 80 100 120 140 160 latency (ms) minutes after start

Median RTT Mean RTT 75%ile RTT 90%ile RTT

Good! Most clients are happy, as they retry (but takes longer)

24

slide-27
SLIDE 27

Experiment H: 90% success DDoS, TTL: 30min

5000 10000 15000 20000

10 20 30 40 50 60 70 80 90 100110120130140150160170

90% packet loss (both NSes) normal normal answers minutes after start OK SERVFAIL No answer 500 1000 1500 2000 2500 3000 3500 4000 20 40 60 80 100 120 140 160 latency (ms) minutes after start

Median RTT Mean RTT 75%ile RTT 90%ile RTT

Good! Even at 90% packet loss with TTL 30min, most clients (60%) get an answer!! Good Engineering!

25

slide-28
SLIDE 28

Experiment I: 90% success DDoS, TTL: 1min

  • What’s TTL influence in partial DDoS?

5000 10000 15000 20000

10 20 30 40 50 60 70 80 90 100110120130140150160170

90% packet loss (both NSes) normal normal answers minutes after start OK SERVFAIL No answer 500 1000 1500 2000 2500 3000 3500 4000 20 40 60 80 100 120 140 160 latency (ms) minutes after start

Median RTT Mean RTT 75%ile RTT 90%ile RTT

Even with no caching (TTL 1min), 27% get an answer: stale + retries

26

slide-29
SLIDE 29

Retries cost: hammering Auth servers

  • Part of DNS resilience is that recursives keep on retrying
  • There’s a cost to it however: 8.1x in case of no caching!
  • Implications: OPS: be ready for friendly fire
  • usually not noticed during DDoS
  • If you overprovision level is 10x, know that 8.1x is friendly fire

50000 100000 150000 200000

10 20 30 40 50 60 70 80 90 100110120130140150160170

90% packet loss (both NSes) normal normal queries minutes after start NS A-for-NS AAAA-for-NS AAAA-for-PID

Figure 6: Queries received at Auth Servers .Experiment I: 90% success DDoS, TTL: 1min

27

slide-30
SLIDE 30

Implications

  • Caching and retries work really well
  • provided some authoritative stays partially up
  • and caches last longer than DDoS (as in TLDs, not in CDNs)
  • For DNS OPs: make one auth very strong? (careful with load

distrubtion, see [5])

  • Explains prior root DDoS outcomes

28

slide-31
SLIDE 31

Implications

  • There is a clear trade-off between TTL and DNS resilience
  • provided caches are filled and not about to expire
  • Many commercial websites have short TTLs
  • explains the pain of Dyn‘s customers and users perception
  • shorter TTLs given them quicker management options

(Amazon EC2 resolvers cap all answer TTL to 60s [7])

29

slide-32
SLIDE 32

Conclusions

  • First study to evaluate DNS resilience to DDoS from user’s

perspective

  • Evaluate design choices of various vendors using

measurements

  • Caching and retries: important part of DNS resilience
  • Good engineering: thanks for all IETFers/devs who have built

this

  • Experiments show when they help and when they won’t
  • Consistent with recent outcomes
  • DNS community:
  • There’s a clear trade-off between TTL and DDoS robustness,

choose wisely

  • Serving stale content is controversial, some deploy it

30

slide-33
SLIDE 33

Questions?

  • Paper: https://www.isi.edu/~johnh/PAPERS/Moura18b.pdf
  • Contact: giovane.moura@sidn.nl
  • Thanks RIPE NCC and reviewers of various drafts:
  • Wes Hardaker, Duanne Wessels, Warren Kumari, Stephane Bortzmeyer,

Maarten Aertsen, Paul Hoffman, our shepherd Mark Allman, and the anonymous IMC reviewers 31

slide-34
SLIDE 34

References i

[1] Sam Kottler.

February 28th DDoS Incident Report | Github Engineering, March 2018. . https:

//githubengineering.com/ddos-incident-report/.

[2] Carlos Morales. February 28th DDoS Incident Report | Github EngineeringNETSCOUT Arbor Confirms 1.7 Tbps DDoS Attack; The Terabit Attack Era Is Upon Us, March 2018.

https://www.arbornetworks.com/blog/asert/ netscout-arbor-confirms-1-7-tbps-ddos-attack-terabit-attack-

32

slide-35
SLIDE 35

References ii

[3] Giovane C. M. Moura, Ricardo de O. Schmidt, John Heidemann, Wouter B. de Vries, Moritz Müller, Lan Wei, and Christian Hesselman. Anycast vs. DDoS: Evaluating the November 2015 root DNS event. In Proceedings of the ACM Internet Measurement Conference, November 2016.

33

slide-36
SLIDE 36

References iii

[4] Giovane C. M. Moura, John Heidemann, Moritz Müller, Ricardo de O. Schmidt, and Marco Davids. When the dike breaks: Dissecting DNS defenses during DDoS (extended). In Proceedings of the ACM Internet Measurement Conference, October 2018. [5] Moritz Müller, Giovane C. M. Moura, Ricardo de O. Schmidt, and John Heidemann. Recursives in the wild: Engineering authoritative DNS servers. In Proceedings of the ACM Internet Measurement Conference, pages 489–495, London, UK, 2017.

34

slide-37
SLIDE 37

References iv

[6] Nicole Perlroth. Hackers used new weapons to disrupt major websites across U.S. New York Times, page A1, Oct. 22 2016. [7] Alec Peterson. Ec2 resolver changing ttl on dns answers? Post on the DNS-OARC dns-operations mailing list,

https://lists.dns-oarc.net/pipermail/ dns-operations/2017-November/017043.html, November

2017.

35

slide-38
SLIDE 38

References v

[8] José Jair Santanna, Roland van Rijswijk-Deij, Rick Hofstede, Anna Sperotto, Mark Wierbosch, Lisandro Zambenedetti Granville, and Aiko Pras. Booters—an analysis of DDoS-as-a-Service attacks. In Proceedings of the 14th IFIP/IEEE Interatinoal Symposium

  • n Integrated Network Management, Ottowa, Canada, May
  • 2015. IFIP

. [9] Kyle Schomp, Tom Callahan, Michael Rabinovich, and Mark Allman. On measuring the client-side DNS infrastructure. In Proceedings of the 2015 ACM Conference on Internet Measurement Conference, pages 77–90. ACM, October 2013.

36