samuel.marchal@uni.lu 3/12/12
Semantic based DNS Forensics
Samuel Marchal, J´ erˆ
- me Fran¸
Semantic based DNS Forensics Samuel Marchal, J er ome Fran cois, - - PowerPoint PPT Presentation
samuel.marchal@uni.lu 3/12/12 Semantic based DNS Forensics Samuel Marchal, J er ome Fran cois, Radu State and Thomas Engel Motivations Semantic analysis Experiments and Results Conclusion Outline 1 Motivations 2 Semantic analysis 3
Motivations Semantic analysis Experiments and Results Conclusion
2 / 17
Motivations Semantic analysis Experiments and Results Conclusion
3 / 17
Motivations Semantic analysis Experiments and Results Conclusion
DNS requests: malwareupdate.com commandandcontrol.net
compromised host DNS recursive server 56.7.89.10 123.45.67.8 76.54.32.1 C&C server malware update
Requests forwarding
phishing web servers
DNS resolution
bots
request for malware update request to C&C connection to phishing website DNS replies: 123.45.67.8 56.7.89.10
Authoritative DNS server for malicious domains
DNS resolution
· malware updates · botnet C&C · phishing · backdoor communications · etc.
4 / 17
Motivations Semantic analysis Experiments and Results Conclusion
request for malware update DNS requests: malwareupdate.com commandandcontrol.net
compromised host DNS recursive server 56.7.89.10 123.45.67.8 76.54.32.1 C&C server malware update
Requests forwarding
phishing web server
DNS resolution
bots
request to C&C connection to phishing website DNS replies: 76.54.32.1 56.7.89.10
Authoritative DNS server for malicious domains
DNS resolution
4 / 17
Motivations Semantic analysis Experiments and Results Conclusion
◮ find proof of infection (malicious domains requests) ◮ reduced amount of data to analyse: DNS is a meager
◮ DNS analysis keeps users’ anonymity
5 / 17
Motivations Semantic analysis Experiments and Results Conclusion
◮ find proof of infection (malicious domains requests) ◮ reduced amount of data to analyse: DNS is a meager
◮ DNS analysis keeps users’ anonymity
5 / 17
Motivations Semantic analysis Experiments and Results Conclusion
◮ User reports + manual checking ◮ DNS packet fields analysis + classification via
◮ domain records removed: data is no longer available
◮ Domain name based analysis:
◮ number of domain levels ◮ relative position of labels ◮ domain length ◮ etc.
6 / 17
Motivations Semantic analysis Experiments and Results Conclusion
7 / 17
Motivations Semantic analysis Experiments and Results Conclusion
◮ Domain names are meant to be meaningful ◮ Observations: malicious domains often use words
◮ www.visa-sweden.mastercard.forever4c.com ◮ myvodafone.vodafone-security-update78.systemknight.com ◮ paypal.com-us.webscr.cmd-homeelocale.gumuspena.com
◮ Issue: single domains are not significant enough ◮ =
◮ Knowing group of malicious and legitimate domains
8 / 17
Motivations Semantic analysis Experiments and Results Conclusion
myvodafone.vodafone-security-update78.systemknights.com vodafone vodafone my system knights security update ‘.’ splitting ‘-’ splitting word segmentation systemknights number extraction 78 update78 myvodafone vodafone-security-update78
◮ distword = {(my, 0.125), (vodafone, 0.25), (security, 0.125), ...}
9 / 17
Motivations Semantic analysis Experiments and Results Conclusion
◮ calculate a similarity score (semantic relatedness)
◮ give the n most related words to w ◮ based on dictionary (Wikipedia, BNC, PubMed, etc.)
(r,w)∈T(w2) I(w2,r,w)
10 / 17
Motivations Semantic analysis Experiments and Results Conclusion
Sim1(A, B) =
wA∈WA
Sim2(A, B) =
wA∈WA
Sim′
3(A, B) = w∈WA
= ⇒ Sim3(A, B) = Sim′
3(A, B) + Sim′ 3(B, A)
11 / 17
Motivations Semantic analysis Experiments and Results Conclusion
12 / 17
Motivations Semantic analysis Experiments and Results Conclusion
◮ 10 sets of around 13,000 domains each ◮ 5 legitimate (Alexa + passive DNS) ◮ 5 malicious (PhishTank, DNS-BH, MDL)
leg-5 leg-4 leg-3 leg-2 leg-1 mal-5 mal-4 mal-3 mal-2 mal-1 0.776 0.795 0.793 0.789 0.785 0.955 0.962 0.965 0.975 mal-2 0.782 0.800 0.798 0.797 0.797 0.965 0.968 0.973 mal-3 0.772 0.796 0.793 0.788 0.784 0.951 0.962 mal-4 0.783 0.804 0.804 0.800 0.796 0.953 mal-5 0.769 0.785 0.784 0.782 0.772 leg-1 0.946 0.948 0.952 0.938 leg-2 0.915 0.924 0.922 leg-3 0.936 0.934 leg-4 0.935 0.7 0.76 0.82 0.88 0.94 1.00
13 / 17
Motivations Semantic analysis Experiments and Results Conclusion
◮ for big set (13,000 domains): ok !! ◮ minimum number of domains in a set to evaluate it ?
0.1 0.2 0.3 0.4 0.5 0.6 0.7 50 100 150 200
Value of Sim1 between datasets # of domains in the dataset
leg mal
S i m
3
14 / 17
Motivations Semantic analysis Experiments and Results Conclusion
15 / 17
Motivations Semantic analysis Experiments and Results Conclusion
◮ semantic similarity scoring ◮ apply to identification of malicious domain set ◮ useful for first step of forensic analysis
◮ able to distinguish malicious from legitimate
◮ ... for sets of at least 10 domains
◮ improve similarity metrics ◮ correlate with IP Flow records
16 / 17