botgraph large scale spamming botnet detec5on
play

BotGraph: Large Scale Spamming Botnet Detec5on Yao Zhao Yinglian - PowerPoint PPT Presentation

BotGraph: Large Scale Spamming Botnet Detec5on Yao Zhao Yinglian Xie * , Fang Yu * , Qifa Ke * , Yuan Yu * , Yan Chen and Eliot Gillum EECS Department, Northwestern University MicrosoK Research Silicon Valley * MicrosoK Coopera5on 1


  1. BotGraph: Large Scale Spamming Botnet Detec5on Yao Zhao Yinglian Xie * , Fang Yu * , Qifa Ke * , Yuan Yu * , Yan Chen and Eliot Gillum ‡ EECS Department, Northwestern University MicrosoK Research Silicon Valley * MicrosoK Coopera5on ‡ 1

  2. Web‐Account Abuse ARack Zombie Spammer’s (Compromised host) Server User/Pwd Captcha solver RDSXXTD3 2

  3. Problems and Challenges • Detect Web‐account Abuse with Hotmail Logs – Input: user ac5vity traces (signup, login, email‐sending records) – Goal: stop aggressive account signup, limit outgoing spam • Algorithmic challenge: – ARack is stealthy: individual account detec5on difficult – ARack is large scale: finding correlated ac5vi5es – Low false posi5ve and false nega5ve rate • Engineering challenge: – Large user popula5on: >500 million accounts – Large data volume: 300GB‐400GB data per month 3

  4. The BotGraph System • A graph‐based approach to a@ack detecBon – A large user‐user graph to capture bot‐account correla5ons – Iden5fy 26M bot‐accounts with a low false posi5ve rate in two months • Efficient implementaBon using Dryad/DryadLINQ – Graph construc5on/analysis is not easily parallelizable – hundreds of millions of nodes, hundreds of billions of edges – Process 200GB‐300GB data in 1.5 hours with a 240‐machine cluster The first to provide a systemaBc soluBon to the new a@ack 4

  5. System Architecture 1. History based algorithm to detect aggressive signups EWMA based change detection Aggressive Signup Signup signups botnets data Verification (ID, IP, time) & prune Sendmail (ID, time, # of recipients) data 2. Graph-based algorithm to find correlations Verification & prune Random graph Graph (ID, IP, time) generation based clustering Login Spamming Suspicious Login graph botnets clusters data 3. Parallel algorithm on 5 DryadLINQ clusters

  6. Detect Aggressive Signups Large 25 predic5on Number of Signup Accounts Signup Count error 20 EWMA Prediction 15 Back to normal 10 5 Date 1-Jul 2-Jul 3-Jul 4-Jul 5-Jul 6-Jul 7-Jul 8-Jul 9-Jul • Simple and efficient • Detect 20 million malicious accounts in 2 months 6

  7. System Architecture 1. History based algorithm on Signup detection EWMA based change detection Aggressive Signup Signup signups botnets data Verification (ID, IP, time) & prune Sendmail (ID, time, # of recipients) data 2. Graph-based algorithm on login detection Verification & prune Random graph Graph (ID, IP, time) generation based clustering Login Spamming Suspicious Login graph botnets clusters data 3. Parallelel Algorithm 7 on DryadLinq clusters

  8. Detect Stealthy Accounts by Graphs • Observa5on: bot‐accounts work collabora5vely A user‐user graph to model behavior similariBes • Normal Users – Share IP addresses in one AS with DHCP assignment • Bot‐users 8

  9. Detect Stealthy Accounts by Graphs • Observa5on: bot‐accounts work collabora5vely A user‐user graph to model behavior similariBes • Normal Users – Share IP addresses in one AS with DHCP assignment • Bot‐users – Likely to share different IPs across ASes 9

  10. User‐user Graph User3 • Node: Hotmail account 2 ASes User1 • Edge weight: # of ASes of the shared IP addresses 4 ASes 5 ASes – Consider edges with weight>1 3 ASes User4 • Key Observa5ons User2 – Bot‐users form a giant connected‐component while User5 normal users do not 1 AS – Interpreted by the random User6 graph theory 10

  11. Random Graph Theory • Random Graph G ( n , p ) – n nodes and each pair of nodes has an edge with probability p and average degree d = ( n ‐1) ∙ p • Theorem – If d < 1 , then with high probability the largest component in the graph has size less than O(log n ) No large connected subgraph – If d > 1, with high probability the graph will contain a giant component with size at the order of O( n ) Most nodes are in one connected subgraph 11

  12. Graph‐based Bot‐user Detec5on • Step 1: detect giant connected‐components from the user‐user graph • Step 2: hierarchical algorithm to iden5fy the correct groupings – Different bot‐user groups may be mixed – Difficult to choose a fixed edge‐threshold – Easier valida5on with correct group sta5s5cs • Step 3: prune normal‐user groups – Due to na5onal proxies, cell phone users, facebook applica5ons, etc. 12

  13. Hierarchical Bot‐Group Extrac5on G T=2 1st group 3rd group A B T=3 C D T=4 2nd E group 13

  14. System Architecture 1. History based algorithm on Signup detection EWMA based change detection Aggressive Signup Signup signups botnets data Verification (ID, IP, time) & prune Sendmail (ID, time, # of recipients) data 2. Graph-based algorithm on login detection Verification & prune Random graph Graph (ID, IP, time) generation based clustering Login Spamming Suspicious Login graph botnets clusters data 3. Parallelel Algorithm 14 on DryadLINQ clusters

  15. Parallel Implementa5on on DryadLINQ • EWMA‐based Signup Abuse Detec5on – Par55on data by IP – Can achieve real‐Bme detecBon • User‐User Graph Construc5on – Two algorithms and op5miza5ons – Process 200GB‐300GB data in 1.5 hours with 240 machines • Connected Component Extrac5on – Divide and conquer – Process a graph of 8.6 billion edges in 7 minutes

  16. Graph Construc5on 1: Simple Data Parallelism � • Poten5al Edges – Select ID group by IP (Map) – Generate poten5al edges ( ID i , ID j , IP k ) (Reduce) • Edge Weights – Select IP group by ID pair (Map) – Calculate edge weight (Reduce) • Problem – Weight 1 edge is two orders of magnitude more than others – Their computaBon/communicaBon is unnecessary �

  17. Graph Construc5on 2: Selec5ve Filtering 17

  18. Comparison of Two Algorithms • Method 1 – Simple and scalable • Method 2 – Op5mized to filter out weight 1 edges – U5lize Join func5onality, data compression and broadcast op5miza5on 18

  19. Detec5on Results • Data descrip5on – Two datasets • Jun 2007 and Jan 2008 – Three types of data • Signup log (IP, ID, Time) • Login log (IP, ID, Time) – 500M users and 200~300GB data per month • Sendmail log (ID, 5me, # of recipients) 19

  20. Detec5on of Signup Abuse 20

  21. Detec5on by User‐user Graph 21

  22. Valida5ons • Manual Check – Sampled groups verified by the Hotmail team – Almost no false posi5ves • Comparison with Known Spamming Users – Detect 86% of complained accounts – Up to 54% of detected accounts are our new findings • Email Sending Sizes per Group – Most groups have a sharp peak – The remaining contain several peaks • False Posi5ve Es5ma5on – Naming paRern (0.44%) – Signup 5me (0.13%) 22

  23. Possible to Evade BotGraph? • Evade signup detec5on: Be stealthy • Evade graph‐based detec5on – Fixed IP/AS binding • Low u5liza5on rate • Bot‐accounts bound to one host are easy to be grouped – Be stealthy (sending as few emails as normal user) Severely limit a@ackers’ spam throughput 23

  24. Conclusions • A graph‐based approach to a@ack detecBon – Iden5fy 26M bot‐accounts with a low false posi5ve rate in two months • Efficient implementaBon using Dryad/DryadLINQ – Process 200GB‐300GB data in 1.5 hours with a 240‐ machine cluster Large‐scale data‐mining for network security is effecBve and pracBcal 24

  25. Q & A? Thanks! 25

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend