BotGraph: Large Scale Spamming Botnet Detec5on Yao Zhao Yinglian - - PowerPoint PPT Presentation

botgraph large scale spamming botnet detec5on
SMART_READER_LITE
LIVE PREVIEW

BotGraph: Large Scale Spamming Botnet Detec5on Yao Zhao Yinglian - - PowerPoint PPT Presentation

BotGraph: Large Scale Spamming Botnet Detec5on Yao Zhao Yinglian Xie * , Fang Yu * , Qifa Ke * , Yuan Yu * , Yan Chen and Eliot Gillum EECS Department, Northwestern University MicrosoK Research Silicon Valley * MicrosoK Coopera5on 1


slide-1
SLIDE 1

BotGraph: Large Scale Spamming Botnet Detec5on

Yao Zhao

Yinglian Xie*, Fang Yu*, Qifa Ke*, Yuan Yu*, Yan Chen and Eliot Gillum‡ EECS Department, Northwestern University MicrosoK Research Silicon Valley* MicrosoK Coopera5on ‡

1

slide-2
SLIDE 2

2

Web‐Account Abuse ARack

Zombie (Compromised host)

Spammer’s Server Captcha solver

RDSXXTD3 User/Pwd

slide-3
SLIDE 3

Problems and Challenges

  • Detect Web‐account Abuse with Hotmail Logs

– Input: user ac5vity traces (signup, login, email‐sending records) – Goal: stop aggressive account signup, limit outgoing spam

  • Algorithmic challenge:

– ARack is stealthy: individual account detec5on difficult – ARack is large scale: finding correlated ac5vi5es – Low false posi5ve and false nega5ve rate

  • Engineering challenge:

– Large user popula5on: >500 million accounts – Large data volume: 300GB‐400GB data per month

3

slide-4
SLIDE 4

4

The BotGraph System

  • A graph‐based approach to a@ack detecBon

– A large user‐user graph to capture bot‐account correla5ons – Iden5fy 26M bot‐accounts with a low false posi5ve rate in two months

  • Efficient implementaBon using Dryad/DryadLINQ

– Graph construc5on/analysis is not easily parallelizable – hundreds of millions of nodes, hundreds of billions of edges – Process 200GB‐300GB data in 1.5 hours with a 240‐machine cluster

The first to provide a systemaBc soluBon to the new a@ack

slide-5
SLIDE 5

System Architecture

5

Login data Login graph

Graph generation

Random graph based clustering Verification & prune Sendmail data Spamming botnets Suspicious clusters Signup data

EWMA based change detection

Aggressive signups Verification & prune Signup botnets

  • 3. Parallel algorithm on

DryadLINQ clusters

(ID, IP, time) (ID, time, # of recipients) (ID, IP, time)

  • 1. History based algorithm to detect aggressive signups
  • 2. Graph-based algorithm to find correlations
slide-6
SLIDE 6

6

Detect Aggressive Signups

Large predic5on error Back to normal Date

Number of Signup Accounts

25 20 15 10 5 1-Jul 2-Jul 3-Jul 4-Jul 5-Jul 6-Jul 7-Jul 8-Jul 9-Jul Signup Count EWMA Prediction

  • Simple and efficient
  • Detect 20 million malicious accounts in 2 months
slide-7
SLIDE 7

System Architecture

7

Login data Login graph

Graph generation

Random graph based clustering Verification & prune Sendmail data Spamming botnets Suspicious clusters Signup data

EWMA based change detection

Aggressive signups Verification & prune Signup botnets

  • 3. Parallelel Algorithm
  • n DryadLinq clusters

(ID, IP, time) (ID, time, # of recipients) (ID, IP, time)

  • 1. History based algorithm on Signup detection
  • 2. Graph-based algorithm on login detection
slide-8
SLIDE 8

8

  • Observa5on: bot‐accounts work collabora5vely
  • Normal Users

– Share IP addresses in one AS with DHCP assignment

  • Bot‐users

Detect Stealthy Accounts by Graphs

A user‐user graph to model behavior similariBes

slide-9
SLIDE 9

9

  • Observa5on: bot‐accounts work collabora5vely
  • Normal Users

– Share IP addresses in one AS with DHCP assignment

  • Bot‐users

– Likely to share different IPs across ASes

Detect Stealthy Accounts by Graphs

A user‐user graph to model behavior similariBes

slide-10
SLIDE 10

User‐user Graph

  • Node: Hotmail account
  • Edge weight: # of ASes of the shared

IP addresses

– Consider edges with weight>1

  • Key Observa5ons

– Bot‐users form a giant connected‐component while normal users do not – Interpreted by the random graph theory

10

2 ASes 3 ASes 5 ASes 1 AS 4 ASes User1 User2 User3 User4 User5 User6

slide-11
SLIDE 11

Random Graph Theory

  • Random Graph G(n,p)

– n nodes and each pair of nodes has an edge with probability p and average degree d = (n‐1) ∙ p

  • Theorem

– If d < 1, then with high probability the largest component in the graph has size less than O(log n) No large connected subgraph – If d > 1, with high probability the graph will contain a giant component with size at the order of O(n) Most nodes are in one connected subgraph

11

slide-12
SLIDE 12

Graph‐based Bot‐user Detec5on

  • Step 1: detect giant connected‐components from the

user‐user graph

  • Step 2: hierarchical algorithm to iden5fy the correct

groupings

– Different bot‐user groups may be mixed – Difficult to choose a fixed edge‐threshold – Easier valida5on with correct group sta5s5cs

  • Step 3: prune normal‐user groups

– Due to na5onal proxies, cell phone users, facebook applica5ons, etc.

12

slide-13
SLIDE 13

Hierarchical Bot‐Group Extrac5on

G A B T=2 C D E T=3 T=4

1st group 2nd group 3rd group

13

slide-14
SLIDE 14

System Architecture

14

Login data Login graph

Graph generation

Random graph based clustering Verification & prune Sendmail data Spamming botnets Suspicious clusters Signup data

EWMA based change detection

Aggressive signups Verification & prune Signup botnets

  • 3. Parallelel Algorithm
  • n DryadLINQ clusters

(ID, IP, time) (ID, time, # of recipients) (ID, IP, time)

  • 1. History based algorithm on Signup detection
  • 2. Graph-based algorithm on login detection
slide-15
SLIDE 15

Parallel Implementa5on on DryadLINQ

  • EWMA‐based Signup Abuse Detec5on

– Par55on data by IP – Can achieve real‐Bme detecBon

  • User‐User Graph Construc5on

– Two algorithms and op5miza5ons – Process 200GB‐300GB data in 1.5 hours with 240 machines

  • Connected Component Extrac5on

– Divide and conquer – Process a graph of 8.6 billion edges in 7 minutes

slide-16
SLIDE 16

Graph Construc5on 1: Simple Data Parallelism

  • Poten5al Edges

– Select ID group by IP (Map) – Generate poten5al edges (IDi, IDj, IPk) (Reduce)

  • Edge Weights

– Select IP group by ID pair (Map) – Calculate edge weight (Reduce)

  • Problem

– Weight 1 edge is two orders of magnitude more than

  • thers

– Their computaBon/communicaBon is unnecessary

slide-17
SLIDE 17

Graph Construc5on 2: Selec5ve Filtering

17

slide-18
SLIDE 18

Comparison of Two Algorithms

  • Method 1

– Simple and scalable

  • Method 2

– Op5mized to filter out weight 1 edges – U5lize Join func5onality, data compression and broadcast op5miza5on

18

slide-19
SLIDE 19

Detec5on Results

  • Data descrip5on

– Two datasets

  • Jun 2007 and Jan 2008

– Three types of data

  • Signup log (IP, ID, Time)
  • Login log (IP, ID, Time)

– 500M users and 200~300GB data per month

  • Sendmail log (ID, 5me, # of recipients)

19

slide-20
SLIDE 20

Detec5on of Signup Abuse

20

slide-21
SLIDE 21

Detec5on by User‐user Graph

21

slide-22
SLIDE 22

Valida5ons

  • Manual Check

– Sampled groups verified by the Hotmail team – Almost no false posi5ves

  • Comparison with Known Spamming Users

– Detect 86% of complained accounts – Up to 54% of detected accounts are our new findings

  • Email Sending Sizes per Group

– Most groups have a sharp peak – The remaining contain several peaks

  • False Posi5ve Es5ma5on

– Naming paRern (0.44%) – Signup 5me (0.13%)

22

slide-23
SLIDE 23

Possible to Evade BotGraph?

  • Evade signup detec5on: Be stealthy
  • Evade graph‐based detec5on

– Fixed IP/AS binding

  • Low u5liza5on rate
  • Bot‐accounts bound to one host are easy to be grouped

– Be stealthy (sending as few emails as normal user)

Severely limit a@ackers’ spam throughput

23

slide-24
SLIDE 24

Conclusions

  • A graph‐based approach to a@ack detecBon

– Iden5fy 26M bot‐accounts with a low false posi5ve rate in two months

  • Efficient implementaBon using Dryad/DryadLINQ

– Process 200GB‐300GB data in 1.5 hours with a 240‐ machine cluster

24

Large‐scale data‐mining for network security is effecBve and pracBcal

slide-25
SLIDE 25

Q & A?

Thanks!

25

slide-26
SLIDE 26