High-Speed Detection of Unsolicited Bulk Email Sheng-Ya Lin, - - PowerPoint PPT Presentation

high speed detection of unsolicited bulk email
SMART_READER_LITE
LIVE PREVIEW

High-Speed Detection of Unsolicited Bulk Email Sheng-Ya Lin, - - PowerPoint PPT Presentation

High-Speed Detection of Unsolicited Bulk Email Sheng-Ya Lin, Cheng-Chung Tan, Jyh-Charn (Steve) Liu, Computer Science Department, Texas A&M University Michael Oehler National Security Agency Dec, 4, 2007 1 Outline Motivation


slide-1
SLIDE 1

1

High-Speed Detection of Unsolicited Bulk Email

Sheng-Ya Lin, Cheng-Chung Tan, Jyh-Charn (Steve) Liu, Computer Science Department, Texas A&M University Michael Oehler National Security Agency Dec, 4, 2007

slide-2
SLIDE 2

2

Outline

  • Motivation
  • Progressive Email Classifier (PEC) system

architecture

  • Experimental results
slide-3
SLIDE 3

3

Email Spamming No Longer Just a Nuisance

  • Some Facts:

– Botnet farms can hit any target (> 106) – bandwidth waste (3:1 or higher) – Network resource exploit & information stealing (malware planting) – Highly effective hit and run strategy (BGP, DNS, domain name, credit card fraud)

  • Existing anti-spamming ware

– Large number of software copies and signatures to maintain – Comprehensive detection rules, but slow to respond

  • Signatures management a major bottleneck

– Acquisition and the deployment of signatures to numerous machines – A small variation in the known signatures can easily defeat a signature based filter – Spammers can test their designs with anti-spamming ware before starting the (hit and run) campaign

slide-4
SLIDE 4

4

Spamming Behavior at a Glance

  • Spammers do not have full freedom in launching spamming.

– Follow the transport protocols to deliver messages – Messages must be perceivable and appealing to human users – Expensive to compose and personalize spamming messages:

  • interactive (click my URL links) or passive
  • Low yield rate combined with greed lead to high spamming

volumes

  • Cheap to launch spamming: millions of zombie machines

each send a few copies

– Any “hit back, interactive” method could cause severe harm to the innocents

  • Summary

– Very difficult for spammers to achieve financial goals without leaving noticeable signatures, i.e. feature instances – A challenge is how to keep up with their speed, volume, and diversity

slide-5
SLIDE 5

5

Our Approach

  • Lossy detection:

– focused mainly on the major offenders – Avoid false positive

  • Timely acquisition of instances of selected features:

– Position the detector at the Network Access Points (NAP)

  • Highest concentration of samples for an enterprise network
  • Detect them before the flood already enters the network
  • Work on the algorithm & data structure level, rather than any

hardware platform

– Broad spectrum of computing resources/constraints

  • Regular emails are expected to have random distributions of

strings that happen to fall into the spamming feature space

– Moderated delivery of bulk, legitimate email

  • A spamming stream: Invariant and variant parts

– An invariant that also appears in regular emails cannot be used for filtering – For the first cut effort: URL (over 95% spamming have them)

slide-6
SLIDE 6

6

Competitive Aging-Scoring Scheme (CASS)

  • A spamming invariant (string) is called its feature

instance (FI). The essence of our technique: “Extract FIs of emails and keep track of their

  • ccurrences. If exceeding a threshold: an UNBE

stream”

  • In a naïve approach, it takes O(1) to update the score of

an FI, but O(N) to age all other FIs

– A major computing cost

  • CASS: a constant time algorithm

– The time-to-live of an FI is reset each time when its score is increased by one (when a new copy arrives) – The time-to-live of all other FIs is reduced by one – New complexity: O(1) for both scoring and aging – Exceeding a threshold: move it to the blacklist – No further copies in a time-out period: discard it

  • It may not be a fixed physical time
slide-7
SLIDE 7

7

PEC Architecture

32bit Hash table of Known strings

New string identified

Birth& Death Of strings Hash vs string

Aging and scoring of unknown strings Email flow

Berkley DB

Sendmail

Feature instance extraction

slide-8
SLIDE 8

8

Data Structure of Scoreboard

URL1, URL1 URL1 Address Hash Function H1(URL) 20 bits H2(URL) ++Score Index of SMT Data Structure of a Cell 76 bits m m+1 m-1 HURL1 Miss Count n-1 Index 1 n HURL1 Update HURL2 H1(URL).H2(URL) Data Structure of HURL 32 bits URL2 Remove Δt

Entries for feature instances Entries for feature instances Scoreboard Table Age Table Exceeds UNBE threshold (S)? Exceeds age threshold (M) (hash_low, score, age_table location)

(score_table location)

slide-9
SLIDE 9

9

A Snapshot

S =10, M =20 HashURL : (414738(20-bit)+3724(12-bit))

Current feature being processed Next feature instance

Active features Arranged in their ages (mod N)

The current time location MOD queue Placement

HashURL : (124489(20-bit)+176(12-bit))

The current time location

The entry [862 1822] is purged time history

newest

  • ldest

Entry moved to blacklist

Queue size = 20

slide-10
SLIDE 10

10

Testbed Environment

Three Modules included:

  • 1. Email generation
  • 2. PEC (Blacklist and scoreboard):
  • 3. Control and visualization console
slide-11
SLIDE 11

11

Experimental Configuration

  • Email generator: Intel P4-3.0 Windows XP
  • Email Server: Xeon 3.0GHz, two single core

CPUs, Linux, Sendmail 8.14.1

  • Within a batch, the sender sends 2000 copies of

emails (uniformly mixed UNBEs and regulars).

– S: 50 – M: 2048 – The average mail size: 1.5K bytes – One mail per 0.088 seconds on average.

slide-12
SLIDE 12

12

Workflow of Email Generation

` Windows Control Console Linux Email Server (Sendmail) simulation parameters Random Text MIME structures Feature Dictionary

Emails (bulk/regular) Bulk

URL Image Src

Regular

Bulk Regular

U R U U ….. R

Message Composer

Subject Generation “From” Generation SMTP Protocol Density Generation (uniform dist.) Spamming Keyword selection

slide-13
SLIDE 13

13

UNBE Generation

  • Both UNBE and regular copies are injected with

URL links or remote image sources

– Can adjust density, locations of variants and invariants in the body of each copy to generate MIME messages. – UNBE features extracted from 2005 TREC Public Spam Corpus, http://plg.uwaterloo.ca/~gvcormac/treccorpus/about.ht ml – Variants: random text taken from web sites – Keywords: User defined (not tested in this report)

  • The message composer calls an SMTP library

to send the generated emails to Sendmail

slide-14
SLIDE 14

14

Detection Latency of Single UNBE source

500 1000 1500 2000 2500 50 100 150 200 250 300 Number of messages in a bin Detection Latency Experimental Value Expected Value

  • Fix threshold and age table length under different densities.
  • Test six different UNBE densities (50, 100, 150, 200 …, 300 UNBE messages/bin)

Unit: Virtual clock

slide-15
SLIDE 15

15

Effects of Multiple UNBE Sources

500 1000 1500 2000 2500 50 100 150 200 250 300 Number of messages in a bin for each non-A UNBE Detection latency

test 1 test 2 test 3 test 4 test 5 test 6

  • ther sources
  • Given an UNBE source A, six tests

were made where one addition UNBE source is added to the experiment at a time.

– The six lines marked as test[1-6]

  • The density (instances/batch)

– A: is fixed at 100 – Other UNBE sources: increased from 50 to 300

  • Result:

– The detection latency of an UNBE decreases with the number of UNBE sources

  • When a source is captured, it is

blocked form the scorebaord. The density measure in VC for others increases

slide-16
SLIDE 16

16

Throughput of URL Parser

5 10 15 20 25 30 1.5K 3.0K 4.5K 6.0K 7.5K Size of Mial Body (K Bytes) Throughput (1000 Bodys/sec

The average Email size is from 1.5 KB to 7.5 KB, and each email has 2 URLs.

slide-17
SLIDE 17

17

Throughput of Scoreboard and Blacklist

100 200 300 400 500 600 700 800 900 1000 30 60 90 120 150 URL length (bytes) Throughput ( K URLs/sec

  • Scoreboard: 1.2M transactions
  • Blacklist: 0.9M (avg. 30 B) URLs, without including database access
slide-18
SLIDE 18

18

Pointer Table: reduce memory need (at a small cost of delay)

  • In the detection window, a limited number of hashed values need to be

tracked

  • Full table for 32-bit hash system takes too much space
  • Higher order bits used as the index, and the rest, and the rest bits

maintained by a linked list (for each entry)

  • If pointer table uses 20 bits for indexing, that means it has 1M entries,

and age table length is 20K~70K, the maximum depth of linked list pointed by pointer table is 2.

slide-19
SLIDE 19

19

Threshold Setting

  • Q1: “What is the minimum value of M to detect

an UNBE attack (of known density) with a success probability of higher than α?”

– (smaller M retire sooner)

  • Q2: “For a given M, what is the maximum value
  • f S to guarantee that the probability of the

detection latency < ς is greater than α?”

– (large S less likely false positive, but more enter network before detection)

slide-20
SLIDE 20

20

Compute M and S

/( )

f b f

λ μ μ μ = +

(2, ) 1 M λ α Γ = −

1

[ ]/

S

E H ς λ

+

=

2 ( 1) 1 1

( ) 1/ 1/ ... 1/ 2/ (1 2 )/( 1)

S S S S S

E H α α α α α α α

− + − − +

= + + + + = − + −

Get M Get S

slide-21
SLIDE 21

21

Detection latency when S=24, M=55

The prediction model is conservative

slide-22
SLIDE 22

22

Sensitivity of TR vs. S

slide-23
SLIDE 23

23

Sensitivity of TR vs. M (S=24)

slide-24
SLIDE 24

24

Summary

  • PEC demonstrates the feasibility of high speed

UNBE filtering at the network vantage points

  • The method is not meant to replace existing

solutions, but to defeat major offenders

– 80-20 rule

  • Expansion of the techniques to handle multiple

features (bad words, dirty subnets, black lists, etc)

– Integration/interface with existing tools

slide-25
SLIDE 25

25

Thank You!