1
High-Speed Detection of Unsolicited Bulk Email
Sheng-Ya Lin, Cheng-Chung Tan, Jyh-Charn (Steve) Liu, Computer Science Department, Texas A&M University Michael Oehler National Security Agency Dec, 4, 2007
High-Speed Detection of Unsolicited Bulk Email Sheng-Ya Lin, - - PowerPoint PPT Presentation
High-Speed Detection of Unsolicited Bulk Email Sheng-Ya Lin, Cheng-Chung Tan, Jyh-Charn (Steve) Liu, Computer Science Department, Texas A&M University Michael Oehler National Security Agency Dec, 4, 2007 1 Outline Motivation
1
Sheng-Ya Lin, Cheng-Chung Tan, Jyh-Charn (Steve) Liu, Computer Science Department, Texas A&M University Michael Oehler National Security Agency Dec, 4, 2007
2
3
– Botnet farms can hit any target (> 106) – bandwidth waste (3:1 or higher) – Network resource exploit & information stealing (malware planting) – Highly effective hit and run strategy (BGP, DNS, domain name, credit card fraud)
– Large number of software copies and signatures to maintain – Comprehensive detection rules, but slow to respond
4
– Follow the transport protocols to deliver messages – Messages must be perceivable and appealing to human users – Expensive to compose and personalize spamming messages:
– Any “hit back, interactive” method could cause severe harm to the innocents
– Very difficult for spammers to achieve financial goals without leaving noticeable signatures, i.e. feature instances – A challenge is how to keep up with their speed, volume, and diversity
5
– focused mainly on the major offenders – Avoid false positive
– Position the detector at the Network Access Points (NAP)
– Broad spectrum of computing resources/constraints
– Moderated delivery of bulk, legitimate email
– An invariant that also appears in regular emails cannot be used for filtering – For the first cut effort: URL (over 95% spamming have them)
6
– A major computing cost
– The time-to-live of an FI is reset each time when its score is increased by one (when a new copy arrives) – The time-to-live of all other FIs is reduced by one – New complexity: O(1) for both scoring and aging – Exceeding a threshold: move it to the blacklist – No further copies in a time-out period: discard it
7
32bit Hash table of Known strings
New string identified
Birth& Death Of strings Hash vs string
Aging and scoring of unknown strings Email flow
Berkley DB
Sendmail
Feature instance extraction
8
URL1, URL1 URL1 Address Hash Function H1(URL) 20 bits H2(URL) ++Score Index of SMT Data Structure of a Cell 76 bits m m+1 m-1 HURL1 Miss Count n-1 Index 1 n HURL1 Update HURL2 H1(URL).H2(URL) Data Structure of HURL 32 bits URL2 Remove Δt
Entries for feature instances Entries for feature instances Scoreboard Table Age Table Exceeds UNBE threshold (S)? Exceeds age threshold (M) (hash_low, score, age_table location)
(score_table location)
9
S =10, M =20 HashURL : (414738(20-bit)+3724(12-bit))
Current feature being processed Next feature instance
Active features Arranged in their ages (mod N)
The current time location MOD queue Placement
HashURL : (124489(20-bit)+176(12-bit))
The current time location
The entry [862 1822] is purged time history
newest
Entry moved to blacklist
Queue size = 20
10
Three Modules included:
11
12
` Windows Control Console Linux Email Server (Sendmail) simulation parameters Random Text MIME structures Feature Dictionary
Emails (bulk/regular) Bulk
URL Image Src
Regular
Bulk Regular
U R U U ….. R
Message Composer
Subject Generation “From” Generation SMTP Protocol Density Generation (uniform dist.) Spamming Keyword selection
13
14
500 1000 1500 2000 2500 50 100 150 200 250 300 Number of messages in a bin Detection Latency Experimental Value Expected Value
Unit: Virtual clock
15
500 1000 1500 2000 2500 50 100 150 200 250 300 Number of messages in a bin for each non-A UNBE Detection latency
test 1 test 2 test 3 test 4 test 5 test 6
were made where one addition UNBE source is added to the experiment at a time.
– The six lines marked as test[1-6]
– A: is fixed at 100 – Other UNBE sources: increased from 50 to 300
– The detection latency of an UNBE decreases with the number of UNBE sources
blocked form the scorebaord. The density measure in VC for others increases
16
5 10 15 20 25 30 1.5K 3.0K 4.5K 6.0K 7.5K Size of Mial Body (K Bytes) Throughput (1000 Bodys/sec
The average Email size is from 1.5 KB to 7.5 KB, and each email has 2 URLs.
17
100 200 300 400 500 600 700 800 900 1000 30 60 90 120 150 URL length (bytes) Throughput ( K URLs/sec
18
tracked
maintained by a linked list (for each entry)
and age table length is 20K~70K, the maximum depth of linked list pointed by pointer table is 2.
19
20
f b f
1
S
+
2 ( 1) 1 1
S S S S S
− + − − +
21
The prediction model is conservative
22
23
24
25