on attacking s tatistical s pam filters
play

On Attacking S tatistical S pam Filters Greg Wittel & S . - PowerPoint PPT Presentation

On Attacking S tatistical S pam Filters Greg Wittel & S . Felix Wu U.C. Davis CEAS 2004 1 Outline Introduction Attack Classes Testing A New Attack Conclusions & Future 2 Attack Classes Attempted attack


  1. On Attacking S tatistical S pam Filters Greg Wittel & S . Felix Wu U.C. Davis CEAS 2004 1

  2. Outline • Introduction • Attack Classes • Testing A New Attack • Conclusions & Future 2

  3. Attack Classes • Attempted attack methods: – Tokenization • Works against feature selection by splitting or modifying key message features • e.g. S plitting up words with spaces, HTML tricks – Obfuscation • Use encoding or misdirection to hide contents from filter • e.g. HTML/ URL encoding, letter substitution 3

  4. Attack Classes cont. – Weak Statistical • S kew message statistics by adding in random data • e.g. Add in random words, fake HTML tags, random text excerpts – Strong Statistical • Differentiated from ‘ weak’ attacks by using more intelligence in the attack • Guessing v. educated guessing • e.g. Graham-Cumming Attack 4

  5. Attack Classes cont. – Misc: • S parse Data attack • Hash breaking attacks 5

  6. Testing A New Attack • Tested two types of attacks: – Dictionary word attack (old) – Common word attack (new) • Both attacks add n random words to a base message. • Tested against two filters: – CRM114 - S parse binary poly. + Naïve Bayesian – SpamBayes (S B) - Naïve bayesian 6

  7. Procedure • Training data – 3000 hams from S pamAssassin corpus – 3000 spams from S pamArchive-mod corpus – CRM114 trained on errors – SB using bulk training 7

  8. Procedure cont. • Test data – Started with a base ‘ picospam’ not in training data: From: Kelsey Stone <bouhooh@entitlement.com> To: submit@spamarchive.org Subject: Erase hidden Spies or Trojan Horses from your computer Erase E-Spyware from your computer http://boozofoof.spywiper.biz 8

  9. Procedure cont. • Test data cont. – Base picospam is detectable by filters – Generated 1000 variations with n words added. • Words selected with and without replacement • n = 10, 25, 50, 100, 200, 300, 400 – Recorded classifications, effect on score 9

  10. Results • Using 10,000 variants didn’ t effect results • S election with/ without replacement had no effect • Mixed results 10

  11. CRM114 Results • Both attacks failed; 0 false negatives • S pam score was effected... 11

  12. CRM114 Results cont. 1 0.95 Base score Spam probability 0.9 0.85 0.8 Dictionary Common 0.75 0 10 25 50 100 200 300 400 Words added 12

  13. SpamBayes Results • Baseline Dictionary attack: mild success • Common word attack... 13

  14. S pamBayes Results cont. Dictionary 1 Common Spam Thresh. 0.8 Spam probability 0.6 0.4 0.2 Ham Thresh. 0 0 10 25 50 100 200 300 400 Words added 14

  15. S pamBayes Results cont. • Common word attack reduces attack size by up to 4x • What Happened? Why such poor performance on either attack? • Hypothesis: Basis picospam was not in training data. • Added the basis spam to S B’ s training data… 15

  16. S pamBayes Results Part 2 • Retrained filter offered greater resistance to ‘ weak’ dictionary attack. • S mall performance gain against common word attack. • Gains not big enough to resist attack 16

  17. S pamBayes Results Part 2 cont. Dict ionary Word Attack Before 1 After Spam Thresh. 0.8 Spam probability 0.6 0.4 0.2 Ham Thresh. 0 0 10 25 50 100 200 300 400 Words added 17

  18. S pamBayes Results Part 2 cont. Common Word At tack Before 1 After Spam Thresh. 0.8 Spam probability 0.6 0.4 0.2 Ham Thresh. 0 0 10 25 50 100 200 300 400 Words added 18

  19. Conclusion & Future... • Mixed success of common word attack shows need for further study • Other filters – Bogofilter shows similar vulnerability • Effect of re-training on attack msgs v. – False negative, false positive rate • Testing other basis picospams 19

  20. Future cont. • What makes a filter hard to distract? • Relevance of independence assumption • More advanced attacks – Natural language generation • Traditional software flaws – Exploitable buffer overflows – Remote code execution 20

  21. Colophon • Contact information: – Greg Wittel ( wittel at cs . ucdavis . edu ) – S. Felix Wu ( wu at cs . ucdavis . edu ) • Questions? 21

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend