New and upcoming features in SpamAssassin v3 ApacheCon 2004 - - PowerPoint PPT Presentation

new and upcoming features in spamassassin v3
SMART_READER_LITE
LIVE PREVIEW

New and upcoming features in SpamAssassin v3 ApacheCon 2004 - - PowerPoint PPT Presentation

New and upcoming features in SpamAssassin v3 ApacheCon 2004 November 15, 2004 By: Theo Van Dinter Project Changes Became an ASF Top Level Project New logo! License change from GPL/PAL to ASL2 Took 4+ months for 100+ retroactive CLAs Moved


slide-1
SLIDE 1

New and upcoming features in SpamAssassin v3

ApacheCon 2004 November 15, 2004 By: Theo Van Dinter

slide-2
SLIDE 2

2

Project Changes

Became an ASF Top Level Project New logo! License change from GPL/PAL to ASL2 Took 4+ months for 100+ retroactive CLAs Moved from SourceForge to ASF Mailing lists CVS to Subversion

slide-3
SLIDE 3

3

Project Changes

New version number scheme (x.y.z vs x.yz) Minimum perl version increased from 5.005 to 5.6.1 Major API Changes Code cleanup Message Parsing Module merging and separation

slide-4
SLIDE 4

4

Project Changes

2.60 vs 3.0.0 1712 commits 941k vs 1.0m (gzip release file) 1 year exactly between releases 9 months in 2.70 & 3.0.0 development 2 months in pre-release mode (scores and testing) 1 month in release candidate mode (beta testing)

slide-5
SLIDE 5

5

Changes per Type

Code is in multiple pieces: Message filter Read in message, parse, rewrite output Rule engine Run hundreds of rules over message contents, handle priorities, scoring (weight) per rule, etc.

slide-6
SLIDE 6

6

Filter Changes

slide-7
SLIDE 7

7

Message Parser

Core of the filter is the message parser v2 did OK, but complex MIME didn’t work at all Removed Mail::Audit support and NoMailAudit module, replaced with Message and Message:: Node Ground-up rewrite of parser goal to handle even complex MIME messages better emulation of common MUA behavior

slide-8
SLIDE 8

8

Message Parser

Linear vs Recursive New internal tree structure Just-In-Time (JIT) behavior when possible

slide-9
SLIDE 9

9

Message Parser

MUA emulation OE HTML heuristic Content-Type boundary handling Non-RFC compliance

slide-10
SLIDE 10

10

Filter Changes

Configuration parser separated from options Makes handling of option values standardized Parsing is much faster Hash lookup, not linear if-then-else logic Configuration files can now include other files

slide-11
SLIDE 11

11

Filter Changes: ArchiveIterator

Added support for UW mbx format Now handles: file, mbox, mbx, dir spamassassin script now uses ArchiveIterator MUCH faster for batch operations Message parser JIT behavior makes remove markup mode super fast!

slide-12
SLIDE 12

12

ArchiveIterator Batch Mode Example

$ ls -la 1000spam

  • rw-r--r-- 1 felicity fame 4375748 Oct 10 18:15 1000spam

$ time formail -s spamassassin-260 -L < 1000spam > 1000spam.spam 900.550u 71.580s 17:06.06 94.7% 0+0k 0+0io 724863pf+0w $ time formail -s spamassassin-260 -d < 1000spam.spam > 1000spam.clean 706.440u 55.200s 13:21.33 95.0% 0+0k 0+0io 722119pf+0w diff reported 51 differences, all Subject header whitespace related $ time spamassassin-30 -L --mbox 1000spam > 1000spam.spam 69.700u 0.600s 1:13.44 95.7% 0+0k 0+0io 730pf+0w $ time spamassassin-30 -d --mbox 1000spam.spam > 1000spam.clean 3.360u 0.290s 0:03.66 99.7% 0+0k 0+0io 726pf+0w diff reported 0 differences; scanning: 14x faster, removing markup: 209x!

slide-13
SLIDE 13

13

Changes to spamd

spamd is daemonized spamassassin scanner Previous versions accepted message, then forked to process 3.0.0 pre-forks children who “randomly” accept connections and do processing Causes lots of challenges, such as reverting user configuration, GC, resource usage, etc.

slide-14
SLIDE 14

14

Changes to spamd

Output log includes mass-check compatible

  • utput:

Oct 10 09:34:57 eclectic spamd[14215]: result: Y 14 - ADDRESS_IN_SUBJECT,BAYES_99,DNS_FROM_AHBL_RHSBL,EXCUSE_1, EXCUSE_3,EXCUSE_7,HTML_90_100,HTML_IMAGE_RATIO_02, HTML_MESSAGE,MARKETING_PARTNERS,MIME_QP_LONG_LINE, MPART_ALT_DIFF,RAZOR2_CHECK,RCVD_IN_SBL,URIBL_OB_SURBL, URIBL_SBL,URIBL_WS_SURBL scantime=2.1,size=6896, mid=<61DF7FD0F6D44F26B764B5C2CE4C9ECFA6D1E8@anbok.com>, bayes=0.999669884503962,autolearn=no

slide-15
SLIDE 15

15

Engine Changes

slide-16
SLIDE 16

16

Rule Changes

2.60 had 872 rules, 3.0.0 has 628 407 kept, 465 removed, 221 added, includes renamed rules 2.60 had 160 sub-rules, 3.0.0 has 227 130 kept, 30 removed, 97 added

slide-17
SLIDE 17

17

New Rules

RCVD_IN_XBL Spamhaus exploit list DRUGS_* Common drug references LONGWORDS Lots of 5+ letter words in a row

slide-18
SLIDE 18

18

New Rules

HTML_BACKHAIR_* Catches HTML obfuscation techniques:

up to 8<b></b>0% </strong> by purch<b></b>asing onl<b></b>ine for ac</amount>cess to mi<grab>llions of pr<clergy>ivate, sen<hem>sitive <cab>online re</maxwellian>cords,<br> Free<kkmx7fb1lxwk0p1> O<ku0j5aa3xhln6z1<k9lntxebsm7452>>nl<k8sk2493yb31md1>ine Consult<br> Order pr<kn10yxtomj0e82>escrip<k0602x82qzft>tion onl<kv5mh0x2lq1npz>ine and <k3eh16dp3swwg1e>Cheap<br>

slide-19
SLIDE 19

19

New Rules

MPART_ALT_DIFF Looks for multipart/alternative messages with significantly different word lists in text/plain and text/html parts

  • -----=_NextPart_000_00AM_08K3791OO_07L.777L91H0

Content-Type: text/plain Get a capable html e-mailer

  • -----=_NextPart_000_00AM_08K3791OO_07L.777L91H0

Content-Type: text/html Buy my pills and mortgage!

  • -----=_NextPart_000_00AM_08K3791OO_07L.777L91H0--
slide-20
SLIDE 20

20

New Rules: Spammers make it easy?

MSGID_SPAM_CAPS Message-ID header is in /^[A-Z]+@/ format Catches 11.3% of spam, no FPs

Message-ID: <SXEXBAZDNVTGYMYBTRKUWOSQ@finklfan.com> Message-ID: <EKGSGWAIBTGZTSHZZBBZ@yahoo.com> Message-ID: <CMIVFJJHOPNXVBXUUP@hoardermail.com> Message-ID: <HAWZFYXQLDVBHGKSSMVDS@t-online.de> Message-ID: <YYIMPKBREVIVSFCLRKKBFI@webtv.com>

slide-21
SLIDE 21

21

New Rules: Spammers make it easy?

RCVD_DOUBLE_IP_SPAM Received header is fake with two IPs listed Catches 12.5% of spam, no FPs

Received: from [119.227.62.1] by 64.142.3.173 with ESMTP id <110617-93232>; Fri, 27 Aug 2004 23:59:33 +0300 Received: from 110.56.100.200 by 211.190.241.62; Sun, 10 Oct 2004 10:03:35 +0600

slide-22
SLIDE 22

22

New Rules: Spammers make it easy?

X_MESSAGE_INFO X-Message-Info header exists... Catches 18.0% of spam, no FPs

X-Message-Info: 7wCUko664gJL/isOpbpHZpUXeysrI7Ea X-Message-Info: TBEqiuUDX224aiZQ59TCWxBY0AToUL99HSW7V9gnf576J X-Message-Info: 5%RNDLCCHAR37%RNDDIGIT15iI/zPMjruQBFrbQUxdR2AManr X-Message-Info: %RNDUCCHAR15c%RNDUCCHAR1548fspGLBoaq%RNDUCCHAR16opvCRRkfnGFQoxl3

slide-23
SLIDE 23

23

RCVD_HELO_IP_MISMATCH Received header indicates sender used IP for HELO, but it doesn’t match the sender’s IP 25.7% of spam, 0.03% FPs, all misconfigured MTAs

New Rules: Spammers make it easy?

Received: from 65.214.43.12 (unknown [211.222.252.28]) by bblisa.bblisa.org (Postfix) with SMTP id DD6DE1768DB for <felicity@kluge.net>; Sat, 11 Sep 2004 00:38:56 -0400 (EDT) Received: from 64.142.3.173 (unknown [219.248.62.167]) by bugzilla.spamassassin.org (Postfix) with SMTP id CFD6C83899 for <felicity@kluge.net>; Wed, 13 Oct 2004 22:06:28 -0700 (PDT) Received: from 66.92.69.221 (unknown [211.217.181.250]) by eclectic.kluge.net (Postfix) with SMTP id BECCD444550 for <felicity@kluge.net>; Thu, 14 Oct 2004 00:53:37 -0400 (EDT)

slide-24
SLIDE 24

24

New Rules: Spammers make it easy?

MIME_BOUND_DD_DIGITS MIME boundary is simply /^--[0-9]+/ Catches 36.5% of spam, no FPs

Content-Type: text/html; boundary="--5050984427071928258" Content-Type: text/plain; boundary="--5895368826571874203" Content-Type: multipart/alternative; boundary="--2396152152574698241" Content-Type: multipart/mixed; boundary="--44188425536568249" Content-Type: multipart/related; boundary="--610294112918606"

slide-25
SLIDE 25

25

Rules

AutoWhiteList (AWL) now on by default Partially due to change from commandline

  • ption to configuration parameter

Mainly because the idea and code are mature and work fairly well AWL tracks From address, sending IP network, and average message scores over time, moves future mail scores towards the average

felicity@kluge.net|ip=66.92 => # of messages received felicity@kluge.net|ip=66.92|totscore => total score of messages received

slide-26
SLIDE 26

26

Bayes Changes

Storage backend now has “plugin” capability Berkeley DB (BDB) is default, added SQL in v3 Added capability to backup & restore Good for backup and recovery, modifying stored values, converting between storage backends, etc. Added flock locking option for all SA DB access Tokens are now stored as hash values, not raw

slide-27
SLIDE 27

27

Bayes in SQL

Supports MySQL and PostgreSQL natively Lots of benefits, generally faster overall Scanning, 3-30% faster, depending on # of tokens Learning, 2-3x slower, requires multiple SQL commands per update Expiry, 6-7x faster, BDB does lots of I/O, etc. For more information, see Michael Parker’s presentation following this one!

slide-28
SLIDE 28

28

Out with the GA!

Replaced Genetic Algorithm (GA) for score generation with Perceptron Learner No one wanted to deal with the GA code Did anyone understand the code? Not really. Most time spent kluging around glue scripts Perceptron is much, much, faster GA took 6-24 hours/scoreset for 2.5 and 2.6 Perceptron took 8 minutes/scoreset for 3.0.0

slide-29
SLIDE 29

29

General Perceptron

Per message, input is bit array of rules which were hit Multiply input bits by respective rule weights, and sum Squash result into 0-1 range (ham vs spam) Modify weights so result approaches desired value At end, weights become scores via linear transformation

Σ

w1 w770 w36 w289 w399 w724 Sigmoid Gain Function Sigma Node Input Layer Weights ACCEPT_CREDIT_CARDS YOU_WON BAYES_99 HTML_80_90 IMPOTENCE URIBL_WS_SURBL

slide-30
SLIDE 30

30

Plugins!

Works towards SpamAssassin as engine goal Can easily disable unused/optional code Started out as method for third-party EvalT ests, became general code enhancer Can modify general behavior and configuration Can show extra “internal” debug information Easy to test out new ideas without core changes

slide-31
SLIDE 31

Plugin Example

package Mail::SpamAssassin::Plugin::MSExec; use Mail::SpamAssassin::Plugin; use strict; use bytes; use vars qw(@ISA); @ISA = qw(Mail::SpamAssassin::Plugin); sub new { my($class,$mailsaobject) = @_; $class = ref($class) || $class; my $self = $class->SUPER::new($mailsaobject); bless ($self, $class); $self->register_eval_rule ("check_microsoft_executable"); return $self; }

Example based on 3.1 MSExec Plugin Small amount of Perl involved to create the Plugin Creates a subclass of the general Plugin module Registers “eval” rule

slide-32
SLIDE 32

Plugin Example

Look at all “application” or “text” parts Return true if part filename has a standard Microsoft executable extension Return true if part is base64 encoded and first line matches regexp

sub check_microsoft_executable { my ($self, $permsgstatus) = @_; foreach my $p ($permsgstatus->{msg}->find_parts( qr/^(application|text)\b/)) { my ($ctype, $boundary, $charset, $name) = Mail::SpamAssassin::Util::parse_content_type( $p->get_header('content-type')); if (lc $ctype eq 'application/octet-stream') { return 1 if ($name =~ /\.(?:scr|bat|com|pif|exe)$/i); my $cte = $p->get_header('content-transfer-encoding'); return 1 if ($cte =~ /base64/i && $p->raw()->[0] =~ /^TV[opqr].A..[AB].[AQgw][A-H].A/); } } return 0; }

slide-33
SLIDE 33

33

Standard 3.0 Plugins

URIDNSBL Check URI domains against DNSBL SURBL and SBL by default SPF (Sender Permitted From) DNS records specify hosts allowed to send mail for domain Is anti-forging method, not anti-spam Hashcash

slide-34
SLIDE 34

34

URIDNSBL Support

SpamAssassin finds definite URI, and URI-looking strings in message body text and HTML markup RegistrarBoundaries used to determine domain Understands standard TLD (apache.org), 2 level ccTLD (theregister.co.uk), and 3-4 level domains (mostly RFC 1480: state.ma.us, pja.pvt.k12.or.us) Handles issues with wildcard hostnames

slide-35
SLIDE 35

35

URIDNSBL Support

Built-in handing of URI redirectors

http://ca.rd.yahoo.com/*http://www.pkabdfbudb.info/ internally becomes: http://ca.rd.yahoo.com/*http://www.pkabdfbudb.info/ http://www.pkabdfbudb.info/ URIDNSBL is passed the two domains: yahoo.com pkabdfbudb.info

slide-36
SLIDE 36

36

URIDNSBL Support

Built-in handling of URI encoding techniques

http://&#099;%61%2Erd.yahoo.com/*%68%74%74%70://www.pkabdfbudb.info/ internally becomes: http://&#099;%61%2Erd.yahoo.com/*%68%74%74%70://www.pkabdfbudb.info/ http://ca.rd.yahoo.com/*http://www.pkabdfbudb.info/ http://www.pkabdfbudb.info/ URIDNSBL is still only passed the two domains: yahoo.com pkabdfbudb.info

slide-37
SLIDE 37

37

URIDNSBL Whitelist

Local URIDNSBL whitelist Skip checking common non-spam domains google.com, apache.org, ups.com, etc. Decreases load on URIDNSBL servers, makes SpamAssassin scanning faster Originally planned for 3.1 series, but included in v3.0.1

slide-38
SLIDE 38

38

Work in Progress for SpamAssassin 3.1

slide-39
SLIDE 39

39

“First make it work, then make it work better. ”

slide-40
SLIDE 40

40

Upcoming Features

Focusing attention on speed and accuracy Move more EvalT ests to Plugins Razor, DCC, Pyzor, AutoWhiteList(?) etc. Faster (automatic) updates? Scores, rules, plugins, etc. Requires large, scalable, secure infrastructure Larger sample of message results required

slide-41
SLIDE 41

41

Upcoming Features

More plugin hooks and integration Return of early exit? (aka: short circuit) Stop processing message when result is assured ... or if certain rules hit? Habeas, BondedSender USER_IN_{WHITE,BLACK}LIST

slide-42
SLIDE 42

42

Upcoming Features

More emulation of MUAs! Keep finding ways in which common MUAs don’t follow the relevant RFCs. No blank line between header and body Content-Type parsing etc!

slide-43
SLIDE 43

43

References

http://www.kluge.net/~felicity/AC2004/ http://spamassassin.apache.org/ http://www.surbl.org/ http://www.spamhaus.org/sbl/ http://www.hashcash.org/ http://www.habeas.com/ http://www.bondedsender.com/