New and upcoming features in SpamAssassin v3
ApacheCon 2004 November 15, 2004 By: Theo Van Dinter
New and upcoming features in SpamAssassin v3 ApacheCon 2004 - - PowerPoint PPT Presentation
New and upcoming features in SpamAssassin v3 ApacheCon 2004 November 15, 2004 By: Theo Van Dinter Project Changes Became an ASF Top Level Project New logo! License change from GPL/PAL to ASL2 Took 4+ months for 100+ retroactive CLAs Moved
ApacheCon 2004 November 15, 2004 By: Theo Van Dinter
2
Became an ASF Top Level Project New logo! License change from GPL/PAL to ASL2 Took 4+ months for 100+ retroactive CLAs Moved from SourceForge to ASF Mailing lists CVS to Subversion
3
New version number scheme (x.y.z vs x.yz) Minimum perl version increased from 5.005 to 5.6.1 Major API Changes Code cleanup Message Parsing Module merging and separation
4
2.60 vs 3.0.0 1712 commits 941k vs 1.0m (gzip release file) 1 year exactly between releases 9 months in 2.70 & 3.0.0 development 2 months in pre-release mode (scores and testing) 1 month in release candidate mode (beta testing)
5
Code is in multiple pieces: Message filter Read in message, parse, rewrite output Rule engine Run hundreds of rules over message contents, handle priorities, scoring (weight) per rule, etc.
6
7
Core of the filter is the message parser v2 did OK, but complex MIME didn’t work at all Removed Mail::Audit support and NoMailAudit module, replaced with Message and Message:: Node Ground-up rewrite of parser goal to handle even complex MIME messages better emulation of common MUA behavior
8
Linear vs Recursive New internal tree structure Just-In-Time (JIT) behavior when possible
9
MUA emulation OE HTML heuristic Content-Type boundary handling Non-RFC compliance
10
Configuration parser separated from options Makes handling of option values standardized Parsing is much faster Hash lookup, not linear if-then-else logic Configuration files can now include other files
11
Added support for UW mbx format Now handles: file, mbox, mbx, dir spamassassin script now uses ArchiveIterator MUCH faster for batch operations Message parser JIT behavior makes remove markup mode super fast!
12
$ ls -la 1000spam
$ time formail -s spamassassin-260 -L < 1000spam > 1000spam.spam 900.550u 71.580s 17:06.06 94.7% 0+0k 0+0io 724863pf+0w $ time formail -s spamassassin-260 -d < 1000spam.spam > 1000spam.clean 706.440u 55.200s 13:21.33 95.0% 0+0k 0+0io 722119pf+0w diff reported 51 differences, all Subject header whitespace related $ time spamassassin-30 -L --mbox 1000spam > 1000spam.spam 69.700u 0.600s 1:13.44 95.7% 0+0k 0+0io 730pf+0w $ time spamassassin-30 -d --mbox 1000spam.spam > 1000spam.clean 3.360u 0.290s 0:03.66 99.7% 0+0k 0+0io 726pf+0w diff reported 0 differences; scanning: 14x faster, removing markup: 209x!
13
spamd is daemonized spamassassin scanner Previous versions accepted message, then forked to process 3.0.0 pre-forks children who “randomly” accept connections and do processing Causes lots of challenges, such as reverting user configuration, GC, resource usage, etc.
14
Output log includes mass-check compatible
Oct 10 09:34:57 eclectic spamd[14215]: result: Y 14 - ADDRESS_IN_SUBJECT,BAYES_99,DNS_FROM_AHBL_RHSBL,EXCUSE_1, EXCUSE_3,EXCUSE_7,HTML_90_100,HTML_IMAGE_RATIO_02, HTML_MESSAGE,MARKETING_PARTNERS,MIME_QP_LONG_LINE, MPART_ALT_DIFF,RAZOR2_CHECK,RCVD_IN_SBL,URIBL_OB_SURBL, URIBL_SBL,URIBL_WS_SURBL scantime=2.1,size=6896, mid=<61DF7FD0F6D44F26B764B5C2CE4C9ECFA6D1E8@anbok.com>, bayes=0.999669884503962,autolearn=no
15
16
2.60 had 872 rules, 3.0.0 has 628 407 kept, 465 removed, 221 added, includes renamed rules 2.60 had 160 sub-rules, 3.0.0 has 227 130 kept, 30 removed, 97 added
17
RCVD_IN_XBL Spamhaus exploit list DRUGS_* Common drug references LONGWORDS Lots of 5+ letter words in a row
18
HTML_BACKHAIR_* Catches HTML obfuscation techniques:
up to 8<b></b>0% </strong> by purch<b></b>asing onl<b></b>ine for ac</amount>cess to mi<grab>llions of pr<clergy>ivate, sen<hem>sitive <cab>online re</maxwellian>cords,<br> Free<kkmx7fb1lxwk0p1> O<ku0j5aa3xhln6z1<k9lntxebsm7452>>nl<k8sk2493yb31md1>ine Consult<br> Order pr<kn10yxtomj0e82>escrip<k0602x82qzft>tion onl<kv5mh0x2lq1npz>ine and <k3eh16dp3swwg1e>Cheap<br>
19
MPART_ALT_DIFF Looks for multipart/alternative messages with significantly different word lists in text/plain and text/html parts
Content-Type: text/plain Get a capable html e-mailer
Content-Type: text/html Buy my pills and mortgage!
20
MSGID_SPAM_CAPS Message-ID header is in /^[A-Z]+@/ format Catches 11.3% of spam, no FPs
Message-ID: <SXEXBAZDNVTGYMYBTRKUWOSQ@finklfan.com> Message-ID: <EKGSGWAIBTGZTSHZZBBZ@yahoo.com> Message-ID: <CMIVFJJHOPNXVBXUUP@hoardermail.com> Message-ID: <HAWZFYXQLDVBHGKSSMVDS@t-online.de> Message-ID: <YYIMPKBREVIVSFCLRKKBFI@webtv.com>
21
RCVD_DOUBLE_IP_SPAM Received header is fake with two IPs listed Catches 12.5% of spam, no FPs
Received: from [119.227.62.1] by 64.142.3.173 with ESMTP id <110617-93232>; Fri, 27 Aug 2004 23:59:33 +0300 Received: from 110.56.100.200 by 211.190.241.62; Sun, 10 Oct 2004 10:03:35 +0600
22
X_MESSAGE_INFO X-Message-Info header exists... Catches 18.0% of spam, no FPs
X-Message-Info: 7wCUko664gJL/isOpbpHZpUXeysrI7Ea X-Message-Info: TBEqiuUDX224aiZQ59TCWxBY0AToUL99HSW7V9gnf576J X-Message-Info: 5%RNDLCCHAR37%RNDDIGIT15iI/zPMjruQBFrbQUxdR2AManr X-Message-Info: %RNDUCCHAR15c%RNDUCCHAR1548fspGLBoaq%RNDUCCHAR16opvCRRkfnGFQoxl3
23
RCVD_HELO_IP_MISMATCH Received header indicates sender used IP for HELO, but it doesn’t match the sender’s IP 25.7% of spam, 0.03% FPs, all misconfigured MTAs
Received: from 65.214.43.12 (unknown [211.222.252.28]) by bblisa.bblisa.org (Postfix) with SMTP id DD6DE1768DB for <felicity@kluge.net>; Sat, 11 Sep 2004 00:38:56 -0400 (EDT) Received: from 64.142.3.173 (unknown [219.248.62.167]) by bugzilla.spamassassin.org (Postfix) with SMTP id CFD6C83899 for <felicity@kluge.net>; Wed, 13 Oct 2004 22:06:28 -0700 (PDT) Received: from 66.92.69.221 (unknown [211.217.181.250]) by eclectic.kluge.net (Postfix) with SMTP id BECCD444550 for <felicity@kluge.net>; Thu, 14 Oct 2004 00:53:37 -0400 (EDT)
24
MIME_BOUND_DD_DIGITS MIME boundary is simply /^--[0-9]+/ Catches 36.5% of spam, no FPs
Content-Type: text/html; boundary="--5050984427071928258" Content-Type: text/plain; boundary="--5895368826571874203" Content-Type: multipart/alternative; boundary="--2396152152574698241" Content-Type: multipart/mixed; boundary="--44188425536568249" Content-Type: multipart/related; boundary="--610294112918606"
25
AutoWhiteList (AWL) now on by default Partially due to change from commandline
Mainly because the idea and code are mature and work fairly well AWL tracks From address, sending IP network, and average message scores over time, moves future mail scores towards the average
felicity@kluge.net|ip=66.92 => # of messages received felicity@kluge.net|ip=66.92|totscore => total score of messages received
26
Storage backend now has “plugin” capability Berkeley DB (BDB) is default, added SQL in v3 Added capability to backup & restore Good for backup and recovery, modifying stored values, converting between storage backends, etc. Added flock locking option for all SA DB access Tokens are now stored as hash values, not raw
27
Supports MySQL and PostgreSQL natively Lots of benefits, generally faster overall Scanning, 3-30% faster, depending on # of tokens Learning, 2-3x slower, requires multiple SQL commands per update Expiry, 6-7x faster, BDB does lots of I/O, etc. For more information, see Michael Parker’s presentation following this one!
28
Replaced Genetic Algorithm (GA) for score generation with Perceptron Learner No one wanted to deal with the GA code Did anyone understand the code? Not really. Most time spent kluging around glue scripts Perceptron is much, much, faster GA took 6-24 hours/scoreset for 2.5 and 2.6 Perceptron took 8 minutes/scoreset for 3.0.0
29
Per message, input is bit array of rules which were hit Multiply input bits by respective rule weights, and sum Squash result into 0-1 range (ham vs spam) Modify weights so result approaches desired value At end, weights become scores via linear transformation
w1 w770 w36 w289 w399 w724 Sigmoid Gain Function Sigma Node Input Layer Weights ACCEPT_CREDIT_CARDS YOU_WON BAYES_99 HTML_80_90 IMPOTENCE URIBL_WS_SURBL
30
Works towards SpamAssassin as engine goal Can easily disable unused/optional code Started out as method for third-party EvalT ests, became general code enhancer Can modify general behavior and configuration Can show extra “internal” debug information Easy to test out new ideas without core changes
package Mail::SpamAssassin::Plugin::MSExec; use Mail::SpamAssassin::Plugin; use strict; use bytes; use vars qw(@ISA); @ISA = qw(Mail::SpamAssassin::Plugin); sub new { my($class,$mailsaobject) = @_; $class = ref($class) || $class; my $self = $class->SUPER::new($mailsaobject); bless ($self, $class); $self->register_eval_rule ("check_microsoft_executable"); return $self; }
Example based on 3.1 MSExec Plugin Small amount of Perl involved to create the Plugin Creates a subclass of the general Plugin module Registers “eval” rule
Look at all “application” or “text” parts Return true if part filename has a standard Microsoft executable extension Return true if part is base64 encoded and first line matches regexp
sub check_microsoft_executable { my ($self, $permsgstatus) = @_; foreach my $p ($permsgstatus->{msg}->find_parts( qr/^(application|text)\b/)) { my ($ctype, $boundary, $charset, $name) = Mail::SpamAssassin::Util::parse_content_type( $p->get_header('content-type')); if (lc $ctype eq 'application/octet-stream') { return 1 if ($name =~ /\.(?:scr|bat|com|pif|exe)$/i); my $cte = $p->get_header('content-transfer-encoding'); return 1 if ($cte =~ /base64/i && $p->raw()->[0] =~ /^TV[opqr].A..[AB].[AQgw][A-H].A/); } } return 0; }
33
URIDNSBL Check URI domains against DNSBL SURBL and SBL by default SPF (Sender Permitted From) DNS records specify hosts allowed to send mail for domain Is anti-forging method, not anti-spam Hashcash
34
SpamAssassin finds definite URI, and URI-looking strings in message body text and HTML markup RegistrarBoundaries used to determine domain Understands standard TLD (apache.org), 2 level ccTLD (theregister.co.uk), and 3-4 level domains (mostly RFC 1480: state.ma.us, pja.pvt.k12.or.us) Handles issues with wildcard hostnames
35
Built-in handing of URI redirectors
http://ca.rd.yahoo.com/*http://www.pkabdfbudb.info/ internally becomes: http://ca.rd.yahoo.com/*http://www.pkabdfbudb.info/ http://www.pkabdfbudb.info/ URIDNSBL is passed the two domains: yahoo.com pkabdfbudb.info
36
Built-in handling of URI encoding techniques
http://c%61%2Erd.yahoo.com/*%68%74%74%70://www.pkabdfbudb.info/ internally becomes: http://c%61%2Erd.yahoo.com/*%68%74%74%70://www.pkabdfbudb.info/ http://ca.rd.yahoo.com/*http://www.pkabdfbudb.info/ http://www.pkabdfbudb.info/ URIDNSBL is still only passed the two domains: yahoo.com pkabdfbudb.info
37
Local URIDNSBL whitelist Skip checking common non-spam domains google.com, apache.org, ups.com, etc. Decreases load on URIDNSBL servers, makes SpamAssassin scanning faster Originally planned for 3.1 series, but included in v3.0.1
38
39
40
Focusing attention on speed and accuracy Move more EvalT ests to Plugins Razor, DCC, Pyzor, AutoWhiteList(?) etc. Faster (automatic) updates? Scores, rules, plugins, etc. Requires large, scalable, secure infrastructure Larger sample of message results required
41
More plugin hooks and integration Return of early exit? (aka: short circuit) Stop processing message when result is assured ... or if certain rules hit? Habeas, BondedSender USER_IN_{WHITE,BLACK}LIST
42
More emulation of MUAs! Keep finding ways in which common MUAs don’t follow the relevant RFCs. No blank line between header and body Content-Type parsing etc!
43
http://www.kluge.net/~felicity/AC2004/ http://spamassassin.apache.org/ http://www.surbl.org/ http://www.spamhaus.org/sbl/ http://www.hashcash.org/ http://www.habeas.com/ http://www.bondedsender.com/