New and upcoming features in SpamAssassin v3 ApacheCon 2004 - PowerPoint PPT Presentation

New and upcoming features in SpamAssassin v3 ApacheCon 2004 November 15, 2004 By: Theo Van Dinter

Project Changes Became an ASF Top Level Project New logo! License change from GPL/PAL to ASL2 Took 4+ months for 100+ retroactive CLAs Moved from SourceForge to ASF Mailing lists CVS to Subversion 2

Project Changes New version number scheme (x.y.z vs x.yz) Minimum perl version increased from 5.005 to 5.6.1 Major API Changes Code cleanup Message Parsing Module merging and separation 3

Project Changes 2.60 vs 3.0.0 1712 commits 941k vs 1.0m (gzip release file) 1 year exactly between releases 9 months in 2.70 & 3.0.0 development 2 months in pre-release mode (scores and testing) 1 month in release candidate mode (beta testing) 4

Changes per Type Code is in multiple pieces: Message filter Read in message, parse, rewrite output Rule engine Run hundreds of rules over message contents, handle priorities, scoring (weight) per rule, etc. 5

Filter Changes 6

Message Parser Core of the filter is the message parser v2 did OK, but complex MIME didn’t work at all Removed Mail::Audit support and NoMailAudit module, replaced with Message and Message:: Node Ground-up rewrite of parser goal to handle even complex MIME messages better emulation of common MUA behavior 7

Message Parser Linear vs Recursive New internal tree structure Just-In-Time (JIT) behavior when possible 8

Message Parser MUA emulation OE HTML heuristic Content-Type boundary handling Non-RFC compliance 9

Filter Changes Configuration parser separated from options Makes handling of option values standardized Parsing is much faster Hash lookup, not linear if-then-else logic Configuration files can now include other files 10

Filter Changes: ArchiveIterator Added support for UW mbx format Now handles: file, mbox, mbx, dir spamassassin script now uses ArchiveIterator MUCH faster for batch operations Message parser JIT behavior makes remove markup mode super fast! 11

ArchiveIterator Batch Mode Example $ ls -la 1000spam -rw-r--r-- 1 felicity fame 4375748 Oct 10 18:15 1000spam $ time formail -s spamassassin-260 -L < 1000spam > 1000spam.spam 900.550u 71.580s 17:06.06 94.7% 0+0k 0+0io 724863pf+0w $ time formail -s spamassassin-260 -d < 1000spam.spam > 1000spam.clean 706.440u 55.200s 13:21.33 95.0% 0+0k 0+0io 722119pf+0w diff reported 51 differences, all Subject header whitespace related $ time spamassassin-30 -L --mbox 1000spam > 1000spam.spam 69.700u 0.600s 1:13.44 95.7% 0+0k 0+0io 730pf+0w $ time spamassassin-30 -d --mbox 1000spam.spam > 1000spam.clean 3.360u 0.290s 0:03.66 99.7% 0+0k 0+0io 726pf+0w diff reported 0 differences; scanning: 14x faster, removing markup: 209x ! 12

Changes to spamd spamd is daemonized spamassassin scanner Previous versions accepted message, then forked to process 3.0.0 pre-forks children who “randomly” accept connections and do processing Causes lots of challenges, such as reverting user configuration, GC, resource usage, etc. 13

Changes to spamd Output log includes mass-check compatible output: Oct 10 09:34:57 eclectic spamd[14215]: result: Y 14 - ADDRESS_IN_SUBJECT,BAYES_99,DNS_FROM_AHBL_RHSBL,EXCUSE_1, EXCUSE_3,EXCUSE_7,HTML_90_100,HTML_IMAGE_RATIO_02, HTML_MESSAGE,MARKETING_PARTNERS,MIME_QP_LONG_LINE, MPART_ALT_DIFF,RAZOR2_CHECK,RCVD_IN_SBL,URIBL_OB_SURBL, URIBL_SBL,URIBL_WS_SURBL scantime=2.1,size=6896, mid=<61DF7FD0F6D44F26B764B5C2CE4C9ECFA6D1E8@anbok.com>, bayes=0.999669884503962,autolearn=no 14

Engine Changes 15

Rule Changes 2.60 had 872 rules, 3.0.0 has 628 407 kept, 465 removed, 221 added, includes renamed rules 2.60 had 160 sub-rules, 3.0.0 has 227 130 kept, 30 removed, 97 added 16

New Rules RCVD_IN_XBL Spamhaus exploit list DRUGS_* Common drug references LONGWORDS Lots of 5+ letter words in a row 17

New Rules HTML_BACKHAIR_* Catches HTML obfuscation techniques: up to 80% by purchasing online for ac</amount>cess to mi<grab>llions of pr<clergy>ivate, sen<hem>sitive <cab>online re</maxwellian>cords, Free<kkmx7fb1lxwk0p1> O<ku0j5aa3xhln6z1<k9lntxebsm7452>>nl<k8sk2493yb31md1>ine Consult Order pr<kn10yxtomj0e82>escrip<k0602x82qzft>tion onl<kv5mh0x2lq1npz>ine and <k3eh16dp3swwg1e>Cheap 18

New Rules MPART_ALT_DIFF Looks for multipart/alternative messages with significantly different word lists in text/plain and text/html parts ------=_NextPart_000_00AM_08K3791OO_07L.777L91H0 Content-Type: text/plain Get a capable html e-mailer ------=_NextPart_000_00AM_08K3791OO_07L.777L91H0 Content-Type: text/html Buy my pills and mortgage! ------=_NextPart_000_00AM_08K3791OO_07L.777L91H0-- 19

New Rules: Spammers make it easy? MSGID_SPAM_CAPS Message-ID header is in /^[A-Z]+@/ format Catches 11.3% of spam, no FPs Message-ID: <SXEXBAZDNVTGYMYBTRKUWOSQ@finklfan.com> Message-ID: <EKGSGWAIBTGZTSHZZBBZ@yahoo.com> Message-ID: <CMIVFJJHOPNXVBXUUP@hoardermail.com> Message-ID: <HAWZFYXQLDVBHGKSSMVDS@t-online.de> Message-ID: <YYIMPKBREVIVSFCLRKKBFI@webtv.com> 20

New Rules: Spammers make it easy? RCVD_DOUBLE_IP_SPAM Received header is fake with two IPs listed Catches 12.5% of spam, no FPs Received: from [119.227.62.1] by 64.142.3.173 with ESMTP id <110617-93232>; Fri, 27 Aug 2004 23:59:33 +0300 Received: from 110.56.100.200 by 211.190.241.62; Sun, 10 Oct 2004 10:03:35 +0600 21

New Rules: Spammers make it easy? X_MESSAGE_INFO X-Message-Info header exists... Catches 18.0% of spam, no FPs X-Message-Info: 7wCUko664gJL/isOpbpHZpUXeysrI7Ea X-Message-Info: TBEqiuUDX224aiZQ59TCWxBY0AToUL99HSW7V9gnf576J X-Message-Info: 5%RNDLCCHAR37%RNDDIGIT15iI/zPMjruQBFrbQUxdR2AManr X-Message-Info: %RNDUCCHAR15c%RNDUCCHAR1548fspGLBoaq%RNDUCCHAR16opvCRRkfnGFQoxl3 22

New Rules: Spammers make it easy? RCVD_HELO_IP_MISMATCH Received header indicates sender used IP for HELO, but it doesn’t match the sender’s IP 25.7% of spam, 0.03% FPs, all misconfigured MTAs Received: from 65.214.43.12 (unknown [211.222.252.28]) by bblisa.bblisa.org (Postfix) with SMTP id DD6DE1768DB for <felicity@kluge.net>; Sat, 11 Sep 2004 00:38:56 -0400 (EDT) Received: from 64.142.3.173 (unknown [219.248.62.167]) by bugzilla.spamassassin.org (Postfix) with SMTP id CFD6C83899 for <felicity@kluge.net>; Wed, 13 Oct 2004 22:06:28 -0700 (PDT) Received: from 66.92.69.221 (unknown [211.217.181.250]) by eclectic.kluge.net (Postfix) with SMTP id BECCD444550 for <felicity@kluge.net>; Thu, 14 Oct 2004 00:53:37 -0400 (EDT) 23

New Rules: Spammers make it easy? MIME_BOUND_DD_DIGITS MIME boundary is simply /^--[0-9]+/ Catches 36.5% of spam, no FPs Content-Type: text/html; boundary="--5050984427071928258" Content-Type: text/plain; boundary="--5895368826571874203" Content-Type: multipart/alternative; boundary="--2396152152574698241" Content-Type: multipart/mixed; boundary="--44188425536568249" Content-Type: multipart/related; boundary="--610294112918606" 24

Rules AutoWhiteList (AWL) now on by default Partially due to change from commandline option to configuration parameter Mainly because the idea and code are mature and work fairly well AWL tracks From address, sending IP network, and average message scores over time, moves future mail scores towards the average felicity@kluge.net|ip=66.92 => # of messages received felicity@kluge.net|ip=66.92|totscore => total score of messages received 25

Bayes Changes Storage backend now has “plugin” capability Berkeley DB (BDB) is default, added SQL in v3 Added capability to backup & restore Good for backup and recovery, modifying stored values, converting between storage backends, etc. Added flock locking option for all SA DB access Tokens are now stored as hash values, not raw 26

Bayes in SQL Supports MySQL and PostgreSQL natively Lots of benefits, generally faster overall Scanning, 3-30% faster, depending on # of tokens Learning, 2-3x slower, requires multiple SQL commands per update Expiry, 6-7x faster, BDB does lots of I/O, etc. For more information, see Michael Parker’s presentation following this one! 27

Out with the GA! Replaced Genetic Algorithm (GA) for score generation with Perceptron Learner No one wanted to deal with the GA code Did anyone understand the code? Not really. Most time spent kluging around glue scripts Perceptron is much, much , faster GA took 6-24 hours/scoreset for 2.5 and 2.6 Perceptron took 8 minutes/scoreset for 3.0.0 28

General Perceptron ACCEPT_CREDIT_CARDS w 1 BAYES_99 w 36 HTML_80_90 Σ w 289 IMPOTENCE w 399 URIBL_WS_SURBL w 724 YOU_WON w 770 Sigmoid Gain Input Layer Weights Sigma Node Function Per message, input is bit array of rules which were hit Multiply input bits by respective rule weights, and sum Squash result into 0-1 range (ham vs spam) Modify weights so result approaches desired value At end, weights become scores via linear transformation 29

New and upcoming features in SpamAssassin v3 ApacheCon 2004 - PowerPoint PPT Presentation

New and upcoming features in SpamAssassin v3 ApacheCon 2004 November 15, 2004 By: Theo Van Dinter Project Changes Became an ASF Top Level Project New logo! License change from GPL/PAL to ASL2 Took 4+ months for 100+ retroactive CLAs Moved

Fighting spam for fun and profit the long road to SpamAssassin 4.0 Giovanni Bechis

Extending Apache SpamAssassin Using Plugins Michael Parker ApacheCon 2005 [ Start Slide ]

COMPANY PROFILE WATER FEATURES 1 WATER FEATURES 2 WATER FEATURES 3 WATER FEATURES 4 WATER

Upcoming: Distinguished Lecturer! Upcoming: Distinguished Lecturer! Lecture: Self-Reference and

High Performance Apache SpamAssassin Michael Parker ApacheCon October 2006

New features of 660 MW Units Turbine Maintenance Sipat Super thermal power project New features

New Type Inference & Related Language Features Svetlana Isakova @sveta_isakova Agenda

BLOGGING How to blog well FEATURES OF A BLOG... FEATURES OF A BLOG... Chronological

TITANIUM EYEWEAR DESIGNED IN ICELAND, MADE IN ITALY AGNAR NEW NEW NEW ALBA NEW NEW NEW

New Bayesian features: Predictions, multiple chains, and more Yulia Marchenko StataCorp LLC

Windows.NET Windows.NET Beta 3 Beta 3 Active Directory New Features Directory New Features

Upcoming Activities in the IRIS Program Gina Perovich Acting Deputy Director, IRIS Upcoming Public

The upcoming The upcoming EU Commission Communication on E- commerce Westminster eForum 7 6

Bonus Depreciation: Making Informed Decisions About Upcoming Capital Investments About Upcoming

Next Steps/Upcoming Dates September 17, 2019 Next Steps/Upcoming Dates September: September

Supervised Learning Given: a set of inputs features X 1 , . . . , X n a set of target features Y 1

Organizing Harvested Knowledge Eduard Hovy USC/ISI (and

1 Peter Series Lesson #089 May 4, 2017 Dean Bible Ministries www.deanbibleministries.org Dr.

Lecture 15: More Probability. Summary. CS70: Onwards. Events, Conditional Probability,

Adaptivity and Personalization in Learning System s Sabine Graf School of Computing and

HUMAN-COMPUTER CO-CREATION Anna Kantosalo Matemaattis-luonnontieteellinen tiedekunta CC-2017

In Situ Visualization using VisIt Brad Whitlock Jean M. Favre Jeremy S. Meredith Lawrence

In-situ MapReduce for Log Processing Dionysios Logothe9s,

In Situ Measurements of Jet Energy Scale in ATLAS Doug Schouten, Andres Tanasiczjuk, and Mike

New and upcoming features in SpamAssassin v3 ApacheCon 2004 - PowerPoint PPT Presentation

New and upcoming features in SpamAssassin v3 ApacheCon 2004 November 15, 2004 By: Theo Van Dinter Project Changes Became an ASF Top Level Project New logo! License change from GPL/PAL to ASL2 Took 4+ months for 100+ retroactive CLAs Moved

Fighting spam for fun and profit the long road to SpamAssassin 4.0 Giovanni Bechis

Extending Apache SpamAssassin Using Plugins Michael Parker ApacheCon 2005 [ Start Slide ]

COMPANY PROFILE WATER FEATURES 1 WATER FEATURES 2 WATER FEATURES 3 WATER FEATURES 4 WATER

Upcoming: Distinguished Lecturer! Upcoming: Distinguished Lecturer! Lecture: Self-Reference and

High Performance Apache SpamAssassin Michael Parker ApacheCon October 2006

New features of 660 MW Units Turbine Maintenance Sipat Super thermal power project New features

New Type Inference &amp; Related Language Features Svetlana Isakova @sveta_isakova Agenda

BLOGGING How to blog well FEATURES OF A BLOG... FEATURES OF A BLOG... Chronological

TITANIUM EYEWEAR DESIGNED IN ICELAND, MADE IN ITALY AGNAR NEW NEW NEW ALBA NEW NEW NEW

New Bayesian features: Predictions, multiple chains, and more Yulia Marchenko StataCorp LLC

Windows.NET Windows.NET Beta 3 Beta 3 Active Directory New Features Directory New Features

Upcoming Activities in the IRIS Program Gina Perovich Acting Deputy Director, IRIS Upcoming Public

The upcoming The upcoming EU Commission Communication on E- commerce Westminster eForum 7 6

Bonus Depreciation: Making Informed Decisions About Upcoming Capital Investments About Upcoming

Next Steps/Upcoming Dates September 17, 2019 Next Steps/Upcoming Dates September: September

Supervised Learning Given: a set of inputs features X 1 , . . . , X n a set of target features Y 1

Organizing Harvested Knowledge Eduard Hovy USC/ISI (and

1 Peter Series Lesson #089 May 4, 2017 Dean Bible Ministries www.deanbibleministries.org Dr.

Lecture 15: More Probability. Summary. CS70: Onwards. Events, Conditional Probability,

Adaptivity and Personalization in Learning System s Sabine Graf School of Computing and

HUMAN-COMPUTER CO-CREATION Anna Kantosalo Matemaattis-luonnontieteellinen tiedekunta CC-2017

In Situ Visualization using VisIt Brad Whitlock Jean M. Favre Jeremy S. Meredith Lawrence

In-situ MapReduce for Log Processing Dionysios Logothe9s,

In Situ Measurements of Jet Energy Scale in ATLAS Doug Schouten, Andres Tanasiczjuk, and Mike

New Type Inference & Related Language Features Svetlana Isakova @sveta_isakova Agenda