[pmmail-list] Spam Filtering was Digest (03/24/2003 09:01) (#2003-477)

John Angelico pmmail-list@blueprintsoftwareworks.com
Tue, 25 Mar 2003 10:08:13 +1100 (EDT)


On Mon, 24 Mar 2003 09:02:21 -0500, brandonk@blueprintsoftwareworks.com
wrote:

>
>Date: Mon, 24 Mar 2003 08:35:33 +0000 (GMT)
>From: "Dave Saville" <dave.saville@ntlworld.com>
>Subject: Re: [pmmail-list] Spam filtering - popfile et al
>
>On Sun, 23 Mar 2003 20:02:02 -0500, Andrew Pitonyak wrote:
>
>>in the last two days I received 124 SPAM messages. Four of the messages were not marked as SPAM. Of these four, two were very simple and plain. The other two messages had a "s p a c e" 
>>after each letter so there were no words against which to filter.
>
>I have thought of a way around that. Because PMMail, unlike PMINews,
>can not ignore RE: (and similar) in a subject line when sorting I run
>a  rexx script on incoming that strips of those from the subject
>line. In passing it could easily add an X-stripped-subject or similar
>that had some processing done on the subject line - Say change all
>special chars to a space and then, as English at least only has "I"
>and  "a" as valid single letter words a bit of intelligent
>compression. The result would be a header line that had recognisable
>words for the later filters to latch onto.

Weighing in on this stressful subject from Down Under.

I have just whipped up in OS/2 a set of Rexx scripts to do mini-content
filtering on scrambled Subject lines.

Here is a copy of my post to the JunkSpy newsgroup:

>Hi everyone.
>
>As a contribution to my own question about how to handle the growing trend
>to randomised garbage in subject lines, I offer the attached ltrcnt.zip
[not attached here of course - I shouldn't have done it there either]
>file of freeware Rexx scripts to 
>a) count the frequency of letter occurrence in Subject lines and
>b) calculate total occurrences and percentages
>
>You may think this is raw material for a form of Bayesian filtering but no.
>Whilst I have taken the basic idea from the Bayesian discussions, this is
>simply to help identify "junky" subject lines which JunkSpy would otherwise
>miss because there are no significant keywords.
>
>What I did:
>1. Using my (PMMail) archive base of messages including both incoming and
>outgoing  I calculated proportions/frequencies of letter appearance. It was
>simpler to do the lot than to try being selective.
>My results don't match the theoretical English language average (E T A R S
>... I think) because my messages are skewed towards O, S and 2, plus a lot
>of square brackets and Re: for lists. [See Zip for the result files too]  
>
>2. Calculating the percentages, I still found that W (1.85), X (0.34), Y
>(1.04) and Z (0.15) were infrequent and could be used as the litmus test
>of a junky subject line. 
>Even simply summing these percentages yields 3.38% - a very low probability
>in an area which is supposed to be human readable. 
>
>The scripts can be used unchanged on any alphabetic language and the
>concept should still hold - use the least frequent letters of your language
>to ascertain if the Subject line has been scrambled by the junk merchants.
>
>3. Constructed a Rexx script to:
>count occurrences in Subject Line
>calculate the percentage
>insert text (as JunkSpy does) X-Junk-Percent: and the value
>
>Then a complex filter can 
>search for the X-Junk-Percent: value 
>flag messages which exceed the threshold you decide eg. 5%
>move them to your JunkSpy folder for a final eyeballing before deletion
>
>I already have an incoming Rexx script to clean up my Subject line(removes
>the "RE, FWD" stuff), and have added this new function to the same script.
>
>Try it, improve it, comment on it, but for heaven's sake let's not take
>these stupid antics lying down!
>

I can send the ZIP to anyone who asks me for it - off list presumably<g>




Best regards
John Angelico
OS/2 SIG
talldad@melbpc.org.au or talldad@kepl.com.au
--------------------------------------

PMTagline v1.50 - Copyright, 1996-1997, Stephen Berg and John Angelico
... Don't read everything you believe.
- pmmail-list - The PMMail Discussion List ---------------------------
To POST to the list, send your message to:
pmmail-list@blueprintsoftwareworks.com

To UNSUBSCRIBE, send a message to mdaemon@bmtmicro.com 
with the first line of the message body being...
UNSUBSCRIBE pmmail-list@blueprintsoftwareworks.com
---------------------------------------------------------------------