xref: /haiku/docs/apps/mail/spamdbm/index.html (revision 93a78ecaa45114d68952d08c4778f073515102f2)
1<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
2<HTML>
3<HEAD>
4<TITLE>AGMSBayesianSpam Documentation</TITLE>
5<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=us-ascii">
6<META NAME="author" CONTENT="Alexander G. M. Smith">
7<META NAME="description" CONTENT="Documentation for AGMSBayesianSpam, for classifying incoming e-mail messages as spam (junk mail) or genuine.">
8<!--
9; $Log: index.html,v $
10; Revision 1.11  2003/02/08 21:54:14  agmsmith
11; Updated the AGMSBayesianSpamServer documentation to match the current
12; version.  Also removed the Beep options from the spam filter, now they
13; are turned on or off in the system sound preferences.
14;
15; Revision 1.10  2002/12/16 17:32:43  agmsmith
16; Added Alex's settings paragraph.  Added screen shot of dangerous
17; header filter that deletes things on the server.
18;
19; Revision 1.9  2002/12/13 22:45:22  agmsmith
20; More changes for self training and chi-squared scoring.
21;
22; Revision 1.8  2002/12/13 22:20:50  agmsmith
23; Under construction.
24;
25; Revision 1.7  2002/11/29 23:42:26  agmsmith
26; Describe the word display and what you can do with it.
27;
28; Revision 1.6  2002/11/29 22:20:02  agmsmith
29; Updated version numbers in the text
30;
31; Revision 1.5  2002/11/28 21:19:41  agmsmith
32; Updated to explain how to check for spam without downloading the
33; whole message.
34;
35; Revision 1.4  2002/11/10 20:56:39  agmsmith
36; Updated documentation to include MDR installer effects, and added a
37; section on the tokenizing experiments.
38;
39; Revision 1.3  2002/11/06 00:54:47  agmsmith
40; Spam definition corrected, with prodding from Ian G.
41;
42; Revision 1.2  2002/11/05 22:47:30  agmsmith
43; Replace UTF-8 copyright symbol with useable token.
44;
45; Revision 1.1  2002/11/05 22:43:24  agmsmith
46; Starting point for the HTML documentation for AGMSBayesianSpam stuff.
47;
48; Revision 1.5  2002/10/21 21:07:19  agmsmith
49; Added references to the original spam detection papers.
50;
51; Revision 1.4  2002/10/21 20:56:01  agmsmith
52; Added hyperlinks for local files, and an explanation of "spam".
53;
54; Revision 1.3  2002/10/21 20:19:29  agmsmith
55; Finished updating instructions for version 1.60, and adding
56; lots of screen shots.
57;
58; Revision 1.2  2002/10/21 02:03:35  agmsmith
59; Added log in HTML comments area.
60;
61; Revision 1.1  2002/10/21 02:00:54  agmsmith
62; Initial revision
63-->
64</HEAD>
65<BODY BGCOLOR="WHITE" TEXT="BLACK">
66
67<P><FONT COLOR="MAGENTA">Short: Junk E-Mail Classifier.
68<BR>Author: agmsmith@rogers.com (Alexander G. M. Smith)
69<BR>Uploader: agmsmith@rogers.com (Alexander G. M. Smith)
70<BR>Website: <A HREF="http://members.rogers.com/agmsmith/">http://members.rogers.com/agmsmith/</A>
71<BR>Version: 1.77
72<BR>Type: internet &amp; network/e-mail
73<BR>Requires: BeOS 5.0+
74<BR>Related things: <A HREF="http://www.paulgraham.com/spam.html">http://www.paulgraham.com/spam.html</A>, <A HREF="http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html">http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html</A></FONT>
75
76<H1><A NAME="Contents"></A>Table of Contents</H1>
77
78<UL>
79  <LI><A HREF="#Contents">Table of Contents</A>
80  <LI><A HREF="#Introduction">Introduction to AGMSBayesianSpam</A>
81  <LI><A HREF="#Installation">Installation</A>
82  <LI><A HREF="#Usage">Usage</A>
83  <UL>
84    <LI><A HREF="#Reading">Reading E-Mail</A>
85    <LI><A HREF="#Training">Training</A>
86    <LI><A HREF="#HidingServer">Hiding the Server Window</A>
87    <LI><A HREF="#AlexSettings">Alex's Settings</A>
88  </UL>
89  <LI><A HREF="#AdvancedUsage">Advanced Usage</A>
90  <UL>
91    <LI><A HREF="#CommandLine">Command Line Mode and Scripting</A>
92    <LI><A HREF="#Spreadsheet">Using a Spreadsheet to Examine Word Statistics</A>
93    <LI><A HREF="#WordDisplay">Understanding and Using the Word Display</A>
94    <LI><A HREF="#Tokenizing">Tokenizing Modes Compared</A>
95    <LI><A HREF="#HeadersOnly">High Speed and High Danger - Headers Only Trick</A>
96  </UL>
97  <LI><A HREF="#ChangeLog">Change Log</A>
98</UL>
99
100<H1><A NAME="Introduction"></A>Introduction to AGMSBayesianSpam</H1>
101
102<P>AGMSBayesianSpam is a set of BeOS programs for classifying e-mail messages
103and other text as either spam or genuine.  "Spam" is the colloquial name for
104unwanted junk messages, usually advertising.  The name comes from a 1970's <A
105HREF="http://www.google.com/search?&q=monty+python+spam">Monty Python comedy
106skit</A> involving lots of unwanted Spam, which is the name for the spicy ham
107in a can made by the <A HREF="http://www.hormel.com/">Hormel Foods</A> company,
108originally from Austin, Minnesota, USA.  The program classifies messages as
109spam or genuine (sometimes called "ham"), based on the words they contain and
110previous messages which have been identified by the user as spam or genuine.
111It's implemented as a server program (AGMSBayesianSpamServer) which keeps track
112of the word list and a Mail Daemon Replacement add-on (AGMSBayesianSpamFilter)
113which uses the server to classify incoming messages.  Theoretically other
114programs, like a news reader, could also use the word database using the
115scripting interface.  There's also a command line interface and a graphical
116user interface.
117
118<P>If you want to know more about the technique of counting words, have a look
119at Paul Graham's wonderful write-up at <A
120HREF="http://www.paulgraham.com/spam.html">http://www.paulgraham.com/spam.html</A>.
121This program is currently using an improved version of Graham's method, called
122Gary-combining, by Gary Robinson.  See <A
123HREF="http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html"
124>http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html</A> for
125Gary's story.  There's also an even more improved method called Chi-Squared
126(&chi;&sup2;) combining, which grew from discussions on the <A
127HREF="http://mail.python.org/mailman-21/listinfo/spambayes">Spambayes</A>
128mailing list.
129
130<H1><A NAME="Installation"></A>Installation</H1>
131
132<OL TYPE="1">
133  <LI>Install the BeOS Mail Daemon Replacement (MDR) version 2.0.0 beta 7 or
134    later.  Beta 3 and later include AGMSBayesianSpam so you don't need to
135    worry about incompatible versions, and the MDR will even do some of the
136    installation for you.  You can get MDR from <A
137    HREF="http://www.bebits.com/app/2289">http://www.bebits.com/app/2289</A> or
138    get the latest source code and compile it yourself from <A
139    HREF="http://sourceforge.net/projects/bemaildaemon"
140    >http://sourceforge.net/projects/bemaildaemon</A>.
141  <LI>Move the AGMSBayesianSpamServer program to the
142    <A HREF="file:/boot/home/config/bin/">/boot/home/config/bin/</A> directory
143    (the MDR installer will do this for you).  It's
144    put there to make it useable from the command line.  If you use it
145    frequently, you can also add a symbolic link to it in your desktop
146    applications menu or to the mail menu.
147  <LI>Move the AGMSBayesianSpamFilter mail add-on to the
148    <A HREF="file:/boot/home/config/add-ons/mail_daemon/inbound_filters/"
149    >/boot/home/config/add-ons/mail_daemon/inbound_filters/</A> directory
150    (the MDR installer will do this for you).<BR>
151    <IMG SRC="pictures/HomeConfigAddonsMaildaemonInboundfilters.png"
152    ALT="[/boot/home/config/add-ons/mail_daemon/inbound_filters/ directory]"
153    WIDTH="634" HEIGHT="288">
154  <LI><IMG SRC="pictures/CantFindSettings.png" ALT="[Can't Find Settings]"
155    WIDTH="326" HEIGHT="120" ALIGN="RIGHT">Set up MIME types and indices (the
156    MDR installer will do this step and the next one for you, invisibly).  Run
157    the AGMSBayesianSpamServer program.  It will put up an alert box
158    complaining about not finding the settings file.  Just hit the Acknowledge
159    button to get past it.  Then click the "Install MIME Types &amp; Make
160    Indices on All Drives" button which does what it says plus it also adds a
161    few sound effect names to the system.<BR CLEAR="ALL">
162    <IMG SRC="pictures/TheInstallButton.png" ALT="[The Install Button]"
163    WIDTH="604" HEIGHT="403">
164  <LI>Quit the program (the close box at the top left corner is one way of
165    doing that).  It will make the settings file and settings directory
166    <A HREF="file:/boot/home/config/settings/AGMSBayesianSpam/"
167    >/boot/home/config/settings/AGMSBayesianSpam/</A> when it exits.
168  <LI>Use the Sounds preferences (or the installsound command) to associate
169    the names with your sound files (SoundGenuine, SoundUncertain and SoundSpam
170    are included as examples with MDR in the <A
171    HREF="file:/boot/home/config/settings/AGMSBayesianSpam/"
172    >/boot/home/config/settings/AGMSBayesianSpam/</A> directory), no I don't
173    have the rights to the Monty Python Spam skit).  If you don't want it to
174    make sounds, don't do anything (you can also use the Sounds preferences
175    later on to disable or remove the sounds if you get tired of them).<BR>
176    <IMG SRC="pictures/StartingSoundPreferences.png" WIDTH="291" HEIGHT="445"
177    ALT="[Starting Sound Preferences]"> <IMG
178    SRC="pictures/SoundPrefChoosingAFile.png" WIDTH="318" HEIGHT="387"
179    ALT="[Sound Preferences Choosing a File]">
180    <P>When you're done, it should look something like this:<BR>
181    <IMG SRC="pictures/SoundPrefFinished.png" WIDTH="314" HEIGHT="245"
182    ALT="[Sound Preferences Finished]">
183  <LI>Add some example messages to the database.
184  <OL TYPE="A">
185    <LI>If you don't want to do this, see step B.  You need to add roughly the
186      same number of sample Spam messages as you add of genuine e-mail.  A few
187      hundred of each should do, though you can get useful results with a dozen.
188      <P>Run the AGMSBayesianSpamServer program again.  This time it shouldn't
189      complain.  Click the "Create" button to make a new database with the
190      default name of "<A
191      HREF="file:/boot/home/config/settings/AGMSBayesianSpam/AGMSBayesianSpam%20Database"
192      >/boot/home/config/settings/AGMSBayesianSpam/AGMSBayesianSpam Database</A>".
193      <P>Use the "Add Example of Spam/Genuine" button, and only select at most
194      80 files at a time (otherwise the Tracker/File Requester will lock up and
195      you'll have to reboot your computer).  It will ask you to identify each
196      file as spam or genuine, you also have the choice of identifying a whole
197      batch of them as all spam or all genuine.<BR>
198      <IMG SRC="pictures/SingleMessageClassificationRequest.png"
199      ALT="[Single Message Classification Request]"
200      WIDTH="349" HEIGHT="104" ALIGN="LEFT">
201      <IMG SRC="pictures/MultipleMessageClassificationRequest.png"
202      ALT="[Multiple Message Classification Request]"
203      WIDTH="349" HEIGHT="104" ALIGN="RIGHT">
204      <BR CLEAR="ALL">
205      <P>You can also drag and drop example messages into the bottom half of
206      the window.  Drop in the left side for genuine, right side for spam, but
207      avoid the middle third of the window.<BR>
208      <IMG SRC="pictures/DropZones.png" ALT="[Drop Zones]"
209      WIDTH="596" HEIGHT="204"><BR>
210      <P>If you have thousands of messages, use the command line mode.<BR>
211      <IMG SRC="pictures/CommandLineSetSpam.png" ALT="[Command Line Set Spam]"
212      WIDTH="634" HEIGHT="205">
213    <LI>If you don't have a few hundred spam messages, instead of doing step
214      A copy the sample database file to "AGMSBayesianSpam Database" in the <A
215      HREF="file:/boot/home/config/settings/AGMSBayesianSpam/"
216      >/boot/home/config/settings/AGMSBayesianSpam/</A> directory (the MDR
217      installer will do this for you).  Due to complaints about the huge file
218      size, the sample spam database that comes with MDR is now very small
219      (10 spam, 10 genuine example messages), so you'll need to train it before
220      it gets accurate (auto-training is your friend).  Or you could get the
221      huge (976KB, 484 spam, 1009 genuine messages) one from version 2.0.0 Beta
222      8, available at: <A
223      HREF="http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/*checkout*/bemaildaemon/AGMSBayesianSpamServer/SampleDatabase?rev=release-2-0-0-beta8"
224      >http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/*checkout*/bemaildaemon/AGMSBayesianSpamServer/SampleDatabase?rev=release-2-0-0-beta8</A>
225      <BR>
226      <IMG SRC="pictures/DatabaseFileLocation.png"
227      ALT="[Database File Location]" WIDTH="609" HEIGHT="234"><BR>
228      Run the AGMSBayesianSpamServer program again.  Hit the Purge button
229      (because it doesn't load the database until it has to do something).  If
230      things are working correctly, you should see a list of the words in the
231      sample database in the bottom half of the window.  An alternative method
232      of picking a database file is to double click on it in Tracker, which is
233      useful if you don't want to type in the full name.
234  </OL>
235  <LI>Quit the AGMSBayesianSpamServer program.  Delete all the remaining
236    files you unzipped from the archive (such as the example database, this
237    readme, this documentation, or source code), unless you want to
238    keep them around.  You will have to decide where to store them; I can't
239    tell you everything :-).
240  <LI>Start up the E-mail preferences control panel (part of the Mail Daemon
241    Replacement project).<BR>
242    <IMG SRC="pictures/StartingEMailPreferences.png"
243    ALT="[Starting EMail Preferences]" WIDTH="291" HEIGHT="443"><BR>
244    Choose the e-mail account you wish to have checked for spam.  Then hit the
245    Add Filter button to bring up the menu with the list of filters you can
246    add, and pick AGMSBayesianSpamFilter.<BR>
247    <IMG SRC="pictures/ClickingOnAddFilterShowsList.png"
248    ALT="[Clicking On Add Filter Shows List]" WIDTH="454" HEIGHT="413"><BR>
249    Remember to click on the filter after you have added it to set the settings
250    (though the defaults are useable too).<BR>
251    <IMG SRC="pictures/ClickOnFilterNameToGetSettings.png"
252    ALT="[Click On Filter Name To Get Settings]" WIDTH="454" HEIGHT="413"><BR>
253    Then select the settings you wish.  If you installed sound files earlier,
254    you can turn on the sound effects here.<BR>
255    <IMG SRC="pictures/FilterSettings.png" ALT="[Filter Settings]"
256    WIDTH="454" HEIGHT="413">
257  <LI>Test it.  Send yourself some e-mail and see if it gets rated correctly.
258</OL>
259
260
261<H1><A NAME="Usage"></A>Usage</H1>
262
263<H2><A NAME="Reading"></A>Reading E-Mail</H2>
264
265<P>Check for e-mail as usual.  If you look at the inbox directory in Tracker,
266you can add an extra column with the E-mail attribute "Spam/Genuine Estimate"
267to see how spammy the messages are.  0.0 means the system thinks the message is
268fully genuine, 1.0 fully spam.  But it can be wrong, for things like a friend
269of yours quoting a spam message.  For the Chi-squared method (the default), you
270see numbers close to zero for genuine (like 9.750e-13), close to 1 for spam and
271in-between (0.01 to 0.99) if it can't decide.  With the Robinson scoring
272method, usually if it is over 0.56 (the best cutoff value depends a bit on your
273database quality, but 0.56 is typical) then it is spam, and the closer it is to
2741.0 the more likely it really is spam.
275
276<P>I sort by spam ratio, and manually throw away the messages that are spammy,
277then I switch the Tracker window back to sorting by thread+date (just a click
278on the appropriate column title does it) and get on with reading the mail.
279
280<P>If you turned on the filter option to modify the subject, you'll see spam
281messages with something like [Spam 95%] in front of the subject (I don't use it
282because it looks ugly).  But only in the Tracker display of the Subject, the
283actual subject inside the message isn't affected, just the MAIL:subject
284attribute, which is what the Tracker shows.
285
286<H2><A NAME="Training"></A>Training</H2>
287
288<P><EM>The accuracy is only as good as your database</EM>, so update it with
289more example spam and genuine messages.  In particular, if it gets the estimate
290wrong, add that message to the database to tell it what it should be doing.  A
291quick way to do that is to right click on the e-mail in Tracker, and pick Open
292With...  AGMSBayesianSpamServer.<BR>
293<IMG SRC="pictures/SortingInboxBySpamEstimate.png"
294ALT="[Sorting Inbox By Spam Estimate]" WIDTH="831" HEIGHT="361"><BR>
295It should start up and ask you if the message is spam or genuine.<BR>
296<IMG SRC="pictures/SingleMessageClassificationRequest.png"
297ALT="[Single Message Classification Request]" WIDTH="349" HEIGHT="104"><BR>
298You can also drag and drop the message into the left third of the word list for
299genuine messages, or right third for spam messages.  Dropping in the middle
300third does something else that's mostly harmless and fun.<BR>
301<IMG SRC="pictures/DropZones.png" ALT="[Drop Zones]" WIDTH="596"
302HEIGHT="204"><BR>
303
304<P>You may also want to train it with all your messages (it gives slightly
305better results in the long run than just training on the mistakes).  To make it
306easier, turn on the self-training option in the mail filter.  It will compute
307the spam ratio of new mail messages, then feed back the same message into the
308database as an example of spam/genuine.  When it gets it wrong, you should
309manually retrain it with the correct classification, otherwise the database
310will get worse and worse and finally turn into mush.
311
312<H2><A NAME="HidingServer"></A>Hiding the Server Window</H2>
313
314<P>If you're annoyed by the server window popping up whenver the system checks
315for e-mail, you can tell it to hide.  Just click the "Server Mode" checkbox.
316Actually, that's now the default since people were complaining about the window
317getting in the way.  The disadvantage is that you don't get to see error
318messages.  To make it visible again, start up AGMSBayesianSpamServer (possibly
319by double clicking on its icon in <A
320HREF="file:/boot/home/config/bin/">/boot/home/config/bin/</A> and bring up the
321hidden window by using the deskbar, or by using the "Edit Server Settings"
322button in the spam filter configuration).<BR>
323<IMG SRC="pictures/MakingTheWindowVisibleFromTheDeskbar.png"
324ALT="MakingTheWindowVisibleFromTheDeskbar" WIDTH="640" HEIGHT="109"><BR>
325
326<H2><A NAME="AlexSettings"></A>Alex's Settings</H2>
327
328<P>I'm currently using it with these settings: Chi-squared scoring,
329AnyTextAndHeader tokenizing, server mode on, ignore previous classification
330off, mark subject with [Spam %] off, spam cutoff 0.95, genuine below 0.05, no
331words found on, self-training on, close AGMSBayesianSpamServer when Finished
332on.  Because of the self training, I always correct it when it gets the
333classification wrong (that means I have to manually delete the messages, can't
334use a Match Header filter to do it).  My Tracker window shows the
335Classification Group attribute rather than the Spam/Genuine Estimate number
336(which isn't pretty when using Chi-squared).
337
338<H1><A NAME="AdvancedUsage"></A>Advanced Usage</H1>
339
340<H2><A NAME="CommandLine"></A>Command Line Mode and Scripting</H2>
341
342<P>Besides the graphical user interface, there
343is also a command line mode.  Just type "AGMSBayesianSpamServer help"
344in the terminal to get a list of the commands and what they do (the ultimate
345documentation).  It also explains all of the mysterious options you see in the
346graphical user interface.  The same commands can be used in scripting, either
347from some other program or via the "hey" utility which you can get from <A
348HREF="http://www.bebits.com/app/2042">http://www.bebits.com/app/2042</A>.  A
349useful command, if you have a lot of spam messages to add, is
350"AGMSBayesianSpamServer set genuine *" which will use all messages in the
351current directory as examples of genuine text.
352
353<PRE>
354Sat Feb  8 16:30:51 274 /tmp>AGMSBayesianSpamServer help
355
356AGMSBayesianSpamServer - A Spam Database Server
357Copyright &copy; 2002 by Alexander G. M. Smith.  Released to the public domain.
358
359Compiled on Feb  8 2003 at 11:13:28.  $Revision: 1.11 $  $Header:
360/cvsroot/bemaildaemon/AGMSBayesianSpamServer/AGMSBayesianSpamServer.cpp,v 1.77
3612003/01/22 03:19:48 agmsmith Exp $
362
363This is a program for classifying e-mail messages as spam (junk mail which
364you don't want to read) and regular genuine messages.  It can learn what's
365spam and what's genuine.  You just give it a bunch of spam messages and a
366bunch of non-spam ones.  It uses them to make a list of the words from the
367messages with the probability that each word is from a spam message or from
368a genuine message.  Later on, it can use those probabilities to classify
369new messages as spam or not spam.  If the classifier stops working well
370(because the spammers have changed their writing style and vocabulary, or
371your regular correspondants are writing like spammers), you can use this
372program to update the list of words to identify the new messages
373correctly.
374
375The original idea was from Paul Graham's algorithm, which has an excellent
376writeup at: http://www.paulgraham.com/spam.html
377
378Gary Robinson came up with the improved algorithm, which you can read about at:
379http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html
380
381Then he, Tim Peters and the SpamBayes mailing list developed the Chi-Squared
382test, see http://mail.python.org/pipermail/spambayes/2002-October/001036.html
383for one of the earlier messages leading from the central limit theorem to
384the current chi-squared scoring method.
385
386Thanks go to Isaac Yonemoto for providing a better icon.
387
388Usage: Specify the operation as the first argument followed by more
389information as appropriate.  The program's configuration will affect the
390actual operation (things like the name of the database file to use, or
391whether it should allow non-email messages to be added).  In command line
392mode it will do the operation and exit.  In GUI/server mode a command line
393invocation will just send the command to the running server.  You can also
394use BeOS scripting (see the "Hey" command which you can get from
395http://www.bebits.com/app/2042 ) to control the Spam server.  And finally,
396there's also a GUI interface which shows up if you start it without any
397command line arguments.
398
399Commands:
400
401Quit
402Stop the program.  Useful if it's running as a server.
403
404Get DatabaseFile
405Get the pathname of the current database file.  The default name is something
406like B_USER_SETTINGS_DIRECTORY / AGMSBayesianSpam / AGMSBayesianSpamServer
407Database
408
409Set DatabaseFile NewValue
410Change the pathname of the database file to use.  It will automatically be
411converted to an absolute path name, so make sure the parent directories exist
412before setting it.  If it doesn't exist, you'll have to use the create command
413next.
414
415Create DatabaseFile
416Creates a new empty database, will replace the existing database file too.
417
418Delete DatabaseFile
419Deletes the database file and all backup copies of that file too.  Really only
420of use for uninstallers.
421
422Count DatabaseFile
423Returns the number of words in the database.
424
425Set Spam NewValue
426Adds the spam in the given file (specify full pathname to be safe) to the
427database.  The words in the files will be added to the list of words in the
428database that identify spam messages.  The files processed will also have the
429attribute MAIL:classification added with a value of "Spam" or "Genuine" as
430specified.  They also have their spam ratio attribute updated, as if you had
431also used the Evaluate command on them.  If they already have the
432MAIL:classification attribute and it matches the new classification then they
433won't get processed (and if it is different, they will get removed from the
434statistics for the old class and added to the statistics for the new one).
435You can turn off that behaviour with the IgnorePreviousClassification
436property.  The command line version lets you specify more than one pathname.
437
438Count Spam
439Returns the number of spam messages in the database.
440
441Set SpamString NewValue
442Adds the spam in the given string (assumed to be the text of a whole e-mail
443message, not just a file name) to the database.
444
445Set Genuine NewValue
446Similar to adding spam except that the message file is added to the genuine
447statistics.
448
449Count Genuine
450Returns the number of genuine messages in the database.
451
452Set GenuineString NewValue
453Adds the genuine message in the given string (assumed to be the text of a
454whole e-mail message, not just a file name) to the database.
455
456Set IgnorePreviousClassification NewValue
457If set to true then the previous classification (which was saved as an
458attribute of the e-mail message file) will be ignored, so that you can add the
459message to the database again.  If set to false (the normal case), the
460attribute will be examined, and if the message has already been classified as
461what you claim it is, nothing will be done.  If it was misclassified, then the
462message will be removed from the statistics for the old class and added to the
463stats for the new classification you have requested.
464
465Get IgnorePreviousClassification
466Find out the current setting of the flag for ignoring the previously recorded
467classification.
468
469Set ServerMode NewValue
470If set to true then error messages get printed to the standard error stream
471rather than showing up in an alert box.  It also starts up with the window
472minimized.
473
474Get ServerMode
475Find out the setting of the server mode flag.
476
477Flush
478Writes out the database file to disk, if it has been updated in memory but
479hasn't been saved to disk.  It will automatically get written when the program
480exits, so this command is mostly useful for server mode.
481
482Set PurgeAge NewValue
483Sets the old age limit.  Words which haven't been updated since this many
484message additions to the database may be deleted when you do a purge.  A good
485value is 1000, meaning that if a word hasn't appeared in the last 1000
486spam/genuine messages, it will be forgotten.  Zero will purge all words, 1
487will purge words not in the last message added to the database, 2 will purge
488words not in the last two messages added, and so on.  This is mostly useful
489for removing those one time words which are often hunks of binary garbage, not
490real words.  This acts in combination with the popularity limit; both
491conditions have to be valid before the word gets deleted.
492
493Get PurgeAge
494Gets the old age limit.
495
496Set PurgePopularity NewValue
497Sets the popularity limit.  Words which aren't this popular may be deleted
498when you do a purge.  A good value is 5, which means that the word is safe
499from purging if it has been seen in 6 or more e-mail messages.  If it's only
500in 5 or less, then it may get purged.  The extreme is zero, where only words
501that haven't been seen in any message are deleted (usually means no words).
502This acts in combination with the old age limit; both conditions have to be
503valid before the word gets deleted.
504
505Get PurgePopularity
506Gets the purge popularity limit.
507
508Purge
509Purges the old obsolete words from the database, if they are old enough
510according to the age limit and also unpopular enough according to the
511popularity limit.
512
513Get Oldest
514Gets the age of the oldest message in the database.  It's relative to the
515beginning of time, so you need to do (total messages - age - 1) to see how
516many messages ago it was added.
517
518Set Evaluate NewValue
519Evaluates a given file (by path name) to see if it is spam or not.  Returns
520the ratio of spam probability vs genuine probability, 0.0 meaning completely
521genuine, 1.0 for completely spam.  Normally you should safely be able to
522consider it as spam if it is over 0.56 for the Robinson scoring method.  For
523the ChiSquared method, the numbers are near 0 for genuine, near 1 for spam,
524and anywhere in the middle means it can't decide.  The program attaches a
525MAIL:ratio_spam attribute with the ratio as its float32 value to the file.
526Also returns the top few interesting words in "words" and the associated
527per-word probability ratios in "ratios".
528
529Set EvaluateString NewValue
530Like Evaluate, but rather than a file name, the string argument contains the
531entire text of the message to be evaluated.
532
533ResetToDefaults
534Resets all the configuration options to the default values, including the
535database name.
536
537InstallThings
538Creates indices for the MAIL:classification and MAIL:ratio_spam attributes on
539all volumes which support BeOS queries, identifies them to the system as
540e-mail related attributes (modifies the text/x-email MIME type), and sets up
541the new MIME type (text/x-vnd.agmsmith.spam_probability_database) for the
542database file.  Also registers names for the sound effects used by the
543separate filter program (use the installsound BeOS program or the Sounds
544preferences program to associate sound files with the names).
545
546Set TokenizeMode NewValue
547Sets the method used for breaking up the message into words.  Use "Whole" for
548the whole file (also use it for non-email files).  The file isn't broken into
549parts; the whole thing is converted into words, headers and attachments are
550just more raw data.  Well, not quite raw data since it converts
551quoted-printable codes (equals sign followed by hex digits or end of line) to
552the equivalent single characters.  "PlainText" breaks the file into MIME
553components and only looks at the ones which are of MIME type text/plain.
554"AnyText" will look for words in all text/* things, including text/html
555attachments.  "AllParts" will decode all message components and look for words
556in them, including binary attachments.  "JustHeader" will only look for words
557in the message header.  "AllPartsAndHeader", "PlainTextAndHeader" and
558"AnyTextAndHeader" will also include the words from the message headers.
559
560Get TokenizeMode
561Gets the method used for breaking up the message into words.
562
563Set ScoringMode NewValue
564Sets the method used for combining the probabilities of individual words into
565an overall score.  "Robinson" mode will use Gary Robinson's nth root of the
566product method.  It gives a nice range of values between 0 and 1 so you can
567see shades of spaminess.  The cutoff point between spam and genuine varies
568depending on your database of words (0.56 was one point in some experiments).
569"ChiSquared" mode will use chi-squared statistics to evaluate the difference
570in probabilities that the lists of word ratios are random.  The result is very
571close to 0 for genuine and very close to 1 for spam, and near the middle if it
572is uncertain.
573
574Get ScoringMode
575Gets the method used for combining the individual word ratios into an overall
576score.
577
578ProcessArgs: The property specified isn't known or doesn't support the requested action (usually means it is an unknown command), error code $FFFFFFFF/-1 (General OS error) has occured.
579AGMSBayesianSpamServer shutting down...
580Sat Feb  8 16:30:58 275 /tmp>
581</PRE>
582<!-- End the C style comment which makes editing this look bad with BeIDE's syntax colouring. */ -->
583
584<H2><A NAME="Spreadsheet"></A>Using a Spreadsheet to Examine Word Statistics</H2>
585
586<P>Another advanced trick is to load the list of words into Gobe Productive's
587spreadsheet, so that you can find the most popular word or chart the word
588frequencies.  Unfortunately it can only handle about 16000 words.  To do that,
589start up Gobe Productive, pick Open, then from the file requester's "Document
590Type" menu, pick "Spreadsheet" and then in the submenu pick "Tab-delimited
591text".  Then navigate to the database, the default location is "<A
592HREF="file:/boot/home/config/settings/AGMSBayesianSpam/AGMSBayesianSpam%20Database"
593>/boot/home/config/settings/AGMSBayesianSpam/AGMSBayesianSpam Database</A>".
594Have fun!
595
596
597<H2><A NAME="WordDisplay"></A>Understanding and Using the Word Display</H2>
598
599<P><IMG ALIGN="RIGHT" SRC="pictures/WordDisplay.png" WIDTH="285" HEIGHT="645"
600ALT="[Narrow Word Display Window]">The word display tells you more than you
601need to know about the words in the database.  Those colour bars actually mean
602something.
603
604<P>Obviously words which are more genuine than spamish show up in <FONT
605COLOR="BLUE">blue</FONT>, while spammier words are in <FONT
606COLOR="RED">red</FONT>.  It's proportionally based on the total message counts
607so that a word which shows up in 10% of the genuine messages and 9% of the spam
608will show up in blue, even if it was in more spam messages than genuine
609messages (this compensates a bit for not training on an equal number of spam
610and genuine messages).  The length of the bar shows the ratio of the
611proportions; further to the left for larger genuine proportions, and similarly
612further right for larger spam proportions.
613
614<P>The thickness of the bar shows how many messages the word was found in.
615It's kind of a weight, saying how frequently used that word is and thus how
616significant it is.
617
618<P>The paleness of the bar shows you how old that word is.  A light colour
619means that the word was last added to the database long ago.  A darker, more
620saturated colour means that the word was added more recently, when you added
621example messages to the database.
622
623<P>Finally, if you click on the word display, the background will change from a
624pale blue tint into solid white, to show you that it is the active keyboard
625focus.  That means you can type in letters to find a particular word (delay for
626one second to start typing the letters for a new word).  The arrow keys, page
627up/down keys and the mouse scroll wheel also show you different words.  Sorry,
628there's no scroll bar since finding the Nth word is a slow operation with a set
629of words (they aren't numbered); each twitch of the scroll bar would mean going
630through the list of tens or even hundreds of thousands of words and counting to
631find the scroll position.
632
633<BR CLEAR="ALL">
634
635<H2><A NAME="Tokenizing"></A>Tokenizing Modes Compared</H2>
636
637<P>I did some tests with tokenizing different parts of mail messages to see
638what would work best.
639
640<P>The Database:
641<BR>341 training genuine messages, 406 training spam messages (or 398 when
642parsing due to a bug (fixed later on in 2.0.0b5) with messages that don't have
643body text).
644<BR>40 test genuine messages, 40 test spam messages, all more recent than the
645training ones.
646<BR>Spam threshold is 0.56, Gary-combining method.
647
648<P>The results:
649
650<TABLE BORDER="2" SUMMARY="[Table showing results of different tokenizing methods]">
651<TR><TH>Tokenizing Method
652<TH>Genuine Test Details
653<TH>Genuine Accuracy
654<TH>Spam Test Details
655<TH>Spam Accuracy
656
657<TR><TD>Just headers
658<TD>Genuine .181352 to .557881, one false positive (a mailbox full announcement).
659<TD>2.5% wrong.
660<TD>Spam .450602 to .750511, 21 false negatives.
661<TD>52.5% wrong.
662
663<TR><TD>Whole raw message text
664<TD>Genuine .163027 to .627022, 3 false positives.
665<TD>7.5% wrong.
666<TD>Spam .509355 to .993985, 1 false negative.
667<TD>2.5% wrong.
668
669<TR><TD>Message parsed into parts plus header
670<TD>Genuine .168857 to .609005, 4 false positives.
671<TD>10% wrong.
672<TD>Spam .614564 to .994364, 0 false negatives.
673<TD>0% wrong.
674
675<TR><TD>Message parsed into parts, no header data
676<TD>Genuine .220161 to .631161, 5 false positives.
677<TD>12.5% wrong.
678<TD>Spam .592501 to .994444, 0 false negatives.
679<TD>0% wrong.
680
681<TR><TD>Any text parts and header
682<TD>Genuine .162697 to .614136, 4 false positives.
683<TD>10% wrong.
684<TD>Spam .614973 to .994362, 0 false negatives.
685<TD>0% wrong.
686
687<TR><TD>Any text parts, no headers
688<TD>Genuine .221923 to .635487, 6 false positives.
689<TD>15% wrong.
690<TD>Spam .594271 to .994441, 0 false negatives.
691<TD>0% wrong.
692
693<TR><TD>text/plain parts (including body text)
694<TD>Genuine .137869 to .583192, 3 false positives.
695<TD>7.5% wrong.
696<TD>Spam .448059 to .994119, 17 false negatives.
697<TD>42.5% wrong.
698
699<TR><TD>Only text/plain sub-parts, no headers.<BR>
700150 spam and 1 genuine training message had no words!
701<TD>Genuine .219169 to .696899, 9 false positives.
702<TD>22.5% wrong.
703<TD>Spam .660755 to .994116, 0 false negatives, 27 had no words.
704<TD>0% wrong.
705</TABLE>
706
707<P>The results look good for the whole message tokenizing method (which also
708works on non-email files) and for the all text parts plus header.  Since the
709text parts method doesn't add lots of garbage words to the database from trying
710to find words in binary attachments, it's now the default setting.
711
712<P>The header only method is pretty good too for identifying genuine messages,
713and so-so for spam messages.  That may make it useable for pre-download tests
714(delete some of the spam on the mail server before downloading it, without
715worrying about deleting too many genuine messages).
716
717
718<H2><A NAME="HeadersOnly"></A>High Speed and High Danger - Headers Only Trick</H2>
719
720<P>If you have a slow dial-up connection, you may wish to classify your mail
721quickly by deleting spam messages without downloading the entire junk message.
722
723<P><IMG SRC="pictures/ChoosingJustHeaderTokenizingMode.png" WIDTH="920"
724HEIGHT="402" ALT="[Choosing JustHeader Tokenizing Mode]">
725
726<P><IMG SRC="pictures/DangerousMatchFilter.png" WIDTH="276" HEIGHT="243"
727ALIGN="RIGHT" ALT="[Dangerous Match Filter]">This can be done with three
728settings.  First switch the AGMSBayesianSpamServer into tokenizing just the
729headers.  Then go into the E-mail preferences and add an
730AGMSBayeisianSpamFilter with the "Add [Spam %] in Front of Subject" option
731turned on, and the ratio set to a nice safe high level like 0.95 (so that your
732genuine mail is less likely to get deleted, but it will still delete the 1% of
733your real mail that looks like spam, which is why this is dangerous).  Do not
734turn on self-training, since you can't manually correct it.  Finally in the
735E-mail preferences, add a "Match Header" filter after the spam filter and set
736it so that If <B>Subject</B> is <B>\[Spam*</B> then <B>Delete Message</B>.
737That's backslash, left square bracket, Spam with the S capitalised, asterix.
738Now it will download the headers, check them against the spam database, and
739then delete the spam ones on the server without downloading the rest of their
740contents.
741
742<P>You should also make a new spam database trained in Just Headers tokenizing
743mode with roughly equal examples of your genuine messages and spam messages (50
744of each should be enough to start).  A full message database may also work, but
745headers only training should be more accurate for headers only decisions.  When
746testing JustHeader mode, I noticed that the false positive rate (genuine
747reported as spam) is nice and low, but the false negative rate (spam reported
748as genuine) is high (tested with Robinson scoring, not Chi-Squared scoring).
749So this means JustHeader mode will delete maybe half the spam (and download the
750rest) and also delete the occasional genuine message.
751
752
753<H1><A NAME="ChangeLog"></A>Change Log</H1>
754
755<P>The various versions released to the public.  These are actually several
756accumulated minor changes, which you can see by looking at the log in the top of the
757source code files.
758
759<UL>
760  <LI>Version 1.77 changed the tokenizing to not convert words to lower case,
761  the case is important for spam!  Minimize the window before opening it so
762  that it doesn't flash on the screen in server mode.  Also load the database
763  when the window is displayed so that the user can see the words.
764
765  <LI>Version 1.73 added self training support and the Chi-Squared scoring
766  method.
767
768  <LI>Version 1.68 nothing significant changed.  Just very minor tweaking.
769
770  <LI>Version 1.65 added a time delay for exiting the program.  This is so that
771  multiple e-mail accounts can simultaneously download mail, without having the
772  server close when one of the accounts finishes downloading.  Scripting
773  requests that come in while it is counting down to quitting time will cancel
774  the countdown.  In the belt <I>and</I> suspenders department, the filter has
775  been enhanced to try starting up the server up to three times.
776
777  <LI>Version 1.60 got rid of the need to use a modified Inbox filter for MDR
778  (found out the correct way of setting attributes on a message), added sound
779  effects, and added parsing of mail messages (parsing MIME headers, decoding
780  base64, quoted-printable and converting character sets to UTF-8 for text, all
781  thanks to using the MDR mail kit, which you now need since it uses their
782  libmail.so code library).  There are now new options for selecting what kind
783  of parsing to do (text/plain or text/* or */* attachments, with or without
784  headers, etc).  Plus sound effect options.  The sample database has also been
785  updated to use text/* plus headers tokenization, which makes it slightly
786  smaller.<!-- End the C style comment which makes editing this look bad with
787  BeIDE's syntax colouring. */ -->
788
789  <LI>Version 1.49 switched to Gary Robinson's method for calculating spam
790  ratios.  The overall results are about the same but you have less false
791  positives and the numbers are spread more evenly between 0.0 and 1.0 than
792  with Paul Graham's method (change the E-mail preferences filter setting
793  cutoff point to 0.56, adjust as needed).  Also, as "jaf" requested, you can
794  now drag and drop messages into the word list - drop in the left third to use
795  it as an example of genuine messages, right third for spam, and middle third
796  to get an evaluation of a message's spaminess.  Also a useless command was
797  removed.  Updated files (replace your existing copies): AGMSBayesianSpam
798  Database, AGMSBayesianSpamFilter, AGMSBayesianSpamServer.
799
800  <LI>Version 1.47 was the first public (and working) version.  It used Paul
801  Graham's algorithm with a few simplifications.
802</UL>
803
804<P>Released to the public domain in 2002 by the author, Alexander G. M. Smith.
805</BODY>
806</HTML>
807