1<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN"> 2<HTML> 3<HEAD> 4<TITLE>AGMSBayesianSpam Documentation</TITLE> 5<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=us-ascii"> 6<META NAME="author" CONTENT="Alexander G. M. Smith"> 7<META NAME="description" CONTENT="Documentation for AGMSBayesianSpam, for classifying incoming e-mail messages as spam (junk mail) or genuine."> 8<!-- 9; $Log: index.html,v $ 10; Revision 1.11 2003/02/08 21:54:14 agmsmith 11; Updated the AGMSBayesianSpamServer documentation to match the current 12; version. Also removed the Beep options from the spam filter, now they 13; are turned on or off in the system sound preferences. 14; 15; Revision 1.10 2002/12/16 17:32:43 agmsmith 16; Added Alex's settings paragraph. Added screen shot of dangerous 17; header filter that deletes things on the server. 18; 19; Revision 1.9 2002/12/13 22:45:22 agmsmith 20; More changes for self training and chi-squared scoring. 21; 22; Revision 1.8 2002/12/13 22:20:50 agmsmith 23; Under construction. 24; 25; Revision 1.7 2002/11/29 23:42:26 agmsmith 26; Describe the word display and what you can do with it. 27; 28; Revision 1.6 2002/11/29 22:20:02 agmsmith 29; Updated version numbers in the text 30; 31; Revision 1.5 2002/11/28 21:19:41 agmsmith 32; Updated to explain how to check for spam without downloading the 33; whole message. 34; 35; Revision 1.4 2002/11/10 20:56:39 agmsmith 36; Updated documentation to include MDR installer effects, and added a 37; section on the tokenizing experiments. 38; 39; Revision 1.3 2002/11/06 00:54:47 agmsmith 40; Spam definition corrected, with prodding from Ian G. 41; 42; Revision 1.2 2002/11/05 22:47:30 agmsmith 43; Replace UTF-8 copyright symbol with useable token. 44; 45; Revision 1.1 2002/11/05 22:43:24 agmsmith 46; Starting point for the HTML documentation for AGMSBayesianSpam stuff. 47; 48; Revision 1.5 2002/10/21 21:07:19 agmsmith 49; Added references to the original spam detection papers. 50; 51; Revision 1.4 2002/10/21 20:56:01 agmsmith 52; Added hyperlinks for local files, and an explanation of "spam". 53; 54; Revision 1.3 2002/10/21 20:19:29 agmsmith 55; Finished updating instructions for version 1.60, and adding 56; lots of screen shots. 57; 58; Revision 1.2 2002/10/21 02:03:35 agmsmith 59; Added log in HTML comments area. 60; 61; Revision 1.1 2002/10/21 02:00:54 agmsmith 62; Initial revision 63--> 64</HEAD> 65<BODY BGCOLOR="WHITE" TEXT="BLACK"> 66 67<P><FONT COLOR="MAGENTA">Short: Junk E-Mail Classifier. 68<BR>Author: agmsmith@rogers.com (Alexander G. M. Smith) 69<BR>Uploader: agmsmith@rogers.com (Alexander G. M. Smith) 70<BR>Website: <A HREF="http://members.rogers.com/agmsmith/">http://members.rogers.com/agmsmith/</A> 71<BR>Version: 1.77 72<BR>Type: internet & network/e-mail 73<BR>Requires: BeOS 5.0+ 74<BR>Related things: <A HREF="http://www.paulgraham.com/spam.html">http://www.paulgraham.com/spam.html</A>, <A HREF="http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html">http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html</A></FONT> 75 76<H1><A NAME="Contents"></A>Table of Contents</H1> 77 78<UL> 79 <LI><A HREF="#Contents">Table of Contents</A> 80 <LI><A HREF="#Introduction">Introduction to AGMSBayesianSpam</A> 81 <LI><A HREF="#Installation">Installation</A> 82 <LI><A HREF="#Usage">Usage</A> 83 <UL> 84 <LI><A HREF="#Reading">Reading E-Mail</A> 85 <LI><A HREF="#Training">Training</A> 86 <LI><A HREF="#HidingServer">Hiding the Server Window</A> 87 <LI><A HREF="#AlexSettings">Alex's Settings</A> 88 </UL> 89 <LI><A HREF="#AdvancedUsage">Advanced Usage</A> 90 <UL> 91 <LI><A HREF="#CommandLine">Command Line Mode and Scripting</A> 92 <LI><A HREF="#Spreadsheet">Using a Spreadsheet to Examine Word Statistics</A> 93 <LI><A HREF="#WordDisplay">Understanding and Using the Word Display</A> 94 <LI><A HREF="#Tokenizing">Tokenizing Modes Compared</A> 95 <LI><A HREF="#HeadersOnly">High Speed and High Danger - Headers Only Trick</A> 96 </UL> 97 <LI><A HREF="#ChangeLog">Change Log</A> 98</UL> 99 100<H1><A NAME="Introduction"></A>Introduction to AGMSBayesianSpam</H1> 101 102<P>AGMSBayesianSpam is a set of BeOS programs for classifying e-mail messages 103and other text as either spam or genuine. "Spam" is the colloquial name for 104unwanted junk messages, usually advertising. The name comes from a 1970's <A 105HREF="http://www.google.com/search?&q=monty+python+spam">Monty Python comedy 106skit</A> involving lots of unwanted Spam, which is the name for the spicy ham 107in a can made by the <A HREF="http://www.hormel.com/">Hormel Foods</A> company, 108originally from Austin, Minnesota, USA. The program classifies messages as 109spam or genuine (sometimes called "ham"), based on the words they contain and 110previous messages which have been identified by the user as spam or genuine. 111It's implemented as a server program (AGMSBayesianSpamServer) which keeps track 112of the word list and a Mail Daemon Replacement add-on (AGMSBayesianSpamFilter) 113which uses the server to classify incoming messages. Theoretically other 114programs, like a news reader, could also use the word database using the 115scripting interface. There's also a command line interface and a graphical 116user interface. 117 118<P>If you want to know more about the technique of counting words, have a look 119at Paul Graham's wonderful write-up at <A 120HREF="http://www.paulgraham.com/spam.html">http://www.paulgraham.com/spam.html</A>. 121This program is currently using an improved version of Graham's method, called 122Gary-combining, by Gary Robinson. See <A 123HREF="http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html" 124>http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html</A> for 125Gary's story. There's also an even more improved method called Chi-Squared 126(χ²) combining, which grew from discussions on the <A 127HREF="http://mail.python.org/mailman-21/listinfo/spambayes">Spambayes</A> 128mailing list. 129 130<H1><A NAME="Installation"></A>Installation</H1> 131 132<OL TYPE="1"> 133 <LI>Install the BeOS Mail Daemon Replacement (MDR) version 2.0.0 beta 7 or 134 later. Beta 3 and later include AGMSBayesianSpam so you don't need to 135 worry about incompatible versions, and the MDR will even do some of the 136 installation for you. You can get MDR from <A 137 HREF="http://www.bebits.com/app/2289">http://www.bebits.com/app/2289</A> or 138 get the latest source code and compile it yourself from <A 139 HREF="http://sourceforge.net/projects/bemaildaemon" 140 >http://sourceforge.net/projects/bemaildaemon</A>. 141 <LI>Move the AGMSBayesianSpamServer program to the 142 <A HREF="file:/boot/home/config/bin/">/boot/home/config/bin/</A> directory 143 (the MDR installer will do this for you). It's 144 put there to make it useable from the command line. If you use it 145 frequently, you can also add a symbolic link to it in your desktop 146 applications menu or to the mail menu. 147 <LI>Move the AGMSBayesianSpamFilter mail add-on to the 148 <A HREF="file:/boot/home/config/add-ons/mail_daemon/inbound_filters/" 149 >/boot/home/config/add-ons/mail_daemon/inbound_filters/</A> directory 150 (the MDR installer will do this for you).<BR> 151 <IMG SRC="pictures/HomeConfigAddonsMaildaemonInboundfilters.png" 152 ALT="[/boot/home/config/add-ons/mail_daemon/inbound_filters/ directory]" 153 WIDTH="634" HEIGHT="288"> 154 <LI><IMG SRC="pictures/CantFindSettings.png" ALT="[Can't Find Settings]" 155 WIDTH="326" HEIGHT="120" ALIGN="RIGHT">Set up MIME types and indices (the 156 MDR installer will do this step and the next one for you, invisibly). Run 157 the AGMSBayesianSpamServer program. It will put up an alert box 158 complaining about not finding the settings file. Just hit the Acknowledge 159 button to get past it. Then click the "Install MIME Types & Make 160 Indices on All Drives" button which does what it says plus it also adds a 161 few sound effect names to the system.<BR CLEAR="ALL"> 162 <IMG SRC="pictures/TheInstallButton.png" ALT="[The Install Button]" 163 WIDTH="604" HEIGHT="403"> 164 <LI>Quit the program (the close box at the top left corner is one way of 165 doing that). It will make the settings file and settings directory 166 <A HREF="file:/boot/home/config/settings/AGMSBayesianSpam/" 167 >/boot/home/config/settings/AGMSBayesianSpam/</A> when it exits. 168 <LI>Use the Sounds preferences (or the installsound command) to associate 169 the names with your sound files (SoundGenuine, SoundUncertain and SoundSpam 170 are included as examples with MDR in the <A 171 HREF="file:/boot/home/config/settings/AGMSBayesianSpam/" 172 >/boot/home/config/settings/AGMSBayesianSpam/</A> directory), no I don't 173 have the rights to the Monty Python Spam skit). If you don't want it to 174 make sounds, don't do anything (you can also use the Sounds preferences 175 later on to disable or remove the sounds if you get tired of them).<BR> 176 <IMG SRC="pictures/StartingSoundPreferences.png" WIDTH="291" HEIGHT="445" 177 ALT="[Starting Sound Preferences]"> <IMG 178 SRC="pictures/SoundPrefChoosingAFile.png" WIDTH="318" HEIGHT="387" 179 ALT="[Sound Preferences Choosing a File]"> 180 <P>When you're done, it should look something like this:<BR> 181 <IMG SRC="pictures/SoundPrefFinished.png" WIDTH="314" HEIGHT="245" 182 ALT="[Sound Preferences Finished]"> 183 <LI>Add some example messages to the database. 184 <OL TYPE="A"> 185 <LI>If you don't want to do this, see step B. You need to add roughly the 186 same number of sample Spam messages as you add of genuine e-mail. A few 187 hundred of each should do, though you can get useful results with a dozen. 188 <P>Run the AGMSBayesianSpamServer program again. This time it shouldn't 189 complain. Click the "Create" button to make a new database with the 190 default name of "<A 191 HREF="file:/boot/home/config/settings/AGMSBayesianSpam/AGMSBayesianSpam%20Database" 192 >/boot/home/config/settings/AGMSBayesianSpam/AGMSBayesianSpam Database</A>". 193 <P>Use the "Add Example of Spam/Genuine" button, and only select at most 194 80 files at a time (otherwise the Tracker/File Requester will lock up and 195 you'll have to reboot your computer). It will ask you to identify each 196 file as spam or genuine, you also have the choice of identifying a whole 197 batch of them as all spam or all genuine.<BR> 198 <IMG SRC="pictures/SingleMessageClassificationRequest.png" 199 ALT="[Single Message Classification Request]" 200 WIDTH="349" HEIGHT="104" ALIGN="LEFT"> 201 <IMG SRC="pictures/MultipleMessageClassificationRequest.png" 202 ALT="[Multiple Message Classification Request]" 203 WIDTH="349" HEIGHT="104" ALIGN="RIGHT"> 204 <BR CLEAR="ALL"> 205 <P>You can also drag and drop example messages into the bottom half of 206 the window. Drop in the left side for genuine, right side for spam, but 207 avoid the middle third of the window.<BR> 208 <IMG SRC="pictures/DropZones.png" ALT="[Drop Zones]" 209 WIDTH="596" HEIGHT="204"><BR> 210 <P>If you have thousands of messages, use the command line mode.<BR> 211 <IMG SRC="pictures/CommandLineSetSpam.png" ALT="[Command Line Set Spam]" 212 WIDTH="634" HEIGHT="205"> 213 <LI>If you don't have a few hundred spam messages, instead of doing step 214 A copy the sample database file to "AGMSBayesianSpam Database" in the <A 215 HREF="file:/boot/home/config/settings/AGMSBayesianSpam/" 216 >/boot/home/config/settings/AGMSBayesianSpam/</A> directory (the MDR 217 installer will do this for you). Due to complaints about the huge file 218 size, the sample spam database that comes with MDR is now very small 219 (10 spam, 10 genuine example messages), so you'll need to train it before 220 it gets accurate (auto-training is your friend). Or you could get the 221 huge (976KB, 484 spam, 1009 genuine messages) one from version 2.0.0 Beta 222 8, available at: <A 223 HREF="http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/*checkout*/bemaildaemon/AGMSBayesianSpamServer/SampleDatabase?rev=release-2-0-0-beta8" 224 >http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/*checkout*/bemaildaemon/AGMSBayesianSpamServer/SampleDatabase?rev=release-2-0-0-beta8</A> 225 <BR> 226 <IMG SRC="pictures/DatabaseFileLocation.png" 227 ALT="[Database File Location]" WIDTH="609" HEIGHT="234"><BR> 228 Run the AGMSBayesianSpamServer program again. Hit the Purge button 229 (because it doesn't load the database until it has to do something). If 230 things are working correctly, you should see a list of the words in the 231 sample database in the bottom half of the window. An alternative method 232 of picking a database file is to double click on it in Tracker, which is 233 useful if you don't want to type in the full name. 234 </OL> 235 <LI>Quit the AGMSBayesianSpamServer program. Delete all the remaining 236 files you unzipped from the archive (such as the example database, this 237 readme, this documentation, or source code), unless you want to 238 keep them around. You will have to decide where to store them; I can't 239 tell you everything :-). 240 <LI>Start up the E-mail preferences control panel (part of the Mail Daemon 241 Replacement project).<BR> 242 <IMG SRC="pictures/StartingEMailPreferences.png" 243 ALT="[Starting EMail Preferences]" WIDTH="291" HEIGHT="443"><BR> 244 Choose the e-mail account you wish to have checked for spam. Then hit the 245 Add Filter button to bring up the menu with the list of filters you can 246 add, and pick AGMSBayesianSpamFilter.<BR> 247 <IMG SRC="pictures/ClickingOnAddFilterShowsList.png" 248 ALT="[Clicking On Add Filter Shows List]" WIDTH="454" HEIGHT="413"><BR> 249 Remember to click on the filter after you have added it to set the settings 250 (though the defaults are useable too).<BR> 251 <IMG SRC="pictures/ClickOnFilterNameToGetSettings.png" 252 ALT="[Click On Filter Name To Get Settings]" WIDTH="454" HEIGHT="413"><BR> 253 Then select the settings you wish. If you installed sound files earlier, 254 you can turn on the sound effects here.<BR> 255 <IMG SRC="pictures/FilterSettings.png" ALT="[Filter Settings]" 256 WIDTH="454" HEIGHT="413"> 257 <LI>Test it. Send yourself some e-mail and see if it gets rated correctly. 258</OL> 259 260 261<H1><A NAME="Usage"></A>Usage</H1> 262 263<H2><A NAME="Reading"></A>Reading E-Mail</H2> 264 265<P>Check for e-mail as usual. If you look at the inbox directory in Tracker, 266you can add an extra column with the E-mail attribute "Spam/Genuine Estimate" 267to see how spammy the messages are. 0.0 means the system thinks the message is 268fully genuine, 1.0 fully spam. But it can be wrong, for things like a friend 269of yours quoting a spam message. For the Chi-squared method (the default), you 270see numbers close to zero for genuine (like 9.750e-13), close to 1 for spam and 271in-between (0.01 to 0.99) if it can't decide. With the Robinson scoring 272method, usually if it is over 0.56 (the best cutoff value depends a bit on your 273database quality, but 0.56 is typical) then it is spam, and the closer it is to 2741.0 the more likely it really is spam. 275 276<P>I sort by spam ratio, and manually throw away the messages that are spammy, 277then I switch the Tracker window back to sorting by thread+date (just a click 278on the appropriate column title does it) and get on with reading the mail. 279 280<P>If you turned on the filter option to modify the subject, you'll see spam 281messages with something like [Spam 95%] in front of the subject (I don't use it 282because it looks ugly). But only in the Tracker display of the Subject, the 283actual subject inside the message isn't affected, just the MAIL:subject 284attribute, which is what the Tracker shows. 285 286<H2><A NAME="Training"></A>Training</H2> 287 288<P><EM>The accuracy is only as good as your database</EM>, so update it with 289more example spam and genuine messages. In particular, if it gets the estimate 290wrong, add that message to the database to tell it what it should be doing. A 291quick way to do that is to right click on the e-mail in Tracker, and pick Open 292With... AGMSBayesianSpamServer.<BR> 293<IMG SRC="pictures/SortingInboxBySpamEstimate.png" 294ALT="[Sorting Inbox By Spam Estimate]" WIDTH="831" HEIGHT="361"><BR> 295It should start up and ask you if the message is spam or genuine.<BR> 296<IMG SRC="pictures/SingleMessageClassificationRequest.png" 297ALT="[Single Message Classification Request]" WIDTH="349" HEIGHT="104"><BR> 298You can also drag and drop the message into the left third of the word list for 299genuine messages, or right third for spam messages. Dropping in the middle 300third does something else that's mostly harmless and fun.<BR> 301<IMG SRC="pictures/DropZones.png" ALT="[Drop Zones]" WIDTH="596" 302HEIGHT="204"><BR> 303 304<P>You may also want to train it with all your messages (it gives slightly 305better results in the long run than just training on the mistakes). To make it 306easier, turn on the self-training option in the mail filter. It will compute 307the spam ratio of new mail messages, then feed back the same message into the 308database as an example of spam/genuine. When it gets it wrong, you should 309manually retrain it with the correct classification, otherwise the database 310will get worse and worse and finally turn into mush. 311 312<H2><A NAME="HidingServer"></A>Hiding the Server Window</H2> 313 314<P>If you're annoyed by the server window popping up whenver the system checks 315for e-mail, you can tell it to hide. Just click the "Server Mode" checkbox. 316Actually, that's now the default since people were complaining about the window 317getting in the way. The disadvantage is that you don't get to see error 318messages. To make it visible again, start up AGMSBayesianSpamServer (possibly 319by double clicking on its icon in <A 320HREF="file:/boot/home/config/bin/">/boot/home/config/bin/</A> and bring up the 321hidden window by using the deskbar, or by using the "Edit Server Settings" 322button in the spam filter configuration).<BR> 323<IMG SRC="pictures/MakingTheWindowVisibleFromTheDeskbar.png" 324ALT="MakingTheWindowVisibleFromTheDeskbar" WIDTH="640" HEIGHT="109"><BR> 325 326<H2><A NAME="AlexSettings"></A>Alex's Settings</H2> 327 328<P>I'm currently using it with these settings: Chi-squared scoring, 329AnyTextAndHeader tokenizing, server mode on, ignore previous classification 330off, mark subject with [Spam %] off, spam cutoff 0.95, genuine below 0.05, no 331words found on, self-training on, close AGMSBayesianSpamServer when Finished 332on. Because of the self training, I always correct it when it gets the 333classification wrong (that means I have to manually delete the messages, can't 334use a Match Header filter to do it). My Tracker window shows the 335Classification Group attribute rather than the Spam/Genuine Estimate number 336(which isn't pretty when using Chi-squared). 337 338<H1><A NAME="AdvancedUsage"></A>Advanced Usage</H1> 339 340<H2><A NAME="CommandLine"></A>Command Line Mode and Scripting</H2> 341 342<P>Besides the graphical user interface, there 343is also a command line mode. Just type "AGMSBayesianSpamServer help" 344in the terminal to get a list of the commands and what they do (the ultimate 345documentation). It also explains all of the mysterious options you see in the 346graphical user interface. The same commands can be used in scripting, either 347from some other program or via the "hey" utility which you can get from <A 348HREF="http://www.bebits.com/app/2042">http://www.bebits.com/app/2042</A>. A 349useful command, if you have a lot of spam messages to add, is 350"AGMSBayesianSpamServer set genuine *" which will use all messages in the 351current directory as examples of genuine text. 352 353<PRE> 354Sat Feb 8 16:30:51 274 /tmp>AGMSBayesianSpamServer help 355 356AGMSBayesianSpamServer - A Spam Database Server 357Copyright © 2002 by Alexander G. M. Smith. Released to the public domain. 358 359Compiled on Feb 8 2003 at 11:13:28. $Revision: 1.11 $ $Header: 360/cvsroot/bemaildaemon/AGMSBayesianSpamServer/AGMSBayesianSpamServer.cpp,v 1.77 3612003/01/22 03:19:48 agmsmith Exp $ 362 363This is a program for classifying e-mail messages as spam (junk mail which 364you don't want to read) and regular genuine messages. It can learn what's 365spam and what's genuine. You just give it a bunch of spam messages and a 366bunch of non-spam ones. It uses them to make a list of the words from the 367messages with the probability that each word is from a spam message or from 368a genuine message. Later on, it can use those probabilities to classify 369new messages as spam or not spam. If the classifier stops working well 370(because the spammers have changed their writing style and vocabulary, or 371your regular correspondants are writing like spammers), you can use this 372program to update the list of words to identify the new messages 373correctly. 374 375The original idea was from Paul Graham's algorithm, which has an excellent 376writeup at: http://www.paulgraham.com/spam.html 377 378Gary Robinson came up with the improved algorithm, which you can read about at: 379http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html 380 381Then he, Tim Peters and the SpamBayes mailing list developed the Chi-Squared 382test, see http://mail.python.org/pipermail/spambayes/2002-October/001036.html 383for one of the earlier messages leading from the central limit theorem to 384the current chi-squared scoring method. 385 386Thanks go to Isaac Yonemoto for providing a better icon. 387 388Usage: Specify the operation as the first argument followed by more 389information as appropriate. The program's configuration will affect the 390actual operation (things like the name of the database file to use, or 391whether it should allow non-email messages to be added). In command line 392mode it will do the operation and exit. In GUI/server mode a command line 393invocation will just send the command to the running server. You can also 394use BeOS scripting (see the "Hey" command which you can get from 395http://www.bebits.com/app/2042 ) to control the Spam server. And finally, 396there's also a GUI interface which shows up if you start it without any 397command line arguments. 398 399Commands: 400 401Quit 402Stop the program. Useful if it's running as a server. 403 404Get DatabaseFile 405Get the pathname of the current database file. The default name is something 406like B_USER_SETTINGS_DIRECTORY / AGMSBayesianSpam / AGMSBayesianSpamServer 407Database 408 409Set DatabaseFile NewValue 410Change the pathname of the database file to use. It will automatically be 411converted to an absolute path name, so make sure the parent directories exist 412before setting it. If it doesn't exist, you'll have to use the create command 413next. 414 415Create DatabaseFile 416Creates a new empty database, will replace the existing database file too. 417 418Delete DatabaseFile 419Deletes the database file and all backup copies of that file too. Really only 420of use for uninstallers. 421 422Count DatabaseFile 423Returns the number of words in the database. 424 425Set Spam NewValue 426Adds the spam in the given file (specify full pathname to be safe) to the 427database. The words in the files will be added to the list of words in the 428database that identify spam messages. The files processed will also have the 429attribute MAIL:classification added with a value of "Spam" or "Genuine" as 430specified. They also have their spam ratio attribute updated, as if you had 431also used the Evaluate command on them. If they already have the 432MAIL:classification attribute and it matches the new classification then they 433won't get processed (and if it is different, they will get removed from the 434statistics for the old class and added to the statistics for the new one). 435You can turn off that behaviour with the IgnorePreviousClassification 436property. The command line version lets you specify more than one pathname. 437 438Count Spam 439Returns the number of spam messages in the database. 440 441Set SpamString NewValue 442Adds the spam in the given string (assumed to be the text of a whole e-mail 443message, not just a file name) to the database. 444 445Set Genuine NewValue 446Similar to adding spam except that the message file is added to the genuine 447statistics. 448 449Count Genuine 450Returns the number of genuine messages in the database. 451 452Set GenuineString NewValue 453Adds the genuine message in the given string (assumed to be the text of a 454whole e-mail message, not just a file name) to the database. 455 456Set IgnorePreviousClassification NewValue 457If set to true then the previous classification (which was saved as an 458attribute of the e-mail message file) will be ignored, so that you can add the 459message to the database again. If set to false (the normal case), the 460attribute will be examined, and if the message has already been classified as 461what you claim it is, nothing will be done. If it was misclassified, then the 462message will be removed from the statistics for the old class and added to the 463stats for the new classification you have requested. 464 465Get IgnorePreviousClassification 466Find out the current setting of the flag for ignoring the previously recorded 467classification. 468 469Set ServerMode NewValue 470If set to true then error messages get printed to the standard error stream 471rather than showing up in an alert box. It also starts up with the window 472minimized. 473 474Get ServerMode 475Find out the setting of the server mode flag. 476 477Flush 478Writes out the database file to disk, if it has been updated in memory but 479hasn't been saved to disk. It will automatically get written when the program 480exits, so this command is mostly useful for server mode. 481 482Set PurgeAge NewValue 483Sets the old age limit. Words which haven't been updated since this many 484message additions to the database may be deleted when you do a purge. A good 485value is 1000, meaning that if a word hasn't appeared in the last 1000 486spam/genuine messages, it will be forgotten. Zero will purge all words, 1 487will purge words not in the last message added to the database, 2 will purge 488words not in the last two messages added, and so on. This is mostly useful 489for removing those one time words which are often hunks of binary garbage, not 490real words. This acts in combination with the popularity limit; both 491conditions have to be valid before the word gets deleted. 492 493Get PurgeAge 494Gets the old age limit. 495 496Set PurgePopularity NewValue 497Sets the popularity limit. Words which aren't this popular may be deleted 498when you do a purge. A good value is 5, which means that the word is safe 499from purging if it has been seen in 6 or more e-mail messages. If it's only 500in 5 or less, then it may get purged. The extreme is zero, where only words 501that haven't been seen in any message are deleted (usually means no words). 502This acts in combination with the old age limit; both conditions have to be 503valid before the word gets deleted. 504 505Get PurgePopularity 506Gets the purge popularity limit. 507 508Purge 509Purges the old obsolete words from the database, if they are old enough 510according to the age limit and also unpopular enough according to the 511popularity limit. 512 513Get Oldest 514Gets the age of the oldest message in the database. It's relative to the 515beginning of time, so you need to do (total messages - age - 1) to see how 516many messages ago it was added. 517 518Set Evaluate NewValue 519Evaluates a given file (by path name) to see if it is spam or not. Returns 520the ratio of spam probability vs genuine probability, 0.0 meaning completely 521genuine, 1.0 for completely spam. Normally you should safely be able to 522consider it as spam if it is over 0.56 for the Robinson scoring method. For 523the ChiSquared method, the numbers are near 0 for genuine, near 1 for spam, 524and anywhere in the middle means it can't decide. The program attaches a 525MAIL:ratio_spam attribute with the ratio as its float32 value to the file. 526Also returns the top few interesting words in "words" and the associated 527per-word probability ratios in "ratios". 528 529Set EvaluateString NewValue 530Like Evaluate, but rather than a file name, the string argument contains the 531entire text of the message to be evaluated. 532 533ResetToDefaults 534Resets all the configuration options to the default values, including the 535database name. 536 537InstallThings 538Creates indices for the MAIL:classification and MAIL:ratio_spam attributes on 539all volumes which support BeOS queries, identifies them to the system as 540e-mail related attributes (modifies the text/x-email MIME type), and sets up 541the new MIME type (text/x-vnd.agmsmith.spam_probability_database) for the 542database file. Also registers names for the sound effects used by the 543separate filter program (use the installsound BeOS program or the Sounds 544preferences program to associate sound files with the names). 545 546Set TokenizeMode NewValue 547Sets the method used for breaking up the message into words. Use "Whole" for 548the whole file (also use it for non-email files). The file isn't broken into 549parts; the whole thing is converted into words, headers and attachments are 550just more raw data. Well, not quite raw data since it converts 551quoted-printable codes (equals sign followed by hex digits or end of line) to 552the equivalent single characters. "PlainText" breaks the file into MIME 553components and only looks at the ones which are of MIME type text/plain. 554"AnyText" will look for words in all text/* things, including text/html 555attachments. "AllParts" will decode all message components and look for words 556in them, including binary attachments. "JustHeader" will only look for words 557in the message header. "AllPartsAndHeader", "PlainTextAndHeader" and 558"AnyTextAndHeader" will also include the words from the message headers. 559 560Get TokenizeMode 561Gets the method used for breaking up the message into words. 562 563Set ScoringMode NewValue 564Sets the method used for combining the probabilities of individual words into 565an overall score. "Robinson" mode will use Gary Robinson's nth root of the 566product method. It gives a nice range of values between 0 and 1 so you can 567see shades of spaminess. The cutoff point between spam and genuine varies 568depending on your database of words (0.56 was one point in some experiments). 569"ChiSquared" mode will use chi-squared statistics to evaluate the difference 570in probabilities that the lists of word ratios are random. The result is very 571close to 0 for genuine and very close to 1 for spam, and near the middle if it 572is uncertain. 573 574Get ScoringMode 575Gets the method used for combining the individual word ratios into an overall 576score. 577 578ProcessArgs: The property specified isn't known or doesn't support the requested action (usually means it is an unknown command), error code $FFFFFFFF/-1 (General OS error) has occured. 579AGMSBayesianSpamServer shutting down... 580Sat Feb 8 16:30:58 275 /tmp> 581</PRE> 582<!-- End the C style comment which makes editing this look bad with BeIDE's syntax colouring. */ --> 583 584<H2><A NAME="Spreadsheet"></A>Using a Spreadsheet to Examine Word Statistics</H2> 585 586<P>Another advanced trick is to load the list of words into Gobe Productive's 587spreadsheet, so that you can find the most popular word or chart the word 588frequencies. Unfortunately it can only handle about 16000 words. To do that, 589start up Gobe Productive, pick Open, then from the file requester's "Document 590Type" menu, pick "Spreadsheet" and then in the submenu pick "Tab-delimited 591text". Then navigate to the database, the default location is "<A 592HREF="file:/boot/home/config/settings/AGMSBayesianSpam/AGMSBayesianSpam%20Database" 593>/boot/home/config/settings/AGMSBayesianSpam/AGMSBayesianSpam Database</A>". 594Have fun! 595 596 597<H2><A NAME="WordDisplay"></A>Understanding and Using the Word Display</H2> 598 599<P><IMG ALIGN="RIGHT" SRC="pictures/WordDisplay.png" WIDTH="285" HEIGHT="645" 600ALT="[Narrow Word Display Window]">The word display tells you more than you 601need to know about the words in the database. Those colour bars actually mean 602something. 603 604<P>Obviously words which are more genuine than spamish show up in <FONT 605COLOR="BLUE">blue</FONT>, while spammier words are in <FONT 606COLOR="RED">red</FONT>. It's proportionally based on the total message counts 607so that a word which shows up in 10% of the genuine messages and 9% of the spam 608will show up in blue, even if it was in more spam messages than genuine 609messages (this compensates a bit for not training on an equal number of spam 610and genuine messages). The length of the bar shows the ratio of the 611proportions; further to the left for larger genuine proportions, and similarly 612further right for larger spam proportions. 613 614<P>The thickness of the bar shows how many messages the word was found in. 615It's kind of a weight, saying how frequently used that word is and thus how 616significant it is. 617 618<P>The paleness of the bar shows you how old that word is. A light colour 619means that the word was last added to the database long ago. A darker, more 620saturated colour means that the word was added more recently, when you added 621example messages to the database. 622 623<P>Finally, if you click on the word display, the background will change from a 624pale blue tint into solid white, to show you that it is the active keyboard 625focus. That means you can type in letters to find a particular word (delay for 626one second to start typing the letters for a new word). The arrow keys, page 627up/down keys and the mouse scroll wheel also show you different words. Sorry, 628there's no scroll bar since finding the Nth word is a slow operation with a set 629of words (they aren't numbered); each twitch of the scroll bar would mean going 630through the list of tens or even hundreds of thousands of words and counting to 631find the scroll position. 632 633<BR CLEAR="ALL"> 634 635<H2><A NAME="Tokenizing"></A>Tokenizing Modes Compared</H2> 636 637<P>I did some tests with tokenizing different parts of mail messages to see 638what would work best. 639 640<P>The Database: 641<BR>341 training genuine messages, 406 training spam messages (or 398 when 642parsing due to a bug (fixed later on in 2.0.0b5) with messages that don't have 643body text). 644<BR>40 test genuine messages, 40 test spam messages, all more recent than the 645training ones. 646<BR>Spam threshold is 0.56, Gary-combining method. 647 648<P>The results: 649 650<TABLE BORDER="2" SUMMARY="[Table showing results of different tokenizing methods]"> 651<TR><TH>Tokenizing Method 652<TH>Genuine Test Details 653<TH>Genuine Accuracy 654<TH>Spam Test Details 655<TH>Spam Accuracy 656 657<TR><TD>Just headers 658<TD>Genuine .181352 to .557881, one false positive (a mailbox full announcement). 659<TD>2.5% wrong. 660<TD>Spam .450602 to .750511, 21 false negatives. 661<TD>52.5% wrong. 662 663<TR><TD>Whole raw message text 664<TD>Genuine .163027 to .627022, 3 false positives. 665<TD>7.5% wrong. 666<TD>Spam .509355 to .993985, 1 false negative. 667<TD>2.5% wrong. 668 669<TR><TD>Message parsed into parts plus header 670<TD>Genuine .168857 to .609005, 4 false positives. 671<TD>10% wrong. 672<TD>Spam .614564 to .994364, 0 false negatives. 673<TD>0% wrong. 674 675<TR><TD>Message parsed into parts, no header data 676<TD>Genuine .220161 to .631161, 5 false positives. 677<TD>12.5% wrong. 678<TD>Spam .592501 to .994444, 0 false negatives. 679<TD>0% wrong. 680 681<TR><TD>Any text parts and header 682<TD>Genuine .162697 to .614136, 4 false positives. 683<TD>10% wrong. 684<TD>Spam .614973 to .994362, 0 false negatives. 685<TD>0% wrong. 686 687<TR><TD>Any text parts, no headers 688<TD>Genuine .221923 to .635487, 6 false positives. 689<TD>15% wrong. 690<TD>Spam .594271 to .994441, 0 false negatives. 691<TD>0% wrong. 692 693<TR><TD>text/plain parts (including body text) 694<TD>Genuine .137869 to .583192, 3 false positives. 695<TD>7.5% wrong. 696<TD>Spam .448059 to .994119, 17 false negatives. 697<TD>42.5% wrong. 698 699<TR><TD>Only text/plain sub-parts, no headers.<BR> 700150 spam and 1 genuine training message had no words! 701<TD>Genuine .219169 to .696899, 9 false positives. 702<TD>22.5% wrong. 703<TD>Spam .660755 to .994116, 0 false negatives, 27 had no words. 704<TD>0% wrong. 705</TABLE> 706 707<P>The results look good for the whole message tokenizing method (which also 708works on non-email files) and for the all text parts plus header. Since the 709text parts method doesn't add lots of garbage words to the database from trying 710to find words in binary attachments, it's now the default setting. 711 712<P>The header only method is pretty good too for identifying genuine messages, 713and so-so for spam messages. That may make it useable for pre-download tests 714(delete some of the spam on the mail server before downloading it, without 715worrying about deleting too many genuine messages). 716 717 718<H2><A NAME="HeadersOnly"></A>High Speed and High Danger - Headers Only Trick</H2> 719 720<P>If you have a slow dial-up connection, you may wish to classify your mail 721quickly by deleting spam messages without downloading the entire junk message. 722 723<P><IMG SRC="pictures/ChoosingJustHeaderTokenizingMode.png" WIDTH="920" 724HEIGHT="402" ALT="[Choosing JustHeader Tokenizing Mode]"> 725 726<P><IMG SRC="pictures/DangerousMatchFilter.png" WIDTH="276" HEIGHT="243" 727ALIGN="RIGHT" ALT="[Dangerous Match Filter]">This can be done with three 728settings. First switch the AGMSBayesianSpamServer into tokenizing just the 729headers. Then go into the E-mail preferences and add an 730AGMSBayeisianSpamFilter with the "Add [Spam %] in Front of Subject" option 731turned on, and the ratio set to a nice safe high level like 0.95 (so that your 732genuine mail is less likely to get deleted, but it will still delete the 1% of 733your real mail that looks like spam, which is why this is dangerous). Do not 734turn on self-training, since you can't manually correct it. Finally in the 735E-mail preferences, add a "Match Header" filter after the spam filter and set 736it so that If <B>Subject</B> is <B>\[Spam*</B> then <B>Delete Message</B>. 737That's backslash, left square bracket, Spam with the S capitalised, asterix. 738Now it will download the headers, check them against the spam database, and 739then delete the spam ones on the server without downloading the rest of their 740contents. 741 742<P>You should also make a new spam database trained in Just Headers tokenizing 743mode with roughly equal examples of your genuine messages and spam messages (50 744of each should be enough to start). A full message database may also work, but 745headers only training should be more accurate for headers only decisions. When 746testing JustHeader mode, I noticed that the false positive rate (genuine 747reported as spam) is nice and low, but the false negative rate (spam reported 748as genuine) is high (tested with Robinson scoring, not Chi-Squared scoring). 749So this means JustHeader mode will delete maybe half the spam (and download the 750rest) and also delete the occasional genuine message. 751 752 753<H1><A NAME="ChangeLog"></A>Change Log</H1> 754 755<P>The various versions released to the public. These are actually several 756accumulated minor changes, which you can see by looking at the log in the top of the 757source code files. 758 759<UL> 760 <LI>Version 1.77 changed the tokenizing to not convert words to lower case, 761 the case is important for spam! Minimize the window before opening it so 762 that it doesn't flash on the screen in server mode. Also load the database 763 when the window is displayed so that the user can see the words. 764 765 <LI>Version 1.73 added self training support and the Chi-Squared scoring 766 method. 767 768 <LI>Version 1.68 nothing significant changed. Just very minor tweaking. 769 770 <LI>Version 1.65 added a time delay for exiting the program. This is so that 771 multiple e-mail accounts can simultaneously download mail, without having the 772 server close when one of the accounts finishes downloading. Scripting 773 requests that come in while it is counting down to quitting time will cancel 774 the countdown. In the belt <I>and</I> suspenders department, the filter has 775 been enhanced to try starting up the server up to three times. 776 777 <LI>Version 1.60 got rid of the need to use a modified Inbox filter for MDR 778 (found out the correct way of setting attributes on a message), added sound 779 effects, and added parsing of mail messages (parsing MIME headers, decoding 780 base64, quoted-printable and converting character sets to UTF-8 for text, all 781 thanks to using the MDR mail kit, which you now need since it uses their 782 libmail.so code library). There are now new options for selecting what kind 783 of parsing to do (text/plain or text/* or */* attachments, with or without 784 headers, etc). Plus sound effect options. The sample database has also been 785 updated to use text/* plus headers tokenization, which makes it slightly 786 smaller.<!-- End the C style comment which makes editing this look bad with 787 BeIDE's syntax colouring. */ --> 788 789 <LI>Version 1.49 switched to Gary Robinson's method for calculating spam 790 ratios. The overall results are about the same but you have less false 791 positives and the numbers are spread more evenly between 0.0 and 1.0 than 792 with Paul Graham's method (change the E-mail preferences filter setting 793 cutoff point to 0.56, adjust as needed). Also, as "jaf" requested, you can 794 now drag and drop messages into the word list - drop in the left third to use 795 it as an example of genuine messages, right third for spam, and middle third 796 to get an evaluation of a message's spaminess. Also a useless command was 797 removed. Updated files (replace your existing copies): AGMSBayesianSpam 798 Database, AGMSBayesianSpamFilter, AGMSBayesianSpamServer. 799 800 <LI>Version 1.47 was the first public (and working) version. It used Paul 801 Graham's algorithm with a few simplifications. 802</UL> 803 804<P>Released to the public domain in 2002 by the author, Alexander G. M. Smith. 805</BODY> 806</HTML> 807