DSPAM is an open-source content-based email filtering system. The idea is similar to SpamAssassin, in that it "scores" each message passing through the filtering engine. It adds headers to the message which can be used by other scripts (such as maildrop) to control the delivery of the message (i.e. send spam to a dedicated spam folder, or delete spam if the score is high.) These headers can also be used by MUAs (Mail User Agents, programs which are used to read and write email messages, such as Thunderbird) to trigger client-side rules which might do things like showing spam messages in red, or moving spam messages to a different folder (if the server doesn't do this automatically.)
I have a client who wants to build a new machine to replace his existing system (which has been running for about five years.) He has a typical small-company server: qmail with the combined patch, vpopmail, qmailadmin, vqadmin, simscan, and dovecot. He also only has a single domain on the machine, with a couple hundred mailboxes. On the new server he wants to use DSPAM to filter the incoming mail.
Until yesterday, I was vaguely aware of DSPAM, but I had never actually set it up or administered it. I'm writing this page to document how I built and deployed it on my own server. I will probably update this page as time goes by and I learn more about it.
DSPAM can be built to use several different back-end storage mechanisms to physically store the data about the messages it has seen and which tokens (i.e. words, sequences of words) should serve to indicate that a given message is spam, or is "ham" (i.e. not spam.) Some of the storage back-ends require other software to be present and running on the machine.
As of the current version (which is dspam-3.10.2 as I write this) the following storage back-ends are available:
The mysql_drv storage back-end requires a running MySQL server. It is thread-safe, which means it can be used in a client-server setup (which is how I'm running it.)
This was the first storage back-end written, and is the one recommended by the authors of DSPAM. It's also the one I'm using on my own server, which means the instructions below are going to use it.
The pgsql_drv storage back-end requires a running PostgreSQL server, version 8.2 or higher. (Unfortunately, the postgresql packages in the CentOS 5 repositories are only 8.1.) It is thread-safe, which means it can be used in a client-server setup.
The hash_drv storage back-end uses files on the disk to store the data. It is thread-safe, which means it can be used in a client-server setup. Because it has no additional dependencies, this is the default storage back-end if you don't specify one when building the software. And because it doesn't rely on an external database service, it runs more quickly than any of the other storage back-ends.
While researching DSPAM, I came across a few reports from people who had their hash files corrupted by system and software crashes. Apparently, one corrupt hash file will make the entire DSPAM system crash, so I tend to be a bit hesitant about using it. If you do use it, be sure to make regular backups of the files, just in case. This thread from the dspam-users mailing list has a few suggestions.
The sqlite3_drv storage back-end requires SQLite 3.x. This driver is not thread-safe, which means you cannot use it for a client-server setup.
The sqlite_drv storage back-end requires SQLite 2.7.7 or higher (but not SQLite 3.x, see above.) This driver is officially deprecated, and the developers have announced that it will be removed from a future version of DSPAM. In other words, do not use this storage back-end.
Before you start, you should choose which storage back-end you will be using. If it requires a database server (i.e. MySQL or PostgreSQL) you should make sure that the server is installed and running, and that you have "root" access to the database server, because part of the setup process will be to create a database and assign permissions to a userid which will be dedicated to DSPAM.
The directions below will show the actual commands needed to set up the MySQL database. If you are using PostgreSQL or SQLite, the documentation which comes in the DSPAM distribution will explain how to set things up.
In your own mailbox (or a mailbox which you can use for testing, without affecting your users) create two IMAP folders. Copy 10-20 typical non-spam messages into one folder. Copy 10-20 spam messages into the other. We will be using these messages for testing the system later on.
When we use these messages later on, I'm going to assume that the folders are called "test-ham" and "test-spam", and that they are direct children of the INBOX folder. If you choose to call them something else, be prepared to adjust the names when we start testing things (below.)
Visit the SourceForge download link in your browser to download the latest DSPAM package. Then upload the file to the server.
However you end up downloading the file, it needs to be on the server, in the home directory of the non-root user you will be using to configure and compile the software.
Before you configure the software, you need to create a userid to run the software, and the "dspam home" directory where it keeps its log files. Depending on which storage back-end you're using, it may also keep the per-user preferences there, or the spam/ham database files.
I created a dedicated "dspam" user for this, and set the user's home directory to "/var/dspam", which is the "dspam home" directory. The commands looked like this:
# useradd -s /sbin/nologin -d /var/dspam -M dspam
# mkdir -m 0700 /var/dspam
# chown dspam:dspam /var/dspam
Once the user and the dspam home directory existed, I expanded the software. I then wrote a "go" script which contains the configure script, because (1) it's easier to check the command before you run it, (2) unless you delete the script, you can refer back to it later to see how you built the software, and (3) when you upgrade to a new version, you can copy the "go" script from the old version and (if necessary) edit the file in order to build the new version.
You may need to change a few things in the configure command line to match your system. The command line shown below is the one I used on my own server (running CentOS 5.8.)
Here's what the procedure looked like for me:
The last command, "ldconfig", rebuilds the cache used by ld to find shared libraries. Normally if a Makefile ends up installing any shared libraries it will run this command, but I've seen cases where the developer forgets to add it in there, and it doesn't really hurt anything to run it again, so I've gotten into the habit of always running this command whenever installing something which includes shared libaries.
Before we can actually run the software, we will need to create a database with the tables used by DSPAM. My server is using the mysql_drv storage back-end, so I had to create a database to hold the data, and a mysql user to access the data.
$ cd ~/dspam-3.10.2/src/tools.mysql_drv $ mysql -u root -p Enter password: (Enter your mysql root password.) Welcome to the MySQL monitor. Commands end with ; or \g. Your MySQL connection id is 11 Server version: 5.0.95 Source distribution Copyright (c) 2000, 2011, Oracle and/or its affiliates. All rights reserved. Oracle is a registered trademark of Oracle Corporation and/or its affiliates. Other names may be trademarks of their respective owners. Type 'help;' or '\h' for help. Type '\c' to clear the current input statement. mysql> CREATE DATABASE dspam ; Query OK, 1 row affected (0.00 sec) mysql> \. mysql_objects-4.1.sql Query OK, 0 rows affected (0.04 sec) Query OK, 0 rows affected (0.06 sec) Records: 0 Duplicates: 0 Warnings: 0 Query OK, 0 rows affected (0.06 sec) Query OK, 0 rows affected (0.02 sec) Records: 0 Duplicates: 0 Warnings: 0 Query OK, 0 rows affected (0.05 sec) Records: 0 Duplicates: 0 Warnings: 0 Query OK, 0 rows affected (0.06 sec) Query OK, 0 rows affected (0.04 sec) Query OK, 0 rows affected (0.03 sec) Records: 0 Duplicates: 0 Warnings: 0 mysql> \. virtual_users.sql Query OK, 0 rows affected (0.05 sec) Query OK, 0 rows affected (0.04 sec) Records: 0 Duplicates: 0 Warnings: 0 mysql> GRANT ALL ON dspam.* TO dspam@localhost IDENTIFIED BY 'p4ssw3rd' ; Query OK, 0 rows affected (0.00 sec) mysql> \q
2012-08-02 I got an email from David Wadson on the qmail-patch mailing list, reminding me to convert the tables from MyISAM to InnoDB. I vaguely remember reading something about that while I was figuring out how to set up DSPAM, but it didn't "click" because I don't normally use MySQL. I figured I would just get it running, write the web page, and then come back to it later.
I didn't realize just how quickly the tables would grow - in less than a week, my dspam_token_data table has grown to over a million records. As you can see, it took quite a while to convert - and during that time, I had to shut down qmail entirely, so that incoming messages wouldn't try to use DSPAM and have to wait for the conversion to finish before they could be processed and delivered.
Do yourself a favour - DON'T WAIT to do this conversion. The more data that builds up, the longer the conversion will take, and during the time that the conversion is happening, DSPAM will not work because it won't be able to access whichever table happens to be in the middle of being converted at the time.
In addition, creating a few indexes on the dspam_token_data table can, with a few minor changes to the purge-4.1.sql script, greatly increase the speed of the purge process. It's easier and faster to create these indexes before the database has any data. The "ALTER TABLE ... ADD INDEX" queries shown below will do this.
$ mysql -u dspam -p dspam Enter password: (Enter the "dspam" mysql user's password.) Welcome to the MySQL monitor. Commands end with ; or \g. Your MySQL connection id is 245 Server version: 5.0.95 Source distribution Copyright (c) 2000, 2011, Oracle and/or its affiliates. All rights reserved. Oracle is a registered trademark of Oracle Corporation and/or its affiliates. Other names may be trademarks of their respective owners. Type 'help;' or '\h' for help. Type '\c' to clear the current input statement. mysql> ALTER TABLE dspam_preferences ENGINE=innodb ; Query OK, 1 row affected (0.50 sec) Records: 1 Duplicates: 0 Warnings: 0 mysql> ALTER TABLE dspam_signature_data ENGINE=innodb ; Query OK, 475 rows affected (2.04 sec) Records: 475 Duplicates: 0 Warnings: 0 mysql> ALTER TABLE dspam_stats ENGINE=innodb ; Query OK, 15 rows affected (0.31 sec) Records: 15 Duplicates: 0 Warnings: 0 mysql> ALTER TABLE dspam_token_data ENGINE=innodb ; Query OK, 1175704 rows affected (1 hour 16 min 35.29 sec) Records: 1175704 Duplicates: 0 Warnings: 0 mysql> ALTER TABLE dspam_virtual_uids ENGINE=innodb ; Query OK, 15 rows affected (1.13 sec) Records: 15 Duplicates: 0 Warnings: 0 mysql> ALTER TABLE dspam_token_data ADD INDEX( spam_hits ) ; Query OK, 1175704 rows affected (52.53 sec) Records: 1175704 Duplicates: 0 Warnings: 0 mysql> ALTER TABLE dspam_token_data ADD INDEX( innocent_hits ) ; Query OK, 1175704 rows affected (1 min 10.16 sec) Records: 1175704 Duplicates: 0 Warnings: 0 mysql> ALTER TABLE dspam_token_data ADD INDEX( last_hit ) ; Query OK, 1175704 rows affected (1 min 36.51 sec) Records: 1175704 Duplicates: 0 Warnings: 0 mysql> \q
David has also also written his own page about using DSPAM with qmail, which included the queries shown above.
He also included a link to a page showing how to optimize the DSPAM purge process by creating these indexes and modifying the queries in the purge-4.1.sql script to take advantage of them. However, in dspam-3.10.2.tar.gz it looks like the queries in the purge-4.1.sql script have already been re-structured to take advantage of these indexes (i.e. the WHERE clauses have been re-written so that the indexed fields are not within a function call) even though the mysql_objects-4.1.sql file doesn't create the indexes. Strange but true.
I will be setting up the purge script as a cron job on my own server within the next day or two, and will be updating this web page with that information.
2012-08-23 While doing an install on another client's machine, I discovered (the hard way) that the GRANT ALL query needs to be run AFTER the scripts which create the tables. I have adjusted the order of the queries in the example above.
The next step is to create the dspam.conf file which configures the software. When you did the "make install" above, it created this file for you, however we need to change several things.
The first step is finding the file. If your configure command line did not include a "--sysconfdir" option, you should find this file in the /usr/local/etc directory.
When you find the file, first make a backup copy of the original file so you have something to refer back to:
# cp -a dspam.conf dspam.conf.dist
Then edit the file, using a text editor. I prefer nano, but feel free to use whatever editor you like.
# nano dspam.conf
You should probably go through the entire file and try to become familiar with what's there. Here's a list of the things I ended up changing:
EnablePlusedDetail on PlusedCharacter - PlusedUserLowercase on QuarantineMailbox -quarantine
These options control DSPAM's "plus character", which is the character which separates a mailbox name from an "extension" name. This is the first character of the "conf-break" file in your qmail source code. Most qmail systems use the default value, which is "-".
Trust root Trust dspam Trust vpopmail
The userids listed here are the only ones who are allowed to do any advanced functions, such as setting the active user or run the DSPAM tools (other programs which can do things like cleaning the database, showing reports of how many messages each user has processed, etc.) On my server, these three users are the only ones which are "trusted", all of the other users (which have "Trust" lines in the default file) are commented out.
Tokenizer osb ImprobabilityDrive on
The "Tokenizer" option controls how DSPAM breaks the messages apart into "tokens". The osb tokenizer is recommended by DSPAM's authors, so that's the one I'm using.
The "ImprobabilityDrive on" option causes DSPAM to add an extra header to each message telling what the odds are of the message being other than what DSPAM judged it to be. For example, if DSPAM says that a given message is not spam, the message will also have a header which looks like "X-DSPAM-Improbability: 1 in 12560 chance of being spam". I think it's a cool thing, and it's certainly easier to understand than "X-DSPAM-Confidence: 0.9921" or "X-DSPAM-Probability: 0.0000" (all three of these headers are from the same message.)
Preference "signatureLocation=headers"
This controls where DSPAM adds a "signature" to each message. The default value will add the signature (which looks like "!DSPAM:1,50156948131953006098420!") to the end of every message, in addition to adding an "X-DSPAM-Signature" header. Setting the value shown above will stop it from adding the signature to the bottom of the message, while still adding it to the headers (where most users won't see it, but where DSPAM will find it in case it needs to re-train a message.)
MySQLServer /var/lib/mysql/mysql.sock MySQLUser dspam MySQLPass p4ssw3rd MySQLDb dspam MySQLCompress true MySQLReconnect true MySQLConnectionCache 10 MySQLUIDInSignature on
These options control how DSPAM connects to the MySQL server.
Because a qmail/vpopmail system doesn't use normal system UIDs for each mailbox, MySQL needs to create virtual UIDs for each user. These UIDs are generated the first time DSPAM processes a message for each user. The MySQLUIDInSignature on option adds this virtual UID to the beginning of each signature it generates (in the signature shown above, the "1," is the virtual UID.
LocalMX 127.0.0.1 LocalMX (as needed)
This option tells DSPAM which server IPs can be ignored when searching the Received: headers to find the IP address which sent a message. If your server has multiple IPs (and remember that "127.0.0.1" counts as an IP address) or if there are other servers which you inherently trust (i.e. mailhubs, front-end filtering devices, other servers in a mail cluster, etc.) you should have multiple LocalMX lines.
ServerPID /var/service/dspam/var/dspam.pid
When DSPAM starts a server process, it writes its PID (Process ID) to a text file so that other processes can find it. This option controls where the file is written. I've moved it to a "var" directory within the daemontools service (the real directory for the daemontools service will be /var/service/dspam) so that the dspam user will have permissions to create, change, and/or remove this file as needed. (The default value, /var/run, is not writable to the dspam user.)
ServerMode dspam ServerPass.ClientID "p4ssw3rd" ServerDomainSocketPath /tmp/dspam.sock ClientHost /tmp/dspam.sock ClientIdent "p4ssw3rd@ClientID"
These options control how the dspam server listens for client connections, and how the dspamc client finds and connects to the server.
The "ServerDomainSocketPath" and "ClientHost" options need to match. This is where the server listens, and where the client connects, in order to communicate. Because the server is on the same machine with the clients, I'm using a unix socket.
It is also possible to use a TCP socket, if you're setting up a cluster and want the DSPAM scanning to take place on a different machine. Personally, I don't see any advantage in that kind of setup, since the full email message needs to be sent across the socket before the server can scan it. I'm sure that somewhere, somebody has a good reason for doing this, but I don't know what that reason might be.
The "ServerMode" option tells the server what protocol to support from clients. The "dspam" value tells the server to only support the proprietary protocol used between the dspam server and the dspamc client.
The dspam server can also act as an LMTP server (see RFC 2033), if you are able to configure DSPAM to feed into an appropriate local delivery agent, and if you have any programs which will act as an LMTP client. I haven't looked into this in any great detail.
The "ServerPass.ClientID" and "ClientIdent" options are also related. It is possible for a server using TCP to have several valid clients, and you might want different passwords for each client. The server can be configured with several "ServerPass.ClientID" options, one for each client, each with a differnt ClientID name, and a different password.
Each client will have a "ClientIdent" line whose value contains their password, and which ClientID value they need on the server. Even though we only have a single client, the ClientID value still needs to match the ClientID portion of the ServerPass.ClientID line.
Remember at the top of the page, when I asked you to put a few spam and ham messages aside? This is where we're going to use some (but not all) of them. We're going to pick a few spam messages, and a few ham messages, and feed them into DSPAM for classification.
Again, these directions are going to assume that the folders are direct children of INBOX, and that they are called "test-ham" and "test-spam".
Start by cd'ing to the physical directory where the spam messages are stored, and listing the files.
# cd ~vpopmail/domains/domain.xyz/userid/Maildir/.test-spam/cur
# ls -1
1343491113.M126703P3740V000000000000CA01I005900A3_0.server.domain.xyz,S=1511:2,
1343500579.M570776P6229V000000000000CA01I005900A2_0.server.domain.xyz,S=6965:2,
1343506628.M396153P7391V000000000000CA01I005900A4_0.server.domain.xyz,S=8393:2,
1343514125.M510854P9684V000000000000CA01I005900A5_0.server.domain.xyz,S=3953:2,
1343522832.M87742P11325V000000000000CA01I005900A6_0.server.domain.xyz,S=39954:2,
1343523432.M594299P11417V000000000000CA01I005900A7_0.server.domain.xyz,S=4002:2,
1343536180.M515867P14022V000000000000CA01I005900A8_0.server.domain.xyz,S=4127:2,
1343544922.M460636P15863V000000000000CA01I005900A9_0.server.domain.xyz,S=3253:2,
1343560401.M161930P12625V000000000000CA01I0061C048_0.server.domain.xyz,S=4790:2,
1343574392.M805375P17565V000000000000CA01I006180C5_0.server.domain.xyz,S=17318:2,
1343620848.M379166P27347V000000000000CA01I006180D5_0.server.domain.xyz,S=13946:2,
1343634463.M188220P30153V000000000000CA01I0023006E_0.server.domain.xyz,S=3033:2,
1343643453.M680065P4748V000000000000CA01I0023004F_0.server.domain.xyz,S=2038:2,
1343646941.M689919P5313V000000000000CA01I00230083_0.server.domain.xyz,S=10693:2,
1343650420.M122150P6013V000000000000CA01I005900AA_0.server.domain.xyz,S=3818:2,
1343654992.M514765P7204V000000000000CA01I005900AB_0.server.domain.xyz,S=4608:2,
1343661333.M803929P9153V000000000000CA01I005900AC_0.server.domain.xyz,S=2634:2,
1343670657.M593803P12608V000000000000CA01I005900AD_0.server.domain.xyz,S=3865:2,
1343673448.M713678P13671V000000000000CA01I0061C04A_0.server.domain.xyz,S=12435:2,
Highlight the first filename and do a COPY. Then ask DSPAM to process it, using a command like this:
# cat '(do a PASTE, the filename should appear)'
| dspam --user userid@domain.xyz --deliver=summary
X-DSPAM-Result: user@domain.xyz; result="Innocent"; class="Innocent";
probability=0.0000; confidence=0.80; signature=N/A
DSPAM has a "--classify" option which will examine a message and decide whether it's ham or spam, without modifying the message or using it to further train the user. However, this will not work if DSPAM has never processed at least one message for that user. Therefore, the first time you run a dspam or dspamc command for a particular user, it cannot be a "--classify" command.
If you see output which looks like this, then DSPAM is working correctly. It may not be accurate yet, but that's because it doesn't have any data on what kinds of messages are spam or ham. The accuracy will improve in time - in my own case, after feeding it a corpus (a collection of messages whose spam/ham status is known) of about a thousand spams and about 350 hams, it has become VERY accurate.
This particular message was spam, but it was incorrectly classified as "Innocent" because the user had no training data, and the message itself was one which was written to look like a legitimate message (i.e. it was advertising for a software company, but it was one I had never heard of, trying to sell custom vertical market programs.) We should probably tell DSPAM that this particular message is indeed spam. (This is known as "training".)
# cat '(PASTE)' | dspam --user userid@domain.xyz --class=spam
--source=corpus --deliver=summary
X-DSPAM-Result: user@domain.xyz; result="Spam"; class="Spam";
probability=1.0000; confidence=1.00; signature=N/A
When training a message that you know is not spam, you will use the same command, but instead of using "--class=spam" you will use "--class=innocent".
If you ask DSPAM to classify this mesage again, it will probably still consider it "Innocent", but now it won't be so sure. Compare the "confidence" number to the output from before training the message:
# cat '(PASTE)' | dspam --user userid@domain.xyz --classify
X-DSPAM-Result: userid@domain.xyz; result="Innocent"; class="Innocent";
probability=0.7127; confidence=0.75; signature=N/A
Also, now that we have trained this particular message, we should probably delete it, so that it doesn't accidentally get trained again.
# rm '(PASTE)'
Setting up the dspam service is just like setting up any other daemontools service. The only tricky part is usually writing the "run" scripts, and in this case the scripts were very simple.
We are doing one slightly unusual thing here... We are creating a "var" directory, which will be owned by the dspam user. This will give the dspam server process somewhere to write out the dspam.pid file, without having to open up permissions on the /var/run directory. (If you used a different value for your ServerPID entry in your dspam.conf file, you may not need to do this here.)
# mkdir -m 1755 /var/service/dspam
# cd /var/service/dspam
# mkdir -m 0750 log var
# wget http://qmail.jms1.net/dspam/service-dspam-run
...
# mv service-dspam-run run
# wget http://qmail.jms1.net/dspam/service-dspam-log-run
..
# mv service-dspam-log-run log/run
# chmod 0750 run log/run
# chown dspam:dspam log var
# chown root:dspam run log/run
Here are the download links, sizes, and checksums for the two files:
|
|
Before you start the script, check /var/service/dspam/run. The last line is the actual dspam command which will run the daemon process. If it's not already there, add the "--debug" option to the end of this command line in order to see what the process is doing, for every message it processes. (Without this option, you will only see messages when the service starts and stops - there won't be any output showing that messages are being processed.)
Once the directory is set up and the permissions are set, you should be able to test the service by manually running the "run" script. This is pretty much the same thing that daemontools will be doing, but without daemontools being involved.
# cd /var/service/dspam
# ./run
9561: [07/30/2012 17:17:56] dspam_init_driver: initializing lock 0
9561: [07/30/2012 17:17:56] dspam_init_driver: initializing lock 1
9561: [07/30/2012 17:17:56] dspam_init_driver: initializing lock 2
9561: [07/30/2012 17:17:56] dspam_init_driver: initializing lock 3
9561: [07/30/2012 17:17:56] dspam_init_driver: initializing lock 4
9561: [07/30/2012 17:17:56] dspam_init_driver: initializing lock 5
9561: [07/30/2012 17:17:56] dspam_init_driver: initializing lock 6
9561: [07/30/2012 17:17:56] dspam_init_driver: initializing lock 7
9561: [07/30/2012 17:17:56] dspam_init_driver: initializing lock 8
9561: [07/30/2012 17:17:56] dspam_init_driver: initializing lock 9
9561: [07/30/2012 17:17:56] Spawning daemon listener
9561: [07/30/2012 17:17:56] Creating local domain socket /tmp/dspam.sock
You should see output similar to this, and then nothing... there will be no command prompt returned yet. This is because the dspam server process is running in the foreground. The output you see in this window is what it would normally send to the daemontools service's log file.
We can try using the service by using the same commands shown above, however instead of using the dspam command, we'll be using the dspamc command.
In another window, cd to the IMAP mailbox and use dspamc to classify, and possibly train, a message:
# cd ~vpopmail/domains/domain.xyz/userid/Maildir/.test-spam/cur
# ls -1
... (Choose a file, highlight it, and do a COPY.)
# cat '(do a PASTE, the filename should appear)'
| dspamc --user userid@domain.xyz --classify
X-DSPAM-Result: user@domain.xyz; result="Innocent"; class="Innocent";
probability=0.0000; confidence=0.80; signature=N/A
# cat '(PASTE)' | dspamc --user userid@domain.xyz --class=spam
--source=corpus --deliver=summary
X-DSPAM-Result: user@domain.xyz; result="Spam"; class="Spam";
probability=1.0000; confidence=1.00; signature=N/A
# cat '(PASTE)' | dspamc --user userid@domain.xyz --classify
X-DSPAM-Result: userid@domain.xyz; result="Innocent"; class="Innocent";
probability=0.7992; confidence=0.77; signature=N/A
# rm '(PASTE)' (make sure we don't end up re-training the same
message again in the future.)
As you run each dspamc command, you should see a burst of log messages scrolling by in the window where the server process is running.
When you are satisfied that dspamc is talking to the server, you can stop the server process (click back into the window where it's running and hit CONTROL-C) and finish activating the service.
You may or may not want to remove the "--debug" option from the dspam command line within the run script. Personally, I left mine in there, since multilog automatically trims the log files within the log/main directory.
9561: [07/30/2012 17:46:26] Burton-Bayesian Probability: 0.000145 Samples: 27
9561: [07/30/2012 17:46:26] using Graham factors
9561: [07/30/2012 17:46:26] Result Confidence: 0.48
9561: [07/30/2012 17:46:26] total processing time: 0.08668s
9561: [07/30/2012 17:46:26] libdspam returned probability of 0.999989
9561: [07/30/2012 17:46:26] message result: SPAM
9561: [07/30/2012 17:46:26] checking trusted user list for dspam(526)
^C
9561: [07/30/2012 17:47:09] dspam_shutdown_driver: destroying lock 0
9561: [07/30/2012 17:47:09] dspam_shutdown_driver: destroying lock 1
9561: [07/30/2012 17:47:09] dspam_shutdown_driver: destroying lock 2
9561: [07/30/2012 17:47:09] dspam_shutdown_driver: destroying lock 3
9561: [07/30/2012 17:47:09] dspam_shutdown_driver: destroying lock 4
9561: [07/30/2012 17:47:09] dspam_shutdown_driver: destroying lock 5
9561: [07/30/2012 17:47:09] dspam_shutdown_driver: destroying lock 6
9561: [07/30/2012 17:47:09] dspam_shutdown_driver: destroying lock 7
9561: [07/30/2012 17:47:09] dspam_shutdown_driver: destroying lock 8
9561: [07/30/2012 17:47:09] dspam_shutdown_driver: destroying lock 9
# ln -s /var/service/dspam /service/
# (Wait about ten seconds.)
# svstat /service/dspam /service/dspam/log
/service/dspam: up (pid 17449) 8 seconds
/service/dspam/log: up (pid 17451) 8 seconds
# tail /service/dspam/log/main/current
@40000000501701523584b4dc 17449: [07/30/2012 17:48:56] dspam_init_driver: initializing lock 2
@40000000501701523584bcac 17449: [07/30/2012 17:48:56] dspam_init_driver: initializing lock 3
@40000000501701523584c094 17449: [07/30/2012 17:48:56] dspam_init_driver: initializing lock 4
@40000000501701523584c864 17449: [07/30/2012 17:48:56] dspam_init_driver: initializing lock 5
@40000000501701523584cc4c 17449: [07/30/2012 17:48:56] dspam_init_driver: initializing lock 6
@40000000501701523584ff14 17449: [07/30/2012 17:48:56] dspam_init_driver: initializing lock 7
@4000000050170152358502fc 17449: [07/30/2012 17:48:56] dspam_init_driver: initializing lock 8
@400000005017015235850acc 17449: [07/30/2012 17:48:56] dspam_init_driver: initializing lock 9
@400000005017015236bea7f4 17449: [07/30/2012 17:48:56] Spawning daemon listener
@400000005017015236beb3ac 17449: [07/30/2012 17:48:56] Creating local domain socket /tmp/dspam.sock
Congratulations, your dspam daemontools service is up and running.
You can test the daemontools service by running the same dspamc commands you used when the service was running without daemontools. (This example shows training a ham (i.e. non-spam) message.)
# cd ~vpopmail/domains/domain.xyz/userid/Maildir/.test-ham/cur
# ls -1
... (Choose a file, highlight it, and do a COPY.)
# cat '(do a PASTE, the filename should appear)'
| dspamc --user userid@domain.xyz --classify
X-DSPAM-Result: user@domain.xyz; result="Innocent"; class="Innocent";
probability=0.0000; confidence=0.80; signature=N/A
# cat '(PASTE)' | dspamc --user userid@domain.xyz --class=innocent
--source=corpus --deliver=summary
X-DSPAM-Result: user@domain.xyz; result="Innocent"; class="Innocent";
probability=1.0000; confidence=1.00; signature=N/A
# cat '(PASTE)' | dspamc --user userid@domain.xyz --classify
X-DSPAM-Result: user@domain.xyz; result="Innocent"; class="Innocent";
probability=0.0000; confidence=0.89; signature=N/A
# rm '(PASTE)' (don't re-train the same message again in
the future.)
This is probably going to be the most compliacated part of the whole process, because we're going to be manually changing the commands in the domains' .qmail-default files (and other .qmail-* files, if they exist.) The changes we make will depend on what kind of processing is happening in each file.
When qmail-local processes a local delivery for a vpopmail domain, it needs to read the dspam.conf file to get the named pipe filename and authentication information necessary to reach the dspam service. It executes as userid vpopmail and group vchkpw, but it does not inherit any supplementary groups that the vpopmail user may have been added to. Therefore, we need to change the permissions on the dspam.conf file so that it's readable to this userid or this group ID.
There are two ways to do this. On my own server, I made the file owned by the dspam user and the vchkpw group. You could also just make the file world-readable, however this is not recommended unless normal (i.e. non-administrative) users have no access to the machine at all.
# chown dspam:vchkpw /usr/local/etc/dspam.conf
2012-08-23 I forgot to add this to my notes when I originally got DSPAM working on my own machine, so I forgot to add it to this page until just now. I've been working on a server for a new client, and forgetting this step has caused me to spend about four extra hours trying to figure out the problem again. I knew I had seen it before, but I couldn't remember the exact details of how I fixed it on my own server. Eventually I went back to my own server and checked the permissions and ownership on every related file, and when I saw the "dspam:vchkpw" ownership, it all came back to me. Adding the vpopmail user to the dspam group does not work in this case. Derp.
For each domain, you will need to go to that domain's directory. This will almost always be "~vpopmail/domains/domain.xyz", but to be sure you can use the vdominfo command:
# ~vpopmail/bin/vdominfo -d domain.xyz
/home/vpopmail/domains/domain.xyz
# cd /home/vpopmail/domains/domain.xyz
# ls -1 .qmail*
.qmail-default
As you probably know already, the ".qmail-default" file handles incoming mail for any recipient address where a specific ".qmail-user" file doesn't exist. In many cases, this will be the only .qmail-* file in the domain.
Start by looking at the current contents of each file. We will be adding a "dspamc ... --deliver=stdout" command to the beginning of each line (or to certain lines, in case the file has multiple lines) however the structure of that addition depends on what the line currently looks like.
As explained in the dot-qmail man page, there are five valid types of lines in a .qmail file. If the line...
starts with "#": it is a comment. We will ignore these lines.
starts with "|" (a vertical bar, or "pipe", character): qmail-local will run the specified command, and send the message to that program's STDIN (standard input.)
This is the simplest case. All we need to do is make qmail-local run dspamc, and make dspamc feed the message into whatever the original command was (in many cases the original command will be "vdelivermail".)
OLD line:
| /home/vpopmail/bin/vdelivermail '' bounce-no-mailbox
NEW line:
| /usr/local/bin/dspamc --user "$EXT@$HOST" --deliver=stdout
| /home/vpopmail/bin/vdelivermail '' bounce-no-mailbox
starts with "&", or a letter or a digit: qmail-local will forward the message to the specified email address.
In this case, we run the same dspamc command, but then we have to feed that output into a command which forwards the message to the remote recipient. Fortunately, qmail comes with the /var/qmail/bin/forward command, which does exactly this.
OLD line: (may be either of the following)
&recipient@domain.xyz
recipient@domain.xyz
NEW line:
| /usr/local/bin/dspamc --user "$EXT@$HOST" --deliver=stdout
| forward recipient@domain.xyz
starts with "/" or ".", and does not end with a "/" character: qmail-local will deliver the message to an "mbox file".
One of the biggest advantages of qmail is being able to use maildirs instead of mbox files. I don't use mbox files myself, and I don't recommend that anybody else does, especially if you're using qmail. The directions shown below are NOT tested, and I'm pretty sure they are not safe. Nothing is preventing more than one process from trying to add messages to the file at the same time, so it is very possible that messages might be lost.
From what Google tells me, the NetBSD version of cat has a "-l" option which can be used to assert an fcntl() exclusive advisory lock on its STDOUT file descriptor. If you're using NetBSD, you can add a "-l" option to the cat command and gain some protectin against multiple processes trying to write to the mbox file at the same time. The GNU version of cat (used in almost every Linux distribution) does not have this option, or anything like it.
This example shows how you should be able to add a message to an mbox file:
OLD line:
./user/filename
NEW line:
| /usr/local/bin/dspamc --user "$EXT@$HOST" --deliver=stdout
| cat >> ./user/filename
starts with "/" or ".", and ends with a "/" character: qmail-local will deliver the message to a Maildir.
The best way to handle this case is to use a program called safecat, which is designed to write messages to a Maildir using the safety mechanisms built into the Maildir specification.
The program's web site contains directions on how to compile and install the program, but here's a quick run-through:
# wget http://www.jeenyus.net/linux/software/safecat/safecat-1.13.tar.gz
# tar xzf safecat-1.13.tar.gz
# cd safecat-1.13
# make
# make setup check
Once you have safecat installed, you can use it like so:
OLD line:
./user/Maildir/.foldername/
NEW line:
| /usr/local/bin/dspamc --user "$EXT@$HOST" --deliver=stdout
| safecat ./user/Maildir/.foldername/tmp
./user/Maildir/.foldername/new || exit 111
Note that some .qmail-* files may contain commands intended to be run in a specific sequence. Be sure you understand exactly what every command in a .qmail-* is doing before you change it. Also, if it isn't obvious, be sure to save backups of the files so that you can quickly restore the previous contents in case of problems.
This is actually fairly simple - send yourself an email, let it be processed by vpopmail and DSPAM, and check the received messages for the headers that DSPAM adds. They will look something like this:
X-DSPAM-Result: Innocent
X-DSPAM-Processed: Mon Jul 30 12:11:12 2012
X-DSPAM-Confidence: 0.9899
X-DSPAM-Improbability: 1 in 9809 chance of being spam
X-DSPAM-Probability: 0.0000
X-DSPAM-Signature: 1,5016b220131959020616053
In addition, if you forgot to set "Preference "signatureLocation=headers"" in your dspam.conf file, you may see a line at the bottom of the message body which looks like this:
!DSPAM:1,5016b220131959020616053!
The signature code in this line should match the X-DSPAM-Signature: header in the same message.
As DSPAM runs, it builds up a lot of data about the tokens (words, and/or sequences of words) it sees in the messages it processes for each user. In order to keep the size of the database down to a manageable size, you need to clean out the database periodically.
There are two different procedures involved in cleaning up the database. One is the "dspam_clean -u" command, and the other is (for MySQL) the purge-4.1.sql script.
The purge-4.1.sql script is not installed as part of the normal "make install" process. You will need to manually copy it somewhere that it will be available. I've installed mine in the dspam daemontools service directory:
# cd ~jms1/dspam-3.10.2/src/tools.mysql_drv
# install -m 0644 purge-4.1.sql /var/service/dspam/
Once this is done, you can write a simple script which can be executed via cron which runs both commands. However, providing the dspam mysql user's password without using the command line is a bit tricky. I got around this by creating /var/dspam/.my.cnf like so:
Then, I wrote /var/service/dspam/cron.cleanup with the following contents:
#!/bin/bash PATH="/usr/local/bin:/usr/bin:/bin" cd /var/service/dspam for n in debug messages do if [ -f /var/dspam/log/dspam.$n ] then echo "===== Trimming /var/dspam/log/dspam.$n =====" mv /var/dspam/log/dspam.$n /var/dspam/log/dspam.$n.$( date +%s ) /usr/local/bin/delbut -3 /var/dspam/log/dspam.$n* fi done echo "===== Running \"dspam_clean -u\" =====" dspam_clean -u echo "===== Running purge-4.1.sql =====" cat purge-4.1.sql | ( HOME=~dspam setuidgid dspam mysql dspam )
The mysql program looks for "$HOME/.my.cnf" for its configuration. Because "setuidgid" doesn't set the HOME variable to the userid it's changing to, we need to explicitly set the variable in the script, as shown here.
The last piece is to install the script so that cron will run it. On my system I created /etc/cron.d/clean-dspam with the following contents:
MAILTO="badguy1@jms1.net" 38 1 * * 0 root /var/service/dspam/cron.cleanup
Obviously the MAILTO address shown here is not the one I'm actually using. The "badguy" addresses are used on one of the other pages on this site (I believe it was the example of the validrcptto mechanism disconnecting a spammer after ten invalid RCPT commands.) A few days after I created that page, I found that several spammers had harvested the addresses on that page and were sending spam to them. So I added them to my server, as honeypots, so that these spammers would add themselves to my private blacklist. It's been over a year since I did this, and the spammers are STILL using them, and in the process adding their compromised machines to my blacklist.
This runs the script once a week, at 01:38 local time every Sunday.
2012-09-30 A few weeks ago, I got a panicked phone call from one of my users, saying that all of their incoming messages were arriving as empty messages, with no sender, subject, body, or anything. When I looked at the files in their mailbox, each one only had the first two lines of headers, and then nothing. Because I was in the middle of packing everything up to move, I didn't really have time to dig into the problem, so I pulled dspam out of the processing chain and the messages started arriving, albeit without having been processed by dspam.
After I was finished with the move and had time to look at it, I discovered that any messages processed through the dspam daemontools service were being truncated. I spent several hours trying to debug the problem, and ended up throwing my hands in the air in disgust, because dspam doesn't offer a whole lot in the way of useful diagnostic logging.
This morning, after only one cup of coffee, I decided to have another look at it. I started by wiping out and re-loading the mysql database, but that didn't help much... I was able to do two successful "dspam ... --deliver=summary" commands on the command line, but after that it stopped working. Then I tried adding a "--debug" option to the same command, and got this:
# cat 1344346871.M196488P31171.phineas.jms1.net,S=14126:2,Sabe |
dspam --debug --user badguy7@jms1.net --class=spam --source=corpus
--deliver=summary
Filesize limit exceeded
Exit 153
I did a search through the source code, and it turns out the text "Filesize limit exceeded" does not appear at all. Then I did some google searching, and found that this is an error message from glibc, which means one of two things: either it's trying to write a file which is too big to be supported by the underlying filesystem (I'm using ext3, so that shouldn't be an issue) or the filesystem is out of space. One quick "df" command later, and it turned out that the filesystem where "/var/dspam" was stored, was 95% full. Most filesystems reserve a certain amount of space for the root user, so that normal users can't fill up a filesystem... and this is what happened to me. Because dspam runs as non-root, it wasn't able to extend one of these log files any further, so it crashed with this "Filesize limit exceeded" error, but didn't tell which file it had been trying to write to.
When I wrote the "cron.cleanup" script, I forgot about the log files that dspam writes under "/var/dspam/log", and they had filled up the filesystem. I have since added a block which will cut the files every day, and then use my delbut script to delete all but the three newest files.
Aside from the fact that I wiped my database when I didn't need to, everything seems to be working normally now, messages are being processed by dspam, and my client-side filters are handling the messages as expected.