The qfixq script is something I wrote out of frustration because there was no way to effectively repair a qmail queue when it becomes corrupted. I grew tired of having to delete the queue and do "make setup" in the qmail source to re-create it- too many users were losing messages, including myself.
I started looking at how the queue itself is built inside. By studying the queue directories and the source code of the programs which manipulate the queue, I eventually reached the point where I understood its structure well enough to try and write a tool to repair a queue which had sustained damage, saving whatever could be saved and deleting the rest.
The most common cause of queue corruption is when the filesystem (or "disk paritition") containing the /var/qmail/queue directory becomes full. I have seen two things cause this:
Many systems have /var as a separate filesystem, and it fills up over time. The most common cause of this is when /var/log becomes full of old log files. I've also seen /var/spool and /var/cache fill up on some systems for various reasons.
A "mail bomb" attack, where somebody or something sends thousands or millions of messages into your server, quickly enough that the server doesn't have time to process the queue or do any deliveries.
I have also seen queue corruption caused by people "playing" in the queue directory and accidentally changing something they shouldn't have... a good rule of thumb is to never even look inside the /var/qmail/queue directory, unless you know exactly what you're doing. Even I don't play around in there- the chances of messing something up are just too great.
And of course there is always the possibility that the physical disk which contains the filesystem may be having hardware issues. If this is happening, you will see other signs of this in your syslogs- things like I/O errors will start to show up. In this case you need to move the data off of the disk as quickly as possible, before it totally dies and your data is gone.
Overall, the process involves stopping new incoming mail, freeing up enough disk space for qmail-send to handle what's already there, letting it run for a bit (in the hopes that it will at least partially fix itself, once it has enough disk space to store things in), and then shutting it down and running the qfixq script.
The first step in fixing the problem, especially if you have mail coming in from the outside world, is to shut down ALL of your incoming services, whether they are running SMTP, QMTP, QMQP, or any other protocol. Note that if the disk is full, you stand a better chance of not losing messages by NOT shutting down qmail-send at the same time, because there may be messages buffered in RAM which would be lost if you were to totally stop qmail-send.
The details of how to shut down individual services will depend on how qmail is run on your system- if you are using daemontools, you will need to run the svc -d command against all of your SMTP services. The three SMTP services on my own mail server are called /service/qmail-smtp, /service/qmail-smtpssl, and /service/qmail-smtplocal, so I am able to use the command svc -d /service/qmail-smtp* to shut them all down at the same time.
If you're using "sysvinit" (System V-style init) scripts, hopefully you have separate scripts for qmail-send (the queue manager) and the SMTP services. If so, you should shut down the SMTP services only.
Otherwise, use a command like "ps | grep 'tcpserver.*smtpd'" to find the tcpserver commands which are running your SMTP service(s), and kill those tcpserver processes (using the "kill" command.)
Once you have stopped the incoming messages from continuing to add to the problem, you need to free up some disk space on the filesystem which contains /var/qmail/queue. This may mean moving or deleting old log files from /var/log, deleting print jobs from /var/spool, removing RPM files from /var/cache, or moving or deleting other files, depending on your system.
The df -k command will show you how much disk space is used and available on your filesystems. Most filesystems will stop allowing non-root users to allocate disk blocks (i.e. create or grow files) when the filesystem reaches 95% utilization. If the filesystem containing /var/qmail/queue is at or above 95%, you need to remove enough files to bring the number back below 95%.
Again, the most obvious places to look include /var/log, /var/spool, and /var/cache. The du command can show you how much disk space is being used by a particular directory and its descendants.
As you delete files, qmail-send will immediately start using the new space to store the message files it's been trying to save. Note that it will only keep trying each message for a short time, so the faster you can free up the disk space, the better your chances of not losing any important messages.
Once qmail-send is through writing any partial message fragments to the queue, it should be stopped as well. Again the procedure to do this depends on how your services are running- whether using daemontools, sysvinit scripts, or some other method. The idea is that the "qmail-send" process, as well as any other process on the machine which involves qmail, should be stopped.
Note that this does not necessarily have to include any POP3 or IMAP services, because these services do not directly involve the queue. However, you may find it easier to shut these down as well, so you can just tell users "the server is down" rather than "you can read mail but not send it".
|Date:||2009-02-04 01:19:54 +0000|
Before you run the script itself, you must make sure that any processes which access the queue are not running. Of course this includes qmail-send, but it also includes any SMTP, QMTP, QMQP, or other services which may add messages to the queue.
The first time you run it, it will scan the queue and find any errors, but it will not actually fix them. I recommend that you run it like this first, just to get an idea of how bad your situation is.
# wget http://qmail.jms1.net/scripts/qfixq
# chmod 700 qfixq
After running it the first time to see what it's going to do, you should run it a second time, in "live mode". This will do the same checks, but it will actually correct the errors it finds.
# ./qfixq live
If you know that there's nothing in the queue you want to save, and you would rather just empty the queue, you can use this command instead. This changes the logic to bypass a lot of the checking, and just plain marks every message it finds for deletion- and then it deletes them.
# ./qfixq live empty
After running it once in "live mode", you should run it again (not in live mode) and make sure it doesn't find anything new. If it does, it means that some process on the system is still interacting with the queue, and if that process is "qmail-send" then running the script may have actually done more damage than good. This is why it's so important to make sure that anything relating to qmail is totally shut down before running the script.
After running the script, it should be safe to start your qmail-send and any SMTP services again. However, before you do so, you should check to ensure that you're not about to kill the system by letting it feed back into itself. If the cause of the corruption was a flood of spam, your queue may contain a lot of bogus messages.
The qmail-qstat command will show you how many messages are in the queue, even while qmail-send is not runnning. If you have a lot of "messages in queue but not yet preprocessed", you should start qmail-send back up (but not any SMTP services yet) and run qmail-qstat every so often, until the "not yet preprocessed" number drops to zero.
At that point, you may wish to run qbonkns (or qkillns, if qmail-send is not running) to remove any bounce messages which may be in the queue. Especially in the case of spam, the addresses to which the bounces will be addressed are no good- and removing them from the queue is a good way to save time and get your server back on its feet more quickly.
Of course, if your server is under attack and you can see one or two IP addresses showing up over and over again in your SMTP server logs, you may wish to use something like "iptables" to block those IP's, or block them at an upstream router if possible.
When you are satisfied that the problem is solved and the issue will not repeat itself, turn the qmail-send service and your SMTP, QMTP, and QMQP services back on. Watch the log fies to make sure the attack isn't immediately repeating itself- if it is, you may wish to take other measures to keep the server from becoming overloaded again.
After cleaning up a server with a damaged queue, your next thought should be to determine exactly how and why the queue became damaged in the first place, and preventing it from happening again. Obviously the methods of prevention will depend on what went wrong in the first place. A web page like this cannot cover every possible problem, but I will try to cover the most common causes.
If the damage was caused by the disk simply becoming full, you should take steps to ensure that the disk does not fill up again. This may involve something as simple as moving some files to a different filesystem, it may involve setting up a cron job to delete old log files (the delbut script on my code page may be helpful with this), or it may involve something as drastic as adding a new disk to the machine and moving part or all of "/var/" to the new disk.
On many systems "/var" is a single filesystem, and the most common reason for "/var" to fill up is that "/var/log" fills up with old log files. Another problem is that "/var/tmp" is usually a world-writable temporary directory- you may wish to remove this directory and make "/var/tmp" into a symbolic link to the "/tmp" directory, especially if "/tmp" is on a different filesystem.
Note that if you want to move "/var/qmail" to a different partition, you should only do so when the queue is totally empty, or you should plan on running "qfixq" after moving the files but before starting qmail-send. This is because the message numbers within the queue, the filenames that it uses, are also the inode numbers of the "mess" files. Since you can't manually change a file's inode number, qfixq will rename the files to whatever the new inode number of the "mess" file is. This will allow qmail-send to keep processing the messages from the queue, but the change in message numbers will confuse log utilities (such as mtrack) and may confuse anybody who later examines the headers of such a message.
If the damage was caused by a specific attack, you should take whatever steps are necessary to prevent a similar attack in the future from being able to cause the damage. This may or may not even be possible, but some steps you may find helpful are:
Adding an appropriate set of patches to your qmail installation which enable you to perform different types of anti-spam checks on incoming mail. A patch which rejects mail sent to non-existent addresses has proven to be very useful in reducing the amount of spam I receive on my own server. My qmail patches page has information about the patches I'm using on my own server.
Using RBL's (realtime blacklists) can prevent your server from accepting messages from IP addresses which are known to be owned by spammers, or known to be zombies or open relays. The ucspi-tcp package includes a program called rblsmtpd, which will intercept connections from IP addresses which are listed on one or more RBL's and present them with a connection that seems to speak SMTP but actually returns nothing but error messages... and then forcibly hangs up on the client after a time (usually one minute) if the client didn't hang up on their own.
Performing SPF and/or domainkeys checks on your incoming mail can detect forged messages, which can then be rejected by the SMTP server.
Again, this list is not, and should not be, considered as a complete list of steps you can take to prevent your server from being attacked. I am always on the lookout for new ideas, if you have any suggestions for this page please email me.
2005-08-30 Thanks to Michael Martinell on the qmailrocks list for finding two minor typos... that'll teach me to trust code I wrote at 2.30am...
2005-11-15 I have added some code which should allow you to just plain empty the queue rather than fixing whatever is there. Note that this will still run the code to check and fix the ownership and permissions of the directories within the queue- it will just leave you with an empty queue.
To use the new stuff, you can run qfixq live empty instead of just "qfixq live". The message which comes up when you run the script without any arguments (i.e. in FIND mode, where no changes are written to the disk at all) has been changed to explain this option as well.
The same code removed the old "default number of buckets" code- basically, if the script can't run "qmail-showctl" and figure out how many buckets are in use, it will not run.
2009-02-03 I got an email from Jussi Nikula telling me about a typo- the word "Removing" was spelled incorrect in the "Remving zero-byte file" message. I have fixed the typo (which has no real effect on how the script works) and while I was in there, I changed the license statement within the script from "GPLv2 only" to "GPLv2 or v3".