http:// qmail.jms1.net / scripts / qfixq.shtml

qfixq

The qfixq script is something I wrote out of frustration because there was no way to effectively repair a qmail queue when it becomes corrupted. I grew tired of having to delete the queue and do "make setup" in the qmail source to re-create it- too many users were losing messages, including myself.

I started looking at how the queue itself is built inside. By studying the queue directories and the source code of the programs which manipulate the queue, I eventually reached the point where I understood its structure well enough to try and write a tool to repair a queue which had sustained damage, saving whatever could be saved and deleting the rest.


How does the queue become corrupted?

The most common cause of queue corruption is when the filesystem (or "disk paritition") containing the /var/qmail/queue directory becomes full. I have seen two things cause this:

I have also seen queue corruption caused by people "playing" in the queue directory and accidentally changing something they shouldn't have... a good rule of thumb is to never even look inside the /var/qmail/queue directory, unless you know exactly what you're doing. Even I don't play around in there- the chances of messing something up are just too great.

And of course there is always the possibility that the physical disk which contains the filesystem may be having hardware issues. If this is happening, you will see other signs of this in your syslogs- things like I/O errors will start to show up. In this case you need to move the data off of the disk as quickly as possible, before it totally dies and your data is gone.


How do we fix a queue?

Overall, the process involves stopping new incoming mail, freeing up enough disk space for qmail-send to handle what's already there, letting it run for a bit (in the hopes that it will at least partially fix itself, once it has enough disk space to store things in), and then shutting it down and running the qfixq script.


Download

File: qfixq
Size: 14,808 bytes
Date: 2009-02-04 01:19:54 +0000
MD5: 20fcda5b3d735fbed4519dba7a957b61
SHA-1: 2268934e609744fea182dd36db4247a8f5900de7
RIPEMD-160: d6dc062e16bcfea39b2327ef0a7806c765e98806
PGP Signature: qfixq.asc

Running the qfixq script

Before you run the script itself, you must make sure that any processes which access the queue are not running. Of course this includes qmail-send, but it also includes any SMTP, QMTP, QMQP, or other services which may add messages to the queue.

The first time you run it, it will scan the queue and find any errors, but it will not actually fix them. I recommend that you run it like this first, just to get an idea of how bad your situation is.

# wget http://qmail.jms1.net/scripts/qfixq
# chmod 700 qfixq
# ./qfixq

After running it the first time to see what it's going to do, you should run it a second time, in "live mode". This will do the same checks, but it will actually correct the errors it finds.

# ./qfixq live

If you know that there's nothing in the queue you want to save, and you would rather just empty the queue, you can use this command instead. This changes the logic to bypass a lot of the checking, and just plain marks every message it finds for deletion- and then it deletes them.

# ./qfixq live empty

After running it once in "live mode", you should run it again (not in live mode) and make sure it doesn't find anything new. If it does, it means that some process on the system is still interacting with the queue, and if that process is "qmail-send" then running the script may have actually done more damage than good. This is why it's so important to make sure that anything relating to qmail is totally shut down before running the script.


Finish the clean-up

After running the script, it should be safe to start your qmail-send and any SMTP services again. However, before you do so, you should check to ensure that you're not about to kill the system by letting it feed back into itself. If the cause of the corruption was a flood of spam, your queue may contain a lot of bogus messages.

The qmail-qstat command will show you how many messages are in the queue, even while qmail-send is not runnning. If you have a lot of "messages in queue but not yet preprocessed", you should start qmail-send back up (but not any SMTP services yet) and run qmail-qstat every so often, until the "not yet preprocessed" number drops to zero.

At that point, you may wish to run qbonkns (or qkillns, if qmail-send is not running) to remove any bounce messages which may be in the queue. Especially in the case of spam, the addresses to which the bounces will be addressed are no good- and removing them from the queue is a good way to save time and get your server back on its feet more quickly.

Of course, if your server is under attack and you can see one or two IP addresses showing up over and over again in your SMTP server logs, you may wish to use something like "iptables" to block those IP's, or block them at an upstream router if possible.

When you are satisfied that the problem is solved and the issue will not repeat itself, turn the qmail-send service and your SMTP, QMTP, and QMQP services back on. Watch the log fies to make sure the attack isn't immediately repeating itself- if it is, you may wish to take other measures to keep the server from becoming overloaded again.


Preventing it from happening again

After cleaning up a server with a damaged queue, your next thought should be to determine exactly how and why the queue became damaged in the first place, and preventing it from happening again. Obviously the methods of prevention will depend on what went wrong in the first place. A web page like this cannot cover every possible problem, but I will try to cover the most common causes.

If the damage was caused by the disk simply becoming full, you should take steps to ensure that the disk does not fill up again. This may involve something as simple as moving some files to a different filesystem, it may involve setting up a cron job to delete old log files (the delbut script on my code page may be helpful with this), or it may involve something as drastic as adding a new disk to the machine and moving part or all of "/var/" to the new disk.

On many systems "/var" is a single filesystem, and the most common reason for "/var" to fill up is that "/var/log" fills up with old log files. Another problem is that "/var/tmp" is usually a world-writable temporary directory- you may wish to remove this directory and make "/var/tmp" into a symbolic link to the "/tmp" directory, especially if "/tmp" is on a different filesystem.

Note that if you want to move "/var/qmail" to a different partition, you should only do so when the queue is totally empty, or you should plan on running "qfixq" after moving the files but before starting qmail-send. This is because the message numbers within the queue, the filenames that it uses, are also the inode numbers of the "mess" files. Since you can't manually change a file's inode number, qfixq will rename the files to whatever the new inode number of the "mess" file is. This will allow qmail-send to keep processing the messages from the queue, but the change in message numbers will confuse log utilities (such as mtrack) and may confuse anybody who later examines the headers of such a message.

If the damage was caused by a specific attack, you should take whatever steps are necessary to prevent a similar attack in the future from being able to cause the damage. This may or may not even be possible, but some steps you may find helpful are:

Again, this list is not, and should not be, considered as a complete list of steps you can take to prevent your server from being attacked. I am always on the lookout for new ideas, if you have any suggestions for this page please email me.


2005-08-30 Thanks to Michael Martinell on the qmailrocks list for finding two minor typos... that'll teach me to trust code I wrote at 2.30am...


2005-11-15 I have added some code which should allow you to just plain empty the queue rather than fixing whatever is there. Note that this will still run the code to check and fix the ownership and permissions of the directories within the queue- it will just leave you with an empty queue.

To use the new stuff, you can run qfixq live empty instead of just "qfixq live". The message which comes up when you run the script without any arguments (i.e. in FIND mode, where no changes are written to the disk at all) has been changed to explain this option as well.

The same code removed the old "default number of buckets" code- basically, if the script can't run "qmail-showctl" and figure out how many buckets are in use, it will not run.


2009-02-03 I got an email from Jussi Nikula telling me about a typo- the word "Removing" was spelled incorrect in the "Remving zero-byte file" message. I have fixed the typo (which has no real effect on how the script works) and while I was in there, I changed the license statement within the script from "GPLv2 only" to "GPLv2 or v3".