October 25, 2004

There's a lot of it about

It's reassuring in a "we're all victims together" kind of way to see stega having RAID troubles. I've just had 24 hours of "fun" with our (usually rock-solid) server.

Yesterday was, at least in theory, mail server upgrade day.

For some time now we've been limping along with a crippled implementation of Courier, which was most definitely not installed in 'the Debian way'. (Never let a new sysadmin near your kit until you have a dozen written references and oaths signed in blood.) Lately we've seen all sorts of weird problems with it - high server load, deleted emails returning from the dead, emails inconsistently saved in the Sent folder, ridiculously long delays when sending, etc. So the plan was to switch back to the SMTP and IMAP servers were were using before, but to retain some of the features of the new setup that were working satisfactorily (common authentication database, spam and virus filtering).

The first step in this Sunday afternoon adventure was a minor tweak to the authentication database to support connections from the new IMAP server. Yes, I'm being deliberately vague here, and not mentioning specific implementation details. I don't think it'd help our security to tell the whole world how it works ;-)

Anyway, so I restarted the authentication database after the change, and found that connections to it were silently failing. I rolled back the change and restarted the database, but connections were still being dropped. I turned up debugging to full, and tried again. No joy. Connections refused, the daemon died, and no error messages available.

After quite a lot of time poking and prodding and testing, at a loss for what to do next, I tried rebooting the server.

... and it didn't come back up cleanly.

It looks like at some point in the past few months a rogue update of the boot loader had fried the disk's boot sector. So first thing this morning, I raced down to London to tend to our sick box, since we couldn't get it back to a workable state via the serial console.

Anyway, the box is up again now, but email and other authentication services are still fried. As soon as I hit Norwich I'll be able to retrieve the required bits and pieces from our backup disk, and hopefully we'll be back in action. Fingers crossed :-/

Posted by savs at October 25, 2004 3:49 PM