On May 29th, 2007, at around 4:00 p.m. (Malaysian time), the web server that hosts all the GIDNetwork sites crashed. Every time the support team at the data centre restarted the dedicated server, it crashed, again and again.
It must have been a computer hardware problem, although I am still not certain what was actually the item that was failing. Looking at the messages in the log file revealed many lines like this:
May 29 05:28:57 yumie kernel: hda: dma_intr: status=0x51 { DriveReady SeekComplete Error }
May 29 05:28:57 yumie kernel: hda: dma_intr: error=0x84 { DriveStatusError BadCRC }
May 29 05:28:57 yumie kernel: ide: failed opcode was: unknown
May 29 05:28:57 yumie kernel: hda: dma_intr: status=0x51 { DriveReady SeekComplete Error }
May 29 05:28:57 yumie kernel: hda: dma_intr: error=0x84 { DriveStatusError BadCRC }
May 29 05:28:57 yumie kernel: ide: failed opcode was: unknown
May 29 05:28:57 yumie kernel: hda: dma_intr: status=0x51 { DriveReady SeekComplete Error }
May 29 05:28:57 yumie kernel: hda: dma_intr: error=0x84 { DriveStatusError BadCRC }
May 29 05:28:57 yumie kernel: ide: failed opcode was: unknown
A quick search online hinted that this could be an issue related to a Hard Disk Drive (HDD) failing, or something as simple as a faulty IDE cable, neither of which we could figure out for sure. So the kind people at WholesaleInternet decided to replace everything, from new HDDs, to the processor and main-board of this server.
That took a while, and because the most recent backup I had on my own PC here in KL was over 1 week old, I didn’t want to restore the server with it. I knew there was a current backup on the old HDD and appealed to the support techs to copy the files onto the new HDD. Well, to make a long story short, they could not mount the (old) hard disc drive no matter what they tried. Much of Wednesday (May 30) was lost waiting for them to get files moved.
Later that night, I met with Darrin Smith online and he was kind enough to offer to help. Within minutes of logging into the server he got the HDD/partition mounted and I could finally access the files, move the backups to a safe place on the new HDD, and proceed with the restore, and bringing the sites back online.
I would like to thank all the good people at WholesaleInternet, especially Brian Vlasenko, who stayed up through the night trying to solve the problem, for eventually replacing nearly all the hardware, and reloading the Operating System and required software for the server on a new HDD.
A special thank you to Darrin for saving us all from losing over a week’s worth of data.
May nothing like this happens ever again…