Prelude.
Inevitably, computer systems fail. When that happens, it is often valuable to take a step back and evaluate the failure scenario and the recovery or response plan. A critical analysis offers one the opportunity to review what was and was not effective in avoiding, detecting, diagnosing, and recovering from a failed system, be it a single hardware system or a network wide authentication scheme.
When dealing with multiple failures, a challenging aspect in analysis is determining an effective ordering for presentation and discussion. When multiple events occur, elements can be listed in chronological order, by system priority, by failure category, alphabetically, and so forth.
Failure Cases.
A have chosen to list failure cases by severity. Most took place during a tight window in late December 2005 through early February 2006.
Security Compromise — Rebecca, Internet facing router Aug, 05
Disk Failure — Sarah, Backup server Dec, 05
NIC Failure — Sarah, Backup server Feb, 06
Disk Failure — Nebula, File serverFeb, 06
Disk Failure — Media, laptop based PV system (no recording) Dec, 05
Network Layout.
My local network is centralized primarily around Nebula, which is responsible for providing mail, DNS, NFS, Samba, NTP, Web , apt-proxy, and performance aggregation services to the local network. Closely coupled with Nebula is Sarah, the backup server which also provides remote syslogging and secondary DNS.
Additionally, the router, Rebecca, provides routing for three subnets, one of which is the Internet and another of which is a private WiFi network which Media is associated with.
All machines are backed up nightly using Dirvish on a per mount point basis from Sarah to a RAID 0 array. All system logs, save Media, are sent to Sarah over UDP. Every hour logcheck is run against the logs of each machine in search of anything notworthy to report. Anything not explicitly excluded is emailed to me by way of Nebula, which runs an IMAP service. Nebula’s data volume is mapped to a hardware RAID 5 array.
Security Compromise — Rebecca, Internet facing router
Preventative Measures Taken:
On the Internet facing side, a software firewall is deployed with a default deny policy for both INPUT and FORWARD traffic using Linux Netfilter. Security updates are applied routinely as they become available. No compiler or development libraries are installed.
Detection:
The compromise was detected by blind luck. The attacker attempted to relay mail through the system detailing information on the compromised host, but due to a misconfiguration the mails bounced back to my inbox. I read the mails with interest.
Diagnosis:
Upon quick investigation of the mail, it was obvious nothing I have running sends mail in a foreign language to accounts at yahoo.com. (I had not yet setup site wide logging to a syslog host or hourly exception analysis of log file contents, making the compromise far more difficult to detect.) The attack vector was an automated SSH dictionary attack against the root user account.
Response:
I pulled the plug. Upon further investigation, I decided to restore the system online from the backup server. Doing a comparison of the box in its compromised state against its last known good state, it was clear where the rootkits and other fun toys had been installed. However, I missed some running processes and the box was never fully clean until I brought it down for a proper recovery. Lack of CD-ROM drive or a rescue floppy with support for Adaptec 2940UW HBA hampered efforts.
Improvements:
-Disallow root login
-Choose stonger password
-Log to secure loghost
-Change data retension policy on backup server
-Install CD-ROM, build custom recovery floppy with Adaptec support, or move to ATA disk based configuration
Disk Failure — Sarah, Backup server
Preventative Measures Taken:
Absolutely none. Constrained by a limited budget, an additional disk for a RAID 5 configuration simply is not possible.
Detection:
Failure mails from 3Ware’s 3dmd2 daemon on Sarah indicating a failure on port one every time repeated access attempts lead to a port timeout. (Loghost setup still not deployed.)
Diagnosis:
Logging into Sarah and executing
dmesgconfirmed that the hard disk on port one had given up the ghost. Death was further confirmed by running Seagate’s diagnostic utility from a floppy after plugging the disk into the mainboard’s ATA controller.
Response:
Initially, Seagate’s utility was used to recover the disk. However, the disk failed again in the same manner. The disk was eventually RMA’d to Seagate.
Improvements:
-Switch disk configuration to RAID 5 from RAID 0 (I mean, really…)
-Switch disk configuration to JBOD and split backups
-Switch disk configuration to RAID 1, reduce retension, add liberal excludes
-Don’t trust the vendor repair utility to Do The Right Thing
-Backup Dirvish vault configuration data to another data drive
NIC Failure — Sarah, Backup server
Preventative Measures Taken:
None. Redundant NICs isn’t an option.
Detection:
Blind luck. The loghost can’t report a NIC failure when the NIC is dead.
Diagnosis:
Obvious commands like
pingfail. Link renegotiation fails.
Response:
Swap failed network card with a known good card.
Improvements:
-Add second NIC with software failover
-Log to serial console on another machine
-Initiate a clean system shutdown to get my attention
Disk Failure — Nebula, File server
Preventative Measures Taken:
Backups only. The storage volume is already RAID 5. The OS drive was once in a RAID 1 configuration, but one disk was discarded and never replaced.
Detection:
Given the catostrophic nature of the failure, the first signal was an audible click, signaling the drive’s death. Moments later, KMail lost all mail folders being fed via IMAP. NFS hung. A dead system sends no mails.
Diagnosis:
An attempted console login failed, but the messages confirmed the system drive had suffered a horrible death.
Response:
Restore system from backup to another storage device. Recover was hampered by the fact that I had not included /boot as a mount to backup. All other systems have /boot as part of the root filesystem, but due to my current scheme of backing up on a per mount basis, I missed this. Moreover, a strange LVM2 root configuration was discarded, requiring
/etc/fstabto be modified and the partition table to be recreated from scratch.
Improvements:
-Run the system from a RAID 1 array
-Change Dirvish backup configuration to backup per host, not per host and mount
Disk Failure — Media, laptop based PV system
Preventative Measures Taken:
Backups only.
Detection:
Dirvish mails indicating that the system could not be connected to.
Diagnosis:
The system had hung. The disk refused to spin up.
Response:
I RMA’d the laptop disk drive. The system was unrestorable due to failure of the backup server’s array, noted above. All possible data was recovered from the backup server before taking the backup array offline. Unfortunately, all I recovered for Media was a copy of
/var, forcing a recovery by looking at the files in/var/lib/dpkg/infoand guessing about installed packages. Essentially, recovery consisted of a bare metal reinstall./etcwas also recovered.
Improvements:
-Mirror Debian packages for old hardware that may disappear from the Internet
Thoughts.
Given the failures above, clearly the most prevalent is hard disk failure. The degree to which the failure is catastrophic varies, as some disks refused to spin up again (Nebula, Media) and another limped along (Sarah) for a time.
What’s more, disk failures are the easiest to prepare for. Given the frequency with which commodity ATA and SATA disks are sold as loss leaders at popular retail outlets, running every system with a software RAID 1 configuration where practical (difficult in laptops) is relatively inexpensive — especially compared to the cost of data recovery services.
Further, a backup solution isn’t much of a solution if it’s based on RAID 0. A RAID 0 configuration is more likely to suffer a failure when taken as a single unit than any individual member of said unit.
Action Items.
Issues to address to increase robustness in the face of system failures.
-Migrate Nebula to 2 x 200GB ATA RAID 1 setup with either 3Ware or Linux RAID for OS install
-Migrate Sarah backup array to 3 x 300GB hardware RAID 5
-Change Dirvish backup configuration to per host instead of per mount per host
-Add partition metadata usingsfdiskto backup
-Add LVM and md metadata to backup along with mount and df output
-Backupdirvish/directories for each backup vault
-Test restore backup images to VMWare Server image and automate process
-Preemptively move to 2 x 40GB software RAID 1 on Faith (workstation)