Recovering data from a failing disk
In the spring of 2001, I had a machine running a low-budget RAID array of 225 GB using four IBM 75GXP disks. One morning I found that within a seven-hour period, two of the disks had failed and been taken offline. My n+1 array was not happy with n-1 working disks.
Normally to recover from a disk failure in a RAID array, you simply remove the dying disk, replace it with a blank one, and resync the array. That was not going to work this time, since when the resync came to the failed area of the disk, it would take that disk offline again. The best approach seemed to be to manually copy the entire failed disk onto a new one.
Unfortunately, doing this with dd(1) can take a
very long time. Every bad sector is reread repeatedly after a
timeout of several seconds per retry. Whole areas of the disk
are bad, so the next sector is likely to be reread several times
as well. It often take hours to fail to read a few kilobytes of
disk; I wanted something more time-efficient, something that
would not leave the disk down for days trying to read blocks
that were beyond recovery. Also, not knowing the nature of the
problem, I wasn’t sure if the disk was about to fail
completely and leave me without whatever I had been unable to
copy onto the new disk.
I decided to do a sort of binary search to recover as many of the good blocks as possible. I would start copying from the beginning, but whenever a sector was unreadable, I would immediately jump to the middle of the largest untried section. I didn’t know where the end of any bad section was, but by approaching it from the end I would be sure to try to recover as much useful information right away.
For testing, I created two files of 75 sectors (35 kB) to serve as the input and output, and modified the read routine to randomly pretend sectors had failed to be read. It also output a visual map of the bad disk. This sample output gives a good illustration of how the algorithm works. In this example you see that each block was retried up to three times, but no block is tried for a second time until every block has had at least one chance.
By running this overnight against the real disk I was ultimately able to recover all of my data.
Source code is here.
During this procedure I also wrote raidsbdump, a utility to
dump the contents of RAID superblocks.
Epilogue
Two weeks later another of the IBM drives failed. I replaced all four with Maxtors.
Links
- Steve Friedl has some notes on how to retrieve useful data fom a disk that is too corrupted to read with normal filesystem tools.
- There is a class action lawsuit against IBM.
- IBM’s new drives are not recommended for full-time use. This marks the end of IBM’s long history as a credible hard disk manufacturer.
- IBM is exiting the hard drive business.