Recovering data from a failing disk

In the spring of 2001, I had a machine running a low-budget RAID array of 225 GB using four IBM 75GXP disks. One morning I found that within a seven-hour period, two of the disks had failed and been taken offline. My n+1 array was not happy with n-1 working disks.

Normally to recover from a disk failure in a RAID array, you simply remove the dying disk, replace it with a blank one, and resync the array. That was not going to work this time, since when the resync came to the failed area of the disk, it would take that disk offline again. The best approach seemed to be to manually copy the entire failed disk onto a new one.

Unfortunately, doing this with dd(1) can take a very long time. Every bad sector is reread repeatedly after a timeout of several seconds per retry. Whole areas of the disk are bad, so the next sector is likely to be reread several times as well. It often take hours to fail to read a few kilobytes of disk; I wanted something more time-efficient, something that would not leave the disk down for days trying to read blocks that were beyond recovery. Also, not knowing the nature of the problem, I wasn’t sure if the disk was about to fail completely and leave me without whatever I had been unable to copy onto the new disk.

I decided to do a sort of binary search to recover as many of the good blocks as possible. I would start copying from the beginning, but whenever a sector was unreadable, I would immediately jump to the middle of the largest untried section. I didn’t know where the end of any bad section was, but by approaching it from the end I would be sure to try to recover as much useful information right away.

For testing, I created two files of 75 sectors (35 kB) to serve as the input and output, and modified the read routine to randomly pretend sectors had failed to be read. It also output a visual map of the bad disk. This sample output gives a good illustration of how the algorithm works. In this example you see that each block was retried up to three times, but no block is tried for a second time until every block has had at least one chance.

By running this overnight against the real disk I was ultimately able to recover all of my data.

Source code is here.

During this procedure I also wrote raidsbdump, a utility to dump the contents of RAID superblocks.

Epilogue

Two weeks later another of the IBM drives failed. I replaced all four with Maxtors.

Links

[home]