Sharing my experience - I had two drive failures at almost the same time in a parity array.
Key point: I am observing Virtualdisk repair jobs failing if it encounters non-recoverable IO Errors on the underlying physical disks. A disk throwing IO Errors remains in the pool, marked as Healthy, but the Event Logs will show IO Errors, as many
of the replies here have shown. Every five minutes there is a new attempt to repair the array, but it runs for a short time before failing and leaving the space as "degraded".
Even after marking the disk as retired, the jobs will continue to fail as it tries and fails to read the data off faulty blocks. One would think that the drivers should be responding to IO Errors by looking at the parity to resolve the missing data,
but my observation is that it did not do that. Possibly in my case it could not resolve because my array was already degraded from an earlier disk removal and so the parity data was not available.
I tried to run chkdsk /r over the virtual disk that was 'degraded', but this gave me some dubious errors about the disk not having enough space to recover bad clusters. I don't know why I got this error and if I could have been successful with chkdsk
in another way. chkdsk is supposed to mark bad clusters to stop the os from using them, but perhaps it just doesn't work with storage spaces.
In my case I had already lost another disk and so I could not just drop that disk from the pool and let it recover using parity.
Here's the approach I took:
1. Marked the bad disk as retired (I'm matching the disk id "7" against the deviceid, and confirmed by looking at the error counts from get-storagerelabilitycounter). This will prevent the storage space from writing anything new onto the
bad disk. If you miss this step, copying at 4 is potentially going to copying back onto the same disk at a different location, and goal here is to get rid of the bad disk completely.
2. Copied everything important off the virtual disk onto a regular non-pooled disk as a safety backup.
3. Created a new storage space virtualdisk.
4. Copied everything from the degraded storage space into the new one. I started out using a simple 'move' via file explorer, but switched to FreeFileSync to make it easier to keep track of.
5. Double checked I had everything by comparing file contents against the backup at 2.
<I'm here now, next steps will be today>
<edit> 6a. Remove (delete) the degraded virtual disks.
6. Manually instigate repair virtual disk for any other disks (including healthy disks) that are using the pool with the bad disk. This will ensure all virtual disks are using only healthy physical drives..
This could potentially send you back to step 2 for a different virtual disk.
7. One all virtual disks are healthy and repaired, Remove the bad disk from the pool.
If you have a parity space, and if there is only one disk throwing errors, you could force removal of that disk from the pool, and then the repair should start to rebuild automatically using the remaining disks. This is a little risky in that you're
assuming that the degraded status is only because of the one bad disk. If you're in degraded status because you're already rebuilding from an earlier disk removal, then forcing remove could take out a lot of data.
Over the 3 or so days to work through all the copying, the disk degraded slightly such that there were a couple of files I copied at step 2, that failed by the time I got to step 4. The more time passed and more IO on the bad disk, the sooner it's
going to go.
I'd be using dual parity from now on for really important data, but you need 7 drives in the pool and I've gone down to 6 thanks to double fails. Time to go drive shopping.
-
Edited by
Ben Hatton
5 hours 32 minutes ago
added step 6a