Storage Spaces Degraded Array But All Physical Disks Show OK

Long story short. During a large file copy something went wrong and the server crashed and rebooted. After it came back up, all physical disks and storage space show no problems, but the virtual disk that uses them shows it's status as degraded.  Attempting to repair the virtual disk through both the UI and PowerShell does not work and the virtual disk remains in the degraded state.

As you can see, no storage job is started by the repair-virtualdisk and nothing appears to be happening.

This issue sounds very similar to
http://social.technet.microsoft.com/Forums/en-US/winserverfiles/thread/9ae8fd65-046b-452e-8ebf-924d9df98168
but the accepted solution was to update drivers which i do not believe to be the problem since the virtual disk was fine for an extended amount of time before this occurred.

To me it seems like one of my drives is about to die (I have all the same drives and 1 has died previously, but I was able to repair the virtual disk that time around).

The drives are all plugged into an AMCC 3ware 9500S SATA RAID Controller and set to be JBOD.

Any suggestions on how to fix or debug this would be much, much appreciated. Also let me know if there is any other information that would be useful. I would hate to lose all of my data.

Thanks,

Disabled Monkey
April 24th, 2013 11:19am

You can check the eventlogs to see if any events were logged indicating that your disks/volumes have problems.

#from admin powershell.

Get-WinEvent -ProviderName *Disk*,*Ntfs*,*Spaces*,*Chk*,*Defrag*

Free Windows Admin Tool Kit Click here and download it now
April 24th, 2013 1:48pm

Alright, so event logs do show problems as would be expected for the night that it crashed during the large file transfer.

Here is an error log

Command line showed the same errors but was a bit more descriptive. Here are two i have pulled from that, showing the IO errors and then eventually the storage spaces status being changed to degraded.

TimeCreated  : 4/18/2013 9:09:12 PM
ProviderName : disk
Id           : 153
Message      : The IO operation at logical block address a9ed7c28 for Disk 3 was retried.
    

TimeCreated  : 4/18/2013 9:19:53 PM
ProviderName : Microsoft-Windows-StorageSpaces-ManagementAgent
Id           : 100
Message      : Storage Spaces status has changed

So from these errors is it safe to assume the crash and degradation was caused because of issues with Disk 3?

How can I go about repairing the virtual disk if all disk show as being ok and repairing just goes back to degraded? Should i be replacing disk 3 and trying the repair again?

Let me know if there is anything else i can look into.

I appreciate the help, Thanks,

Disabled Monkey

April 24th, 2013 10:07pm

Hi,

You could test to replace the disk3 and see if issue still exists. You may be asked to add a replace disk before removing disk3.

Free Windows Admin Tool Kit Click here and download it now
April 26th, 2013 6:50am

While I do think the logs indicate that disk 3 may be having problems, I think it would be good a good step before just replacing it to confirm this is the case.  Does anybody have any suggestions for any tools and/or commands I could use to do that?  Since storage spaces and windows in general seem to think the disk is fine, I am at a loss of how to determine the true cause of the virtual disk degradation and would rather not take any extreme steps like replacing a drive until I am sure that is what is causing the problems. 

Thanks,

Disabled Monkey


April 27th, 2013 5:32am

its a wide idea for a raid administrator to have spare disks in the tool box just in case a drive fails, raid 5 protects against single disk failure. raid 6 can tolerate 2 disks failing, which is more common than imagined.


RAID6 protects mostly not from double drive failure (that's a rare case, I've never seen any administrator running degraded RAID5 with disk failed) but from a failure during long and painful parity re-build. 
Free Windows Admin Tool Kit Click here and download it now
April 28th, 2013 4:03pm

Alright, so I think we got a little off topic here guys.

Storage Spaces is built to abstract the concept of doing a software raid to the point where you have a bunch of physical disks, you add them to a pool of available disk space, and then you assign virtual disks to use that space.  So I don't really need any discussion about types of raid or other stuff like that.

I have another 2tb hard-drive i can swap out, that is not the issue.  The problem is that all my physical hard-drives show as being fine under windows storage spaces, but I am unable to repair the virtual drive that has become degraded.  I would like ideas on how to determine/confirm if and/or which one of the drives is causing the virtual disk degradation and failure to repair.  If you have been following the conversation so far, the drive is most likely disk3, but I would like to confirm this before taking the step of replacing it.  It would also be nice if there was some way to tell exactly what is keeping it from running the repair operation, as just repairing it and getting it out of the degraded state is my highest priority.

Please try to stay on topic,

Thanks to those who have tried to help so far,

DisabledMonkey


April 29th, 2013 3:17am

Yes, software raid does have its pros and cons, as does a hardware raid (ie. less expandable). Please stay ON TOPIC though, I have server 2012, with a degraded virtual drive with all physical drives showing up as fine.  While your help is appreciated, if you don't have experience with Server 2012 Storage Spaces, suggestions with how to get it back to an okay state, or determine if one of the physical drives (probably disk3) has a problem, don't bother posting.

Thanks,

Disabled Monkey

Free Windows Admin Tool Kit Click here and download it now
April 29th, 2013 11:23am

I have used google extensively while trying to figure out this problem and have come up with nothing, which is what brought me here to microsoft's forum to hopefully get some help.  I do have a real raid card, which if you read my initial post you would realize, its not spectacular but does the job. If you don't have a response to my actual questions (read the entire thread), please don't bother posting.

Apparently my image got removed from the first post, so here it is again showing the physical drives being fine, virtual drive being degraded, and being unable to repair using both the UI and commandline.

Thanks for your time,

Disabled Monkey

April 30th, 2013 12:04am

Its is setup as thin provisioned in storage spaces.  In essence this allows you to set a value for the virtual disk that you may or may not reach in the future.  And the virtual disk will allow you to keeping using up the space until the storage space as no more left.  So if in the future i decide I want another 10tb, I plop a couple of more drives in, add them to the storage space and that is all i have to do.  One of the nice things that comes from a software raid using storage spaces.

http://en.wikipedia.org/wiki/Thin_provisioning

But that is besides the point with the issue I am running into.

Free Windows Admin Tool Kit Click here and download it now
April 30th, 2013 2:13am

SEAGATE ST2000DL003

Cheap Drives.

April 30th, 2013 9:54pm

While not ideal for business use by any means, these drives do just fine for the home environment that I am using them for.  I have used them for almost a year straight without any issues whatsoever.   I know you disagree with how I have setup my file server, but posting about it in no way helps me in trying to figure out how to resolve the current issue I am experiencing.
Free Windows Admin Tool Kit Click here and download it now
May 1st, 2013 12:08am

Indeed, with the huge capacity of modern hard disks, it can take an inordinate amount of time to rebuild an array.

the fastest disks can manage maybe 300MB/s and scanning a 5TB disk in an array of 16 can take days on end.

given all the disks are usually bought at the same time, same lot etc, the probability of a second failure is not so far fetched

http://www.hardcore-games.tk/wp/hw/raid.php

May 1st, 2013 3:18am

Hello,

I also dislike software RAID for several reasons so I am not 100% sure about this S/W raid.. oooh I am off topic, sorry.

Try to run a checkdisk without the /r just to check USN and file table consistency because it may be that the server crash left open files, caused journal problems, etc. a chkdsk would not hurt and requires no screwdrivers or swapping disks. If it finds errors, back up the volume and then user the /r switch to attempt repair.

If that does not find any errors, I would suggest swapping out that # 3 drive. It may have errors. After you remove it, run a low level format (like a milspec) to see whether it was a hard or soft error. if it erases/formats fine, you had a soft error and the drive is still u

Free Windows Admin Tool Kit Click here and download it now
May 1st, 2013 3:27am

Real RAID cards have logic and a BIOS to tell you a disk on channel x is our of order


Amen!
May 1st, 2013 3:28am

chkdsk was actually the second thing I tried after using Storage Spaces repair command with no luck.  It came up with no errors to report so I don't know what to make of that.  If no one else has any ideas I might just have to bite the bullet and replace disk3 even though the only thing possibly indicating it is bad is the fact that it shows up in the error log, but other than that everything else shows it with no issues.

Thanks,

DisabledMonkey

Free Windows Admin Tool Kit Click here and download it now
May 1st, 2013 11:20am

You can also try retrieving the SMART data for the device.

Get-PhysicalDisk PhysicalDisk3 | Get-StorageReliabilityCounter

The other thing to try might be to see if the disk in question is spinning down too aggressively.

Nandu

May 2nd, 2013 5:01pm

Interesting, I ran it on all my physical disks, While not providing a whole lot of useful information, one notable thing is that  the FlushLatencyMax, ReadLatencyMax, WriteLatencyMax is significantly lower than the rest for disk 3 (even more so than my hardware raided solidstate drives: physicaldisk0 that i use as my OS drive).  Wouldn't you want latency to be lower? Does it being that much lower indicate there is a problem with it?  Any other thoughts on diagnosing this?  If I have time this weekend I will probably try swapping it out and seeing what happens.

Results are below,

Thanks for the help,

DisabledMonkey

ObjectId                : {e31a9347-aa29-11e2-941f-806e6f6e6963}:reliabilitycounter
PassThroughClass        :
PassThroughIds          :
PassThroughNamespace    :
PassThroughServer       :
UniqueId                : {e31a9347-aa29-11e2-941f-806e6f6e6963}:reliabilitycounter
DeviceId                : 0
FlushLatencyMax         : 958
LoadUnloadCycleCount    :
LoadUnloadCycleCountMax :
ManufactureDate         :
PowerOnHours            :
ReadErrorsCorrected     :
ReadErrorsTotal         :
ReadErrorsUncorrected   :
ReadLatencyMax          : 989
StartStopCycleCount     :
StartStopCycleCountMax  :
Temperature             :
TemperatureMax          :
Wear                    :
WriteErrorsCorrected    :
WriteErrorsTotal        :
WriteErrorsUncorrected  :
WriteLatencyMax         : 1074
PSComputerName          :

ObjectId                : {4cc290a9-6831-11e2-93eb-806e6f6e6963}:reliabilitycounter
PassThroughClass        :
PassThroughIds          :
PassThroughNamespace    :
PassThroughServer       :
UniqueId                : {4cc290a9-6831-11e2-93eb-806e6f6e6963}:reliabilitycounter
DeviceId                : 1
FlushLatencyMax         : 983
LoadUnloadCycleCount    :
LoadUnloadCycleCountMax :
ManufactureDate         :
PowerOnHours            :
ReadErrorsCorrected     :
ReadErrorsTotal         :
ReadErrorsUncorrected   :
ReadLatencyMax          : 1399
StartStopCycleCount     :
StartStopCycleCountMax  :
Temperature             :
TemperatureMax          :
Wear                    :
WriteErrorsCorrected    :
WriteErrorsTotal        :
WriteErrorsUncorrected  :
WriteLatencyMax         : 1616
PSComputerName          :

ObjectId                : {0926ef6c-7867-11e2-93fc-806e6f6e6963}:reliabilitycounter
PassThroughClass        :
PassThroughIds          :
PassThroughNamespace    :
PassThroughServer       :
UniqueId                : {0926ef6c-7867-11e2-93fc-806e6f6e6963}:reliabilitycounter
DeviceId                : 8
FlushLatencyMax         : 994
LoadUnloadCycleCount    :
LoadUnloadCycleCountMax :
ManufactureDate         :
PowerOnHours            :
ReadErrorsCorrected     :
ReadErrorsTotal         :
ReadErrorsUncorrected   :
ReadLatencyMax          : 1415
StartStopCycleCount     :
StartStopCycleCountMax  :
Temperature             :
TemperatureMax          :
Wear                    :
WriteErrorsCorrected    :
WriteErrorsTotal        :
WriteErrorsUncorrected  :
WriteLatencyMax         : 2110
PSComputerName          :

ObjectId                : {4cc290ab-6831-11e2-93eb-806e6f6e6963}:reliabilitycounter
PassThroughClass        :
PassThroughIds          :
PassThroughNamespace    :
PassThroughServer       :
UniqueId                : {4cc290ab-6831-11e2-93eb-806e6f6e6963}:reliabilitycounter
DeviceId                : 2
FlushLatencyMax         : 983
LoadUnloadCycleCount    :
LoadUnloadCycleCountMax :
ManufactureDate         :
PowerOnHours            :
ReadErrorsCorrected     :
ReadErrorsTotal         :
ReadErrorsUncorrected   :
ReadLatencyMax          : 1381
StartStopCycleCount     :
StartStopCycleCountMax  :
Temperature             :
TemperatureMax          :
Wear                    :
WriteErrorsCorrected    :
WriteErrorsTotal        :
WriteErrorsUncorrected  :
WriteLatencyMax         : 1628
PSComputerName          :

ObjectId                : {4cc290af-6831-11e2-93eb-806e6f6e6963}:reliabilitycounter
PassThroughClass        :
PassThroughIds          :
PassThroughNamespace    :
PassThroughServer       :
UniqueId                : {4cc290af-6831-11e2-93eb-806e6f6e6963}:reliabilitycounter
DeviceId                : 4
FlushLatencyMax         : 983
LoadUnloadCycleCount    :
LoadUnloadCycleCountMax :
ManufactureDate         :
PowerOnHours            :
ReadErrorsCorrected     :
ReadErrorsTotal         :
ReadErrorsUncorrected   :
ReadLatencyMax          : 1480
StartStopCycleCount     :
StartStopCycleCountMax  :
Temperature             :
TemperatureMax          :
Wear                    :
WriteErrorsCorrected    :
WriteErrorsTotal        :
WriteErrorsUncorrected  :
WriteLatencyMax         : 2119
PSComputerName          :

ObjectId                : {4cc290b1-6831-11e2-93eb-806e6f6e6963}:reliabilitycounter
PassThroughClass        :
PassThroughIds          :
PassThroughNamespace    :
PassThroughServer       :
UniqueId                : {4cc290b1-6831-11e2-93eb-806e6f6e6963}:reliabilitycounter
DeviceId                : 5
FlushLatencyMax         : 994
LoadUnloadCycleCount    :
LoadUnloadCycleCountMax :
ManufactureDate         :
PowerOnHours            :
ReadErrorsCorrected     :
ReadErrorsTotal         :
ReadErrorsUncorrected   :
ReadLatencyMax          : 1354
StartStopCycleCount     :
StartStopCycleCountMax  :
Temperature             :
TemperatureMax          :
Wear                    :
WriteErrorsCorrected    :
WriteErrorsTotal        :
WriteErrorsUncorrected  :
WriteLatencyMax         : 2128
PSComputerName          :

ObjectId                : {4cc290b3-6831-11e2-93eb-806e6f6e6963}:reliabilitycounter
PassThroughClass        :
PassThroughIds          :
PassThroughNamespace    :
PassThroughServer       :
UniqueId                : {4cc290b3-6831-11e2-93eb-806e6f6e6963}:reliabilitycounter
DeviceId                : 6
FlushLatencyMax         : 994
LoadUnloadCycleCount    :
LoadUnloadCycleCountMax :
ManufactureDate         :
PowerOnHours            :
ReadErrorsCorrected     :
ReadErrorsTotal         :
ReadErrorsUncorrected   :
ReadLatencyMax          : 1203
StartStopCycleCount     :
StartStopCycleCountMax  :
Temperature             :
TemperatureMax          :
Wear                    :
WriteErrorsCorrected    :
WriteErrorsTotal        :
WriteErrorsUncorrected  :
WriteLatencyMax         : 2214
PSComputerName          :

ObjectId                : {4cc290ad-6831-11e2-93eb-806e6f6e6963}:reliabilitycounter
PassThroughClass        :
PassThroughIds          :
PassThroughNamespace    :
PassThroughServer       :
UniqueId                : {4cc290ad-6831-11e2-93eb-806e6f6e6963}:reliabilitycounter
DeviceId                : 3
FlushLatencyMax         : 511
LoadUnloadCycleCount    :
LoadUnloadCycleCountMax :
ManufactureDate         :
PowerOnHours            :
ReadErrorsCorrected     :
ReadErrorsTotal         :
ReadErrorsUncorrected   :
ReadLatencyMax          : 848
StartStopCycleCount     :
StartStopCycleCountMax  :
Temperature             :
TemperatureMax          :
Wear                    :
WriteErrorsCorrected    :
WriteErrorsTotal        :
WriteErrorsUncorrected  :
WriteLatencyMax         : 694
PSComputerName          :

ObjectId                : {4cc290b7-6831-11e2-93eb-806e6f6e6963}:reliabilitycounter
PassThroughClass        :
PassThroughIds          :
PassThroughNamespace    :
PassThroughServer       :
UniqueId                : {4cc290b7-6831-11e2-93eb-806e6f6e6963}:reliabilitycounter
DeviceId                : 7
FlushLatencyMax         : 994
LoadUnloadCycleCount    :
LoadUnloadCycleCountMax :
ManufactureDate         :
PowerOnHours            :
ReadErrorsCorrected     :
ReadErrorsTotal         :
ReadErrorsUncorrected   :
ReadLatencyMax          : 1576
StartStopCycleCount     :
StartStopCycleCountMax  :
Temperature             :
TemperatureMax          :
Wear                    :
WriteErrorsCorrected    :
WriteErrorsTotal        :
WriteErrorsUncorrected  :
WriteLatencyMax         : 3124
PSComputerName          :




Free Windows Admin Tool Kit Click here and download it now
May 2nd, 2013 9:50pm

Yes, a lot of unhelpful comments here.  However, Disabled Monkey, I have almost the same problem--The virtual Disk is flagged with "Degraded" but the Physical Disks show problems.  Also, the Storage Spaces list shows no errors.  The event log shows frequent timeouts on Disk4 and so I suspect it but I'd like to know for sure.
December 15th, 2014 9:40pm

Hi!

Sorry to beat a dead thread, but I just got the same issue. My storage layout is 7 disks and a 1-way mirror. One disk failed, and I took the normal steps

  1. Replaced physical disk
  2. Added new disk from Primordial to pool
  3. Marked failed disk object as Retired
  4. Repaired virtual disk (PowerShell)
  5. Removed Retired disk

Have performed a disk replacement using the exact same steps on the same system roughly a year ago with no issues. Now everything else works otherwise as it should (Volume is perfect, all disks are OperationalStatus OK), but

  1. Virtual disk shows as Operational Status = Degraded. OtherOperationalStatusDescription is empty.
  2. Repair operation on the virtual disk finishes quickly with no errors logged (both GUI and PS)

Anybody with any more recent info? In any case this is a bug that we need to be able to report somewhere as there should obviously be some log info on what is wrong. But is there a workaround or hotfix to make the virtual disk redundancy operational again? Not very interested in moving terabytes of data around to recreate the virtual disk just because a physical disk failed.

Free Windows Admin Tool Kit Click here and download it now
May 18th, 2015 7:00am

I am also experiencing the same problem, System log repeatedly shows warnings, error 153, "The IO operation at logical block address 0x1037abc00 for Disk 7 (PDO name: \Device\00000043) was retried." However the Physical Disk panel shows all disks are OK.

One virtual disk is degraded and detaches from service. Attaching it and running repair completes immediately without any improvement. There is a healthy second virtual disk that also uses physical disk 7.

Has anyone gained additional insight into these behaviors?


June 2nd, 2015 10:19pm

I am also experiencing the same problem, System log repeatedly shows warnings, error 153, "The IO operation at logical block address 0x1037abc00 for Disk 7 (PDO name: \Device\00000043) was retried." However the Physical Disk panel shows all disks are OK.

One virtual disk is degraded and detaches from service. Attaching it and running repair completes immediately without any improvement. There is a healthy second virtual disk that also uses physical disk 7.

Has anyone gained additional insight into these behaviors?


  • Edited by Michael A P Wednesday, June 03, 2015 2:19 AM
Free Windows Admin Tool Kit Click here and download it now
June 3rd, 2015 2:17am

Same problem. Degraded virtual disk, all physical disks ok. Repair doesn't help. This appeared after replacing a faulty drive.

I am about to rebuild and restore 8TB from backup.

My backup device is 8 USB drives in a parity storage pool. I don't feel confident about that.

Storage Spaces feels flaky.

June 12th, 2015 9:55am

Sharing my experience - I had two drive failures at almost the same time in a parity array.

Key point: I am observing Virtualdisk repair jobs failing if it encounters non-recoverable IO Errors on the underlying physical disks.  A disk throwing IO Errors remains in the pool, marked as Healthy, but the Event Logs will show IO Errors, as many of the replies here have shown.  Every five minutes there is a new attempt to repair the array, but it runs for a short time before failing and leaving the space as "degraded".

Even after marking the disk as retired, the jobs will continue to fail as it tries and fails to read the data off faulty blocks.  One would think that the drivers should be responding to IO Errors by looking at the parity to resolve the missing data, but my observation is that it did not do that.  Possibly in my case it could not resolve because my array was already degraded from an earlier disk removal and so the parity data was not available.

I tried to run chkdsk /r over the virtual disk that was 'degraded', but this gave me some dubious errors about the disk not having enough space to recover bad clusters.  I don't know why I got this error and if I could have been successful with chkdsk in another way.  chkdsk is supposed to mark bad clusters to stop the os from using them, but perhaps it just doesn't work with storage spaces.

In my case I had already lost another disk and so I could not just drop that disk from the pool and let it recover using parity.

Here's the approach I took:

1. Marked the bad disk as retired (I'm matching the disk id "7" against the deviceid, and confirmed by looking at the error counts from get-storagerelabilitycounter).  This will prevent the storage space from writing anything new onto the bad disk.  If you miss this step, copying at 4 is potentially going to copying back onto the same disk at a different location, and goal here is to get rid of the bad disk completely.

2. Copied everything important off the virtual disk onto a regular non-pooled disk as a safety backup.

3. Created a new storage space virtualdisk.

4. Copied everything from the degraded storage space into the new one.  I started out using a simple 'move' via file explorer, but switched to FreeFileSync to make it easier to keep track of.

5. Double checked I had everything by comparing file contents against the backup at 2.

<I'm here now, next steps will be today>

<edit> 6a. Remove (delete) the degraded virtual disks.

6. Manually instigate repair virtual disk for any other disks (including healthy disks) that are using the pool with the bad disk.  This will ensure all virtual disks are using only healthy physical drives..  This could potentially send you back to step 2 for a different virtual disk.

7. One all virtual disks are healthy and repaired, Remove the bad disk from the pool.

If you have a parity space, and if there is only one disk throwing errors, you could force removal of that disk from the pool, and then the repair should start to rebuild automatically using the remaining disks.  This is a little risky in that you're assuming that the degraded status is only because of the one bad disk.  If you're in degraded status because you're already rebuilding from an earlier disk removal, then forcing remove could take out a lot of data.

Over the 3 or so days to work through all the copying, the disk degraded slightly such that there were a couple of files I copied at step 2, that failed by the time I got to step 4.  The more time passed and more IO on the bad disk, the sooner it's going to go.

I'd be using dual parity from now on for really important data, but you need 7 drives in the pool and I've gone down to 6 thanks to double fails.  Time to go drive shopping.


  • Edited by Ben Hatton 5 hours 32 minutes ago added step 6a
Free Windows Admin Tool Kit Click here and download it now
July 24th, 2015 9:05pm

Same problem. Degraded virtual disk, all physical disks ok. Repair doesn't help. This appeared after replacing a faulty drive.

This was exactly me, until I realised that *another* disk was throwing IO Errors.  I had assumed that just one drive failed and didn't pay careful attention to the device ids in the error logs.

July 24th, 2015 9:07pm

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics