Tirered two-way mirrored storage space craches on one SSD missing (Network Steve Forum)

Tirered two-way mirrored storage space craches on one SSD missing

Hi guys

My greatest nightmare happened yesterday. I lost on SSD in my clustered, tiered, enclosure aware storage pool and all virtual disks got unresponsive before RHS rebooted my two head nodes. This is primary storage for a Hyper-V cluster, so imagine the number of angry phone calls...

Key info:

Failover cluster
Two head nodes
Three Quanta SAS JBODs
56 x Seagate ST1000NM0023
8 x Sandisk SDLKAE6M200G5CA1 (Optimus Ascend)
One storage pool
4 virtual disks (storage quorum and 3 disk containing one SOFS share each

So what happened is one SSD started misbehaving and throwing controller errors (11, LSI_SAS2) and eventually warnings on failed IO operations (1531 disk).

Then RHS threw a time out on my storage pool (1230 FailoverClustering) and followed up with timeouts on all virtual disks including my storage cluster disk witness.

Eventually one cluster node bugchecked 0x9E (1001, Bugcheck 0x0000009e (0xffffe0011d522080, 0x00000000000004b0, 0x0000000000000005, 0x0000000000000000) which I believe is forced by RHS due to continoued timeouts. The remaining head node had lost access to the disk witness and decided it had lost quorum and shut the whole cluster down.

When the first head nod had rebooted it had full access to the storage pool and virtual disks including the storage cluster disk witness, so it started the cluster. The second head node booted and joined the cluster and everything is peachy. Well, except for angry users and owners of virtual machines stored on SOFS.

The storage pool reported a degraded state and one SSD was reported missing. I replaced the SSD and everything i healthy again now. But what the heck happened? I thought the storage pool would abstract the physical disk layer, and everything from virtual disk and up would be oblivious to physical failure.

Where do I even start looking at this one to prevent future meltdowns? I'll be happy to post logs and even my powershell setup script for the entire storage as well as details as needed. Any help is appreciated :-)

March 25th, 2015 10:34am

Hi,

Do you mean that the storage space works fine now? Please provide detailed error message in the Event Log when the SSD was reported missing.

Best Regards,

Mandy

Free Windows Admin Tool Kit Click here and download it now

March 26th, 2015 10:01am

Hi Mandy, tanks for replying! The space is working fine again after the BSOD and following reboot of my two cluster nodes. I have supplied a chronological excerpt of the logs from one node.

The first two logs are controller and disk error that repeat all the way up to the reboot. After the reboot the disk was missing and no LSI and disk logs. The disk has been replaced and my pool/space is healty again.. Sorry about the data overload. Please let me know if there is a way to improve formatting :-)

Log Name: System
Source: LSI_SAS2
Date: 24.03.2015 13:19:59
Event ID: 11
Task Category: None
Level: Error
Keywords: Classic
User: N/A
Computer: NODE1.domain.local
Description:
The driver detected a controller error on \Device\RaidPort1.
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
<System>
<Provider Name="LSI_SAS2" />
<EventID Qualifiers="49156">11</EventID>
<Level>2</Level>
<Task>0</Task>
<Keywords>0x80000000000000</Keywords>
<TimeCreated SystemTime="2015-03-24T12:19:59.662050100Z" />
<EventRecordID>94345</EventRecordID>
<Channel>System</Channel>
<Computer>NODE1.domain.local</Computer>
<Security />
</System>
<EventData>
<Data>\Device\RaidPort1</Data>
<Binary>0F00180001000000000000000B0004C01A01123100000000000000000000000000000000000000000000000000000000000000000B0004C00000000000000000</Binary>
</EventData>
</Event>

Log Name: System
Source: disk
Date: 24.03.2015 13:22:02
Event ID: 153
Task Category: None
Level: Warning
Keywords: Classic
User: N/A
Computer: NODE1.domain.local
Description:
The IO operation at logical block address 0x1d7690 for Disk 62 (PDO name: \Device\00000096) was retried.
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
<System>
<Provider Name="disk" />
<EventID Qualifiers="32772">153</EventID>
<Level>3</Level>
<Task>0</Task>
<Keywords>0x80000000000000</Keywords>
<TimeCreated SystemTime="2015-03-24T12:22:02.057930400Z" />
<EventRecordID>94363</EventRecordID>
<Channel>System</Channel>
<Computer>NODE1.domain.local</Computer>
<Security />
</System>
<EventData>
<Data>\Device\Harddisk62\DR62</Data>
<Data>0x1d7690</Data>
<Data>62</Data>
<Data>\Device\00000096</Data>
<Binary>0F01040004002C0000000000990004800000000000000000000000000000000000000000000000000000122A</Binary>
</EventData>
</Event>

Log Name: System
Source: Microsoft-Windows-FailoverClustering
Date: 24.03.2015 13:27:13
Event ID: 1230
Task Category: Resource Control Manager
Level: Error
Keywords:
User: SYSTEM
Computer: NODE1.domain.local
Description:
A component on the server did not respond in a timely fashion. This caused the cluster resource 'Cluster Pool 1' (resource type 'Storage Pool', DLL 'clusres.dll') to exceed its time-out threshold. As part of cluster health detection, recovery actions will be taken. The cluster will try to automatically recover by terminating and restarting the Resource Hosting Subsystem (RHS) process that is running this resource. Verify that the underlying infrastructure (such as storage, networking, or services) that are associated with the resource are functioning correctly.
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
<System>
<Provider Name="Microsoft-Windows-FailoverClustering" Guid="{BAF908EA-3421-4CA9-9B84-6689B8C6F85F}" />
<EventID>1230</EventID>
<Version>0</Version>
<Level>2</Level>
<Task>3</Task>
<Opcode>0</Opcode>
<Keywords>0x8000000000000000</Keywords>
<TimeCreated SystemTime="2015-03-24T12:27:13.582340500Z" />
<EventRecordID>94367</EventRecordID>
<Correlation />
<Execution ProcessID="1384" ThreadID="10660" />
<Channel>System</Channel>
<Computer>NODE1.domain.local</Computer>
<Security UserID="S-1-5-18" />
</System>
<EventData>
<Data Name="ResourceName">Cluster Pool 1</Data>
<Data Name="ResourceType">Storage Pool</Data>
<Data Name="ResTypeDll">clusres.dll</Data>
</EventData>
</Event>

Log Name: System
Source: Microsoft-Windows-FailoverClustering
Date: 24.03.2015 13:28:05
Event ID: 1230
Task Category: Resource Control Manager
Level: Error
Keywords:
User: SYSTEM
Computer: NODE1.domain.local
Description:
A component on the server did not respond in a timely fashion. This caused the cluster resource 'HyperQuorum01' (resource type 'Physical Disk', DLL 'clusres.dll') to exceed its time-out threshold. As part of cluster health detection, recovery actions will be taken. The cluster will try to automatically recover by terminating and restarting the Resource Hosting Subsystem (RHS) process that is running this resource. Verify that the underlying infrastructure (such as storage, networking, or services) that are associated with the resource are functioning correctly.
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
<System>
<Provider Name="Microsoft-Windows-FailoverClustering" Guid="{BAF908EA-3421-4CA9-9B84-6689B8C6F85F}" />
<EventID>1230</EventID>
<Version>0</Version>
<Level>2</Level>
<Task>3</Task>
<Opcode>0</Opcode>
<Keywords>0x8000000000000000</Keywords>
<TimeCreated SystemTime="2015-03-24T12:28:05.019505000Z" />
<EventRecordID>94368</EventRecordID>
<Correlation />
<Execution ProcessID="1384" ThreadID="10660" />
<Channel>System</Channel>
<Computer>NODE1.domain.local</Computer>
<Security UserID="S-1-5-18" />
</System>
<EventData>
<Data Name="ResourceName">HyperQuorum01</Data>
<Data Name="ResourceType">Physical Disk</Data>
<Data Name="ResTypeDll">clusres.dll</Data>
</EventData>
</Event>

Log Name: System
Source: Microsoft-Windows-FailoverClustering
Date: 24.03.2015 13:31:13
Event ID: 1146
Task Category: Resource Control Manager
Level: Critical
Keywords:
User: SYSTEM
Computer: NODE1.domain.local
Description:
The cluster Resource Hosting Subsystem (RHS) process was terminated and will be restarted. This is typically associated with cluster health detection and recovery of a resource. Refer to the System event log to determine which resource and resource DLL is causing the issue.
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
<System>
<Provider Name="Microsoft-Windows-FailoverClustering" Guid="{BAF908EA-3421-4CA9-9B84-6689B8C6F85F}" />
<EventID>1146</EventID>
<Version>0</Version>
<Level>1</Level>
<Task>3</Task>
<Opcode>0</Opcode>
<Keywords>0x8000000000000000</Keywords>
<TimeCreated SystemTime="2015-03-24T12:31:13.010887000Z" />
<EventRecordID>94370</EventRecordID>
<Correlation />
<Execution ProcessID="1384" ThreadID="3564" />
<Channel>System</Channel>
<Computer>NODE1.domain.local</Computer>
<Security UserID="S-1-5-18" />
</System>
<EventData>
<Data Name="NodeName">INFRA-STOR01</Data>
</EventData>
</Event>

Log Name: System
Source: Microsoft-Windows-FailoverClustering
Date: 24.03.2015 13:34:13
Event ID: 1069
Task Category: Resource Control Manager
Level: Error
Keywords:
User: SYSTEM
Computer: NODE1.domain.local
Description:
Cluster resource 'StorageQuorum' of type 'Physical Disk' in clustered role 'Cluster Group' failed.

Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it. Check the resource and group state using Failover Cluster Manager or the Get-ClusterResource Windows PowerShell cmdlet.
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
<System>
<Provider Name="Microsoft-Windows-FailoverClustering" Guid="{BAF908EA-3421-4CA9-9B84-6689B8C6F85F}" />
<EventID>1069</EventID>
<Version>1</Version>
<Level>2</Level>
<Task>3</Task>
<Opcode>0</Opcode>
<Keywords>0x8000000000000000</Keywords>
<TimeCreated SystemTime="2015-03-24T12:34:13.084369900Z" />
<EventRecordID>94373</EventRecordID>
<Correlation />
<Execution ProcessID="1384" ThreadID="3564" />
<Channel>System</Channel>
<Computer>NODE1.domain.local</Computer>
<Security UserID="S-1-5-18" />
</System>
<EventData>
<Data Name="ResourceName">StorageQuorum</Data>
<Data Name="ResourceGroup">Cluster Group</Data>
<Data Name="ResTypeDll">Physical Disk</Data>
</EventData>
</Event>

Log Name: System
Source: Microsoft-Windows-FailoverClustering
Date: 24.03.2015 13:36:13
Event ID: 1230
Task Category: Resource Control Manager
Level: Error
Keywords:
User: SYSTEM
Computer: NODE1.domain.local
Description:
A component on the server did not respond in a timely fashion. This caused the cluster resource 'ClusterDisk01' (resource type 'Physical Disk', DLL 'clusres.dll') to exceed its time-out threshold. As part of cluster health detection, recovery actions will be taken. The cluster will try to automatically recover by terminating and restarting the Resource Hosting Subsystem (RHS) process that is running this resource. Verify that the underlying infrastructure (such as storage, networking, or services) that are associated with the resource are functioning correctly.
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
<System>
<Provider Name="Microsoft-Windows-FailoverClustering" Guid="{BAF908EA-3421-4CA9-9B84-6689B8C6F85F}" />
<EventID>1230</EventID>
<Version>0</Version>
<Level>2</Level>
<Task>3</Task>
<Opcode>0</Opcode>
<Keywords>0x8000000000000000</Keywords>
<TimeCreated SystemTime="2015-03-24T12:36:13.242993800Z" />
<EventRecordID>94380</EventRecordID>
<Correlation />
<Execution ProcessID="1384" ThreadID="7264" />
<Channel>System</Channel>
<Computer>NODE1.domain.local</Computer>
<Security UserID="S-1-5-18" />
</System>
<EventData>
<Data Name="ResourceName">ClusterDisk01</Data>
<Data Name="ResourceType">Physical Disk</Data>
<Data Name="ResTypeDll">clusres.dll</Data>
</EventData>
</Event>

Log Name: System
Source: Microsoft-Windows-FailoverClustering
Date: 24.03.2015 13:37:13
Event ID: 1069
Task Category: Resource Control Manager
Level: Error
Keywords:
User: SYSTEM
Computer: NODE1.domain.local
Description:
Cluster resource 'HyperQuorum01' of type 'Physical Disk' in clustered role '3a6281f0-1b41-4176-8071-0e2aa35d9ee9' failed.

Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it. Check the resource and group state using Failover Cluster Manager or the Get-ClusterResource Windows PowerShell cmdlet.
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
<System>
<Provider Name="Microsoft-Windows-FailoverClustering" Guid="{BAF908EA-3421-4CA9-9B84-6689B8C6F85F}" />
<EventID>1069</EventID>
<Version>1</Version>
<Level>2</Level>
<Task>3</Task>
<Opcode>0</Opcode>
<Keywords>0x8000000000000000</Keywords>
<TimeCreated SystemTime="2015-03-24T12:37:13.109674700Z" />
<EventRecordID>94381</EventRecordID>
<Correlation />
<Execution ProcessID="1384" ThreadID="3564" />
<Channel>System</Channel>
<Computer>NODE1.domain.local</Computer>
<Security UserID="S-1-5-18" />
</System>
<EventData>
<Data Name="ResourceName">HyperQuorum01</Data>
<Data Name="ResourceGroup">3a6281f0-1b41-4176-8071-0e2aa35d9ee9</Data>
<Data Name="ResTypeDll">Physical Disk</Data>
</EventData>
</Event>

Log Name: System
Source: Microsoft-Windows-FailoverClustering
Date: 24.03.2015 13:39:13
Event ID: 1230
Task Category: Resource Control Manager
Level: Error
Keywords:
User: SYSTEM
Computer: NODE1.domain.local
Description:
A component on the server did not respond in a timely fashion. This caused the cluster resource 'HyperQuorum01' (resource type 'Physical Disk', DLL 'clusres.dll') to exceed its time-out threshold. As part of cluster health detection, recovery actions will be taken. The cluster will try to automatically recover by terminating and restarting the Resource Hosting Subsystem (RHS) process that is running this resource. Verify that the underlying infrastructure (such as storage, networking, or services) that are associated with the resource are functioning correctly.
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
<System>
<Provider Name="Microsoft-Windows-FailoverClustering" Guid="{BAF908EA-3421-4CA9-9B84-6689B8C6F85F}" />
<EventID>1230</EventID>
<Version>0</Version>
<Level>2</Level>
<Task>3</Task>
<Opcode>0</Opcode>
<Keywords>0x8000000000000000</Keywords>
<TimeCreated SystemTime="2015-03-24T12:39:13.127408000Z" />
<EventRecordID>94383</EventRecordID>
<Correlation />
<Execution ProcessID="1384" ThreadID="4740" />
<Channel>System</Channel>
<Computer>NODE1.domain.local</Computer>
<Security UserID="S-1-5-18" />
</System>
<EventData>
<Data Name="ResourceName">HyperQuorum01</Data>
<Data Name="ResourceType">Physical Disk</Data>
<Data Name="ResTypeDll">clusres.dll</Data>
</EventData>
</Event>

Log Name: System
Source: Microsoft-Windows-WER-SystemErrorReporting
Date: 24.03.2015 13:52:49
Event ID: 1001
Task Category: None
Level: Error
Keywords: Classic
User: N/A
Computer: INFRA-STOR01
Description:
The computer has rebooted from a bugcheck. The bugcheck was: 0x0000009e (0xffffe0011d522080, 0x00000000000004b0, 0x0000000000000005, 0x0000000000000000). A dump was saved in: C:\Windows\MEMORY.DMP. Report Id: 032415-233875-01.
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
<System>
<Provider Name="Microsoft-Windows-WER-SystemErrorReporting" Guid="{ABCE23E7-DE45-4366-8631-84FA6C525952}" EventSourceName="BugCheck" />
<EventID Qualifiers="16384">1001</EventID>
<Version>0</Version>
<Level>2</Level>
<Task>0</Task>
<Opcode>0</Opcode>
<Keywords>0x80000000000000</Keywords>
<TimeCreated SystemTime="2015-03-24T12:52:49.000000000Z" />
<EventRecordID>94394</EventRecordID>
<Correlation />
<Execution ProcessID="0" ThreadID="0" />
<Channel>System</Channel>
<Computer>INFRA-STOR01</Computer>
<Security />
</System>
<EventData>
<Data Name="param1">0x0000009e (0xffffe0011d522080, 0x00000000000004b0, 0x0000000000000005, 0x0000000000000000)</Data>
<Data Name="param2">C:\Windows\MEMORY.DMP</Data>
<Data Name="param3">032415-233875-01</Data>
</EventData>
</Event>

Edited by Jorgen Fundingsrud Thursday, March 26, 2015 11:04 AM Clarification after re reading

March 26th, 2015 11:01am

Hi Jorgen,

We also use the Quanta m4600 jbod in combination with 32 seagate's 1,2 TB and 8 Seagate 200 gb SSD's.

Our experiances are mixed. The performance is good but the stability is a problem.

We also had a crash like you discribed and we see a lot of IO operation warnings on the SSD's

We first suspect the ssd's and are now testing with HGST's.

Also the Quanta JBOD could be the problem and the support on that is not so good.

I think it can be interesting to exchange some information so we can sort out what the cause is of the stability issues.

Regards,

Jeroen Bosman

Free Windows Admin Tool Kit Click here and download it now

April 28th, 2015 2:36pm

Wow! We also have Seagate SSDs (ST400FM0053) and there are also lots of I\O warnings with disk numbers pointing to SSD. Our SOFS hosts are not crushing, but VMs, that are stored on SOFS hang sometimes.

I've contacted Seagate support, and they told me, that this issue should be resolved by upgrading disk firmware to version 0006. We are currently planning this upgrade...

April 28th, 2015 5:54pm

Can you clarify how many physical disks you have in your clustered storage pool? There is a recommended limit of 80 physical disks in a clustered storage pool.

This would be a good place to start in evaluating your storage health: https://gallery.technet.microsoft.com/scriptcenter/Test-StorageHealthps1-66d84fd4

Free Windows Admin Tool Kit Click here and download it now

April 28th, 2015 6:18pm

Our config: 3 enclosures connected to 3 servers via dual path connections using LSI 9207-8e cards. 18 HDDs and 4 SSDs in each enclosure (so, total of 66 drives in pool). One tiered Storage Pool and 6 Virtual Disks.

April 28th, 2015 6:34pm

We already tested the firmware 0006 for the Seagate SSD's (ST200FM0073) 200GB, and we don't see any improvements regarding the IO operation warnings. (See warning below)

(The IO operation at logical block address 0x7fe2578 for Disk 25 (PDO name: \Device\MPIODisk24) was retried.)

There are also errors in the eventviewer regarding controller errors:

The driver detected a controller error on \Device\RaidPort0. Event ID:11

We disabled trim/unmap and changed our mpio policy to failover only.

Last week I updated my LSI SAS adapter 9207-8e to the latest firmware and Driver version P20

So far not much improvement, This week i'm going to run a new test with new HGST SSD'S.

Our experiance is that we have time-outs on the Storage and we also had a crash once of the complete cluster.

During the time-outs we see warnings in the eventviewer:

Reset to device, \Device\RaidPort0, was issued. Event ID:129

And Errors:

The driver detected a controller error on \Device\RaidPort0. Event ID:11

Free Windows Admin Tool Kit Click here and download it now

April 29th, 2015 11:04am

Hi Demonixed,

What brand of enclosure are you using in your environment?

Regards,

Jeroen Bosman

April 29th, 2015 11:07am

Hi, Jeroen! We use SuperMicro 847E26-R1400LPB chassis with external cabling. It's not certified, but looks like almost exact copy of 847E26-RJBOD1, which is certified. We've upgraded to P20 firmware in January. LB Policy - least blocks (also tried failover only)

I'm getting events with id 153 (IO operation was retried) multiple times a day at random time. Events with id 129 and 11 occurs far less often (last was April, 22).

Free Windows Admin Tool Kit Click here and download it now

April 29th, 2015 12:28pm

Hi Demonixed,

Same problem here.<o:p></o:p>

Events with id 129 and 11 far less often then 153, but id 129 and 11 can give big problems regarding storage time-outs or crashes.

April 29th, 2015 1:36pm

I had a couple of issues earlier, s I had already upgraded LSI firmware, driver and BIOS to P20 before the last incident. SSDs and disks were all on the most recent firmware. I saw a lot of LSI_SAS2 11 events over a period of days leading up to my crash so that's something I would take seriously.

I've used LSIutil and tracked the phy error counters to one specific SSD from all SAS paths on both storage nodes. Individually the SSD tests just fine, but my hardware vendor agreed to an RMA. I added a new SSD on a new port in the same enclosure andre retiered the "defective" one. Right now I have no phy errors. But I'm not sure I trust the storage just yet. Also I've only seen trouble while both storage nodes were active. I'm currently running all storage on one noe and the other one paused (yay redundancy...). Not ideal, but I'm considering starting up the other node again for limited periods to test stability.

Free Windows Admin Tool Kit Click here and download it now

April 30th, 2015 11:41am

Hi!

Just wanted to share our current state.

We've upgraded SSD firmware to 0006 five days ago, and everything is fine now. No warnings about IO retries. And performance of tiered storage itself is now very good. Before upgrade disk idle time (VD on storage spaces) jump from 0% to 70%:

After upgrade this counter became stable. It is now 80-100% under same load (or even heavier - because I removed IO limits on bunch of VMs):

Btw, we've performed upgrade the following way: retire 2 disks, repair VDs, remove disks from pool, do upgrade, return disk back to pool, retire another 2 disks, again repair VDs, etc... And every time we did repair - one of hosts rebooted (sometimes two nodes rebooted sequentially). That were bugchecks
0x0000007e and 0x0000003b, both caused by spaceport.sys and also reboots with no reasons specified.

We have 3-node cluster, two nodes were last patched at the end January and third one - in the middle of March. I've installed current patches before retire last portion of drive and hosts hadn't fail.

Hope, it is happy end... :)

Edited by Demonixed Tuesday, May 05, 2015 11:04 AM typos

May 5th, 2015 10:55am

Update from our situation so far:

Got in contact with a Dutch company who also uses the Seagate's with similar issues.

He also did an update of the firmware to 0006.

But the problem came back with higher load.

He advised me to go back to firmware P19 for the sas adapters LSI 9207-8e because he had the same controller errors with firmware p20 and LSI advised him to downgrade because of issues with firmware P20.

We are starting a new test period to verify the environment with firmware P19.

Regards,

Jeroen Bosman

Free Windows Admin Tool Kit Click here and download it now

May 6th, 2015 10:58am

Update 2.<o:p></o:p>

After the downgrade to Firmware & Driver P19 for the LSI 9207-8e with HGST SSD's the logbooks are clean so far.<o:p></o:p>

We enabled Trim again end no performance issues for now.<o:p></o:p>

We set mpio to failover only because we experienced beter performance when turned off to failover only.

We are planning to rol out this downgrade to our production environment also.<o:p></o:p>

May 11th, 2015 7:55am

We have having a similar issue posted here

https://social.technet.microsoft.com/Forums/en-US/28fae8ba-83a0-4153-871e-dd1f6afd4df1/scale-out-file-server-event-ids-5142-5120?forum=winserverfiles

Any further status updates?

Free Windows Admin Tool Kit Click here and download it now

July 26th, 2015 6:30pm

This topic is archived. No further replies will be accepted.