Server 2012 R2 Cluster Failover issues during catastrophic failure of iSCSI shared storage on Node 1 of 2

I am running an experiment with clustered servers to understand failover timing and failover recovery.  I have a couple of questions I was hoping this forum could help me with.  My setup is:

  • Two clustered 2012 R2 servers (Node 1 and Node 2) - all updates installed - same subnet - shared iSCSI storage
  • One file server role running
  • Node 1 is current host server
  • Node 1 is owner node for file server role
  • Node 1 is owner node for quorum disk and for file server disk
  • Single file share \\MIKEFS

On the file share, I placed a 64 KB file.  On a client PC (2012 R2) with access to the share, I ran a simple executable I wrote that does the following:

  1.        Get a handle to the file (CreateFile -> FILE_FLAG_NO_BUFFERING)
  2.        Set file position to 0 and synchronous ReadFile of 64 KB / Track execution time
  3.        Loop forever on #2 until any failure occurs
  4.        Close the file handle (CloseHandle)

During the middle of the test, I removed power from Node 1.  The following sequence of events occurred:

  • 8:20:01 - Node 1 Power Loss
  • 8:20:07 - Node 2 records Event ID 1135 (i.e. reports node 1 was removed from the active failover cluster membership)
  • 8:20:39 - 64 KB test application, performing the read test, takes 37.6 seconds to return from ReadFile API
  • Client test application continued to run without any failure (just that one long delay noted above)

Node 2 determined relatively quickly that Node 1 was lost.  The default heartbeat settings help explain the time from the power loss to the time Node 2 reports the loss (SameSubnetDelay=1000 and SameSubnetThreshold=5).

So ~5 seconds of the 37.6 seconds was due to the heartbeat settings.  The remaining ~32.6 seconds was the time to failover the node, mount the volumes, etc.  How can I determine what was going on during this time?  What is taking the longest amount of time?  Are there any settings that could adjust these timings?  I looked at cluster.log but there is a lot of detail in there and it's difficult to determine why it takes a considerable amount of time from the point the node detects the failover.

July 29th, 2015 5:50pm

I am running an experiment with clustered servers to understand failover timing and failover recovery.  I have a couple of questions I was hoping this forum could help me with.  My setup is:

  • Two clustered 2012 R2 servers (Node 1 and Node 2) - all updates installed - same subnet - shared iSCSI storage
  • One file server role running
  • Node 1 is current host server
  • Node 1 is owner node for file server role
  • Node 1 is owner node for quorum disk and for file server disk
  • Single file share \\MIKEFS

On the file share, I placed a 64 KB file.  On a client PC (2012 R2) with access to the share, I ran a simple executable I wrote that does the following:

  1.        Get a handle to the file (CreateFile -> FILE_FLAG_NO_BUFFERING)
  2.        Set file position to 0 and synchronous ReadFile of 64 KB / Track execution time
  3.        Loop forever on #2 until any failure occurs
  4.        Close the file handle (CloseHandle)

During the middle of the test, I removed power from Node 1.  The following sequence of events occurred:

  • 8:20:01 - Node 1 Power Loss
  • 8:20:07 - Node 2 records Event ID 1135 (i.e. reports node 1 was removed from the active failover cluster membership)
  • 8:20:39 - 64 KB test application, performing the read test, takes 37.6 seconds to return from ReadFile API
  • Client test application continued to run without any failure (just that one long delay noted above)

Node 2 determined relatively quickly that Node 1 was lost.  The default heartbeat settings help explain the time from the power loss to the time Node 2 reports the loss (SameSubnetDelay=1000 and SameSubnetThreshold=5).

So ~5 seconds of the 37.6 seconds was due to the heartbeat settings.  The remaining ~32.6 seconds was the time to failover the node, mount the volumes, etc.  How can I determine what was going on during this time?  What is taking the longest amount of time?  Are there any settings that could adjust these timings?  I looked at cluster.log but there is a lot of detail in there and it's difficult to determine why it takes a considerable amount of time from the point the node detects the failover.

Free Windows Admin Tool Kit Click here and download it now
July 29th, 2015 5:50pm

On this posting, I detailed a simple two node Server 2012 R2 cluster configured with a single file server role.  Both nodes accessing shared iSCSI storage.  Client system is performing a simple looped 64 KB read on a file share from the clustered server.  This test applet will fail any time the ReadFile() API fails.

When one node experiences a complete power loss, the other node will take over and no failure will occur on the client system (other than an ~30 second delay where the synchronous ReadFile() API does not return).  This was detailed in the other post.  The key point is that failover completed successfully and NO failure was seen on the client PC.

I want to test out a different failover scenario so with Node 1 being the current host, and the owner node for the file server role, quorum disk, and shared storage, I simulate a catastrophic failure of Node 1s iSCSI storage.  The OS itself boots from a virtual ATA drive so the OS drive will remain alive.  What I do is go into device manager, find the NIC(s) that are configured for the iSCSI network, and then I disable them.  This causes all iSCSI I/O activity to fail.  When I perform this test, the following is seen from the client PC:

  • 11:20:01 - Node 1 loss of iSCSI shared storage
  • 11:22:01 - Synchronous ReadFile() returns after 120 seconds with ERROR_INVALID_HANDLE
  • 11:22:20 - CloseHandle() completes successfully after 19 seconds

I understand the 120 second delay.  Its a combination of the iSCSI link down timer as well as the MPIO PDO remove period timer.  What I dont understand is why the request failed.  I expected a failover and successful completion of the ReadFile like I saw when I did a power loss of Node 1.

Whats interesting is that Windows did move the host server, file server, and disk storage, to Node 2.  So a failover did occur.  At least this time it did. Other times, I have seen the file server role stopped.  It appears that the failover occurred immediately after the client PC received the read failure.  Why not complete the failover before failing the client ReadFile request?  Are there timeout adjustments I could experiment with?

Here is Node 1's event log which shows the failover sequence from Node 1's perspective starting just after the catastrophic iSCSI loss on Node 1:

   ProviderName: Microsoft-Windows-FailoverClustering

TimeCreated                     Id LevelDisplayName Message
-----------                     -- ---------------- -------
7/29/2015 11:20:06 AM         1132 Information      Cluster network interface 'TESTCLUSTER1 - Ethernet 2' for node 'TESTCLUSTER1' on network 'Cluster Network 1' was removed.
7/29/2015 11:21:38 AM         1649 Information      Cluster resource 'Cluster Disk 1' in clustered role 'Cluster Group' has taken more than one minute to respond to a control code. The control code was 'STORAGE_GET_DISK_INFO_EX'.
7/29/2015 11:22:01 AM         5264 Information      Physical Disk resource 'd47df305-c3a6-4bbe-8475-48e1398bbee6' has been disconnected from this node.
7/29/2015 11:22:01 AM         5264 Information      Physical Disk resource 'b543b5ab-67c9-4836-89a1-ae5636916de5' has been disconnected from this node.
7/29/2015 11:22:01 AM         1637 Information      Cluster resource 'Cluster Disk 1' in clustered role 'Cluster Group' has transitioned from state Online to state ProcessingFailure.
7/29/2015 11:22:01 AM         1637 Information      Cluster resource 'Cluster Disk 1' in clustered role 'Cluster Group' has transitioned from state ProcessingFailure to state WaitingToTerminate. Cluster resource 'Cluster Disk 1' is waiting on the following resources: .
7/29/2015 11:22:01 AM         1637 Information      Cluster resource 'Cluster Disk 1' in clustered role 'Cluster Group' has transitioned from state WaitingToTerminate to state Terminating.
7/29/2015 11:22:01 AM         1637 Information      Cluster resource 'Cluster Disk 1' in clustered role 'Cluster Group' has transitioned from state Terminating to state DelayRestartingResource.
7/29/2015 11:22:01 AM         1637 Information      Cluster resource 'Cluster Disk 2' in clustered role 'DATAFS' has transitioned from state Online to state ProcessingFailure.
7/29/2015 11:22:01 AM         1637 Information      Cluster resource 'Cluster Disk 2' in clustered role 'DATAFS' has transitioned from state ProcessingFailure to state WaitingToTerminate. Cluster resource 'Cluster Disk 2' is waiting on the following resources: File Server (\\DATAFS).
7/29/2015 11:22:01 AM         1637 Information      Cluster resource 'File Server (\\DATAFS)' in clustered role 'DATAFS' has transitioned from state Online to state WaitingToTerminate. Cluster resource 'File Server (\\DATAFS)' is waiting on the following resources: .
7/29/2015 11:22:01 AM         1637 Information      Cluster resource 'File Server (\\DATAFS)' in clustered role 'DATAFS' has transitioned from state WaitingToTerminate to state Terminating.
7/29/2015 11:22:01 AM         1637 Information      Cluster resource 'File Server (\\DATAFS)' in clustered role 'DATAFS' has transitioned from state Terminating to state WaitingToComeOnline. Cluster resource 'File Server (\\DATAFS)' is waiting on the following resources: Cluster Disk 2.
7/29/2015 11:22:01 AM         1637 Information      Cluster resource 'Cluster Disk 2' in clustered role 'DATAFS' has transitioned from state WaitingToTerminate to state Terminating.
7/29/2015 11:22:01 AM         1637 Information      Cluster resource 'Cluster Disk 2' in clustered role 'DATAFS' has transitioned from state Terminating to state DelayRestartingResource.
7/29/2015 11:22:02 AM         1637 Information      Cluster resource 'Cluster Disk 1' in clustered role 'Cluster Group' has transitioned from state DelayRestartingResource to state OnlineCallIssued.
7/29/2015 11:22:02 AM         1637 Information      Cluster resource 'Cluster Disk 1' in clustered role 'Cluster Group' has transitioned from state OnlineCallIssued to state ProcessingFailure.
7/29/2015 11:22:02 AM         1633 Information      The Cluster service failed to bring clustered role 'Cluster Group' completely online or offline. One or more resources may be in a failed or an offline state. This may impact the availability of the clustered role.
7/29/2015 11:22:02 AM         1637 Information      Cluster resource 'Cluster Disk 1' in clustered role 'Cluster Group' has transitioned from state ProcessingFailure to state WaitingToTerminate. Cluster resource 'Cluster Disk 1' is waiting on the following resources: .
7/29/2015 11:22:02 AM         1637 Information      Cluster resource 'Cluster Disk 1' in clustered role 'Cluster Group' has transitioned from state WaitingToTerminate to state Terminating.
7/29/2015 11:22:02 AM         1637 Information      Cluster resource 'Cluster Disk 1' in clustered role 'Cluster Group' has transitioned from state Terminating to state CannotComeOnlineOnThisNode.
7/29/2015 11:22:02 AM         1153 Information      The Cluster service is attempting to fail over the clustered role 'Cluster Group' from node 'TESTCLUSTER1' to node 'TESTCLUSTER2'.
7/29/2015 11:22:02 AM         1637 Information      Cluster resource 'Cluster Disk 2' in clustered role 'DATAFS' has transitioned from state DelayRestartingResource to state OnlineCallIssued.
7/29/2015 11:22:02 AM         1637 Information      Cluster resource 'Cluster Disk 2' in clustered role 'DATAFS' has transitioned from state OnlineCallIssued to state ProcessingFailure.
7/29/2015 11:22:02 AM         1637 Information      Cluster resource 'Cluster Disk 2' in clustered role 'DATAFS' has transitioned from state ProcessingFailure to state WaitingToTerminate. Cluster resource 'Cluster Disk 2' is waiting on the following resources: File Server (\\DATAFS).
7/29/2015 11:22:02 AM         1637 Information      Cluster resource 'Cluster Disk 2' in clustered role 'DATAFS' has transitioned from state WaitingToTerminate to state Terminating.
7/29/2015 11:22:02 AM         1637 Information      Cluster resource 'File Server (\\DATAFS)' in clustered role 'DATAFS' has transitioned from state WaitingToComeOnline to state OfflineDueToProvider. Cluster resource 'File Server (\\DATAFS)' is waiting on the following resources: Cluster Disk 2.
7/29/2015 11:22:02 AM         1203 Information      The Cluster service is attempting to bring the clustered role 'Cluster Group' offline.
7/29/2015 11:22:02 AM         1637 Information      Cluster resource 'Cluster IP Address' in clustered role 'Cluster Group' has transitioned from state Online to state WaitingToGoOffline. Cluster resource 'Cluster IP Address' is waiting on the following resources: Cluster Name.
7/29/2015 11:22:02 AM         1637 Information      Cluster resource 'Cluster Name' in clustered role 'Cluster Group' has transitioned from state Online to state WaitingToGoOffline. Cluster resource 'Cluster Name' is waiting on the following resources: .
7/29/2015 11:22:02 AM         1637 Information      Cluster resource 'Cluster Name' in clustered role 'Cluster Group' has transitioned from state WaitingToGoOffline to state OfflineCallIssued.
7/29/2015 11:22:02 AM         1637 Information      Cluster resource 'Cluster Disk 2' in clustered role 'DATAFS' has transitioned from state Terminating to state CannotComeOnlineOnThisNode.
7/29/2015 11:22:02 AM         1204 Information      The Cluster service successfully brought the clustered role 'DATAFS' offline.
7/29/2015 11:22:02 AM         1637 Information      Cluster resource 'Cluster Name' in clustered role 'Cluster Group' has transitioned from state OfflineCallIssued to state OfflinePending.
7/29/2015 11:22:02 AM         1153 Information      The Cluster service is attempting to fail over the clustered role 'DATAFS' from node 'TESTCLUSTER1' to node 'TESTCLUSTER2'.
7/29/2015 11:22:02 AM         1203 Information      The Cluster service is attempting to bring the clustered role 'DATAFS' offline.
7/29/2015 11:22:02 AM         1637 Information      Cluster resource 'DATAFS' in clustered role 'DATAFS' has transitioned from state Online to state WaitingToGoOffline. Cluster resource 'DATAFS' is waiting on the following resources: .
7/29/2015 11:22:02 AM         1637 Information      Cluster resource 'DATAFS' in clustered role 'DATAFS' has transitioned from state WaitingToGoOffline to state OfflineCallIssued.
7/29/2015 11:22:02 AM         1637 Information      Cluster resource 'IP Address 10.18.236.0' in clustered role 'DATAFS' has transitioned from state Online to state WaitingToGoOffline. Cluster resource 'IP Address 10.18.236.0' is waiting on the following resources: DATAFS.
7/29/2015 11:22:02 AM         1637 Information      Cluster resource 'DATAFS' in clustered role 'DATAFS' has transitioned from state OfflineCallIssued to state OfflinePending.
7/29/2015 11:22:02 AM         1637 Information      Cluster resource 'Cluster Name' in clustered role 'Cluster Group' has transitioned from state OfflinePending to state OfflineSavingCheckpoints.
7/29/2015 11:22:02 AM         1637 Information      Cluster resource 'Cluster Name' in clustered role 'Cluster Group' has transitioned from state OfflineSavingCheckpoints to state Offline.
7/29/2015 11:22:02 AM         1637 Information      Cluster resource 'Cluster IP Address' in clustered role 'Cluster Group' has transitioned from state WaitingToGoOffline to state OfflineCallIssued.
7/29/2015 11:22:02 AM         1637 Information      Cluster resource 'Cluster IP Address' in clustered role 'Cluster Group' has transitioned from state OfflineCallIssued to state OfflineSavingCheckpoints.
7/29/2015 11:22:02 AM         1637 Information      Cluster resource 'Cluster IP Address' in clustered role 'Cluster Group' has transitioned from state OfflineSavingCheckpoints to state Offline.
7/29/2015 11:22:02 AM         1204 Information      The Cluster service successfully brought the clustered role 'Cluster Group' offline.
7/29/2015 11:22:02 AM         1641 Information      Clustered role 'Cluster Group' is moving to cluster node 'TESTCLUSTER2'.
7/29/2015 11:22:02 AM         1637 Information      Cluster resource 'DATAFS' in clustered role 'DATAFS' has transitioned from state OfflinePending to state OfflineSavingCheckpoints.
7/29/2015 11:22:02 AM         1637 Information      Cluster resource 'DATAFS' in clustered role 'DATAFS' has transitioned from state OfflineSavingCheckpoints to state Offline.
7/29/2015 11:22:02 AM         1637 Information      Cluster resource 'IP Address 10.18.236.0' in clustered role 'DATAFS' has transitioned from state WaitingToGoOffline to state OfflineCallIssued.
7/29/2015 11:22:02 AM         1637 Information      Cluster resource 'IP Address 10.18.236.0' in clustered role 'DATAFS' has transitioned from state OfflineCallIssued to state OfflineSavingCheckpoints.
7/29/2015 11:22:02 AM         1637 Information      Cluster resource 'IP Address 10.18.236.0' in clustered role 'DATAFS' has transitioned from state OfflineSavingCheckpoints to state Offline.
7/29/2015 11:22:02 AM         1204 Information      The Cluster service successfully brought the clustered role 'DATAFS' offline.
7/29/2015 11:22:02 AM         1641 Information      Clustered role 'DATAFS' is moving to cluster node 'TESTCLUSTER2'.

July 29th, 2015 7:03pm

I should note that when I switch to a scale-out filter server type, with Cluster Shared Volumes, the client PC continues to run without failure.  There is still a 120 second delay but the failover was successful and the client PC continued operating without failure.  My goal, however, is to understand why the standard file server role didn't recover from this failure and what settings I can tune to make it work.
Free Windows Admin Tool Kit Click here and download it now
July 29th, 2015 7:26pm

Hi,

I am trying to involve someone more familiar with this issue. There might be some time delay. Appreciate your patience.


Thanks for your understanding and support.

July 31st, 2015 2:26am

Hi BT Mike,

Does your read target file on the CSV disk? Without CSV in cluster it need some down time to failover the disk PR, more detail please refer the following information:

Cluster Shared Volumes (CSV): Disk Ownership

http://blogs.msdn.com/b/clustering/archive/2009/03/02/9453288.aspx

How to add storage to Clustered Shared Volumes in Windows Server 2012

http://blogs.msdn.com/b/clustering/archive/2012/04/06/10291490.aspx

Use Cluster Shared Volumes in a Failover Cluster

https://technet.microsoft.com/en-us/library/JJ612868.aspx

Im glad to be of help to you!

Free Windows Admin Tool Kit Click here and download it now
August 5th, 2015 7:35am

I should note that when I switch to a scale-out filter server type, with Cluster Shared Volumes, the client PC continues to run without failure.  There is still a 120 second delay but the failover was successful and the client PC continued operating without failure.  My goal, however, is to understand why the standard file server role didn't recover from this failure and what settings I can tune to make it work.

Because SoFS uses not "generic" but so-called CA or Continuously Available SMB3 file shares. These are different (for example cache is write-thru and not aggressive write-back) with fast and transparent failover in mind. 

SMB Transparent Failover

http://blogs.technet.com/b/clausjor/archive/2012/06/07/smb-transparent-failover-making-file-shares-continuously-available.aspx

Hope this helped :)

August 5th, 2015 9:06am

Because SoFS uses not "generic" but so-called CA or Continuously Available SMB3 file shares. These are different (for example cache is write-thru and not aggressive write-back) with fast and transparent failover in mind. 

Thanks for the link.  I'll have to research that further when I have a moment.  Here is my concern though.  How come using a standard (not SoFS) clustered file server, SMB3 recovers from a complete node failure (e.g. unexpected power loss).  Node 2 takes over from Node 1.  Client PC sees no API failures (other than ReadFile() taking 30 seconds to complete).  I'm creating a less critical failure than a complete node loss.   Node 1 remains alive but has lost access to the SAN.  Node 2 still retains full access.  In this example, the ReadFile() API fails with an ERROR_INVALID_HANDLE error code.  Why can't it recover like it does during a full node failure?
Free Windows Admin Tool Kit Click here and download it now
August 7th, 2015 3:25pm

"What I do is go into device manager, find the NIC(s) that are configured for the iSCSI network, and then I disable them."

You are causing a multiple network failure.  Clusters are designed to ride through a single failure.  If you lose a single NIC and you have properly configured MPIO, you should not experience a failure.  But you are injecting a double failure by disabling both NICs.  I've seen this multiple times where people forget that for true HA they have to have two fabrics for their network.  They put both paths to a single network switch.  When that network switch fails, the cluster fails because it has had, in essence, multiple network failures and not just a single failure.  Your experiment is simulating a configuration where both NICs are connected to the same network switch and that switch fails.

Yes, a node failure seems to be the same because both NICs fail when the node goes down, but the overriding event in a node failure is the node failing, not the NICs failing.

August 10th, 2015 2:40pm

Hi,

I would like to confirm the current situation. If there is anything that I can do for you, please do not hesitate to let me know, I will be happy to help.

Free Windows Admin Tool Kit Click here and download it now
August 27th, 2015 2:17am

Hey Tim,

Sorry for my delayed reply.  I am creating a single point of failure in this experiment.  I just replicated it.  It is a very basic configuration, not something you would use in the field, but something to prove out this issue.

Cluster Node 1 and 2 have the same configuration:

  • Network Adapter 1 for typical network communication
  • Network Adapter 2 for iSCSI connectivity

I have standard file server role (not SOFS).  Cluster Node 1 is the current host server and the Owner Node for the File Server role.  I have a client PC performing a looped unbufferred read from a shared folder on the file server.  It will report any failure as well as any delayed execution times.

While everything is running, I go to Cluster Node 1 and disable Network Adapter 2 (i.e. iSCSI connectivity).  This is my single point of failure, one iSCSI adapter going down.  I realize that, in an optimal configuration, you'd have multiple iSCSI network adapters.  I'm just trying to validate that a cluster failover will occur.

When I perform this test, the client app will fail like you see here:


07:09:07 - Longest Delay ReadFile API       = 120.023968s (previous 0.052372s)
Unexpected error reading/writing file - Error 6
07:14:18 - Longest Delay CloseHandle API    = 311.379720s

The 120 second delay is a combined 30 seconds for iSCSI to remove the iSCSI physical paths, and an additional 90 seconds for MPIO to remove the pseudo-LUNs.  Instead of a cluster failover, the cluster failed.  After that, the test app tries to cleanup by performing a CloseHandle but that takes 5+ minutes to return (which it does successfully).

On Cluster Node 1, the File Server role is now in the "Stopped" state.  The following cluster event is recorded:

Clustered role 'TESTFS' has exceeded its failover threshold.  It has exhausted the configured number of failover attempts within the failover period of time allotted to it and will be left in a failed state.  No additional attempts will be made to bring the role online or fail it over to another node in the cluster.  Please check the events associated with the failure.  After the issues causing the failure are resolved the role can be brought online manually or the cluster may attempt to bring it online again after the restart delay period.

Everything has been configured using default settings.  If I switch to SOFS/CSV, this experiment passes.

September 4th, 2015 10:30am

My guess is that the validation wizard throws some warnings on this configuration, because I believe you have created SPoF into your environment.  As you state, " I realize that, in an optimal configuration, you'd have multiple iSCSI network adapters."  I would not consider a cluster without redundant storage paths.

Then you say "an additional 90 seconds for MPIO to ..."  You have configured MPIO when you have only a single path to storage?  That doesn't make sense. MPIO is configured to ensure the host does not recognize a disk more than once when multiple paths are defined. Have you tried it without the MPIO software?

Free Windows Admin Tool Kit Click here and download it now
September 4th, 2015 6:43pm

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics