Server 2012 R2 Cluster Failover issues during catastrophic failure of iSCSI shared storage on Node 1 of 2 (Network Steve Forum)

Server 2012 R2 Cluster Failover issues during catastrophic failure of iSCSI shared storage on Node 1 of 2

I am running an experiment with clustered servers to understand failover timing and failover recovery. I have a couple of questions I was hoping this forum could help me with. My setup is:

Two clustered 2012 R2 servers (Node 1 and Node 2) - all updates installed - same subnet - shared iSCSI storage
One file server role running
Node 1 is current host server
Node 1 is owner node for file server role
Node 1 is owner node for quorum disk and for file server disk
Single file share \\MIKEFS

On the file share, I placed a 64 KB file. On a client PC (2012 R2) with access to the share, I ran a simple executable I wrote that does the following:

Get a handle to the file (CreateFile -> FILE_FLAG_NO_BUFFERING)
Set file position to 0 and synchronous ReadFile of 64 KB / Track execution time
Loop forever on #2 until any failure occurs
Close the file handle (CloseHandle)

During the middle of the test, I removed power from Node 1. The following sequence of events occurred:

8:20:01 - Node 1 Power Loss
8:20:07 - Node 2 records Event ID 1135 (i.e. reports node 1 was removed from the active failover cluster membership)
8:20:39 - 64 KB test application, performing the read test, takes 37.6 seconds to return from ReadFile API
Client test application continued to run without any failure (just that one long delay noted above)

Node 2 determined relatively quickly that Node 1 was lost. The default heartbeat settings help explain the time from the power loss to the time Node 2 reports the loss (SameSubnetDelay=1000 and SameSubnetThreshold=5).

So ~5 seconds of the 37.6 seconds was due to the heartbeat settings. The remaining ~32.6 seconds was the time to failover the node, mount the volumes, etc. How can I determine what was going on during this time? What is taking the longest amount of time? Are there any settings that could adjust these timings? I looked at cluster.log but there is a lot of detail in there and it's difficult to determine why it takes a considerable amount of time from the point the node detects the failover.

Merged by Alex LvMicrosoft contingent staff, Moderator Friday, July 31, 2015 6:24 AM repeat

July 29th, 2015 5:50pm

On this posting, I detailed a simple two node Server 2012 R2 cluster configured with a single file server role. Both nodes accessing shared iSCSI storage. Client system is performing a simple looped 64 KB read on a file share from the clustered server. This test applet will fail any time the ReadFile() API fails.

When one node experiences a complete power loss, the other node will take over and no failure will occur on the client system (other than an ~30 second delay where the synchronous ReadFile() API does not return). This was detailed in the other post. The key point is that failover completed successfully and NO failure was seen on the client PC.

I want to test out a different failover scenario so with Node 1 being the current host, and the owner node for the file server role, quorum disk, and shared storage, I simulate a catastrophic failure of Node 1s iSCSI storage. The OS itself boots from a virtual ATA drive so the OS drive will remain alive. What I do is go into device manager, find the NIC(s) that are configured for the iSCSI network, and then I disable them. This causes all iSCSI I/O activity to fail. When I perform this test, the following is seen from the client PC:

11:20:01 - Node 1 loss of iSCSI shared storage
11:22:01 - Synchronous ReadFile() returns after 120 seconds with ERROR_INVALID_HANDLE
11:22:20 - CloseHandle() completes successfully after 19 seconds

I understand the 120 second delay. Its a combination of the iSCSI link down timer as well as the MPIO PDO remove period timer. What I dont understand is why the request failed. I expected a failover and successful completion of the ReadFile like I saw when I did a power loss of Node 1.

Whats interesting is that Windows did move the host server, file server, and disk storage, to Node 2. So a failover did occur. At least this time it did. Other times, I have seen the file server role stopped. It appears that the failover occurred immediately after the client PC received the read failure. Why not complete the failover before failing the client ReadFile request? Are there timeout adjustments I could experiment with?

Here is Node 1's event log which shows the failover sequence from Node 1's perspective starting just after the catastrophic iSCSI loss on Node 1:

ProviderName: Microsoft-Windows-FailoverClustering

TimeCreated Id LevelDisplayName Message
----------- -- ---------------- -------
7/29/2015 11:20:06 AM 1132 Information Cluster network interface 'TESTCLUSTER1 - Ethernet 2' for node 'TESTCLUSTER1' on network 'Cluster Network 1' was removed.
7/29/2015 11:21:38 AM 1649 Information Cluster resource 'Cluster Disk 1' in clustered role 'Cluster Group' has taken more than one minute to respond to a control code. The control code was 'STORAGE_GET_DISK_INFO_EX'.
7/29/2015 11:22:01 AM 5264 Information Physical Disk resource 'd47df305-c3a6-4bbe-8475-48e1398bbee6' has been disconnected from this node.
7/29/2015 11:22:01 AM 5264 Information Physical Disk resource 'b543b5ab-67c9-4836-89a1-ae5636916de5' has been disconnected from this node.
7/29/2015 11:22:01 AM 1637 Information Cluster resource 'Cluster Disk 1' in clustered role 'Cluster Group' has transitioned from state Online to state ProcessingFailure.
7/29/2015 11:22:01 AM 1637 Information Cluster resource 'Cluster Disk 1' in clustered role 'Cluster Group' has transitioned from state ProcessingFailure to state WaitingToTerminate. Cluster resource 'Cluster Disk 1' is waiting on the following resources: .
7/29/2015 11:22:01 AM 1637 Information Cluster resource 'Cluster Disk 1' in clustered role 'Cluster Group' has transitioned from state WaitingToTerminate to state Terminating.
7/29/2015 11:22:01 AM 1637 Information Cluster resource 'Cluster Disk 1' in clustered role 'Cluster Group' has transitioned from state Terminating to state DelayRestartingResource.
7/29/2015 11:22:01 AM 1637 Information Cluster resource 'Cluster Disk 2' in clustered role 'DATAFS' has transitioned from state Online to state ProcessingFailure.
7/29/2015 11:22:01 AM 1637 Information Cluster resource 'Cluster Disk 2' in clustered role 'DATAFS' has transitioned from state ProcessingFailure to state WaitingToTerminate. Cluster resource 'Cluster Disk 2' is waiting on the following resources: File Server (\\DATAFS).
7/29/2015 11:22:01 AM 1637 Information Cluster resource 'File Server (\\DATAFS)' in clustered role 'DATAFS' has transitioned from state Online to state WaitingToTerminate. Cluster resource 'File Server (\\DATAFS)' is waiting on the following resources: .
7/29/2015 11:22:01 AM 1637 Information Cluster resource 'File Server (\\DATAFS)' in clustered role 'DATAFS' has transitioned from state WaitingToTerminate to state Terminating.
7/29/2015 11:22:01 AM 1637 Information Cluster resource 'File Server (\\DATAFS)' in clustered role 'DATAFS' has transitioned from state Terminating to state WaitingToComeOnline. Cluster resource 'File Server (\\DATAFS)' is waiting on the following resources: Cluster Disk 2.
7/29/2015 11:22:01 AM 1637 Information Cluster resource 'Cluster Disk 2' in clustered role 'DATAFS' has transitioned from state WaitingToTerminate to state Terminating.
7/29/2015 11:22:01 AM 1637 Information Cluster resource 'Cluster Disk 2' in clustered role 'DATAFS' has transitioned from state Terminating to state DelayRestartingResource.
7/29/2015 11:22:02 AM 1637 Information Cluster resource 'Cluster Disk 1' in clustered role 'Cluster Group' has transitioned from state DelayRestartingResource to state OnlineCallIssued.
7/29/2015 11:22:02 AM 1637 Information Cluster resource 'Cluster Disk 1' in clustered role 'Cluster Group' has transitioned from state OnlineCallIssued to state ProcessingFailure.
7/29/2015 11:22:02 AM 1633 Information The Cluster service failed to bring clustered role 'Cluster Group' completely online or offline. One or more resources may be in a failed or an offline state. This may impact the availability of the clustered role.
7/29/2015 11:22:02 AM 1637 Information Cluster resource 'Cluster Disk 1' in clustered role 'Cluster Group' has transitioned from state ProcessingFailure to state WaitingToTerminate. Cluster resource 'Cluster Disk 1' is waiting on the following resources: .
7/29/2015 11:22:02 AM 1637 Information Cluster resource 'Cluster Disk 1' in clustered role 'Cluster Group' has transitioned from state WaitingToTerminate to state Terminating.
7/29/2015 11:22:02 AM 1637 Information Cluster resource 'Cluster Disk 1' in clustered role 'Cluster Group' has transitioned from state Terminating to state CannotComeOnlineOnThisNode.
7/29/2015 11:22:02 AM 1153 Information The Cluster service is attempting to fail over the clustered role 'Cluster Group' from node 'TESTCLUSTER1' to node 'TESTCLUSTER2'.
7/29/2015 11:22:02 AM 1637 Information Cluster resource 'Cluster Disk 2' in clustered role 'DATAFS' has transitioned from state DelayRestartingResource to state OnlineCallIssued.
7/29/2015 11:22:02 AM 1637 Information Cluster resource 'Cluster Disk 2' in clustered role 'DATAFS' has transitioned from state OnlineCallIssued to state ProcessingFailure.
7/29/2015 11:22:02 AM 1637 Information Cluster resource 'Cluster Disk 2' in clustered role 'DATAFS' has transitioned from state ProcessingFailure to state WaitingToTerminate. Cluster resource 'Cluster Disk 2' is waiting on the following resources: File Server (\\DATAFS).
7/29/2015 11:22:02 AM 1637 Information Cluster resource 'Cluster Disk 2' in clustered role 'DATAFS' has transitioned from state WaitingToTerminate to state Terminating.
7/29/2015 11:22:02 AM 1637 Information Cluster resource 'File Server (\\DATAFS)' in clustered role 'DATAFS' has transitioned from state WaitingToComeOnline to state OfflineDueToProvider. Cluster resource 'File Server (\\DATAFS)' is waiting on the following resources: Cluster Disk 2.
7/29/2015 11:22:02 AM 1203 Information The Cluster service is attempting to bring the clustered role 'Cluster Group' offline.
7/29/2015 11:22:02 AM 1637 Information Cluster resource 'Cluster IP Address' in clustered role 'Cluster Group' has transitioned from state Online to state WaitingToGoOffline. Cluster resource 'Cluster IP Address' is waiting on the following resources: Cluster Name.
7/29/2015 11:22:02 AM 1637 Information Cluster resource 'Cluster Name' in clustered role 'Cluster Group' has transitioned from state Online to state WaitingToGoOffline. Cluster resource 'Cluster Name' is waiting on the following resources: .
7/29/2015 11:22:02 AM 1637 Information Cluster resource 'Cluster Name' in clustered role 'Cluster Group' has transitioned from state WaitingToGoOffline to state OfflineCallIssued.
7/29/2015 11:22:02 AM 1637 Information Cluster resource 'Cluster Disk 2' in clustered role 'DATAFS' has transitioned from state Terminating to state CannotComeOnlineOnThisNode.
7/29/2015 11:22:02 AM 1204 Information The Cluster service successfully brought the clustered role 'DATAFS' offline.
7/29/2015 11:22:02 AM 1637 Information Cluster resource 'Cluster Name' in clustered role 'Cluster Group' has transitioned from state OfflineCallIssued to state OfflinePending.
7/29/2015 11:22:02 AM 1153 Information The Cluster service is attempting to fail over the clustered role 'DATAFS' from node 'TESTCLUSTER1' to node 'TESTCLUSTER2'.
7/29/2015 11:22:02 AM 1203 Information The Cluster service is attempting to bring the clustered role 'DATAFS' offline.
7/29/2015 11:22:02 AM 1637 Information Cluster resource 'DATAFS' in clustered role 'DATAFS' has transitioned from state Online to state WaitingToGoOffline. Cluster resource 'DATAFS' is waiting on the following resources: .
7/29/2015 11:22:02 AM 1637 Information Cluster resource 'DATAFS' in clustered role 'DATAFS' has transitioned from state WaitingToGoOffline to state OfflineCallIssued.
7/29/2015 11:22:02 AM 1637 Information Cluster resource 'IP Address 10.18.236.0' in clustered role 'DATAFS' has transitioned from state Online to state WaitingToGoOffline. Cluster resource 'IP Address 10.18.236.0' is waiting on the following resources: DATAFS.
7/29/2015 11:22:02 AM 1637 Information Cluster resource 'DATAFS' in clustered role 'DATAFS' has transitioned from state OfflineCallIssued to state OfflinePending.
7/29/2015 11:22:02 AM 1637 Information Cluster resource 'Cluster Name' in clustered role 'Cluster Group' has transitioned from state OfflinePending to state OfflineSavingCheckpoints.
7/29/2015 11:22:02 AM 1637 Information Cluster resource 'Cluster Name' in clustered role 'Cluster Group' has transitioned from state OfflineSavingCheckpoints to state Offline.
7/29/2015 11:22:02 AM 1637 Information Cluster resource 'Cluster IP Address' in clustered role 'Cluster Group' has transitioned from state WaitingToGoOffline to state OfflineCallIssued.
7/29/2015 11:22:02 AM 1637 Information Cluster resource 'Cluster IP Address' in clustered role 'Cluster Group' has transitioned from state OfflineCallIssued to state OfflineSavingCheckpoints.
7/29/2015 11:22:02 AM 1637 Information Cluster resource 'Cluster IP Address' in clustered role 'Cluster Group' has transitioned from state OfflineSavingCheckpoints to state Offline.
7/29/2015 11:22:02 AM 1204 Information The Cluster service successfully brought the clustered role 'Cluster Group' offline.
7/29/2015 11:22:02 AM 1641 Information Clustered role 'Cluster Group' is moving to cluster node 'TESTCLUSTER2'.
7/29/2015 11:22:02 AM 1637 Information Cluster resource 'DATAFS' in clustered role 'DATAFS' has transitioned from state OfflinePending to state OfflineSavingCheckpoints.
7/29/2015 11:22:02 AM 1637 Information Cluster resource 'DATAFS' in clustered role 'DATAFS' has transitioned from state OfflineSavingCheckpoints to state Offline.
7/29/2015 11:22:02 AM 1637 Information Cluster resource 'IP Address 10.18.236.0' in clustered role 'DATAFS' has transitioned from state WaitingToGoOffline to state OfflineCallIssued.
7/29/2015 11:22:02 AM 1637 Information Cluster resource 'IP Address 10.18.236.0' in clustered role 'DATAFS' has transitioned from state OfflineCallIssued to state OfflineSavingCheckpoints.
7/29/2015 11:22:02 AM 1637 Information Cluster resource 'IP Address 10.18.236.0' in clustered role 'DATAFS' has transitioned from state OfflineSavingCheckpoints to state Offline.
7/29/2015 11:22:02 AM 1204 Information The Cluster service successfully brought the clustered role 'DATAFS' offline.
7/29/2015 11:22:02 AM 1641 Information Clustered role 'DATAFS' is moving to cluster node 'TESTCLUSTER2'.

Free Windows Admin Tool Kit Click here and download it now

July 29th, 2015 7:03pm

I should note that when I switch to a scale-out filter server type, with Cluster Shared Volumes, the client PC continues to run without failure. There is still a 120 second delay but the failover was successful and the client PC continued operating without failure. My goal, however, is to understand why the standard file server role didn't recover from this failure and what settings I can tune to make it work.

July 29th, 2015 7:26pm

Hi,

I am trying to involve someone more familiar with this issue. There might be some time delay. Appreciate your patience.

Thanks for your understanding and support.

Free Windows Admin Tool Kit Click here and download it now

July 31st, 2015 6:20am

Hi BT Mike,

Does your read application on the CSV disk? With CSV in cluster it need some down time to failover the disk PR, more detail please refer the following information:

Cluster Shared Volumes (CSV): Disk Ownership

http://blogs.msdn.com/b/clustering/archive/2009/03/02/9453288.aspx

How to add storage to Clustered Shared Volumes in Windows Server 2012

http://blogs.msdn.com/b/clustering/archive/2012/04/06/10291490.aspx

Use Cluster Shared Volumes in a Failover Cluster

https://technet.microsoft.com/en-us/library/JJ612868.aspx

Im glad to be of help to you!

August 5th, 2015 3:36am

This topic is archived. No further replies will be accepted.