Server 2012 R2 Witness Disk won't failover

Hi,

I have a 2 node cluster with a shared witness disk for Quorum. when I lose connection to the disk from 1 node, the ownership fails over to the other node, this is what I expect. however if I test again shortly afterwards, it doesn't failover, it just goes into the offline state, I have to manually move it then it fails over.

it doesn't matter how many times I test this, it simply doesn't failover after that first attempt. however, several hours later (after I slept and tested it the next morning), it again fails over correctly, but subsequent tests bring it offline. I have looked through every tab in the properties for the witness, cluster name and IP address and can not find anything that could relate to this timeout of several hours. all values are at their defaults, however I did increase the number of failures within the specified time from 1 to 10... but that made no difference.

I get the following two events

ID 1038

Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it

ID 1069

Cluster resource 'Cluster Disk 1' of type 'Physical Disk' in clustered role 'Cluster Group' failed.

Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it.  Check the resource and group state using Failover Cluster Manager or the Get-ClusterResource Windows PowerShell cmdlet.

How can I make the witness disk repeatedly failover more than once in several hours? i'm wondering if it's a powershell only configuration but I have no idea what command that would be.

thanks

Steve

May 25th, 2015 9:12am

That is the expected behavior, there are thresholds to prevent ping-pong'ing around of resources.  When you are in your lab yanking cables doing multiple chained failures you can hit them, but the policies are designed for production deployments.  I recommend not modifying your production servers based on lab testing scenarios.

If you look in the System Event log there should be an event logged that the failure thresholds have been hit and no further attempts to bring the resource online will be made at this time.  After 1 hour, the resources will be attempted to be restarted automatically.

Thanks!
Elden

Free Windows Admin Tool Kit Click here and download it now
May 25th, 2015 4:58pm

That is the expected behavior, there are thresholds to prevent ping-pong'ing around of resources.  When you are in your lab yanking cables doing multiple chained failures you can hit them, but the policies are designed for production deployments.  I recommend not modifying your production servers based on lab testing scenarios.

If you look in the System Event log there should be an event logged that the failure thresholds have been hit and no further attempts to bring the resource online will be made at this time.  After 1 hour, the resources will be attempted to be restarted automatically.

Thanks!
Elden

May 25th, 2015 4:58pm

this is exactly what I suspect.... but I have a few questions:

1. how to actually see what that time period is, it's not mentioned on property pages. I notice 6 hours for  VM resource so I would assume it could be the same, but where can I see it for sure for the witness disk, I suspect it's only visible with Powershell

2. how can I modify it so it is less specifically for my testing needs, i'd set it back to default after I am finished. though surely customers should be able to change it even if its not advised, everyone's circumstances are different I guess - my testing scenario is a really long drawn out process if I have to wait 6 hours between the automatic failover.

after 1 hour the resource will be restarted automatically - in plain English does this mean that I left the witness in an offline state it would actually attempt to failover to the other node after 1 hour, or will it just try to bring it online on the existing node (which will fail because it can't manually be brought online under this particular scenario it must be moved to the other node).

thanks

Steve

Free Windows Admin Tool Kit Click here and download it now
May 25th, 2015 6:59pm

HI,

You could use a File witness or a cloud witness disk. Much easier and no extra disk needed.

Better and Faster.  

May 27th, 2015 11:11am

"1. how to actually see what that time period is, it's not mentioned on property pages."

Yes, the information is all there in the property page.  Here is the property page from my witness disk.

As for some of your other questions, we would need to see what settings you have on the witness and/or cluster resource to know how to answer them.  If you have left things at their default (the recommendation) it should automatically start properly in the cluster.  If you have made changes, actual results will vary.

Free Windows Admin Tool Kit Click here and download it now
May 27th, 2015 12:09pm

hi, this is the settings page I have looked at and thought it would be in here, but if I wait 15 minutes it doesn't make a difference, if I wait 1 hour it doesn't make a difference.... it only fails over after waiting a number of hours since the last failure. I could waste a lot of time waiting 1 hour, then 2 hours, then 3 hours and so on until I find out what that failover reset time is... are there any powershell commands that will list all settings applicable to the witness disk?

yes I have left all other settings as default.

using a file share witness isn't the point as an alternative Robert, I could easily do this sure, but we don't have a third separate location to place this at the moment (cloud witness is only server 2016 right? but we only have one internet path anyway which is in the same side as one of the nodes - lose the building, lose the net, lose the cluster), besides, this should be a setting we can change somewhere and it's educational to know how to do this.

thanks for your help

Steve

May 27th, 2015 2:03pm

Is the disk assigned to a Role?  Or is it just sitting in Available Storage?
Free Windows Admin Tool Kit Click here and download it now
May 27th, 2015 9:02pm

its the witness disk
May 28th, 2015 2:54am

There is no option for moving the Witness disk this disk is bound to the core cluster Resource and setting an option on the Disk will do no good.

you can move the cluster group with powershell   move-clustergroup

So if you move the cluster CNO ( core cluster resources) then the Witness willl move to an other node.  this can be found on the right side of the FCM menu under more actions

Free Windows Admin Tool Kit Click here and download it now
May 28th, 2015 3:11pm

I know all this, tell me something I don't already know. I am looking at finding out why the core resources can only failover once in x number of hours... how do i find what that exact figure is, and is there a way to change this. I'm suspecting the time period is 6 hours because 1) it was long enough for me to go to sleep and test again the next morning and it worked, 2) the virtual machines have the period of 6 hours between failures so I assume other resources including the witness would also have this value, but I need to know how to change it.

what powershell options do I have for Cluster Group - this might be going in the right direction, the thing is though it is specifically the disk that doesn't fail over until a certain period of time has passed... the group name itself I haven't tested... maybe there is a setting that covers both but I cant find it in the group name properties or witness disk, so if it's there it must be in the powershell settings somewhere.

thanks

Steve

May 28th, 2015 3:31pm

"what powershell options do I have for Cluster Group "

PS C:\Users\administrator> help clustergroup

Name                              Category  Module
----                              --------  ------
Add-ClusterGroup                  Cmdlet    FailoverClusters
Get-ClusterGroup                  Cmdlet    FailoverClusters
Move-ClusterGroup                 Cmdlet    FailoverClusters
Remove-ClusterGroup               Cmdlet    FailoverClusters
Start-ClusterGroup                Cmdlet    FailoverClusters
Stop-ClusterGroup                 Cmdlet    FailoverClusters

Free Windows Admin Tool Kit Click here and download it now
May 28th, 2015 6:26pm

thanks Tim, useful to look through those but i dont see anything relating to what i'm after.

just a quick recap to my problem:

1. i cause a failure to the witness disk

2. the ownership moves to another node (great)

3. i fix the issue, and wait several minutes

4. i cause the same failure to the witness disk

5. the ownership does not move to another node - it remains in an offline state

the time between steps 3 and 4 has to be significant in order to see what should happen in step 2. i'm suspecting this time-frame to be maybe 6 hours or longer because i tried it after 15 minutes, 1 hour and even 2 hours but that isn't long enough. my problem is how to adjust that 6 hour time-out, or whatever that value is.

i notice on the properties of a VM i have a failover tab, which states:

"specify the number of times the cluster service will attempt to restart or fail over the clustered role in the specified period.

if the clustered role fails more than the maximum in the specified period, it will be left in the failed state"

this last sentence is exactly what problem i am having, on the VM it shows the maximum number of failures as being 1, and the period being 6 hours.

I am making the assumption that there is such a setting somewhere for quorum/witness but i can't find it, i am suspecting that i am hitting 1 failure in a 6 hour period and if i dont wait 6 hours between step 3 and 4, i wont see what should happen in step 2 - it just fails and stays offline. 

i dont see a "failover" tab in the witness disk properties, i see policies and advanced policies, but not failover - and no setting that seems apparent to what i am after. I can see the "period for restarts" set to 15 minutes, but like i said waiting 1 hour doesn't seem to work.

i know this might seem a bit picky that i want to change this setting, but when you go through testing and understanding your weak points in a solution, having to wait as long as 6 hours between tests is quite a time to wait, if i had 3 tests each of which could cause the witness to get disconnected, my test suddenly takes over 18 hours when i could have done all 3 in about 15 minutes, seen and noted the results.

we want to build regular testing into a schedule to ensure our solution and configuration is still valid over time (i dont know, maybe test it once every 3-6 months) but if 15 minutes is going to turn into a day that's just ridiculous.

the thread located here describes my problem exactly Witness Does Not Failover however, this guy found an answer by reading the two technet links, but i dont understand how he found the answer because they relate to Server 2003, I'm a decade further on and i can't seem to relate to what he found as the answer, and he never explained on the thread precisely what he did.

appreciate any further insight anyone may have on this.

thanks

Steve

May 29th, 2015 11:18am

I guess it depends upon how daring you are in trying to change things. <grin>

You can readily view all the properties via PowerShell for the disk witness:

PS C:\> get-clustergroup -cluster <clusname> "cluster group" | get-clusterresource <witness> | fl *

Cluster                 : VMHost-MgmtClus
IsCoreResource          : True
IsNetworkClassResource  : False
IsStorageClassResource  : True
OwnerNode               : VMHost-Mgmt03
ResourceType            : Physical Disk
State                   : Online
OwnerGroup              : Cluster Group
Name                    : vmhost-mgmt-witness
MaintenanceMode         : False
MonitorProcessId        : 4988
Characteristics         : Quorum, BroadcastDelete, MonitorReattach
Description             :
SeparateMonitor         : True
StatusInformation       : 0
PersistentState         : 1
LastOperationStatusCode : 0
LooksAlivePollInterval  : 4294967295
IsAlivePollInterval     : 4294967295
RestartAction           : 2
RestartThreshold        : 1
EmbeddedFailureAction   : 2
RestartDelay            : 500
RestartPeriod           : 900000
RetryPeriodOnFailure    : 3600000
PendingTimeout          : 180000
DeadlockTimeout         : 300000
ResourceSpecificData1   : 0
ResourceSpecificData2   : 0
ResourceSpecificStatus  :
Id                      : 7291bccc-f179-4ef3-aa28-f06586bb0000

Then you can alter any value:

PS C:\> get-clustergroup -cluster <clusname> "cluster group" | (get-clusterresource <witness>).PersistentState = 0

I would compare the disk witness resource output (first cmdlet) with another resource that is working the way you want it to work, and then alter fields which differ to see if anything makes a difference.  I would definitely perform this in a test environment in case one of the changes completely breaks the cluster.

Free Windows Admin Tool Kit Click here and download it now
May 29th, 2015 2:14pm

one of these options must be what im looking for thanks, what is persistent state do?

I will try to take a look over the weekend if I get a chance, thank you.

Steve

May 30th, 2015 7:05am

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics