Exchange DAG Node Down

I have a two node (node1 & node2) exchange 2013 DAG with FSW quorum on a 'non' exchange server (server), all within the same site.  All databases are mounted on the first node which is working with database copies on the second node which is 'down'.

A few times I have experienced, one of the DAG node goes 'offline' in failover cluster manager, in the past it has resolved itself, or I have taken various steps to resolve it.  This time however it will not come online again and I cannot find any method to fix this.

Windows event errors include (repeatedly);

1564

File share witness resource 'File Share Witness (\\server\EX15DAG.domain.net)' failed to arbitrate for the file share '\\server\EX15DAG.domain.net'. Please ensure that file share '\\server\EX15DAG.domain.net' exists and is accessible by the cluster.

1573

Node 'Node2' failed to form a cluster. This was because the witness was not accessible. Please ensure that the witness resource is online and available.

C:\Windows\system32>cluster node (from the 'working node')
Listing status for all available nodes:

Node           Node ID Status
-------------- ------- ---------------------
NODE1        1 Up
NODE2        2 Down

C:\Windows\system32>cluster node (from the not working node)
Listing status for all available nodes:

Node           Node ID Status
-------------- ------- ---------------------
NODE1        1 Down
NODE2        2 Joining

The above shows me two different results depending on which exchange node I run it from.. interesting...

My DC and exchange server's time are all in sync.
I can access the FWS, and confirm the share is present, and being updated by the working node. The share has the Trusted Subsystem full control permission on the share and ntfs security.
Internally we do not use a firewall, and can confirm there are not any firewall issues.
I do not have any Antivirus running, so nothing is being blocked or interfered with here.
I can ping all hosts involved from all machines (both nodes, fws, dc, dag dns name, everything).
I have restarted the failed node, and also the working node.

It was all fine until an unexpected host/VM failure and restart.

I figure I can remove all the database copies from the failed node, then evict it from the cluster, and start it again, but if I can just get it joined again properly I would much prefer that.

August 4th, 2014 12:11am

Hi,

Please make sure witness file share is online and please check the quorum configuration is Node and File Share Majority.

Here is a related article for your reference.

http://technet.microsoft.com/en-us/library/cc756221(v=ws.10).aspx

Please exclude the following directories from antivirus active scanning:

'\\server\EX15DAG.domain.net'

And please make sure you have read and write permissions on File Share at Witness Server for the followings:

Domain Admin
Cluster nodes
Cluster Name

If the issue still persists, you can recreate the FSW for DAG to check result.

Best regards,
Belinda

Free Windows Admin Tool Kit Click here and download it now
August 4th, 2014 10:12am

Hi

even though the time is correct, can you confirm your timezones are correct as well?

If you make any changes you will need to reboot your machines. 

August 4th, 2014 10:14am

Please ensure that Cluster Service account has full permission on FSW share and I suggest to use FQDN in the Share path '\\server.FQDN\EX15DAG.domain.net'

Please ensure that Cluster Service on both nodes are started by a Active Directory Service account and not by "SYSTEM"

Free Windows Admin Tool Kit Click here and download it now
August 6th, 2014 7:09am

Does the Share allow for more than one connection to it at a time?
August 6th, 2014 3:55pm

Yes it does allow. However, the file gets locked by any one node to obtain its vote. The other node simply accesses the share and updates its information.

Free Windows Admin Tool Kit Click here and download it now
August 6th, 2014 4:21pm

Confirmed. 

Time zones on both exchange mailbox servers are UTC+10 (Sydney), as well as the FSW.

August 6th, 2014 11:58pm

Please ensure that Cluster Service account has full permission on FSW share and I suggest to use FQDN in the Share path '\\server.FQDN\EX15DAG.domain.net'

Please ensure that Cluster Service on both nodes are started by a Active Directory Service account and not by "SYSTEM

Free Windows Admin Tool Kit Click here and download it now
August 7th, 2014 12:07am

Hi Belinda,

everything you said, I already mentioned that I have done it in my original post.  I have recreated the fsw also.

August 7th, 2014 1:10am

currently the Cluster server DOES run under the local system account.  What account should it be running as if not this?
Free Windows Admin Tool Kit Click here and download it now
August 7th, 2014 4:39am

Some additional information.

I can remove the working node1 from the dag, and then add the 2nd node successfully, then when I try to go and add in the first node, I cannot, with all exactly the same issues..

eventually the process to add the dag member fails with

A server-side database availability group administrative operation failed with a transient error. Please try the operation
again. Error: An error occurred while attempting a cluster operation. Error: Cluster API failed: "AddClusterNode()
(MaxPercentage=100) failed with 0x5b4. Error: This operation returned because the timeout period expired" [Server:

August 7th, 2014 5:20am

Case closed.

I moved both exchange VM's to the same Host and it instantly added the 2nd node..

Free Windows Admin Tool Kit Click here and download it now
August 7th, 2014 6:43am

Moving two VM's to the same Host doesn't by you anything. You'd be better off with no VM's. But if this is just a test lab for practice then that doesn't matter.

The error message "AddClusterNode() (MaxPercentage=100) failed with 0x5b4. Error: This operation returned because the timeout period expired" means that when the cluster node was able to join the cluster, but then when it tried to heartbeat all of the existing nodes, at least one of them wouldn't heartbeat. Given that moving to the same host fixed the issue leads me to believe the problem was with network routes.

August 7th, 2014 5:45pm

Moving two VM's to the same Host doesn't by you anything. You'd be better off with no VM's. But if this is just a test lab for practice then that doesn't matter.

The error message "AddClusterNode() (MaxPercentage=100) failed with 0x5b4. Error: This operation returned because the timeout period expired" means that when the cluster node was able to join the cluster, but then when it tried to heartbeat all of the existing nodes, at least one of them wouldn't heartbeat. Given that moving to the same host fixed the issue leads me to believe the problem was with network routes.

Hi Jared, yeah I understand this is not a 'solution'.  In any case this is my production environment, and I have done everything else I can think of and have the expertise to do, so I have no other options, unless you want to remote in and take a look.

It will work for me for now, as I am hoping to move to o365 soon so I never have to bother with on prem troubles again :)

Free Windows Admin Tool Kit Click here and download it now
August 8th, 2014 12:06am

thanks for the update,,..

did you try to find out why this happen & what it took from node 1 to work..

i am facing the same issue ??

thnx

August 25th, 2014 9:01am

I had the exact same scenario.  Has happened twice now.

When we run from NODE1:

Node1 >cluster node
Listing status for all available nodes:

Node           Node ID Status
-------------- ------- ---------------------
NODE1                 1 Down
NODE2                 2 Up

When we run from NODE2:

Node2 >cluster node
Listing status for all available nodes:

Node           Node ID Status
-------------- ------- ---------------------
NODE1                 1 Joining
NODE2                 2 Up

Anyway both times it has happened we have started the Failover Cluster Manager:

1) right click on Dag Cluster

2) More Actions

3) Shutdown Cluster

4) Nagged if I am sure.  click Yes

5) what for it to stop 

6) repeat steps 1 & 2

7) Start Cluster

I feel like there is some sort of lock on the FSW from NODE2 that is preventing NODE1 to participate.  I really don't know though.  

I imagine if we shut down Node2 then shutdown Node1.  Started Node1 then started Node2 that it should also correct itself but downtime is greater.

Our environment is virtualized as well.   Usually this event happens when some vmware/veeam backup oddness happens.   There is a KB from veeam KB1744 that talks about setting the failover sensitivity higher.


Free Windows Admin Tool Kit Click here and download it now
June 17th, 2015 10:36am

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics