DAG mailboxes keep randomly failing over

Hi there, 

I've been banging my head against for a while now. We have a "IP less" Exchange 2013 SP1 DAG that's been going great for over a year now. I've noticed that randomly all the databases will failover at various times throughout the day. We've looked into recent changes and nothing seems to stand out.

Here is the results of our CollectOverMetrics.ps1. I can see the failover was automatic and caused by a "PeriodicAction". Any idea where I can dig into next to find this root cause?

  • We've looked into the AV - disabled as a test
  • We've looked at the backup agent in the server - disabled as a test

Any ideas or suggestions?

Update#1 - Noticed we had several Health check mailboxes that were corrupted similar to the post below. Went through steps to recreate them mentioned in the linked post. 

https://social.technet.microsoft.com/Forums/exchange/en-US/b722da4c-eeeb-4afc-adfa-d5359bbb087c/monitor-mailboxes-corrupted?forum=exchangesvrgeneral


Update#2 - I looked into the "IP less DAG" networking and noticed that it auto configured with replication $true on the ISCI network (105 network). I disabled replication on that network so it simply uses the "MapiDagNetwork" instead. Tested failing over/back a db, seems to be ok so far. I somewhat inherited this with not much Exchange knowledge, trying my best to work through it. I will have to see if any more random failovers occur. I will update this thread with more info if the issue is resolved or if another failover occurs.

Before

After

Update#3 -  Arrived this morning to see the mailboxes tripped over to the other exchange server after hours. Looked through the event logs and this event stood out. Could two missed consecutive heartbeats be causing the the databases to failover?



  • Edited by TSGzz Thursday, August 20, 2015 2:29 PM Update#3
August 18th, 2015 9:14pm

My mistake. We're indeed on CU4. I am pushing my TL to approve the CU9 update so we're supported in case I need to call MS support.

I moved the PAM to the main site Exchange server. I assume this could have been causing the failover issues if the PAM was on the secondary which is offsite? I will have to wait and see.

Thank you for your guidance.



  • Edited by TSGzz Thursday, August 20, 2015 5:02 PM
Free Windows Admin Tool Kit Click here and download it now
August 20th, 2015 4:50pm

Agreed - that should not influence activities here.

I'd check to see how many times nodes drop out of the cluster. For example:

http://blogs.technet.com/b/rmilne/archive/2014/11/19/retrieving-cluster-error-1135-from-servers.aspx

I also look at this:

http://blogs.technet.com/b/rmilne/archive/2014/07/18/retrieving-packets-received-discarded-perfmon-counter-from-multiple-servers.aspx

Which can be caused by this:

http://blogs.technet.com/b/rmilne/archive/2014/07/21/vmware-issues-with-exchange-dag.aspx

Are these VMs??

Couple of other points.

Disabling file system AV is not enough.  You must ensure it is fully removed as it leaves low level network and IO drivers in place.  Disabling proves nothing.

Technically CU4 is supported.  We base the support windows off service packs.  However you will not see updates cut for CU5,CU6 etc.   Regardless, you need to get to CU9 ASAP.

August 23rd, 2015 1:40pm

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics