DAG mailboxes keep randomly failing over

Hi there, 

I've been banging my head against for a while now. We have a "IP less" Exchange 2013 SP1 DAG that's been going great for over a year now. I've noticed that randomly all the databases will failover at various times throughout the day. We've looked into recent changes and nothing seems to stand out.

Here is the results of our CollectOverMetrics.ps1. I can see the failover was automatic and caused by a "PeriodicAction". Any idea where I can dig into next to find this root cause?

  • We've looked into the AV - disabled as a test
  • We've looked at the backup agent in the server - disabled as a test

Any ideas or suggestions?

Update#1 - Noticed we had several Health check mailboxes that were corrupted similar to the post below. Went through steps to recreate them mentioned in the linked post. 

https://social.technet.microsoft.com/Forums/exchange/en-US/b722da4c-eeeb-4afc-adfa-d5359bbb087c/monitor-mailboxes-corrupted?forum=exchangesvrgeneral


Update#2 - I looked into the "IP less DAG" networking and noticed that it auto configured with replication $true on the ISCI network (105 network). I disabled replication on that network so it simply uses the "MapiDagNetwork" instead. Tested failing over/back a db, seems to be ok so far. I somewhat inherited this with not much Exchange knowledge, trying my best to work through it. I will have to see if any more random failovers occur. I will update this thread with more info if the issue is resolved or if another failover occurs.

Before

After

Update#3 -  Arrived this morning to see the mailboxes tripped over to the other exchange server after hours. Looked through the event logs and this event stood out. Could two missed consecutive heartbeats be causing the the databases to failover?


Update#4 - See DAG threshold post below - issue resolved

  • Edited by TSGzz 17 hours 6 minutes ago
August 18th, 2015 9:14pm

We removed the AV as you suggested. As mentioned by a few MVPs in this thread I focused on networking issues. We noticed some latency between our HQ and DR site off hours. I set some DAG thresholds to be a little less sensitive using the commands below. So far we've gone a few days without failover. I also have a change request submitted to apply CU9. I will update my results after a few more days.

Commands I used to adjust DAG thresholds

Default

(get-cluster).SameSubnetDelay=1000

(get-cluster).CrossSubnetDelay=1000

(get-cluster).CrossSubnetThreshold=5

(get-cluster).SameSubnetThreshold=5

 

 

New settings

(get-cluster).SameSubnetDelay=2000

(get-cluster).CrossSubnetDelay=4000

(get-cluster).CrossSubnetThreshold=10

(get-cluster).SameSubnetThreshold=10


  • Edited by TSGzz Tuesday, August 25, 2015 2:49 PM
  • Marked as answer by TSGzz 17 hours 7 minutes ago
Free Windows Admin Tool Kit Click here and download it now
August 25th, 2015 2:49pm

Been failover free for almost a week now after making the threshold changes above. My TL is looking into the latency issues between the two sites. After making the changes above all databases are healthy and have not flipped over during the week. CU9 is also being applied this weekend.

  • Marked as answer by TSGzz 17 hours 7 minutes ago
August 28th, 2015 10:24am

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics