DAG mailboxes keep randomly failing over

Hi there, 

I've been banging my head against for a while now. We have a "IP less" Exchange 2013 SP1 DAG that's been going great for over a year now. I've noticed that randomly all the databases will failover at various times throughout the day. We've looked into recent changes and nothing seems to stand out.

Here is the results of our CollectOverMetrics.ps1. I can see the failover was automatic and caused by a "PeriodicAction". Any idea where I can dig into next to find this root cause?

  • We've looked into the AV - disabled as a test
  • We've looked at the backup agent in the server - disabled as a test

Any ideas or suggestions?

Update#1 - Noticed we had several Health check mailboxes that were corrupted similar to the post below. Went through steps to recreate them mentioned in the linked post. 

https://social.technet.microsoft.com/Forums/exchange/en-US/b722da4c-eeeb-4afc-adfa-d5359bbb087c/monitor-mailboxes-corrupted?forum=exchangesvrgeneral


Update#2 - I looked into the "IP less DAG" networking and noticed that it auto configured with replication $true on the ISCI network (105 network). I disabled replication on that network so it simply uses the "MapiDagNetwork" instead. Tested failing over/back a db, seems to be ok so far. I somewhat inherited this with not much Exchange knowledge, trying my best to work through it. I will have to see if any more random failovers occur. I will update this thread with more info if the issue is resolved or if another failover occurs.

Before

After

Update#3 -  Arrived this morning to see the mailboxes tripped over to the other exchange server after hours. Looked through the event logs and this event stood out. Could two missed consecutive heartbeats be causing the the databases to failover?



  • Edited by TSGzz Thursday, August 20, 2015 2:29 PM Update#3
August 18th, 2015 9:14pm

My mistake. We're indeed on CU4. I am pushing my TL to approve the CU9 update so we're supported in case I need to call MS support.

I moved the PAM to the main site Exchange server. I assume this could have been causing the failover issues if the PAM was on the secondary which is offsite? I will have to wait and see.

Thank you for your guidance.



  • Edited by TSGzz Thursday, August 20, 2015 5:02 PM
Free Windows Admin Tool Kit Click here and download it now
August 20th, 2015 4:50pm

We removed the AV as you suggested. As mentioned by a few MVPs in this thread I focused on networking issues. We noticed some latency between our HQ and DR site off hours. I set some DAG thresholds to be a little less sensitive using the commands below. So far we've gone a few days without failover. I also have a change request submitted to apply CU9. I will update my results after a few more days.

Commands I used to adjust DAG thresholds

Default

(get-cluster).SameSubnetDelay=1000

(get-cluster).CrossSubnetDelay=1000

(get-cluster).CrossSubnetThreshold=5

(get-cluster).SameSubnetThreshold=5

 

 

New settings

(get-cluster).SameSubnetDelay=2000

(get-cluster).CrossSubnetDelay=4000

(get-cluster).CrossSubnetThreshold=10

(get-cluster).SameSubnetThreshold=10


  • Edited by TSGzz 16 hours 39 minutes ago
August 25th, 2015 10:53am

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics