DAG mailboxes keep randomly failing over

Hi there, 

I've been banging my head against for a while now. We have a "IP less" Exchange 2013 SP1 DAG that's been going great for over a year now. I've noticed that randomly all the databases will failover at various times throughout the day. We've looked into recent changes and nothing seems to stand out.

Here is the results of our CollectOverMetrics.ps1. I can see the failover was automatic and caused by a "PeriodicAction". Any idea where I can dig into next to find this root cause?

  • We've looked into the AV - disabled as a test
  • We've looked at the backup agent in the server - disabled as a test

Any ideas or suggestions?

Update#1 - Noticed we had several Health check mailboxes that were corrupted similar to the post below. Went through steps to recreate them mentioned in the linked post. 

https://social.technet.microsoft.com/Forums/exchange/en-US/b722da4c-eeeb-4afc-adfa-d5359bbb087c/monitor-mailboxes-corrupted?forum=exchangesvrgeneral


Update#2 - I looked into the "IP less DAG" networking and noticed that it auto configured with replication $true on the ISCI network (105 network). I disabled replication on that network so it simply uses the "MapiDagNetwork" instead. Tested failing over/back a db, seems to be ok so far. I somewhat inherited this with not much Exchange knowledge, trying my best to work through it. I will have to see if any more random failovers occur. I will update this thread with more info if the issue is resolved or if another failover occurs.

Before

After

  • Edited by TSGzz 16 hours 15 minutes ago
August 18th, 2015 9:14pm

Hi,

Is there any related event logs in Exchange side? Please provide more information for further analysis.

Regards,

Free Windows Admin Tool Kit Click here and download it now
August 19th, 2015 5:29am

Will do if they failover today. We've made two changes recently that hopefully iron out this issue.
August 19th, 2015 11:10am

I looked into that and made a change (Update#2 above).
Free Windows Admin Tool Kit Click here and download it now
August 19th, 2015 11:15am

Are you really still on SP1, and haven't gone to a later CU (currently on CU9, SP1 is basically CU4)?

If so, I would look at updating the server, I have seen some odd stuff with random DAG failover that seemed to go away with either CU5 or CU6.

Is this an inter-site or intra-site failover? I agree with Ed though - this is usually a network issue.

Simon.

August 19th, 2015 1:41pm

We're on Exchange SP1 CU5. The failover that was occuring was inter-site. So far no failovers today. If they start occuring again I will update with CU9. 
Free Windows Admin Tool Kit Click here and download it now
August 19th, 2015 2:50pm

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics