DAG mailboxes keep randomly failing over

Hi there, 

I've been banging my head against for a while now. We have a "IP less" Exchange 2013 SP1 DAG that's been going great for over a year now. I've noticed that randomly all the databases will failover at various times throughout the day. We've looked into recent changes and nothing seems to stand out.

Here is the results of our CollectOverMetrics.ps1. I can see the failover was automatic and caused by a "PeriodicAction". Any idea where I can dig into next to find this root cause?

  • We've looked into the AV - disabled as a test
  • We've looked at the backup agent in the server - disabled as a test

Any ideas or suggestions?

Update#1 - Noticed we had several Health check mailboxes that were corrupted similar to the post below. Went through steps to recreate them mentioned in the linked post. 

https://social.technet.microsoft.com/Forums/exchange/en-US/b722da4c-eeeb-4afc-adfa-d5359bbb087c/monitor-mailboxes-corrupted?forum=exchangesvrgeneral


Update#2 - I looked into the "IP less DAG" networking and noticed that it auto configured with replication $true on the ISCI network (105 network). I disabled replication on that network so it simply uses the "MapiDagNetwork" instead. Tested failing over/back a db, seems to be ok so far. I somewhat inherited this with not much Exchange knowledge, trying my best to work through it. I will have to see if any more random failovers occur. I will update this thread with more info if the issue is resolved or if another failover occurs.

Before

After

Update#3 -  Arrived this morning to see the mailboxes tripped over to the other exchange server after hours. Looked through the event logs and this event stood out. Could two missed consecutive heartbeats be causing the the databases to failover?



  • Edited by TSGzz 17 hours 1 minutes ago Update#3
August 18th, 2015 9:14pm

I also noticed the PAM is set to the secondary Exchange server. Should I change this to the primary? I am thinking if there is drops between the two servers in the evening that this could be triggering the failovers.

Free Windows Admin Tool Kit Click here and download it now
August 20th, 2015 11:03am

The PAM should be on the main site, so I would move it.

http://blogs.technet.com/b/timmcmic/archive/2014/08/04/exchange-2010-2013-pam-and-the-cluster-core-resources.aspx

No such thing as SP1 CU5.

SP1 is CU4, CU5 is CU5. The cumulative updates count from RTM, not the service packs.

CU5 is no longer supported. It is over 12 months old - Microsoft only support the current and previous CU.

Simon.

August 20th, 2015 11:10am

My mistake. We're indeed on CU4. I am pushing my TL to approve the CU9 update so we're supported in case I need to call MS support.

I moved the PAM to the main site Exchange server. I assume this could have been causing the failover issues if the PAM was on the secondary which is offsite? I will have to wait and see.

Thank you for your guidance.



  • Edited by TSGzz 14 hours 28 minutes ago
Free Windows Admin Tool Kit Click here and download it now
August 20th, 2015 12:52pm

The location of the PAM shouldn't make any difference, actually, but best organizational practice is to keep it active in the primary site.  If it's in the DR site, it's another indicator that you've had a network incident that caused it to fail over.
August 20th, 2015 7:21pm

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics