Active node in 2 node DAG becomes non responsive after ~7 days.  Requires failover to passive server.  Nothing in logs, plenty of resources

I have a 2 node DAG Exchange 2013 CU2 as mailbox and CAS on Windows Server 2012, we'll call them EX1 and EX2.  Servers are physical Dell R620s with 128 GB RAM, plenty of hard drive space and fast procs.  No 3rd party software, nothing special in terms of configuration. 

Everything was running great for months of testing and then we moved all our mailboxes over from Exchange 2007, around 250 users.  After a few weeks of running smoothly EX1 stopped working correctly.  Clients that were connected waited 20-30 seconds to switch between emails and those that weren't already connected failed to connect with a server unavailable message.  OWA seemed to be a little slow but was usable as was ECP.  Memory usage was about 50%, processor usage under 10% but the server was pretty nonresponsive.  It would take 4-5 minutes for Powershell to open up.  Even a shutdown took about 15 minutes to finally go through.  Once DAG mailboxes and Failover cluster switched over to EX2 Outlook clients worked as expected again. 

Whatever got EX1 in a knot was apparently still occurring though because it continued to be unresponsive and the database and content index status on the databases switch between healthy and failed (presumably because it can't keep up with the replication process).  This continued through reboots as well and then after about 10 days stopped, everything appeared normal again on EX1.

At this point I assumed there was some kind of hardware issue on EX1 and I ran diags but everything came up clean.  I tested performance during a maintenance period and with mailboxes on EX1 after it started acting normally again it was fine.  Still, I left EX2 as my primary server. 

Now the problem is that EX2 failed in the exact same way just a couple of days ago, so it's must less likely a hardware issue.  It resolved itself after about 24 hours and started responding as expected while set as the passive node.  EX1 then became very slow again after only 2 days of being the active node and I switched back to EX2. 

I have scoured the logs on EX1, EX2, the Domain controller and do not see anything out the ordinary that correlates with the timeframe when things start going South. 

I was curious if the problem was with either the CAS or mailbox role so I tried to switch just failover cluster which would move the CAS role to the passive server and it did not seem to help.  That being said just switching the mailboxes to the passive server does not solve the problem either, both need to happen which I guess is expected considering how nonresponsive Windows becomes.

Can anyone offer any suggestions for how to troubleshoot this issue or provide guidance on what could be causing the problems?  I haven't had any success on my own.  Thanks.


  • Edited by cjhaugen Monday, October 14, 2013 3:52 PM spelling
October 14th, 2013 6:51pm

I will recommend that you try to look to see if you can find patches  for Server 2012 on the processor family for R620s.
Free Windows Admin Tool Kit Click here and download it now
October 14th, 2013 10:31pm

Hello,

You can try the chaen fong's suggestion.

Besides, I recommend you use get-healthreport cmdle to retrieve health information related to the server you specify.

October 15th, 2013 9:09am

Both servers are updated with the exception of this weeks patches.  Those will be applied next week.  I tried running get-healthreport a couple of days ago and it didn't display anything.  Today It's reporting as expected and there are items that are unhealthy.  I'll investigate those further and update this thread with my findings.
Free Windows Admin Tool Kit Click here and download it now
October 16th, 2013 10:47am

There are some critical hotfixes released for Windows 2012 clustering and are listed here

https://social.technet.microsoft.com/wiki/contents/articles/15577.list-of-failover-cluster-hotfixes-for-windows-server-2012.aspx

Also, if you are not seeing anything in Application or system logs, you should look at the article http://blogs.msdn.com/b/clustering/archive/2012/05/07/10301709.aspx

Above articles will help you to find out the problems related to clustering components. However, If DAG switchover is working, I think it may not be related to DAG or failover clustering components. I will also suggest to look at below configuration:

1. NIC configuration on both the servers including the priority of network cards in advanced settings of network card settings control panel applet.

2. Check whether you see anything related to IIS application pool crashing.

October 16th, 2013 11:06am

Thank you for the suggestions.  I'll try them out.  As luck would have it I just had another failure this morning.  This time at least I got an entry in the event log.  The RPC client access service failed and restarted itself resulting in a short outage for clients.  About 10 minutes later the system again became nonresponsive.  I failed over to the passive node which is currently working but reports a number of resources are unhealthy.  My question is - if a service state is Offline or NotApplicable is it OK for it to be unhealthy?  For example I don't use UM on my server and it's unhealthy. If you look at this screen capture it's obvious that there's stuff here that should not be unhealthy but I'm not sure where to start troubleshooting.

Free Windows Admin Tool Kit Click here and download it now
October 16th, 2013 12:18pm

Hello Cjhaugen,

If a service state is offline or notapplicable, it is ok for it to be unhealthy.

And from the result, the required services are unhealthy. I recommend you check if required exchange services are started.

October 20th, 2013 7:00am

After further investigation I found that the Health service was failing on all sorts of checks, mostly with authentication and 401 errors.  I did have a couple of corrupt health mailboxes which were deleted and the successfully recreated but the health manager failures continued.  I ended up opening a service call with MS and they were also unable to locate the source of these errors.  We tried to run setup /preparead to make sure any AD related changes were applied but that didn't help.  For now we've disabled the Exchange Health Manager.  What appears to be happening is that the health manager would try to check the health of various services, would be unable to verify and try to take action to resolve the problem.  Since there was no problem the "resolution" would cause a service to restart and knock Outlook clients offline as a result. 

I'm not entirely sure if the Health manager was the cause of my initial problems or not, but if I find anything else out I'll update this thread.

Free Windows Admin Tool Kit Click here and download it now
October 21st, 2013 10:34am

Hello,

Thank you for your update.

If you have further update, please free let me know.

October 21st, 2013 10:15pm

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics