I have a 2 node DAG Exchange 2013 CU2 as mailbox and CAS on Windows Server 2012, we'll call them EX1 and EX2. Servers are physical Dell R620s with 128 GB RAM, plenty of hard drive space and fast procs. No 3rd party software, nothing special in terms of configuration.
Everything was running great for months of testing and then we moved all our mailboxes over from Exchange 2007, around 250 users. After a few weeks of running smoothly EX1 stopped working correctly. Clients that were connected waited 20-30 seconds to switch between emails and those that weren't already connected failed to connect with a server unavailable message. OWA seemed to be a little slow but was usable as was ECP. Memory usage was about 50%, processor usage under 10% but the server was pretty nonresponsive. It would take 4-5 minutes for Powershell to open up. Even a shutdown took about 15 minutes to finally go through. Once DAG mailboxes and Failover cluster switched over to EX2 Outlook clients worked as expected again.
Whatever got EX1 in a knot was apparently still occurring though because it continued to be unresponsive and the database and content index status on the databases switch between healthy and failed (presumably because it can't keep up with the replication process). This continued through reboots as well and then after about 10 days stopped, everything appeared normal again on EX1.
At this point I assumed there was some kind of hardware issue on EX1 and I ran diags but everything came up clean. I tested performance during a maintenance period and with mailboxes on EX1 after it started acting normally again it was fine. Still, I left EX2 as my primary server.
Now the problem is that EX2 failed in the exact same way just a couple of days ago, so it's must less likely a hardware issue. It resolved itself after about 24 hours and started responding as expected while set as the passive node. EX1 then became very slow again after only 2 days of being the active node and I switched back to EX2.
I have scoured the logs on EX1, EX2, the Domain controller and do not see anything out the ordinary that correlates with the timeframe when things start going South.
I was curious if the problem was with either the CAS or mailbox role so I tried to switch just failover cluster which would move the CAS role to the passive server and it did not seem to help. That being said just switching the mailboxes to the passive server does not solve the problem either, both need to happen which I guess is expected considering how nonresponsive Windows becomes.
Can anyone offer any suggestions for how to troubleshoot this issue or provide guidance on what could be causing the problems? I haven't had any success on my own. Thanks.
- Edited by cjhaugen Monday, October 14, 2013 3:52 PM spelling