i thought i have high availability

Exchange 2013 CU4

a while back, the blades where my PDC, MB02, CAS02 are hosted suddenly went down. the other half of my DAG is still online as well as several DC servers. however, the high availability we were expecting with the deployed Exchange didn't happen. everyone is offline with Exchange.

only when the downed servers came online and after several minutes after they became online did our Exchange services returned. in this regards, what did we do wrong that we didn't achieved high availability?

one question keeps haunting me, does Exchange rely heavily on the Windows file cluster?

this is my Exchange block diagram below. the load balancer repositioned is was pointing to the remaining, alive Exchange server.

regards,

Rino

September 7th, 2015 5:59am

Hi

Where is your witness server? Did you loose that too?

BR
Steen


Free Windows Admin Tool Kit Click here and download it now
September 7th, 2015 6:58am

hi,

the witness server is on CAS01 which remained up and online together with MB01.

regards,

Rino

September 7th, 2015 7:01am

Hi,

CAS load balancing is done using either an external load balancer (in your case), Windows NLB or round robin DNS. MBX high availability is provided by a DAG which relies heavily on the Windows Failover Cluster. 

In your case, both a CAS and a DAG member failed but you report that the load balancer was pointing at the remaining CAS server, CAS 01 so in this case, we can focus our attention on the DAG. 

In a two node DAG, you must have a witness server to provide a third quorum vote which can be any other server as the witness server is just a file share. For the DAG to stay up, you need to have two of the three voters online all the time. You may have lost two voters of the three voters when the blade server went down if your witness server was on the same blade server. To find out what your witness server is, use the below command in the Exchange Management Shell:

Get-DatabaseAvailabilityGroup -Status | fl name,*witness*

More info here: https://technet.microsoft.com/en-us/library/dd351226(v=exchg.150).aspx

If this shows that the witness server was on the same blade server as the MBX server then you should either move the VM off or move the witness server using the command:

Set-DatabaseAvailabilityGroup -Identity DAGName -WitnessDirectory D:\witness -WitnessServer Server1

More info here: https://technet.microsoft.com/en-us/library/dd297934(v=exchg.150).aspx

Also, on the MBX servers, check the failover cluster logs to find out if the cluster lost quorum during the failure. Check the application logs to find out if the databases failed over or not and also check the system log check to see if any of the services (especially the information store) failed during this time. 

The other thing to check is that you have multiple mailbox database copies. Setting up a DAG doesn't mean that the databases are highly available. You need to have at least two copies of each database that you need to be up in case of a MBX failure. To do this, run the below command:

Get-MailboxDatabaseCopyStatus -Identity DB1

More info here: https://technet.microsoft.com/en-us/library/dd298044(v=exchg.150).aspx 

If you don't have mailbox database copies set up then you'll need to add them using Add-MailboxDatabaseCopy (https://technet.microsoft.com/en-us/library/dd298105(v=exchg.150).aspx). Once done, mailbox databases will fail over in case of a mailbox server failure. 

The last thing you should do is check that the cluster is configured correctly. Run the validate cluster wizard from the failover cluster manager and check the output. Ignore the alerts about storage as Exchange doesn't require shared storage for DAGs. 

Let me know if this answers your questions.

Thanks.

Free Windows Admin Tool Kit Click here and download it now
September 7th, 2015 7:16am

hi,

the witness server remained up and online as it's not in the affected blades.

i will, however, check on the failover cluster logs on MBX.

i think it was just a bad timing as i am still resolving index failures of my other mailbox databases when this happened.

yes your post clarified a lot of things. i shall be checking them and get back with the results.

thanks.

Rino

September 7th, 2015 7:32am

So many things can cause these symptoms. Did you get the same behaviour when you tested prior to placing into production?

Something as simple as not having more than 1 DNS server (or more than 1 valid DNS server) on the relevant NIC will cause Exchange to have issues.

Start with the simple settings, and go through everything.

You must then schedule a defined maintenance window to retest this.

Free Windows Admin Tool Kit Click here and download it now
September 7th, 2015 11:37am

On a side note, how exactly is the load balancer set to to health detection of CAS?
September 7th, 2015 11:37am

hi,

Health probes used for exchange are two , one of them is icmp ping reachability and the second one is https service . so if any service probe is down on a server the load balancer will recognize it and will not forward traffic to this server.

Free Windows Admin Tool Kit Click here and download it now
September 8th, 2015 2:24am

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics