Implications of lease & healthcheck timeout in SQL Server 2014 AlwaysOn AG environment

Issue Background

We have recently migrated to a production environment running on SQL Server 2014 Ent. Ed. & have faced an Availability Group (AG) outage issue as of last week.

 

The AG instance is having 1 Primary & 1 Sync Sec replica. The two replicas are on separate servers & hosted within the same datacenter / LAN network. For reasons of easier explanation, will refer to original Primary replica as A & Secondary replica as B.

 

The sequence of events started with an auto failover from A to B. This didnt create any issue as all workloads were running fine till this point. Do note, with this failover, B is now primary & A secondary.

Issue started soon after that, when the AG failed over again within a few minutes of the first failover. At this time, DBs in A (Secondary) were not in sync & thus the failover was not successful. Consequently the AG group went into resolving state on both B (primary) & A (secondary) instances.

To come out of this resolving state, we did a manual fail-over to B, after which A went out of AG. But, it was still part of Windows cluster & was reachable physically.

Diagnosis

Went through SQL Server error logs & saw the following line which triggered sequence of events as mentioned above:

The lease between availability group 'AG_Name' and the Windows Server Failover Cluster has expired. A connectivity issue occurred between the instance of SQL Server and the Windows Server Failover Cluster.

Searching on this, lead us to multiple possible reasons why this might happen.

In our case, there were a couple of queries which were taking 100% CPU. While these queries were running, the lease check process was not receiving any response from invocation of sp_server_diagnostics, within the timeout period of 30 secs. This resulted in lease timeout & subsequent failover. 

With some more digging & testing, we re-configured MaxDOP (earlier it was set to 0), CTP & also lease timeout from 30 to 100 secs. Its been 6 days since this issue had occurred & it has not happened thereafter. However it raises some queries & doubts.

Queries

  1. Why did A replica went out of AG? We had to manually add it to AG & include the DBs back.
  2. Have understood the working of lease timeout & its frequency of running. However, what has not been clear is the exact difference between lease timeout & health check timeout? Moreover, if both these processes result in AG health monitoring & failover, if required, why are the 2 checks separately needed.
  3. What should be the ideal range of values for lease & health check timeout? Because a higher timeout value will only result in delayed failover when an actual issue happens.

Have searched on the above queries, esp. 2 & 3 but have not found any reasonable clarity.

Will really appreciate if these are adequately answered.

Thanking in advance,

Subho

September 13th, 2015 8:31am

First let me say that I really like the way this post is written. We see a lot of postings here, but rarely any that is so clear and precise in the information provided. I like that a lot...

Regarding your queries:

1) That's an interesting question that I can't answer easily. Truth of the matter is that it shouldn't. Truth of the matter unfortunately also is that I have also seen this behavior in my systems during normal failovers. With the latest service packs and CUs this has decreased significantly, but is still there on machines that have large numbers of Availability Groups. My assumption would be (but I could not prove that so far unfortunately...) that as every AG needs a cluster mutex to sync its last hardened LSN there might be timeouts there too when communicating with the cluster service. But again: That's an educated guess at best.

3) Those two are more or less different sides of the same coin. The health check is a process running within SQL server to determine database/instance problems within SQL Server itself. The lease timeout controls the timeout value for the so called "Looks Alive" check from the windows cluster. Both have the same intention: Confirm that the DB is healthy. But one belongs to SQL, the other belongs to Windows Cluster.

5) There is no one size fits all answer here. Running on a reasonable server with MaxDOP set to other than 0 (I personally run MaxDOP 2 on my systems) normally leaves no reason to change the default setting. I never had a case of overload triggering a failover on my systems so far, even under edge conditions (in my case that being >400.000 batches/sec on a single DB...) So I would say that with the change of your MaxDOP you should be good even with the default setting. If you do increase the value bear one thing in mind: This really only affects an unresponsive SQL server and the failover time in this scenario. If the host itself crashes the value has no bearing whatsoever. (As the Looks Alive check for the node itself will fail and trigger the failover...)

Free Windows Admin Tool Kit Click here and download it now
September 13th, 2015 8:54am

Thank you very much for the prompt & detailed response.

A lot has been answered. Just listing a few additional pointers / doubts against each of the queries:

1) We are running SQL 2014 SP1. I'm not sure whether there are any further updated CUs / patches that needs to be applied but, would check up on this again. Also, in our case there is only 1 AG running per DB instance.

The apprehension is, if the actual reason remains unknown, it can recur & we would not know exactly how to rectify better, than manually adding it back to AG as we did last time. 

2) The difference between lease & health check timeout stands much better clarified now.

Would request anyone to help with links / articles, if there is more content existing on this. Our search on this point has not resulted in much.

3) Your last statement on this point is well taken if the Looks Alive check for the node fails, it will trigger failover. You are also absolutely correct in saying that a change in MaxDOP was required & would probably also be enough, even with the default lease timeout value of 30 secs.

However, while diagnosing this issue, was a bit surprised to not come across any significant article(s) with detailed clarification on best case values for lease & health check tiemouts. Obviously that depends on workloads, server configurations & other factors. But, since these counters are easily available for re-configuration, there is always a high probable chance of setting it incorrectly.

For e.g. in our case, while updating the lease timeout to 100 secs, we have also updated health check timeout to the same value. Im not sure whether that is correct and/or required. So, just hoping to find more literature behind these concepts to take actions with more certainty.

Thanks again for all the help & assistance. Do look forward to response on these pointers.

Warm regards,

Subho

September 13th, 2015 10:55am

1) The latest build for SQL 2014 is SP1 Cumulative Update 2 (12.0.4422). I don't think there is anything specific to AlwaysOn in these two CUs, but I have not crosschecked...

2+3) The truth is that in all my time working with AlwaysOn (since 2011 actually when I first got the pitch in the early stages of the SQL2012 beta) I have never ever had a single case where those values needed to be changed... And I also don't know anyone who has touched those. So that might be why there is little documentation... This might change with SQL2016 though, but I can't tell you details yet.

Free Windows Admin Tool Kit Click here and download it now
September 13th, 2015 11:42am

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics