Should I retry on Sql Azure Timeout?

I've been seeing intermittent SqlException - "Timeout expired" in our application quite regularly.  We have implemented transient retry logic as per

http://social.technet.microsoft.com/wiki/contents/articles/sql-azure-connection-management-in-sql-azure.aspx

and

http://msdn.microsoft.com/en-us/library/4cff491e-9359-4454-bd7c-fb72c4c452ca

but neither of those documents mentioned the -2 timeout error and as a result we don't retry.  But the regularity with which we see these exceptions seems to tell me that I'm doing something wrong and perhaps I should retry.

Additionally, the first document above infers that the second is the definitive resource, yet the second doc does not mention error code 40143 (which we also had not implemented as a retry-able error but appears that we should).

I'd really like to get the definitive list (so that my app would stop sending me SMS alerts in the middle of the night!)

For the record (and to pass on some code) , my app is testing as follows:

 

	public static class SqlExceptionExtensions
	{
		public static bool IsRetryable( this SqlException that )
		{
			//for error descriptions go to:
			//http://social.technet.microsoft.com/wiki/contents/articles/sql-azure-connection-management-in-sql-azure.aspx
			//http://msdn.microsoft.com/en-us/library/4cff491e-9359-4454-bd7c-fb72c4c452ca
			var sqlErrorCodesToRetry = new[]
			{
				-2 /*Timeout expired. The timeout period elapsed prior to completion of the operation or the server is not responding.*/
				, 20 /*The instance of SQL Server you attempted to connect to does not support encryption. (PMcE: amazingly, this is transient)*/
				, 64 /*A connection was successfully established with the server, but then an error occurred during the login process.*/
				, 233 /*The client was unable to establish a connection because of an error during connection initialization process before login*/
				, 10053 /*A transport-level error has occurred when receiving results from the server.*/
				, 10054 /*A transport-level error has occurred when sending the request to the server.*/
				, 10060 /*A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible.*/
				, 40143 /*The service has encountered an error processing your request. Please try again.*/
				, 40197 /*The service has encountered an error processing your request. Please try again.*/
				, 40501 /*The service is currently busy. Retry the request after 10 seconds.*/
				, 40613 /*Database '%.*ls' on server '%.*ls' is not currently available. Please retry the connection later.*/
			};

			return that.Errors.Cast<SqlError>().Any( sqlError => sqlErrorCodesToRetry.Contains( sqlError.Number ) );
		}
	}


 

 

July 27th, 2011 12:50pm

Have you had a look at the Transient Fault Handling Framework (http://blogs.msdn.com/b/appfabriccat/archive/2010/10/28/best-practices-for-handling-transient-conditions-in-sql-azure-client-applications.aspx)?  It does a nice job of providing a reusable, extensible library for such transient faults.

The list of SQL error codes in that framework appears to be the same as what you have though.

Free Windows Admin Tool Kit Click here and download it now
July 27th, 2011 4:08pm

Thanks Michael,

A lot of my inspiration actually came from that article, however it does not list -2 (Timeout) as a valid sql error code for retrying (in my code above, I actually added the Timeout check  after discovering that it was not being tested)

Indeed, the SqlAzureTransientErrorDetectionStrategy class in that download does NOT retry on timeout, so hence my question here....

 

Pete

July 27th, 2011 5:37pm

Hi Pete,

I built a system that highly depends on SQL Azure availability; I spent a lot of time fixing the application for many different types of failures and found that certain types of errors could be happening at various levels in the communication stack. I built the following TryOpen methods to centralize the communication attempts to SQL Azure. Here are the 2 methods; I only call the first one from the parent code. Note that I check for the Exception itself, the InnerException if there is one, and Socket Exceptions as well.  Depending on the exception, the error may be in the Number or the ErrorCode property.

public static bool IsTransientError(Exception ex)
{
    if (ex is SqlException)
      return IsTransientError((SqlException)ex);
    if (ex.InnerException is SqlException)
      return IsTransientError((SqlException)ex.InnerException);
    else if (ex is InvalidOperationException && ex.InnerException != null && ex.InnerException.Message.ToLower().StartsWith("timeout expired"))
      return true;
    else if (ex is System.Net.Sockets.SocketException)
    {
      if (((System.Net.Sockets.SocketException)ex).ErrorCode == 10054)
        return true;
    }
    else if (ex is TimeoutException)
      return true;
    return false;
}

public static bool IsTransientError(SqlException ex)
{
   if (ex.Number == 40197 ||
      ex.Number == 40501 ||
      ex.Number == 10053 ||
      ex.Number == 10054 ||
      ex.Number == 10060 ||
      ex.Number == 40613 ||
      ex.Number == 40143 ||
      ex.Number == 233 ||
      ex.Number == 64 ||
      ex.Number == 20)
      return true;

    return false;
}

 

Free Windows Admin Tool Kit Click here and download it now
July 29th, 2011 8:09pm

Herve,

Thanks for your response - I appreciate you sharing your code.

I had not considered the SocketException and lowlevel TimeoutException - I was only examining SqlExceptions.  I'll need to add similar checks on my next deployment.  Although, I would be a little scared about the way you examine the exception message text - that seems brittle to me.

I notice though that you don't check for SqlError of -2 (Timeout) - would that be because in that case the SqlException would contain an inner TimeoutException as well as the SqlError -2?  

Pete

July 30th, 2011 12:36pm

Hi Peter - you are welcome.

Yes - checking for a text message is always dangerous and something I avoid as the primary method; I consider this more of a heuristic approach and makes the assumption that a timeout is always something to retry.  SqlError -2 is, as you mention, a timeout. I am not sure exactly which condition in the code I posted catches it, but I thought it was accounted for somehow. I remember seeing it come across as well.

Also, I do not believe this code is perfect nor complete; but so far it works... :)  If you find a way to improve it, please do share on this post for my benefit, and others.  I would be curious to know if the error you are experiencing is caught by the routines I posted.

Thank you - I look forward to your response. 

 

Free Windows Admin Tool Kit Click here and download it now
July 30th, 2011 11:14pm

Thanks again, It's such a pity that we are both unsure if our code catches all cases, despite the fact that there is a guidance article (and indeed a framework) that should address the issues we are talking about and that we are seeing.  It all seems a little anecdotal for me...

 

Best of luck

Pete

July 31st, 2011 12:33pm

Pete,

Let me clarify one important thing.

The Transient Fault Handling Framework is not configured to handle the timeout errors by default. We believe that this type of exceptions is NOT something we should encourage the developers to compensate by a retry. Timeouts may have a very distinct nature and can be the result of poor schema design, lack of indexes, suboptimal query plans, blockings, timeout settings misconfiguration, and more. It is unwise to always retry when you get a timeout. Therefore, we have decided not to include the general timeout errors into the framework.
 
Instead, I would recommend looking into the queries being executed and try to determine the reasons behind timeouts. If you are 100% confident that your queries are generating a good execution plan, you have all the right indexes and everything is configured correctly, and you are still seeing timeouts, you have all the source code in your possession. The implementation of the SqlAzureTransientErrorDetectionStrategy class can be easily modified to catch the timeout exceptions.

To re-iterate, this is not a general recommendation. We leave the timeout handling at your own discretion.

Valery

Free Windows Admin Tool Kit Click here and download it now
August 9th, 2011 6:28pm

Thanks for the feedback.  I appreciate that there are cases where poorly designed queries would cause Timeouts and that your framework should not make retry decisions as a rule.  

However, my issue is that I am seeing an unusual amount of Timeouts on simple, optimized queries that lead me to question the guidance and explanations outlined in the "Sql Azure Connection Management" article I pasted in the OP.

That article definitely makes it explicit to the developer that when writing code that connects to sql azure that we need to handle connections in a specific way that is different to the way that we would (normally) write on-premise code.  It does not mention anything inherent in the environment that can cause Timeouts.

Yet since moving to SQL Azure, our application does see a lot more timeouts.  

For example, simple username/password lookups (on a table with indexes and only a few thousand rows) seemed to _occasionally_ exceed the default CommandTimeout (which is 30seconds)!  I stress this is only ocassional.

Since writing my OP, I have reduced my ConnectionTimeout to 5 seconds and my CommandTimeout to 5 sec (and in the password lookup case, 2 seconds - although I makes me sick to think that a request thread could be in the DB for more than 2 seconds).  

The reason I have reduced from the defaults on both, is that I'd rather fail-fast-and-retry while the user is still waiting for the page to render than hang around for 30 seconds because some load-balancer is trying to swap instances

Pete

 


August 9th, 2011 7:40pm

Pete,

Thanks for the additional clarifications.

Note though that timeouts could also be the result of SQL connection pool exhaustion.

I hear what you are saying about ocassional nature of the timeouts. How about this? We can get our best support engineers to assist you with the root cause analysis. Please get in touch with us via our blog at AppFabricCAT.com so that we can start the ball rolling.

Valery

Free Windows Admin Tool Kit Click here and download it now
August 9th, 2011 7:56pm

Pete, Valery,

I'd be interested in what the outcome of this investigation was as we are experiencing what appears to be the same behaviour. an Update to a SQL Azure table with just 51 rows in, times out periodically.

We have an active ticket with Microsoft with regard to the issue, that has shown the actual execution time of the SP is on average 140ms with a top time of 158ms. There is only one other SP that accesses the same table and that also runs equaly fast. Both SPs are run on a scheduled basis, (approx every 5 seconds) with the default timeout for the command, (linq). Of these scheduled calls approximately 2-3 will fail with a tiemout exception in any 24 hour period.

Blocking does not seem likely, rather it appears that the timeout exception could be masking something else, hence my interest.

thanks

Niall 



September 23rd, 2011 12:55pm

I have an open ticket as well and have been working with a chap called Evan.  Investigations are still ongoing, although things have slowed down the last two weeks due to me being committed to other things so I have not been able to get the information Evan needs.

Gathering information is difficult as the app is in production and we have not got a new build ready yet to deploy. 

I hope to get something soon and get it closed out.

Will post here any outcome.

Free Windows Admin Tool Kit Click here and download it now
September 23rd, 2011 3:26pm

We seem to have made some progress. MS Support performed a reconfiguration of the databases which they define as :

Reconfiguration 

The process of replacing failed replicas is called reconfiguration. Reconfiguration can be required due to failed hardware or to an operating system crash, or to a problem with the instance of SQL Server running on the node in the data center. Reconfiguration can also be necessary when an upgrade is performed, whether for the operating system, for SQL Server, or for SQL Azure.

Since then we have not had any of the timeouts, (they were daily previously) but since it has been only a few days, we will continue to keep an eye on it.

September 29th, 2011 11:47am

Well that sure is good news (for you at least!).  We have little progress:  we dont appear to have blocking queries yet still get the timeouts at least daily.  

If reconfiguration fixes for you over the next week, I'd appreciate if you let me know...  What kind of downtime did you experience while they were fixing your DB?  Is there any risk?

Free Windows Admin Tool Kit Click here and download it now
September 29th, 2011 12:59pm

There was a period of db outage, about 20seconds according to MS but approx 60 seconds according to the errors I can see in our logs.

We didn't get any prior notification, which would have been nice! Fortunately at the time of the reconfiguration all the db calls were heartbeat rather than business processes, although the outage can be mitigated if the previously discussed retry strategies are in place. The errors were at least detectable as something other than sql timeouts.

Still no timeouts so far, and no data issues detected with our DBs. The real test will be if it remains healthy.

September 29th, 2011 4:10pm

Niall,

Co-incidentally, we also have not seen any SQL Timeouts since Monday 26th.  Prior to that, over the last two months (August, September) we only had two separate days when no timeouts occurred

I am not aware if MS has "reconfigured" our DB instance and we have taken no corrective action ourselves.  

I will keep an eye on things here and report back.  

Niall, I wonder if perhaps I could contact you directly off-list?  If so, can you email me at: pmcevoy "at" inishtech "dot" com

 

Pete

Free Windows Admin Tool Kit Click here and download it now
September 30th, 2011 1:39pm

We are also still clear, the fact we both are since the 26th doesn't feel like a coincidence to me.

I'll drop you a mail with my contact details.

Niall

October 4th, 2011 1:53pm

I just recently came across this thread and have found it very helpful, especially this link: http://windowsazurecat.com/2010/10/best-practices-for-handling-transient-conditions-in-sql-azure-client-applications/

I am using the Entity Framework 4.1 with ASP.NET 4.0, hosted in Windows Azure with a SQL Azure database.  I have to say, I find it somewhat bizarre how much wrapper code is required to use SQL Azure with the Entity Framework.  It seems to me these are two flagship Microsoft development technologies that naturally should function more elegantly together.  Perhaps, the entity framework connection string could simply support a property such as "IsSqlAzure=true", and all of this retry policy logic would be handled automatically (since we aren't writing custom handlers for each exception in the list anyway).

Does anyone else have any thoughts on this?  Is anyone aware of anything on the Microsoft roadmap which integrates these two technologies better?  Perhaps these technologies are so new that I'm not giving enough credit to how far these technologies have come in the first place.  That said, the goal and promise of the Entity Framework to simplify data access seems to not be fully realized as it pertains to SQL Azure.



Free Windows Admin Tool Kit Click here and download it now
October 5th, 2011 11:34pm

Update - I spoke to the ADO.NET team lead about this.  Implementing retry logic "under the hood" is in the backlog for the ADO.NET team, and as of March 2012, it has a lead time of 6-18 months before they plan on implementing it.  It sounded like the ADO.NET is currently overwhelmed with other issues.  Just thought I'd provide an update to anyone currently realizing how much effort it is to wrap every atomic operation in their data layer with retry logic.  Good luck!


June 22nd, 2012 6:27pm

Pete, Valery,

I'd be interested in what the outcome of this investigation was as we are experiencing what appears to be the same behaviour. an Update to a SQL Azure table with just 51 rows in, times out periodically.

We have an active ticket with Microsoft with regard to the issue, that has shown the actual execution time of the SP is on average 140ms with a top time of 158ms. There is only one other SP that accesses the same table and that also runs equaly fast. Both SPs are run on a scheduled basis, (approx every 5 seconds) with the default timeout for the command, (linq). Of these scheduled calls approximately 2-3 will fail with a tiemout exception in any 24 hour period.

Blocking does not seem likely, rather it appears that the timeout exception could be masking something else, hence my interest.

thanks

Niall 



Hi All,

I'm wondering if someone gets to the bottom of the issue?

It seems that we're experiencing similar problems (periodically, queries which run 50-150 ms produce timeouts 30+ seconds).

Peter, Niall, please share output of your investigations if any.

Do you experience any issues after 'magic Reconfiguration' ?

Regards, Oleksii




Free Windows Admin Tool Kit Click here and download it now
June 3rd, 2013 8:43pm

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics