Substantial increase in SE_REPL_SLOW_SECONDARY_THROTTLE Wait Type to the point we can't perform any write operations

Over the last three weeks we've seen huge increases in the prevalence of SE_REPL_SLOW_SECONDARY_THROTTLE wait type on our databases. Our workload has not changed, but at many times during the day we are "locked out" of our databases. This wait type relates to replication to a secondary database Azure maintains for availability of the three replicas of every SQL Azure database.

It is not uncommon to have a situation where we need to shut down our application that accesses the database for over and hour before we can insert even a single row into the database. Just to clarify, we need to leave our database with no transactions running at all for over an hour before we can update, delete or insert data. If we try to perform a write operation in that time, it simply times out and the "wait_time" exactly equals the "total_elapsed_time" while it is running.

Our application certainly does some ETL type operations but our workload hasn't changed recently and we are regularly crippled by this wait type, but this only started happening in the last three weeks. We've even sharded our database which has reduced the problem but it is still a huge issue. We've tried using more temp tables as SQL Azure doesn't replicate these to replicas, but this isn't compatible with SqlBulkCopy as per this discussion: http://social.msdn.microsoft.com/Forums/en-US/ssdsgetstarted/thread/51b26995-509f-419e-8545-3a3c16a6fc14

Has Microsoft recently changed the threshold for throttling? Are others experiencing the same problem? Any ideas for reducing this wait type?

Cheers

David


Here's the code I'm running to get info on the wait type:

SELECT TOP 10 r.session_id, r.plan_handle,
     r.sql_handle, r.request_id,
     r.start_time, r.status,
     r.command, r.database_id,
     r.user_id, r.wait_type,
     r.wait_time, r.last_wait_type,
     r.wait_resource, r.total_elapsed_time,
     r.cpu_time, r.transaction_isolation_level,
     r.row_count, st.text
 FROM sys.dm_exec_requests r
 CROSS APPLY sys.dm_exec_sql_text(r.sql_handle) as st
 GROUP BY r.session_id, r.plan_handle,
     r.sql_handle, r.request_id,
     r.start_time, r.status,
     r.command, r.database_id,
     r.user_id, r.wait_type,
     r.wait_time, r.last_wait_type,
     r.wait_resource, r.total_elapsed_time,
     r.cpu_time, r.transaction_isolation_level,
     r.row_count, st.text
 ORDER BY r.total_elapsed_time desc

March 21st, 2013 3:24pm

Hi David,

Link to another thread on this topic http://social.msdn.microsoft.com/Forums/en-US/ssdsgetstarted/thread/f8e5e5d0-90da-4150-a05f-b22429fce3ce

I've seen the same problem, and it's getting worse. Today I tried stopping all activity on our database, and running a COPY DATABASE command (which took several hours for a 30GB database) to create a new version of the database. I then renamed the old and new to swap the copy over to live for access by the production apps. The very first query the production app ran, after having been inactive for 4+ hours, locked up with SE_REPL_SLOW_SECONDARY_THROTTLE, and I'm seeing no improvement at all in performance. This is on a freshly copied database, where I would assume the replicas would be rebuilt as well?

Not a lot of help I know, but I'm in the process of spinning up instances on Amazon RDS to effect a migration of this app from SQL Azure as soon as possible. I've evaluated SQL Server on Azure VMs, and whilst the performance is fine (for my workload), the random reboots for host OS updates which last up to an hour (happened the day after I provisioned the VM), and therefore the requirement for 2 instances with replication to be configured, renders it too expensive financially and from a management standpoint. It's also still  in Preview status, so no proper production licensing available (only Evaluation edition).

Regards,

James

Free Windows Admin Tool Kit Click here and download it now
March 22nd, 2013 3:38pm

Hi David,

Open up ticket with SQL Azure support and they will help you out on this.

Looks like machine having your database instance might be in some problem.

Thanks,

Dilkush

March 22nd, 2013 4:11pm

Hi Dilkush

Thanks for getting in touch. Unfortunately this has affected four of our databases in two different datacentres so it doesn't appear to be machine related. We've even tried migrating our data to another database which didn't work either.

Has the threshold or method of throttling on SQL Azure changed recently? It would be great to understand if and how it has changed so we can modify our pattern of usage to mitigate these issues.

Cheers

David

Free Windows Admin Tool Kit Click here and download it now
March 22nd, 2013 4:20pm

Hi David,

Checkout this link,

http://social.technet.microsoft.com/wiki/contents/articles/1541.windows-azure-sql-database-connection-management.aspx#Throttling_Limits

This has new throttling mechanism, we call it resource governance.

It has limit of 180 worker threads per instance.

Thanks,

Dilkush

March 23rd, 2013 1:25pm

Hi Dilkush

Thanks for the article link, very useful for working on our best practice.

The article seems to mostly focus on connection termination thresholds and much less on soft throttling which appears to be what we're experiencing ie throttling were the connection isn't terminated but the execution of sql takes a very long time or times out.

Have there been changes to these soft throttling limits in the last three weeks ie since my problems started?

Are there ways we can prevent or be warned we are approaching a limit? I've described a situation above where an hour can pass with zero transations against a database and yet still be thottled when updating a single row. If we knew we were approaching a limit that meant we were not able to access our database for an hour, we could reduce our activity to prevent the lock out. Recently we've started looking at the table sys.resource_stats to see if there is some kind of correlation with throttling, but it would be great to get some guidance on this.

Cheers

David

 

Free Windows Admin Tool Kit Click here and download it now
March 25th, 2013 1:49pm

I've been given the following answer in my communication with Microsoft (we've seen this issue with 4 of our 15 databases in the EU data center):

Question: Have there been changes to these soft throttling limits in the last three weeks ie since my problems started?
Answer: No, there has not.

Question: Are there ways we can prevent or be warned we are approaching a limit?
Answer: No. The issue may not be caused by your application but can be caused by other tenants relying on the same physical hardware. In other words, your application can have very little load and still run into the problem.

In other words, your own traffic may be a cause of this problem, but it can just as well be caused by other tenants relying on the same physical hardware. There's no way to know beforehand that the issue will soon occur - it can occur at any time without warning.

The SQL Azure operations team does not monitor this type of error, so they won't automatically try to solve the problem for you. So if you run into it you have two opitions:

  1. Create a copy of your db and use that and hope the db is placed on another server with less load.
  2. Contact Windows Azure Support and inform the about the problem and let them do Option 1 for you

Martin

March 25th, 2013 2:36pm

Thanks for the info Martin, much appreciated!

Seems a bit odd that SQL Azure operations wouldn't monitor throttling as there could be bugs in throttling that could lead to an outage.

The problem with moving database is that I've already tried that and the problem follows me around. This makes me think Microsoft have recently changed their throttling limits. I think this because the pattern of usage of my databases hasn't changed recently and I've never been "locked out" of my database (as described above) before.

In order to fix this problem I'm trying to find out how throttling has changed and how I can rewrite my applications to reduce throttling.

I've already started by sharding the data, making greater use of temporary tables (which are not replicated to the replicas), truncating staging tables rather than deleting (less logging) and reducing the sizes of my SqlBulkCopy batches (reduces transaction size).

Any other ideas?

Obviously a response from Microsoft with more details on how soft throttling has changed recently would be great... Dilkush have you any info on this?

Cheers

David

Free Windows Admin Tool Kit Click here and download it now
March 26th, 2013 2:49pm

Hi David,

I think you missed my second point - You can optimize your traffic towards SQL Azure all you want (and that's good), but you can still run into this problem. In fact, even if your application runs only 1 little INSERT per day, this one statement may still hang with the SE_REPL_SLOW_SECONDARY_THROTTLE lock type.  Why? Because other tenants using the same physical hardware may put too much burden on the server, and that affects your application as well. 

In other words, this throttling isn't really throttling in the sense that they limit your database access by putting limits on how much data you can send to SQL Azure. Instead, they are throttling the performance of your SQL Azure database, because the database server is under too high load.

What I'm trying to say is that you may see this issue because you are unlucky and your databases has ended up on physical servers under too high load.  We've seen this issue 5 times during the last 2 weeks, and in all cases it has helped us to move the db to a new server. We've seen this issue in databases with pretty much no load whatsoever.

The recommendations they have given to me can be summed up in:

  • Re-architect application to make it less dependent on SQL Azure. Use queues and alternative storage mechanisms such as Azure Tables to place less load on SQL Azure. Use caching to reduce the traffic towards the database. Of course, this only reduces the risk of the issue occuring, it may still occur when I actually need to access the database. Use SQL Azure only from a backend which can handle the case of the database being unavailable for a long time.
  • Install Microsoft SQL Server on a virtual machine in Windows Azure and take care of this myself. Note that this is non-tricky and as far as I know, only evaluation editions of SQL Server is avaiable as virtual machine images at this point.

I know that none of the above options are really options to most users though. Sorry I can't be of more help.

Martin

March 26th, 2013 11:38pm

Hi Martin

Thanks for the response! Great to know I'm not the only one...

I didn't miss your second point, I just didn't address it because I guess there is not a lot we can do about it. We've coded for failure and hopefully that should cover it. The idea of migrating server every time this happens is a scary prospect and doesn't sound very sustainable.

I'm not sure moving away from SQL Azure is the answer as the application requires lots of joins and other SQL capabilities. Did someone from Microsoft really recommend that? We have certainly removed SQL Azure from web-facing tasks as non-deterministic throttling does seem too dangerous if you need a more consistent connection experience. We've used Azure Storage for that with datacentre failover, which was great until the global outage happened last month!

Through a radical re-architecting of our applications and a move to sharding combined with the steps I outlined in my other comments above, we have been able to reduce the prevalence of thirttling. I'm concerned it may return next time Microsoft tweaks the throttling algo.

I'm also crossing my fingers for the creation of a "dedicated" SQL Azure product where there is no contention. I would be willing to pay a premium for that!

Cheers

David

MICROSOFT:
----------
Could anyone from Microsoft please comment on this? I'm just looking for an answer on whether anything changed recently and what we can do to decrease the chances of soft throttling. The tricks of using temp tables, truncating (rather than deleting) permanent statging tables, etc have been learned through bitter experience, it would be great it there was a paper published on how to reduced secondary throttling like there is for connection termination throttling.

Free Windows Admin Tool Kit Click here and download it now
March 27th, 2013 6:02pm

I am sure the advice given above was given in earnest, but nevertheless, if it were really true, then it would be a policy of madness.  It cannot be the case that if your database is inserting one record per day that it could throttled.  That just could not be the way it is.  Something else must be going on. 

It would be like if someone were speeding on the golden gate bridge, so everyone was given a ticket.  It is an unsustainable model and cannot be true.   To set it up that way would be pure madness, and I understand that when we troubleshoot things for a few days that even madness can seem like the correct answer, but it isnt.  It cant be.  No way. 

If you are not abusing the SQL Database, you are not going to be throttled.  I have been using it for more than a year on multiple datacenters and never been throttled that I (and I mean my stuff personally) did not deserve it.


March 29th, 2013 6:46am

@danielsn, we've started seeing SE_REPL_SLOW_SECONDARY_THROTTLE happening on our database as of a few days ago.

It seems to exacerbated when we try to perform inserts of 10,000 or so fairly chunky rows - even though we use SqlBulkCopy and have lowered the batches down to 500 rows at a time.

We have a support ticket open, but not really getting much joy.

The standard answer seems to be that we should create a new database and migrate to that.
This is crazy for a production setup - particularly ours where we also use Sql Azure Federations.
We run a production cloud service used by a lot of paying customers and having our database effectively turn unusable for any period of time is out of the question.

At this stage we're seriously evaluating moving our entire setup to Amazon - specifically to Sql Server RDS and EC2.

This is mainly because there is no other production option for Sql Server on Azure.  

The Sql Server VM option is still in preview and is not for production use.  Even it was, we'd be reluctant since we don't want to be maintaining a db, worrying about patches, backups etc.

We have an application that depends on a relational database so using Table Storage is not an option - and even so the limitations around inserts (100 entities max per batch) are a killer.

If Microsoft could provide Sql Azure on dedicated VMs - i.e. just like RDS - we would be super happy to pay extra, but this does not appear even remotely on the radar.

At the end of the day, we have customers to answer to and even though we love Azure, if we can't depend on our db then we have to react.

There's just no way we can continue to risk being affected by the actions of other tenants on the same Azure db server.

Pretty ironic that the high availability aspect (i.e. the replication) is causing problems that result in basically no availability!!!

Free Windows Admin Tool Kit Click here and download it now
March 29th, 2013 7:44am

Paul / DavidM51,

I am running to the exact same issues regarding SE_REPL_SLOW_SECONDARY_THROTTLE.  After the last Azure DB outage on March 11, 2013 we have intermittent problems inserting data into the database.  It's hit or miss for us as sometimes a simple one row insert into an empty table will take 1 second and other times 60 seconds.  Is it perhaps isolated to the West US sub-region?

In any case, we used SQL Azure Migration Wizard to move the azure database to a VM and performance has been spectacular.  We lost the redundancy azure provides but our customers are a lot happier as we have much faster query response and far better reliability.

Good luck.

March 29th, 2013 7:43pm

It is my opinon that you are indeed hitting a limit and you will hit the same limit on a different database.

There are probably several approaches possible, but which one would depend on your particular situation. 

How often do you do an insert with 10K rows?

What is the size of your Azure database?  

Does your relational query that is your limiting factor use data from the entire table or could it be sharded?

Also, did someone possibly add an index right before it started failing?



Free Windows Admin Tool Kit Click here and download it now
March 29th, 2013 11:11pm

Wow have you read this thread?

The SE_REPL errors are documented as being related to the overall replication behind the Sql Azure server, not a specific db as such.

This is not something we can control or mitigate on our own.  This has been confirmed by several MS support sources we have been in contact with.

It's on the MS side and they don't seem to have an answer.

Besides, if we were abusing the database, we should be throttled through the normal mechanisms.

Plus our Azure db is already sharded - we use Federations (again read the thread).

Each db is under 10GB in total size.  We've taken to splitting our Federation members more aggressively in an attempt to get past this.

We've even taken to sharding individual tables into multiple tables - effectively a poor mans partitioning - and we still have these problems.

Personally I think MS have over-tenanted the hardware.  

Now people have stopped evaluating/playing around and started using their databases for real life production use cases, there is clearly too much contention on the resources.  If this wasn't the case, the replication would be able to keep up with a simple batch insert of 500 records.


March 30th, 2013 2:31am

Wowwww

So many responses...

guys just to let everyone know this secondary throttling issue is kind of taking heat and it is taken as important.

If you are also hitting this issue where you are not doing much of  work on your DB please open up support ticket so MS will come to know about more affected customers and that will help in correcting them faster.

on recent changes in throttling mechanism, last one was resource governance and we are trying to as many matrix to resource governance model as possible.

ressource governance is essentially to make sure that you do not suffer if you are not hitting any limit, which os obviouslly good change than what Azure had initially.

regarding Amazon vs Azure i shouldn't comment there but you can try out other features like IAAS offerings which will attract lots of people as time goes.

Thanks,

Dilkush

Free Windows Admin Tool Kit Click here and download it now
March 30th, 2013 5:18am

We've just started seeing this error message:

"Resource ID : 1. The request minimum guarantee is 0, maximum limit is 180 and the current usage for the database is 1. However, the server is currently too busy to support request greater than 1 for this database. See 'http://go.microsoft.com/fwlink/?LinkId=267637' for assistance. Otherwise, please try again later."

That sounds like that throttling isn't our fault...

March 30th, 2013 10:38am

This definitely something going wrong on server.

Again i would say please open support ticket with highest priority and this will be taken care of.

Thanks,

Dilkush

Free Windows Admin Tool Kit Click here and download it now
March 30th, 2013 7:21pm

I'm guessing that that is just a side effect of the SE_REPL issue. We've seen similar throttling being caused as a side effect of SE_REPL.

As you know, SE_REPL makes INSERT/UPDATE statements queue up for tenants using same physical hardware as you. Since statements takes so long time to run, the database server will be running a lot of concurrent queries. Or rather it will sit and wait for secondary replicas to catch up. So your server may actually do absolutely nothing else than to wait for replicas to catch up (at least in theory).

For your own database, this means that when there is a total of 180 concurrent INSERT/UPDATE queued up due to SE_REPL, you will no longer be able to SELECT data from our database due to concurrent-request-throttling. When the database is in this state, it's entirely unusable since no queries will succeed - not even SELECT queries trying to list what queries are currently running which makes troubleshooting a bit tricky.

On a server level, I assume that they use something very similar to the 180 concurrent statements, but instead something like 10 000 concurrent (or whatever the threshold is). What I think happens in your case is that lots of statements for lots of different tenants starts to queue up with the SE_REPL error. After a while, this means that the server you are communicating with will be running 10 000 statements and your statements will fail as well.

I obviously can't be sure about this, but considering what I know about throttling in SQL Azure, it seems very likely that this error would occur from time to time.


April 2nd, 2013 6:59pm

We have been experiencing this issue also since Tuesday 26th March last.  It's been today confirmed by Microsoft support as an issue with the Replication System on SQL Azure. In fact, it would appear that there are actually three known issues that are being worked on by the dev team. No resolution times or further details were given. 

Our solution for now is similar to that mentioned by james.currer and Nitramafve above - only we backup the DB and  restore it (to the same server). The effect is the same - the DB gets pushed to new hardware and the issue is resolved, at least temporarily. 

So whilst this issue could be expected to happen occasionally under normal conditions as a server gets busy, it would appear that the current bugs are causing this to happen on a much more constant basis. I have created a posting on the Microsoft connect site at:

  

https://connect.microsoft.com/SQLServer/feedback/details/782837/sql-azure-se-repl-slow-secondary-throttle-issue

If you are affected by this issue, can I request that you open the above link and vote - it will help raise the profile, and urgency of a fix for this issue.

Free Windows Admin Tool Kit Click here and download it now
April 2nd, 2013 7:41pm

paraicnolan,

When you say that it has been confirmed to be an issue, do you mean they have confirmed that it's an actual bug in the replication system, or just that it's a known problem they are working to fix?

To me they have just said that it's a known limitation in SQL Azure.

April 2nd, 2013 8:37pm

We've also experienced a number of "database outages" during March. Interestingly they have;

1. All been in the EU data center (West Europe).
2. SE_REPL_SLOW_SECONDARY_THROTTLE or SE_REPL_COMMIT_ACK
3. Climaxed on Saturday when our master database went offline & we lost connectivity to all 45 databases!

Date Impact (databases) Severity Duration Root Cause
8/03/2013 1 High 4 hours SE_REPL_SLOW_SECONDARY_THROTTLE
8/03/2013 1 High 1 hour SE_REPL_SLOW_SECONDARY_THROTTLE
12/03/2013 1 High 1 hour SE_REPL_SLOW_SECONDARY_THROTTLE
29/03/2013 4 High 1 hour SE_REPL_COMMIT_ACK
30/03/2013 45 High 1 hour SQL Azure UK master database offline.

Free Windows Admin Tool Kit Click here and download it now
April 3rd, 2013 5:31am

Nitramafve

They would not be specific, but said that there were "3 known issues" causing SE-REPL to timeout, and that the Devs are aware and working on a resolution. So this is not the general limitation of SQL Azure as regards replication. There was also a  mention of a specific outage within part of the Data Centre (Northern Europe) that caused the remaining machines to come under heavier than normal load, compounding the issue. 

James - I see that you are in Western Europe - this supports the theory of a general issue in Replication. I think that the  Northern European centre had the added complication of having a heavier load due to some machines going down, making this issue more likely to occur. 

Nitramafve - you mention you are in a European Data Centre - is it the Northern or Western one - would be interesting to know. 

Thanks to all who voted for the issue on the connect site.

April 3rd, 2013 11:10am

paraicnolan,

We've been seeing the issue in the North European data center (last time yesterday afternoon). We run our solution in US and SEA as well but have not experienced it there.

Free Windows Admin Tool Kit Click here and download it now
April 3rd, 2013 11:42am

Nitramafve

Thanks - so working theory then is a general issue, but more severe in Northern Europe. I will update this thread if I get any clearer information from support.

April 3rd, 2013 12:00pm

I've seen this problem from early March on at least ten databases across the south central US, northern Europe and western Europe datacentres.

This does appear to be a bug in SQL Azure and I like Paul Du Bois' theory that "MS have over-tenanted the hardware".

We've taken some action in the last couple of weeks (detailed above) that seem to have mitigated the problem in the short term. But as throttling is non-deterministic knowing my luck it will come back!

Cheers

David

Free Windows Admin Tool Kit Click here and download it now
April 3rd, 2013 1:01pm

We are in the West Coast data center in the United States and have also been experiencing this issue sporadically throughout March, sometimes for hours but the Copy As Database command does fix the issue temporarily for us as well.

Apparently, Microsoft's next SQL Azure upgrade is in May.  Hopefully, a fix for these issues will be included.  Any guidance from Microsoft or ideas from the community on a work around (other than to simply copy the database to a new server) would be greatly appreciated.

Thanks

April 4th, 2013 12:18am

Add me to the list of those encountering this problem.  Time to start investigating Amazon RDS by the sounds of it
April 5th, 2013 2:53pm

Hi,

Soluion to these problem will be somehow change the physical nodes for dbs

Sometimes this is achived by automatic failover of dbs by Azure

or we have to force that,

for that one way is copying database (doesn't sound good right?)

or tell microsoft support that you are having this issue and they will help in triggering failover (preferable as they will come to know how severe is the problem)

Thanks,

Dilkush

Free Windows Admin Tool Kit Click here and download it now
April 6th, 2013 8:25am

It seems that there was a sudden spike of these types of issues in March/April that has since subsided. It is disappointing that Microsoft has never publicly admitted there was a bug in the platform, or provided status updates to users of the platform.

May 31st, 2013 9:14am

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics