Always on network latency between datacenters

When we open Always on dashboard, we see network latency [Estimated Data Loss] on secondary server and our secondary server basically DR server is in different datacenter. Latency causes log file full on primary server. What could be the reason of network latency between se

October 31st, 2014 1:43pm

Hello,

Have your Networking team give you the base line latency for that connection. Many things can cause latency - you'll need to partner with Networking to delve deeper into your issue.

-Sean

Free Windows Admin Tool Kit Click here and download it now
October 31st, 2014 6:22pm

Hi Rahul,

Based on my research, if there is not enough network bandwidth supporting the communication between primary/secondary replicas, the secondary replica can fall further behind leading to higher data latency.

And other than the insufficient network bandwidth issues, there is also considerable latency involved with having to process data through the TCP/IP stack. This takes time (latency) and also requires CPU processing resources that the application running on SQL Server may need.

For more details, please review the following articles.
AlwaysOn: Readable Secondary and data latency
Optimizing MS-SQL AlwaysOn Availability Groups With Server SSD


Thanks,
Lydia Zhang

November 4th, 2014 12:07pm

Hi Rahul,

Based on my research, if there is not enough network bandwidth supporting the communication between primary/secondary replicas, the secondary replica can fall further behind leading to higher data latency.

And other than the insufficient network bandwidth issues, there is also considerable latency involved with having to process data through the TCP/IP stack. This takes time (latency) and also requires CPU processing resources that the application running on SQL Server may need.

For more details, please review the following articles.
AlwaysOn: Readable Secondary and data latency
Optimizing MS-SQL AlwaysOn Availability Groups With Server SSD


Thanks,
Lydia Zhang

Free Windows Admin Tool Kit Click here and download it now
November 4th, 2014 12:07pm

Hi All -  I always seem to find best practices for certain aspects to consider for setting up a DR solution, for example, however, it's difficult to come across actual average values considered as optimal, good, acceptable, etc, for things such as network latency. For example, Lydia, you said "if there is not enough network bandwidth", but what is considered enough? would 10ms be acceptable? how about 20ms? or 50ms? What are the average values administrators deal with and considered "enough network bandwidth"? This is just one of the pieces of information I struggle with finding online. Your advice is greatly appreciated!

April 7th, 2015 2:45pm

Leroy,

it's difficult to come across actual average values considered as optimal, good, acceptable, etc, for things such as network latency. For example, Lydia, you said "if there is not enough network bandwidth", but what is considered enough?

The reason is because everyone has different requirements. We can't tell you how much load you're going to put on the network (which would be the bandwidth requirement) or how much latency is acceptable for your business unit (obviously as close to 0 as possible).

What we can tell you is how to tackle the situation. For example, you're asking how much bandwidth do you need. We know that Availability Groups work by transmitting log blocks to *each* replica in the cluster that is part of the availability group. Right away we know the amount of bandwidth is going to depend on how much log we need to send in a given amount of time. What's a way we can see how much log we generate over time? One way would be to look at our log backup sizes and average the value out across the time between backups. For example, if we had a 100MB log backup and the backup runs every 10 minutes we know we generated roughly 100MB of log in 10 minutes = 100MB/(10m * 60s/m) = 100MB/600s = .17 MB/s = 170KB/s. That's a general ball park as all of that could have happened in the first minute and sat idle the other 9 minutes.

If we wanted to get a little more refined for our guesstimate there is the log bytes flushed/sec performance counter under the SQL Server:Database performance object. We could capture this value every minute or so and look at it over the period of a week, or month end, year end, batch load time, etc., to get a good idea on what our *peak* usage is. If we are flushing 10MB to disk every second we know, initially we're probably going to need about 10 MB/sec bandwidth.

When it comes to latency, that's pretty easy. If we're using synchronous for our replicas then we *want* as close to 0 as possible as anything else is going to be observable overhead to the transactions and ultimately the end user. If we're running asynchronous, as long as the data is making it there within our defined SLA we should be ok.

This is just one of the pieces of information I struggle with finding online.

You're not alone. Unfortunately there isn't a one-size-fits-all with databases as there are in other areas of IT. We really have to fidget around to get a semi-accurate prediction if we can't have a test environment or anything considered "pre-production" to fully test.

If we had a test environment and could run/simulate/replay a captured/generated workload, then we would be able to tell exactly how much bandwidth is needed for a normal workload. However, with the pace of some businesses this is a luxury that is generally not afforded.

I have worked on many projects and the first thing that is cut when the project starts to fall behind is testing of the configuration and hardware. Normally UAT has some time cut from it but it's mostly on the infrastructure side, which is unfortunate as there has been quite a few incidents that ended up costing quite a bit of money because it was rushed into production without proper testing.

-Sean

Free Windows Admin Tool Kit Click here and download it now
April 7th, 2015 7:22pm

Sean,

Thank you so much for your explanation. It definitely helped me get a clearer picture and thank you for providing examples of how much bandwidth to estimate as "needed" depending on the workload, and also how latency plays a role. I am originally a developer and have switched to a DBA role in the past year and I'm still trying to engrain these concepts in my head.

April 8th, 2015 9:23pm

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics