Azure doesn't use ECC memory?

Reading an overview of SQL Azure [1], I saw a comment that SQL Azure does not have ECC memory, with the suggestion that it only impacted uptime. While that might be true for some workloads, if the RAM is mainly for user data, that is not the case.

If SQL Azure isn't using ECC RAM, that means the likelihood of getting corrupted data is non-trivial. It seems like a not-good-idea to allow memory errors as a routine, uncorrected, undetected mode of operation for server applications.

Does anyone have any insight into this?

1: http://msdn.microsoft.com/en-us/library/windowsazure/jj879332.aspx

 

May 13th, 2013 11:31pm

SQL Azure (aka Windows Azure SQL Database) has a series of measures to protect user data.  Essentially, we have built a series of mechanisms to give the same benefits as ECC without requiring ECC.  (More broadly, we check all sorts of things, such as network packet checksums, to detect/correct for bugs in hardware or drivers).   For data writes, we have a mechanism that commits to a quorum of machines to guarantee that we harden changes on disk (and have a series of measures to validate that data is correct when written to detect/correct against bit errors on disk).  Also, we have measures to detect/correct errors for bit errors in memory (page checksums, etc.).  We also have a system to help us detect and track hardware failures, to burn-in hardware with tests before user workloads are allowed on the machines (both initially and before a machine is returned from a repair action), and we also have a system where we do statistical analysis for kinds of failures that might predict that a bit error _might_ happen in the future (so we can proactively move the machine offline).

Ultimately, we are confident in the model being correct (and actually we have higher confidence in this system than a lot of the hardware we see people run on-premises because we spend so much time validating hardware formally) such that you can be confident in the correctness of your data in SQL Azure.

I hope this gives more clarity to your question/concern.

Thanks

Conor

Free Windows Admin Tool Kit Click here and download it now
June 25th, 2014 6:37pm

SQL Azure (aka Windows Azure SQL Database) has a series of measures to protect user data.  Essentially, we have built a series of mechanisms to give the same benefits as ECC without requiring ECC.  (More broadly, we check all sorts of things, such as network packet checksums, to detect/correct for bugs in hardware or drivers).   For data writes, we have a mechanism that commits to a quorum of machines to guarantee that we harden changes on disk (and have a series of measures to validate that data is correct when written to detect/correct against bit errors on disk).  Also, we have measures to detect/correct errors for bit errors in memory (page checksums, etc.).  We also have a system to help us detect and track hardware failures, to burn-in hardware with tests before user workloads are allowed on the machines (both initially and before a machine is returned from a repair action), and we also have a system where we do statistical analysis for kinds of failures that might predict that a bit error _might_ happen in the future (so we can proactively move the machine offline).

Ultimately, we are confident in the model being correct (and actually we have higher confidence in this system than a lot of the hardware we see people run on-premises because we spend so much time validating hardware formally) such that you can be confident in the correctness of your data in SQL Azure.

I hope this gives more clarity to your question/concern.

Thanks

Conor

June 25th, 2014 6:37pm

SQL Azure (aka Windows Azure SQL Database) has a series of measures to protect user data.  Essentially, we have built a series of mechanisms to give the same benefits as ECC without requiring ECC.  (More broadly, we check all sorts of things, such as network packet checksums, to detect/correct for bugs in hardware or drivers).   For data writes, we have a mechanism that commits to a quorum of machines to guarantee that we harden changes on disk (and have a series of measures to validate that data is correct when written to detect/correct against bit errors on disk).  Also, we have measures to detect/correct errors for bit errors in memory (page checksums, etc.).  We also have a system to help us detect and track hardware failures, to burn-in hardware with tests before user workloads are allowed on the machines (both initially and before a machine is returned from a repair action), and we also have a system where we do statistical analysis for kinds of failures that might predict that a bit error _might_ happen in the future (so we can proactively move the machine offline).

Ultimately, we are confident in the model being correct (and actually we have higher confidence in this system than a lot of the hardware we see people run on-premises because we spend so much time validating hardware formally) such that you can be confident in the correctness of your data in SQL Azure.

I hope this gives more clarity to your question/concern.

Thanks

Conor

Free Windows Admin Tool Kit Click here and download it now
June 25th, 2014 6:37pm

SQL Azure (aka Windows Azure SQL Database) has a series of measures to protect user data.  Essentially, we have built a series of mechanisms to give the same benefits as ECC without requiring ECC.  (More broadly, we check all sorts of things, such as network packet checksums, to detect/correct for bugs in hardware or drivers).   For data writes, we have a mechanism that commits to a quorum of machines to guarantee that we harden changes on disk (and have a series of measures to validate that data is correct when written to detect/correct against bit errors on disk).  Also, we have measures to detect/correct errors for bit errors in memory (page checksums, etc.).  We also have a system to help us detect and track hardware failures, to burn-in hardware with tests before user workloads are allowed on the machines (both initially and before a machine is returned from a repair action), and we also have a system where we do statistical analysis for kinds of failures that might predict that a bit error _might_ happen in the future (so we can proactively move the machine offline).

Ultimately, we are confident in the model being correct (and actually we have higher confidence in this system than a lot of the hardware we see people run on-premises because we spend so much time validating hardware formally) such that you can be confident in the correctness of your data in SQL Azure.

I hope this gives more clarity to your question/concern.

Thanks

Conor

June 25th, 2014 6:37pm

SQL Azure (aka Windows Azure SQL Database) has a series of measures to protect user data.  Essentially, we have built a series of mechanisms to give the same benefits as ECC without requiring ECC.  (More broadly, we check all sorts of things, such as network packet checksums, to detect/correct for bugs in hardware or drivers).   For data writes, we have a mechanism that commits to a quorum of machines to guarantee that we harden changes on disk (and have a series of measures to validate that data is correct when written to detect/correct against bit errors on disk).  Also, we have measures to detect/correct errors for bit errors in memory (page checksums, etc.).  We also have a system to help us detect and track hardware failures, to burn-in hardware with tests before user workloads are allowed on the machines (both initially and before a machine is returned from a repair action), and we also have a system where we do statistical analysis for kinds of failures that might predict that a bit error _might_ happen in the future (so we can proactively move the machine offline).

Ultimately, we are confident in the model being correct (and actually we have higher confidence in this system than a lot of the hardware we see people run on-premises because we spend so much time validating hardware formally) such that you can be confident in the correctness of your data in SQL Azure.

I hope this gives more clarity to your question/concern.

Thanks

Conor

Free Windows Admin Tool Kit Click here and download it now
June 25th, 2014 6:37pm

SQL Azure (aka Windows Azure SQL Database) has a series of measures to protect user data.  Essentially, we have built a series of mechanisms to give the same benefits as ECC without requiring ECC.  (More broadly, we check all sorts of things, such as network packet checksums, to detect/correct for bugs in hardware or drivers).   For data writes, we have a mechanism that commits to a quorum of machines to guarantee that we harden changes on disk (and have a series of measures to validate that data is correct when written to detect/correct against bit errors on disk).  Also, we have measures to detect/correct errors for bit errors in memory (page checksums, etc.).  We also have a system to help us detect and track hardware failures, to burn-in hardware with tests before user workloads are allowed on the machines (both initially and before a machine is returned from a repair action), and we also have a system where we do statistical analysis for kinds of failures that might predict that a bit error _might_ happen in the future (so we can proactively move the machine offline).

Ultimately, we are confident in the model being correct (and actually we have higher confidence in this system than a lot of the hardware we see people run on-premises because we spend so much time validating hardware formally) such that you can be confident in the correctness of your data in SQL Azure.

I hope this gives more clarity to your question/concern.

Thanks

Conor

June 25th, 2014 6:37pm

Thank you for the reply! It helps clarify a bit. But I was under the impression that RAM single-bit errors were fairly high rate. WP's quote on the Google study says

"about 5 single bit errors in 8 Gigabytes of RAM per hour using the top-end error rate"

If I was keeping 1GB of pricing information in an array, how is that protected? How do page checksums fix things?

Or perhaps easiest to explain: a single bit error happens in a memory module. What happens next? With ECC it's fixed, and on two-bit error, the system faults. So that's nice and clear, I can have a fair amount of reliability, right? What's the story for Azure in this exact scenario?


  • Edited by MichaelGG Monday, June 30, 2014 11:54 PM
Free Windows Admin Tool Kit Click here and download it now
June 30th, 2014 11:54pm

Thank you for the reply! It helps clarify a bit. But I was under the impression that RAM single-bit errors were fairly high rate. WP's quote on the Google study says

"about 5 single bit errors in 8 Gigabytes of RAM per hour using the top-end error rate"

If I was keeping 1GB of pricing information in an array, how is that protected? How do page checksums fix things?

Or perhaps easiest to explain: a single bit error happens in a memory module. What happens next? With ECC it's fixed, and on two-bit error, the system faults. So that's nice and clear, I can have a fair amount of reliability, right? What's the story for Azure in this exact scenario?


  • Edited by MichaelGG Monday, June 30, 2014 11:54 PM
June 30th, 2014 11:54pm

Thank you for the reply! It helps clarify a bit. But I was under the impression that RAM single-bit errors were fairly high rate. WP's quote on the Google study says

"about 5 single bit errors in 8 Gigabytes of RAM per hour using the top-end error rate"

If I was keeping 1GB of pricing information in an array, how is that protected? How do page checksums fix things?

Or perhaps easiest to explain: a single bit error happens in a memory module. What happens next? With ECC it's fixed, and on two-bit error, the system faults. So that's nice and clear, I can have a fair amount of reliability, right? What's the story for Azure in this exact scenario?


  • Edited by MichaelGG Monday, June 30, 2014 11:54 PM
Free Windows Admin Tool Kit Click here and download it now
June 30th, 2014 11:54pm

Thank you for the reply! It helps clarify a bit. But I was under the impression that RAM single-bit errors were fairly high rate. WP's quote on the Google study says

"about 5 single bit errors in 8 Gigabytes of RAM per hour using the top-end error rate"

If I was keeping 1GB of pricing information in an array, how is that protected? How do page checksums fix things?

Or perhaps easiest to explain: a single bit error happens in a memory module. What happens next? With ECC it's fixed, and on two-bit error, the system faults. So that's nice and clear, I can have a fair amount of reliability, right? What's the story for Azure in this exact scenario?


  • Edited by MichaelGG Monday, June 30, 2014 11:54 PM
June 30th, 2014 11:54pm

Thank you for the reply! It helps clarify a bit. But I was under the impression that RAM single-bit errors were fairly high rate. WP's quote on the Google study says

"about 5 single bit errors in 8 Gigabytes of RAM per hour using the top-end error rate"

If I was keeping 1GB of pricing information in an array, how is that protected? How do page checksums fix things?

Or perhaps easiest to explain: a single bit error happens in a memory module. What happens next? With ECC it's fixed, and on two-bit error, the system faults. So that's nice and clear, I can have a fair amount of reliability, right? What's the story for Azure in this exact scenario?


  • Edited by MichaelGG Monday, June 30, 2014 11:54 PM
Free Windows Admin Tool Kit Click here and download it now
June 30th, 2014 11:54pm

Thank you for the reply! It helps clarify a bit. But I was under the impression that RAM single-bit errors were fairly high rate. WP's quote on the Google study says

"about 5 single bit errors in 8 Gigabytes of RAM per hour using the top-end error rate"

If I was keeping 1GB of pricing information in an array, how is that protected? How do page checksums fix things?

Or perhaps easiest to explain: a single bit error happens in a memory module. What happens next? With ECC it's fixed, and on two-bit error, the system faults. So that's nice and clear, I can have a fair amount of reliability, right? What's the story for Azure in this exact scenario?


  • Edited by MichaelGG Monday, June 30, 2014 11:54 PM
June 30th, 2014 11:54pm

We have extended software to detect bit errors on replicas and throw them out of the replica set (rebuilding the replica from the other copies).  There are various layers of checks that we add on top of the hardware so that we can catch and correct bit errors when possible or invalidate the replica (rebuilding from another copy automatically) if we can not.

In our scenario, we have not seen bit error rates that align with the quote you mention, but we are also doing somewhat different things than others in our solution.

The answer to your hypothetical question varies quite a bit based on what is residing on the memory module at the time.  If it corrupts a database page in one replica, we would recognize the page checksum as incorrect when writing it out (if it was dirty) and then invalidate the replica.    If the memory was in something that is not data (a memory structure for the engine, for example), the process would likely crash which would cause us to switch to a different primary and restart the server on the (now) secondary [which would recover from the issue since memory would be re-initialized on process restart].  Generally, each of the main scenarios has been handled with layers of checks to enable data preservation and to get you to a working replica in case of an issue like this.

Free Windows Admin Tool Kit Click here and download it now
August 21st, 2014 3:23pm

We have extended software to detect bit errors on replicas and throw them out of the replica set (rebuilding the replica from the other copies).  There are various layers of checks that we add on top of the hardware so that we can catch and correct bit errors when possible or invalidate the replica (rebuilding from another copy automatically) if we can not.

In our scenario, we have not seen bit error rates that align with the quote you mention, but we are also doing somewhat different things than others in our solution.

The answer to your hypothetical question varies quite a bit based on what is residing on the memory module at the time.  If it corrupts a database page in one replica, we would recognize the page checksum as incorrect when writing it out (if it was dirty) and then invalidate the replica.    If the memory was in something that is not data (a memory structure for the engine, for example), the process would likely crash which would cause us to switch to a different primary and restart the server on the (now) secondary [which would recover from the issue since memory would be re-initialized on process restart].  Generally, each of the main scenarios has been handled with layers of checks to enable data preservation and to get you to a working replica in case of an issue like this.

August 21st, 2014 3:23pm

We have extended software to detect bit errors on replicas and throw them out of the replica set (rebuilding the replica from the other copies).  There are various layers of checks that we add on top of the hardware so that we can catch and correct bit errors when possible or invalidate the replica (rebuilding from another copy automatically) if we can not.

In our scenario, we have not seen bit error rates that align with the quote you mention, but we are also doing somewhat different things than others in our solution.

The answer to your hypothetical question varies quite a bit based on what is residing on the memory module at the time.  If it corrupts a database page in one replica, we would recognize the page checksum as incorrect when writing it out (if it was dirty) and then invalidate the replica.    If the memory was in something that is not data (a memory structure for the engine, for example), the process would likely crash which would cause us to switch to a different primary and restart the server on the (now) secondary [which would recover from the issue since memory would be re-initialized on process restart].  Generally, each of the main scenarios has been handled with layers of checks to enable data preservation and to get you to a working replica in case of an issue like this.

Free Windows Admin Tool Kit Click here and download it now
August 21st, 2014 3:23pm

We have extended software to detect bit errors on replicas and throw them out of the replica set (rebuilding the replica from the other copies).  There are various layers of checks that we add on top of the hardware so that we can catch and correct bit errors when possible or invalidate the replica (rebuilding from another copy automatically) if we can not.

In our scenario, we have not seen bit error rates that align with the quote you mention, but we are also doing somewhat different things than others in our solution.

The answer to your hypothetical question varies quite a bit based on what is residing on the memory module at the time.  If it corrupts a database page in one replica, we would recognize the page checksum as incorrect when writing it out (if it was dirty) and then invalidate the replica.    If the memory was in something that is not data (a memory structure for the engine, for example), the process would likely crash which would cause us to switch to a different primary and restart the server on the (now) secondary [which would recover from the issue since memory would be re-initialized on process restart].  Generally, each of the main scenarios has been handled with layers of checks to enable data preservation and to get you to a working replica in case of an issue like this.

August 21st, 2014 3:23pm

We have extended software to detect bit errors on replicas and throw them out of the replica set (rebuilding the replica from the other copies).  There are various layers of checks that we add on top of the hardware so that we can catch and correct bit errors when possible or invalidate the replica (rebuilding from another copy automatically) if we can not.

In our scenario, we have not seen bit error rates that align with the quote you mention, but we are also doing somewhat different things than others in our solution.

The answer to your hypothetical question varies quite a bit based on what is residing on the memory module at the time.  If it corrupts a database page in one replica, we would recognize the page checksum as incorrect when writing it out (if it was dirty) and then invalidate the replica.    If the memory was in something that is not data (a memory structure for the engine, for example), the process would likely crash which would cause us to switch to a different primary and restart the server on the (now) secondary [which would recover from the issue since memory would be re-initialized on process restart].  Generally, each of the main scenarios has been handled with layers of checks to enable data preservation and to get you to a working replica in case of an issue like this.

Free Windows Admin Tool Kit Click here and download it now
August 21st, 2014 3:23pm

We have extended software to detect bit errors on replicas and throw them out of the replica set (rebuilding the replica from the other copies).  There are various layers of checks that we add on top of the hardware so that we can catch and correct bit errors when possible or invalidate the replica (rebuilding from another copy automatically) if we can not.

In our scenario, we have not seen bit error rates that align with the quote you mention, but we are also doing somewhat different things than others in our solution.

The answer to your hypothetical question varies quite a bit based on what is residing on the memory module at the time.  If it corrupts a database page in one replica, we would recognize the page checksum as incorrect when writing it out (if it was dirty) and then invalidate the replica.    If the memory was in something that is not data (a memory structure for the engine, for example), the process would likely crash which would cause us to switch to a different primary and restart the server on the (now) secondary [which would recover from the issue since memory would be re-initialized on process restart].  Generally, each of the main scenarios has been handled with layers of checks to enable data preservation and to get you to a working replica in case of an issue like this.

August 21st, 2014 3:23pm

We have extended software to detect bit errors on replicas and throw them out of the replica set (rebuilding the replica from the other copies).  There are various layers of checks that we add on top of the hardware so that we can catch and correct bit errors when possible or invalidate the replica (rebuilding from another copy automatically) if we can not.

In our scenario, we have not seen bit error rates that align with the quote you mention, but we are also doing somewhat different things than others in our solution.

The answer to your hypothetical question varies quite a bit based on what is residing on the memory module at the time.  If it corrupts a database page in one replica, we would recognize the page checksum as incorrect when writing it out (if it was dirty) and then invalidate the replica.    If the memory was in something that is not data (a memory structure for the engine, for example), the process would likely crash which would cause us to switch to a different primary and restart the server on the (now) secondary [which would recover from the issue since memory would be re-initialized on process restart].  Generally, each of the main scenarios has been handled with layers of checks to enable data preservation and to get you to a working replica in case of an issue like this.

Free Windows Admin Tool Kit Click here and download it now
August 21st, 2014 3:23pm

We have extended software to detect bit errors on replicas and throw them out of the replica set (rebuilding the replica from the other copies).  There are various layers of checks that we add on top of the hardware so that we can catch and correct bit errors when possible or invalidate the replica (rebuilding from another copy automatically) if we can not.

In our scenario, we have not seen bit error rates that align with the quote you mention, but we are also doing somewhat different things than others in our solution.

The answer to your hypothetical question varies quite a bit based on what is residing on the memory module at the time.  If it corrupts a database page in one replica, we would recognize the page checksum as incorrect when writing it out (if it was dirty) and then invalidate the replica.    If the memory was in something that is not data (a memory structure for the engine, for example), the process would likely crash which would cause us to switch to a different primary and restart the server on the (now) secondary [which would recover from the issue since memory would be re-initialized on process restart].  Generally, each of the main scenarios has been handled with layers of checks to enable data preservation and to get you to a working replica in case of an issue like this.

August 21st, 2014 3:23pm

All servers that run SQL Azure have ECC memory.

Free Windows Admin Tool Kit Click here and download it now
July 29th, 2015 1:26pm

I was asked a question about this yesterday, and I asked for more clarifications since when I double-checked was not quite what I had been told when I wrote my original replies on this.

So, to clarify:

SQL Azure uses hardware with ECC memory.  It also has various layers of protection against corruption that it uses in other layers of the stack (like what Guy mentions).

I am trying to double-check whether all earlier hardware configurations in SQL Azure also had ECC memory (I believed that some did not based at the time that I wrote the post).  I'm trying to get more data on that, but as some of those classes of hardware are no longer deployed I may not be able to fully convince myself :).

Thanks,

Conor

July 30th, 2015 10:22am

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics