Azure doesn't use ECC memory? (Network Steve Forum)

Azure doesn't use ECC memory?

Reading an overview of SQL Azure [1], I saw a comment that SQL Azure does not have ECC memory, with the suggestion that it only impacted uptime. While that might be true for some workloads, if the RAM is mainly for user data, that is not the case.

If SQL Azure isn't using ECC RAM, that means the likelihood of getting corrupted data is non-trivial. It seems like a not-good-idea to allow memory errors as a routine, uncorrected, undetected mode of operation for server applications.

Does anyone have any insight into this?

1: http://msdn.microsoft.com/en-us/library/windowsazure/jj879332.aspx

May 13th, 2013 11:31pm

SQL Azure (aka Windows Azure SQL Database) has a series of measures to protect user data. Essentially, we have built a series of mechanisms to give the same benefits as ECC without requiring ECC. (More broadly, we check all sorts of things, such as network packet checksums, to detect/correct for bugs in hardware or drivers). For data writes, we have a mechanism that commits to a quorum of machines to guarantee that we harden changes on disk (and have a series of measures to validate that data is correct when written to detect/correct against bit errors on disk). Also, we have measures to detect/correct errors for bit errors in memory (page checksums, etc.). We also have a system to help us detect and track hardware failures, to burn-in hardware with tests before user workloads are allowed on the machines (both initially and before a machine is returned from a repair action), and we also have a system where we do statistical analysis for kinds of failures that might predict that a bit error _might_ happen in the future (so we can proactively move the machine offline).

Ultimately, we are confident in the model being correct (and actually we have higher confidence in this system than a lot of the hardware we see people run on-premises because we spend so much time validating hardware formally) such that you can be confident in the correctness of your data in SQL Azure.

I hope this gives more clarity to your question/concern.

Thanks

Conor

Proposed as answer by Guy Haycock [MSFT]Microsoft employee, Owner Friday, June 27, 2014 4:47 PM

Free Windows Admin Tool Kit Click here and download it now

June 25th, 2014 6:37pm

I hope this gives more clarity to your question/concern.

Thanks

Conor

Proposed as answer by Guy Haycock [MSFT]Microsoft employee, Owner Friday, June 27, 2014 4:47 PM

June 25th, 2014 6:37pm

I hope this gives more clarity to your question/concern.

Thanks

Conor

Proposed as answer by Guy Haycock [MSFT]Microsoft employee, Owner Friday, June 27, 2014 4:47 PM

Free Windows Admin Tool Kit Click here and download it now

June 25th, 2014 6:37pm

I hope this gives more clarity to your question/concern.

Thanks

Conor

Proposed as answer by Guy Haycock [MSFT]Microsoft employee, Owner Friday, June 27, 2014 4:47 PM

June 25th, 2014 6:37pm

I hope this gives more clarity to your question/concern.

Thanks

Conor

Proposed as answer by Guy Haycock [MSFT]Microsoft employee, Owner Friday, June 27, 2014 4:47 PM

Free Windows Admin Tool Kit Click here and download it now

June 25th, 2014 6:37pm

I hope this gives more clarity to your question/concern.

Thanks

Conor

Proposed as answer by Guy Haycock [MSFT]Microsoft employee, Owner Friday, June 27, 2014 4:47 PM

June 25th, 2014 6:37pm

Thank you for the reply! It helps clarify a bit. But I was under the impression that RAM single-bit errors were fairly high rate. WP's quote on the Google study says

"about 5 single bit errors in 8 Gigabytes of RAM per hour using the top-end error rate"

If I was keeping 1GB of pricing information in an array, how is that protected? How do page checksums fix things?

Or perhaps easiest to explain: a single bit error happens in a memory module. What happens next? With ECC it's fixed, and on two-bit error, the system faults. So that's nice and clear, I can have a fair amount of reliability, right? What's the story for Azure in this exact scenario?

Edited by MichaelGG Monday, June 30, 2014 11:54 PM

Free Windows Admin Tool Kit Click here and download it now

June 30th, 2014 11:54pm

Thank you for the reply! It helps clarify a bit. But I was under the impression that RAM single-bit errors were fairly high rate. WP's quote on the Google study says

"about 5 single bit errors in 8 Gigabytes of RAM per hour using the top-end error rate"

If I was keeping 1GB of pricing information in an array, how is that protected? How do page checksums fix things?

Edited by MichaelGG Monday, June 30, 2014 11:54 PM

June 30th, 2014 11:54pm

Thank you for the reply! It helps clarify a bit. But I was under the impression that RAM single-bit errors were fairly high rate. WP's quote on the Google study says

"about 5 single bit errors in 8 Gigabytes of RAM per hour using the top-end error rate"

If I was keeping 1GB of pricing information in an array, how is that protected? How do page checksums fix things?

Edited by MichaelGG Monday, June 30, 2014 11:54 PM

Free Windows Admin Tool Kit Click here and download it now

June 30th, 2014 11:54pm

Thank you for the reply! It helps clarify a bit. But I was under the impression that RAM single-bit errors were fairly high rate. WP's quote on the Google study says

"about 5 single bit errors in 8 Gigabytes of RAM per hour using the top-end error rate"

If I was keeping 1GB of pricing information in an array, how is that protected? How do page checksums fix things?

Edited by MichaelGG Monday, June 30, 2014 11:54 PM

June 30th, 2014 11:54pm

Thank you for the reply! It helps clarify a bit. But I was under the impression that RAM single-bit errors were fairly high rate. WP's quote on the Google study says

"about 5 single bit errors in 8 Gigabytes of RAM per hour using the top-end error rate"

If I was keeping 1GB of pricing information in an array, how is that protected? How do page checksums fix things?

Edited by MichaelGG Monday, June 30, 2014 11:54 PM

Free Windows Admin Tool Kit Click here and download it now

June 30th, 2014 11:54pm

Thank you for the reply! It helps clarify a bit. But I was under the impression that RAM single-bit errors were fairly high rate. WP's quote on the Google study says

"about 5 single bit errors in 8 Gigabytes of RAM per hour using the top-end error rate"

If I was keeping 1GB of pricing information in an array, how is that protected? How do page checksums fix things?

Edited by MichaelGG Monday, June 30, 2014 11:54 PM

June 30th, 2014 11:54pm

We have extended software to detect bit errors on replicas and throw them out of the replica set (rebuilding the replica from the other copies). There are various layers of checks that we add on top of the hardware so that we can catch and correct bit errors when possible or invalidate the replica (rebuilding from another copy automatically) if we can not.