EDB Corruption Errors 228, 233, 234, 530 (Network Steve Forum)

EDB Corruption Errors 228, 233, 234, 530

Just recently we had an incident in which a couple of switches were rebooted unintentionally which caused a backup process (Veeam) that utilizes VM snapshots (on VMware) to go haywire. I thought that it was just the backup process that was corrupted but then suddenly, I started getting errors every 5 minutes after the last snapshot was removed that primarily log event ID 233 but also 234, 228, and 530. Below are what they state:

228: At '5/15/2015 11:42:26 AM', the copy of database 'DB' on this server encountered an error that couldn't be automatically repaired without running as a passive copy and failover was attempted. The error returned was "There is only one copy of this mailbox database (DB). Automatic recovery is not available.". For more information about the failure, consult the Event log on the server for "ExchangeStoreDb" events.

233: At '5/15/2015 11:46:03 AM', database copy 'DB' on this server encountered an error. For more information, consult the Event log for "ExchangeStoreDb" or "MSExchangeRepl" events.

234: At '5/15/2015 11:42:26 AM', the copy of database 'DB' on this server encountered a serious I/O error that may have affected all copies of the database. For information about the failure, consult the Event log on the server for "ExchangeStoreDb" or "MSExchangeRepl" events. All data should be immediately moved out of this database into a new database.

530: Information Store (3468) DB: The database page read from the file "F:\DB\Database\DB.edb" at offset 238081900544 (0x000000376ec98000) (database page 7265682 (0x6EDD92)) for 32768 (0x00008000) bytes failed verification due to a lost flush detection timestamp mismatch. The read operation will fail with error -1119 (0xfffffba1). If this condition persists, restore the database from a previous backup. This problem is likely due to faulty hardware. Please contact your hardware vendor for further assistance diagnosing the problem.

So I figured, well, I can just create a new database and migrate the mailboxes over but many of them fail to migrate. For those mailboxes, I've tried to run the PowerShell command to repair them (New-MailboxRepairRequest) but that fails too with the following error:

10049: Online integrity check for request 0ab17d2b-bd15-4161-b4df-0dfcfd16c4d6 failed with error -1119.

The export to PST file fails as well and users report that archiving through Outlook fails once it reaches a corrupted folder.

I thought this was only happening for one of the databases so we figured we'd migrate as many as we could to a new drive and then announce data loss for the rest. Right now, we're copying the last good backup of the edb and the log files to the drive to mount in the old one's place in hopes that we can get away from the errors. Unfortunately, due to drive constraints, we were forced to enable circular logging on this database but we're okay with the one-two days of data loss for that particular database. The disturbing part is that once we dismounted the corrupted database, we started receiving the same errors for two other databases... Fortunately, at least those aren't nearly as big and they do not have circular logging enabled so we might be able to do a full restore assuming that the log files are not corrupted. However, I am worried that there is a bigger problem such as drive failure.

I am wondering if anyone can offer some advice for this scenario and I wanted to make sure that I am going down the right path of simply running restore process for each DB that gets this error until we can move everything to new storage. I am on Exchange 2010 SP1 and we have been working hard over the last few months to get our environment ready for 2013 (we purchased new storage for that deployment). Sorry for the lengthy post and please let me know if you need any further info for me.

Thank you in advance for your time!

Edited by Scott_42 Friday, May 15, 2015 7:49 PM

May 15th, 2015 7:48pm

UPDATE:

We restored the last backup we had of the EDB for the one database in question. It replayed the few log files that we had, threw a BUNCH of errors (mostly 228 but also 203 and 474) and then all of a sudden, everything went back to normal (no errors, no corrupted mailboxes, I am even able to migrate the same ones that failed earlier). It's been almost an hour since we mounted the EDB from backup and the errors for the other EDB's that were reported to be corrupted have also ceased. It's almost as if putting that EDB from backup back in place put the drive back into consistency and/or put all of Exchange back into consistency. I'll wait to celebrate until the 1AM maintenance cycle runs and see if it kicks more errors but if anyone cares to elaborate or explain, that would be helpful. I am obviously not an Exchange expert nor a storage expert so I am only making educated guesses at this point. Otherwise, if this remains stable, then perhaps we've bought ourselves enough time to finish our Exchange 2013 deployment...

Edited by Scott_42 Saturday, May 16, 2015 1:13 AM

Free Windows Admin Tool Kit Click here and download it now

May 16th, 2015 1:10am

Well, we are certainly not ignoring that advice... In cases like this I'd actually prefer some downtime to make sure that no new data gets lost on top of everything. I like that term "purposeful mistake" :). I guess I can see how that could happen with the IQN, especially if you haven't done a lot of Windows clustering before.

Well, as an update to my issue, I found some evidence that points to the drives. Although I still can't find anything specific in any of the event logs for the LUN, I had a mailbox fail to migrate on one of my EDB's. Since it was one of our smaller EDB's I decided to try migrating it to a different drive and after that move, the mailbox migration succeeded. I can only assume that parts of the disk are becoming corrupted or Exchange is having a hard time reading its own EDB when it sits on certain drive sectors.

May 18th, 2015 10:59am

Yes definitely sounds like a disk related issue and could be the actual disk, or even the Ethernet controller, driver, firmware or perhaps even a latency issue caused by the switch although if you moved it to another drive and it resolved without changing anything else then definitely something on that disk LUN

Glad you were able to resolve by moving it to another drive.

let me know if you discover the cause as its always good to have that knowledge for the future

Free Windows Admin Tool Kit Click here and download it now

May 18th, 2015 11:18am

This topic is archived. No further replies will be accepted.