EDB Corruption Errors 228, 233, 234, 530 (Network Steve Forum)

EDB Corruption Errors 228, 233, 234, 530

Just recently we had an incident in which a couple of switches were rebooted unintentionally which caused a backup process (Veeam) that utilizes VM snapshots (on VMware) to go haywire. I thought that it was just the backup process that was corrupted but then suddenly, I started getting errors every 5 minutes after the last snapshot was removed that primarily log event ID 233 but also 234, 228, and 530. Below are what they state:

228: At '5/15/2015 11:42:26 AM', the copy of database 'DB' on this server encountered an error that couldn't be automatically repaired without running as a passive copy and failover was attempted. The error returned was "There is only one copy of this mailbox database (DB). Automatic recovery is not available.". For more information about the failure, consult the Event log on the server for "ExchangeStoreDb" events.

233: At '5/15/2015 11:46:03 AM', database copy 'DB' on this server encountered an error. For more information, consult the Event log for "ExchangeStoreDb" or "MSExchangeRepl" events.

234: At '5/15/2015 11:42:26 AM', the copy of database 'DB' on this server encountered a serious I/O error that may have affected all copies of the database. For information about the failure, consult the Event log on the server for "ExchangeStoreDb" or "MSExchangeRepl" events. All data should be immediately moved out of this database into a new database.

530: Information Store (3468) DB: The database page read from the file "F:\DB\Database\DB.edb" at offset 238081900544 (0x000000376ec98000) (database page 7265682 (0x6EDD92)) for 32768 (0x00008000) bytes failed verification due to a lost flush detection timestamp mismatch. The read operation will fail with error -1119 (0xfffffba1). If this condition persists, restore the database from a previous backup. This problem is likely due to faulty hardware. Please contact your hardware vendor for further assistance diagnosing the problem.

So I figured, well, I can just create a new database and migrate the mailboxes over but many of them fail to migrate. For those mailboxes, I've tried to run the PowerShell command to repair them (New-MailboxRepairRequest) but that fails too with the following error:

10049: Online integrity check for request 0ab17d2b-bd15-4161-b4df-0dfcfd16c4d6 failed with error -1119.

The export to PST file fails as well and users report that archiving through Outlook fails once it reaches a corrupted folder.

I thought this was only happening for one of the databases so we figured we'd migrate as many as we could to a new drive and then announce data loss for the rest. Right now, we're copying the last good backup of the edb and the log files to the drive to mount in the old one's place in hopes that we can get away from the errors. Unfortunately, due to drive constraints, we were forced to enable circular logging on this database but we're okay with the one-two days of data loss for that particular database. The disturbing part is that once we dismounted the corrupted database, we started receiving the same errors for two other databases... Fortunately, at least those aren't nearly as big and they do not have circular logging enabled so we might be able to do a full restore assuming that the log files are not corrupted. However, I am worried that there is a bigger problem such as drive failure.

I am wondering if anyone can offer some advice for this scenario and I wanted to make sure that I am going down the right path of simply running restore process for each DB that gets this error until we can move everything to new storage. I am on Exchange 2010 SP1 and we have been working hard over the last few months to get our environment ready for 2013 (we purchased new storage for that deployment). Sorry for the lengthy post and please let me know if you need any further info for me.

Thank you in advance for your time!

Edited by Scott_42 11 hours 39 minutes ago

May 15th, 2015 3:49pm

UPDATE:

We restored the last backup we had of the EDB for the one database in question. It replayed the few log files that we had, threw a BUNCH of errors (mostly 228 but also 203 and 474) and then all of a sudden, everything went back to normal (no errors, no corrupted mailboxes, I am even able to migrate the same ones that failed earlier). It's been almost an hour since we mounted the EDB from backup and the errors for the other EDB's that were reported to be corrupted have also ceased. It's almost as if putting that EDB from backup back in place put the drive back into consistency and/or put all of Exchange back into consistency. I'll wait to celebrate until the 1AM maintenance cycle runs and see if it kicks more errors but if anyone cares to elaborate or explain, that would be helpful. I am obviously not an Exchange expert nor a storage expert so I am only making educated guesses at this point. Otherwise, if this remains stable, then perhaps we've bought ourselves enough time to finish our Exchange 2013 deployment...

Edited by Scott_42 6 hours 14 minutes ago

Free Windows Admin Tool Kit Click here and download it now

May 15th, 2015 9:11pm

So very odd and I ran into this same issue with a client the other day however they were using Exchange 2010 and in summary here was what took place

1. Customer had two Exchange servers with numerous DB's and originally all DB's were on local storage, however they were running low on disk space

2. Since they had an iSCSI array with lots of storage they created new luns on the array for each box, and the moved the EDBs and LOGS to the array on Saturday

3. All went well until on Sunday night the stores started going down

4. They mucked with it until Monday when I finally got involved and upon examination there were numerous very serious errors all pointing at the iSCSI array.

5. We finally discovered that they were sharing the same IQN between both servers which means they were stepping all over themselves i.e. both machines were trying to read/write from the same LUN which was causing all the corruption

6. Cutting to the chase after shutting down all the DB.s making new luns and copying data off to a private LUN on one of the servers and letting the 2nd one retain ownership of the original lun we had 2 of the databases that started squawking about "the copy of database 'DB' on this server encountered an error that couldn't be automatically repaired without running as a passive copy and fail over was attempted" and the weird part is that these were just stand alone DB's i.e. not part of a DAG and never had been

7. We tried sames things as you did to no avail. Finally the db's cratered, however luckily we had backups of the damaged DB's so we dial-toned the production DB's to start fresh and then used our DigiScope tool to open the copy of the damaged DB via our forensic mount option and restored all the data to the new clean stores.

So with that said if the situation has any similar threads you should at least make offline copies of the DB's that are squawking in case you have to do a recovery and also check your system event logs to ensure that the IP flip issue you discovered did not cause any disk related errors

May 15th, 2015 9:38pm

Yes, we have a very similar setup indeed. Luckily the system log looks clean so far but I am still suspicious of our SCSI drives. At least it looks like things are stable for the moment but I'll keep monitoring the application log and we are making it a priority to get a new Exchange 2013 deployment completed and get all of the mailboxes migrated as soon as possible. Otherwise, we may consider at least migrating the mailboxes in that database to a new one for the time being. I will also take your advice and try to get a copy of the EDB. I am very curious how your client managed to get duplicate IQN's though.

Free Windows Admin Tool Kit Click here and download it now

May 15th, 2015 10:23pm

Yes I would make a copy of any EDBs acting odd or giving errors since the customer ignored that advice at first because they were more concerned about keeping users up and they lost one of their DB's entirely so then they listened and we fixed the rest up.

So the IQN issue was a purposeful mistake, i.e. the storage guy thought it was fine to share IQN's because he had read an article showing him how to do this for exchange. Well when I asked him what article it turns out it was in relation to CLUSTERED servers which is true ( but we were not using clusters) and you can do that for the quorum drive because even though multiple servers have access its setup so that no more then one server will talk to that LUN, i.e. the passive nodes just wait to take over if the Active node fails. Anyway in the end all of it was caused by a mistake...

May 15th, 2015 11:47pm

This topic is archived. No further replies will be accepted.