Why is DPM 2012 R2 splitting this Co-located Protection Group Set across three different tapes?
I have 10 protection groups that ran Saturday night. I have put information about them in the following table:
I extracted the data in the table using Get-DPMProtectionGroup CmdLet and the job information from the Monitoring page of the DPM admin console. As you can see from the tape barcode, DPM wrote the first four PGs to one tape, then started a second
tape, and finally failed when starting a third because it had run out of free tapes.
All of the protection groups are assigned to the same "Co-Located Protection Group Set" in the "Tape Optimization Setup" dialog box.
Prior to the jobs running, I painstakingly ensured that each of the Protection Group's "Select Long-Term Goals" and "Select Library and Tape Details" screens are identical as follows:
Prior to running the jobs I had already disabled autoclean on the tape library.
I have checked to see if the tape drive is reporting IO error 0x8007045D using the method described in this blog post:
http://blogs.technet.com/b/dpm/archive/2012/05/14/things-you-can-do-to-help-data-protection-manager-utilize-your-tapes-full-capacity.aspx
There seems to be no mention of 0x8007045D or -2147023779 in any of the MSDPM*.Errlog files.
These are LTO6 tapes with a capacity of 2.5 TB each. The total data for all the jobs put together is less than 300GB.
Why is DPM 2012 R2 splitting these protection groups across three different tapes?
What else should I check for that might cause DPM to start a new tape?
-
Edited by
alx9r
Monday, August 18, 2014 8:53 PM
August 18th, 2014 11:52pm
Hi,
Did you check the DPMRA*.Errlog files ? Did you add the BufferQueueSize and BusyRetryCount registry entries ?
August 19th, 2014 3:57am
Hi Mike,
Thanks for the quick response. I did check the DPMRA*.Errlog files for "0x8007045D" and "-2147023779". There are no instances of those strings in the DPMRA*.Errlog files. Your blog post indicates adding BufferQueueSize and BusyRetryCount as potential
cures for IO_DEVICE_ERROR 0x8007045D. Since IO_DEVICE_ERROR 0x8007045D is not mentioned in any logs, I haven't added BufferQueueSize or BusyRetryCount.
Should I add the BufferQueueSize and BusyRetryCount anyway? If so, how do I know if it's helping if there were no "0x8007045D" errors in the first place?
Is there something else I should be looking for in the logs that would indicate why DPM switched tapes?
Thanks for your help,
Alex
August 19th, 2014 7:53pm
Hi,
The registry entries are preventative in nature so they will not cause any harm.
You can see if the tape drive is reporting end of media by performing this search.
Find /i "0x8007044C" dpmra*.errlog
August 19th, 2014 8:09pm
Hi Mike,
I must have made a mistake earlier. I retraced my steps and searched again for "0x8007044C" and "0x8007045D" in each of DPMRA*.errlog and MSDPM*.errlog. There are 46kB worth of lines containing "0x8007045D" in the
DPMRA*.errlog. The other three searches yielded no results. The "0x8007045D" errors all seem to look something like this:
0C84 1804 08/17 06:11:15.676 22 genericthreadpool.cpp(537) [0000000000D87210] WARNING Failed: Hr: = [0x8007045d] Thread 6148 received ERROR
0C84 1C9C 08/17 06:11:15.676 18 bufferedmediareader.cpp(782) [0000000000DA0A30] WARNING Failed: Hr: = [0x8007045d] CBufferedMediaReader - Received unknown error. number of bytes - 0, index - 5.
The strange thing is that all of the occurrences are from either 08/12 or after 06:00 on 08/17 which is after all the backup jobs had finished. None of the "0x8007045D" timestamps seem to match up with 08/17 00:15 or 08/17 02:38
which are the times when DPM switched to new tapes.
I have added BufferQueueSize and BusyRetryCount to the registry. The co-located protection groups in question are scheduled to backup to tape on Saturday night. Is there any way to run Saturday's backup jobs early without manually changing the
schedule of all of the protection groups?
Alex
-
Edited by
alx9r
Tuesday, August 19, 2014 7:01 PM
August 19th, 2014 9:52pm
Hi,
All the times logged in the DPM Errlog files are in GMT time - so based on local time zone, you need to subtract from the logged time. Lets see if the registry entries worked, they should unless you have a hardware problem.
August 19th, 2014 10:48pm
Hi Mike,
Can you confirm the correct path to the BusyRetryCount registry entry? Your blog post mentions three different spellings of the final level of the hierarchy:
- "Storport"
- "StorPort"
- "StorpPort"
I'm still seeing 0x8007045d errors. I'd like to ensure the registry entry I created is being picked up. Is there a way to see that happening in the logs?
Alex
August 25th, 2014 7:03pm
Hi
Gosh - good catch - I don't believe the correct entry is case sensitive - however the correct spelling is StorPort.
The DPMRA logs are only going to detail that the tape drive reported an IO error. The BufferQueueSize is also important in eliminating the errors. If you have multiple tape drives, you can try disabling all but one at a time and see if it's just
a particular drive, or if it's a bus problem.
August 25th, 2014 8:26pm
Hi Mike,
I added another free tape and invoked "Resume Tape Backups..." for each of the failed protection groups. This time there were no 0x8007045d errors. Strangely, DPM completed backup for exactly one PG, produced the warning "Job waiting for tape"
and, most bizarrely, marked the new tape as "offsite ready".
All of the resumed jobs should have been co-located onto a single tape. I have a "Write period" of 1 day for this PG set. Is DPM refusing to co-locate PGs and marking this new tape as "offsite ready" because it is more than 1 day after the scheduled
back up time? If so, how do you ever get resumed backup jobs to co-locate?
Alex
August 26th, 2014 12:44am
Hi,
Yes, most likely that is what happened. For D2D2T - the recovery point creation time is that of the last disk based recovery point at the time the tape backup. If you retry a tape backup, we actually mount the shadow copy for the disk
based backup from the day the tape backup was initiated. If that disk based backup is already a day old, then we cannot write it on the same tape because it would exceed the Write Period. You would need to change the write period so
the retried backup jobs could co-locate - then change it back in time for the new round of backups.
August 26th, 2014 2:18am
Thanks for clarifying that behavior. I just tested with a longer write period and confirm that invoking "Resume tape backups..." on multiple PGs in the same PG set results in them being collocated.
I don't understand the philosophy behind this behavior though. Isn't the point of "Write Period" to limit collocation of shadow copies to those that
were created during a limited period of time? That way you end up with all of the shadow copies from say, last weekend, on a single tape. Instead, DPM seems to be limiting collocation of shadow copies to those that
can be successfully written to tape during a limited period of time. That behavior basically guarantees tape count amplification as soon as a tape backup job fails for any reason. Am I missing something here?
Alex
August 26th, 2014 8:01pm
Hi Mike,
I'm still finding that DPM often mysteriously splits backups across multiple tapes and immediately marks those tapes as "Offsite Ready". Here is what I have confirmed:
- "BufferQueueSize" has value "3" at HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft Data Protection Manager\Agent
- The tape drive is enumerated at SCSI\SEQUENTIAL&VEN_IBM&PROD_ULT3580-HH6\1&1C121344&0&000100
- "BusyRetryCount" has value "250" at HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Enum\SCSI\SEQUENTIAL&VEN_IBM&PROD_ULT3580-HH6\1&1C121344&0&000100\Device Parameters\StorPort
- All PGs are members of the same PGSet with a Write Period of 6 days and "Don't allow backups of different retention periods to co-locate on same tape" disabled.
- All PGs have identical TapeBackupOptions (I have compared them using Get-DPMTapeBackupOption
- All PGs have identical weekly and monthly schedules (I have compared them using Get-DPMPolicySchedule)
- Find /i "0x8007044C" *.errlog reveals no errors
- Find /i "0x8007045D" *.errlog reveals errors from September but no errors since then.
Is there any other place I should be looking for errors that DPM might be interpreting as "End of Tape"?
Are there reasons other than errant "End of Tape" errors that could cause DPM to abandon writing a tape part-way?
Thanks for your help,
Alex
February 3rd, 2015 1:46am
Hi,
You can look in the DPMRA logs for the code 0x8007044C which is "The physical end of the tape has been reached." See if they match times when tapes are spanned prematurely or marked offsite ready.
February 3rd, 2015 5:24am
Hi Mike,
There are no 0x8007044C errors. I ran the following search:
cd "C:\Program Files\Microsoft System Center 2012 R2\DPM\DPM\Temp"
Find /i "0x8007044C" *.errlog > c:\users\un1a\desktop\0x8007044C.log
The resulting 0x8007044C.log file shows no error lines, only the names of each *.errlog file:
---------- DPMRA0.ERRLOG
...
...
---------- DPMRACURR.ERRLOG
Is there any other place to look for clues to why DPM isn't filling tapes?
Alex
February 3rd, 2015 6:36pm
Hi,
Assuming you are making daily disk based RP's - I see no reason why DPM would not be able to fill tapes when it comes time to doing the weekly tape based backup. Would probably need to enable verbose logging during next scheduled tape backup time and
see why DPM is selecting new tapes instead of appending. Please open a support case to help track this down.
February 4th, 2015 12:43am
Hi Mike,
I opened a case with Microsoft Support Services. It is SR number
115020312355639.
Thanks for your help,
Alex
February 4th, 2015 2:49am
Hi Alex,
Have you been able to resolve this?
I'm having the exact same problem as yourself (except I'm using LTO5 tapes) and have tried all the steps listed here and on that link as well.
What would normally take 4 tapes to complete now takes 7-8 tapes .
Using DPM 2012 R2, Windows Server 2012 R2 and IBM 3573 Tape Library.
I'm desperately looking for a solution to this as I keep running out of backup tapes and having to purchase more and more.
Please let me know if you find a solution.
Thanks,
Dallas
February 6th, 2015 12:43pm
Hi Dallas,
I am in the same boat: What should fit on one tape suddenly spread two, then three, then four tapes. I've bought more tapes, but I'm afraid that is just a stop-gap, since it seems that the more workloads I add, the more tapes DPM spreads the backups
across.
My Microsoft Support Services case has been open for about four days. I do not have any progress to report. I will report back here if there is progress I can share.
Alex
twitter: @alx9r
February 6th, 2015 6:11pm
Hi,
Please make sure each protection group is part of PGSet and that the write period and expiry tolerance are high enough - you can set high then reduce once you confirm those settings are related to the problem.
February 6th, 2015 7:30pm
Hi Mike,
My Write period and Expiry Tolerance are 6 days and 7 days, respectively. Based on the mechanism described in your blog post and here, those values should be more than adequate to permit a tape to fill. Are you suggesting that they should be
set even higher? If so, what values should I set them to?
Alex
February 6th, 2015 7:55pm
Hi,
The co-location settings in this
BLOG does not really apply to DPM 2012 and later because the PGSet feature was introduced and replaced the mechanisms described in the blog. As a troubleshooting step, yes I would increase those values and see if it makes a difference.
February 6th, 2015 8:03pm
OK I have changed both Write Period and Expiry Tolerance to 12 months.
Alex
February 6th, 2015 8:23pm
Hi Mike,
Here's the latest:
- 2015-02-06: changed both Write Period and Expiry Tolerance to 12 months
- 2015-02-07: enabled verbose logging by setting TraceLogLevel to 0x43e as per
this technet article
- 2015-02-08 0300: weekly tape backup began with verbose logging
- 2015-02-08 0926: weekly tape backup completes
having used only a single tape
Alex
February 9th, 2015 7:34pm
Hi Dallas,
Take a look at my latest post below -- I made some changes to my DPM config (increased WritePeriod and ExpiryTolerance, enabled verbose logging) and it collocated everything on one tape for our weekly. You might try doing the same thing to see if it
makes a difference. Not exactly a long-term solution, but it might take some load off your tape tape budget if it works.
Alex
February 9th, 2015 7:42pm
Hi Alex,
Thanks for that, I'll give it a go!
I'm glad it's worked for you. Fingers crossed it works for me too!
Dallas
February 12th, 2015 6:57am
Hi Alex,
I took a look at your verbose logs, unfortunately we don't log the decisions that DPM makes when determining what tape to select for the next backup. There are a lot of decisions based on date of the disk based recovery point that is about to be written
to tape, what the PGset settings are, the expiry of the data set that is about to be written, and what is already written to the tape. So without seeing these decision points in the log, I cannot specifically tell you why your original PGset settings
were problematic.
February 12th, 2015 5:28pm
Hi Mike,
Thanks for taking a look at the logs. I just wanted to point out that there were actually two changes made just before collocation succeeded:
- Write Period and Expiry Tolerance were increase
- Full Verbose logging was enabled
This technet article states that Full Verbose logging affects performance. If the problem with collocation is caused by a race condition then it is
possible that enabling logging actually caused collocation to succeed. At this point there doesnt seem to be any evidence pointing to which of the two changes resulted in collocation succeeding.
I propose I run the following tests to differentiate the two:
Test #1 - Collocation Confirmation
Repeat last weekend's test to confirm that successful collocation wasn't an anomaly. Configuration:
- full verbose logging: Enabled
- Expiry Tolerance: 1 year
- Write Period: 1 year
Test #2 - Logging Only
Test to see whether enabling Full Verbose logging enabled collocation to succeed. Configuration:
- full verbose logging: Enabled
- Expiry Tolerance: 7 days
- Write Period: 6 day
Thanks for your help,
Alex
February 12th, 2015 6:05pm
Hi,
I would like to propose that you just disable verbose logging since as I mentioned earlier, the logs do not contain the desired information and yes, verbose logging will effect performance. That way the only change made were to the expiry tolerance
and write period.
February 12th, 2015 6:12pm
Hi Mike,
Thanks for the guidance. Based on it, here is the test configuration I'll be running on 2015-02-15:
Test #3 - Long Expiry and Write Period, No Verbose Logging
Configuration:
- full verbose logging: Disabled
- Expiry Tolerance: 1 year
- Write Period: 1 year
I'll post the outcome as soon as I have it.
Thanks for your help,
Alex
February 14th, 2015 12:01am
Hi Mike,
It looks like my installation is affected by the
console crash problem after the UR5 update. I was able to set your recommended configuration before the weekly tape backups, and I suspect the test ran, however, it'll be a couple days before I have a chance to dig into the fix and regain console access
to my DPM server. I'm hoping there is a Microsoft-supported fix to the UR5 console crash problem that emerges by then.
Alex
-
Edited by
alx9r
Monday, February 16, 2015 5:53 PM
February 16th, 2015 8:52pm
Hi Mike,
I resolved that console crash problem. It looks like the last test indeed completed with the following configuration:
- full verbose logging: Disabled
- Expiry Tolerance: 12 Months
- Write Period: 12 Months
Result: Collocation succeeded.
This indeed points to the Expiry Tolerance and Write Period settings being at fault. Please advise what test I should run next.
Alex
February 20th, 2015 4:22am
The February 22nd weeklies backups ran with following configuration:
- full verbose logging: Disabled
- Expiry Tolerance: 1 Month
- Write Period: 1 Month
Result: Collocation Succeeded.
February 26th, 2015 12:38am
The March 1st weekly backups ran with the following configuration:
- full verbose logging: Disabled
- Expiry Tolerance: 2 weeks
- Write Period: 2 weeks
Result: Collocation Succeeded
March 2nd, 2015 4:58pm
In case this helps someone else, here is what I learned at the end of all this. There is one factor that can apparently prevent co-location that I have not seen documented anywhere yet: Data sources that have not been synchronized in a while.
Here's how I understand how this works:
For DPM to protect a data source, it must first "sync" that data source with DPM's own copy of that data source on disk. DPM's tape backups are always disk-to-disk-to-tape, not disk-to-tape. When a tape backup of a data source occurs,
DPM uses the latest sync'ed copy of the data source*. If the latest copy of one data source is older enough than another data source, then colocation is prevented because the Expiry Tolerance prevents the two data sources from being eligible to be put
on the same tape.
Exactly how the expiry tolerance logic works is being applied in any particular situation is difficult to determine because it does not seem to turn up in any of the logs. Increasing the Write Period and the Expiry Tolerance until colocation
succeeds, and then reducing them individually in small increments might be the best way to arrive at good values for your situation.
Alex
* I think this might actually be "the latest copy of the data source prior to the scheduled tape backup."
March 10th, 2015 6:58pm