Why is DPM 2012 R2 splitting this Co-located Protection Group Set across three different tapes?

I have 10 protection groups that ran Saturday night.  I have put information about them in the following table:

I extracted the data in the table using Get-DPMProtectionGroup CmdLet and the job information from the Monitoring page of the DPM admin console.  As you can see from the tape barcode, DPM wrote the first four PGs to one tape, then started a second tape, and finally failed when starting a third because it had run out of free tapes.

All of the protection groups are assigned to the same "Co-Located Protection Group Set" in the "Tape Optimization Setup" dialog box.

Prior to the jobs running, I painstakingly ensured that each of the Protection Group's "Select Long-Term Goals" and "Select Library and Tape Details" screens are identical as follows:

Prior to running the jobs I had already disabled autoclean on the tape library.

I have checked to see if the tape drive is reporting IO error 0x8007045D using the method described in this blog post: http://blogs.technet.com/b/dpm/archive/2012/05/14/things-you-can-do-to-help-data-protection-manager-utilize-your-tapes-full-capacity.aspx

There seems to be no mention of 0x8007045D or -2147023779 in any of the MSDPM*.Errlog files.

These are LTO6 tapes with a capacity of 2.5 TB each.  The total data for all the jobs put together is less than 300GB.

Why is DPM 2012 R2 splitting these protection groups across three different tapes? 

What else should I check for that might cause DPM to start a new tape?


  • Edited by alx9r Monday, August 18, 2014 8:53 PM
August 18th, 2014 11:52pm

Hi,

Did you check the DPMRA*.Errlog files ?  Did you add the BufferQueueSize and BusyRetryCount registry entries ?

Free Windows Admin Tool Kit Click here and download it now
August 19th, 2014 3:57am

Hi Mike,

Thanks for the quick response.  I did check the DPMRA*.Errlog files for "0x8007045D" and "-2147023779".  There are no instances of those strings in the DPMRA*.Errlog files. Your blog post indicates adding BufferQueueSize and BusyRetryCount as potential cures for IO_DEVICE_ERROR 0x8007045D.  Since IO_DEVICE_ERROR 0x8007045D is not mentioned in any logs, I haven't added BufferQueueSize or BusyRetryCount. 

Should I add the BufferQueueSize and BusyRetryCount anyway?  If so, how do I know if it's helping if there were no "0x8007045D" errors in the first place?

Is there something else I should be looking for in the logs that would indicate why DPM switched tapes?

Thanks for your help,

Alex

August 19th, 2014 7:53pm

Hi,

The registry entries are preventative in nature so they will not cause any harm.

You can see if the tape drive is reporting end of media by performing this search.

 Find /i "0x8007044C" dpmra*.errlog

Free Windows Admin Tool Kit Click here and download it now
August 19th, 2014 8:09pm

Hi Mike,

I must have made a mistake earlier.  I retraced my steps and searched again for "0x8007044C" and "0x8007045D" in each of DPMRA*.errlog and MSDPM*.errlog.  There are 46kB worth of lines containing "0x8007045D" in the DPMRA*.errlog.  The other three searches yielded no results.  The "0x8007045D" errors all seem to look something like this:

0C84 1804 08/17 06:11:15.676 22 genericthreadpool.cpp(537) [0000000000D87210]  WARNING Failed: Hr: = [0x8007045d] Thread 6148 received ERROR

0C84 1C9C 08/17 06:11:15.676 18 bufferedmediareader.cpp(782) [0000000000DA0A30]  WARNING Failed: Hr: = [0x8007045d] CBufferedMediaReader - Received unknown error. number of bytes - 0, index - 5.

The strange thing is that all of the occurrences are from either 08/12 or after 06:00 on 08/17 which is after all the backup jobs had finished.  None of the "0x8007045D" timestamps seem to match up with 08/17 00:15 or 08/17 02:38 which are the times when DPM switched to new tapes.

I have added BufferQueueSize and BusyRetryCount to the registry.  The co-located protection groups in question are scheduled to backup to tape on Saturday night.  Is there any way to run Saturday's backup jobs early without manually changing the schedule of all of the protection groups?

Alex


  • Edited by alx9r Tuesday, August 19, 2014 7:01 PM
August 19th, 2014 9:52pm

Hi,

All the times logged in the DPM Errlog files are in GMT time - so based on local time zone, you need to subtract from the logged time.   Lets see if the registry entries worked, they should unless you have a hardware problem.

Free Windows Admin Tool Kit Click here and download it now
August 19th, 2014 10:48pm

Hi Mike,

Can you confirm the correct path to the BusyRetryCount registry entry?  Your blog post mentions three different spellings of the final level of the hierarchy:

  • "Storport"
  • "StorPort"
  • "StorpPort"

I'm still seeing 0x8007045d errors.  I'd like to ensure the registry entry I created is being picked up.  Is there a way to see that happening in the logs?

Alex

August 25th, 2014 7:03pm

Hi

Gosh - good catch - I don't believe the correct entry is case sensitive - however the correct spelling is StorPort.

The DPMRA logs are only going to detail that the tape drive reported an IO error.  The BufferQueueSize is also important in eliminating the errors.  If you have multiple tape drives, you can try disabling all but one at a time and see if it's just a particular drive, or if it's a bus problem.

Free Windows Admin Tool Kit Click here and download it now
August 25th, 2014 8:26pm

Hi Mike,

I added another free tape and invoked "Resume Tape Backups..." for each of the failed protection groups.  This time there were no 0x8007045d errors.  Strangely, DPM completed backup for exactly one PG, produced the warning "Job waiting for tape" and, most bizarrely, marked the new tape as "offsite ready". 

All of the resumed jobs should have been co-located onto a single tape.  I have a "Write period" of 1 day for this PG set. Is DPM refusing to co-locate PGs and marking this new tape as "offsite ready" because it is more than 1 day after the scheduled back up time?  If so, how do you ever get resumed backup jobs to co-locate?

Alex

August 26th, 2014 12:44am

Hi,

Yes, most likely that is what happened.  For D2D2T - the recovery point creation time is that of the last disk based recovery point at the time the tape backup.  If you retry a tape backup, we actually mount the shadow copy for the disk based backup from the day the tape backup was initiated.   If that disk based backup is already a day old, then we cannot write it on the same tape because it would exceed the Write Period.   You would need to change the write period so the retried backup jobs could co-locate - then change it back in time for the new round of backups.

Free Windows Admin Tool Kit Click here and download it now
August 26th, 2014 2:18am

Thanks for clarifying that behavior.  I just tested with a longer write period and confirm that invoking "Resume tape backups..." on multiple PGs in the same PG set results in them being collocated.

I don't understand the philosophy behind this behavior though.  Isn't the point of "Write Period" to limit collocation of shadow copies to those that were created during a limited period of time?  That way you end up with all of the shadow copies from say, last weekend, on a single tape.  Instead, DPM seems to be limiting collocation of shadow copies to those that can be successfully written to tape during a limited period of time.  That behavior basically guarantees tape count amplification as soon as a tape backup job fails for any reason.  Am I missing something here?

Alex

August 26th, 2014 8:01pm

Hi Mike,

I'm still finding that DPM often mysteriously splits backups across multiple tapes and immediately marks those tapes as "Offsite Ready".  Here is what I have confirmed:

  • "BufferQueueSize" has value "3" at HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft Data Protection Manager\Agent
  • The tape drive is enumerated at SCSI\SEQUENTIAL&VEN_IBM&PROD_ULT3580-HH6\1&1C121344&0&000100
  • "BusyRetryCount" has value "250" at HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Enum\SCSI\SEQUENTIAL&VEN_IBM&PROD_ULT3580-HH6\1&1C121344&0&000100\Device Parameters\StorPort
  • All PGs are members of the same PGSet with a Write Period of 6 days and "Don't allow backups of different retention periods to co-locate on same tape" disabled. 
  • All PGs have identical TapeBackupOptions (I have compared them using Get-DPMTapeBackupOption
  • All PGs have identical weekly and monthly schedules (I have compared them using Get-DPMPolicySchedule)
  • Find /i "0x8007044C" *.errlog reveals no errors
  • Find /i "0x8007045D" *.errlog reveals errors from September but no errors since then.

Is there any other place I should be looking for errors that DPM might be interpreting as "End of Tape"?

Are there reasons other than errant "End of Tape" errors that could cause DPM to abandon writing a tape part-way?

Thanks for your help,

Alex

Free Windows Admin Tool Kit Click here and download it now
February 3rd, 2015 1:46am

Hi,

You can look in the DPMRA logs for the code 0x8007044C which is  "The physical end of the tape has been reached."  See if they match times when tapes are spanned prematurely or marked offsite ready.

February 3rd, 2015 5:24am

Hi Mike,

There are no 0x8007044C errors.  I ran the following search:

cd "C:\Program Files\Microsoft System Center 2012 R2\DPM\DPM\Temp"
Find /i "0x8007044C" *.errlog > c:\users\un1a\desktop\0x8007044C.log

The resulting 0x8007044C.log file shows no error lines, only the names of each *.errlog file:

---------- DPMRA0.ERRLOG
...
...
---------- DPMRACURR.ERRLOG

Is there any other place to look for clues to why DPM isn't filling tapes?

Alex

Free Windows Admin Tool Kit Click here and download it now
February 3rd, 2015 6:36pm

Hi,

Assuming you are making daily disk based RP's - I see no reason why DPM would not be able to fill tapes when it comes time to doing the weekly tape based backup.  Would probably need to enable verbose logging during next scheduled tape backup time and see why DPM is selecting new tapes instead of appending.   Please open a support case to help track this down.

February 4th, 2015 12:43am

Hi Mike,

I opened a case with Microsoft Support Services. It is SR number 115020312355639.

Thanks for your help,

Alex

Free Windows Admin Tool Kit Click here and download it now
February 4th, 2015 2:49am

Hi Alex,

Have you been able to resolve this?

I'm having the exact same problem as yourself (except I'm using LTO5 tapes) and have tried all the steps listed here and on that link as well.

What would normally take 4 tapes to complete now takes 7-8 tapes .

Using DPM 2012 R2, Windows Server 2012 R2 and IBM 3573 Tape Library.

I'm desperately looking for a solution to this as I keep running out of backup tapes and having to purchase more and more.

Please let me know if you find a solution.

Thanks,

Dallas

February 6th, 2015 12:43pm

Hi Dallas,

I am in the same boat: What should fit on one tape suddenly spread two, then three, then four tapes.  I've bought more tapes, but I'm afraid that is just a stop-gap, since it seems that the more workloads I add, the more tapes DPM spreads the backups across. 

My Microsoft Support Services case has been open for about four days. I do not have any progress to report. I will report back here if there is progress I can share.

Alex

twitter: @alx9r

Free Windows Admin Tool Kit Click here and download it now
February 6th, 2015 6:11pm

Hi,

Please make sure each protection group is part of PGSet and that the write period and expiry tolerance are high enough - you can set high then reduce once you confirm those settings are related to the problem.

February 6th, 2015 7:30pm

Hi Mike,

My Write period and Expiry Tolerance are 6 days and 7 days, respectively.  Based on the mechanism described in your blog post and here, those values should be more than adequate to permit a tape to fill.  Are you suggesting that they should be set even higher?  If so, what values should I set them to?

Alex

Free Windows Admin Tool Kit Click here and download it now
February 6th, 2015 7:55pm

Hi,

The co-location settings in this BLOG does not really apply to DPM 2012 and later because the PGSet feature was introduced and replaced the mechanisms described in the blog.  As a troubleshooting step, yes I would increase those values and see if it makes a difference.

February 6th, 2015 8:03pm

OK I have changed both Write Period and Expiry Tolerance to 12 months.

Alex

Free Windows Admin Tool Kit Click here and download it now
February 6th, 2015 8:23pm

Hi Mike,

Here's the latest:

  • 2015-02-06: changed both Write Period and Expiry Tolerance to 12 months
  • 2015-02-07: enabled verbose logging by setting TraceLogLevel to 0x43e as per this technet article
  • 2015-02-08 0300: weekly tape backup began with verbose logging
  • 2015-02-08 0926: weekly tape backup completes having used only a single tape

Alex

February 9th, 2015 7:34pm

Hi Dallas,

Take a look at my latest post below -- I made some changes to my DPM config (increased WritePeriod and ExpiryTolerance, enabled verbose logging) and it collocated everything on one tape for our weekly. You might try doing the same thing to see if it makes a difference. Not exactly a long-term solution, but it might take some load off your tape tape budget if it works.

Alex

Free Windows Admin Tool Kit Click here and download it now
February 9th, 2015 7:42pm

Hi Alex,

Thanks for that, I'll give it a go!

I'm glad it's worked for you. Fingers crossed it works for me too!

Dallas


February 12th, 2015 6:57am

Hi Alex,

I took a look at your verbose logs, unfortunately we don't log the decisions that DPM makes when determining what tape to select for the next backup.  There are a lot of decisions based on date of the disk based recovery point that is about to be written to tape, what the PGset settings are, the expiry of the data set that is about to be written, and what is already written to the tape.  So without seeing these decision points in the log, I cannot specifically tell you why your original PGset settings were problematic. 

Free Windows Admin Tool Kit Click here and download it now
February 12th, 2015 5:28pm

Hi Mike,

Thanks for taking a look at the logs.  I just wanted to point out that there were actually two changes made just before collocation succeeded:

  1. Write Period and Expiry Tolerance were increase
  2. Full Verbose logging was enabled

This technet article states that Full Verbose logging affects performance.  If the problem with collocation is caused by a race condition then it is possible that enabling logging actually caused collocation to succeed. At this point there doesnt seem to be any evidence pointing to which of the two changes resulted in collocation succeeding.

I propose I run the following tests to differentiate the two:

Test #1 - Collocation Confirmation

Repeat last weekend's test to confirm that successful collocation wasn't an anomaly.  Configuration:

  • full verbose logging: Enabled
  • Expiry Tolerance: 1 year
  • Write Period: 1 year

Test #2 - Logging Only

Test to see whether enabling Full Verbose logging enabled collocation to succeed.  Configuration:

  • full verbose logging: Enabled
  • Expiry Tolerance: 7 days
  • Write Period: 6 day

Thanks for your help,

Alex

February 12th, 2015 6:05pm

Hi,

I would like to propose that you just disable verbose logging since as I mentioned earlier, the logs do not contain the desired information and yes, verbose logging will effect performance.  That way the only change made were to the expiry tolerance and write period.

Free Windows Admin Tool Kit Click here and download it now
February 12th, 2015 6:12pm

Hi Mike,

Thanks for the guidance.  Based on it, here is the test configuration I'll be running on 2015-02-15:

Test #3 - Long Expiry and Write Period, No Verbose Logging

Configuration:

  • full verbose logging: Disabled
  • Expiry Tolerance: 1 year
  • Write Period: 1 year

I'll post the outcome as soon as I have it.

Thanks for your help,

Alex

February 14th, 2015 12:01am

Hi Mike,

It looks like my installation is affected by the console crash problem after the UR5 update. I was able to set your recommended configuration before the weekly tape backups, and I suspect the test ran, however, it'll be a couple days before I have a chance to dig into the fix and regain console access to my DPM server.  I'm hoping there is a Microsoft-supported fix to the UR5 console crash problem that emerges by then.

Alex


  • Edited by alx9r Monday, February 16, 2015 5:53 PM
Free Windows Admin Tool Kit Click here and download it now
February 16th, 2015 8:52pm

Hi Mike,

I resolved that console crash problem.  It looks like the last test indeed completed with the following configuration:

  • full verbose logging: Disabled
  • Expiry Tolerance: 12 Months
  • Write Period: 12 Months

Result: Collocation succeeded.

This indeed points to the Expiry Tolerance and Write Period settings being at fault.  Please advise what test I should run next.

Alex

February 20th, 2015 4:22am

The February 22nd weeklies backups ran with following configuration:

  • full verbose logging: Disabled
  • Expiry Tolerance: 1 Month
  • Write Period: 1 Month

Result: Collocation Succeeded.

Free Windows Admin Tool Kit Click here and download it now
February 26th, 2015 12:38am

The March 1st weekly backups ran with the following configuration:

  • full verbose logging: Disabled
  • Expiry Tolerance: 2 weeks
  • Write Period: 2 weeks

Result: Collocation Succeeded

March 2nd, 2015 4:58pm

In case this helps someone else, here is what I learned at the end of all this.  There is one factor that can apparently prevent co-location that I have not seen documented anywhere yet: Data sources that have not been synchronized in a while.

Here's how I understand how this works:

For DPM to protect a data source, it must first "sync" that data source with DPM's own copy of that data source on disk.  DPM's tape backups are always disk-to-disk-to-tape, not disk-to-tape.  When a tape backup of a data source occurs, DPM uses the latest sync'ed copy of the data source*.  If the latest copy of one data source is older enough than another data source, then colocation is prevented because the Expiry Tolerance prevents the two data sources from being eligible to be put on the same tape. 

Exactly how the expiry tolerance logic works is being applied in any particular situation is difficult to determine because it does not seem to turn up in any of the logs.  Increasing the Write Period and the Expiry Tolerance until colocation succeeds, and then reducing them individually in small increments might be the best way to arrive at good values for your situation.

Alex

* I think this might actually be "the latest copy of the data source prior to the scheduled tape backup."

Free Windows Admin Tool Kit Click here and download it now
March 10th, 2015 6:58pm

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics