Secondary DPM synchronizing large volumes of data

Hello. We have a primary and secondary DPM servers (DPM 2012 SP1). We protect a domain member file server via File protection. This works fine on the primary, synchronization occuring, recovery points being created etc.

However, on the secondary, the synchronization and recovery point creation jobs are frequently failing. I suspect this is because for some reason, it's not synchronizing only the changed data from the primary, but always the whole thing. We're speaking two disks with roughly 300 and 260GB of data respectively. Daily data change is usually 500MB-5GB as seen from the synchronization jobs on the primary DPM. However the jobs on secondary always copy hundreds of gigabytes of data every day, which is failing in the end, because there's a VPN connection between the primary and secondary DPM with limited bandwidth and it's not used exclusively for DPM, so there are badwidth restrictions set on the secondary DPM for the primary DPM agent.

Why is the secondary always moving those big chunks of data instead of the small changes that it's supposed to copy?

I've tried removing the protection group on primary and secondary for this file system, delete the backups and reconfigure the file protection from scratch, however after the initial synchronization, the above behavior comes back.

Thanks in advance

February 2nd, 2015 4:23am

Hi,

I don't think this is the problem, but worth investigating.  See if any of the files on the replica volume for that data source have the sparce file attribute set.  DPM will always do a full sync for files that are marked sparce, however not sure why only secondary protection would be effected.

1) Download the psexec.exe utility from www.sysinternals.com and save it on the DPM Server.
2) Copy the below script in notepad, and save it as C:\temp\checksparce.ps1

# ---- Start here ---- 
# Create a list of files that exists on the current folder and all its subfolders
$dirlist = dir * -Recurse | ? { $_.Attributes -ne 'Directory'}

# Loop for every file found on the step above
foreach ($dir in $dirlist)
{
    # Checks if file attribute = 0x220  which means sparse...
    if ((fsutil usn readdata $dir.fullname | Select-String 'File Attributes  : 0x220'))
    {
    	  # list file (path + filename) in which its sparse attribute = 0x220
        $dir.FullName
    }
} 
# ---- End here ---- 

3) Get the path to the replica volume from the details of the data source.
4) Open an administrative command prompt and run:  Psexec.exe -s -I powershell.exe
5) In the new powershell window, CD to the root of the replica volume.
6) Run the c:\temp\checksparce.ps1 and see if it lists any files that are sparce.

Free Windows Admin Tool Kit Click here and download it now
February 2nd, 2015 9:02pm

Thanks for the reply Mike, however your script isn't working.

First, I had to add the "-i" parameter to the psexec command to actually being able to start and see the PSH window.

Anyway the problem is that dir (Get-ChildItem) can't handle paths longer than 248(260) characters. Could you please modify your script to work around that issue?

February 3rd, 2015 4:57am

Hi,

ya - I guess psexec.exe is case sensitive.  The long path is a limitation I cant fix in the script.  Were any sparce files found ? 

Free Windows Admin Tool Kit Click here and download it now
February 3rd, 2015 4:28pm

I've kind of worked around that path issue by going directly to the replica volume path and running the script from there. It hasn't found any sparse file.

Any other ideas? The synchronization job for one of the volumes runs for over 61 hours now while showing data transferred of almost 800GB (the volume is 500GB large with 244GB free space). Here are the details of the synchronization job. With the jobs failing an lasting so long, it means the last usable recovery point for this volume is now a week old on the secondary DPM.

Type: Synchronization
Status: In progress
End time: -
Start time: 1. 2. 2015 22:22:33
Time elapsed: 61:02:52
Data transferred: 798 280,13 MB
Cluster node -
Source details: E:\(fileserver.domain.local)
Protection group: File system secondary

February 4th, 2015 5:31am

Hi,

This is a real head scratcher - I cannot see any way that DPM synchronization would transfer more data than what is being protected.  If you are not already on DPM 2012 SP1 UR8 - please install that on both primary and secondary DPM servers and see if that helps.

Free Windows Admin Tool Kit Click here and download it now
February 4th, 2015 11:43am

Mike, they're both already on UR8 (build number 4.1.3465.0).

Couldn't the reason the for the transfer data being larger than the volume (or the actual data size) be that the data on the primary DPM are changing (the daily synchronizations normally continue there) and the secondary is "confused" and possibly can't keep up? (that would be strange though with those small changes (<5GB)).

Any other ideas? Should we open a ticket with MSFT support?

Just to make sure I'm not misunderstanding something .... (talking about File System protection) primary synchronizes changes from the target file server (let's say 5GB), then whenever secondary wants to do synchronization, it should synch. with the primary and also transfer those 5GB of changes and the secondary doesn't touch the target file server all (since the only agent it talks to is the primary), right?

February 4th, 2015 5:17pm

Hi,

Yes, your understanding is correct. The DPMFilter tracks block level changes for the Primary DPM replica volumes under secondary protection.  When it comes time for the sync, the DPM agent on the Primary will read the filter bitmap file and determine what files and block offsets that changed since the last successful sync and only transfer those blocks.  It should be a very fast and efficient way to get the secondary replica in sync with the primary.  For some reason, it seems something is going haywire in that block tracking mechanism.

Free Windows Admin Tool Kit Click here and download it now
February 4th, 2015 6:22pm

So, is there anything else we can try before calling your support? Will they be able to help with this problem if we open a ticket?
February 5th, 2015 8:41am

Hi,

As a test, you can bypass the DPMFilter and see if the amount of data transferred is more in line with expectations.  Just do it as a test, not as a solution.

On the primary DPM server, perform these steps from an administrative command prompt when there are no active jobs.

- Stop the DPMRA:

     NET STOP DPMRA

- Set the Force Fixup Registry key

reg add "HKLM\SOFTWARE\Microsoft\Microsoft Data Protection Manager\Agent\2.0" /v ForceFixup /t REG_DWORD /d 1

Let a few normal sync jobs run on the secondary and monitor results, then remove the above value.

Free Windows Admin Tool Kit Click here and download it now
February 5th, 2015 11:12am

I've added that registry key on the primary DPM, restarted the DPM agent (on primary) and then monitored the jobs on secondary for couple days, but nothing seems to have changed. Synchronization jobs are still queued, because 2 days old job is still running and transferring large amount of data (over 400GB for one of the volumes).

Overall the synchronization jobs mostly fail at some point and as a result, there are 5-8 days gaps between successful synchronizations (and recovery points).

February 12th, 2015 5:09pm

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics