Secondary DPM backup issues

We currently have 2 DPM servers.  One acting as a primary (DPMSERVERNJ), other as secondary (DPMSERVERSC).  On the primary, all backups run fine, however on the secondary, backups either succeed or fail.  Every morning I am presented with the following errors.  As you can see, the job will run for several hours, then fail...

Type: Synchronization
Status: Failed
Description: The DPM service was unable to communicate with the protection agent on DPMSERVERNJ.munzingnj.munzingus.com. (ID 52 Details: The semaphore timeout period has expired (0x80070079))
More information
End time: 6/5/2013 2:06:04 AM
Start time: 6/5/2013 12:00:01 AM
Time elapsed: 02:06:02
Data transferred: 739.00 MB
Cluster node -
Source details: \Backup Using Child Partition Snapshot\Uranus Virtual Machine(Uranus Virtual Machine.MunzingCluster.munzingnj.munzingus.com)
Protection group: Disaster Recovery Group

Type: Synchronization
Status: Failed
Description: The DPM service terminated unexpectedly during completion of the job. The termination may have been caused by a system reboot. (ID 910)
More information
End time: 6/5/2013 10:10:53 AM
Start time: 6/4/2013 6:34:41 PM
Time elapsed: 15:36:11
Data transferred: 6,956.06 MB
Cluster node -
Source details: \Backup Using Child Partition Snapshot\Jupiter Virtual Machine(Jupiter Virtual Machine.MunzingCluster.munzingnj.munzingus.com)
Protection group: Disaster Recovery Group

Can somebody please lead me in the right direction as to troubleshooting an issue like this.  I will provide any information that is needed.  All help is greatly appreciated!

June 5th, 2013 5:57pm

Please disregard the second error message.  The first error message is of most concern
Free Windows Admin Tool Kit Click here and download it now
June 5th, 2013 6:27pm

HI,

Diagnostic steps when "Semaphore timeout" is hit during network transfer:

1. Check if the protected server (sender) or DPM (receiver) was under stress or inaccessible during the time of failure from event logs from both the machines. Retry should work if the packet loss was because of either of the servers being inaccessible or under stress for a period.

2. Check if the network between the PS and the DPM is flaky retransmit count from netstat -s or perfmon counters can give an idea.

3. If the network is expected to be flaky, setting a higher TCP/IP maximum retransmission timeout as described in
http://support.microsoft.com/kb/170359 might help -increase the TcpMaxDataRetransmissions to 10 or more.

4. Else contact network support engineer to diagnose the packet loss issue netmon captures from both machines, packet route and network layout/devices will be required to start the investigation.

5. Take some performance monitor logs on both DPM and Protected server side.

Some good and basic perfmon counters to take to see if the servers are under stress are below.

Logical Disk/Physical Disk
******************
\%idle
100% idle to 50% idle = Healthy
49% idle to 20% idle = Warning or Monitor
19% idle to 0% idle = Critical or Out of Spec
\%Avg. Disk Sec Read or Write
.001ms to .015ms  = Healthy
.015ms to .025 = Warning or Monitor
.026ms or greater = Critical or Out of Spec
Current Disk Queue Length (for all instances)
80 requests for more than 6 minutes.
Indicates possibly excessive disk queue length.
Memory
*******
\Pool Non Paged Bytes*
Less that 60% of pool consumed=Healthy
61% - 80% of pool consumed = Warning or Monitor.
Greater than 80% pool consumed = Critical or Out of Spec.
\Pool Paged Bytes*
Less that 60% of pool consumed=Healthy
61% - 80% of pool consumed = Warning or Monitor.
Greater than 80% pool consumed = Critical or Out of Spec.
\Available Megabytes
50% of free memory available or more =Healthy
25% of free memory available = Monitor.
10% of free memory available = Warning
Less than 100MB or 5% of free memory available = Critical or Out of Spec.
Processor
*******
\%Processor Time (all instances)                                                                   
Less than 60% consumed = Healthy
51% - 90% consumed = Monitor or Caution
91% - 100% consumed = Critical

June 5th, 2013 9:28pm

As already indicated, most likely it is a network issue.

However, prior to the implementation of the Veeam Backup and Replication we used to have similar issues with DPM in our environment - general error regarding DPM service being unable to communicate with the protection agent). So, may be you'll find the following information useful.

At that time we couldn't find reasons for such behavior, till the moment we decided to check the account that had been used by the DPM agent to communicate with untrusted domains. It turned out to be the issue, indeed.

In fact, the final resolution was quite easy and below there are the steps that we took in order to get the agents back to a working state:

1. On Protected Server:

  • Open elevated command prompt
  • Go to: C:\Program Files\Microsoft Data Protection Manager\DPM\bin
  • Execute the following command:

SetDpmServer.exe dpmServerName DPMSERVERNAME.DOMAINNAME.com -isNonDomainServer -userName dpmaccount

2. On DPM server:

  • Open DPM PowerShell. You will be here: PS C:\Program Files\Microsoft DPM\DPM\bin\
  • Run this command:

Attach-NonDomainServer.ps1

  • You will be asked to input the corresponding information (DPMServer;PSName;UserName;Password)

After that we were able to get our agent back to working state.

Kind regards, Leonardo Muller





Free Windows Admin Tool Kit Click here and download it now
June 14th, 2013 3:05pm

I apologize for the late response.  During the last couple of days, all backups have been failing with the semaphore timeout error.  To be completely honest, I am fairly new to the DPM server world so please bear with me.  We are desperately trying to resolve this issue.  Both servers are in different locations but are part of our domain.  Not sure if the above suggestions pertain to my issue.  However, I do believe this is a network issue.  I tried running netdom and noticed a significant amount of dropped packets during the failed backup processes.  Can somebody please point me in the right direction and/or shed some light on this issue.
June 27th, 2013 3:28pm

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics