Hyper-V 2012 replica giving timeouts (Network Steve Forum)

Hyper-V 2012 replica giving timeouts

We have two Hyper-V clusters in two separate datacenters. These datacenters are connected with a 100 Mbit/s redundant connection. Redundancies is achieved with spanning tree and the two sites share the same private IP range. In both datacenters are two domain controllers in the same domain. We use Hyper-V replica to replicate about 50 VM's to the secondary datacenter. We use the primary datacenter as our production environment and the secondary datacenter as our failover location, backup location and test environment. When everything was working and all the servers were being replicated the bandwidth that was used over the WAN connection was about 10 - 15 Mbit/s. We use http for replication.

After a failure, which was caused by a rebooting Hyper-V host (wrong configured automated update settings) several VM's stopped with their replication. After that incident I could not get the replication working properly. I get timeouts errors and other error messages indicating that there is no connection at that moment. The replication works but not all the time. Replication status are going to warning when more than 20% of the replication cycles are missed. Eventually we even see a status of critical on several VM's. After that we have to resume the replication manually.

I have the feeling that this is caused by the limited amount of bandwidth that is available. I had to do a complete new initial replication for some servers because these servers would no longer resume replication after a manual resume command. Even the resynchronization command did not work, so I had to stop the replication and start again. The problem is that the initial replication uses all the available bandwidth and I have the feeling that there is not enough bandwidth left for the regular replication of the other VM's and that this is causing the timeouts. I am however not sure about this theory. Is it possible that initial replications drain so much bandwidth that regular replications are getting timeouts or is there a mechanism that prevents this. Are there possible other causes of these problems?

below you see the error messages that I frequently get. I get these errors on VM's with a normal replication status as well as VM's with a warning status or a critical status.

Hyper-V could not replicate changes for virtual machine 'XXXXX' because the Replica server refused the connection. This may be because there is a pending replication operation in the Replica server for the same virtual machine which is taking longer than expected or has an existing connection. (Virtual machine ID 33F83E0A-843A-4E83-9CD2-92EC7D3E3FEA)

ID: 32552

Hyper-V suspended replication for virtual machine 'VADC01' due to a non-recoverable failure. (Virtual Machine ID 33F83E0A-843A-4E83-9CD2-92EC7D3E3FEA). Resume replication after correcting the failure.

ID: 32086

Hyper-V could not replicate changes for virtual machine 'XXXXX': The device does not recognize the command. (0x80070016). (Virtual Machine ID 33F83E0A-843A-4E83-9CD2-92EC7D3E3FEA)

ID: 32022

Could not replicate changes for virtual machine 'XXXXX' as the Replica server 'cc-hv11.cc.lan' on port '80' is not reachable. The operation timed out (0x00002EE2). (Virtual Machine ID 308792FF-E1E8-4C15-930B-15506C4BF85D)

ID: 29292

Connection to the Replica server 'computer.domain.lan' timed out while waiting to receive a response for virtual machine XXXXX: The operation timed out(0x00002EE2). The total size of replication data being transferred is 65639 KByte(s). (Virtual Machine ID 5A99E295-7E42-4D9E-8814-9469151C7400)

ID: 29312

September 19th, 2013 12:34pm

Hi,

Please disable any of the block conditions about your two Hyper-v servers IP and monitor the error again.
The large bandwidth will happen when you have the long time not be synchronized, you can choice copy the primary vm disk by the removable disk, then import to the back up site Hyper-v server.
Please enable the compression section in your replication plan.

The related third party article:

Step-By-Step: Virtual Machine Replication Using Hyper-V Replica

http://blogs.technet.com/b/canitpro/archive/2013/04/08/step-by-step-virtual-machine-replication-using-hyper-v-replica.aspx

Hope this helps.

Free Windows Admin Tool Kit Click here and download it now

September 23rd, 2013 8:31am

Okay, I am a few steps further with this problem. Let me explain what I did so far. First I wrote a PS script:

Get-VMReplication * -computername host1, host2, host3, etc | Where {$_.state -ne "Replicating"} | Resume-VMReplication

I created a scheduled task and I run this script every 30 minutes. At least this makes sure that the replication continues and that my backups are guaranteed this way. Because the replication gets kicked every 30 minutes for the VMs that pause, I finaly have all the machines replicating and there are no longer initial replications going on anymore. This also allows me to see wath happens when this stage is reached. The bandwidth that is now being used is roughly 15 Mbit/s of the available 100 Mbit/s with sometimes a spike upwards to about 80 Mbit/s when a lot of changes need to be replicated. This would proof that a bandwidth limitation is not causing the problems anymore. Now that all the replications are going I dont have to run initial replications anymore. I needed to do that before because the replicated VMs would be so much behind on the production server that an initial replication was required .

Now that everything is running fine I wonder what this will do with the error messages. If the bandthwidth limitation was causing the problem then I should no longer see error messages indicating timeouts with regard to the communication. Therefore I reset all the replication statistics and I will see tomorrow what the result will be. I will keep you posted tomorrow. Thanks for your help so far.

September 24th, 2013 10:05am

Hi,

I would like to check if you need further assistance.

Thanks.

Free Windows Admin Tool Kit Click here and download it now

September 30th, 2013 2:47am

Have you found any solution to the 2 errors happening after each other (32086 and then 32022) yet, we are experiencing the same problem since the start of using Hyper-V 2012 last year. Randomly different servers stop replication using the same errors not really helping.

I also thought of doing a command to automatically start replication again, as I never have an error saying resume replication. So it must be any condition. We use DPM 2012 for Backups, but they do not run on the times, this happens, so really odd.

Thanks
Patrick

Edited by Patrick N. _ Friday, October 18, 2013 10:11 AM

October 18th, 2013 10:07am

We are having the same problem.

Event ID:32086 - Hyper-V suspended replication for virtual machine 'Server1' due to a non-recoverable failure

followed by

Event ID:32022 - Hyper-V could not replicate changes for virtual machine 'Server1': The device does not recognize the command. (0x80070016)

We are replicating between Clusters.

It will work for about a day or so, then we'll get these errors and the VM's pause. We have lots of bandwidth and our backups are not taking place at the time of these errors.

When I tested in a non-clustered environment, everything worked perfectly. Clustered environment just keeps throwing errors after a day or two of working fine.

I might go the script route as well to avoid these frustrating errors, which there seems to be little info on when searching the net.

Edited by hazmat2012 Friday, October 18, 2013 5:56 PM grammar

Free Windows Admin Tool Kit Click here and download it now

October 18th, 2013 5:49pm

Same Issue here. I've been fighting this for a couple of weeks.

Event ID: 32366 - Hyper-V Replica failed to apply the log file onto the VHD for virtual machine 'Server1'. (Virtual machine ID 67945514-5EDE-49FF-A792-CEBE5ACE3305) (Log File C:\ClusterStorage\Volume1\Server1\Server1_734211F7-382B-450C-9095-E9CC4C40924E.hrl) (VHD C:\ClusterStorage\Volume1\Server1\Server1_DISK_1.VHDX) - Error: The device does not recognize the command. (0x80070016)

Event ID: 29012: Could not apply the replicated changes on the Replica virtual machine 'Server1'. (Virtual Machine ID 67945514-5EDE-49FF-A792-CEBE5ACE3305)

Event ID: 32056 - Hyper-V failed to apply replication logs for 'Server1': The device does not recognize the command. (0x80070016). (Virtual Machine ID 67945514-5EDE-49FF-A792-CEBE5ACE3305)

I'm replicating between 2 Failover-Clusters separated via a 1-Gb WAN connection. So, bandwidth is not the issue here.

I can initiate the replication on a VM, it will work for about a day and will then pause. At this point, I have to manually restart the replication and it eventually pauses again with the above errors.

When searching on the Net, this seems to be a common issue, but I've seen no definitive resolution.

Edited by Felix Carballo Wednesday, November 06, 2013 9:30 PM

November 6th, 2013 7:59pm

Have you found any solution to the 2 errors happening after each other (32086 and then 32022) yet, we are experiencing the same problem since the start of using Hyper-V 2012 last year. Randomly different servers stop replication using the same errors not really helping.

I also thought of doing a command to automatically start replication again, as I never have an error saying resume replication. So it must be any condition. We use DPM 2012 for Backups, but they do not run on the times, this happens, so really odd.

Thanks
Patrick

Question: Are you using any kind of Dell HITKit Software? See this post. - http://social.technet.microsoft.com/Forums/windowsserver/en-US/04eeed3e-df3c-4b79-8f73-bf68c5d5f985/hyperv-2012-replication-fails-to-automatically-resume-after-a-server-reboot?forum=winserverhyperv

Free Windows Admin Tool Kit Click here and download it now

November 29th, 2013 10:15pm

Thanks for your reply. No, we use HP and no additional software installed on the root nodes, other than from Microsoft.

Patrick

November 29th, 2013 10:28pm

I have the same problem, but figured out the cause in my case: On my parent server-->Replication Settings-->Recover Points, "Addition Recovery Points" set to 2. This option creates 2 snapshot on replica server.

Well Every 5 minutes parent server push changes to Replica Server, the replica server start "merge in progress" of the snapshots and takes a few hours to complete, and won't allow any replication from parent server until it completes.

I have yet to find a solution without reducing recovery points.

Free Windows Admin Tool Kit Click here and download it now

January 3rd, 2014 3:39pm

My issue is finally resolved. The Dell HitKit 4.6 was the root cause. Dell recommended to keep the HitKit installed but de-register the storage provider on each Hyper-V cluster host. Once I did the this on each host our replication issues have disappeared and the cluster in general is more stable. This was also causing our Hyper-V backups success to be inconsistent. HV FOC Backups are way more stable now

Disabling the Dell Equallogic Storage Provider (When enabled it disrupts HV Replication in FOC Environments) (HITKIT)

The preferred method is to do the following:

To un-register Dell under vssadmin list providers using the eqlvss command at a dos command prompt:

C:\Program Files\EqualLogic\bin>eqlvss /unregserver

If you ever want to re-register the provider

C:\Program Files\EqualLogic\bin>eqlvss /regserver

C:\Program Files\EqualLogic\bin>vssadmin list providers and you see Dell re-registered.

Edited by hazmat2012 Saturday, January 04, 2014 8:24 AM

January 3rd, 2014 3:45pm

I have this exact issue on Host1/Host2 Windows 2012 failover cluster with an EQL SAN. This did not resolve my issue. Was there anything else you tried? Replication from independent Host3/SAN to Host4/SAN works fine. EQL SAN with hit kit still stalls. I removed the hit kit from Host2 (failover cluster) and it still did not work. Maybe removing kit from both hosts in cluster?

Edited by JamRWil Friday, January 03, 2014 8:29 PM

Free Windows Admin Tool Kit Click here and download it now

January 3rd, 2014 8:29pm

I have this exact issue on Host1/Host2 Windows 2012 failover cluster with an EQL SAN. This did not resolve my issue. Was there anything else you tried? Replication from independent Host3/SAN to Host4/SAN works fine. EQL SAN with hit kit still stalls. I removed the hit kit from Host2 (failover cluster) and it still did not work. Maybe removing kit from both hosts in cluster?

I removed the replication on the VM's that were failing from my head office cluster. Then went through the enable replication process again on each VM. I didn't have to redo the Replica Brokers on either of my clusters. Are you trying to replicate within just one Cluster? if so that won't work unless they are stand alone Hyper-V hosts. Replicating in a cluster requires two separate clusters. IE: Head Office cluster and DR Cluster. Then creating a Replica Broker on each cluster and the enabling your replication. I'm just mentioning that as it sound like you only have one cluster from your description.

January 4th, 2014 8:22am

This is my setup (all windows 2012 standard):

--"Host1"/"Host2" in cluster connected to Equallogic SAN (production)

--"Host3" connected to MD3000i iSCSI SAN w/ SCSI attached MD1000 (onsite replica system)

--"BDR" site to site vpn connected server with internal storage (offsite backup and disaster recovery). Good bandwidth between sites.

I can replicate from cluster to BDR without problems.

I can replicate from Host3 to BDR without problems.

~~This news was just discovered after testing this weekend.

I can replicate from cluster to Host3 for about 1 day before it fails (sometimes a server may go 2-3 days before failing). I am now thinking something is up with the Host3 setup that it is stalling accepting the replica, not that the cluster was stalling sending the replica. I removed the EQL hit kit from both cluster servers and ran the replications over the weekend and they failed. Therefore, I do not believe that the hit kit is the culprit in my situation. It seems the replication utility is quite fragile to hardware variations.

Edited by JamRWil Monday, January 06, 2014 5:23 PM

Free Windows Admin Tool Kit Click here and download it now

January 6th, 2014 4:16pm

Under Replication-->Recovery Points, I only selected "Only the latest point for recovery" option to see if replication will succeed or not, and yes in my case, it seem to be replicating in past 3-4 days now. (Based on Replication Health option)

I still see the errors in the even logs though, "

Hyper-V could not replicate changes for virtual machine 'HyperVServer1' because the Replica server refused the connection. This may be because there is a pending replication operation in the Replica server for the same virtual machine which is taking longer than expected or has an existing"

followed by warning

Hyper-V failed to replicate changes for virtual machine 'HyperVServer1' (Virtual Machine ID 51F6F09F-D15F-4327-AA9E-FC9A488DEC25). Hyper-V will retry replication after 5 minute(s).

January 6th, 2014 8:26pm

I am running only latest recovery point

Free Windows Admin Tool Kit Click here and download it now

January 6th, 2014 9:05pm

To me, it is a bug that needs to be addressed. Replication health shows ok, but lots of errors and warnings in HyperV event logs. Also specifying 1or more restore point, I don't know this idea would work when replica server, can't handle too many replicas and merge between snapshot need to take place

January 6th, 2014 10:46pm

It looks like I can replicate in any direction except replicating to my PowerEdge R710 (Host3) server with an MD3000i iSCSI SAN for my VM storage. Anything going to this breaks after a day or two. No research returns any answers. Very stuck on this issue.

Free Windows Admin Tool Kit Click here and download it now

January 9th, 2014 10:14pm

I have the same problem... Dell PowerEdge R620 with Equallogic PS6100X. Replication runs fine for a number of hours, maybe days then pauses. If I babysit this every day and keep resuming I can keep things running. Any help would be greatly appreciated.

For now I think I will run Edwin's PS

Get-VMReplication * -computername host1, host2, host3, etc | Where {$_.state -ne "Replicating"} | Resume-VMReplication

January 10th, 2014 8:28pm

I have the same problem... Dell PowerEdge R620 with Equallogic PS6100X. Replication runs fine for a number of hours, maybe days then pauses. If I babysit this every day and keep resuming I can keep things running. Any help would be greatly appreciated.

For now I think I will run Edwin's PS

Get-VMReplication * -computername host1, host2, host3, etc | Where {$_.state -ne "Replicating"} | Resume-VMReplication

I was using that script for a while with success, I ran it every 30 minutes but there were still lots of pauses later on at night.

Your setup sounds close to mine. We have 6 hosts at HO and 3 at DR, with a different cluster at each location. We have the cluster Replica role on each cluster. We are using Dell M610 blades to Equallogic SAN Pool. I tried every conceivable thing I could think of to fix this.

The only thing that finally worked, was de-registering the Equallogic storage provider on all hosts in both clusters. Then I removed replication on VM's and then re-enabled replication on the VM's which required a full copy/sync. Since I did that I have had no replication errors and REPL status is normal when my cluster heatlh powershell script runs each morning. Did you try doing the de-registering the storage provider method?

To un-register Dell under vssadmin list providers using the eqlvss command at a dos command prompt:

C:\Program Files\EqualLogic\bin>eqlvss /unregserver

Free Windows Admin Tool Kit Click here and download it now

January 10th, 2014 9:00pm

I did unregister the Equallogic VSS provider but I did not start over with replications after this step. I will give that a shot tomorrow and report back.

thanks for the tip

January 12th, 2014 10:22pm

This did not help. I even went as far as to create a completely independent VM migration network with additional NIC's. Nothing seems to work.

Free Windows Admin Tool Kit Click here and download it now

January 14th, 2014 3:23pm

This did not help. I even went as far as to create a completely independent VM migration network with additional NIC's. Nothing seems to work.

This could be part of your problem - The MD 3000i isn't officially supported under Server 2012. Could be one of those things where most of it works, but breaks a feature such as replication?

See this post - http://en.community.dell.com/support-forums/storage/f/1216/t/19491719.aspx

January 14th, 2014 5:49pm

I thought of that as well. I moved one of the virtual servers to local storage on the target replication host and I still had problems replicating after one night.

Proposed as answer by JamRWil Thursday, March 06, 2014 5:39 PM
Unproposed as answer by JamRWil Thursday, March 06, 2014 5:39 PM

Free Windows Admin Tool Kit Click here and download it now

January 14th, 2014 10:29pm

I hope this helps someone....

I found a workaround for this issue. I created a set of tasks for both of my Hyper-V hosts (Host1 and Host2). This set of powershell commands creates tasks that run a specific command on a 1 hour interval. For the first time ever I arrived to work with all normal replications. Pretty much all the replication jobs are still stalling but the task will run and tell the virtual servers to resume without interaction. While I did set this job to run for the sets shown, you will have to modify this to suit your environment. I modified this after it worked for 24 hours and set it to run for one year by actually modifying the task and changing the 23 hours to 365 days. The initial job did work well so hopefully this will workaround my issue for one year. I hope this helps someone because the Hyper-V replication job stall is a very frustrating thing to deal with because if it works well, it is very nice to have.

$cred = Get-Credential

$dailystart = New-JobTrigger -once -At "03/05/2014 09:45:00" -repetitioninterval (new-timespan -hours 1) -repetitionduration (new-timespan -hours 23)

Register-ScheduledJob -Name ResumeReplication.Host1 -ScriptBlock { Resume-VMReplication -VMName * -ComputerName Host1 } -Trigger $dailystart -Credential $cred

Register-ScheduledJob -Name ResumeReplication.Host2 -ScriptBlock { Resume-VMReplication -VMName * -ComputerName Host2 } -Trigger $dailystart -Credential $cred

enjoy!

Proposed as answer by JamRWil Thursday, March 06, 2014 5:55 PM

March 6th, 2014 5:55pm

To me, it is a bug that needs to be addressed. Replication health shows ok, but lots of errors and warnings in HyperV event logs. Also specifying 1or more restore point, I don't know this idea would work when replica server, can't handle too many replicas and merge between snapshot need to take place

I concur with these statements, I feel this is either a bug or bad engine design. I can't find any documented evidence that it is expected.

We have a SQL server, replicated with 2012 R1 with several VHDX files. Initially the issue was thousands of replicas on the target that would never merge or failovers would never complete or take so long because of the amount of VHDX files.

Initially, we had 15 hourly application quiesced copies to be kept, now it has been reduced to 4. It is not an option to go below 4 for me in this case. Even though the rate of change for this system (it is pre-production) is quite low, the hourly merges were taking longer than 5 minutes, in some cases just maybe just over that threshold. The system is adequately designed and I don't see any disk throughput bottlenecks.

I noticed from the VMMS log that the merge was starting and ending after 5 minutes continuously. This correlates with the fixed 5 minute block level changes that happen as part of the replica engine.

Every time that 5 minute replica happens it breaks the merge operation and it restarts. Therefore if the merge is expected to take longer than 5 minutes, it never completes and you end up with thousands of VHDX files on the replica target.

The only way I have found to resolve this is to pause the replica and let the merge complete, then resume it. We already have a script that checks for various replica failure conditions and attempts to correct them. The only way I can see to prevent manual intervention for this issue is to check in scheduled a script if this specific target is in a merging state, pause the replica and wait using a loop or similar until the target is no longer merging, before resuming replication.

Free Windows Admin Tool Kit Click here and download it now

January 16th, 2015 1:16pm

Every time that 5 minute replica happens it breaks the merge operation and it restarts. Therefore if the merge is expected to take longer than 5 minutes, it never completes and you end up with thousands of VHDX files on the replica target.

The only way I have found to resolve this is to pause the replica and let the merge complete, then resume it. We already have a script that checks for various replica failure conditions and attempts to correct them. The only way I can see to prevent manual intervention for this issue is to check in scheduled a script if this specific target is in a merging state, pause the replica and wait using a loop or similar until the target is no longer merging, before resuming replication.

I have seen this exact same thing when doing a failback where I discard any changes. Pausing the replica is how I got around it short term. What has changed in my environment is that I am now on a new and much faster SAN.

I haven't tested failover/faliback lately but will update after my next test.

May 26th, 2015 3:20pm

My environment is on the new faster SAN and still the same issue. I'm currently on 2012R1 and assume this issue is still present in 2012R2. Although you could probably adjust the replication time to 15mins to possibly give the replica enough time to merge.

I have a case open with MS for weeks regarding the issue of replicating VMs just stop replicating randomly requiring either a manual resume or scripting the resume. While I have a script that works nicely I need the system to work as designed.

I have seen periods of time where VMs on cluster A will not experience the random pausing whereas the a month prior you could bet money that 1 or more VMs would stop replicating requiring a resume.

The error I get differs slightly than what this thread was started with.

Hyper-V failed to open the file 'C:\ClusterStorage\Volume01\VM1\VM1_XXX5-4C3A-AF19-XXA3.hrl' for replication in primary server for virtual machine 'VM1': The process cannot access the file because it is being used by another process. (0x80070020). (Virtual Machine ID DXXD-XX5-XXF-AXXXXX-233F227B6)

Doing an err.exe on 0x80070020 reveals

# for hex 0x80070020 / decimal -2147024864 :
STIERR_SHARING_VIOLATION stierr.h
# 1 matches found for "0x80070020"

Also, my clusters are on HP gear.

Free Windows Admin Tool Kit Click here and download it now

August 26th, 2015 8:42pm

This topic is archived. No further replies will be accepted.