We have two Hyper-V clusters in two separate datacenters. These datacenters are connected with a 100 Mbit/s redundant connection. Redundancies is achieved with spanning tree and the two sites share the same private IP range. In both datacenters are two domain controllers in the same domain. We use Hyper-V replica to replicate about 50 VM's to the secondary datacenter. We use the primary datacenter as our production environment and the secondary datacenter as our failover location, backup location and test environment. When everything was working and all the servers were being replicated the bandwidth that was used over the WAN connection was about 10 - 15 Mbit/s. We use http for replication.
After a failure, which was caused by a rebooting Hyper-V host (wrong configured automated update settings) several VM's stopped with their replication. After that incident I could not get the replication working properly. I get timeouts errors and other error messages indicating that there is no connection at that moment. The replication works but not all the time. Replication status are going to warning when more than 20% of the replication cycles are missed. Eventually we even see a status of critical on several VM's. After that we have to resume the replication manually.
I have the feeling that this is caused by the limited amount of bandwidth that is available. I had to do a complete new initial replication for some servers because these servers would no longer resume replication after a manual resume command. Even the resynchronization command did not work, so I had to stop the replication and start again. The problem is that the initial replication uses all the available bandwidth and I have the feeling that there is not enough bandwidth left for the regular replication of the other VM's and that this is causing the timeouts. I am however not sure about this theory. Is it possible that initial replications drain so much bandwidth that regular replications are getting timeouts or is there a mechanism that prevents this. Are there possible other causes of these problems?
below you see the error messages that I frequently get. I get these errors on VM's with a normal replication status as well as VM's with a warning status or a critical status.
Hyper-V could not replicate changes for virtual machine 'XXXXX' because the Replica server refused the connection. This may be because there is a pending replication operation in the Replica server for the same virtual machine which is taking longer than expected or has an existing connection. (Virtual machine ID 33F83E0A-843A-4E83-9CD2-92EC7D3E3FEA)
Hyper-V suspended replication for virtual machine 'VADC01' due to a non-recoverable failure. (Virtual Machine ID 33F83E0A-843A-4E83-9CD2-92EC7D3E3FEA). Resume replication after correcting the failure.
Hyper-V could not replicate changes for virtual machine 'XXXXX': The device does not recognize the command. (0x80070016). (Virtual Machine ID 33F83E0A-843A-4E83-9CD2-92EC7D3E3FEA)
Could not replicate changes for virtual machine 'XXXXX' as the Replica server 'cc-hv11.cc.lan' on port '80' is not reachable. The operation timed out (0x00002EE2). (Virtual Machine ID 308792FF-E1E8-4C15-930B-15506C4BF85D)
Connection to the Replica server 'computer.domain.lan' timed out while waiting to receive a response for virtual machine XXXXX: The operation timed out(0x00002EE2). The total size of replication data being transferred is 65639 KByte(s). (Virtual Machine ID 5A99E295-7E42-4D9E-8814-9469151C7400)