First of all, I know it's a beta and these are the perils of being an early adopter, but I've got a serious problem.
I've upgraded our production Hyper-V cluster to Server 2012. The setup is a 4 node cluster running CSVs on an ISCSI SAN with MPIO via dual gigabit Ethernet networks. The SAN storage is provided by Open-E DSS7 and replicated to another server in a different building.
Post the upgrade everything about the cluster seemed stable and to work as expected - live migrations etc all working. I then turned my attention to backups, and I discovered that Server 2012 wasn't supported by DPM. Fortunately there is a beta of DPM 2012 SP1 which adds support for Server 2012, unfortunately there is no upgrade path from the beta to RTM of SP1. Not wanting to upgrade our production DPM server to a beta, I installed a copy of DPM 2012 SP1 beta on a VM to provide a stopgap backup solution for VM level backups of certain machines that couldn't be backed up in other ways. I realise that running the backup server on the same cluster / SAN as the stuff that's being backed up is an odd thing to do, but this at least serves to provide snapshots, SAN replication provides resilience, and like I say, this is a stopgap.
Then I started noticing problems. First symptom was that on starting / rebooting VMs, sometimes other VMs would hang for perhaps 30s - 2m, people would start complaining that SharePoint had gone unresponsive etc. However, they would come back to life in a minute or two.On a couple of occasions we came in in the morning to find a number of VMs off or paused (backups ran overnight). Both of these problems occurred only when the DPM server was turned on. I thought the issue might be general load on the SAN, having both the backup server and the machines being backed up living on the same CSV / hardware. I moved the DPM server to a different ISCSI box and put on aggressive throttling (200Mbps) to try to reduce load, but the problem continues.
The event logs on the Hyper-V cluster suggest I/O timeouts to the SAN at the times of the backups. Lot's of event ID 1069, 1205, 1146, 1230, (various cluster resources failed). The interesting one I think is 5120 Cluster Shared Volume 'Volume5' ('VOLUME NAME') is no longer available on this node because of 'STATUS_CLUSTER_CSV_AUTO_PAUSE_ERROR(c0130021)'. All I/O will temporarily be queued until a path to the volume is reestablished.
Is anyone else using SP1 beta to successfully backup a 2012 Hyper-V cluster?
Is anyone seeing the same problem?
Is it likely that this is a problem with SP1 beta, will it be fixed at RTM?
Any suggestions for a stopgap solution?
I think I might try setting up a test physical DPM server to check the issue isn't in someway related to the fact that the DPM server sits on the same cluster it's backing up. I'm also happy to consider the problem could lie elsewhere i.e. with the SAN storage (this was upgraded from v6 to v7 at the same time as the 2012 upgrade, but as soon as I tell the vendor that the problem relates to running a beta of DPM they will be pointing fingers at that.
- Moved by Mike JacquetMicrosoft employee, Moderator Friday, November 23, 2012 4:09 AM (From:Data Protection Manager - General)