Adding a WS2012 Hyper-V Cluster to SCVMM2012 SP1 causes volumes on Dell EqualLogic to go offline

Hi,

I've recently built three WS2012 Hyper-V clusters with Dell EqualLogic storage.  The cluster builds were straight forward and are performing well.

The same issue has happened on all three whenever doing the push install of the VMM agent to the hosts and adding the cluster to VMM.  On each occasion the cluster volumes on the Dell EqualLogic were taken offline for a period, causing the virtual machines on the volumes to pause in a critical state.

After SCVMM was finished adding the hosts, the volumes could be brought back online and the virtual machines started.  Bit heart stopping though the first time it happened!

Whilst it happened, lots of errors event ID 5120, 5142, 1557, 1558, and 1069 appeared in the logs of each host - basically relating the volumes going offline, but not helping to point out how or why.

In all of the cluster builds affected "Do not allow cluster communication for this network" has been selected for the iSCSI network.

The VMM logs had the following "Completed w/ Info" warning for each host after adding the cluster "Warning (26211) A restart is required to complete claiming of multi-path I/O devices on host <host FQDN>).

I'm wondering if there is something strange happening with the new SMP storage management capabilities of VMM 2012 SP1?  I didn't ask VMM to try to manage the storage whilst adding the Hyper-V hosts, so why it should interfere with the storage I don't know :(

Anyone ran into something similar?  Would like to get to the bottom of it as Hyper-V with Dell EqualLogic storage is a very common build for us.

Cheers

James


January 25th, 2013 5:57pm

Same problem here but with DataCore SANsymphony-V iSCSI storage. The problem appeared when I tried to install the agent on a W2012 Hyper-V host. Do not dare to install the agent on our W2008 R2 Hyper-V cluster.

Someone who has an explanation or solution to this?

The hotfix 2813630 only applies to W2012 clusters.


  • Edited by AndersP Thursday, June 13, 2013 8:55 AM
Free Windows Admin Tool Kit Click here and download it now
June 13th, 2013 11:49am

If you read through this thread you will notice its NOT only Equallogic storage that has an issue.  

Also if you read the Equallogic notes in the HIT Kit v4.6 (EPA) the SCVMM 2012 SP1 support is to allow SCVMM 2012 SP1 to manage the Equallogic storage on via the Fabric section of SCVMM, under storage and arrays.  Create volumes or LUN's in the case of SCVMM, assign host permissions etc.   So instead of creating a volume for Hyper V hosts in EQL group manager you could do it in SCVMM.  It might be easier for some but SCVMM falls way short of what you can do in group manager or in what you can do in VMware with the Equallogic vCenter, VSM plugin.

This ONLY needs to be installed on the SCVMM server and not on the Hyper V hosts for this to work.  However I would update all hosts for bug fixes or other issues.  The documentation states nothing about fixing this issue in this thread.  Exact wording in the document...

This release of Host Integration Tools for Microsoft includes support for System Center Virtual Machine Management (SCVMM, or VMM) 2012 SP1, used by Windows Server 2012 and Windows 8.

You must install the Host Integration Tools to enable SCVMM, which then enables you to configure access to PS Series groups so that they are included in the list of providers visible in the VMM GUI. In the VMM GUI, you can then create new volumes (called Logical Units in the VMM GUI).

SCVMM uses the Dell EqualLogic Storage Management Provider (SMP) to communicate with the PS Series groups. The Dell EqualLogic SMP allows you to manage EqualLogic storage directly through native Windows storage interfaces such as storage PowerShell cmdlets (Storage Module), the File Services UI in the Windows 2012 Server Manager console, or the standard Windows Management Instrumentation API. 

August 1st, 2013 6:43am

I just attempted the install of SCVMM Agents to my Server 2012 Hyper-V cluster using an EqualLogic backend on September 11th with the same results of all guests going to a state of failed within minutes. I have firmware 6.0.2 on my arrays and HIT 4.6 on my hosts as well as my VMM Management server and am running SCVMM 2012 SP1 CU3 so I can verify that none of the proposed updates fix this problem. I simply went to each host and manually uninstalled the VMM agent under Programs and Features and I was able to restart all of my guests within 5 minutes. A side note: a couple of disks took almost a full 5 minutes to become available again, showing under storage on the hosts as being there but "Cannot be accessed", so patience is definitely your friend when bringing the guests back online without having to rebuild them.
Free Windows Admin Tool Kit Click here and download it now
September 12th, 2013 10:43pm

Hey Tim,

Can you check the Metrics for your Cluster Networks for me?

You can do so by running the following PowerShell Commands (from a cluster node)

IPMO FailoverClusters #Import Cluster Module
Get-ClusterNetworks | Select Name, Metric, AutoMetric

I had to double check the metric for my CSV Network as it was automatically set higher than Admin and was set to auto metric enabled. The CSV Network is recommended to be set to something like 900 (or lower than everything else)

After Setting the CSV network metric (this is a new step in 2012) I haven't had anymore problems (yet).

(Get-ClusterNetwork -Name "CSV Network").Metric = 900 #Set the Metric to 900

The Order from lowest metric to highest should be CSV, LiveMigration, Admin, iSCSI (iSCSI is always the highest). AutoMetric should get this mostly right except for CSV Network.

Found this info in this handy Hyper-V Cluster Setup Check List.

http://blogs.technet.com/b/askpfeplat/archive/2013/03/10/windows-server-2012-hyper-v-best-practices-in-easy-checklist-form.aspx

September 13th, 2013 1:50am

Hi,

I've recently built three WS2012 Hyper-V clusters with Dell EqualLogic storage.  The cluster builds were straight forward and are performing well.

The same issue has happened on all three whenever doing the push install of the VMM agent to the hosts and adding the cluster to VMM.  On each occasion the cluster volumes on the Dell EqualLogic were taken offline for a period, causing the virtual machines on the volumes to pause in a critical state.

After SCVMM was finished adding the hosts, the volumes could be brought back online and the virtual machines started.  Bit heart stopping though the first time it happened!

Whilst it happened, lots of errors event ID 5120, 5142, 1557, 1558, and 1069 appeared in the logs of each host - basically relating the volumes going offline, but not helping to point out how or why.

In all of the cluster builds affected "Do not allow cluster communication for this network" has been selected for the iSCSI network.

The VMM logs had the following "Completed w/ Info" warning for each host after adding the cluster "Warning (26211) A restart is required to complete claiming of multi-path I/O devices on host <host FQDN>).

I'm wondering if there is something strange happening with the new SMP storage management capabilities of VMM 2012 SP1?  I didn't ask VMM to try to manage the storage whilst adding the Hyper-V hosts, so why it should interfere with the storage I don't know :(

Anyone ran into something similar?  Would like to get to the bottom of it as Hyper-V with Dell EqualLogic storage is a very common build for us.

Cheers

James


Hello James

Did you ever get this issue resolved?

I just installed a new VMM 2012 R2 and a Two node WS2012 R2 cluster with CSV on a Equallogic SAN. I first setup the cluster without the VMM, through Failover Cluster Manager. Then I went to add it to VMM, and I experienced the same issues as you.

My cluster validates without errors.

I now cannot recreate the problem on my WS2012 R2 Cluster.

But my plan is to add my older WS2008 R2 cluster to be managed by VMM, but i fear this, because this is my production cluster, and i want to move VMs to my new cluster. but im afraid of causing down time.

Free Windows Admin Tool Kit Click here and download it now
November 25th, 2013 5:56pm

Joy of joys I just had the EXACT same issue.

SCVMM 2012 R2.  Adding 2 node Dell PE r720 with Equalogic 6110 back end.  Production environment.  Adding the cluster to SCVMM knocked most of my VMs offline cold.  MSCSV were going on and offline constantly.  Per the earlier comments my  solution was a reboot on one of my hosts.  surely there must be a concrete answer / fix for this??

Come on M$...if you EVER want to catch VMware this simply cannot happen...

And here was me thinking it was Friday the 13th related...

December 14th, 2013 7:39am

Joy of joys I just had the EXACT same issue.

SCVMM 2012 R2.  Adding 2 node Dell PE r720 with Equalogic 6110 back end.  Production environment.  Adding the cluster to SCVMM knocked most of my VMs offline cold.  MSCSV were going on and offline constantly.  Per the earlier comments my  solution was a reboot on one of my hosts.  surely there must be a concrete answer / fix for this??

Come on M$...if you EVER want to catch VMware this simply cannot happen...

And here was me thinking it was Friday the 13th related...


Last week i added our older 2008 R2 cluster to be managed by VMM 2012R2. My ISCSI connections lost connectivity for a bit and one node needed to restart. I think it is because VMM tries to run MPCLAIM on each node. But the nodes where presetup with everything, so dont know why this needs to happen. must be hardcoded into vmm.
Free Windows Admin Tool Kit Click here and download it now
December 16th, 2013 4:21pm

Hi there.

Same problem here.

I have three diffent clusters:

2 w2k8 and DELL MD3000 DAS

2 w2k8 and DELL/EMC AX-4 iSCSI

2 w2k12 and EL6100

No problem at all adding MD3000 cluster. It makes sense since the problem is related with iSCSI.

When I added my w2k8 and EMC cluster I've lost all CSV volumes and all VMs were down. Guess what? Yes, production middle of day work. :-(

Then I added one new host to the w2k12 cluster. VMM did not recognize this new host. I removed the cluster and added it back. I didn't have any problem with the two other hosts with vmm agente already installed but I received a message about MPIO in the new host asking me to reboot. This time I did not loose any CSV volume.

It's pretty clear for me that VMM is the problem.

I'm thinking to simply uninstall VMM from all cluster and use only FCM.

Any progress on that?

Thanks

Thompson

January 14th, 2014 9:11pm

Same problem here when we added our 4 node 2012 R2 cluster to VMM 2012 R2. iSCSI connections were reset causing everything to go offline.

Has nobody got an answer as to why this happens?

 
Free Windows Admin Tool Kit Click here and download it now
April 10th, 2014 11:19am

Hello together,

we have had a 2008r2 4 node cluster with open-e san managed by vmm 2008 r2. Works perfect.I tried to migrate to vmm2012r2 but there was to many problems so i decide to build a new vmm environment.

New Environment: Win 2012 Datacenter, SCVMM 2012 R2

Cluster: Win 2008 R2 Sp1 4 Node

I add the Cluster to the vmm and ALL CSV Volumes went offline on the product system. Perfect Microsoft, thank u! Restart the hole System brings the cluster online.

Isn't there an other solutions? Or is it only possible to build a new Cluster and configure everything with vmm?

Thank u guys...

best

Steffen

May 19th, 2014 12:10pm

Not only did we see the same issue where everything went offline, our CSVs somehow got flip-flopped.  We have 3 CSVs configured and the 1st and 3rd swapped places.  While all VMs on CSV2 came back up, none of the other did because they couldn't find their .vhdx files where the expected them.  We had to manually reimport the 2/3 of our VMs that were broken through failover cluster manager.

This was on a production, 4 node cluster consisting of Dell PowerEdge R620s and an EqualLogic PS6100.

Free Windows Admin Tool Kit Click here and download it now
September 26th, 2014 12:52am

We are seeing this same problem with Windows 2012R2 (including update!!) and VMM 2012R2. This is crazy.

The trouble is that our VMM VMs sit across the cluster we're trying to add to VMM. I think the problem lies with the MPCLAIM VMM is trying to perform against our MPIO. But as a result, the CSV goes offline while adding the cluster to VMM - therefore VMM VM looses comms to storage and can't add the cluster as it crashes out in the process.

VMM hosted on the cluster it manages is a supported configuration yet this is clearly a longstanding problem based on other comments in this thread. Would love to know if anybody actually has a way around this? At the moment my only route is to migrate the VMs off to another cluster whilst I add the management cluster to VMM. Ridiculous...

We are using UCS with NetApp Metrocluster so definitely not specific to Dell Equalogic. 

December 8th, 2014 5:16pm

I've seen the iscsi Csv's go offline with 2012 and 2012 R2 clusters with EMC, Netapp and Equalogic SANs so I agree that it is not specific to just Dell Equalogic.  I've played with a number of settings within the cluster and have tried even vs odd cluster configurations.  i've tried all the hotfixes without success.  I've done the agent install enough time to expect offline CSVs, but need to figure out a solution before i deploy to an important hyper-v cluster.

Today i stood up a 2 node cluster with an equalogic backend to test.  I was able to complete the install successful.  Once the cluster was running, i moved all roles from Node2 then evicted Node2.  I then installed the VMM agent to Node2 with a reboot to follow.  I brought Node2 back into the cluster and moved storage and VMs over to Node2.  VMm recognized the VMs.  Then I evicted Node1, installed agent, rebooted and rejoined the cluster.  My concern was that VMM wouldn't recognize the cluster and instead see the Nodes as individual hypervisors.  I left the office for 30 minutes and returned to find that the cluster was recognized with both nodes.  So it was a success.

I'm going to try again with a smaller production cluster using the same methodology.  Evict, Install, reboot and rejoin cluster.   Move Roles and repeat.

Chris


Free Windows Admin Tool Kit Click here and download it now
January 15th, 2015 11:26pm

I get email notification when this thread has been updated.  I am shocked that Microsoft has not fixed this issue.

This issue and a few others made my company ditch Hyper V in favor of VMware.  

We are 100% virtualized in our data center.  The Hyper Visor seems like a small layer in the data center stack, but problems like this show you have important that layer is to get right.

I think the biggest weakness of Hyper V is Microsoft Clustering.  It has been un-reliable layer for a very long time, going way back to Windows 2000.

Stand alone Hyper V servers are pretty solid.  Throw in the clustering part and its a house of cards.  VMware seems to do clustering so well and easy in comparison.

January 16th, 2015 12:54am

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics