Adding a WS2012 Hyper-V Cluster to SCVMM2012 SP1 causes volumes on Dell EqualLogic to go offline

Hi,

I've recently built three WS2012 Hyper-V clusters with Dell EqualLogic storage.  The cluster builds were straight forward and are performing well.

The same issue has happened on all three whenever doing the push install of the VMM agent to the hosts and adding the cluster to VMM.  On each occasion the cluster volumes on the Dell EqualLogic were taken offline for a period, causing the virtual machines on the volumes to pause in a critical state.

After SCVMM was finished adding the hosts, the volumes could be brought back online and the virtual machines started.  Bit heart stopping though the first time it happened!

Whilst it happened, lots of errors event ID 5120, 5142, 1557, 1558, and 1069 appeared in the logs of each host - basically relating the volumes going offline, but not helping to point out how or why.

In all of the cluster builds affected "Do not allow cluster communication for this network" has been selected for the iSCSI network.

The VMM logs had the following "Completed w/ Info" warning for each host after adding the cluster "Warning (26211) A restart is required to complete claiming of multi-path I/O devices on host <host FQDN>).

I'm wondering if there is something strange happening with the new SMP storage management capabilities of VMM 2012 SP1?  I didn't ask VMM to try to manage the storage whilst adding the Hyper-V hosts, so why it should interfere with the storage I don't know :(

Anyone ran into something similar?  Would like to get to the bottom of it as Hyper-V with Dell EqualLogic storage is a very common build for us.

Cheers

James


January 25th, 2013 5:57pm

James,

confirmed - just went through the same thing (I found your post afterwards). As you said, a bit heart stopping.

In my case it seems the CSV went offline somehow (see below), causing all VMs to turn off (not pause). The CSV was still listed as online in Cluster Mgr, though.

Only the VM's on the CSV were affected. I have a couple of VMs on each node running "outside" the cluster on a separate LUN, those continued to run fine.

A reboot of the owner node of the CSV brought everything back to normal.

Not cool ...

Thx&Rgds,
Marcus.

Associated Events
46-mpio: Path 77020000 was removed from \Device\MPIODisk0 due to a PnP event. The dump data contains the current number of paths.
7-iScsiPrt: The initiator could not send an iSCSI PDU. Error status is given in the dump data.
17-mpio: \Device\MPIODisk1 is currently in a degraded state. One or more paths have failed, though the process is now complete.
5120-FailOverClustering: Cluster Shared Volume 'Volume1' ('Cluster Disk 2') is no longer available on this node because of 'STATUS_DEVICE_BUSY(80000011)'. All I/O will temporarily be queued until a path to the volume is reestablished.
And then tons of NTFS warnings:
140-NTFS: The system failed to flush data to the transaction log. Corruption may occur in VolumeId: CSV01, DeviceName: \Device\HarddiskVolume62. ({Device Busy}The device is currently busy.)

Free Windows Admin Tool Kit Click here and download it now
January 30th, 2013 2:23am

Hi Marcus,

Glad your VM's were fine.  Was your CSV on an EqualLogic storage array?

Considering opening a case with Dell, as the EQL is a popular deployment for us.  Suspect even if we were to add a host to an existing cluster the same thing would likely happen.  Not cool in production :(

Cheers

James

February 5th, 2013 12:26am

Hey James,

yes, an EQL6100 with three cluster nodes attached to it.
I just spoke to our Dell Rep about this and pointed him to this thread. Let me know if you need add'l info when opening your case. I'd be happy to "assist".

Thx&Rgds,
Marcus.

Free Windows Admin Tool Kit Click here and download it now
February 7th, 2013 1:27am

Last week I have observed similar issues with a PS 6000 in our lab.

In my case it doesn't seem to be related to VMM. I have had errors before installing VMM. (Volume Busy, I/O paused and even disconnects)

The errors seems to be related to live migration, adding or removing cluster nodes, putting hosts im maintanace mode.

Basicly everything that redirects IO or switches volume access between hosts.

Since our PS 6000 is out of support, I cannot open a case at DELL.

I will try to do some further diagnostics next week.

One possible source for the trouble might be the integration components (HIT 4.5). In my previous labs I have not used them because they were not available at that time and I haven't had any trouble with IO redirection etc.

I have a second cluster (File and iSCSI) running in virtual machines without any issues. This clusters uses the same EQL, but the nodes do have the HIT inst

February 17th, 2013 11:49am

Thanks Marcus,

Still no answer to this one, although I haven't logged a case with Dell yet.

Have had it happen to three EQL units deployed along with VMM2012 SP1.  All on latest firmware 6.02 and hosts with the latest HIT 4.5.

@ Dark Grant, think I've heard similar cases to yours when HIT4.5 wasn't used.  Related to ODX issues IIRC, which could be worked around via disabling ODX.

Cheers

James

Free Windows Admin Tool Kit Click here and download it now
March 5th, 2013 6:57pm

I experienced the same issue with Infortrend iSCSI storage. This issue isn't vendor or device specific. In our case the volumes wouldnt come back and hosts had to be hard reset to bring everything back up which was a real pain as we are used to installed agents for VMM 2008 on production hosts during working hours and hadnt anticipated this would knock out all services. The same warning was displayed "Warning (26211) A restart is required to complete claiming of multi-path I/O devices on host <host FQDN>).

March 8th, 2013 11:22pm

Hello,

I have done some further testing in my lab. The issue seems to be related to the setup workflow.

I have now choosen a different approach to setup my hyper-v clusters. The basic workflow looks like this:

  1. Install VMM and configure Equallogic integration
  2. Configure the network objects inside VMM
  3. Setup the cluster nodes and install HIT - but do not configure EQL access etc.
  4. Join the nodes to VMM
  5. Setup host networks (logical switches, teaming, virtual NICs etc.) using VMM
  6. Configure the HIT on the VMM Server to act as a central management system for the hyper-v nodes
  7. Configure EQL access on the nodes
  8. Configure ASM settings from the central console on the VMM Server
  9. Optional: configure access to cluster quorum on the hosts - CSVs will be created and added later using VMM
  10. Create the cluster using Cluster Manager (It won't work with VMM due to the "2 Adapters in the same subnet" warning)
  11. Use VMM with EQL integration to add cluster shared volumes to the cluster

---

I have setup two clusters (one server core and one with GUI) this way and both are running for two weeks now without any issues. Now I will do some testing with adding and removing nodes to/from the cluster.

---

I will wait for the next patchday and if this setup turns out to be stable, I will write and maybe publish a more detailed documentation (the first version might be in german)

Free Windows Admin Tool Kit Click here and download it now
March 26th, 2013 9:27am

This happens on my Server 2008 R2 clusters when deploying the  VMM 2012 (RTM) agents. The differences are that I'm using HP storage array & SAS connections. I could reproduce this on at least 12 different clusters.  I called in a Premier case, and MS pointed to HP as the culprit.

I've just now deployed SP1, and I'm getting another round of cluster disks going offline, and 1 situation caused a BSOD 40 mins after agent deployment on each cluster node.  The weird part is... this only happens 1 time per cluster (installation of the VMM agent).  Removing the agent & re-installing it doesn't reproduce the problems.

Just thought I'd let you know this may not be specific to "Dell" or "iSCSI".   I'll open a new Premier case and see what I can find out.


April 11th, 2013 5:17am

Same thing just happened to me with EMC array and SCVMM 2012 w/SP1. I don't think this is related to the hardware, looks like a bug somewhere in either scvmm or MS cluster. 
Free Windows Admin Tool Kit Click here and download it now
April 19th, 2013 9:51am

this hotfix maybe help you .

http://support.microsoft.com/kb/2813630/

May 13th, 2013 5:06pm

Same problem here but with DataCore SANsymphony-V iSCSI storage. The problem appeared when I tried to install the agent on a W2012 Hyper-V host. Do not dare to install the agent on our W2008 R2 Hyper-V cluster.

Someone who has an explanation or solution to this?

The hotfix 2813630 only applies to W2012 clusters.


Free Windows Admin Tool Kit Click here and download it now
June 13th, 2013 11:49am

I just got off the phone with PSS after experiencing similar issues on basically the same hardware. They suggested that I install

http://support.microsoft.com/kb/2838669/EN-US - this hotfix is apparently the important one

http://support.microsoft.com/kb/2836988 - May Update Rollup through Windows Update

http://www.microsoft.com/en-us/download/details.aspx?id=36916 - March Update Rollup pick "Windows8-RT-KB2811660-x64.msu"

I haven't installed the hotfix yet as I'm waiting for an update window and the other 2 updates were already installed prior to calling PSS.

June 14th, 2013 3:49am

I had the same issue.  In my case I was migrating all of my 2008 R2 clusters from SCVMM 2008 R2 to SCVMM 2012 SP1 Update 2.

I did not reboot hosts but this bug cost me others many hours recovering VM's that crashed some of which so bad that we had to build a new VM using the .VHD files.

Again this was with 2008 R2 SP1 + many...many hot fixes.  We are using Equallogic as well, 6.02 HIT KIT 4.5 on all of the hosts. I see other storage vendors in this thread plus links to Windows 2012 hotfixes.  Since we have different storage vendors and a mix of 2008 R2/2012 I think the issues is SCVMM 2012 SP1.

After my first 3 clusters all did the same thing.  I opened a case with Premiere support.  They advised me to remove my last cluster from SCVMM 2008, then pre-install the SCVMM 2012 agent manually, as in do it on my Production cluster one host at a time while in maintenance mode.  This made sense to me as it looked like installing of the SCVMM 2012 agent on the host when adding the cluster to SCVMM was causing the issues.

Of course this took almost a full day of draining productions hosts (8 nodes) installing the agent and rebooting these boxes.  I applied Windows updates at the same time, which was done back in February as well.

Then I added my prod cluster to SCVMM 2012 after hours and it did the same thing.

Oh how I miss VMware......let me count the ways.  My faith in SCVMM and Windows clustering is not very high at all.  We are starting to upgrade our clusters from 2008 R2 to 2012 in hopes of better stability but that process is a joke compared to VMware. (busting clusters and upgrading nodes, creating new clusters and swinging over LUN's via the cluster Migration Wizard, rolling the nodes over.)

Free Windows Admin Tool Kit Click here and download it now
June 27th, 2013 12:49pm

It's really interesting that you mentioned maintenance mode. The problems I experienced appeared seconds after I set one of my nodes to maintenance mode through VMM, if I paused the node through FCM there were no problems.

Your setup (EQ SAN with Latest HIT, ~8 Nodes) and experiences (oh so many hotfixes and oh so much time wasted) sound almost exactly like mine. I'm currently working on migrating to 2012 in hopes of greater stability, which I'm not finding. All of my faith is on the Server 2012 R2 and VMM 2012 R2, both look promising (i.e. join a 2012 R2 Machine to a 2012 Cluster and do a 1 way live migration as well as the VMM team actually working with the Hyper-V Team.)

As for the post above, I haven't experienced the problem again. I will be adding another node to my cluster tonight, so we'll see how smoothly it goes.

Fingers Crossed.

-Jon

June 27th, 2013 9:41pm

I hate to burst your bubble but from everything I have read you CANT joint a Windows 2012 R2 host to a Windows 2012 cluster.

You must either have extra hardware and build a new cluster, or evict a node in your 2012 cluster, upgrade it (or clean install) then create a 1 node cluster with it, and move some VM's.

The only new thing I really see is that you can live migrate from one version to another as in Live Migrate from 2012 cluster to a 2012 R2 Host/Cluster...which would be a storage migration as well.

Its NOTHING like upgrading VMware which supports mixed node version clusters so you can do a easy rolling upgrade.

I am not finding 2012 much better either.  A few little things here and there.  I find SCVMM 2012 SP1 slower than SCVMM 2008 but it does support logical networks/teaming finally.

Free Windows Admin Tool Kit Click here and download it now
June 28th, 2013 4:17am

As far as adding another node.  This is the method I did recently that worked for me.

This was a rolling migration of 2008 R2 cluster to 2012.  I broke the original cluster or three nodes, by taking out two nodes and creating a new 2 node 2012 cluster.  (I could do this over the weekend with many VDI/VM's shut down).  The cluster was built NOT with SCVMM but with failover cluster manager, I had issues with SCVMM and skipped it.  The hosts were already in SCVMM (agent installed) but were not clustered.  I did this first for a few reasons, avoid the problem we are talking about here and use the logical networks/teaming pushed from SCVMM.

Then I preformed a cluster migration using the cluster migration wizard...

http://blogs.msdn.com/b/clustering/archive/2012/06/25/10323434.aspx

Once the migration was finished, I rebuilt the 3rd node, added it to SCVMM, pushed the logical networks, and then added the host to the cluster via Fail Over Cluster manager.  In SCVMM I just refreshed the cluster and the host was part of it now.  I had no issues at all adding the third node using this method.

All that said I had another issue with the Cluster Migration.  About 50 or so of my VDI's did not get updated with the new network I chose in the cluster Migration Wizard.  Those VDI/VM's failed in both SCVMM and Failover cluster manager.  After a call with Microsoft we found out that those VM's had the old network of the old cluster.  The fix was to drop them Failover cluster, fix the network in the hosts local Hyper V manager and then re-add them to the the cluster making them "highly available".  Again my confidence in this platform is not great.

June 28th, 2013 7:25pm

Hi,

I also got an 3 node 2012 Cluster + Equallogic Storage. (all latest pachtes +FW also the recomended hotfixlist for 2012 Hyper-V cluster)

I also having trobles to pushing the VMM Agent onto the cluster. All iSCSI connections are reset during this process. Resulting in fail of the CVS and VMs ...

did anyone found a trick to add the cluster to VMM without getting into a failed state ?

i could reproduce this issue on an test cluster. Like Lindy i tired to manually install the agents and add it afterwards to VMM resulting also in the CSV failing state.

btw @Lindy i also had trobles during the migration from 2008 R2 > 2012 Cluster. 2nd node always tried to load the configuration of the old cluster. From MS support i also got the workaround with removing the VM's form the cluster and make them highly available again...


Free Windows Admin Tool Kit Click here and download it now
July 10th, 2013 9:49am

This has happened again 3 times, not specifically when adding a node to VMM, but just randomly in the middle of the day.

I've take VMM offline as well as any other System Center Product that may be causing iSCSI connections to drop. There were a few hotfixes that suggested that DPM was causing an issue, but since applying those hotfixes the problems continue.

There is another cluster at my work that runs on the exact same hardware and same configuration, only without System Center, and they have next to no problems (other than having to apply patches once a month)

I'm currently waiting for Microsoft PSS to call me back, I've been waiting for 5 days and I keep getting "someone will call you back within the next 4 hours". I placed a call with a VMware vendor to get pricing on moving to their products. It shouldn't be hard to find the money considering the periods of downtime in the last 3 weeks caused by Hyper-V was estimated at around $30,000 of lost productivity for developers.

-Jon

July 10th, 2013 5:53pm

An Additional Hotfix from PSS that will hopefully also address this issue.

http://support.microsoft.com/kb/2838043/en-us

This one Updates clusres.dll, whereas the others don't touch it.

I'm applying it now, I will report back If I continue to experience issues.

Free Windows Admin Tool Kit Click here and download it now
July 15th, 2013 5:40pm

Last week we made the decision to move our Production Server cluster (8 nodes) full of production server VM's back to VMware because of all of the issues we are having.  We simply cant have the outages we have had with Hyper V. The mirror cluster at our DR site will roll back as well.

I started over the weekend, and it will take probably 30-60 days because some VM's are large and require a lot of downtime.  The free VMware converter tool is a god send compared to the V2V methods we used when going from VMware to Hyper V.

We are going to keep our DEV and QA environments, (smaller 2 node clusters each) on Hyper V as the smaller clusters seem more stable.  However our VDI cluster will probably go back to VMware as well since they are considered production as well.

I really gave Hyper V my best shot.  I wanted it to work.  At the time we moved VMware was raising the price aka "v-tax" was coming into play and Hyper V was much cheaper.  Now that "v-tax" is gone and Microsoft is raising the price of SCVMM considerably I am not sure Vmware is more expensive if you are using SCVMM with 2012 version pricing.

I look forward to a more stable platform that allows me more time away from work...after hours.

July 15th, 2013 5:51pm

Hi,

HIT/Microsoft v4.6 (EPA) includes the following features and updates: Support for Microsoft System Center Virtual Machine Manager 2012 SP1

So HIT 4.5 don't support SCVMM...

SCVMM SP1 CU3 is available http://support.microsoft.com/kb/2836751 with some fixes.

Regards,

Free Windows Admin Tool Kit Click here and download it now
July 24th, 2013 5:08am

Hi,

HIT/Microsoft v4.6 (EPA) includes the following features and updates: Support for Microsoft System Center Virtual Machine Manager 2012 SP1

So HIT 4.5 doesn't support SCVMM...

SCVMM SP1 CU3 is available http://support.microsoft.com/kb/2836751 with some fixes.

Regards,

July 24th, 2013 12:07pm

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics