DPM 2012 SP1 Beta - Causing Server 2012 Hyper-V Cluster hang / ISCSI problems (Network Steve Forum)

DPM 2012 SP1 Beta - Causing Server 2012 Hyper-V Cluster hang / ISCSI problems

Hi All,

First of all, I know it's a beta and these are the perils of being an early adopter, but I've got a serious problem.

I've upgraded our production Hyper-V cluster to Server 2012. The setup is a 4 node cluster running CSVs on an ISCSI SAN with MPIO via dual gigabit Ethernet networks. The SAN storage is provided by Open-E DSS7 and replicated to another server in a different building.

Post the upgrade everything about the cluster seemed stable and to work as expected - live migrations etc all working. I then turned my attention to backups, and I discovered that Server 2012 wasn't supported by DPM. Fortunately there is a beta of DPM 2012 SP1 which adds support for Server 2012, unfortunately there is no upgrade path from the beta to RTM of SP1. Not wanting to upgrade our production DPM server to a beta, I installed a copy of DPM 2012 SP1 beta on a VM to provide a stopgap backup solution for VM level backups of certain machines that couldn't be backed up in other ways. I realise that running the backup server on the same cluster / SAN as the stuff that's being backed up is an odd thing to do, but this at least serves to provide snapshots, SAN replication provides resilience, and like I say, this is a stopgap.

Then I started noticing problems. First symptom was that on starting / rebooting VMs, sometimes other VMs would hang for perhaps 30s - 2m, people would start complaining that SharePoint had gone unresponsive etc. However, they would come back to life in a minute or two.On a couple of occasions we came in in the morning to find a number of VMs off or paused (backups ran overnight). Both of these problems occurred only when the DPM server was turned on. I thought the issue might be general load on the SAN, having both the backup server and the machines being backed up living on the same CSV / hardware. I moved the DPM server to a different ISCSI box and put on aggressive throttling (200Mbps) to try to reduce load, but the problem continues.

The event logs on the Hyper-V cluster suggest I/O timeouts to the SAN at the times of the backups. Lot's of event ID 1069, 1205, 1146, 1230, (various cluster resources failed). The interesting one I think is 5120 Cluster Shared Volume 'Volume5' ('VOLUME NAME') is no longer available on this node because of 'STATUS_CLUSTER_CSV_AUTO_PAUSE_ERROR(c0130021)'. All I/O will temporarily be queued until a path to the volume is reestablished.

Is anyone else using SP1 beta to successfully backup a 2012 Hyper-V cluster?

Is anyone seeing the same problem?

Is it likely that this is a problem with SP1 beta, will it be fixed at RTM?

Any suggestions for a stopgap solution?

I think I might try setting up a test physical DPM server to check the issue isn't in someway related to the fact that the DPM server sits on the same cluster it's backing up. I'm also happy to consider the problem could lie elsewhere i.e. with the SAN storage (this was upgraded from v6 to v7 at the same time as the 2012 upgrade, but as soon as I tell the vendor that the problem relates to running a beta of DPM they will be pointing fingers at that.

Thanks,

Tim

Moved by Mike JacquetMicrosoft employee, Moderator Friday, November 23, 2012 4:09 AM (From:Data Protection Manager - General)

November 22nd, 2012 1:02pm

That would be most appreciated Mike, thanks very much.

Proposed as answer by cciuleanu Wednesday, February 06, 2013 12:29 PM

Free Windows Admin Tool Kit Click here and download it now

November 26th, 2012 2:21pm

I'm replying to this thread because

a) It's one of the only two threads on the whole internet that mentions "'STATUS_CLUSTER_CSV_AUTO_PAUSE_ERROR"
b) I'm getting the same symptoms.

I can confirm that one of my nodes in my Hyper-V 2012 cluster recently experienced the following event:

Log Name: System
Source: Microsoft-Windows-FailoverClustering
Event ID: 5120
Logged: 02/12/2012 18:01:30

Details: Cluster Shared Volume 'Volume2' ('ClusterStorage Volume 2') is no longer available on this node because of 'STATUS_CLUSTER_CSV_AUTO_PAUSE_ERROR(c0130021)'. All I/O will temporarily be queued until a path to the volume is reestablished.

I can also confirm that I am using DPM 2012 SP1 Beta to back up this cluster. I have been running this environment for quite some time now, and I can confirm that I've received 15 of these kinds of events (14 of which I was completely oblivious to). What prompted me to do research this time is that I discovered that 3 of my virtual machines were in a paused state and were not available. My other node (two node cluster) has had 5 of these events.

As this is already in the hands of Microsoft I won't log a call but will follow this thread. If there is any further information I can provide please ask.

Oh yes, my primary storage is Fiber Channel, so it's not an iSCSI problem.

Edited by LesterClayton Monday, December 03, 2012 8:11 AM

December 3rd, 2012 8:11am

I've found a workaround for this I thought I'd share. It's a little convoluted but if like me not having your servers backed up was giving you sleepless nights, it might be worth it.

Server 2012 introduces Hyper-V Replica allowing you to push an offline copy of your VMs to a remote server / site for DR purposes. This works from cluster to standalone. It's pretty simple to set up. You need a server with Hyper-V role installed to host the replicas.

Once your replicas are set up you can use DPM to backup the replicas. The replica VMs are turned off normally anyway so if backups do cause brief disk glitches it isn't going to interrupt any important services. My guess is this is a cluster related issue anyhow, so having the replicas on a standalone machine removes that issue.

HV Replica does allow hourly snapshots of the replicas, but it seems that it's not possible to change the frequency of these, so this isn't an efficient way of providing a decent retention time. For some reason, when I tried it DPM would only see the replicas to backup if snapshots were turned off.

I've only set this up today, so can't comment on the long term reliability, but so far so good.

Tim

Edited by TimBoothby Friday, December 07, 2012 12:38 PM Typo

Free Windows Admin Tool Kit Click here and download it now

December 7th, 2012 12:36pm

I'm having the same issue, aslo 2 clusters:

cluster1 4x HP ML330 G6, 2x 8 Gbit FC Switch, HP P2000 G3

cluster2 (testing) 2x HP ML110 G6, directly connected via 4 gbit FC to HP P2000 G3

Sometimes some LUN disappear, or is inaccessible (and I have to switch on/off maintenance mode on this LUN), sometimes VMs on affected HyperV host pause.

Both clusters have problems witch backup and I see these events.

Cluster Shared Volume 'Volume6' ('HyperV Data 6') is no longer available on this node because of 'STATUS_CLUSTER_CSV_AUTO_PAUSE_ERROR(c0130021)'. All I/O will temporarily be queued until a path to the volume is reestablished.

Edited by Martin Poisel Thursday, January 03, 2013 11:10 PM

January 3rd, 2013 11:07pm

Thanks Rich - you've put your finger on it. I've done the update and turned off ODX. I've done a number of backups successfully - there seemed to be some brief glitches in the availability of some of the VMs, but nothing crashed.

Then all the VMs on one of the nodes started flashing critical messages, shutting down, rebooting, migrating to other hosts etc. Looking into it, the host seemed to be out of memory, even with most of the guests offline. As with Rich, this was node was the storage owner. Cancelling the in progress backups immediately freed up the RAM.

I agree with Rich's diagnosis - severe memory leak.

Edited by TimBoothby Tuesday, January 15, 2013 4:32 PM

Free Windows Admin Tool Kit Click here and download it now

January 15th, 2013 4:23pm

We have seen this, too. Within our monitoring software, SQL Sentry, we noted that the memory ballooning is tied to the file cache - which I'm guessing is related to the shadow copy/vds/vss stuff. We have just installed the released hotfix and we're working to see if the stability issues are resolved, which were a much bigger deal for us...and have left me sleep deprived.

I've pasted a screenshot below from SQL Sentry Performance Advisor. It shows the memory peaks with each VM being backed up on that host.

One other thing - I had the host exhaust its memory when the pagefile was set to 4GB, but have since changed it to allow WS2012 to do whatever (system managed). Not sure if that has helped, but can't say it has hurt either. The host has 96GB and in VMM I had reserved 6GB...but when DPM kicked off, it just mowed right over all of it. Sigh.

Edited by MarkLarma Tuesday, January 15, 2013 11:20 PM

January 15th, 2013 9:07pm

After installing KB2799728, I got this console error (on all server, I applied KB). I can manage my clusters only remotly from server without KB2799728.

I can aslo confirm memory leak when backup runs.

Edited by Martin Poisel Thursday, January 17, 2013 11:31 AM

Free Windows Admin Tool Kit Click here and download it now

January 17th, 2013 11:15am

Is it possible to run 2012 VMs in 2008 R2 Hyper-V Cluster? I starting to thing about reinstalling my Hyper-V servers and configure a new 2008 R2 Cluster.

If your VM based on VHD (non VHDX) - it's possible to migrate easy to 2008 R2 back. Even if VM configuration will be unreadable - just create new VM and assign necessary VHD-files to it.

Edited by AndricoRus Tuesday, January 22, 2013 12:11 PM

January 22nd, 2013 12:10pm

I to am having the memory leak issue to the point it crashes the Host and all the VMs on that host save critical and jump ship. Very frustrating. I have applied KB2799728 and am now waiting on whatever the latest fix to this fiasco will be.

Edited by Seth H. _ Thursday, January 24, 2013 9:01 PM

Free Windows Admin Tool Kit Click here and download it now

January 24th, 2013 9:00pm

6 node cluster running 88 VM's with iSCSI Storage on HP Lefthand with production workloads!

Everything fine until we migrated heavier workloads to the cluster.

Then.... we experienced the paused VM issue back in December.
Then.... we applied the patch a few weeks ago and had the CSV IO Timeout issues every other day.
Then.... we Disabled ODX yesterday and now have the memory leak issue.

Server 2012 Hyper-V 3.0 has become a nightmare to administer with these problems.

Come on Microsoft we need this memory leak fixed!!!

Edited by TrevorBaker1979 Wednesday, January 30, 2013 5:14 PM

January 30th, 2013 5:13pm

I should have clarified; I was specifically asking about the question where the Hyper-V instances are running within the CSV, but the destination of the backup is not. Regardless, it sounds like this is a very serious problem and I can only imagine how frustrating it might be.

Hi RJMPhD, sorry for misunderstanding what you were asking. In our environment our DPM server is one of the few standalone physical servers with a directly attached SAS array were it stores the backups. So in our case yes, the Hyper-V instances are running within the CSV volumes but are being backed up to a destination that is outside of the CSV.

Edited by HorusCG Thursday, February 07, 2013 4:31 PM

Free Windows Admin Tool Kit Click here and download it now

February 7th, 2013 4:31pm

I have already exported, converted all WMs, re-installed Hyper-V Cluster with 2008 R2 and re-configured everything. Just imported the last WMs. Now installing DPM again to run backups. Hopefully alot better then on 2012.

I know it is easy to complain. But i think Windows Server 2012 with Hyper-V would be great when they fixed all the problems. Also i want to say, I will never again be first to try out new MS products. I will wait about 6-12 months before trying.

Br
Patrik

Edited by boje_ Friday, February 08, 2013 11:13 AM

February 8th, 2013 11:12am

http://support.microsoft.com/kb/2813630

Proposed as answer by Aaron M Marks Saturday, February 16, 2013 8:41 PM

Free Windows Admin Tool Kit Click here and download it now

February 16th, 2013 11:24am

Also in the same boat with these errors, DPM 2012 SP1 UR2 + Windows 2012, 10 node cluster using CSV.

We are currently migrating from 2008R2 cluster to 2012 so this is quite scary. Already had to fix 2 VM's which couldn't start.

Edited by -DeNMaN- Friday, May 03, 2013 1:02 AM

April 23rd, 2013 3:19am

Hi Paul,

have a look at this Article, the Hotfix was released today and it seems to solve the Problems. I've installed the Patch already via CAU and did not receive any Errors since now.

http://support.microsoft.com/kb/2838669

Lets hope the MS finally got it now.

I'll update you when i receive any Errors.

Edited by Hummeldum Wednesday, May 15, 2013 10:12 AM Forget to paste Link ;)

Free Windows Admin Tool Kit Click here and download it now

May 15th, 2013 10:12am

I was encountering the 2 of the issues described in KB2838669.

Before this KB, I was getting Failover Clustering timeout errors once a week when my DPM starts its snapshots.
Yesterday I've installed this KB on 1 of my node, and things goes wrong : I've been encountering Failover Clustering 8 times in only 5 hours starting from the beginning of my DPM snapshots. Worst ? All my virtual machines hosted on this node crashed ( which was not the case when I had some failover clustering errors before ).

Weirdest thing ? All my DPM snapshots were successful anyway !!

So result of the KB ? I shouldn't have installed it :/

I'm running my nodes on Win Srv 2012, and my DPM server is runnung DPM 2012 SP1. The only hotfix I installed before on my HyperV hosts is kb2813630.

Edited by tena6ous Thursday, May 16, 2013 7:28 AM

May 16th, 2013 7:25am

I'm experiencing a memory leak that I think is related to this thread, but I would like some feedback on what others are experiencing. I have a 2 node Hyper-V 2012 cluster (full install) and I'm using DPM 2012 SP1 to back it up. On the node that owns the CSV, there is an increase in memory that seems to coincide with my backups for time and amount of data transferred. For large backups like Exchange, this fills up the server's memory and will crash the cluster if left alone. The memory does not become available after the backups complete. If I change the owner node on the CSV, the memory clears up immediately and I can even move the CSV back without issue.

There may also be a small memory leak that is not related to the backup times, but dissipates when I change the CSV owner.

I've installed all available updates on the two host servers (including those released yesterday) as well as hotfixes KB2813630-v2 and KB2838669. I've also disabled ODX and serialized the backups.

I'm not seeing related errors in Failover Cluster Manager, but I'm watching the servers like a hawk and changing the CSV owner node as needed to clear up the memory leak.

My storage device is an EqualLogic PS6100X with the latest HIT Kit (4.5) installed.

Is this what others are experiencing? Any thoughts?

This thread has been very helpful and I've been following it very closely for the past week or so! Thank you all for your input! ^_^

We have almost the exact same setup and problem. Windows 2012 cluster, 7 nodes running primarily SQL VMs. Dell Blade servers and EqualLogic PS6110XV with HIT Kit 4.5. We have installed all the hotfixes including KB2838669, and disabled ODX as well. A backup job triggers the memory leak, but not all the time. Using rammap we can see the VMs show up and never release the memory. We will max out 256GB of memory in hours sometimes. When I move the CSV to another node the problem follows the CSV. My only fix is to reboot the node having the problem then move the CSV back. We have put in 80 hours with MS so far on this.

Using Veeam instead of DPM.

This morning in veeam I disabled using Dell Equallogic VSS HW provider and now only using MS CSV Shadow copy. I am going to see if that helps.

Edited by awinstead Friday, May 17, 2013 2:35 PM

Free Windows Admin Tool Kit Click here and download it now

May 17th, 2013 2:34pm

Hello all,

I have also lots of issues with CSVs and also with DPM.

First I was thinking that the CSV hung because, removed the agent and applied all existing patchs. Now CSV is stable (FC Lun zoning was wrong and only half hosts were able to contact the lun directly, others were redirecting using cluster network, but nothing pointing out that, even Test-Cluster that was showing full green success test for cluster disks). I re-install the agent and the issue come back with VM backups, but no more CSV paused.

I opened a call to Microsoft support, asking me to apply these patchs using the LDR branch (QFE):

http://support.microsoft.com/kb/2838669/EN-US

http://support.microsoft.com/kb/2795944/EN-US

http://support.microsoft.com/kb/2837407/EN-US (?).

For installing the LDR: http://social.technet.microsoft.com/wiki/contents/articles/3323.how-to-forcibly-install-the-ldr-branch-from-a-particular-hotfix-package.aspx

Didn't have time to apply the LDR branch yet (should have been done with CAU hotfix plugin, but actually, the file version is from GDR and not LDR).

Edit: BTW, this is not the subject, but do you also get VMM service crashed when configuring VMM continuous protection in DPM ?
(Set-DPMGlobalProperty -KnownVMMServers vmmserver01.sogeti. local + DPM-VMM Helper Service configuration)

Guillaume

Edited by Guigui38 Friday, May 24, 2013 10:47 AM

May 24th, 2013 10:39am

It's a bit early to say, but my testing seems to show that my memory problems may be tied to dynamic volumes. I had major memory leaks every night when my system state backups kicked off (agent installed within VM guest) that corresponded to the amount of data being backed up. I created fixed size volumes on my EqualLogic SAN and moved the biggest offenders over; so far I've not encountered this memory leak again.

I do see other, slower memory leaks throughout the day on different VMs. When I move my two dynamic volumes from one host to the other, the memory frees up immediately. I do not seem to have this issue with VMs on the fixed volumes.

After reading Stefan's post, I decided to read up a bit on TRIM. That's when I got the idea that the problem could be a sort of conflict between TRIM and dynamic volumes. I can't say for sure, but things are starting to look up for me. If I can stabilize everything using fixed volumes, I might even be bold enough to try re-enabling ODX and non-serialized backups.

Here's hoping my luck's changed!

Proposed as answer by JeanLouis Wednesday, May 29, 2013 5:20 PM
Unproposed as answer by JeanLouis Wednesday, May 29, 2013 5:21 PM

Free Windows Admin Tool Kit Click here and download it now

May 29th, 2013 1:43am

Dell MD3620i using iSCSI here

Thinking of reinstalling the OS on the host computers without Dell MPIO drivers myself.

Edited by brock_paul Thursday, May 30, 2013 5:17 PM

May 30th, 2013 5:07pm

Hello all,

last hotfix of the "backup VM on CSV" saga is http://support.microsoft.com/kb/2870270/en-us - Update that improves cloud service provider resiliency in Windows Server 2012. It supersedes KB2848344 and any previously released (KB2838669, KB2813630, KB2790728, etc.) on this issue.

I suggest also http://support.microsoft.com/kb/2869923/en-us - Physical Disk resource move during the backup of a Cluster Shared Volume (CSV) may cause resource outage, strictly related to the same topic.

For anyone who is experiencig 5120 and 5217 have a look at this post: http://social.technet.microsoft.com/Forums/windowsserver/en-US/223eb499-53cd-4590-980a-4078d0b52bd3/statusclustercsvautopauseerror-not-fixed-with-kb2848344.As you can see the MSFT guy says:

Seeing an Event 5120 with an error code of STATUS_CLUSTER_CSV_AUTO_PAUSE_ERROR may be expected and can be safely ignored in most situations. It basically means that clustering knew of a software snapshot, but the software snapshot was deleted. So now clustering is resynchronizing its state on the view of the snapshots.

So in general, you should only be worried if you see lots of 5120s with an error code of STATUS_CLUSTER_CSV_AUTO_PAUSE_ERROR. That is a sign that clustering is in need of constantly resyncing its state for the snapshots.

Proposed as answer by arndawg Thursday, August 08, 2013 12:25 PM

Free Windows Admin Tool Kit Click here and download it now

July 16th, 2013 8:31am

This problem seems to have flared up again for me over the last few weeks, can't put my finger on what has changed, it did seem to resolve itself for a while after the May 2013 hotfix.

Anyway - there is another hotfix which might be relevant to some people although it seems to tackle a specific issue where the guest VM crashes at backup time if it has many snapshots.

http://support.microsoft.com/kb/2908415/en-us

Edited by TimBoothby Thursday, December 12, 2013 1:32 PM

December 12th, 2013 1:11pm

This issue has cropped up for me again as well. Thought I had it sorted middle of last year.

I noticed that ODX was enabled again on our HyperV servers. Disabling this had resolved the issue previously, so I have disabled again and installed the latest hotfix.

Not sure how ODX could have been re-enabled - Only things I can think of are either via a Windows Update, or when installing the DPM 2012 R2 agent.

Will need to monitor the backups for a week or so before I am confident that it is resolved again.

Free Windows Admin Tool Kit Click here and download it now

February 10th, 2014 1:40am

Are you running FEP on your Hyper-V hosts? We have found that FEP causes Hyper-V issues if the DPMRA.exe agent is not excluded.

June 24th, 2014 8:08pm

Digging up an old thread here, but does anyone know if these issues were fixed with Server 2012 R2 and DPM 2012 R2? I had the same issues with Server 2012 and DPM 2012 R2. Raised a case with Pro Support, installed a load of hotfixes, disabled ODX, removed hardware VSS but still had the issue. Ended up scrapping the VHD backups in the end as it was proving a nightmare to manage, and resulting in a lot of sleepless nights!

Free Windows Admin Tool Kit Click here and download it now

January 12th, 2015 5:03pm

Hi Tim,

Did you manage to get any kind of solution for this ?

Regards,

Pankaj Singh

June 30th, 2015 12:22pm

Hi Mike,

Did you manage to get any kind of solution for this ?

Regards,

Pankaj Singh

Free Windows Admin Tool Kit Click here and download it now

June 30th, 2015 12:23pm

Hi Mark,

Did you manage to get any kind of solution for this ?

Regards,
Pankaj Singh

June 30th, 2015 12:27pm

This topic is archived. No further replies will be accepted.