Backup of Hyper-V 2012 CSV Intermittent Fails with Error 0x80042301
We have a 5 node cluster. All nodes are running fully patched versions of Windows Server2012 Datacenter (including hotfixes KB2813630 and KB2796995). Storage is EqualLogic running firmware 6.0.2. All nodes have EqualLogic HIT 4.5 installed
and we are using the hardware provider. We have two 3TB thin provisioned CSVs setup. One is not in use. The other currently contains the first 14 VMs that have been moved from our existing stand-alone Windows Server 2008 R2 SP1 Hyper-V servers.
Only 5 of the 14 VMs are being backed up. Protection was stopped and started for the move and the required consistency check was performed after the move. The DPM server is a physical server running SCDPM 2012 SP1 RU2. All Hyper-V servers
have had their agent updated after RU2. The SCDPM server only has a single protection group setup for all Hyper-V servers (legacy 2008 R2 servers and 2012 cluster). All backups are succeeding on the legacy servers which are running the same EqualLogic
HIT version and are storing their VMs on the same SAN. Overnight, some backups will fail and others will succeed. When I fix them up the next day, they will sometimes fail as well even if I tell it to resume backups on one VM at a time. I
can see the hardware snapshots being created on the SAN. The SAN doesn't report any errors. SCDPM fails and reports the following:
Type: Recovery point
Status: Failed
Description: The VSS application writer or the VSS provider is in a bad state. Either it was already in a bad state or it entered a bad state during the current operation. (ID 30111 Details: VssError:A function call was made when the object
was in an incorrect state
for that function
(0x80042301))
More information
End time: 4/23/2013 3:37:09 PM
Start time: 4/23/2013 3:34:44 PM
Time elapsed: 00:02:25
Data transferred: 0 MB
Cluster node xxxxx.xxxx.xxx
Recovery Point Type Express Full
Source details: \Backup Using Child Partition Snapshot\vm1
Protection group: Hyper-V VMs - Daily
It leaves the Micrsoft Hyper-V VSS Writer in a failed state with a Timed Out error. All other VSS writers are fine. I am also intermittently seeing the following in Application log on some nodes only when backups fail:
Event: 12363
Source: VSS
An expected hidden volume arrival did not complete because this LUN was not detected.
LUN ID {350f0b61-0244-4708-abab-a413fb710e7b}
Version 0x0000000000000001
Device Type 0x0000000000000000
Device TypeModifier 0x0000000000000000
Command Queueing 0x0000000000000001
Bus Type 0x0000000000000009
Vendor Id EQLOGIC
Product Id 100E-00
Product Revision 6.0
Serial Number 6090A0881074D4686E17059B9F4365CA
Storage Identifiers
Version 16
Identifier Count 2
Identifier 0
CodeSet "VDSStorageIdCodeSetBinary" (1)
Type "VDSStorageIdTypeFCPHName" (3)
Byte Count 16
60 90 A0 88 10 74 D4 68 6E 17 05 9B 9F 43 65 CA `....t.hn....Ce.
Identifier 1
CodeSet "VDSStorageIdCodeSetBinary" (1)
Type "VDSStorageIdTypeVendorSpecific" (0)
Byte Count 16
01 00 00 00 1F BF 0E 6A 00 00 00 3F 00 00 10 54 .......j...?...T
Operation:
Exposing Volumes
Locating shadow-copy LUNs
PostSnapshot Event
Executing Asynchronous Operation
Context:
Execution Context: Provider
Provider Name: Dell EqualLogic VSS HW Provider
Provider Version: 4.5.0
Provider ID: {d4689bdf-7b60-4f6e-9afb-2d13c01b12ea}
Current State: DoSnapshotSet
Event: 8194
Source: VSS
Volume Shadow Copy Service error: Unexpected error querying for the IVssWriterCallback interface. hr = 0x80070005, Access is denied.
. This is often caused by incorrect security settings in either the writer or requestor process.
Operation:
Gathering Writer Data
Context:
Writer Class Id: {e8132975-6f93-4464-a53e-1050253ae220}
Writer Name: System Writer
Writer Instance ID: {d70791b2-f0fe-416e-bbea-e631878ee313}
April 24th, 2013 1:11am
The error "A function call was made when the object was in an incorrect state" and the VSS Writer timing out seem to indicate we are having problems accessing the CSV during backups. The following registry settings allow you to make adjustments
to how DPM performs retries to claim the CSV in order to get reliable backups.
CsvMaxRetryAttempt - Adjust the maximum number of times (Default is 1) the DPM agent will attempt to claim the CSV volume. The value 0xC8 = 200 times.
CsvAttemptWaitTime - Adjusts the amount of time in milliseconds to wait between retry attempts. The value 0x2bf20 = 3 minutes.
To change the values for these registry settings follow the steps below.
1) Copy the following in notepad, then save the file as csvretry.reg
Windows Registry Editor Version
5.00
[HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft Data Protection
Manager\Agent\CSV]
"CsvMaxRetryAttempt"=dword:000000C8
"CsvAttemptWaitTime"=dword:0002bf20
2) Copy the csvretry.reg file to each node in the cluster.
3) Logon to each node in the cluster as an administrator, then right-click
the csvretry.reg file and select "open with" - then "Registry Editor" option to
import the registry settings.
April 24th, 2013 1:55am
The values
"CsvMaxRetryAttempt"=dword:000000C8
"CsvAttemptWaitTime"=dword:0002bf20
were already set on all five nodes.
April 24th, 2013 6:25am
I am anxiously awaiting an answer. I almost have the exact same setup with the same problem. I went so far as to completely wipe my DPM Server and reload it from scratch in hopes of a fix.
April 25th, 2013 9:37pm
I did some testing today with disabling the EqualLogic hardware VSS provider and so far it seems to be working. That is not a solution. However, if my overnight backups succeed in one pass it points me in Dell's direction rather than Microsoft's.
Are you using EqualLogic as well Daves? This morning, I emailed a contact I have at Dell that is a System Center/virtualization/EqualLogic specialist that has been very helpful in the past. I asked him if this is a known issue and if there is any
additional configuration required to support DPM using hardware snapshots of CSVs on Server 2012 with EqualLogic storage. We'll see if he has any insights when he replies. Barring that I am going to open a case with entry level Dell support and
see where that takes me.
April 25th, 2013 10:07pm
We are using an EqualLogic SAN with the latest 4.5 HIT. With the ASM installed we have tried with both logging into the PS group and not. We tried disabling the hardware VSS and just going with the software. That made for a nice morning
the next day as it caused our cluster to crash.
Almost all of our VM backups show critical with the following:
The VSS application writer or the VSS provider is in a bad state. Either it was already in a bad state or it entered a bad state during the current operation. (ID 30111 Details: VssError:A function call was made when the object was in an incorrect
state
for that function
(0x80042301))
Sometimes I can clear this message by manually creating a recovery point or doing a consistency check, one-by-one.
April 26th, 2013 4:07pm
Using the software provider didn't crash my cluster, but it has been randomly crashing some VMs (not just ones that I am backing up) since I turned it on. I am about to switch back to the hardware provider today. My contact at Dell hasn't responded
to my email yet, so I will probably start a support case with Dell today or tomorrow and see if they have any solutions. I just hope that I don't end up in a situation where Dell is pointing the finger at Microsoft and Microsoft is pointing back at Dell.
April 29th, 2013 5:39pm
Just another comment from someone else having what looks to be the same issue watching for an answer.
Windows 2012 Hyper-V Cluster. DPM 2012 SP1. Dell EqualLogic SAN on firmware 6.0.2. HV Hosts running Dell HIT 4.5
Direct DPM backup of VMs works. Backup of the VMs via cluster Child Partitions fails:
The VSS application writer or the VSS provider is in a bad state. Either it was already in a bad state or it entered a bad state during the current operation. (ID 30111 Details: VssError:A function call was made when the object was in an incorrect state
for that function
(0x80042301))
April 29th, 2013 6:55pm
Steve, what exactly do you mean by direct backup of VMs?
I am trying some things on my end today (somewhat at random). Are either of you integrating SCDPM with SCVMM as per
Protecting Hyper-V machines? Did you create a new protection group from scratch for your cluster, or did you re-use an existing one (perhaps one that was already protecting stand-alone hosts or hosts running a different version of Windows)?
April 29th, 2013 7:18pm
Not using SCVMM at all. My current workaround is just treating from DPM's point of view the clustered VM as a standalone machine.
I used the same Protection Group I was using to try to do full VM backups through the Cluster. But rather I installed the DPM agent on the VM (which I think is needed for granual recover anyway when backing up through the cluster) and rather than selecting
under Modify Protection Group
DomainName\Clustername\SCVMM VMName Resources\HyperV
I selected things from
DomainName\VMName\files/etc I want backed up
April 29th, 2013 7:26pm
We had our cluster crash again over the weekend. When the backup job started it ran ok for a couple of minutes then we had several servers go into a consistency check. This time we had all servers (4 hosts) logged in via HIT to the PS group with
all the services logged in. We did not have the DPM server logged in to HIT on purpose. Does DPM need direct access to the CSV in the iscsi initiator? How are you guys running it? I would hate to install the DPM client on all the VM's. I am about to call
it quits on backing the whole group and just make an individual job for each VM server. So infuriating.
April 29th, 2013 9:30pm
I think we have our backup situation worked out now. It is not the ideal solution but it will work. We took our single CSV hosting multiple VM's and broke it into 4 separate CSV's. We then applied the serial backup registry mod and xml file.
We now have several backup jobs instead of one, with each job set to backup one node in each CSV. Last night I was able to backup without errors.
April 30th, 2013 6:26pm
I don't want to burst your bubble, but the first night I switched to software VSS backups it worked perfectly. Then the next night it all fell apart. Hopefully you have more success.
April 30th, 2013 6:54pm
I would like to jump in as well.
I have a 6 Node Hyper-V cluster. All servers are running 2012 Standard, they are connected to an Equallogic array running v6.0.2, all servers are running HIT 4.5.0.6492. Currently I have 25 VMs running on my cluster.
I am running DPM 2012 Build 4.1.3408.0.
Some nights most of the machines seem to backup fine, other nights I am left with almost half that did not backup correctly. If they do not backup correctly I can often run the job again and have them work. Other times they won't work until I
move the VMs off that host, reboot it and then move them back.
For errors I have seen a number of times where after running the "vssadmin list writers" The Microsoft Hyper-V VSS Writer has a state of [10] Failed, Error Time out. I have also seen [7], but I don't have that up right now so I'll have to grab the
details of that one again.
Watching on the EqualLogic I see some different results. There are times I can see the snapshot created, set online and logged into. I see no data transferred by DPM. Then I see logout requests received from the initiator. The
snapshot is then deleted. After that I see another login request, which fails since the snapshot has been deleted.
I have also seen it not create the snapshot, and just the job eventually fail on dpm.
I'm not sure what is going on, so any ideas that anyone else finds would be great. If there is any more information that would be helpful please let me know.
I have previously seen machines get in a state where backups would fail, when I would check in the Failover Cluster Manager interface the machine would be listed as Running (Locked), then when I would check in Hyper-V console it would say the machine was
being backed up. When looking in DPM there were no jobs running backing the machine up. I would then have to power down the VM and reboot the host. I couldn't figure out how to unlock the machine so I couldn't migrate it etc. Once I
rebooted I could backup the machine.
I don't see the reg HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft Data Protection Manager\Agent\CSV, the CSV key does not exist. So I'm not sure if I need to create that then the values below it.
As I try more or learn anything I'll update this with more information.
Thanks for any help.
May 2nd, 2013 11:22pm
Eric, you may need to create that registry key. I don't know for sure as I am setting it via GPP so it is created automatically for me.
I have temporarily suspended my backups. We added a few more VMs and now VMs will sometimes go offline during the backup (even ones that we don't backup at all) which was unacceptable. I have opened a support case with Dell to see if they can
figure it out (Dell support = Free, Microsoft support = pay unless they admit that it is a bug on their end). So far Dell support has just asked me to provide logs, so we'll see where it goes from there.
-
Edited by
gregg79
Friday, May 03, 2013 3:52 PM
Added extra detail
May 3rd, 2013 3:51pm
Eric, you may need to create that registry key. I don't know for sure as I am setting it via GPP so it is created automatically for me.
I have temporarily suspended my backups. We added a few more VMs and now VMs will sometimes go offline during the backup (even ones that we don't backup at all) which was unacceptable. I have opened a support case with Dell to see if they can
figure it out (Dell support = Free, Microsoft support = pay unless they admit that it is a bug on their end). So far Dell support has just asked me to provide logs, so we'll see where it goes from there.
May 3rd, 2013 6:51pm
Eric, you may need to create that registry key. I don't know for sure as I am setting it via GPP so it is created automatically for me.
I have temporarily suspended my backups. We added a few more VMs and now VMs will sometimes go offline during the backup (even ones that we don't backup at all) which was unacceptable. I have opened a support case with Dell to see if they can
figure it out (Dell support = Free, Microsoft support = pay unless they admit that it is a bug on their end). So far Dell support has just asked me to provide logs, so we'll see where it goes from there.
- Edited by
gregg79
Friday, May 03, 2013 3:52 PM
Added extra detail
May 3rd, 2013 6:51pm
I've been seeing the same errors when backing up using DPM 2012 SP1.
My environment:
- 3 Hyper-V Node Cluster - Windows Server 2012 with Equallogic HIT 4.5.0
- 3 CSV Volumes for my VMs
- 23 Virtual Machines (Linux, Windows 2003, Windows 2008 R2, Windows 2012)
- 1 Dell Equallogic PS4000 Array
- DPM 2012 SP1
My initial attempt to backup my virtual machines resulted in one of my Hyper-V nodes getting hung up. It appears that it ran out of memory due to a memory leak. After installing the hotfix from Microsoft (KB 2813630) that addresses know issues with CSV backup,
I was able to have more success in my backups. However I'm still getting Event: 12363 "An expected hidden volume arrival did not complete because this LUN was not detected." from time to time. There also still seems to be a memory leak on the Hyper-V
node that is holding the CSV volume.
I'm also seeing the following errors:
-
Hyper-V-VMMS - Event ID: 19050
'vm-name' failed to perform the operation. The virtual machine is not in a valid state to perform the operation.
Hyper-V-VMMS - Event ID: 16010
The operation failed.
VSS - Event ID: 8194
Volume Shadow Copy Service error: Unexpected error querying for the IVssWriterCallback interface. hr = 0x80070005, Access is denied.
. This is often caused by incorrect security settings in either the writer or requestor process.
Operation:
Gathering Writer Data
Context:
Writer Class Id: {e8132975-6f93-4464-a53e-1050253ae220}
Writer Name: System Writer
Writer Instance ID: {830e49ca-131e-499f-b35b-73a6b4b0ded4}
FilterManager - Event ID 3
Filter Manager failed to attach to volume '\Device\HarddiskVolume65'. This volume will be unavailable for filtering until a reboot. The final status was 0xC03A001C.
volsnap - Event ID: 27
The shadow copies of volume \\?\Volume{04ef607f-b7f6-11e2-93fa-de1da79be6cb} were aborted during detection because a critical control file could not be opened.
It seems to me that my problems are a combination of the Dell VSS provider which is causing Event 12363 and Microsoft bugs which are causing memory leaks.
See http://social.technet.microsoft.com/Forums/en-US/dpmhypervbackup/thread/604409df-ada1-47d1-bdfb-3f938cde0b59
http://up2v.nl/2013/03/12/storage-issues-on-windows-server-2012-hyper-v-microsoft-struggling-to-fix/
May 11th, 2013 12:33am
As part of the testing Dell is having me do, they had me disable the EqualLogic hardware VSS provider using the command:
"C:\Program Files\EqualLogic\bin\eqlvss" /unregserver (it can be undone via "C:\Program Files\EqualLogic\bin\eqlvss" /regserver)
Since doing that on Sunday, I haven't had a single SCDPM backup failure. I also changed the max allowed parallel backups from 3 to 1, but I just switched it back to 3 so we will see how it goes tonight. Obviously this isn't a fix, but it may
work as a band-aid for everyone in the short term. If I get anything new from Dell I'll make sure to post it.
May 15th, 2013 9:36pm
I can confirm, that disableling the EQL VSS writer will resolve some issues with csv backup.
May 16th, 2013 4:06pm
Same here.
May 16th, 2013 4:21pm
Hi Guys,
I'm having exactly the same issue.
I'd like to find a solution that allows me to still use the hardware vss provider. If you disable it won't your backups run in serial and send your SAN into redirected access mode? The backups will work but your guest servers will take a
BIG performance hit.
Anyone found this?
Cheers,
Jon
May 16th, 2013 5:17pm
Windows 2012 CSV doesn't do redirected mode anymore
May 16th, 2013 5:19pm
ah ok... surely using software vss would still cause performance issues and also make backups take a lot longer?
Do you think this is an issue with the dell HW VSS? or a possible configuration error?
May 16th, 2013 5:24pm
Marcus is right, the penalty for using the software VSS provider in 2012 is substantially reduced (no redirected IO mode and parallel backups are supported). However, I agree with you Jon. I still want to use the hardware provider. I'm
not sure what the penalty for using software vs hardware is anymore, but I still want to use hardware. In fact, I just emailed my case manager at Dell and told him the same thing.
As for the cause, it could be a bug or a misconfiguration (or both). If it is a misconfiguration, it must not be something they have documented as all of us are having the issue and no one has found a solution at the present time.
May 16th, 2013 5:28pm
squeee! Fixed it!
Go into Dell ASM on your hosts, Settings, MPIO, Uncheck 'Use MPIO for snapshots'
Done. Hardware vss provider working. :D
May 16th, 2013 5:37pm
Thanks Jon! That is definitely progress, and I have forwarded that info on to Dell. I would still say that it isn't fully fixed until you don't have to uncheck that checkbox, but it is definitely a huge step in the right direction. I
won't be able to test it until this evening, but I will let you know if I can reproduce your results.
May 16th, 2013 6:12pm
squeee! Fixed it!
Go into Dell ASM on your hosts, Settings, MPIO, Uncheck 'Use MPIO for snapshots'
Done. Hardware vss provider working. :D
I will give it a try tomorrow.
May 16th, 2013 7:03pm
Did this work for anyone besides Jon? I didn't fix it for me. Now my backups start, do nothing, and eventually time out.
May 24th, 2013 3:20pm
Disabling MPIO for snapshots didn't fix the problem for me.
I have since disabled the Equallogic hardware VSS provider. After rebooting the cluster my memory leaks have gone away and my backups are happening without any failures. Waiting for a fix so I can use the Equallogic hardware VSS provider.
May 24th, 2013 4:38pm
Disabling MPIO for snapshots had no effect for me either.
It still looks random for me wich VM will be backed up and wich one will not.
May 27th, 2013 9:15pm
I just tried a diffent approach and it looks promising or at least it seems to get closer to the core of the issue.
I did a repair install of the HIT on all my hosts and reconfigured group access - I might have messed something up during my testing ...
- the hardware vss provider is enabled
- a domain account / local admin on the hosts is used as a service account
- snapshot are stored in shared directory, MPIO is used for snapshots, etc.
- On the DPM server I changed the registry key 'MaxAllowedParallelBackups' from 3 to 1
Windows Registry Editor Version 5.00
[HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft Data Protection Manager\2.0\Configuration\MaxAllowedParallelBackups]
"Microsoft Hyper-V"=dword:00000001
---
Now one host can create only one hardware snapshot at a time. The backup is working ... for now
There are several down trades with this setup.
- Since the location of a VM in a cluster is unpredictable, you might have all VMs from a protection group queued up on one host waiting to be backed up.
- If a queued VM is migrated to another host in the cluster, the backup will fail (even with VMM integration setup)
- Backups will take more time to finish, since one host can only backup one VM
---
From what I have observed and found in my EQL and host logs, it looks like there is a problem when one host is accessing multiple snapshots at the same time.
Multiple snapshot on the same CSV from different hosts don't seem to cause any problems.
May 28th, 2013 5:41pm
I have almost the exact same scenario as gregg79 and I'm seeing the same errors/issues. Please post when you find the solution.
May 30th, 2013 6:30pm
"I have almost the exact same scenario as gregg79 and I'm seeing the same errors/issues. Please post when you find the solution." +1
I open a support case at DELL/Equalogic for the error : Event: 12363 / Source: VSS
I open a support case at Microsoft for the error :
Event: 8194 / Source: VSS. I think it's a DCOM security problem...
Equalogic ask me too to disable the EqualLogic hardware VSS provider using the command:
"C:\Program Files\EqualLogic\bin\eqlvss" /unregserver
No backup failure since...
May 31st, 2013 2:29pm
"I have almost the exact same scenario as gregg79 and I'm seeing the same errors/issues. Please post when you find the solution." +1
I open a support case at DELL/Equalogic for the error : Event: 12363 / Source: VSS
I open a support case at Microsoft for the error :
Event: 8194 / Source: VSS. I think it's a SCOM security problem...
Equalogic ask me too to disable the EqualLogic hardware VSS provider using the command:
"C:\Program Files\EqualLogic\bin\eqlvss" /unregserver
No backup failure since...
June 3rd, 2013 6:05pm
Hello,
the issue might not be on the EQL side alone.
I have got two 2012 clusters in my lab, one with EQL HIT and the other one without. Both clusters having problems with DPM backup. The only workaround I found so far is to set the MaxAllowedParallelBackups to 1. This seems to get the backup running half
way stable - at least I have had a 'success rate' of 85% completed backups over the weekend. Still annoying, but I haven't found a better solution yet.
You might also have a look at this thread:
http://social.technet.microsoft.com/Forums/de-DE/dpmhypervbackup/thread/604409df-ada1-47d1-bdfb-3f938cde0b59
June 4th, 2013 8:56am
Hi Dark Grant,
I'm in French environment, and you ? German or English ?
"the issue might not be on the EQL side alone." -> i'm agree so i called the DELL/Equalogic and the Microsoft Support last week.
My Equalogic support case is at level 2 for the moment.
I'm waiting Microsoft call me back...
About your thread, i never get the STATUS_CLUSTER_CSV_AUTO_PAUSE_ERROR(c0130021).
For me, http://support.microsoft.com/kb/2838669 contains every fix:
Csvflt.sys 6.2.9200.20682
Clussvc.exe 6.2.9200.20686
Csvfs.sys 6.2.9200.20686
Fssagent.dll 6.2.9200.20682
Kernelbase.dll 6.2.9200.20682
NTFS.sys 6.2.9200.20684
Rdbss.sys 6.2.9200.20685
Srv2.sys 6.2.9200.20685
Kernelbase.dll 6.2.9200.20685
I don't disabled ODX.
I installed DPM 2012 SP1 UR2 :
http://blogs.technet.com/b/dpm/archive/2013/04/11/update-rollup-2-for-system-center-2012-service-pack-1-dpm-updates.aspx
I'm using 10Gb network.
I disabled Equalogic Hardward Provider only on 1 node and backup works on this one !
Except the 8194 Error it's perhaps only a EQL error...
I'm using Firmware 6.0.4. the are a fix about multihost and snapshot :"A snapshot of a volume with multi-host access enabled displayed the following error if multiple hosts attempted to simultaneously access the snapshot: Initiator cannot access this target
because an iSCSI session from another initiator already exists and multihost access is not enabled for this target. [Tracking #: 635738]". But it's not linked...
In your lab :
- Can you upgrade the Equalogic firmware ?
- What's your dpm agent version ?
June 4th, 2013 12:36pm
Microsoft call me back for the 8194 Error.
It's a windows 8 BUG on System Writer with DHCP, WINS and CSV ???
They are investigated...
If you try to configure Snapshot on your hypervisor, you will see only your System drive.
June 4th, 2013 4:20pm
Hello Frederic,
I updated my EQL to 6.0.4 yesterday. Still got the same problems with the EQL VSS provider.
Disabling ODX didn't help either.
I have got RU2 for SC 2012SP1 and KB2838669 installed. I doubled checked the file version yesterday.
I have got DPM agent version 4.1.3408.0 (DPMRA.exe)
For now I will stick with software vss. I just usinstalled all HIT/ASM components, except PowerShell module, MPIO and SMI
----
btw: I am from germany
June 5th, 2013 4:28pm
I have a case open with Dell EqualLogic Support and a case open with Microsoft Product Support. My Dell support representative is telling me that all of the virtual machines need to have the v4.5 HIT Kit installed. There are no disks that are directly
connected to the guest OS using an iSCSI initiator. Has anyone else heard this from Dell support? The Dell documentation seems to indicate this is not the case.
Regards,
Kevin
-- Dell EqualLogic Documentation: --
According to the PDF documents you sent me; the HIT KIT is only required for guest virtual machines running from host servers running Server 2008 R2.
The following is from the HIT_Install_User_Guide_V4.5.pdf you sent me earlier.
Installing HIT in Each VM in a Windows Server 2008 R2 CSV Configuration
For Smart Copies of Virtual Machines to work correctly in a Windows Server 2008 R2 configuration using Cluster Shared Volumes (CSVs), the Host Integration Tools must be installed in each VM. This applies
to all such configurations, including Enterprise, Datacenter, Core or any other Microsoft Server release configuration, but only for Windows Server 2008 R2. This does not apply to CSVs with Windows Server 2012.
|
June 5th, 2013 5:37pm
Hi Kevin,
Yes, "This does not apply to CSVs with Windows Server 2012". You can read the HIT 4.5 release notes if you need more information :
Microsoft Windows 8 and Server 2012 Support
HIT/ME now supports Microsoft Windows 8 and Windows Server 2012. With Windows Server 2012, Smart Copies of the cluster shared volume (CSV) are now application consistent, not just file system consistent. You no longer need to install ASM/ME in the Hyper-V
VMs; you only need to install it on the cluster nodes.
VMs and volumes are now manageable from every cluster node (the icons are blue, not gray). You do not have to change the coordination node or move volumes. Therefore, the
Move CSV
action no longer applies to volumes in Windows Server 2012.
The following restrictions apply:
- You cannot create replica Smart Copies on either Windows 8 or Windows Server 2012.
- To perform a selective restore operation on Windows Server 2012, you must start the operation on the cluster node that owns the VM (but you do not need to move volumes).
June 5th, 2013 5:58pm
Hello Kevin,
I found a thread on DELL forum, recommending the same.
I will not even think about installing integration components in every VM. There are enough others vendors waiting for new customers.
June 5th, 2013 6:03pm
Hi Dark,
Can you send us the link on the DELL forum thread ?
I read something about XML for VSS on 2003 and 2008 is not the same...
Do you backup 2003 VM ?
Yesterday I get a 5142 Cluster Error : ERROR_TIMEOUT(1460) on the CSV !
This Morning :
- I disabled TRIM Feature :
fsutil behavior set disabledeletenotify 1
- I disable ODX Feature by modifying the key:HKLM\System\CurrentControlSet\Control\File System\FilterSupportedFeaturesMode.
- I unregister Hardware VSS provider on all host
- I reboot all machines
June 6th, 2013 11:40am
Hello,
I just got some promising results from my tests.
Since I have the ASM component removed from my servers, I haven't had any backup or cluster errors.
---
- Cluster one
- -- first running completly without EQL Tools and giving me cluster error 5120 and 5217
- -- had no errors since I installed PS, MPIO and SMI from the HIT
- -- 18 hours / 23 VMs / >200 recovery points without errors. DPM is set to backup the VMs every hour between 8:00 and 22:00 - just to see how far I can push it and when it will break the cluster
---
- Cluster two
- -- first running with a full EQL HIT installation and giving me issues with failed backups on the DPM - no errors in cluster manager
- -- had no errors since I removed the ASM component from the HIT
- -- 24 hours / 60 VMs / >100 recovery points without errors using a regular backup schedule
---
The only HIT components I have installed are:
- PowerShell Tools
- MPIO DSM
- SMP
---
The next thing will be:
- activating ODX again - I disabled it yesterday
- Increase the 'MaxAllowedParallelBackups' - I have set it to 1 at the moment
- Setup a new cluster to make sure the positive trend is not a side effect of my testing and that the results are reproducible on a fresh installation
---
I will keep you updated on the progress of my testing
June 6th, 2013 11:51am
Some news:
Setting the 'MaxAllowedParallelBackups' back to the deafult value of 3 immediately results in cluster error 5120 and 5217 during the following backup cycle. Turning it back to 1 resolves the cluster errors.
ODX doesn't seem to play a role here. I reactivated ODX and the backups are still running fine.
June 6th, 2013 2:03pm
Some news from supports :
Event: 12363 Source: VSS : An expected hidden volume arrival did not complete because this LUN was not detected.
It's a problem of communication between the different services when the account used by ASM is different from the local System account.
We must add the account used in the Microsoft VSS provider.
There is the same error on Windows 2003 and ASM 4.0. ASM could not launch SmartCopy as VSS could not be used with the modified account in the ASM configuration.
For the moment the only "workaround" is to change the registry to add the account and give it the right to the writer and the requester.
The DEll/equalogic support "are up a DPM architecture to give the most information possible for developers back."
Event: 8194 Source: VSS : IVssWriterCallback interface. hr = 0x80070005, Access is denied
We reproduce the error this morning and send every log to the Microsoft Support (application log, idna of VSS and Cryptsvc, ProcMon). No news for the moment but i think it's a DCOM security problem on no English
operationg system...
Event: 5120/5142/5217 Source : Cluster
No news for the moment...
June 6th, 2013 4:34pm
Some news from supports :
Event: 12363 Source: VSS : An expected hidden volume arrival did not complete because this LUN was not detected.
No news for the moment, but we detect another problem on the PSM4110 passive controller... it'll correct in the 6.0.5 firmware planned to the 15th.
Event: 8194 Source: VSS : IVssWriterCallback interface. hr = 0x80070005, Access is denied
This problem appears only as part of
a cluster ; This event is visible on
nodes other than the one who initiated the call to
VSS.
During a VSS call, the Cluster service
sends requests to all nodes through
the GUM (Global Update Manager).
Because the "System Writer" is hosted
by the encryption service (cryptographic
service or cryptsvc) and
that it is executed in a context "Network
Service" instead of "System", the return of
COM calls a meeting Denied
Access because different impersonnations
on other cluster nodes
The problem will not be fixed as it
has no functional involvement
Events can be ignored
Event: 5120/5142/5217 Source : Cluster
No news for the moment...
June 13th, 2013 3:35pm
New Hotfix :
http://support.microsoft.com/kb/2848344
Csvflt.sys
|
6.2.9200.20712 |
Clussvc.exe |
6.2.9200.20712 |
Csvfs.sys |
6.2.9200.20712 |
Fssagent.dll |
6.2.9200.20712 |
Kernelbase.dll |
6.2.9200.20712 |
NTFS.sys |
6.2.9200.20712 |
Rdbss.sys |
6.2.9200.20712 |
Srv2.sys |
6.2.9200.20712 |
Witness.dll |
6.2.9200.20712 |
June 17th, 2013 5:46pm
New Equallogic Firmware 6.0.5
An issue that may have caused a passive controller to reboot spontaneously has been corrected to resolve the temporary effect on array redundancy. [Tracking #: 749774]
In rare occasions when using OffloadDataTransfers (ODX) with Windows 2012 initiators, a specific WriteUsingToken
command could have generated an inappropriate response at the target that may have resulted in a controller failover. (See T10 specifications re: WriteUsingToken command) [Tracking #: 762035]
June 18th, 2013 12:57pm
DPM has run for several days without issue with MaxParalellBackups set to 1. (Without issue means that DPM backups are successful. There are still VSS EQL HW Provider errors in the application event logs.) I have noticed that DPM works better if there is
some available memory on the host. During the last round of tests I distributed the VM's across all of the hosts to allow for at least 5GB available memory on each host server.
Kevin
June 21st, 2013 1:18am
Hi
My Environment is:
4 Hyper-V Datacenter Cluster Server 2012 without Hotfix from 6/14/2013
2 Equalogic PS6010 Firmware 6.02
Hitkit 4.5
DPM 2012 SP1 Rollup 2
Have the same Problems. The new Hotfix (http://support.microsoft.com/kb/2848344)
let crash my Servers. So I removed again. I find out, if you move the failed Machine, the Replication come
back in a successful State! And Guys, check all your Backup Points. About 2 Weeks, I will restore a Machine, but over 3 Months the Replica say, everything is OK, but no Recovery Points was created!!
Now I have enabled the Option (Run a daily consistency Check ... ) for all Protection Groups!
And now I have a lot of more Replication Errors. But with moving some Machine I can create all Replication successful. I think we have her Timeout Problems.
An other Problem is the Backup himself. On the Hyper-V Manager you can see if the Backup is running and some Machine cannot
stop this Process. This give an additional Problem. You need to stop this Machine. Remove all Machines from this Cluster server. Restart this Server and it works again! You cannot see that on the Cluster Manager.
All my SQL, Exchange- and SharePoint Backups running fine. There are virtual Machines. The Problems are only VM's with the Hit Kit from Dell
I am waiting for a Hotfix from Dell and or MS. And I am in Contact with a Dell Engineer.
Ren
June 24th, 2013 10:43am
Hello,
I applied EQL FW 6.0.5 and KB2848344 on friday.
KB2848344 has resolved the issue with cluster error 5120 "STATUS_CLUSTER_CSV_AUTO_PAUSE_ERROR(c0130021)" on my system.
I still get cluster error 5217 when I set 'MaxAllowedParallelBackups' to anything else then 1. But It didn't seem to have a negative impact on my environment.
---
KB2848344 has resolved some issues on the MS side. The EQL VSS provider still causes failed backups on my lab cluster.
I will stick with the Hyper-V VSS provider for now.
June 24th, 2013 11:36am
Hi
My MS Consultant give me a lot of addittional Key. I tried this and for today it works!
If anyone will tried this her are the changes:
1. increase the timeout period:-
- Under "HKLM\Software\Microsoft\Microsoft Data Protection Manager\Agent"
- Add a DWORD value with name ConnectionNoActivityTimeoutForNonCCJobs
- Set it to 7200 decimal.
- Under "HKLM\Software\Microsoft\Microsoft Data Protection Manager\Agent"
- Add a DWORD with name ConnectionNoActivityTimeout
- Set it to 7200 decimal.
2. Registry changes to increase the paged pool memory
- HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\Session Manager\Memory Management
- Add Value , and then add the following registry value:
- Value name: PoolUsageMaximum
- Data type: REG_DWORD
- Radix: Decimal
- Value data: 60
- Setting the value at 60 informs the Memory Manager to start the trimming process at
- 60 percent of PagedPoolMax rather than default setting of 80 percent. If a
- threshold of 60 percent is not enough to handle spikes in activity, reduce this
- setting to 50 percent or 40 percent.
- Value name: PagedPoolSize
- Data type: REG_DWORD
- Radix: Hex
- Value data: 0xFFFFFFFF
- Setting PagedPoolSize to 0xFFFFFFFF allocates the maximum paged pool in lieu of other resources to the computer.
3. Restart dpmra services
Maby it helps
Best Regards
Rndi
-
Edited by
Roendi
Tuesday, June 25, 2013 8:58 AM
Change
June 25th, 2013 8:52am
Hi
My MS Consultant give me a lot of addittional Key. I tried this and for today it works!
If anyone will tried this her are the changes:
1. increase the timeout period:-
- Under "HKLM\Software\Microsoft\Microsoft Data Protection Manager\Agent"
- Add a DWORD value with name ConnectionNoActivityTimeoutForNonCCJobs
- Set it to 7200 decimal.
- Under "HKLM\Software\Microsoft\Microsoft Data Protection Manager\Agent"
- Add a DWORD with name ConnectionNoActivityTimeout
- Set it to 7200 decimal.
2. Registry changes to increase the paged pool memory
- HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\Session Manager\Memory Management
- Add Value , and then add the following registry value:
- Value name: PoolUsageMaximum
- Data type: REG_DWORD
- Radix: Decimal
- Value data: 60
- Setting the value at 60 informs the Memory Manager to start the trimming process at
- 60 percent of PagedPoolMax rather than default setting of 80 percent. If a
- threshold of 60 percent is not enough to handle spikes in activity, reduce this
- setting to 50 percent or 40 percent.
- Value name: PagedPoolSize
- Data type: REG_DWORD
- Radix: Hex
- Value data: 0xFFFFFFFF
- Setting PagedPoolSize to 0xFFFFFFFF allocates the maximum paged pool in lieu of other resources to the computer.
3. Restart dpmra services
Maby it helps
Best Regards
Rndi
June 25th, 2013 11:52am
Hi
My MS Consultant give me a lot of addittional Key. I tried this and for today it works!
If anyone will tried this her are the changes:
1. increase the timeout period:-
- Under "HKLM\Software\Microsoft\Microsoft Data Protection Manager\Agent"
- Add a DWORD value with name ConnectionNoActivityTimeoutForNonCCJobs
- Set it to 7200 decimal.
- Under "HKLM\Software\Microsoft\Microsoft Data Protection Manager\Agent"
- Add a DWORD with name ConnectionNoActivityTimeout
- Set it to 7200 decimal.
2. Registry changes to increase the paged pool memory
- HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\Session Manager\Memory Management
- Add Value , and then add the following registry value:
- Value name: PoolUsageMaximum
- Data type: REG_DWORD
- Radix: Decimal
- Value data: 60
- Setting the value at 60 informs the Memory Manager to start the trimming process at
- 60 percent of PagedPoolMax rather than default setting of 80 percent. If a
- threshold of 60 percent is not enough to handle spikes in activity, reduce this
- setting to 50 percent or 40 percent.
- Value name: PagedPoolSize
- Data type: REG_DWORD
- Radix: Hex
- Value data: 0xFFFFFFFF
- Setting PagedPoolSize to 0xFFFFFFFF allocates the maximum paged pool in lieu of other resources to the computer.
3. Restart dpmra services
Maby it helps
Best Regards
Rndi
- Edited by
Roendi
Tuesday, June 25, 2013 8:58 AM
Change
June 25th, 2013 11:52am
I thought I'd mention that I am experiencing the same intermittent errors as well:
The VSS application writer or the VSS provider is in a bad state. Either it was already in a bad state or it entered a bad state during the current operation. (ID 30111 Details: VssError:A function call was made when the object was in an incorrect state
for that function
(0x80042301))
Volume Shadow Copy Service error: Unexpected error querying for the IVssWriterCallback interface. hr = 0x80070005, Access is denied.
. This is often caused by incorrect security settings in either the writer or requestor process.
Operation:
Gathering Writer Data
Context:
Writer Class Id: {e8132975-6f93-4464-a53e-1050253ae220}
Writer Name: System Writer
Writer Instance ID: {ad25ea3e-ce36-4a0a-9500-9f19f989fef3}
My setup is three 2-node Hyper-V clusters running Server 2012 Core with 2 Dell PS4100 SANs, and approximately 20 VMs. The SANs are running 6.0.5, HIT is 4.5 on the Hosts. All of the Hosts are fully patched (and fresh installs as well...these clusters have
all been recently migrated from 2008 R2). I've even upgraded all firmware and drivers on the hosts, and the switches to the latest versions.
I am having a hell of a time getting backups to work with the EqualLogic HW provider in DPM 2012 SP1 UR2. In fact on one of my clusters I can't even get any replicas of any of my VMs created at all! Going through this thread it seems like there are a myriad
of fixes that work sometimes, but before I go and apply a bunch of registry entries to my DPM server, or my cluster hosts, it seems like the general consensus is that the only real fix at the moment to rely on the Hyper-V Software provider by disabling the
EqualLogic provider?
June 25th, 2013 7:02pm
it seems like the general consensus is that the only real fix at the moment to rely on the Hyper-V Software provider by disabling the EqualLogic provider?
No. You'll enter a whole new world of pain relying on the MS VSS writer.
http://up2v.nl/2013/06/19/another-update-of-another-update-improves-cluster-resiliency-in-windows-server-2012/
Incidentally, I have an HP 3PAR StoreServ 7200 and using the VSS writer from 3PAR I get the same errors detailed in this thread so the problem isn't specific to Dell Equalogic. Seems neither the storage vendors nor Microsoft have worked out how to
robustly use VSS with Windows 2012 CSV.
Until they get it sorted, using agents in the guests is the only sure fire way to consistently backup my VMs.
-
Edited by
slinkoff
Tuesday, June 25, 2013 9:03 PM
spelling
June 25th, 2013 9:01pm
it seems like the general consensus is that the only real fix at the moment to rely on the Hyper-V Software provider by disabling the EqualLogic provider?
No. You'll enter a whole new world of pain relying on the MS VSS writer.
http://up2v.nl/2013/06/19/another-update-of-another-update-improves-cluster-resiliency-in-windows-server-2012/
Incidentally, I have an HP 3PAR StoreServ 7200 and using the VSS writer from 3PAR I get the same errors detailed in this thread so the problem isn't specific to Dell Equalogic. Seems neither the storage vendors nor Microsoft have worked out how to
robustly use VSS with Windows 2012 CSV.
Until they get it sorted, using agents in the guests is the only sure fire way to consistently backup my VMs.
June 26th, 2013 12:01am
it seems like the general consensus is that the only real fix at the moment to rely on the Hyper-V Software provider by disabling the EqualLogic provider?
No. You'll enter a whole new world of pain relying on the MS VSS writer.
http://up2v.nl/2013/06/19/another-update-of-another-update-improves-cluster-resiliency-in-windows-server-2012/
Incidentally, I have an HP 3PAR StoreServ 7200 and using the VSS writer from 3PAR I get the same errors detailed in this thread so the problem isn't specific to Dell Equalogic. Seems neither the storage vendors nor Microsoft have worked out how to
robustly use VSS with Windows 2012 CSV.
Until they get it sorted, using agents in the guests is the only sure fire way to consistently backup my VMs.
- Edited by
slinkoff
Tuesday, June 25, 2013 9:03 PM
spelling
June 26th, 2013 12:01am
I am having a hell of a time getting backups to work with the EqualLogic HW provider in DPM 2012 SP1 UR2. In fact on one of my clusters I can't even get any replicas of any of my VMs created at all! Going through this thread it seems like there are a myriad
of fixes that work sometimes, but before I go and apply a bunch of registry entries to my DPM server, or my cluster hosts, it seems like the general consensus is that the only real fix at the moment to rely on the Hyper-V Software provider by disabling the
EqualLogic provider?
Hello,
from my personal experience, it's best to uninstall the EQL VSS provider. I have been to various combinations on my clusters and my final solution for now is:
Only the following HIT componets are installed:
- PowerShell module
- DSM
- SMI Provider
MPIO is setup via registry (it's explained at the bottom of the site):
http://en.community.dell.com/techcenter/storage/w/wiki/2678.dell-equallogic-hit-kit-auto-install-script.aspx
MS Hotfixes installed:
On my DPM I have set 'MaxAllowedParallelBackups' to 5
---
One - my small two node cluster - occasionally reports cluster error 5217. But this one may be related to my testing. I have setup DPM to backup all 23 VM every hour.
My 4 node cluster haven't had any errors since I applied KB2848344 and unsinstalled the EQL VSS provider on friday.
---
Since my backup runs reliable atm, I will drop this topic and wait for a new release of EQL HIT.
June 26th, 2013 9:35am
Hi,
New hotfix : http://support.microsoft.com/kb/2870270 (replace KB2848344 ?)
Csvflt.sys |
6.2.9200.20712 |
Clussvc.exe |
6.2.9200.20712 |
Csvfs.sys |
6.2.9200.20712 |
Fssagent.dll |
6.2.9200.20712 |
Kernelbase.dll |
6.2.9200.20712 |
Ntfs.sys |
6.2.9200.20736 |
Rdbss.sys |
6.2.9200.20712 |
Srv2.sys |
6.2.9200.20712 |
Witness.dll |
6.2.9200.20712 |
Kernelbase.dll |
6.2.9200.20712 |
http://support.microsoft.com/kb/2869923
Clussvc.exe |
6.2.9200.20767 |
Csvfs.sys |
6.2.9200.20767 |
Vhdmp.sys |
6.2.9200.20767 |
July 17th, 2013 11:56am
if (like me) you installed the kb2848344, you need kb2870270 and after kb2869923
File
|
kb 2796995
|
kb 2813630
|
kb 2838669
|
kb 2848344
|
kb 2870270
|
kb 2869923
|
Csvflt.sys
|
6.2.9200.20596
|
6.2.9200.20626
|
6.2.9200.20682
|
6.2.9200.20712
|
6.2.9200.20712
|
-
|
Clussvc.exe
|
-
|
6.2.9200.20623
|
6.2.9200.20686
|
6.2.9200.20712
|
6.2.9200.20712
|
6.2.9200.20767
|
Csvfs.sys
|
-
|
-
|
6.2.9200.20686
|
6.2.9200.20712
|
6.2.9200.20712
|
6.2.9200.20767
|
Fssagent.dll
|
-
|
-
|
6.2.9200.20682
|
6.2.9200.20712
|
6.2.9200.20712
|
-
|
Kernelbase.dll
|
6.2.9200.20596
|
-
|
6.2.9200.20682
|
6.2.9200.20712
|
6.2.9200.20712
|
-
|
Ntfs.sys
|
-
|
6.2.9200.20623
|
6.2.9200.20684
|
6.2.9200.20712
|
6.2.9200.20736
|
-
|
Rdbss.sys
|
-
|
-
|
6.2.9200.20685
|
6.2.9200.20712
|
6.2.9200.20712
|
-
|
Srv2.sys
|
-
|
-
|
6.2.9200.20685
|
6.2.9200.20712
|
6.2.9200.20712
|
-
|
Witness.dll
|
-
|
-
|
6.2.9200.20685
|
6.2.9200.20712
|
6.2.9200.20712
|
-
|
Kernelbase.dll
|
-
|
-
|
-
|
-
|
6.2.9200.20712
|
-
|
Vhdmp.sys
|
-
|
-
|
-
|
-
|
-
|
6.2.9200.20767
|
July 17th, 2013 12:19pm
None of these fixes worked for me (even this latest one) until I disabled ODX (which I didn't want to do). Turned it off and backups all consistently pass now so seems definite cause for my environment.
Now I have to see who will take responsibility for this and get it fixed. I have a 3PAR 7200 which supports ODX (and it was working nicely) so is it HP's problem with their ODX implementation or Microsoft with theirs?!
July 22nd, 2013 6:13pm
I vote for microsoft because I think this is not
the latest patch...
I will turn on ODX and TRIM this week ...
July 22nd, 2013 6:18pm
I am having a hell of a time getting backups to work with the EqualLogic HW provider in DPM 2012 SP1 UR2. In fact on one of my clusters I can't even get any replicas of any of my VMs created at all! Going through this thread it seems like there are a myriad
of fixes that work sometimes, but before I go and apply a bunch of registry entries to my DPM server, or my cluster hosts, it seems like the general consensus is that the only real fix at the moment to rely on the Hyper-V Software provider by disabling the
EqualLogic provider?
Hello,
from my personal experience, it's best to uninstall the EQL VSS provider. I have been to various combinations on my clusters and my final solution for now is:
Only the following HIT componets are installed:
- PowerShell module
- DSM
- SMI Provider
MPIO is setup via registry (it's explained at the bottom of the site):
http://en.community.dell.com/techcenter/storage/w/wiki/2678.dell-equallogic-hit-kit-auto-install-script.aspx
MS Hotfixes installed:
On my DPM I have set 'MaxAllowedParallelBackups' to 5
---
One - my small two node cluster - occasionally reports cluster error 5217. But this one may be related to my testing. I have setup DPM to backup all 23 VM every hour.
My 4 node cluster haven't had any errors since I applied KB2848344 and unsinstalled the EQL VSS provider on friday.
---
Since my backup runs reliable atm, I will drop this topic and wait for a new release of EQ
July 22nd, 2013 6:27pm
Hi,
Yesterday, I turned on ODX :
set-itemproperty hklm:\system\currentcontrolset\control\filesystem -name "FilterSupportedFeaturesMode" -value 0
This night, I get a 5142 ERROR on the CSV (ERROR_TIMEOUT(1460))
let crash my Servers...
July 23rd, 2013 12:14pm
So it appears enabling ODX on clustered Hyper-V servers when using DPM 2012 SP1 to backup guest VMs with the software VSS writer causes timeout errors on the storage, resulting in VM crashes.
This is happening on Dell and HP SAN hardware which supports ODX and with the very latest MS hotfixes.
I'm opening a support case, I want a resolution so I can get ODX back, it was great for live migrations.
July 23rd, 2013 12:49pm
So it appears enabling ODX on clustered Hyper-V servers when using DPM 2012 SP1 to backup guest VMs with the software VSS writer causes timeout errors on the storage, resulting in VM crashes.
This is happening on Dell and HP SAN hardware which supports ODX and with the very latest MS hotfixes.
I'm opening a support case, I want a resolution so I can get ODX back, it was great for live migrations.
my support case is 113060610494265...
I'm using DELL Blades and Equallogic PS SAN wich supports ODX.
July 23rd, 2013 12:54pm
I installed Hotfix KB2870270 and KB2869923 on the 18th last week and haved had a single cluster error since. Still using MS software provider at the moment.
DELL just released the HIT 4.6 EPA. I will give it a try tomorrow on one of my clusters.
July 23rd, 2013 6:13pm
Is ODX enabled or disabled? Do you care about ODX?
July 23rd, 2013 6:17pm
I have ODX enabled ... but right now I don't care about it. Doing some performance metrering and other testing with and without ODX is not very high on my agenda.
July 23rd, 2013 6:46pm
I have ODX enabled ... but right now I don't care about it. Doing some performance metrering and other testing with and without ODX is not very high on my agenda.
July 23rd, 2013 7:02pm
I don't use it.
Allmost Everything ist deployed out-of-the-box, except for jumbo frames and protocol bindings on the iSCSI network.
July 23rd, 2013 9:11pm
Looks like I still have no luck with the EQL VSS Provider.
I installed the HIT 4.6 EPA yesterday evening.
.. failed backups, one backup job stuck for allmos 6 hours now, "IVssWriterCallback" error 8194 spamming my eventlog every 30 sec ...
Back to Microsoft Software Provider again
July 24th, 2013 4:36pm
Same here. HIT 4.6 does not resolve the issue. Two clusters I've added the EQL VSS provider now have repeatedly failing backups. Third cluster with software provider; no errors at all.
July 24th, 2013 5:42pm
Same here. HIT 4.6 does not resolve the issue. Our Cluster have the same Problem again.
August 2nd, 2013 9:04am
this error is normal
Event: 8194 Source: VSS : IVssWriterCallback interface. hr = 0x80070005, Access is denied
This problem appears only as part of
a cluster ; This event is visible onnodes other than
the one who initiated the call toVSS.
During a VSS call, the Cluster servicesends
requests to all nodes throughthe GUM
(Global Update Manager).Because the
"System Writer" is hostedby the
encryption service (cryptographicservice or
cryptsvc) and that it is executed in a
context "NetworkService" instead of
"System", the return ofCOM calls a
meeting Denied Access because
different impersonnationson other cluster nodes
The problem will not be fixed as it
has no functional involvement
Events can be ignored
August 9th, 2013 10:58am
Time to jump in on this thread, as we have a similar setup as some others here :
1* Svr2012 with DPM 2012 SP1 (4.1.3408.0)
4* Svr2012 nodes in a Failover Cluster with Hit Kit 4.5 (ASM 4.5.0.6492)
1* EQL PS4110X
All servers connected by 4* 10Gb on Juniper 2* EX4550
We use ODX and the EQL hardware provider
We regularly have error 8194, wich doesn't seem to harm the cluster.
Finally I pinpointed how to avoid error 12393, which I tought I would share with you.
We don't use the DPM schedule for backups, since it keeps makings errors that trigger
multiple consistency checks at a time. When this occurs, error 12393 happens on at least
one of the nodes. And after an 12393 error, we need to reboot every node to get the
cluster stable again.
Instead we use a powershell script, which runs all VM backups one by one, four times
a day. This works well, ocasionally we have an error in a backup, which gets solved
the next time DPM runs. This solution is not how it should be, but it works for us
for now.
The biggest problem we still have is when we have to reboot our nodes (patch tuesday
or other important updates). After rebooting the four nodes we need to do a Consistency
Check by hand on every VM to avoid that DPM will do it by itself, because when DPM does it
it does multiple VMs at a time which triggers an 12393 error. When I do a CS one by one after
rebooting the nodes everything stays fine. But this is rather time consuming (29 VMs in total ~ 4,2TB),
so I hope there will come a decent solution rather sooner than later.
So please keep posting your findings, as it will help al lot of people when a good solution becomes available.
Regards, Bert
-
Edited by
Bert Oris
Sunday, August 11, 2013 12:08 PM
August 11th, 2013 12:08pm
Time to jump in on this thread, as we have a similar setup as some others here :
1* Svr2012 with DPM 2012 SP1 (4.1.3408.0)
4* Svr2012 nodes in a Failover Cluster with Hit Kit 4.5 (ASM 4.5.0.6492)
1* EQL PS4110X
All servers connected by 4* 10Gb on Juniper 2* EX4550
We use ODX and the EQL hardware provider
We regularly have error 8194, wich doesn't seem to harm the cluster.
Finally I pinpointed how to avoid error 12393, which I tought I would share with you.
We don't use the DPM schedule for backups, since it keeps makings errors that trigger
multiple consistency checks at a time. When this occurs, error 12393 happens on at least
one of the nodes. And after an 12393 error, we need to reboot every node to get the
cluster stable again.
Instead we use a powershell script, which runs all VM backups one by one, four times
a day. This works well, ocasionally we have an error in a backup, which gets solved
the next time DPM runs. This solution is not how it should be, but it works for us
for now.
The biggest problem we still have is when we have to reboot our nodes (patch tuesday
or other important updates). After rebooting the four nodes we need to do a Consistency
Check by hand on every VM to avoid that DPM will do it by itself, because when DPM does it
it does multiple VMs at a time which triggers an 12393 error. When I do a CS one by one after
rebooting the nodes everything stays fine. But this is rather time consuming (29 VMs in total ~ 4,2TB),
so I hope there will come a decent solution rather sooner than later.
So please keep posting your findings, as it will help al lot of people when a good solution becomes available.
Regards, Bert
- Edited by
Bert Oris
Sunday, August 11, 2013 12:08 PM
August 11th, 2013 3:08pm
Translate by Google (Part 1/3) :
Hello,
Here is a summary of the failures of our cluster HYPER-V 2012.
When setting up the backup with DPM2012 and VSS Hardware Provider for Equallogic, backup
did not work correctements.
We have opened a file in Microsoft (REG: 113053110480759) on 31/05/2013 for error VSS 8194: IVssWriterCallback
interface. hr = 0x80070005, Access is denied
Microsoft has concluded that this error (8194) was normal and he had to ignore, our problem backups
did not come from there.
Another case has been opened with Equallogic
(Case # 877369257) on 03/06/2013 for VSS error 12363: An expected arrival hidden volume About did not complete LUN Because this was not detected.
Equallogic asked us to disable the Hardware Provider and use the Software Provider to see
if the problem was much Equallogic.
At that time, we had problems with our cluster
with several downtime and we opened a file in Microsoft Cluster for the 5120/5142/5217 06/06/2013 mistakes.
The configuration of our HYPER-V nodes as follows:

We immediately disabled the ODX functions TRIM / UNMAP.
CSV volumes without cover had no problems.
After several weeks of testing and installation of numerous fixes the problem is still there.
Regarding CSV, here is a list of Microsoft KB with affected files:
File
|
kb 2796995
|
kb 2813630
|
kb 2838669
|
kb 2848344
|
kb 2870270
|
kb 2869923
|
Csvflt.sys
|
6.2.9200.20596
|
6.2.9200.20626
|
6.2.9200.20682
|
6.2.9200.20712
|
6.2.9200.20712
|
-
|
Clussvc.exe
|
-
|
6.2.9200.20623
|
6.2.9200.20686
|
6.2.9200.20712
|
6.2.9200.20712
|
6.2.9200.20767
|
Csvfs.sys
|
-
|
-
|
6.2.9200.20686
|
6.2.9200.20712
|
6.2.9200.20712
|
6.2.9200.20767
|
Fssagent.dll
|
-
|
-
|
6.2.9200.20682
|
6.2.9200.20712
|
6.2.9200.20712
|
-
|
Kernelbase.dll
|
6.2.9200.20596
|
-
|
6.2.9200.20682
|
6.2.9200.20712
|
6.2.9200.20712
|
-
|
Ntfs.sys
|
-
|
6.2.9200.20623
|
6.2.9200.20684
|
6.2.9200.20712
|
6.2.9200.20736
|
-
|
Rdbss.sys
|
-
|
-
|
6.2.9200.20685
|
6.2.9200.20712
|
6.2.9200.20712
|
-
|
Srv2.sys
|
-
|
-
|
6.2.9200.20685
|
6.2.9200.20712
|
6.2.9200.20712
|
-
|
Witness.dll
|
-
|
-
|
6.2.9200.20685
|
6.2.9200.20712
|
6.2.9200.20712
|
-
|
Kernelbase.dll
|
-
|
-
|
-
|
-
|
6.2.9200.20712
|
-
|
Vhdmp.sys
|
-
|
-
|
-
|
-
|
-
|
6.2.9200.20767
|
On 01/08/2013, the Microsoft dev suggests a problem on the network:
"Since the pattern is That The IO is failing with timeout over SMB, I would SUGGEST looking at the
network capabilities of this cluster. There Were three instances with 5120 IO timeout.
Two of Them Were Caused When CSVFS redirected IO was in fashion for snapshot operations.
Once it was in Direct IO mode (view all metadata ops Will still go over SMB).
Those out of three, one was very quick to Recovering in Active state.
One resulted in the 5142 event as Explained above.
And another Took ~ 2 minutes, All which is still not good Because SMB scale-out customers
IO Will fail if this cluster is used as a scale-out file server. "
We first checked the storage network with:
- Updating M8024-k switches
- Updated maps and activation Broadcom iSCSI offload (replacing NDIS -. Equallogic see White Paper of
July 2013)
Our configuration is now as follows:

August 12th, 2013 3:12pm
Translate by Google (Part 2/3) :
On 05/08/2013, the Microsoft support is geared more towards a problem redirect I / O on the CSV.
On 06/08/2013, the Microsoft support asked us to run a network trace between nodes and advises us to enable
QoS.
This QoS must be managed at the "Hyper-V Extensible Switch."
Traffic Clustering (CSV + HeartBeat) and Live Migration does not pass through the switch!
We modify our configuration to all traffic by the "Hyper-V Extensible Switch."
We realize at this point that the MAC addresses of cards are not correct!
A file is opened at Dell (No. 880538152) for a problem on the flex addresses.
The conclusion is that you completely uninstall the drivers and reinstall them.
It also means that completely loses the configuration.
On one of the nodes (HYPERV4), uninstalling the drivers goes wrong and not resettlement.
We find ourselves obliged to reinstall the complete node.
Su another node (HYPERV3), uninstalling the drivers went well but the MAC addresses are always bad.
New call to Dell Support.
This time the mac addresses are also poor in the BIOS ... Remapping FLEX address through the CMC.
We finally create the aggregate with the technician.
It is then we need to completely redo the configuration.
Shortly after we would redo the aggregate, it tells us that already exists!

Unable to remove the shadow card.
It seems necessary to remove the Hyper-V switch before installing or updating drivers Broadcom.
Our problem is not an isolated: http://mikefrobbins.com/2011/04/21/enabling-jumbo-frames-for-iscsi-on-server-core/
Warning:
There is an issue with the Broadcom drivers Version 14.4.8.4 That Could because the network cards to Become inoperable if a virtual switch already exists on your Hyper-V host server and it is running the core install of Windows Server 2008 R2.
I have only Experienced this issue on Dell PowerEdge R710 Servers.
I have run the same process on Dell PowerEdge 2950 Servers with the same network cards and drivers without issue.
If You have a Dell PE R710, Consider Removing the virtual
switches before Installing this driver or Be Prepared to reload the Hyper-V host server if you experience this problem.
We find ourselves obliged to reinstall complete this node.
Our configuration is as follows:

August 12th, 2013 3:15pm
Translate by Google (Part 3/3) :
But our mistake, we forgot to put in the creation of the Hyper-V Switch "MinimumBandwidthMode Weight"
parameter. The switch does not know how to manage bandwidth and we find ourselves obliged
to redo the entire setting with this parameter!
Once setup is complete, we realize that the SR-IOV is no longer available.
This seems logical since this technology allows to override the switch Hyper-V:

We resettlement our configuration with SR-IOV ($ EnableIov true) without managing bandwidth
(MinimumBandwidthMode Weight).
We try to pass change the MTU to 9000 on the network clustering and Live Migration, but
the Hyper-V switch does not pass packets.
So we pass on an MTU of 1500.
We also realize that by installing the Dell Kace agent OpenManage 7.3 and the cluster communication is blocked!
Optimization of Clustering network (CSV + HeartBeat) therefore seems impossible!
We decide not to use redirect I / O request to the least
possible that network and for this we will reinstate the Equallogic VSS Hardware Provider (HIT 4.6 EPA).
We limit the bandwidth of agents in DPM, we block the number of simultaneous backup
one and we increase the TimeOut DPM agents.

Restarting the service back VMMS operational writer.
Restart DPMRA Service (DPM agent) to take into account the change.
But some backups still does not work, with the same message as the original!
Reminder of EQUALLOGIC support.
Telephone point with Stphane F.
Our problem is not isolated.
Other similar cases are under investigation since 21/06/2013.
We need to redo a point 01/09/2013.
Meanwhile, we must make a choice between hardware provider with errors on backups
or software provider with downtime ...
Sincerely,
Frdric OGUER
SID - IT Manager
August 12th, 2013 3:21pm
The final script used :
# DELETE ALL
Remove-VMNetworkAdapter ManagementOS Name "MANAGEMENT"
Remove-VMNetworkAdapter ManagementOS Name "LAN"
Remove-VMNetworkAdapter ManagementOS Name "CSV"
Remove-VMNetworkAdapter ManagementOS Name "LM"
Remove-VMNetworkAdapter ManagementOS Name "HEARTBEAT"
Get-VMNetworkAdapter ManagementOS
Remove-VMSwitch "20GbE switch" -force
Get-VMSwitch
Remove-NetLbfoTeam "2x10GbE Team" -confirm:$false
Remove-NetLbfoTeam "20GbE switch" -confirm:$false
get-NetLbfoTeam
#NIC MTU
Get-NetAdapterAdvancedProperty -Name NIC1,NIC2 -DisplayName "Jumbo Packet" | Set-NetAdapterAdvancedProperty -RegistryValue "9014"
#Teaming creation
New-NetLbfoTeam "2x10GbE Team" TeamMembers NIC1,NIC2 -TeamingMode Lacp -LoadBalancingAlgorithm TransportPorts TeamNicName "2x10GbE" -confirm:$false
# !!! attendre le temps qu'il installe les drivers multiplexor et la carte
Sleep 60
Get-NetLbfoTeam
#VMswitch with SR-IOV !! MinimumBandwidthMode Weight
New-VMSwitch "20GbE switch" -NetAdapterName "2x10GbE" -EnableIov $true -AllowManagementOS $false
#Set-VMSwitch "20GbE switch" -DefaultFlowMinimumBandwidthWeight 30
#VMNetworkAdapter
# l'ordre des rseaux sera l'inverse de l'ordre de cration : LAN, LM, CSV, HEARTBEAT
#Add-VMNetworkAdapter ManagementOS Name "HEARTBEAT" SwitchName "20GbE switch"
Add-VMNetworkAdapter ManagementOS Name "CLUSTER" SwitchName "20GbE switch"
Add-VMNetworkAdapter ManagementOS Name "LM" SwitchName "20GbE switch"
Add-VMNetworkAdapter ManagementOS Name "LAN" SwitchName "20GbE switch"
#VLAN on VMNetworkadapter
Set-VMNetworkAdapterVlan ManagementOS VMNetworkAdapterName "LAN" -Access -VlanId 11
Set-VMNetworkAdapterVlan ManagementOS VMNetworkAdapterName "CLUSTER" -Access -VlanId 12
Set-VMNetworkAdapterVlan ManagementOS VMNetworkAdapterName "LM" -Access -VlanId 13
#Set-VMNetworkAdapterVlan ManagementOS VMNetworkAdapterName "HEARTBEAT" -Access -VlanId 14
#BandwidthWeight: 30(default)+20+30+10+10 = 100
#Set-VMNetworkAdapter ManagementOS Name "LAN" MinimumBandwidthWeight 10
#Set-VMNetworkAdapter ManagementOS Name "CLUSTER" MinimumBandwidthWeight 20
#Set-VMNetworkAdapter ManagementOS Name "LM" MinimumBandwidthWeight 30
#Set-VMNetworkAdapter ManagementOS Name "HEARTBEAT" MinimumBandwidthWeight 10
#IeeePriority
Set-VMNetworkAdapter ManagementOS Name "LAN" IeeePriorityTag On
Set-VMNetworkAdapter ManagementOS Name "CLUSTER" IeeePriorityTag On
Set-VMNetworkAdapter ManagementOS Name "LM" IeeePriorityTag On
#Set-VMNetworkAdapter ManagementOS Name "HEARTBEAT" IeeePriorityTag On
New-NetQosPolicy "LM" -LiveMigration -Priority 5
New-NetQosPolicy "CSV" -SMB Priority 3
New-NetQosPolicy "HEARTBEAT" -IPDstPort 3343 Priority 6
##IP configuration
$IP=36
#MANAGEMENT
Get-NetAdapter -name *LAN* | New-NetIPAddress -IPAddress 192.168.0.$IP -PrefixLength 24 -DefaultGateway 192.168.0.1
Get-NetAdapter -name *LAN* | Set-DnsClientServerAddress -ServerAddresses 192.168.0.5,192.168.0.6,192.168.3.3
Get-NetAdapter -name *LAN* | Set-DnsClient -ConnectionSpecificSuffix "home.sid.tm.fr"
#PowerShell 3.0 does not have any new cmdlet for configuring WINS server settings.
$WINS = Get-WmiObject win32_networkadapterconfiguration | Where IPAddress -eq 192.168.0.$IP
$WINS.SetWINSServer("192.168.0.5","192.168.0.6")
$WINS.SetTcpipNetbios("2")
#CLUSTER
Get-NetAdapter -name *CLUSTER* | New-NetIPAddress -IPAddress 192.168.12.$IP -PrefixLength 24
Get-NetAdapter -name *CLUSTER* | Set-DnsClient -RegisterThisConnectionsAddress $false
#PowerShell 3.0 does not have any new cmdlet for configuring WINS server settings.
$WINS = Get-WmiObject win32_networkadapterconfiguration | Where IPAddress -eq 192.168.12.$IP
$WINS.SetTcpipNetbios("2")
#Disable IPV6, dcouverte de couche liaison, rpondeur de dcouverte
Disable-NetAdapterBinding -Name "vEthernet (CLUSTER)" -ComponentID ms_tcpip6
Disable-NetAdapterBinding -Name "vEthernet (CLUSTER)" -ComponentID ms_lltdio
Disable-NetAdapterBinding -Name "vEthernet (CLUSTER)" -ComponentID ms_rspndr
#LM
Get-NetAdapter -name *LM* | New-NetIPAddress -IPAddress 192.168.13.$IP -PrefixLength 24
Get-NetAdapter -name *LM* | Set-DnsClient -RegisterThisConnectionsAddress $false
#PowerShell 3.0 does not have any new cmdlet for configuring WINS server settings.
$WINS = Get-WmiObject win32_networkadapterconfiguration | Where IPAddress -eq 192.168.13.$IP
$WINS.SetTcpipNetbios("2")
#Disable IPV6, dcouverte de couche liaison, rpondeur de dcouverte
Disable-NetAdapterBinding -Name "vEthernet (LM)" -ComponentID ms_tcpip6
Disable-NetAdapterBinding -Name "vEthernet (LM)" -ComponentID ms_lltdio
Disable-NetAdapterBinding -Name "vEthernet (LM)" -ComponentID ms_rspndr
#HEARTBEAT
#Get-NetAdapter -name *HEARTBEAT* | New-NetIPAddress -IPAddress 192.168.14.$IP -PrefixLength 24
#Get-NetAdapter -name *HEARTBEAT* | Set-DnsClient -RegisterThisConnectionsAddress $false
#PowerShell 3.0 does not have any new cmdlet for configuring WINS server settings.
#$WINS = Get-WmiObject win32_networkadapterconfiguration | Where IPAddress -eq 192.168.14.$IP
#$WINS.SetTcpipNetbios("2")
#Disable IPV6, dcouverte de couche liaison, rpondeur de dcouverte
#Disable-NetAdapterBinding -Name "vEthernet (HEARTBEAT)" -ComponentID ms_tcpip6
#Disable-NetAdapterBinding -Name "vEthernet (HEARTBEAT)" -ComponentID ms_lltdio
#Disable-NetAdapterBinding -Name "vEthernet (HEARTBEAT)" -ComponentID ms_rspndr
#VMSwitch MTU
$RegKey ="HKLM:\SYSTEM\CurrentControlSet\Control\Class\{4D36E972-E325-11CE-BFC1-08002BE10318}"
Get-ChildItem -Path $RegKey -ErrorAction SilentlyContinue| % {
$path = $_.PSPath
Get-Itemproperty $path | where {$_.driverdesc -eq "Hyper-V Virtual Ethernet Adapter" -and $_.Characteristics -eq "41"} | % {
Set-ItemProperty $path -Name "*JumboPacket" -Value "9014"
}
}
#MTU
#Get-NetAdapterAdvancedProperty -Name "vEthernet (LM)", "vEthernet (CLUSTER)" -DisplayName "Paquet Jumbo" | Set-NetAdapterAdvancedProperty -RegistryValue "9014"
August 12th, 2013 3:21pm
sounds like a lot of work for no gain!
I've given up on using DPM for child partition snapshot backups. I'm relying on native SAN snapshots to backup VMs. A more manual restore process but it's fine for the rare occasion I need to restore an entire VM. I'm using DPM just
for agent backups of SQL, Exchange and files.
This issue was taking too much troubleshooting time and the loss of performance from turning off ODX as a workaround was not acceptable.
August 14th, 2013 12:49pm
Hi
I receive this Message from an
Enterprise
Enterprise Technical Sr LVL3 Consultant from Dell EqualLogic Team
This is an ongoing issue we are investigating at this time. We are currently aware of the issue and we are working with Microsoft to address the problem. Unfortunately at this time there is no workaround that I know of.
So I think we are happy to hear this and we need to wait.
I will give you more Information if avalible.
August 19th, 2013 10:24am
HI,
I've installed on every node of my cluster:
-KB2838669
-KB2870270
- DPM 2012 RU2
- HIT 4.6
Like Dave Grant, IVssWriterCallback" error 8194 spamming my eventlog every 30 sec ...
August 19th, 2013 2:00pm
Translate by Google :
Symptom:
On your three-node
cluster Hyper-V
in 2012, you get
VSS event ID
8194 source
indicating an error
Log Name: Application
Source: VSS
Date: 06/06/2013 12:44:26
Event ID: 8194
Task Category: None
Level: Error
Keywords: Classic
User: N/A
Computer: Computer
Description:
Volume Shadow Copy Service error: Unexpected error querying for the IVssWriterCallback interface. hr = 0x80070005, Access is denied.
. This is often caused by incorrect security settings in either the writer or requestor process.
Operation :
Gathering Writer Data
Context :
Writer Class ID: {e8132975-6f93-4464-a53e-1050253ae220}
Writer Name: System Writer
Writer Instance ID: {4e342a4b-5cc5-4a42-ab69-fdc843778325}
This event is visible
on nodes other than
the one who initiated the call to
VSS
Cause:
Known problem but functional involvement
This problem appears only
as part of a cluster
During a call VSS,
the Cluster service sends
requests to all nodes
throughthe GUM
(Global Update Manager).
Because the "SystemWriter" is hosted
by the encryption service(cryptographic
service or cryptsvc)
and that it
is executed in a context
"NetworkService" instead of
"System", the return of
COM calls a meeting
Denied Access
because different impersonnations
on other cluster nodes
Resolution:
The problem will not be
fixed as it has no
functional involvement
Events can be ignored
August 19th, 2013 2:16pm
Event 8194 is no problem, but event 12363 is.
Any news when there will be a solution for that one ?
August 19th, 2013 2:19pm
Hi Bert,
I have the same answer than Roendi :
"This is an ongoing issue we are investigating at this time. We are currently aware of the issue and we are working with Microsoft to address the problem. Unfortunately at this time there is no workaround that I know of."
I must call back Equalogic September 1.
Best regards,
August 19th, 2013 2:25pm
funny, I had a similar reply from Microsoft:
Hope you are aware that this is a known issue and we hardly have any workaround available for this, so we have only option collect traces and analyze it. There
is no point in changing any configuration as this is already a known issue.
This will surely be credited against the BUG so will not be a charged incident.
Hope this helps
August 19th, 2013 2:33pm
we have a working solution using powershell. Only after every reboot of the cluster nodes we have to do Consistency Check manually (also using PS). If we do that we can avoid error 12363 and our cluster and DPM work like they should (using ASM
and ODX).
But after a cluster node reboot, we still have to stop the powershell backup schedules, start the consistency check script, and start the backup schedules again when CS is ready. We could also automate this using PS, but since it is only ones or twice
a month, we do it manual for now.
If someone wants the scripts and procedures, feel welcome to reply.
A solution from Microsoft/Dell still would be very welcome.
Regards, Bert
August 19th, 2013 2:49pm
Consistency Check use VSS.
Sometime, i get 12363 error during Consistency Check.
It's random like backup...
I will install DPM agents inside VMs
August 19th, 2013 3:04pm
We also thought the errors are random, but I think we found a system in when the errors occur. When we have two actions on one node (2 backups, 2 CS, or backup and CS) we always get a 12363. When we have 2 actions, but on different nodes, we
rarely get the 12363 error. To be on the safe side we make sure there is always only one action (backup or CS) running at a time. Than we never have the 12363 error.
We also considered installing the agent in every VM, but we're not looking forward into doing this in a production environment. Therefore we chose to go the powershell route.
August 19th, 2013 3:15pm
I will try your route without PowerShell :
- I disabled Auto-CC (Consistency Check) on my PG (Protection Group
- I planned a daily CC on my PG after Backup
- On the DPM server I changed the registry key 'MaxAllowedParallelBackups'
by node from 3 to 1
Windows Registry Editor Version 5.00
[HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft Data Protection Manager\2.0\Configuration\MaxAllowedParallelBackups]
"Microsoft Hyper-V"=dword:00000001
More information :
http://blogs.technet.com/b/dpm/archive/2011/06/06/how-to-use-and-troubleshoot-the-auto-heal-features-in-dpm-2010.aspx
August 19th, 2013 4:12pm
Will you post the Powershell scripts. I'm interested in seeing what you are doing.
Thanks.
August 20th, 2013 11:45pm
I'll try to explain how we work with our PS scripts to avoid the 12363 error. As said before we still use ODX and the Hit Kit 4.5 on our 4 node cluster with one EQL PS4110X.
First we have to make sure DPM isn't doing any backups by itself. So we modified the protection group and set the retention range at 31 days, but only run one backup a week at Saturday morning 6 AM. The first two scripts alter this schedule.
On Friday night we change the DPM backup to Sunday, and on Saterday night it changes back to Saterday. This way DPM isn't doing any backups itself. If there is an easier way, please let me know.
Change_Backup_Schedule_Day_Sa-Su.ps1
Import-Module DataProtectionManager
$pg = Get-DPMProtectionGroup -dpmservername "DPMServerName"
$SC0 = Get-DPMPolicySchedule $pg[0] -shortterm
$mpg = Get-DPMModifiableProtectionGroup -protectiongroup $pg[0]
Set-DPMPolicySchedule $mpg -Schedule $sc0[0] -TimesOfDay 06:00 -DaysOfWeek Su
Set-DPMProtectionGroup $mpg
$SC1 = Get-DPMPolicySchedule $pg[1] -shortterm
$mpg = Get-DPMModifiableProtectionGroup -protectiongroup $pg[1]
Set-DPMPolicySchedule $mpg -Schedule $sc1[0] -TimesOfDay 06:00 -DaysOfWeek Su
Set-DPMProtectionGroup $mpg
Change_Backup_Schedule_Day_Su-Sa.ps1
Import-Module DataProtectionManager
$pg = Get-DPMProtectionGroup -dpmservername "DPMServerName"
$SC0 = Get-DPMPolicySchedule $pg[0] -shortterm
$mpg = Get-DPMModifiableProtectionGroup -protectiongroup $pg[0]
Set-DPMPolicySchedule $mpg -Schedule $sc0[0] -TimesOfDay 06:00 -DaysOfWeek Sa
Set-DPMProtectionGroup $mpg
$SC1 = Get-DPMPolicySchedule $pg[1] -shortterm
$mpg = Get-DPMModifiableProtectionGroup -protectiongroup $pg[1]
Set-DPMPolicySchedule $mpg -Schedule $sc1[0] -TimesOfDay 06:00 -DaysOfWeek Sa
Set-DPMProtectionGroup $mpg
Next is the script that actually runs the backups. For this example we modified the script to only one PG and just a couple of VMs. The actual script contains two PGs and 29 VMs. We schedule this script four times a day (3:30 AM, 10:30
AM, 2:00 PM and 5:15 PM). Schedule might seem weird, but this to avoid conflicts when other tasks are running.
New-DPMRecoveryPoint_PG1.ps1
param([string] $dpmname, [string] $pgname, [string] $backupoption)
if(!$dpmname)
{ $dpmname = "DPMServerName"}
if(!$pgname)
{ $pgname = "ProtectionGroupName"}
if(!$backupoption)
{ $backupoption = "Expressfull"}
trap{"Error in execution... $_";break}
&{
$clipg = Get-ProtectionGroup $dpmname | where { $_.FriendlyName -eq $pgname}
if($clipg -eq $abc)
{ Throw "No PG found" }
$backupds = @(Get-Datasource $clipg)
foreach ($ds in $backupds[0]) # change to 1, 2 etc for each VS in the ProtectionGroup
{
$j = New-RecoveryPoint -Datasource $ds -Disk
$jobtype = $j.jobtype
Write-Output "$jobtype Job has been triggerred..."
}
while (! $j.hasCompleted )
{
}
Write-Host
if($j.Status -eq "Succeeded")
{
}
}
Start-Sleep -s 15
# From here on copy as many as needed
trap{"Error in execution... $_";break}
&{
$clipg = Get-ProtectionGroup $dpmname | where { $_.FriendlyName -eq $pgname}
if($clipg -eq $abc)
{ Throw "No PG found" }
$backupds = @(Get-Datasource $clipg)
foreach ($ds in $backupds[1]) # change to 1, 2 etc for each VS in the ProtectionGroup
{
$j = New-RecoveryPoint -Datasource $ds -Disk
$jobtype = $j.jobtype
Write-Output "$jobtype Job has been triggerred..."
}
while (! $j.hasCompleted )
{
}
Write-Host
if($j.Status -eq "Succeeded")
{
}
}
Sometimes (a couple times a week) DPM fails to do a backup of one VM. Than we get a mail telling:
Computer: MyVM.HyperCluster.domain.com
Description: Last 1 recovery points not created.
DPM encountered a retryable VSS error.
When this happens there is a 8194 error but that can be ignored. In the next backup this will be OK again.
Ones or twice a month we have to reboot the nodes after updates. We don't use Cluster Aware Updating yet, but do the updates and rebooting manual. After the reboot of every node (and pausing and resuming the cluster role per node) we disable
the "New-DPMRecoveryPoint_PG1.ps1" schedule on the DPM server. Than we have to run a Consistency Check on every node to avoid the 12363 error. Here's the script we use (once again, this is a shorter version than in production) :
ConsistencyCheck_PG1.ps1
param([string] $dpmname, [string] $pgname, [string] $dsname, [string] $isheavyweight)
if(!$dpmname)
{ $dpmname = "DPMServerName"}
if(!$pgname)
{ $pgname = "ProtectionGroupName"}
if(!$dsname)
{ $dsname = "Hyper-V Name"} # like "Backup Using Saved State\VSName" of " like "Backup Using Child Partition Snapshot\VSName"
if(!$isheavyweight)
{ $isheavyweight = "true"}
write-host "Start consistency check on $dsname "
trap{"Error in execution... $_";break}
&{
write-host "Getting protection group $pgname in $dpmname..."
$clipg = DataProtectionManager\Get-DPMProtectionGroup -DPMServerName $dpmname | where { $_.FriendlyName -eq $pgname }
if($clipg -eq $abc)
{
Throw "No PG found"
}
write-host "Getting $dsname from PG $pgname..."
$ds = DataProtectionManager\Get-DPMDatasource $clipg | where { $_.name -eq $dsname }
if($ds -eq $abc)
{
Throw "No Data Source found"
}
if( $isheavyweight -ne "true")
{
write-host "Starting light weight consistency check..."
$j = DataProtectionManager\Start-DPMDatasourceConsistencyCheck -Datasource $ds
$jobtype = $j.jobtype
if(("Validation") -notcontains $jobtype)
{
Throw "Shadow Copy job not triggered"
}
while (! $j.hascompleted ){ write-host "Waiting for $jobtype job to complete..."; start-sleep 30}
if($j.Status -ne "Succeeded") {write-host "Job $jobtype failed..." }
Write-host "$jobtype job completed..."
}
else
{
write-host "Starting Heavy weight consistency check..."
$j = DataProtectionManager\Start-DPMDatasourceConsistencyCheck -Datasource $ds -HeavyWeight
$jobtype = $j.jobtype
if(("Validation") -notcontains $jobtype)
{
Throw "Shadow Copy job not triggered"
}
while (! $j.hascompleted ){ write-host "Waiting for $jobtype job to complete..."; start-sleep 30}
if($j.Status -ne "Succeeded") {write-host "Job $jobtype failed..." }
Write-host "$jobtype job completed..."
}
}
Start-Sleep -s 15
# From here on copy as many as needed. Use the same or other ProtectionGroups.
$dpmname = "DPMServerName"
$pgname = "ProtectionGroupName2"
$dsname = "Hyper-V Name2" # like "Backup Using Saved State\VSName" of " like "Backup Using Child Partition Snapshot\VSName"
$isheavyweight = "true"
write-host "Start consistency check on $dsname "
trap{"Error in execution... $_";break}
&{
write-host "Getting protection group $pgname in $dpmname..."
$clipg = DataProtectionManager\Get-DPMProtectionGroup -DPMServerName $dpmname | where { $_.FriendlyName -eq $pgname }
if($clipg -eq $abc)
{
Throw "No PG found"
}
write-host "Getting $dsname from PG $pgname..."
$ds = DataProtectionManager\Get-DPMDatasource $clipg | where { $_.name -eq $dsname }
if($ds -eq $abc)
{
Throw "No Data Source found"
}
if( $isheavyweight -ne "true")
{
write-host "Starting light weight consistency check..."
$j = DataProtectionManager\Start-DPMDatasourceConsistencyCheck -Datasource $ds
$jobtype = $j.jobtype
if(("Validation") -notcontains $jobtype)
{
Throw "Shadow Copy job not triggered"
}
while (! $j.hascompleted ){ write-host "Waiting for $jobtype job to complete..."; start-sleep 30}
if($j.Status -ne "Succeeded") {write-host "Job $jobtype failed..." }
Write-host "$jobtype job completed..."
}
else
{
write-host "Starting Heavy weight consistency check..."
$j = DataProtectionManager\Start-DPMDatasourceConsistencyCheck -Datasource $ds -HeavyWeight
$jobtype = $j.jobtype
if(("Validation") -notcontains $jobtype)
{
Throw "Shadow Copy job not triggered"
}
while (! $j.hascompleted ){ write-host "Waiting for $jobtype job to complete..."; start-sleep 30}
if($j.Status -ne "Succeeded") {write-host "Job $jobtype failed..." }
Write-host "$jobtype job completed..."
}
}
When this is completed (usually the next morning) we start the "New-DPMRecoveryPoint_PG1.ps1" schedule again.
With these scripts we are able to keep using DPM without the 12363 error. I still don't know exactly what the 12363 error does, but I notice that after this error the iSCSI traffic is only using one node (the LUN owner) and that performance drops awfully.
I think redirected IO doesn't exist anymore, but the behavior we get very much looks like it. The only way I found to cure this is to reboot every node again and do a consistency check again on every VM.
When reading this reply, you will definitely notice that we still are novice PS users. So if you see anything in our scripts that could be done more elegant, please reply.
To end this reply I can't keep myself from saying that I think it is a shame that we have to come up with this bunch of workarounds to get a fairly simple cluster backup running. That MS and Dell are leaving us in the dark doesn't help me building
confidence in both companies. That said, I hope our workaround can help some of you.
Best Regards, Bert Oris
-
Edited by
Bert Oris
Wednesday, August 21, 2013 11:55 AM
August 21st, 2013 11:50am
yeah, disabling ODX also fixes my issue when using the software VSS, too bad that we want the performance improvement on this cluster. We shouldn't have to disable a much trumpeted feature of 2012 and new SANs just to get a backup to work.
Rollup3 makes no difference by the way.
August 21st, 2013 12:32pm
I'll try to explain how we work with our PS scripts to avoid the 12363 error. As said before we still use ODX and the Hit Kit 4.5 on our 4 node cluster with one EQL PS4110X.
First we have to make sure DPM isn't doing any backups by itself. So we modified the protection group and set the retention range at 31 days, but only run one backup a week at Saturday morning 6 AM. The first two scripts alter this schedule.
On Friday night we change the DPM backup to Sunday, and on Saterday night it changes back to Saterday. This way DPM isn't doing any backups itself. If there is an easier way, please let me know.
Change_Backup_Schedule_Day_Sa-Su.ps1
Import-Module DataProtectionManager
$pg = Get-DPMProtectionGroup -dpmservername "DPMServerName"
$SC0 = Get-DPMPolicySchedule $pg[0] -shortterm
$mpg = Get-DPMModifiableProtectionGroup -protectiongroup $pg[0]
Set-DPMPolicySchedule $mpg -Schedule $sc0[0] -TimesOfDay 06:00 -DaysOfWeek Su
Set-DPMProtectionGroup $mpg
$SC1 = Get-DPMPolicySchedule $pg[1] -shortterm
$mpg = Get-DPMModifiableProtectionGroup -protectiongroup $pg[1]
Set-DPMPolicySchedule $mpg -Schedule $sc1[0] -TimesOfDay 06:00 -DaysOfWeek Su
Set-DPMProtectionGroup $mpg
Change_Backup_Schedule_Day_Su-Sa.ps1
Import-Module DataProtectionManager
$pg = Get-DPMProtectionGroup -dpmservername "DPMServerName"
$SC0 = Get-DPMPolicySchedule $pg[0] -shortterm
$mpg = Get-DPMModifiableProtectionGroup -protectiongroup $pg[0]
Set-DPMPolicySchedule $mpg -Schedule $sc0[0] -TimesOfDay 06:00 -DaysOfWeek Sa
Set-DPMProtectionGroup $mpg
$SC1 = Get-DPMPolicySchedule $pg[1] -shortterm
$mpg = Get-DPMModifiableProtectionGroup -protectiongroup $pg[1]
Set-DPMPolicySchedule $mpg -Schedule $sc1[0] -TimesOfDay 06:00 -DaysOfWeek Sa
Set-DPMProtectionGroup $mpg
Next is the script that actually runs the backups. For this example we modified the script to only one PG and just a couple of VMs. The actual script contains two PGs and 29 VMs. We schedule this script four times a day (3:30 AM, 10:30
AM, 2:00 PM and 5:15 PM). Schedule might seem weird, but this to avoid conflicts when other tasks are running.
New-DPMRecoveryPoint_PG1.ps1
param([string] $dpmname, [string] $pgname, [string] $backupoption)
if(!$dpmname)
{ $dpmname = "DPMServerName"}
if(!$pgname)
{ $pgname = "ProtectionGroupName"}
if(!$backupoption)
{ $backupoption = "Expressfull"}
trap{"Error in execution... $_";break}
&{
$clipg = Get-ProtectionGroup $dpmname | where { $_.FriendlyName -eq $pgname}
if($clipg -eq $abc)
{ Throw "No PG found" }
$backupds = @(Get-Datasource $clipg)
foreach ($ds in $backupds[0]) # change to 1, 2 etc for each VS in the ProtectionGroup
{
$j = New-RecoveryPoint -Datasource $ds -Disk
$jobtype = $j.jobtype
Write-Output "$jobtype Job has been triggerred..."
}
while (! $j.hasCompleted )
{
}
Write-Host
if($j.Status -eq "Succeeded")
{
}
}
Start-Sleep -s 15
# From here on copy as many as needed
trap{"Error in execution... $_";break}
&{
$clipg = Get-ProtectionGroup $dpmname | where { $_.FriendlyName -eq $pgname}
if($clipg -eq $abc)
{ Throw "No PG found" }
$backupds = @(Get-Datasource $clipg)
foreach ($ds in $backupds[1]) # change to 1, 2 etc for each VS in the ProtectionGroup
{
$j = New-RecoveryPoint -Datasource $ds -Disk
$jobtype = $j.jobtype
Write-Output "$jobtype Job has been triggerred..."
}
while (! $j.hasCompleted )
{
}
Write-Host
if($j.Status -eq "Succeeded")
{
}
}
Sometimes (a couple times a week) DPM fails to do a backup of one VM. Than we get a mail telling:
Computer: MyVM.HyperCluster.domain.com
Description: Last 1 recovery points not created.
DPM encountered a retryable VSS error.
When this happens there is a 8194 error but that can be ignored. In the next backup this will be OK again.
Ones or twice a month we have to reboot the nodes after updates. We don't use Cluster Aware Updating yet, but do the updates and rebooting manual. After the reboot of every node (and pausing and resuming the cluster role per node) we disable
the "New-DPMRecoveryPoint_PG1.ps1" schedule on the DPM server. Than we have to run a Consistency Check on every node to avoid the 12363 error. Here's the script we use (once again, this is a shorter version than in production) :
ConsistencyCheck_PG1.ps1
param([string] $dpmname, [string] $pgname, [string] $dsname, [string] $isheavyweight)
if(!$dpmname)
{ $dpmname = "DPMServerName"}
if(!$pgname)
{ $pgname = "ProtectionGroupName"}
if(!$dsname)
{ $dsname = "Hyper-V Name"} # like "Backup Using Saved State\VSName" of " like "Backup Using Child Partition Snapshot\VSName"
if(!$isheavyweight)
{ $isheavyweight = "true"}
write-host "Start consistency check on $dsname "
trap{"Error in execution... $_";break}
&{
write-host "Getting protection group $pgname in $dpmname..."
$clipg = DataProtectionManager\Get-DPMProtectionGroup -DPMServerName $dpmname | where { $_.FriendlyName -eq $pgname }
if($clipg -eq $abc)
{
Throw "No PG found"
}
write-host "Getting $dsname from PG $pgname..."
$ds = DataProtectionManager\Get-DPMDatasource $clipg | where { $_.name -eq $dsname }
if($ds -eq $abc)
{
Throw "No Data Source found"
}
if( $isheavyweight -ne "true")
{
write-host "Starting light weight consistency check..."
$j = DataProtectionManager\Start-DPMDatasourceConsistencyCheck -Datasource $ds
$jobtype = $j.jobtype
if(("Validation") -notcontains $jobtype)
{
Throw "Shadow Copy job not triggered"
}
while (! $j.hascompleted ){ write-host "Waiting for $jobtype job to complete..."; start-sleep 30}
if($j.Status -ne "Succeeded") {write-host "Job $jobtype failed..." }
Write-host "$jobtype job completed..."
}
else
{
write-host "Starting Heavy weight consistency check..."
$j = DataProtectionManager\Start-DPMDatasourceConsistencyCheck -Datasource $ds -HeavyWeight
$jobtype = $j.jobtype
if(("Validation") -notcontains $jobtype)
{
Throw "Shadow Copy job not triggered"
}
while (! $j.hascompleted ){ write-host "Waiting for $jobtype job to complete..."; start-sleep 30}
if($j.Status -ne "Succeeded") {write-host "Job $jobtype failed..." }
Write-host "$jobtype job completed..."
}
}
Start-Sleep -s 15
# From here on copy as many as needed. Use the same or other ProtectionGroups.
$dpmname = "DPMServerName"
$pgname = "ProtectionGroupName2"
$dsname = "Hyper-V Name2" # like "Backup Using Saved State\VSName" of " like "Backup Using Child Partition Snapshot\VSName"
$isheavyweight = "true"
write-host "Start consistency check on $dsname "
trap{"Error in execution... $_";break}
&{
write-host "Getting protection group $pgname in $dpmname..."
$clipg = DataProtectionManager\Get-DPMProtectionGroup -DPMServerName $dpmname | where { $_.FriendlyName -eq $pgname }
if($clipg -eq $abc)
{
Throw "No PG found"
}
write-host "Getting $dsname from PG $pgname..."
$ds = DataProtectionManager\Get-DPMDatasource $clipg | where { $_.name -eq $dsname }
if($ds -eq $abc)
{
Throw "No Data Source found"
}
if( $isheavyweight -ne "true")
{
write-host "Starting light weight consistency check..."
$j = DataProtectionManager\Start-DPMDatasourceConsistencyCheck -Datasource $ds
$jobtype = $j.jobtype
if(("Validation") -notcontains $jobtype)
{
Throw "Shadow Copy job not triggered"
}
while (! $j.hascompleted ){ write-host "Waiting for $jobtype job to complete..."; start-sleep 30}
if($j.Status -ne "Succeeded") {write-host "Job $jobtype failed..." }
Write-host "$jobtype job completed..."
}
else
{
write-host "Starting Heavy weight consistency check..."
$j = DataProtectionManager\Start-DPMDatasourceConsistencyCheck -Datasource $ds -HeavyWeight
$jobtype = $j.jobtype
if(("Validation") -notcontains $jobtype)
{
Throw "Shadow Copy job not triggered"
}
while (! $j.hascompleted ){ write-host "Waiting for $jobtype job to complete..."; start-sleep 30}
if($j.Status -ne "Succeeded") {write-host "Job $jobtype failed..." }
Write-host "$jobtype job completed..."
}
}
When this is completed (usually the next morning) we start the "New-DPMRecoveryPoint_PG1.ps1" schedule again.
With these scripts we are able to keep using DPM without the 12363 error. I still don't know exactly what the 12363 error does, but I notice that after this error the iSCSI traffic is only using one node (the LUN owner) and that performance drops awfully.
I think redirected IO doesn't exist anymore, but the behavior we get very much looks like it. The only way I found to cure this is to reboot every node again and do a consistency check again on every VM.
When reading this reply, you will definitely notice that we still are novice PS users. So if you see anything in our scripts that could be done more elegant, please reply.
To end this reply I can't keep myself from saying that I think it is a shame that we have to come up with this bunch of workarounds to get a fairly simple cluster backup running. That MS and Dell are leaving us in the dark doesn't help me building
confidence in both companies. That said, I hope our workaround can help some of you.
Best Regards, Bert Oris
- Edited by
Bert Oris
Wednesday, August 21, 2013 11:55 AM
August 21st, 2013 2:50pm
[Bert],
I'm impressed by the scripts you have designed. But according to me it's a too big workaround.
I can't efford changing the PGs retention range.
But, on the final thought, wouldn't we be satisfied if we could just simply let DPM do its job, without having to care about the CSV situation.
As [Slinkoff] mentionned, ODX is a cool feature when you move data around your SAN, and Windows 2012 Cluster Service (and CSV 2.0) do no longer require us to make VM serialization in order to protect them.
It shoudl be straight forward!
August 21st, 2013 7:37pm
Since installation HIT 4.6 + DPM 2012 RU2 + KB2838669 & KB2870270 ,
the DPM Backup of Hyper-V 2012 CSV always Fails with Error 0x80042301 on a lot of VM (0 byte transfer).
The time for backup 20 VM take 6 hours instead 1 hour before (i use Vss Hardware)
On every node, a lot of new vss event id 4003: Hyper-V VSS Writer receive freeze event and wait for abort or Thaw event.
IVssWriterCallback" error 8194 spamming my node eventlog every 30 sec (i known we could ignored this but its' frustrating).
However, a start manually replica it's ok and seems give good performance.
I'll try to change MaxAllowedParallelBackups to 1 and will see tomorrow.
August 27th, 2013 3:16pm
Last week of August : "Dell and Microsoft teams were trying to check if the MPIO driver was not involved. My colleague [...] just informed me that the problem occurs even without MPIO. The research problem therefore continues."
September 4th, 2013 4:07pm
I've link my support case Dell to the case open by my "compatriote" F.Oguer. (Thanks to Frdric for all informations in this blog).This incident is in level four, so probably wait for new firmware EqualLogic, new HIT, new KB or roll-up Microsoft ...
September 10th, 2013 2:40pm
Hi everybody,
No news from DELL.
More information about my configuration (translated by Google)
MTU9000onVMswitch:
On the issue of MTU on the
virtual switch, Ichanged the
MTU VMswitch via
the registry.
Here is the Power Shell script used:
#Modification de la MTU sur le Virtual Switch
$RegKey ="HKLM:\SYSTEM\CurrentControlSet\Control\Class\{4D36E972-E325-11CE-BFC1-08002BE10318}"
Get-ChildItem -Path $RegKey -ErrorAction SilentlyContinue| % {
$path = $_.PSPath
Get-Itemproperty $path | where {$_.driverdesc -eq "Hyper-V Virtual Ethernet Adapter" -and $_.Characteristics -eq "41"} | % {
Set-ItemProperty $path -Name "*JumboPacket" -Value "9014"}}
The Ping works
Dell KACE dysfunction and OpenManage
The Dell Kace
agent was reinstalled without
problem.
The problem seems to come from
OpenManage 7.3.
Broadcom 17.6 drivers
We did an updated Broadcom
drivers (17.4 to
17.6) and firmware
(7.4 to 7.6).
After consulting the list of fixes,VMQueue
and SR-IOV
is not supported with version
17.4:
Enhancements:
===============
- Added Support for
VMqueue NetXtremeII
1G and 10G
devices.
- Added SR-IOV
Support features for
57712 and 578xx
SR-IOV and VMSwitch
The SR-IOV is incompatible with the
Windows NIC Teaming.
(Source : http://technet.microsoft.com/en-us/library/hh997031.aspx
Incompatibilities. The
NIC teaming feature
is consistent with
networking capabilities
in Windows Server
2012 with three exceptions:
SR-IOV
Remote DirectMemory Access(RDMA)
TCP Chimney)
Our current configuration is not
supported ! I plan to
reinstall my nodes
with the management of the
bandwidth(MinimumBandwidthMode
Weight).
September 19th, 2013 12:15pm
Hi Frdric,
We upgrade the Broadcom drivers 17.6 and firmware 7.6.
No change.
Backup with VSS Hardware are very slow.
Backup with vss software sometimes failed and hang (CSV Volume Lost momentary on a node).
I would open a case Microsoft.
Have-you open it ? Could you give-me your case number Microsoft for reference?
Thanks a lot
October 1st, 2013 12:31pm
Fonznip,
Do you install ?
Send me a email (foguer-at-sid.tm.fr) or call me (direct line : 01 45 17 43 32) , it'll be more easy in French !
I'm at Crteil, near Svres...
Microsoft Case Number for Error 5142 : CSV TIMEOUT (open since 06/06/2013)...
113060610494265
For "Backup with VSS Hardware are very slow." : what's very slow ?
Yesterday, i did a replicat of 2 To in 4 Hours, it's 500 GBytes/Hour, 140 MByte/s (disk write speed...) :

R
October 1st, 2013 1:43pm
New fix for DPM (UR 3.6):
http://www.microsoft.com/en-sa/download/details.aspx?id=40318
Issue #1: DPM has express full technology where DPM tracks the changes via DPM filter driver and the changed block information are tracked as bitmap and is stored in bitmap files. In some scenarios, DPM bitmap files are becoming very big leading to
higher CSV volume consumption. This issue is fixed in DPM filter and effects only VM protection scenarios. This fix is done on the DPM filter driver running on the production server.
October 1st, 2013 6:14pm
I saw this as well, I plan on attempting to install this tonight and will update the thread unless anyone else has done it already.
October 1st, 2013 9:09pm
I installed it last night and it did not resolve our issue we are still getting the 12363 event.
October 2nd, 2013 5:32pm
Not Resolve our issue but backup looks faster (x2)...
October 2nd, 2013 6:07pm
Wow this dpm hotfix really saved my day. i had constantly growing csv volumes and when i listed all files it didnt add upp with the diskusage i could se. Our SAN guy could only see 50% disk usage when i on my side had less then a few gigabytes left. After
applying this Hotfix for System Center 2012 SP1 Data Protection Manager (KB2886362) i have retrieved Terabytes of diskspace.
Regarding your other issues i had lots of problems before the summers hotfixes. but after that i havent run into any problems my setup is a 12 node cluster and a 3par 7200 array. we are using software vss provider with serialization ( i haven't seen any
hardware vss provider from hp that officially supports hyper-v 2012 only one that supports it using hp recovery manager).
/Regards Jorgen
October 3rd, 2013 8:43am
Has anyone gotten any updates from Dell or Microsoft on their open tickets?
I have had success by creating smaller protection groups and having all the VMs that are defined in that protection group running on a single node in the cluster. It also helps to stagger the time that the protection groups run. The issue appears
to occur when VMs within a protection group get spread across multiple nodes in the hyper-V cluster. Running in this way will cause the Event ID 12363 which Microsoft/Dell need to be working on a fix, but I haven't gotten a word out of either of them
on if they are or not.
October 11th, 2013 12:36am
I was told earlier this week by my current Dell case manager that Dell has been able to show the problem to Microsoft, Microsoft has acknowledged the problem, and that Microsoft told Dell that the problem affects other vendors besides Dell. Dell
and Microsoft are apparently running tests/diagnostics every day to try and figure out how to fix it. I don't know if that means that they are close to a solution on not, but at least it is something.
October 11th, 2013 1:06am
I also had a similar problem, with a 3Par 7400 and HPs backup software and a hardware VSS provider.
A backup of the CSV volumes would take down random nodes in the cluster after between 1-5 backups.
Turns out this was a jumbo frames issue. It was causing excessive pause frames on the FlexFabric and the nodes were crashing. We could have disabled the flow control, but chose to leave Jumbo frames
off and have no problems now.
-
Edited by
DFarlam
Monday, November 11, 2013 10:32 PM
my previous entry was nonsense
October 11th, 2013 6:20am
I also had a similar problem, with a 3Par 7400 and HPs backup software and a hardware VSS provider.
A backup of the CSV volumes would take down random nodes in the cluster after between 1-5 backups.
Turns out this was a jumbo frames issue. It was causing excessive pause frames on the FlexFabric and the nodes were crashing. We could have disabled the flow control, but chose to leave Jumbo frames
off and have no problems now.
- Edited by
DFarlam
Monday, November 11, 2013 10:32 PM
my previous entry was nonsense
October 11th, 2013 9:20am
I also installed the latest DPM update, but noticed no differences at all. Even if there is still no solution for the 12363 error, I was looking forward to the speed advantage, but I couldn't see any difference.
We also figured out to make PGs per node, as this indeed avoids the problem of error 12363. But in the end this wasn't flexible enough for us as VMs sometimes need to move to another node to balance the load across our cluster. Therefore we chose
to use some PS scripting, you can find them above.
A last question, a bit off topic. When we run a Consistency check (after a node reboot we need to), it takes arround 9 hours for 4,3 TB. This is only 139 MB/s on average, is this a normal figure ? We are using a PS4110X and a PS4110E.
October 11th, 2013 10:14am
Anyone try DPM 2012 R2 and Server 2012 R2 combination to see if the problems are fixed?
October 24th, 2013 12:37am
I tried yestarday and I have broken my Backup now. Restore back to Rollup 3 and I lost mostly Backup Volumes.
And I am not allone with this Problem!
http://social.technet.microsoft.com/Forums/en-US/245e0473-a882-488c-addf-598267145187/an-unexpected-error-occurred-during-the-installation-id-4387-dpm2012-r2-upgrade?forum=dpmsetup
October 25th, 2013 10:05am
Hi Guys
Some god News. It works mutch better with 2012 R2 but!!!!!!
Very hard to upgrade for me! I will all my Problem post on the Link above.
And 2012 R2 still support 2003 Server !!!! You need to now that.
But all my Server has no more problems by Sync the Volumes and Services.
I have only a lot of Errors, the same as befor.
Volume Shadow Copy Service error: Unexpected error querying for the IVssWriterCallback interface. hr = 0x80070005, Access is denied.
Best Regards
Rndif
October 28th, 2013 10:50am
H Guys
My Backup Server hast crashed last Night and now the Problems are back. So DPM 2012 R2 resolve not the problem.
October 29th, 2013 10:21am
Hi,
We change Broadcom to Intel Network Cards : no change.
We try DPM 2012 R2 / HYPERV 2012 R2 on a 8 nodes cluster : exactly the same problem.
Warning with HIT 4.6 : when the VM has only IDE controler, the VM is show as Stopped !!! You need to add a SCSI controler to have correct snapshot.
We try firmware 7.0 EPA : crash with ODX but very good disk repartition.
For me, the best workaround for backup is CSV serialisation (work with DPM 2012 / DPM 2012 R2) :
http://technet.microsoft.com/en-us/library/ff634192.aspx
Best regards,
Frdric OGUER
November 15th, 2013 12:01pm
I cant believe this problem (DPM-CSV-Error 0x80042301 -> ODX) is already several months old/known, and MS support does not seems to be aware of it.
I still have an open case (November 2013) for this (second one already) with no progress whatsoever, I was glad finding this post. It is obvious that this problem is not (specific) hardware-vendor related
as there are people using different hardware: Dell, IMB, HP with the exact same problem! Mostly using an ODX capable SAN. We ourselves are using the latest HP Blades (BL460c Gen8), and an 3PAR 7200 as SAN.
All servers (Host, VM's, DPM,...) are running Windows 2012 and are fully patched, inclusive hotfixes, firmware, drivers ect.
Also upgrading to DPM2012R2 did not solve the problem but disabling ODX did the magic trick :-). And yes, it has a (small) negative impact on the SAN performance, but rather this than unreliable backups. I hope/aspect Microsoft will come with a fix
soon, otherwise we will consider upgrading the Hyper-V hosts to Windows2012R2 (seems not to have this problem). But HP
is not 100% (2012R2 drivers) ready for this. Greetings from Belgium -
Ivan Henderix
November 20th, 2013 5:26pm
[...]
I hope/aspect Microsoft will come with a fix soon, otherwise we will consider upgrading the Hyper-V hosts to Windows2012R2 (seems not to have this problem). But HP
is not 100% (2012R2 drivers) ready for this. Greetings from Belgium -
Ivan Henderix
November 27th, 2013 8:16pm
I just heard from a collegue that Veeam doesn't have this problem in a similar setup. We will check it out next week, and I'll post again when I know more.
Regard, Bert Oris
December 6th, 2013 9:37pm
I just heard from a collegue that Veeam doesn't have this problem in a similar setup. We will check it out next week, and I'll post again when I know more.
Regard, Bert Oris
We were using Veeam when we first setup our infrastructure; however, we had similar issues and still waiting for a fix to get this working.
We have several VMs in production and simply cannot guess that re-installaing and reconfiguring Veeam will work this time. We haven't tried to disable ODX as we still need this feature to be activated.
Hopefully, a fix will be released soon.
- Luc
December 12th, 2013 8:21pm
Thx.. we will try this hotfix out in a week or two.
Anyone tried this one already ?
January 13th, 2014 7:51am
As part of the testing Dell is having me do, they had me disable the EqualLogic hardware VSS provider using the command:
"C:\Program Files\EqualLogic\bin\eqlvss" /unregserver (it can be undone via "C:\Program Files\EqualLogic\bin\eqlvss" /regserver)
Since doing that on Sunday, I haven't had a single SCDPM backup failure. I also changed the max allowed parallel backups from 3 to 1, but I just switched it back to 3 so we will see how it goes tonight. Obviously this isn't a fix, but it may
work as a band-aid for everyone in the short term. If I get anything new from Dell I'll make sure to post it.
Had same issue with an EqualLogic and after disabling the hardware VSS DPM started to backup Hyper-V child partitions correctly. It's quite frustrating that after a whole year Microsoft and Dell haven't come back with a solution. However, I must say the Server
2012 software provider is fairly fast, I'm a bit impressed. Even so using the hardware one would be desirable.
January 14th, 2014 12:31pm
Upgrading to 2012 R2 in DPM and Server 2012 R2 for our Fail-Over clusters hasn't made a difference.
Both 2012 R2 clusters I've built had to have bulit-in ODX disabled via Powershell, and the Dell HIT toolkit has had to be installed without the ODX provider in order for DPM 2012 R2 to take proper backups. I still get the odd VSS failure in DPM, but at least
backups are working without taking down the VMs or offlining the CSV (which happened a few times using the built-in 2012 R2 ODX provider).
The Clusters I built were also built completely from scratch with the latest patches, NIC firmware, NIC drivers, and BIOS for our Dell severs. We use the MS iSCSI software provider, and our Equallogic SAN runs 7.0.5 firmware.
January 29th, 2014 7:17pm
Not surprised on this, Dell's HIT kit has not been updated to support 2012 R2 yet so I wouldn't expect it to work without making special changes.
On another note, I had a call today with the Dell Equallogic engineering team, they informed me that they are actively working with Microsoft on this issue and that it is indeed a Microsoft issue. The quickest way for this to gain traction and have Microsoft
resolve this is for everyone to open a ticket and/or if you have a ticket open to notify Dell about your Microsoft ticket so that they can track them and put more heat on Microsoft to resolve this issue. The more tickets Microsoft gets about this issue the
more resources they will devote to resolving it.
January 29th, 2014 11:04pm
Upgrading to 2012 R2 in DPM and Server 2012 R2 for our Fail-Over clusters hasn't made a difference.
Both 2012 R2 clusters I've built had to have bulit-in ODX disabled via Powershell, and the Dell HIT toolkit has had to be installed without the ODX provider in order for DPM 2012 R2 to take proper backups. I still get the odd VSS failure in DPM, but at least
backups are working without taking down the VMs or offlining the CSV (which happened a few times using the built-in 2012 R2 ODX provider).
The Clusters I built were also built completely from scratch with the latest patches, NIC firmware, NIC drivers, and BIOS for our Dell severs. We use the MS iSCSI software provider, and our Equallogic SAN runs 7.0.5 firmware.
7.0.2 is in Beta and will be release next week...
February 11th, 2014 4:34pm
Not surprised on this, Dell's HIT kit has not been updated to support 2012 R2 yet so I wouldn't expect it to work without making special changes.
On another note, I had a call today with the Dell Equallogic engineering team, they informed me that they are actively working with Microsoft on this issue and that it is indeed a Microsoft issue. The quickest way for this to gain traction and have Microsoft
resolve this is for everyone to open a ticket and/or if you have a ticket open to notify Dell about your Microsoft ticket so that they can track them and put more heat on Microsoft to resolve this issue. The more tickets Microsoft gets about this issue the
more resources they will devote to resolving it.
I open a case in May ! Microsoft close it today...
It's not a DPM bug but a Hyper-V VSS bug.
February 11th, 2014 4:35pm
Hello everyone,
Timeout error on CSV, this seems to be normal by design.
I have a great backup server with a network 2 * 10 Gbit / s
The only bottleneck is the equallogic discs.
There is no way (with software VSS) to manage priorities between volumes and snapshots.
Production volumes have hard latencies exceeding 40 ms .
At this time , we see appear the errors CSV_TIMEOUT / 5120 .
This was reduced with the latest patches CSV ( size replica is smaller ) .
The only workaround I found is to enable the
limitation of bandwidth DPM agents for network performance does not exceed the performance disk arrays ...
Regarding support for Windows Server 2012 R2 is expected with 4.7 HIT announced in March.
On 7.0.1 firmware there is a bug removal snapshot.
Also identified on Veam .
http://forums.veeam.com/microsoft-hyper-v-f25/equallogic-on-latest-firmware-volume-snapshots-not-deleting-t20088.html
February 11th, 2014 4:41pm
Upgrading to 2012 R2 in DPM and Server 2012 R2 for our Fail-Over clusters hasn't made a difference.
Both 2012 R2 clusters I've built had to have bulit-in ODX disabled via Powershell, and the Dell HIT toolkit has had to be installed without the ODX provider in order for DPM 2012 R2 to take proper backups. I still get the odd VSS failure in DPM, but at least
backups are working without taking down the VMs or offlining the CSV (which happened a few times using the built-in 2012 R2 ODX provider).
The Clusters I built were also built completely from scratch with the latest patches, NIC firmware, NIC drivers, and BIOS for our Dell severs. We use the MS iSCSI software provider, and our Equallogic SAN runs 7.0.5 firmware.
Not really sure what you mean by "ODX provider". Dell's Hit Kit only has the VSS provider, while ODX is an OS function. Did you guys install the VSS provider? Aslo, how many backups per csv/server do you have DPM configured (it's a regkey on the DPM server)?
We haven't seen any issues since switching to the software provider (since Dell's hardware provider (hitkit v.4.6) doesn't currently support VSS functions in 2012/2012 R2), and installing the latest hotfixes - http://support.microsoft.com/kb/2920151/en-us.
March 20th, 2014 3:52pm
So we are all still on same two options from beginning :
a) Use software VSS and hope that it will not crash OS inside VM's (which had happened to us)
b) Use Hardware VSS and manually fix broken Backups.
We have this issue since begining of this thread, and after trying OS reinstall, software VSS, patching with various proposed hotfix it is still present after almost a year now ! Right now we use option B, with manual backup fix, but this
is also non-error free solution, since in three occasion from january we had to restart each node in cluster because of Errors like 1230, 1205, 1069, 5120 ... which all trigger with activation of DPM backup (two times during scheduled backup
and once during manual fix). We have 2012 three node cluster + DPM 2012 SP1
Did anyone tried HIT 4.7 EPA + W2012R2 + DPM2012R2 ?
-
Ognjen
April 28th, 2014 12:19pm
Hi,
Yes, we tried HIT 4.7 EPA + W2012R2 (new install with april update) + DPM2012R2.
Some news :
We tested ODX : it's worked without problem ! Yes we can reactivated it !
HIT 4.7 is available. The following issue has been fixed since Dell EqualLogic Host Integration Tools for Microsoft (HIT/Microsoft) version 4.7 Early Production Access (EPA), released
February 2014:
- Previously, Hyper-V vNICs on a Windows 2012 R2 host could cause the included and excluded subnets to not display properly in Auto-Snapshot Managers MPIO Settings view. The issue has been resolved.
- Selective restore of Hyper-V objects was disabled in HIT/Microsoft Version 4.7 EPA and has been re-enabled in the final production version of HIT/Microsoft Version 4.7.
- Previously, two problems caused the EqlReqService to intermittently crash.
These problems have been fixed.
The v7.0.4 firmware is available. it patches the OpenSSL (Heartbleed) vulnerability. This bug affected OpenSSL versions 1.0.1 (including 1.0.1f) and 1.0.2-beta1 releases.
DPM 2012 SP1 UR6 is available : http://support.microsoft.com/kb/2958098
DPM 2012 R2 UR2 was available : http://support.microsoft.com/kb/2958098
We tested it for 2 bugs but, for us, this one is not corrected : SetBackupComplete called prematurely causes SetBackupSucceeded to be called and 0x80042301 in VSS.
The only workaround is the XML serialization (or set one Protection group by host with different backup schedule).
But new problem : who get CSV 5120 error during BITS transfer with SCVMM ?
May 6th, 2014 9:23am
a) Use software VSS and hope that it will not crash OS inside VM's (which had happened to us)
Do you upgrade integration services ? Do you have enough disk space inside the VM ? Do you have an error message ?
We have no crash inside VM with software VSS.
May 6th, 2014 9:26am
Hi
My Backup Problem persiste. Every Week I have other Problems. At the moment I want change this Value but in my Registry I have dare a Reg_Binary Value. Is this a proper Value or a misconfiguration?
[HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft Data Protection Manager\2.0\Configuration\MaxAllowedParallelBackups] "Microsoft Hyper-V"=dword:00000001
Configuraton is on my Hyper-V 2012 Hosts a Reg_Binary Value.
My DMP is 2012 R2 fully Patched
Thanks for any Feedback
July 24th, 2014 6:23am
hi,
it has been a while since I reported anything, since we kept using our PS script to run DPM instead of the DPM schedules.
A case with Dell and Microsoft helped us solve our case, thanks to the perseverance of both support engineers.
In the end all we did was stop the EqlASMAgent [EqualLogic Auto-Snapshot Manager Agent] service on all four cluster nodes.
We are running three VM-backups at a time using the hardware VSS writer without any problems for over a week now.
Best Regards, Bert
August 15th, 2014 9:26am
Hi at all
Today I receife this from Dell ;-)
Once the hardware VSS is correctly configured and works with ASM/ME then it can be really considered an MS issue and customers should open a case with Microsoft.
There have been a number of issues uncovered by MS with their DPM application which has recently released DPM service pack 3 that has resolved a number of their problems with Hyper-V and DPM using hardware VSS providers.
The Dell Case 00915355
was closed on 30.6.14. The Fix come with DPM 2012 R2 Rollup 2
.
Now we have migrate to Hyper-V 2012 R2. Backup works fine now
Best Regards
Rndi
August 21st, 2014 9:04am
Roendi,
I'm about to install a brand new DPM R2 RU 3. Currently, my cluster is W2k12 (not R2). Did you solve this problem only when you migrate your cluster to R2?
Thompson
September 3rd, 2014 3:57pm
Hi,
We resolved our issues with DPM 2012R2, and host upgrade to 2012 R2 along with latest HIT tools from Dell.
July 21st, 2015 1:50pm