Cluster Shared Volume error after server not shutting down properly

Hi,
We have two IBM X240 servers ( we call it server A and server B) connecting to IBM disk system:V3700 via fibre HBA.

The both servers are installing windows 2012 R2.

We have implemented VM cluster and everything is working well.

Last week this two server is down due to power shortage in my server room.

After turning on the  server A, it will come out the below error:

Windows failed to start, a recent hardware or software change might be cause.
File: \windows\system32\drivers\msdsm.sys
status: 0xc0000017
Info:the operation system could't be loaded because a critical system drive is missing or contain errors.

After using the Last Good Configuration, we can log in to the system and turn on the clustered virtual machine.

it seems everything is fine now.

So i go and start the server B and log in to the system using the same method with server A.

I found all the VM will be shut down or running error due to Cluster Shared Volume error.

Refer to below some errors captured from system system logs.

* Event 5142, Cluster Shared Volume 'Volume7' ('Cluster Disk 10') is no longer accessible from this cluster node because of error '(1460)'. Please troubleshoot this node's connectivity to the storage device and network connectivity.

* Event 5120,Cluster Shared Volume 'Volume3' ('Cluster Disk 4') has entered a paused state because of '(c00000be)'. All I/O will temporarily be queued until a path to the volume is reestablished.

Now we only can turn on only one server and shut down another server, if i turn on both server, the error will come out again & the server will go down.

Any suggestion or need me provide more information.

Thanks.

May 14th, 2015 9:34am

Hi,

Any advise on this?

Free Windows Admin Tool Kit Click here and download it now
May 15th, 2015 2:58pm

Hi GONGPEILIN,

It seems is the MSDSM.sys corrupted since the Power outage, may there have mismatched MSDSM.sys between both the Cluster nodes.

If you are using the vendor DSM and we can try to reinstall DSM to replace the old MSDSM and see if it can handle such situation. If it not work please run the sfc /scannow to check whether there have system files corrupted.

More information:

Understanding MPIO Features and Components

https://technet.microsoft.com/en-us/library/ee619734%28v=ws.10%29.aspx?f=255&MSPPError=-2147217396

Im glad to be of help to you!

May 20th, 2015 2:33am

Hi Alex,

1) I try to reinstall the DSM drivers in both nodes, but the result is the same.

As per below extracted from DSM document, i think reinstall the DSM will not replace the old MSDSM

*************************

SDDDSM installation package, which includes:
MPIO drivers. MPIO is not shipped with the Windows Server 2003
operating system but is shipped with the Windows Server 2008 or
Windows Server 2012 operating system. On Windows Server 2003, the
MPIO drivers that are shipped with the SDDDSM package are used,
while for Windows Server 2008 and Windows Server 2012, the MPIO
drivers that are shipped with the operating system are used.

********************

2) Refer to below result after running the command: sfc /scannow

Windows Resource Protection did not find any integrity violations.

Any advise.

Thanks.

Free Windows Admin Tool Kit Click here and download it now
May 25th, 2015 1:18pm

Hi,

Who can help me on this or need me to provide any more information.

Thanks.

May 27th, 2015 6:14am

HI,

So you have one good node. boot this node up and keep the B node off and make sure everything is running fine on A and all the CSV's are up.

and make sure no resource can go to B

Then Pause/disable the B node in the FCM and start node B and check the Server that is can ping the Node A and DC /DNS.

and keep the Cluster services stopped on B

and start the services without Quorum

Force a WSFC Cluster to Start Without a Quorum. https://msdn.microsoft.com/en-us/library/hh270275.aspx

To force a cluster to start without a quorum
  1. Start an elevated Windows PowerShell via Run as Administrator.

  2. Import the FailoverClusters module to enable cluster commandlets.

  3. Use Stop-ClusterNode to make sure that the cluster service is stopped.

  4. Use Start-ClusterNode with FixQuorum to force the cluster service to start.

  5. Use Get-ClusterNode with Propery NodeWieght = 1 to set the value the guarantees that the node is a voting member of the quorum.

  6. Output the cluster node properties in a readable format.

Free Windows Admin Tool Kit Click here and download it now
May 27th, 2015 10:57am

Hi,
I think you make misunderstanding.

The state for both nodes are UP when running the Powershell cmdlet:get-clusternode

The issue is unable to access a CSV volume from a passive (non-coordinator) node.

i also refer to below document but it also can not solve my issue:

https://support.microsoft.com/en-us/kb/2008795?wa=wsignin1.0

Refer to below result when turning on both nodes or any one node using cmdlet:get-clustershardvolumestate

*When turning on both nodes:

When turning on any one node:

May 27th, 2015 11:44pm

you should check the eventlogs for more info on the errors. give this is leaving so mucht options that you could try. but checking the events on this and the cluster.log could fix your problem.

when adding a fresh new disk does this disk have the same problem ?

a quick wild shot is evict node and join node after rebooting <> remember this is on your own and I can't see for now if there are any other problems.

Step1 seek the events for more info on this. and run the cluster validation report.

Free Windows Admin Tool Kit Click here and download it now
May 28th, 2015 7:24pm

Hi,

1)There will be quite many errors related to Event ID:5120,5142 and some related to 1069 when turning on the both nodes
Refer to below some example:

*Event ID:5120 Cluster Shared Volume 'Volume14' ('Cluster Disk 16') has entered a paused state because of '(c00000be)'. All I/O will temporarily be queued until a path to the volume is reestablished.

*Event ID:5142 Cluster Shared Volume 'Volume14' ('Cluster Disk 16') is no longer accessible from this cluster node because of error '(1460)'. Please troubleshoot this node's connectivity to the storage device and network connectivity.

*Event ID:1069:Cluster resource 'Cluster Disk 16' of type 'Physical Disk' in clustered role 'e9bb39a9-178a-4235-8c0b-428b732ae777' failed.
Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it.  Check the resource and group state using Failover Cluster Manager or the Get-ClusterResource Windows PowerShell cmdlet.

2) I try to add a fresh new disk, but the problem is the same.

3) I try to evict one node. Restart the node and join again, but the problem is the same.

4) I run the validation report, there will come out the below warning message:

    Validating access using Server Message Block (SMB) protocol from node HYPER-V04.internal.com to a share on node HYPER-V05.internal.com.
    Failed to validate Server Message Block (SMB) share access through the IP address of the fault tolerant network driver for failover clustering (NetFT). The connection was attempted with the Cluster Shared Volumes test user account, from node HYPER-V04.internal.com to the share on node HYPER-V05.internal.com. The network path was not found.
    Validating access using Server Message Block (SMB) protocol from node HYPER-V05.internal.com to a share on node HYPER-V04.internal.com.
    Failed to validate Server Message Block (SMB) share access through the IP address of the fault tolerant network driver for failover clustering (NetFT). The connection was attempted with the Cluster Shared Volumes test user account, from node HYPER-V05.internal.com to the share on node HYPER-V04.internal.com. The network path was not found.
    Thanks.





  • Edited by GONGPEILIN Friday, May 29, 2015 3:21 PM Updating format
May 29th, 2015 3:16pm

Hi,

What I should do now is patching both nodes now full. and update the Servers with the latest drivers. maybe the already have the latest drivers but do it anyway if there is a corrupt drivers in the system then it will be updated.

Or force remove the network drivers and check the hardware list to see if there any errors.

Below are some useful links.

http://blogs.technet.com/b/askcore/archive/2010/12/16/troubleshooting-redirected-access-on-a-cluster-shared-volume-csv.aspx

Understanding the state of your Cluster Shared Volumes in Windows Server 2012 R2

http://blogs.msdn.com/b/clustering/archive/2013/12/05/10474312.aspx

Free Windows Admin Tool Kit Click here and download it now
June 2nd, 2015 7:54am

Hi,

After patching both node fully and updating the network adapter drivers, it is still not able to solve my issue.

Is the issue related to below error when running the cluster validation:

    Validating access using Server Message Block (SMB) protocol from node HYPER-V04.internal.com to a share on node HYPER-V05.internal.com.
    Failed to validate Server Message Block (SMB) share access through the IP address of the fault tolerant network driver for failover clustering (NetFT). The connection was attempted with the Cluster Shared Volumes test user account, from node HYPER-V04.internal.com to the share on node HYPER-V05.internal.com. The network path was not found.
    Validating access using Server Message Block (SMB) protocol from node HYPER-V05.internal.com to a share on node HYPER-V04.internal.com.
    Failed to validate Server Message Block (SMB) share access through the IP address of the fault tolerant network driver for failover clustering (NetFT). The connection was attempted with the Cluster Shared Volumes test user account, from node HYPER-V05.internal.com to the share on node HYPER-V04.internal.com. The network path was not found.

    Any idea how to solve this issue?


  • Edited by GONGPEILIN 4 hours 45 minutes ago Typing error
June 7th, 2015 10:16pm

See if the below registry key is disabled. If yes, enable it and see if it works.

Hive HKEY_LOCAL_MACHINE
Key path SYSTEM\CurrentControlSet\Services\LanmanServer\Parameters
Value name smb2
Value type REG_DWORD
Value data

0x0 (0)

Check this link as well

https://technet.microsoft.com/en-us/magazine/hh289314.aspx

Thanks,

Umesh.S.K

Free Windows Admin Tool Kit Click here and download it now
June 7th, 2015 10:31pm

Hi,

I didn't find the value name:smb2 in the register key path:

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\LanmanServer\Parameters

i think the default is enable on both nodes.

Refer to below checking result as per the link you provide.

* SMB is working using the command: NET VIEW
* WMI is working using the powershell:get-wmiobject mscluster_resourcegroup -computer HYPER-V04 -namespace "ROOT\MSCluster"

Any other suggestion?


June 8th, 2015 1:47am

Hi,

After patching both node fully and updating the network adapter drivers, it is still not able to solve my issue.

Is the issue related to below error when running the cluster validation:

    Validating access using Server Message Block (SMB) protocol from node HYPER-V04.internal.com to a share on node HYPER-V05.internal.com.
    Failed to validate Server Message Block (SMB) share access through the IP address of the fault tolerant network driver for failover clustering (NetFT). The connection was attempted with the Cluster Shared Volumes test user account, from node HYPER-V04.internal.com to the share on node HYPER-V05.internal.com. The network path was not found.
    Validating access using Server Message Block (SMB) protocol from node HYPER-V05.internal.com to a share on node HYPER-V04.internal.com.
    Failed to validate Server Message Block (SMB) share access through the IP address of the fault tolerant network driver for failover clustering (NetFT). The connection was attempted with the Cluster Shared Volumes test user account, from node HYPER-V05.internal.com to the share on node HYPER-V04.internal.com. The network path was not found.

    Any idea how to solve this issue?


  • Edited by GONGPEILIN Monday, June 08, 2015 2:14 AM Typing error
Free Windows Admin Tool Kit Click here and download it now
June 8th, 2015 2:13am

See if the below registry key is disabled. If yes, enable it and see if it works.

Hive HKEY_LOCAL_MACHINE
Key path SYSTEM\CurrentControlSet\Services\LanmanServer\Parameters
Value name smb2
Value type REG_DWORD
Value data

0x0 (0)

Check this link as well

https://technet.microsoft.com/en-us/magazine/hh289314.aspx

Thanks,

Umesh.S.K

  • Edited by Umesh S K Monday, June 08, 2015 2:32 AM
June 8th, 2015 2:28am

Hi,

Did you run cluster validation ? and did you check the Eventlogs ?

and check the Cluster blog for troubleshooting

http://blogs.msdn.com/b/clustering/archive/2012/05/07/10301709.aspx

Free Windows Admin Tool Kit Click here and download it now
June 10th, 2015 2:43am

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics