CSV unavailable to one node, VMs don't failover

Hi All,

I had this situation:

Environment:

- 2 nodes Hyper-v cluster

- 2 CSV

- NODE1 is running 2 VMs on CSV1 and 2 VMs on CSV2

Issue:

- NODE1 loses connection to CSV1

- NODE1 logs Evnet ID 4096 at 00:01 in Microsoft-Windows-Hyper-V-Config/Admin about VM111 and VM112, both standing on CSV1

- NODE2 logs 2 events ID 1167 in System Log at 00:03 (VM111 degraded and VM112 degraded)

- at 00:05 VM111 and VM112 are off (dirty shutdown)

Question:

since NODE2 didn't lose the connection to CSV1 and it knows that the VMs on NODE1 are degraded, why it didn't initiate a failover?

May 27th, 2015 9:48am

This is because the CSV feature offers the CSV redirection feature.

What that means ?

- A node communicates directly with a CSV volume

- If a node loses communication with a CSV, the storage traffic will resume but through the  CSV owner node. And you will notice this in the storage view, in the fail-over console, the CSV is flagged with the 'Redirected' flag

- Normally, when the communication to the storage resumes, the node will resume communication directly with the CSV

---> So, the VMs will node fail to another cluster node

But this behavior has a drawback: If the redirection is not fast enough, the VMs will lose communication with the storage (Even for short time, seconds). The operating system will crash or hang and shutdown or reboot unexpectedly (due to IO errors)

More on CSV redirected mode HERE

In Windows Server 2016, a new enhancement will be added, that pause the VMs when they loose communication with storage. This will let you check the issue, repair it and resume the VMs.

Free Windows Admin Tool Kit Click here and download it now
May 27th, 2015 10:05am

Hi Samir,

I'm sorry but this doesn't answer my question. CSV redirection didn't happen, no event 5136 has been logged.

Even if it was because of redirection this is not good enough, when a component fails in a cluster I do expect a fail-over so even pausing the VM wouldn't be acceptable, the fail-over should occur for HA, a paused machine does not fit the meaning of HA.

Let's say it all happened because of CSV redirection not happening fast enough, still I do expect a fail-over, like:

NODE1 "cannot access CSV1, redirecting through NODE2, wow still cannot access, I'll tell this to NODE2"

or

NODE2 "uh VM111 and VM112 are degraded, let me ask NODE1 and eventually fail them over"

Would this make sense?

My question is:

Was that just bad luck or it's kind of "by design"?

May 27th, 2015 10:28am

More info:

the event I found about VMs being degraded is coming from HP SAN.

In this post it's said that it happens when the VM fails over, but this is not my case.

Free Windows Admin Tool Kit Click here and download it now
May 27th, 2015 10:43am

Normally, if a virtual machine becomes in a failed state, the cluster will try to fail it to another node.

Can you look to your logs to see if the cluster tried to fail the VMs or not ?

May 27th, 2015 11:27am

I have been looking throug all the logs (system evt log, all evt log starting with hyper-v) including the cluster log, not a clue of a failover try.

around the time of failure that's what I have:

INFO  [RES] Physical Disk: PNP: Update volume exit, status 1168 <-- that means SERVICE_CLUSRTL_BAD_PATH 
Resource type Volume Manager Disk Group not found.
Resource type GeoCluster Replicated Disk not found.

But the there's no "ProcessingFailover" as I would have expected.

Free Windows Admin Tool Kit Click here and download it now
May 27th, 2015 11:37am

"Resource type GeoCluster Replicated Disk not found."

Looks like you have a geographically distributed cluster?  If that's the case, you should talk with your storage vendor to ensure you have everything configured the way they require it to be configured.  In a geographically distributed cluster, Microsoft relies on the storage vendors to provide their own unique capabilities.  The fact that the error is reporting a problem with the storage configuration means the storage vendor needs to help you out.

May 27th, 2015 4:14pm

that's the point, no GeoCluster there. The point is even if I had a very lousy storage (which is not) I would expect the cluster trying to bring up the resources on another node, The other node didn't lose the connection to the
Free Windows Admin Tool Kit Click here and download it now
May 27th, 2015 4:49pm

Hi aperelli,

If you are using 2008R2 based cluster please install the following hotfixes,

- Clussvc: https://support.microsoft.com/kb/2779069

- Clusres.dll: http://support.microsoft.com/kb/2798093

- Vmclusres.dll: https://support.microsoft.com/kb/2705759

Then refer the following article to confirm your vm file storage location is correct.

Hyper-V Resolving Event ID 4096

http://blogs.technet.com/b/jhoward/archive/2008/12/28/hyper-v-resolving-event-id-4096.aspx

If you are using 2012R2 base cluster please install the following update first.

Recommended hotfixes and updates for Windows Server 2012 R2-based failover clusters

https://support.microsoft.com/en-us/kb/2920151

Im glad to be of help to you!

May 28th, 2015 11:27pm

Alex,

I don't know how to tell it anymore, so I'm not going to tell it. I wrote my question twice already. I do not expect to have always an answer on a forum but at list do not pretend the question has been answered.

Thanks.

Free Windows Admin Tool Kit Click here and download it now
May 29th, 2015 3:31am

You do not mention what the results are of running the cluster validation wizard.  It would be helpful to know what warnings are generated when you run a full validation of your environment.
May 29th, 2015 10:22am

Tim,

I don't want to be unpleasant but my question wasn't about why NODE1 lost connection to the CSV but why a failover didn't start, I understand that we all agree it should have started, why it didn't we don't know. Should I have some storage errors it would explain why NODE1 lost connection, but my concern is not about this. I hope it's clear at this

Free Windows Admin Tool Kit Click here and download it now
May 29th, 2015 10:41am

Running the cluster validation wizard and looking at the warnings may help us understand why things are happening the way they are in your environment.  Generally if something is not working in a cluster one of the first things you should do is run the cluster validation wizard.  It is a great tool to see if something is amiss.
May 29th, 2015 2:32pm

I know Tim, I've been working for MS support, I could find flaws in my system but I would have expected some error/warning like "hey I'm trying to failover this resources but I can't"
Free Windows Admin Tool Kit Click here and download it now
May 29th, 2015 3:12pm

I'm sorry 'cause I feeel like being rude, sometimes we all experience unexpected behavior from our systems, I opened this 3d to understand if a failover is expected in that situation, in my understanding we all agree it should have happened.  I'll look into the cause of the lost connection and update the hypervisors hoping that nex time it will at least try to failover the resources.

My catch:

CSV went offline for NODE1, the VMs just had a dirty shutdown due to missing system volumes, connection to the CSV came back before any failover can occur, NODE2 had no knowledge of the failure 'cause it happened too fast

Thank you all for your support, but please don't mark any of the posts as answer, probably this wouldn't have been answered by MS support either.

May 29th, 2015 3:22pm

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics