CSV unavailable to one node, VMs don't failover

Hi All,

I had this situation:

Environment:

- 2 nodes Hyper-v cluster

- 2 CSV

- NODE1 is running 2 VMs on CSV1 and 2 VMs on CSV2

Issue:

- NODE1 loses connection to CSV1

- NODE1 logs Evnet ID 4096 at 00:01 in Microsoft-Windows-Hyper-V-Config/Admin about VM111 and VM112, both standing on CSV1

- NODE2 logs 2 events ID 1167 in System Log at 00:03 (VM111 degraded and VM112 degraded)

- at 00:05 VM111 and VM112 are off (dirty shutdown)

Question:

since NODE2 didn't lose the connection to CSV1 and it knows that the VMs on NODE1 are degraded, why it didn't initiate a failover?

May 27th, 2015 9:48am

This is because the CSV feature offers the CSV redirection feature.

What that means ?

- A node communicates directly with a CSV volume

- If a node loses communication with a CSV, the storage traffic will resume but through the  CSV owner node. And you will notice this in the storage view, in the fail-over console, the CSV is flagged with the 'Redirected' flag

- Normally, when the communication to the storage resumes, the node will resume communication directly with the CSV

---> So, the VMs will node fail to another cluster node

But this behavior has a drawback: If the redirection is not fast enough, the VMs will lose communication with the storage (Even for short time, seconds). The operating system will crash or hang and shutdown or reboot unexpectedly (due to IO errors)

More on CSV redirected mode HERE

In Windows Server 2016, a new enhancement will be added, that pause the VMs when they loose communication with storage. This will let you check the issue, repair it and resume the VMs.

Free Windows Admin Tool Kit Click here and download it now
May 27th, 2015 10:05am

Hi Samir,

I'm sorry but this doesn't answer my question. CSV redirection didn't happen, no event 5136 has been logged.

Even if it was because of redirection this is not good enough, when a component fails in a cluster I do expect a fail-over so even pausing the VM wouldn't be acceptable, the fail-over should occur for HA, a paused machine does not fit the meaning of HA.

Let's say it all happened because of CSV redirection not happening fast enough, still I do expect a fail-over, like:

NODE1 "cannot access CSV1, redirecting through NODE2, wow still cannot access, I'll tell this to NODE2"

or

NODE2 "uh VM111 and VM112 are degraded, let me ask NODE1 and eventually fail them over"

Would this make sense?

My question is:

Was that just bad luck or it's kind of "by design"?

May 27th, 2015 10:28am

More info:

the event I found about VMs being degraded is coming from HP SAN.

In this post it's said that it happens when the VM fails over, but this is not my case.

Free Windows Admin Tool Kit Click here and download it now
May 27th, 2015 10:43am

Normally, if a virtual machine becomes in a failed state, the cluster will try to fail it to another node.

Can you look to your logs to see if the cluster tried to fail the VMs or not ?

May 27th, 2015 11:27am

I have been looking throug all the logs (system evt log, all evt log starting with hyper-v) including the cluster log, not a clue of a failover try.

around the time of failure that's what I have:

INFO  [RES] Physical Disk: PNP: Update volume exit, status 1168 <-- that means SERVICE_CLUSRTL_BAD_PATH 
Resource type Volume Manager Disk Group not found.
Resource type GeoCluster Replicated Disk not found.

But the there's no "ProcessingFailover" as I would have expected.

Free Windows Admin Tool Kit Click here and download it now
May 27th, 2015 11:37am

"Resource type GeoCluster Replicated Disk not found."

Looks like you have a geographically distributed cluster?  If that's the case, you should talk with your storage vendor to ensure you have everything configured the way they require it to be configured.  In a geographically distributed cluster, Microsoft relies on the storage vendors to provide their own unique capabilities.  The fact that the error is reporting a problem with the storage configuration means the storage vendor needs to help you out.

May 27th, 2015 4:14pm

that's the point, no GeoCluster there. The point is even if I had a very lousy storage (which is not) I would expect the cluster trying to bring up the resources on another node, The other node didn't lose the connection to the
Free Windows Admin Tool Kit Click here and download it now
May 27th, 2015 4:49pm

Hi aperelli,

If you are using 2008R2 based cluster please install the following hotfixes,

- Clussvc: https://support.microsoft.com/kb/2779069

- Clusres.dll: http://support.microsoft.com/kb/2798093

- Vmclusres.dll: https://support.microsoft.com/kb/2705759

Then refer the following article to confirm your vm file storage location is correct.

Hyper-V Resolving Event ID 4096

http://blogs.technet.com/b/jhoward/archive/2008/12/28/hyper-v-resolving-event-id-4096.aspx

If you are using 2012R2 base cluster please install the following update first.

Recommended hotfixes and updates for Windows Server 2012 R2-based failover clusters

https://support.microsoft.com/en-us/kb/2920151

Im glad to be of help to you!

May 28th, 2015 11:24pm

Alex,

I don't know how to tell it anymore, so I'm not going to tell it. I wrote my question twice already. I do not expect to have always an answer on a forum but at list do not pretend the question has been answered.

Thanks.

Free Windows Admin Tool Kit Click here and download it now
May 29th, 2015 3:29am

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics