On failover, when DC-nic is disconnected: No authority could be contacted for authentication (0x80090311)

Hi All,

Failover in my Hyper-V cluster doesn't work when a specific NIC (office), on which DC is active, is disconnected.

My cluster:

  •  3 nodes, running Hyper-V 2012r2 Core. (Dell R630)
  •  Direct Attached Shared Storage to all 3 nodes through MPIO (Dell MD3200)
  •  5 networks:
  1. Office network (this is a 192.168.20.x via this network, the nodes are added to the domain controller)
  2. Factory 1 network  (this is a 192.168.0.x network, no DC, no gateway, no DNS. Factory PLC's use this network)
  3. Factory 2 network  (this is a 192.168.1.x network, again, no DC, no gateway, no DNS, also used for PLC's)
  4. Migration network  (this is a 10.0.0.x network, dedicated switch, no DC, no gateway, no DNS only used for migration.)
  5. Cluster Heartbeat network ( this is a 172.16.0.x network, dedicated switch, no DC, no gateway, no DNS, used for cluster)

 Cluster is up and running validated. Now when testing failover a couple of scenario's work allright like:

  • Killing power to one of the nodes, the cluster senses this and restarts the VM's on a different node.
  • Remove all network connections from one node, VM is restarted on a different node.
  • A couple of VM's use 2 networks: Factory 2 network and Office network, when Factory 2 network is disconnected,the cluster senses this and the VM's are 'Live Migrated' to a different node which still has both network connections alive.

So far so good, but if I:

  • Disconnect the Office network on a node which hosts the VM's that use 2 networks (Factory 2 and Office)

I can see that the cluster has sensed this and wants to live migrate the VM's to a different host.
After a few seconds the VM's have status 'Migration Queued', instead of directly migrating.
After a while, there's an error that the VM's are not migrated and are still running on the original node which don't have all network connections anymore.

A lot of digging throught logs I find this: No authority could be contacted for authentication 0x80090311
So, I understand this because that node doesn't have a connection to the office network, on which the DC resides, anymore, so it can not author the kerberos constrained delegation anymore.

But what I don't get is how to resolve this issue.
How can I have all the nodes in the cluster trust eachother and accept live migration even when the DC can not be reached at all.

Looking forward to your responses!

BR,

Mark

August 28th, 2015 12:34pm

Because failover cluster is based on redundancy the question I have is the VM still accessible after the network has failed because you unplugged it? Failover clusters will have multiple network adapters when one fails it isn't unexpected if the server is accessible from one of the other configured adapters.

in my experience if I'm having network adapter problems I don't think I would want machines migrating, especially if the problem is switch related, this may be what you are seeing.

I would typically have two network cards team them and rely on this redundancy to handle physical network or card failure and leave the machines where they are. 

Free Windows Admin Tool Kit Click here and download it now
August 28th, 2015 12:54pm

Hi Darren,

Thanks for your reply.

Yes the VM is still accessible from the Factory 2 network.
I agree with your opinion on network cards teaming, but for now that is not yet the case in our situation. In near future we'll be doing this, but other infrastructure (ring) needs to be prepared first. I currently see no use in connection 2 teaming network cards into the same switch.

But then still I'd like to know how my cluster can trust all the nodes so that migration without DC can be done.
We also have VM's which are only connected to Factory1 or Factory2 network, they don't have anything to do with the Office network, yet the nodes on which they run need a DC from the Office network to authenticate in case of migrating. This seems a bit odd to me...

I've tried to take out the nodes out of the domain (hoping that the authentication issue dissappeared) but this only caused the cluster not being able to connect to the nodes, so I rejoined the domain.

What am I missing? doesn't seem such a strange situation I'm in, or am I wrong? am I overlooking something?

Thanks!

Mark

August 28th, 2015 1:16pm

So,

as stated here technet.microsoft.com/nl-nl/library/dn265970.aspx I can't use live migration without a DC.

So what are my options?

- add DC to the factory networks, really don't want to do that, don't know if PLC's will be happy with that.
- add a virtual DC to every node so that it still can authenticate for the Live Migration.

Any second thoughts?

TIA

Mark

Free Windows Admin Tool Kit Click here and download it now
August 30th, 2015 1:36pm

The DC is critical.  It is generally on a routed network that is reachable.  It is also highly recommended to have at least two DCs so that should one fail, there is another to take on the authentication/authorization duties.  It's overkill to add another DC VM to every node, because that would mean that all the DCs have to be able to communicate with each other.  If they can communicate with each other, then that means that the other nodes VMs which are talking to the 'local DC VM' should be able to communicate to the reusable DC.
August 30th, 2015 9:50pm

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics