Can anyone explain this one?
I have the following setup:
- Switches:
3 x Cisco 3750X-24TS and 1 x 3750G-24TS-1U in a stack
s/w: 15.0(2)SE on all
Etherchannel groups created as LACP/Active with 4 to 6 links per group
- Servers #1
4 x Dell servers with Broadcom NICs in a Hyper-V 2012 r2 cluster
NIC teaming configured using MS Hyper-V Load Balancing (i.e. not using the Broadcom utility to aggregate the NICs).
There are 6 NICs per LACP group on each server connected to the 3750(x) switch stack.
The NICs are connected across all 4 members of the stack. LAG created as LACP/Active
- Servers #2
2 x Dell servers with Intel NICs running Windows 2012 r2 in a cluster (used as a HA file cluster).
NIC teaming created using Win2012 load balancing. LAG created as LACP/Active
There are 4 NICs per LACP group on each server connected to the 3750(x) switch stack
The NICs are connected across all 4 members of the stack. LAG created as LACP/Active
Issue:
On Thursday a helpful telecoms engineer knocked out power to switch #2 of the 3750(x) core switch stack and caused a reboot of that switch. When this happened it appears that, rather than just losing connectivity on the LACP links connected to that single switch, the entire LACP interface went down causing both the Hyper-V cluster and the HA file cluster to start throwing roles around and denying the existence of the other nodes. Chaos ensued.
Once switch #2 had rebooted it joined the stack correctly and normal service, in terms of the network connectivity resumed. The file cluster started operating without assistance but all of the VMs had rebooted and were in various states of chaos.
So the question is what could have happened to the LACP LAG interface on the switches or the Loadbalancing on HyperV to cause the entire group to fail when one / two links went down?
Thanks in advance - any help very greatfully received!
Hi Martin,
As far as I know, the failure of one link would not cause the group to fail.
Are there any related events on the server
Best Regards,
Leo
Hi
Thanks for the reply.
On one of the file cluster servers' event logs I can see:
1) the NIC connected to the switch that went down disconnecting first (LAN#2. Event 16949 Source MsLbfoSysEvtProvider and Event 4 Source b57nd60a).
2) then (inexplicably) I can see the other LAN NIC in the LACP team also disconnect with the same events.
3) One of the Cluster NICs (there is a separate team for the "Cluster" network) reports as disconnected (again - this interface isn't connected to the switch that lost power) and then there is an event saying that the cluster has lost all network connectivity.
After that everything pretty much breaks down as it can no longer see the Quorum or the other node in the cluster.
So it does appear that after one NIC went down (expected when the 2nd switch in the stack went down) the other NICs also dropped out.
What other information would be useful?
Thanks
Martin
Hi Martin,
According to the events, I found some posts of similar errors and issue.
And the Broadcom NICs may be the cause. We may contact broadcom or dell support, they may have the solution for similar issue.
Besides, have you tried switch independent? I have seen it helped from some posts.
Best Regards,
Leo