Loss of a single LACP link results in LACP group traffic failing
Hi


Can anyone explain this one?

I have the following setup:

- Switches:

    3 x Cisco 3750X-24TS and 1 x 3750G-24TS-1U in a stack
    s/w: 15.0(2)SE on all
    Etherchannel groups created as LACP/Active with 4 to 6 links per group


- Servers #1

    4 x Dell servers with Broadcom NICs in a Hyper-V 2012 r2 cluster
    NIC teaming configured using MS Hyper-V Load Balancing (i.e. not using the Broadcom utility to aggregate the NICs).
    There are 6 NICs per LACP group on each server connected to the 3750(x) switch stack.
    The NICs are connected across all 4 members of the stack. LAG created as LACP/Active

- Servers #2

    2 x Dell servers with Intel NICs running Windows 2012 r2 in a cluster (used as a HA file cluster).
    NIC teaming created using Win2012 load balancing. LAG created as LACP/Active
    There are 4 NICs per LACP group on each server connected to the 3750(x) switch stack
    The NICs are connected across all 4 members of the stack. LAG created as LACP/Active


Issue:

On Thursday a helpful telecoms engineer knocked out power to switch #2 of the 3750(x) core switch stack and caused a reboot of that switch. When this happened it appears that, rather than just losing connectivity on the LACP links connected to that single switch, the entire LACP interface went down causing both the Hyper-V cluster and the HA file cluster to start throwing roles around and denying the existence of the other nodes. Chaos ensued.

Once switch #2 had rebooted it joined the stack correctly and normal service, in terms of the network connectivity resumed. The file cluster started operating without assistance but all of the VMs had rebooted and were in various states of chaos.

So the question is what could have happened to the LACP LAG interface on the switches or the Loadbalancing on HyperV to cause the entire group to fail when one / two links went down?

Thanks in advance - any help very greatfully received!
 
August 28th, 2015 8:51am

Hi Martin,

As far as I know, the failure of one link would not cause the group to fail.

Are there any related events on the server

Best Regards,

Leo

Free Windows Admin Tool Kit Click here and download it now
September 1st, 2015 9:47am

Hi

Thanks for the reply.

On one of the file cluster servers' event logs I can see:

1) the NIC connected to the switch that went down disconnecting first (LAN#2. Event 16949 Source MsLbfoSysEvtProvider and Event 4 Source b57nd60a).

2) then (inexplicably) I can see the other LAN NIC in the LACP team also disconnect with the same events.

3) One of the Cluster NICs (there is a separate team for the "Cluster" network) reports as disconnected (again - this interface isn't connected to the switch that lost power) and then there is an event saying that the cluster has lost all network connectivity.

After that everything pretty much breaks down as it can no longer see the Quorum or the other node in the cluster.

So it does appear that after one NIC went down (expected when the 2nd switch in the stack went down) the other NICs also dropped out.

What other information would be useful?

Thanks

Martin

September 1st, 2015 11:04am

Hi Martin,

According to the events, I found some posts of similar errors and issue.

And the Broadcom NICs may be the cause. We may contact broadcom or dell support, they may have the solution for similar issue.

Besides, have you tried switch independent? I have seen it helped from some posts.

Best Regards,

Leo 

Free Windows Admin Tool Kit Click here and download it now
September 2nd, 2015 11:45pm

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics