All team NICs become disconnected MsLbfoSysEvtProvider 16949

I have a cluster where by all the team members become disconnected causing the cluster to failover.

yes. this is a hyper-v cluster BUT the failing team is not the one attached to the hyper-v switch but rather a second team that has 3 interfaces (csv,live migration, management) with nothing configure on the default interface  primarily it seems like its the management  interface that initially disconnects.

I get a ton of event 16949 MsLbfoSysEvtProvider Team Nic GUID has disconnected.. then finally the cluster fails. In 5 or so minutes time all the members reconnect and the cluster is back normal.  This is happening a couple times a day...and the VM's abnormally terminate when the failover occurs.. so I need to find a solution..  I have a case open with MS and Dell (R720s)... so far all they do is ask silly questions .. been ongoing for a two weeks now.  Hoping this happens to Office 365  or MS Azure servers.. then maybe an intermittent issue will get some attention... 

I am using latest updates on 2012, drivers and firmware for Broadcom BM5719 A1 Nic cards.. using LACP switch dependent address hash to a cisco 3750 and the port channel looks clean and showing a single interface for the team. 

Maybe just switch to VMware good gosh MS.....  frustrated.

May 29th, 2013 10:45pm

Hi,

Thank you for the post.

How do you configure NIC team in Windows server 2012, do you use third party teaming software? Since you have open a case with Microsoft, you should wait and a dedicated Support Professional can assist with this request.

Regards,

Free Windows Admin Tool Kit Click here and download it now
June 3rd, 2013 5:07am

Nick,

This is all inbox 2012 teaming no third party software is involved.

Update:

I tracked down a MS cluster person at TechEd.....  as well as still having  the open ticket with MS PSS  and Dell...  So far no new information except the MS person has indicated that it is most likely a nic card driver bug but hasn't revealed any details as to hardware/driver combinations...

Right now we have solved the problem by changing the configuration to use  Switch Independent Teaming instead of Switch Dependent LACP.. kinda defeats the purpose of link aggregation as the VM's are now restricted to the bandwidth available on a single nic (in my case 1gig)

Fow what its worth ..  MS PSS has been no help whatsoever.. all they say is that it is intermittent so they cant help.  Nice when my production environment is completely offline ... but no one can help... :(

Please do not mark this as an answer... the issue is not resolved from my perspective.  And I don't believe it would help anyone else with this issue ..



  • Edited by SteveLith Wednesday, June 12, 2013 2:08 PM
June 12th, 2013 2:01pm

Hey there,

experienced today the same Problem:

one NIC of the combined Host-Management/VM Switch Team (two NICs) went down, then the complete cluster failed.

Any update on this problem?

Config: IBM x Server with Intel Onboard NetworkCard.

Thank you!

Free Windows Admin Tool Kit Click here and download it now
June 27th, 2013 10:10am

Hello everyone,

I would like to revive the carje about meeting the same problem as you, namely:

a 2012 hyperv cluster (full update)
a Broadcom teaming 2x10GB (LACP hyperv-port)
the vnics create over the vswitch :  one for live migration, management, and csv

When I do Live Migration, I have error 16949 and I lose teaming, sometimes see bsod (DPC WATCHDOG VIOLATION)

@steve: Have you had any updates from MS

Thank you very much in advance



October 29th, 2013 1:29pm

Hi Frederic,

Nope... no info from MS on this and I have not heard back at all from my contact at TechEd.. he had an engineer contact me to  try to reproduce the issue but never heard back.  The only thing I could do was to use switch independent mode... obviously you give up some link aggregation when a VM requires more than the bandwidth of one adapter .. but at least it works :(

sorry I have no other information....

Free Windows Admin Tool Kit Click here and download it now
October 29th, 2013 1:44pm

Hi Steve,

thank you for your quick answer

I opened a ticket with Microsoft but for the moment we do not find a solution and it starts to last

unfortunately I cannot go into independent switch, it would change the configuration of switches and core network :/

Thanks
----------------------------------------------------------

Frdric Stefani

October 29th, 2013 2:09pm

I can fully appreciate the changes to the switches.... BUT in case you haven't already tried this.. the "Switch Independent" mode will work even if the switch ports are configure for ether channel LACP .

I know this because this was affecting our production virtualization environment and I didn't have time to wait for port configuration changes.. so I just changed the teaming configuration in 2012 to use Switch Independent just to see if it fixed the issue.. turns out...  ether channel doesn't complain..   and the  teaming works as expected.. I ran this way for nearly two weeks before the ports were reconfigured with no ill affects..  AND best of all... my cluster stayed up ! :)

I will try to find the contact info for the person at MS that was looking into the issue... stand by...

Free Windows Admin Tool Kit Click here and download it now
October 29th, 2013 6:05pm

Frederic,

I found my old case..  I forgot I had opened a case both with Dell and MS for this issue... the case was finally closed due to my changing to Switch Independent mode...

For your reference:  SR 11305101043028

I must not have kept  the emails between myself and the MS representative I talked to at TechEd :(  I am sorry for that... I know that I sent him all the case material for his reference as well...

October 29th, 2013 6:12pm

Steve,

Thank you very much.

I will try to spend my teaming with switch 
independent mode,without repeating the setup switches that it is LACP.

I did not test it;)

I'll let you know the result

Frederic Stefani


Free Windows Admin Tool Kit Click here and download it now
October 29th, 2013 8:34pm

I'm having the same problem.  

My configuration is Server 2012R2, HP DL360p Gen8 with 4x GbE running as a storage server presenting iSCSI to a VMware cluster.

The problem we had was this reset itself over a 1 second duration and the iSCSI hosts dropped connection sending 20 VM's in to a tailspin.  We had installed the latest HP Drivers and Firmware a week ago and this has only happened once so far, but am very fearful of this recurring.  We have another identially configured server and thus far, experienced no issues.

The server are running Broadcom NIC's.  Seems to me like it's less of a NIC issue and more an MS bug.  The reason I say that, is we have IBM, Dell and now HP, with Intel and Broadcom NICs.  

Jason.


February 17th, 2014 9:42am

I am having the same problem as well.


I am running a HP DL380 Gen 8 - server with 4 Broadcom nic's. The Nic's are configured in two LACP-teams.

The server is stand-alone Windows Server 2012 R2 Hyper-V host with internal storage. 

The server was working perfectly for 10 days, when suddenly the nic-team connected to the VM-switch disconnected. All the VM's lost network-connectivity permanently. I was able to shut down the VM's and tried to do a reboot of the host, but the server did not go down until I used the power-switch!

The problem started approx 1 minute after our Veeam VSS-based backup started working with the VM's on the server, but I don't know if that is related.

We have two other identical servers running Windows Server 2012 (Not R2). They have been working fine for 8 months now.

Has anyone got any new information on this problem?

Magz

Free Windows Admin Tool Kit Click here and download it now
May 26th, 2014 8:07am

Hi Magz,

We haven't had any issues thankfully.  As our storage servers are not backed up by Veeam (but the VM's are) and the issue occurred during Business Hours, our Veeam infrastructure wouldn't have been the cause.

Since the firmware updates we have not had any issues.  We also have Windows updates turned off on these servers so they are running baseline 2012 R2.

Hope this info helps.

Jason.

May 26th, 2014 10:25am

I'm having the same issue here.

Dell R720xd running Server 2012R2 w/Hyper-V

3-port LACP team dedicated for VMs

Connecting to Dell 5548 switches.

Free Windows Admin Tool Kit Click here and download it now
September 25th, 2014 3:32pm

Did anyone get anywhere with this please? We have the exact same issue, same setup too - Dell Servers and Switches, with Dell/Broadcom NICS...

Thanks

February 2nd, 2015 2:53pm

I have the same problem...same setup too. 2 identical Dell R720 servers with Broadcom NICs.

Strangely, the NIC teams are dropping at exactly the same time on both servers...down to the second. I'm wondering if it's the switch that is the problem.

I'm going to change one of them from LACP to Static to see if that helps.

EDIT: forgot to add that the switch logged 100% CPU just before one of the instances the teams dropped (but only 1 out of 3).
Free Windows Admin Tool Kit Click here and download it now
March 12th, 2015 10:31am

The only way we fixed it was to get rid of Broadcom and swap to Intel NICS!
March 12th, 2015 10:33am

After 9 weeks of struggle with the integrated Broadcom-NICs in our HP DL380 Gen8's, I bought new Intel-based NIC's and installed in the servers.

I have not had any problems after that.

It is very sad that the large server-manufacturers integrate Broadcom NIC's on their motherboards when they clearly have these weaknesses...

Free Windows Admin Tool Kit Click here and download it now
March 13th, 2015 3:18am

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics