NLB in VMWare troubleshooting (Network Steve Forum)

NLB in VMWare troubleshooting

Windows Server 2012 R2
NLB for multiple IIS
each IIS in separate VM in VMWare vCenter, it is not known, whether both of them are on the same hardware
1 NIC per VM
further VMs for database and AD DC are in the same subnet and the IIS access them
VMs connected via vSwitch
it is just a test environment
pinging between the NLB nodes works

I tried unicast and multicast mode, in both cases it is working sometimes. Especially after setting up NLB in one of the two modes it works fine. However after a restart it makes trouble.

In both modes, I get the following warning in the Event log:

"NLB cluster [172.26.101.21]: NLB detected duplicate cluster subnets. This may be due to network partitioning, which prevents NLB heartbeats of one or more hosts from reaching the other cluster hosts. Although NLB operations have resumed properly, please investigate the cause of the network partitioning."

Unicast

often just 1 of the VMs converges and the other one converges infinitely
1 of the NLB VMs sometimes has just limited internet access

Multicast

1 VM converges fast, the other one needs >= 10 minutes
after restarting one of the VMs the virtual cluster IP is not accessible at all (neither website nor ping)
I captured the network traffic on the client (same subnet) using Wireshark. It does not even send a HTTP request, but there are ARP requests for my virtual cluster IP and they are answered appropriately.

I only tested the access from another VM in the same subnet as client, so router limitations concerning ARP should not be a problem.

I read these articles, but they did not solve my problem (I did not follow the instructions for changing the router configuration, because even the test with a client inside the same subnet did not work):

http://kb.vmware.com/selfservice/search.do?cmd=displayKC&docType=kc&docTypeID=DT_KB_1_1&externalId=1006558

http://kb.vmware.com/selfservice/search.do?cmd=displayKC&docType=kc&docTypeID=DT_KB_1_1&externalId=1006778

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1556

Here my configuration, if you need further information, I will provide them:

Interface Ethernet0 Parameters
----------------------------------------------
IfLuid                             : ethernet_10
IfIndex                            : 12
State                              : connected
Metric                             : 10
Link MTU                           : 1500 bytes
Reachable Time                     : 38000 ms
Base Reachable Time                : 30000 ms
Retransmission Interval            : 1000 ms
DAD Transmits                      : 3
Site Prefix Length                 : 64
Site Id                            : 1
Forwarding                         : disabled
Advertising                        : disabled
Neighbor Discovery                 : enabled
Neighbor Unreachability Detection  : enabled
Router Discovery                   : dhcp
Managed Address Configuration      : enabled
Other Stateful Configuration       : enabled
Weak Host Sends                    : disabled
Weak Host Receives                 : disabled
Use Automatic Metric               : enabled
Ignore Default Routes              : disabled
Advertised Router Lifetime         : 1800 seconds
Advertise Default Route            : disabled
Current Hop Limit                  : 0
Force ARPND Wake up patterns       : disabled
Directed MAC Wake up patterns      : disabled
ECN capability                     : application

NLB Cluster Control Utility V2.6

Cluster 172.26.101.21



=== Configuration: ===



Current time                = 24.08.2015 10:55:25
ParametersVersion           = 6
CurrentVersion              = V2.6
EffectiveVersion            = 00000201
InstallDate                 = 0x55D58FB7
HostPriority                = 1
ClusterName                 = www.example.com
ClusterIPAddress            = 172.26.101.21
ClusterNetworkMask          = 255.255.0.0
DedicatedIPAddresses/       = 172.26.101.22/255.255.0.0
DedicatedNetworkMasks       
McastIPAddress              = 239.255.101.21
ClusterNetworkAddress       = 03-bf-ac-1a-65-15
IPToMACEnable               = ENABLED
MulticastSupportEnable      = ENABLED
IGMPSupport                 = DISABLED
MulticastARPEnable          = ENABLED
MaskSourceMAC               = ENABLED
AliveMsgPeriod              = 1000
AliveMsgTolerance           = 5
MaxConnectionDescriptors    = 262144
FilterICMP                  = DISABLED
ClusterModeOnStart          = STARTED
PersistedStates             = NONE
NBTSupportEnable            = ENABLED
UnicastInterHostCommSupport = ENABLED
BDATeaming                  = NO
TeamID                      = 
Master                      = NO
ReverseHash                 = NO
IdentityHeartbeatPeriod     = 10000

NumberOfRules (2):

      VIP       Start  End  Prot   Mode   Pri Load Affinity
--------------- ----- ----- ---- -------- --- ---- --------
ALL                80    80 TCP  Multiple      Eql None

ALL               443   443 TCP  Multiple      Eql None




=== Event messages: ===



Could not open event log due to:

The operation completed successfully.

Edited by chipper12 Tuesday, August 25, 2015 6:16 PM

August 24th, 2015 9:01am

Now I it has worked for some time in multicast mode. After restarting cluster host1 converging needed some time, but following to Nlbmgr.exe it succeeded. However the virtual IP was not reachable at all at this time.

event log on host1:

Error: lphlpsvc, Unable to update the ip address on isatap interface {...}
DNS Client Events: Failed to register PTR records for {...}

event log on host2:

NLB detected duplicate cluster subnets. (see first post)

After another restart of host1 it works fine again, but obviously there are problems.

Free Windows Admin Tool Kit Click here and download it now

August 28th, 2015 3:00am

Hi Chipper,

Check the following:

Upstream routers might require a static Address Resolution Protocol (ARP) entry. This is because routers might not accept an ARP response that resolves unicast IP addresses to multicast MAC addresses.

Without IGMP, switches might require additional configuration to tell the switch which ports to use for the multicast traffic.

Upstream routers might not support mapping a unicast IP address (the cluster IP address) with a multicast MAC address. In these situations, you must upgrade or replace the router. Otherwise, the multicast method is unusable.

Best Regards,

Leo

August 31st, 2015 10:53pm

Upstream routers might require a static Address Resolution Protocol (ARP) entry. This is because routers might not accept an ARP response that resolves unicast IP addresses to multicast MAC addresses.

As I said, the problem also appears with a client in the same subnet as the servers, so that can not be the problem.

Free Windows Admin Tool Kit Click here and download it now

September 1st, 2015 1:58am

Hi Chipper,

I made a test in Hyper-V. And the virtual IP was reachable all the time.

To exclude the effect of physical device, I suppose we could make a test. On a client connected to the same virtual switch that NLB cluster is connecting to, try to access the virtual IP. And check the result.

Best Regards,

Leo

September 1st, 2015 10:08am

To exclude the effect of physical device, I suppose we could make a test. On a client connected to the same virtual switch that NLB cluster is connecting to, try to access the virtual IP. And check the result.

Thanks for your effort, but all my VMs were connected to the same virtual switch all the time.

Free Windows Admin Tool Kit Click here and download it now

September 1st, 2015 10:47am

Hi Chipper,

I compared the configuration of my NLB cluster with yours. And the only difference I found is:
McastIPAddress = 239.255.101.21:

But I'm afraid I could not find any official document about it.

I tested again. I performed network captures on two NLB nodes and client. It seems that the McastIPAddress is not used in the process of connection during my test.

>>event log on host1:

Error: lphlpsvc, Unable to update the ip address on isatap interface {...}
DNS Client Events: Failed to register PTR records for {...}

event log on host2:

NLB detected duplicate cluster subnets. (see first post)

Considering we are using different hypervisors and the events, probably there are some problems with the NIC on VM or the virtual switch. We may contact VMware support to see if they have seen similar issue.

https://www.vmware.com/support/contacts

As it is working now, we may monitor the NLB cluster to see if there are any errors later.

Best Regards,

Leo

September 2nd, 2015 5:43am

But I'm afraid I could not find any official document about it.

Here is described, how the Multicast IP is built: https://support.microsoft.com/en-us/kb/283028

As it is working now, we may monitor the NLB cluster to see if there are any errors later.

In the last days I still had the same problems sometimes. Especially if I restart just 1 of the cluster hosts the mentioned problems appear, but not everytime I reboot it and usually it works again after another reboot.

>> We may contact VMware support to see if they have seen similar issue.

You said "We": Should I do that?

Free Windows Admin Tool Kit Click here and download it now

September 2nd, 2015 7:19am

Hi Chipper,

I know how the address is built. From some other posts, I have seen the address was 0.0.0.0, the same as mine. But I could not find out if it would affect the NLB cluster.

Yes, it may be helpful to contact VMware.

Best Regards,

Leo

September 2nd, 2015 11:01pm

This topic is archived. No further replies will be accepted.