Windows 2008 R2/SQL 2008 SP1 CU5 active/active cluster keeps failing

I have set up many of the same clusters in the last few months, but this cluster specifically has been having lots of problems lately. 

OS: Windows Server 2008 R2 latest patches

SQL: SQL Server 2008 SP1 CU5 (version: 10.0.2746)

Disks: VMAX SAN

In the very beginning when I set up the cluster, the cluster validation kept failing when I tried to add a node.  Turned out that we had to unjoin the servers from the domain and rejoin.  Right after that the cluster validation succeeded, the second node joined the cluster, and quorum was changed to "Node and Disk Majority".  SQL was installed and set up as active/active (one sql instance on each node).  No issues then until SQL server was actually in use. 

Symptoms:

SQL server has heavy ETL processing plus being a subscriber of a sql replication.  The following errors first started on 4/24 when I set up sql replication (these servers are replication subscribers):

EventID: 1127, Source: FailoverCluster, Task Category: Network Manager

Cluster network interface 'man1fscl01a - Hartbeat to man1fscl91b' for cluster node 'man1fscl01a' on network 'ClusterHeartBeat' failed. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

EventID: 1130, Source: FailoverClustering, Task Category: Network Manager

Cluster network 'ClusterHeartBeat' is down. None of the available nodes can communicate using this network. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

Then we started getting more errors on the cluster since 5/14 after I started the ETL processing job.  The following entries have been consistently present in the Windows System log, sometimes multiple times in a day.  SQL server instances have been failed over constantly (I can see multiple sql error logs through the day in the last week).

EventID: 1592 Source FailoverClustering, Task Category: Node-to-Node Communications

Cluster node 'man1fscl01a' lost communication with cluster node 'man1fscl01b'. Network communication was reestablished. This could be due to communication temporarily being blocked by a firewall or connection security policy update. If the problem persists and network communication are not reestablished, the cluster service on one or more nodes will stop. If that happens, run the Validate a Configuration wizard to check your network configuration. Additionally, check for hardware or software errors related to the network adapters on this node, and check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

EventID: 4201, Source: Iphlpsvc

Isatap interface isatap.{640E853E-3232-4A2D-8095-076A798D85AE} is no longer active.

EventID: 1135, Source: FailoverClustering, Task Category: Node Mgr

Cluster node 'man1fscl01b' was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

EventID: 1069, Source: FailoverClustering, Task Category: Resource Control Manager (would happen a few times in a row)

Cluster resource 'Drive Q:' in clustered service or application 'Cluster Group' failed.

EventID: 1177, Source: FailoverClustering, TaskCategory: Quorum Manager

The Cluster service is shutting down because quorum was lost. This could be due to the loss of network connectivity between some or all nodes in the cluster, or a failover of the witness disk.

Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

EventID: 7036, Source: Service Control Manager

The Cluster Service service entered the stopped state.

EventID: 7024, Source: Service Control Manger

The Cluster Service service terminated with service-specific error A quorum of cluster nodes was not present to form a cluster..

I've asked the hardware team to check out the NICs, and they said that the drivers are all up to date.  I've asked the network team to check out network issues but they reported that there's no issue - however, since these servers are not anywhere near us (they're on opposite side of the continent) I am not sure if they really did check network issues there.  I have asked the SAN team to check disk issues because it seems like sql/cluster would fail only when sql server is busy doing ETL processing, with high I/O usage.  SAN team reported that they found no issues.  I am at lost here.  The hardware team insisted that these servers need to be re-imaged, which would mean another 2 weeks before they're coming back to be where they are now.  I just cannot believe that is the only solution.  Please help!

G

May 25th, 2010 6:09pm

Do these happen to be blade servers going through a common I//O backplane?

I have seen heavy disk I/O drag down network performance since they are really "virtual" connections on a common I/O board.

Another possibility is a SAN configured for file services and not for SQL database applications.  too large a RAID set and too many LUNs can lead to command tag queue lengths that are incompatible with cluster timeouts.

Hard to troubleshoot without some kind of hands-on monitoring.  You might want to set some performance monitor traces that record key disk and network metrics on each node.

Free Windows Admin Tool Kit Click here and download it now
May 25th, 2010 9:36pm

These are not blade servers.  They are HP DL 580 servers with EMC V-MAX SAN space.  The ETL job in SQL uses the data within its own databases on the same server.  However, the data does get fed into these databases via SQL replication. 

The servers have been up since I disabled all SQL jobs (only thing that's running is SQL replication), and I notied that SQL failed only a couple of times since then, but failover did not happen. 

Setting up PerfMon to collect some data now.  I will provide more info once I have the trace data. 

May 27th, 2010 4:05pm

I couldn't get any data collected as the cluster stayed alive for a week and finally gave up last weekend.  I'm not able to open the cluster at all from Server Manager, but oddly enough, the two sql instances have been up and running on one node - B node that has all the EMC V-MAX drives now.  The A node has no disks attached and is showing as down from the cluster (on B node).  The last critical error on the cluster was:

Event id: 1146

Source: microsoft-windows-failoverclustering

Description: The cluster resource host subsystem (RHS) stopped unexpectedly. An attempt will be made to restart it. This is usually due to a problem in a resource DLL. Please determine which resource DLL is causing the issue and report the problem to the resource vendor.

Upon researching on the Internet I came to these two articles:

http://support.microsoft.com/kb/978476

http://support.microsoft.com/kb/978527

Yes the MS DTC resource failed (there was an error in the event log as well).  I am beginning to wonder that the cluster issues are related to the EMC V-MAX, which is new in our environment.  My clusters happen to be the first servers on EMC V-MAX...

I have opened a case with EMC to look into this.  What they're saying was, if there's no entries in Windows log about "EmcpMpx event ID 100s " then it's not related to the disks.  I am going to ask them to look deeper into the issue and will post what I find here later, if the cluster can be saved...

Free Windows Admin Tool Kit Click here and download it now
June 1st, 2010 6:38pm

After another week of working on the cluster, it's still failing.  I've installed the hotfix (978476) mention above, but that didn't help (cluster failed during the weekend, started on 6/5).  I turned off all sql jobs and only left sql replication on (the cluster is a replication subscriber).  Then on Tuesday we updated all firmware and NIC driver (using HP management software so it's PSP version 8.3).  The servers were already on PSP 8.3 so it was a refresh.  On the same day of the firmware/driver refresh, cluster failed again multiple times with the following errors (and sequence):

This time it didn't take that long for the cluster to fail.  Node A was rebooted at 12:49:32pm server time, and node B was rebooted at 4:00:52pm server time.  Node A had the DTC and Quorum drives.  Starting at 4:02:52am on 6/9, node B started to display warnings and eventually at 6:32:39am there was a critical error on the cluster.  The following are the log entries:

Time: 6/9 4:02:54am
Event ID: 47, Source: Time-Service
Description:
Time Provider NtpClient: No valid response has been received from manually configured peer time.windows.com,0x9 after 8 attempts to contact it. This peer will be discarded as a time source and NtpClient will attempt to discover a new peer with this DNS name. The error was: The peer is unreachable.

Time: 6/9 6:31:42am
Event ID: 1592, Source: FailoverClustering
Description:
Cluster node 'man1fscl01b' lost communication with cluster node 'man1fscl01a'. Network communication was reestablished. This could be due to communication temporarily being blocked by a firewall or connection security policy update. If the problem persists and network communication are not reestablished, the cluster service on one or more nodes will stop. If that happens, run the Validate a Configuration wizard to check your network configuration. Additionally, check for hardware or software errors related to the network adapters on this node, and check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

Time: 6/9 6:34:38am
Event ID: 4201, Source Iphlpsvc
Description:
Isatap interface isatap.{C384E115-73CA-45A9-B700-0384C394F0D4} is no longer active.

Time: 6/9 6:34:39am
Event ID: 1135, Source: FailoverClustering
Description:
Cluster node 'man1fscl01a' was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

Then at 7:15:15am there were these errors:

Event ID: 1557, Source: FailoverClustering
Description:
Cluster service failed to update the cluster configuration data on the witness resource. Please ensure that the witness resource is online and accessible.

Event ID: 1558, Source: FailoverClustering
Description:
The cluster service detected a problem with the witness resource. The witness resource will be failed over to another node within the cluster in an attempt to reestablish access to cluster configuration data.

Event ID: 1069, Source: FailoverClustering
Description:
Cluster resource 'Drive Q:' in clustered service or application 'Cluster Group' failed.

Event ID: 1, Source: Kernel-Tm
Description:
The Transaction (UOW=%1, Description='%3') was unable to be committed, and instead rolled back; this was due to an error message returned by CLFS while attempting to write a Prepare or Commit record for the Transaction. The CLFS error returned was: %4.

Event ID: 6, Source: Kernel-General
Description:
An I/O operation initiated by the Registry failed unrecoverably.The Registry could not flush hive (file): '\??\C:\Windows\Cluster\CLUSDB'.

Event ID: 7024, Source, Service Control Manager
Description:
The Cluster Service service terminated with service-specific error An I/O operation initiated by the registry failed unrecoverably. The registry could not read in, or write out, or flush, one of the files that contain the system's image of the registry..

Then SQL Server on that node would fail and failover to the other node... 

 EMC did not find anything that's related to SAN space/driver.  And I'm still trying to figure out how to set up traces to collect data right around the time cluster fails...

June 10th, 2010 12:13am

This is definitely a bigger problem than we can fix on the forums.  You already have cases open with EMC and HP.  I would open one with Microsoft CSS.  See if you can get your hardware vendor to get everyone together and come up with an action plan.  This looks like communications are failing to the storage system, even on the boot drive (are you booting from SAN?).  Lose storage and the whole cluster goes unstable.

 

Free Windows Admin Tool Kit Click here and download it now
June 10th, 2010 2:35am

We are having the same problem and we're using the same hardware platforms (HP, EMC MS 2008 clustering).  We have MS, EMC and HP engaged for about a month now but have yet to make much headway.

Did anybody ever find a root cause for this or how to fix?  If so, can you please share?

 

thanks

 

August 19th, 2010 2:19pm

Any update on this issue as we are experiencing cluster failures on an Exchange 2010 environment.  System event log starts with


Event ID: 4201, Source Iphlpsvc
Description:
Isatap interface isatap.{C384E115-73CA-45A9-B700-0384C394F0D4} is no longer active.

And then we get cluster failure and DAG group failover.

Running VMWARE 3.5 VM's using VMNIC VMXNET Enhanced.  IPV6 is disabled

The above event is all over the place on both my MBX servers at two locations but other VM's on same HOST ESX server configured with same OS, Windows Server Datacenter 64B with IPV6 disabled, are not getting these errors including my CAS/HUB servers.

Failure just started happening within past two days and I've had a stable test environment running for two weeks.

Free Windows Admin Tool Kit Click here and download it now
August 20th, 2010 7:44pm

I'm in the same boat with the Event 4201 (ISATAP is no longer active)

Running a SQL 2008 DB on a Windows 2008 R2 failover cluster. This sits on VMWare vSphere 4.1. All the other nodes on the virtual infrastructure are working fine.

Exactly every 30 mins I experience this outage

  • There is an successful NTP time synch with our DC (the time is correct on the DC)
  • The 4201 error happens

Note the order of these events can vary but there is always only a second in between them.

  • A second after the previous events the cluster fails with Event ID 1135

Cluster node 'NODE2' was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

Michael

 

 

 

 

 

September 27th, 2010 9:57am

David-JFC, I also had a stable environment for a number of months. The issue first occurred for my at 12:35:00 on the 16th Sept 2010. This would not have been far off when your's occured.

At a hardware level we are IBM xSeries 3650's and EMC SAN.

Free Windows Admin Tool Kit Click here and download it now
September 27th, 2010 10:01am

We upggrade to Vsphere 4.X and have removed the cluster so we can move forward with our Exchange 2010 deployment.  We have not heard of any resolution to this issue.
September 27th, 2010 3:19pm

Just out of curiosity... What is the order of your connections in Advanced Settings? Private should be the last on the list and public 1st. I remember that warning popped up during one of the installations I've completed...

Regards,

Akim

 

Free Windows Admin Tool Kit Click here and download it now
September 27th, 2010 9:48pm

Akim,

We've got Public first on both cluster nodes.

I've disabled IPv6 on each NIC too.

The error Isatap interface isatap.{XXXXXX} is no longer active appears just prior to each cluster failure.

This is an interface that is related to tunneling in IPv6. I have three on each node of my cluster but only one repeatedly shows up in the error logs. When I do an Ipconfig /all I can see all three ISATAP adapters are in a 'media disconnected' state.

Michael

September 28th, 2010 3:08pm

We're having the same problem with one of our production environment (Windows 2008 R2 Enterprise Edition with Hyper-V Active Active Cluster on a Fujitsu BX600S3 Blade). Every three days we got the following events:

 

Event ID: 4201

Logged: 13/10/2010 12:17:51

Isatap interface isatap.{61E72180-4FC2-4FA7-8F6E-37F86834EABD} is no longer active.

Then:

Event ID: 37

Logged: 13/10/2010 12:17:51

The time provider NtpClient is currently receiving valid time data from xxxxx.xxxx.xxxxx (ntp.d|0.0.0.0:123->192.168.1.36:123).

Then the cluster fail (the event log message is about the other node):

Event ID: 1135

Logged: 13/10/2010 12:17:52

Cluster node 'xxxxx' was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due

to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network

configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other

network components to which the node is connected such as hubs, switches, or bridges.

 

Event ID: 4200

Logged: 13/10/2010 12:17:57

Isatap interface isatap.{61E72180-4FC2-4FA7-8F6E-37F86834EABD} with address fe80::5efe:169.254.2.253 has been brought up.

 

Event ID: 1069

Logged: 13/10/2010 12:17:59

Cluster resource 'Cluster Disk 2' in clustered service or application 'Cluster Group' failed.

 

Event ID: 1177

Logged: 13/10/2010 12:18:12

The Cluster service is shutting down because quorum was lost. This could be due to the loss of network connectivity between some or all nodes in the cluster,

or a failover of the witness disk. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or

software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches,

or bridges.

 

Event ID: 7036

Logged: 13/10/2010 12:18:12

The Cluster Service service entered the stopped state.

 

Event ID: 7024

Logged: 13/10/2010 12:18:12

The Cluster Service service terminated with service-specific error A quorum of cluster nodes was not present to form a cluster..

 

Event ID: 7031

Logged: 13/10/2010 12:18:12

The Cluster Service service terminated unexpectedly. It has done this 1 time(s). The following corrective action will be taken in 60000 milliseconds: Restart

the service.

 

Again and again the same things. We have already upgraded the Fujitsu BX600S3 firmware to the last version and also the SAN's firmware (Nexsan SASBoy). Any help would be greatly appreciated.

Regards,

Andrea D'Orio

 

Free Windows Admin Tool Kit Click here and download it now
October 15th, 2010 10:26pm

did anybody find a solution here?

I am having the same problem and it happens every other hour.

 

Thanks,

Domenico

October 27th, 2010 2:14pm

Is this fix from above users expierencing this specific problem came to any resolution.  This is very specific and occurs even after completly rebuilding the cluster. 
Free Windows Admin Tool Kit Click here and download it now
November 11th, 2010 7:22pm

Did anyone get anywhere with this?

We are seeing this in our Exchange 2010 DAG servers, 2008 R2, DAG is stretched across 2 sites.

All of the mailbox servers are identical HP DL360 G7 with HP NC382i DP NICs

This error appears to be coming out of the blue, up to a month apart, but disconnects the mailbox servers form the FSW, which leads to loss of quroum and failover of the databases in Site 2 to Site 1.

Any ideas would be welcome!

Thanks,

Karl

April 7th, 2011 2:19pm

I'm seeing the issue too on an Exchange 2010 DAG environment. Two 2008 R2 servers on a VMware enviroment, report event id 4201 at the same moment. As soon as this happens all network connectivity seems to fail and the two nodes stop working.

Did anybody solve this?

Free Windows Admin Tool Kit Click here and download it now
April 11th, 2011 1:31pm

Hey Jetze, 

I'm seeing this issue as well,  just posted my own issue related to my Exchange 2010 DAG issue.   My databases don't ever fail over,  just appears the servers lose communications with each other.  Report various FailoverClusting errors from Event 1135 to 1177.   Will complain that the Witness server cannot be contacted, etc, etc.      Will open up Fail Over Cluster Mgr and see where the host server has then failed over to another server in our second site.   

Like I said,  databases don't failover to another node,   just the cluster node errors which eventually break OWA when the host server fails over.

Just would love to hear what you experience to know that I'm not alone. !!!!!!!!!

 

April 15th, 2011 6:13pm

hi guys,

i'malso experincing this issue with Clustering on windows server 2008 r2 with Exchange DAG with multi site across regions. This is very disruptive to the production enviroment as the database was failovered to another site as and when the 4201 event occurs within seconds.

i hope that there is a resolution this problem asap from Microsoft :p

regards,

Vince

Free Windows Admin Tool Kit Click here and download it now
April 17th, 2011 9:38am

Hi,

I am having the same issue. 

The IP Helper service seems to proceed the Cluster failing.  If I remove the DAG I don't see any event messages from the IP Helper service.

I have disabled the IP Helper service to see if it resolves the issue.  I will post again if this works.  If anyone else would like to try the same thing please let me now of it helps.

Regards
GazD

April 27th, 2011 2:31pm

That seemed aggressive, but then again, makes sense: it always seems to have been the shutdown of the ISATAP interface that lead to failovers, so looks like it is worth a try.

I have disabled the IP Helper service on two of my servers, I'll post back in a little while (well, 3 weeks, about to go on vacation, so it will be a while)

Thanks for the idea,

Karl


Free Windows Admin Tool Kit Click here and download it now
May 1st, 2011 2:53am

Karl,

 

Did disabling the IP Helper server help with the failovers? We are experiencing similar failover events that are proceeded by the "Isatap interface isatap. is no longer active".

May 10th, 2011 6:37pm

Hi All

does any one figure out a solution for this issue as we also experiance same on a Hyper-V Failover Cluster with windows 2008 R2 Hosts on three different clusters, does disabling the IP Helper service solve the issue or not.

 

Thanks for you

Free Windows Admin Tool Kit Click here and download it now
May 18th, 2011 11:56am

Hi Jody,

just got back from vacation, and since May 1 we have had only 1 occurrence, so that may have actually been a network issue. As far as I can tell from watching this for the past 4 weeks, dsiabling the IP Helper service does seem to have resolved it for us.

Thanks,

Karl

May 25th, 2011 3:07pm

Hi All,

I have this issue at two costumers, both Exchange 2010 SP1 DAG configurations. Disabling the IP Helper service seems to resolve the problem. Does anyone already have more information about this issue and is there already a Microsoft KB article that describes this issue?

Thanks

Geert

Free Windows Admin Tool Kit Click here and download it now
August 4th, 2011 12:19pm

Hi

I have the same issue with a 2008R2 SQL Cluster. Physical nodes, HP BL460cG7 with HDS USP V SAN. I'll give it a try, disabling the Iphlpsvc, and report back in a few weeks

 

Matt

August 12th, 2011 8:56pm

I've been having the same issues, the cluster validates the nodes are healthy and there are no issues with MPIO.  We've replaced the SAN (planned upgrade), all networking gear and even added in a Virtual Node that also had this issue.  It crept up seemingly over night on a cluster that had run for 18 months with no issues.

 

I have disabled Iphlpsvc this morning on the inactive node, and will do the same on the active node tomorrow and we'll see.

Free Windows Admin Tool Kit Click here and download it now
August 29th, 2011 1:19pm

how is everyone going with this now?
September 5th, 2011 4:31am

Our setup is working much better with the IPHelper Service disabled, though I am sure this will come back to haunt us sometime in the future :(
  • Edited by Karl Grabe Tuesday, September 06, 2011 5:26 PM
Free Windows Admin Tool Kit Click here and download it now
September 6th, 2011 5:26pm

Hello, I'm having similar issues, has anyone found a "solution" to this matter? Regards M
September 13th, 2011 10:23am

Hi,

we are still in testing phase , but probably helps VMWARE service console memory increase (in our case from 400 Mb to 800 Mb)


if you have VMware and if disabling IPHelper doesn't solve issue ;)
Free Windows Admin Tool Kit Click here and download it now
September 13th, 2011 11:20am

Hello, I'm having similar issues, has anyone found a "solution" to this matter? Regards M

As far as I know many people fixed the problem without side effects by disabling the IP Helper service. I think this is the best solution, personally I don't expect a better fix.
September 13th, 2011 12:24pm

Hello,

we solved cluster failovers via following:

disabled IP helper

Disabled LSO on BroadCom and configured teaming to StandBy with SmartBalancing

 

Anyway, we have still information events logged (1592 Node to Node communication)

Fortunately, cluster is stable.

Events are logged only during higher NW utilization.

We are considering to replace NIC with Intel.

 

UPDATE:

well, cluster failures even after all this steps are back.

we replaced BroadCom with Intel, the same.

 

Must be definitively something in the failover cluster.

 

Please update the thread if any result from MS ...

Free Windows Admin Tool Kit Click here and download it now
September 13th, 2011 7:37pm

I disabled the IPHelper service on all our nodes last week and the issue still occurs for us, so no luck there.  I have an open case with MS at the moment... will update this thread if we manage to resolve.
September 13th, 2011 10:36pm

Same for us. We're using Dell blades with one physical A node and a VM B node with Equallogic iSCSI SAN. No problems preproduction and once loaded we started getting the errors reported above. Disabling IPHelper looked like a fix (well for a fortnight anyway) but it's come back to haunt me just when I'm boarding a customer. This time though it took out one of the other cluster drives which was a dependency of SQL. Down went my cluster... Nothing in the logs other than the Isatap interface message and no clues on the SAN.... Just raising a case with MS will let you know how I get on.... Thank goodness for this forum entry. At least I'm not on my own.

Stuart


Free Windows Admin Tool Kit Click here and download it now
September 15th, 2011 9:48pm

Hi,

can someone please share result of MS investigation ?

Please !

peter

September 20th, 2011 6:37am

Still working on it currently, doing some more work thursday night so can update after then.

Free Windows Admin Tool Kit Click here and download it now
September 20th, 2011 11:58pm

I am having same issue, in addition I while fail over is happening active node does not release IP address from the previous active node, and failover does not complete and cluster is down. :(

MS recommendation comes like to change "Power Management" of network adapters, to not to shutdown for all NICs. Anyway this is another hit and trial. All network adapters of cluster nodes are being users always for monitoring/production/heartbeat so this shall not be the case. But considering it a solution from MS we are configuring it.

Thanks.

September 23rd, 2011 4:18am

OK - after hours of Premium support from MS and Dell I now have disabled IPHelper, upgraded drivers, switched from MS bridging of LAN adaptors and used Broadcom teaming. Still have failovers, still have blue screens, and still the cluster goes completely offline.

It's currently looking like LAN latency to do with our switches (HP procurve 2910al) and the NIC teaming... I'll let you know how I get on.

 

Stuart

Free Windows Admin Tool Kit Click here and download it now
September 23rd, 2011 7:34am

Hi,

we had a BroadCom teaming and destroyed it to go with simple NIC (Intel)

Anyway, we are still facing the same issues.

 

Regarding NW latency, we removed switch from replication NW (1 Gbit) replaced with simple cross over cablink.

we minimalized blue screens decreasing a NW load /we have Ex2010 Sp1 RU 5/ and observed that mailbox moves were the initiator of the cluster failure.

we have IPv4 as well as IPv6, nowadays considering to disable IPv6, establish additional cluster node without any 3rd party drivers for NIC.

 

with respect to MS, from my point of view it seems as some kind of bug in special combination of HW + drivers raised only during high network utilization.

 

thanks for your feedback and hoping that by sharing results/investigations we would be at the end successful ...

peter

September 25th, 2011 7:58pm

Guys, just wanted to give you an update on my specific case, i managed to fix ours!  Note the cluster specific hotfix for 2008 R2 SP1 below.

Summary of the items changed on all nodes of our DAG:-

 

TCP Chimney Settings disabled

  

netsh int tcp set global chimney=disabled

netsh int tcp set global rss=disabled

netsh int tcp set global netdma=disabled

 

Cluster Hotfix applied - http://support.microsoft.com/kb/2552040

 

Broadcom NIC Drivers and teaming application suite updated to latest version from IBM downloads site

 

Exchange 2010 SP1 Roll Up Pack 5 applied

 

Symantec Endpoint Protection 11 updated to RU6 MP2 and Exchange specific exclusions re-applied with a new policy

 

Replication NIC settings configured as per MS recommendations here - http://technet.microsoft.com/en-us/library/dd638104.aspx#NR

 

Cluster level - Cluster heartbeat settings changed as recommended and cluster service restarted on all nodes to apply this change

cluster . /prop SameSubnetDelay=2000                (range 250-2000)

cluster . /prop SameSubnetThreshold=10            (range 3-10)

cluster . /prop CrossSubnetDelay=4000                 (range 250-4000)

cluster . /prop CrossSubnetThreshold=10            (range 3-10)

 

Free Windows Admin Tool Kit Click here and download it now
September 26th, 2011 1:17am

Hi,

many thanks for your post Stew.

I've marked it as the answer and would like to confirm that we implemented the same and it fixed our troubles !

peter

September 27th, 2011 6:30pm

thats great to hear PeBe :)
Free Windows Admin Tool Kit Click here and download it now
September 27th, 2011 10:03pm

Hi,

How many nodes you are running in your cluster? IHAC met the same problem, but they are running at two-node cluster.

Can KB2552040 be applied to the two-node cluster, because I found following in the KB2552040:

  •  You create a Windows Server 2008 R2 failover cluster that has three or more nodes
  • An asymmetric communication failure occurs in the cluster. For example, two nodes cannot communicate with one another. However, the two nodes may be able to communicate with other nodes in the cluster.

Thanks.
Jeff

September 28th, 2011 2:05am

we have 5 nodes in our cluster.  Not sure about applying to two node clusters, you would need to check with MS.
Free Windows Admin Tool Kit Click here and download it now
September 28th, 2011 5:00am

OK - We may have a solution for our particular issue..

It looks to be quite simple...

We're running 2 NICs into 2 switches for each server. The NIC drivers only support link aggregation or smart load balancing. The switches unfortunately don't support spanning LACP (LACP across both switches) so as soon as the traffic goes up, spanning tree kicks in and starts blocking the switch. Packets are lost and the cluster fails over... All relatively simple really.

 

I'll know for sure tomorrow if it fixes everything. Currently disabling one NIC on each server at the switch port level. (Disabling at the NIC doesn't seem to work)... Will look to buying some switches that will cope with this.

 

Stuart

  • Proposed as answer by MaximK Friday, October 28, 2011 6:56 AM
September 29th, 2011 9:25pm

Hi,

 

we have two node cluster and successfully implemented the hotfix KB2552040.

Anyway, just this hotfix alone could not solve it, we had implement also disabling of chimney as well as to change cluster properties.

finally, all things together solved it. hoping will help such info for you.

p

Free Windows Admin Tool Kit Click here and download it now
September 30th, 2011 10:27am

So we just had the same thing happen in our environment. Ours is setup as two VM's running SQL in a cluster. They both experienced the cluster failure issue. What I was wondering from any of you though is did you also apply the disablement of the IP Helper service in conjunction with the KB2552040 and the Chimney stack change?

Any information would be greatfull. We want to make sure our environment is as stable as it can be.

 

Thanks in advance for any information.

Chiloco.

October 7th, 2011 4:45pm

Yes Chiloco, the patch and the chimney settings.  Note my original post above showing all the steps we did to fix it, so you might want to include cluster timeout and driver checks.
Free Windows Admin Tool Kit Click here and download it now
October 10th, 2011 3:35am

Hi,  I have been following this in hopes of a solution.  So I have 2 DAGs and applied KB2552040 and then did the netsh changes and the cluster delay and threshold changes (for one of the DAGs).  But after doing this I still see cluster problems.   I have Exchange 2010 SP1 Update  Roll Up 2 and I see you suggest UP5 - I wounder if this is the problem.  But I am hesitant to do the update from what I have been reading there are problems with UP5 and UP6 is supposedly due out this month.

My servers are Hyper-V VMs; should I do the netsh changes on the Hyper-V host?   I have multiple NICs on the servers; does the netsh apply to all interfaces? 

October 14th, 2011 11:58pm

There wasnt anything specific in UR5 which fixed the issue, i just did the patch at the same time as all the other updates to ensure we were bang up to date.  We have been running UR5 now for several weeks without issue.

Cant comment on the hyper-v settings sorry, as ours are physical.

Free Windows Admin Tool Kit Click here and download it now
October 17th, 2011 12:56am

We had the problem with a SQL 2008 R1 cluster on a VMware ESXi 4.1 update 1 platform. In our case we experienced two different problems.

1) The OS backup is made with Snapmanger for VI. During the OS backup the SQL cluster service failed. Solution for this one was to reinstall the VMware tools - custom install - without the feature "Volume Shadow Copy Service".

2) The cluster service also seemed to fail during a McAfee update (OS scanner) of the DAT files. This also seemed related to the fact we have a virtual envirionment and all the servers tried to update on the same time wich caused a CPU peak on VM's and ESX hosts and could also be seen on the storage platform due to a peak in disk latency. Solution for this one was to update all the servers in a one hour Windows en enable the randomize option for updating all the client systems on the ePolicy server.

November 6th, 2011 11:41am

One of our clients have been having a similar issue with an Exchange 2010 SP1 DAG, the below hotfix may assist any one else in the same boat, we are going to schedule an outage and install the hotfix I'll report back as to whether it helps or not

 

http://support.microsoft.com/kb/2550886

 

Free Windows Admin Tool Kit Click here and download it now
November 22nd, 2011 2:20am

Scott Schnoll has recently blogged about this and the other relevant cluster related hotfixes here :-

http://blogs.technet.com/b/exchange/archive/2011/11/20/recommended-windows-hotfix-for-database-availability-groups-running-windows-server-2008-r2.aspx

This includes the 2550886 and the 2552040 that i mention above.

November 22nd, 2011 11:59pm

Here's a list of all the hotfixes regarding MS Failover cluster for 2008 R2...

http://social.technet.microsoft.com/wiki/contents/articles/list-of-cluster-hotfixes-for-windows-server-2008-r2.aspx

 

Hope this helps...

Free Windows Admin Tool Kit Click here and download it now
November 30th, 2011 1:52pm

I had this issue with a two node mailbox only role Exchange 2010 SP2 solution (even pre-SP2). I've had installed all mentioned cluster hotfixes, disabled the iphelper service but nothing helped until I've changed the Chimney settings and cluster heartbeat settings as specified by Stew (while keeping other changes also intact).

So thank you very much Stew!

Note: My DAG failed over during backup using Veaam v. The two Exchange mailbox servers are virtualized on vSphere 5 and are on different vSphere nodes, both in the same site. 

March 6th, 2012 10:09am

Good to hear Dave.  Happy to help :)
Free Windows Admin Tool Kit Click here and download it now
March 6th, 2012 10:14pm

Hi Stew,

Thought I'd let everyone know that this issue cropped up again after I did a bunch of windows updates in April.

Was occurring on a SQL Server 2008 R2 cluster aswell as our Exchange 2010 DAG.

Combination of the Chimney settings, Heartbeat settings and the Hotfix 2552040 

fixed the issue for us.

Cheers

Anthony

June 7th, 2015 7:04pm

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics