multiple heartbeat failures for 1-2 minutes!!!
Hello, I just got 25 servers sending alert for heartbeat failures!!! I checked http://scomskills.com/blog/?p=107 - none of them sent the Failed to connect to computer - no IP related... a maximum of 2 servers per subnets and they are in three differnet datacenters... Where to look for a root cause as it is happening several times a month...? The settings are Agent 180 seconds / Server 3 The main issue is it is happening most likely always for the same servers !!! Thanks, DomSystem Center Operations Manager 2007 / System Center Configuration Manager 2007 R2 / Forefront Client Security / Forefront Identity Manager
July 18th, 2012 4:20pm

Hi Dom, >it is happening most likely always for the same servers What servers do you mean - agents or management servers? You need to check the: - CPU spikes at the time of failed HB. When the server eventually hits the 100% CPU agent on this server can't send anything before it get some CPU resources to do that. - In case you mean the same managenent server you're also need to check the OpsMgr SQL Database connectivity\latency at the time when HBs were lost. http://OpsMgr.ru/
Free Windows Admin Tool Kit Click here and download it now
July 18th, 2012 11:54pm

Hi Alexey, 1. it is almost always the same Agents which fail the heartbeat. I will review CPU Spikes but 30 servers having CPU spikes at the same time in three different datacenters using different HW, OS and Application it iw weird.... - No spike - No event in the logs on the Agent Thanks, Dom System Center Operations Manager 2007 / System Center Configuration Manager 2007 R2 / Forefront Client Security / Forefront Identity Manager
July 19th, 2012 1:02am

>30 servers having CPU spikes at the same time in three different datacenters using different HW, OS and Application it iw weird.... Agreed. You may want to think about what is common to them all. Network link to the management server? Some sceduled task like a backup at the same time? Something else?http://OpsMgr.ru/
Free Windows Admin Tool Kit Click here and download it now
July 19th, 2012 1:07am

Hello, Network Link: Only the last part between the switch and the management server is common ... so why only 30 from 900 servers? - different switches, subnets, etc... not very much common part on this side... As there are SQL, TSM (physical servers) and VeeamBackup for VMs, the backups are really disparate... timing is different and there is no backup during the day and the failures "heartbeat" was at noon in the middle of the day... There are VMs, Blade, Standalone... servers. Let see if there is any other task... maybe on the MS!!! but again why only 30 and not all servers... I think so far all these servers are on the same MS... I need to verify... Thanks, DomSystem Center Operations Manager 2007 / System Center Configuration Manager 2007 R2 / Forefront Client Security / Forefront Identity Manager
July 19th, 2012 10:02am

Hello, again today about noon 24 machines sent an heartbeat failures... checking if any were already on the list yesterday.. exactly the same machines !!! I noticed that the Local Health Service Availability is also red and under the Availability is grayed out ... Others have the Local Health Service Availability gray but not on the list of the hearbeat failure... Thanks, Dom System Center Operations Manager 2007 / System Center Configuration Manager 2007 R2 / Forefront Client Security / Forefront Identity Manager
Free Windows Admin Tool Kit Click here and download it now
July 19th, 2012 3:12pm

As Alexey said earlier; the healthservice has low priority compared to any other services running on the agent side, it will fail to heartbeat as there are no cycles available to send the information and/or run additional workflows. Do you receive the unreachable alerts? If not then there should not be any network issues unless there is a firewall preventing the data flow in the agent to management server direction. Network Saturation? Find the common link of those agents and rule out other factors such as other agents reporting to the same MS. Are they all SQL servers or are they mixed roles? Perhaps you have a discovery/rule/monitor that is syncing up at noon and is causing the healthservice to choke for a bit.
July 23rd, 2012 11:32pm

As Alexey said earlier; the healthservice has low priority compared to any other services running on the agent side, it will fail to heartbeat as there are no cycles available to send the information and/or run additional workflows. Do you receive the unreachable alerts? If not then there should not be any network issues unless there is a firewall preventing the data flow in the agent to management server direction. Network Saturation? Find the common link of those agents and rule out other factors such as other agents reporting to the same MS. Are they all SQL servers or are they mixed roles? Perhaps you have a discovery/rule/monitor that is syncing up at noon and is causing the healthservice to choke for a bit.
Free Windows Admin Tool Kit Click here and download it now
July 23rd, 2012 11:32pm

Hello, No other alert received for this batch of 25 machines. The machines are reporting fine for a while and then this heartbeat which seems happening always about the same time for the same machines. Network saturation possible but as the subnets range are so wide for these machines it is not this path... I think!! I will check the applications on all these machines again as it seems to be also a wide range VMs, Blade, Standalone and the first check was giving mixed roles but I need to confirm. Yes they all are on the same Management Server with 700+ other machines. Let me check the discovery/monitor/rule for the synchronization but as the batch contains always the same item there might be a grouping somehow which do this... Thanks, DOm System Center Operations Manager 2007 / System Center Configuration Manager 2007 R2 / Forefront Client Security / Forefront Identity Manager
July 24th, 2012 9:32am

Hello, No other alert received for this batch of 25 machines. The machines are reporting fine for a while and then this heartbeat which seems happening always about the same time for the same machines. Network saturation possible but as the subnets range are so wide for these machines it is not this path... I think!! I will check the applications on all these machines again as it seems to be also a wide range VMs, Blade, Standalone and the first check was giving mixed roles but I need to confirm. Yes they all are on the same Management Server with 700+ other machines. Let me check the discovery/monitor/rule for the synchronization but as the batch contains always the same item there might be a grouping somehow which do this... Thanks, DOm System Center Operations Manager 2007 / System Center Configuration Manager 2007 R2 / Forefront Client Security / Forefront Identity Manager
Free Windows Admin Tool Kit Click here and download it now
July 24th, 2012 9:32am

Hi Dom, Is there any latest information on this issue? As this thread has been quiet for a while, we assume that the issue has been resolved. At this time, we will mark it as "Answered" as the previous steps should be helpful for many similar scenarios. In addition, wed love to hear your feedback about the solution. By sharing your experience you can help other community members facing similar problems. Thanks,Yog Li TechNet Community Support
July 25th, 2012 11:13pm

Hello, The local Service Avaibility is grayed out on all this 25 servers!!! Any idea? Thanks, DomSystem Center Operations Manager 2007 / System Center Configuration Manager 2007 R2 / Forefront Client Security / Forefront Identity Manager
Free Windows Admin Tool Kit Click here and download it now
July 26th, 2012 6:16pm

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics