Failed to connect to computer vs heartbeat failure (Network Steve Forum)

Failed to connect to computer vs heartbeat failure

We had a scom-monitored servergo down last night. I received a "Health Service Heartbeat Failure" alert, but not a "failed to connect to computer" alert where the alert says that the computer was not pingable. Unfortunately, We are relying on the latter alert to be the true indicator of when a machine is down because if I rely on both, I end up getting duplicate alerts. The machine was completely down - it wasn't a case where it was "hung" up and still pingable but otherwise unresponsive. So why would thefailed to connectalert not fire but the heartbeat failure alert does? This is very disconcerting because it's happened before and I've got some SCOM detractors who are stating that SCOM can't even reliably monitor basic machine up/down state.

July 16th, 2009 3:18pm

Could you shut down a server and see if you get that correct alert? If you do we could investigate where the alert went last night. Else it must be a configuration problem.Anders Bengtsson | Microsoft MVP - Operations Manager | http://www.contoso.se

Free Windows Admin Tool Kit Click here and download it now

July 16th, 2009 3:29pm

You can also check the state of the monitors in the health explorer to see if both got into a critical state when the machine was down.

July 16th, 2009 3:42pm

Hi Scott.With ALerts are you talking about the Alerts shown in the Console or about getting out the Alerts as well like e-mail, sms(notifications)?For a customer I solved a similar problem with notifications. They got two Alerts in the Console when a server went down: one about the healthservice and another about failing to connect to computer. But no notification went out. After doing some investigation it appeared that the customer had a custom build group for the notifications WITHOUT the HealthService Watcher. After adding these (of every serverobject) to this group, the Alerts did get out as a notification.Kevin Holman has some very good postings about it:http://blogs.technet.com/kevinholman/archive/2008/02/01/configuring-notifications-to-include-specific-alerts-from-specific-groups-and-classes.aspxhttp://blogs.technet.com/kevinholman/archive/2008/06/26/using-opsmgr-notifications-in-the-real-world-part-1.aspxhttp://blogs.technet.com/kevinholman/archive/2008/10/12/creating-granular-alert-notifications-rule-by-rule-monitor-by-monitor.aspxHope this helps. By the way, in OpsMgr R2 notifications have hugely improved.Best regards, Marnix Wolf (Thoughts on OpsMgr)

Free Windows Admin Tool Kit Click here and download it now

July 16th, 2009 4:43pm

Hi ScottAll the above are valid but you might want to clarify whether it is the alert in the console that was missed or the notification. There is no point troubleshooting notifications if the alert didn't happen as the notification comes from the alert. The process is that ifthe Management Servermisses 3 consecutive heartbeats from an agent then the Management Server generates then it generates the heartbeat failure alert and then runs a diagnostic ping. If there is no response to the ping then it generates the Cannot Connect to Computer alert.So .... this suggests:- very unlucky in that the machine shutdown was very slow .... it is possible but very unlikely that the 3 heartbeats could be missed but the machine was still shutting down and responded to the PING. Have you changed the heartbeat settings or consecutive missed heartbeat settings (Administration \ settings) ? - the diagnostic ping ip address can be changed e.g. to the default gateway. Has this been done? If an override has been set then it would mean that on a heartbeat failure, the diagnostic ping would ping the default gateway which I assume would be up. CheersGrahamView OpsMgr tips and tricks at http://systemcentersolutions.wordpress.com/

July 17th, 2009 10:20am

Graham,Thanks for the information about how heartbeating works in conjunction with the ping test. I wasn't aware that they were linked like that and thought thatthe ping test ran independently of heartbeating. I have not modified any of the default heartbeat or ping settings. However, after digging a littledeeper with the responsible parties, here was the scenario:- The machine basically "locked up" in a shutdown state. So the healthservice heartbeat failed, but the ping was successful. (I've seen this with windows quite frequently - the machine is locked but still responds to ping). - I had not been sending notifications on heartbeat failure because it's generally too chatty and leads to too many false positives. (Very big WAN environment at this customer). So I was relying on the failed to connect alert to indicate a true machine down state. The caveat to this setup is the situation above - where the machine is locked but still pingable.How to best get around this? I can increase agent heartbeating from the default 60 second interval, but based on your previous posting this will delay the fall back to the ping test and in theory a quick, unplannedreboot would never even be detected.

Free Windows Admin Tool Kit Click here and download it now

July 21st, 2009 3:29pm

Hi ScottThere is no "correct" answer to this - you've summed up the options well and ultimately you've got to make the choice that is best for you and live with it. I don't mean that in a flippant way but as you identify you can't cover all bases all the time. OpsMgr has its limitations ... In general .... I prefer to miss a reboot than get too many alerts. This is from experience where I find too many alerts can mean they all get ignored so you miss the important ones anyway. In too many environments where I am called in to troubleshoot I see a couple of hundred alerts each morning (if not more)and the biggest sender of email in the organisation is OpsMgr. That isn't useful to anyone. Sadly the heartbeat interval is global so you can't have different heartbeats for different servers which would sort of help. Sorry I can't be of more help but ultimately the choice is too many false positives versus potentially missing reboots. If you want to be alerted on a reboot then you can create a rule to alert on the start ... but it is very much retrospective (this server was rebooted) than proactive.Have funGrahamView OpsMgr tips and tricks at http://systemcentersolutions.wordpress.com/

July 21st, 2009 5:18pm

Hi Scot,I also came across this with a client, I have the following solution that helped us;Create a rule looking for Event ID 513, this will alert you that the server is shuting down the good news is that this is sent BEFORE the server does shutdown and Ops Mgr can get the alert out before the server does close or hangsAlso create a rule for Event ID 512, this will alert you that the server is starting back up, we could alert on this with a timed event so if we see Event ID 513 and not Event ID 512 within a certain time we throw an alert.Regarding the Heartbeats if we say we want to alert every 2 minutes and check 3 time then the alerts is delayed for 6 mins for the 3 times checking.In the case of a Virtual Server we can reboot and not really notice so much because it can do it in less the the Heatbeat time!Ops Mgr may have limitations but this is overcome but the fact we, the technicial people do not we 'bend light' change the rules just like Scot on the Enterprise to fill those gaps.Simon Skinner

Free Windows Admin Tool Kit Click here and download it now

July 23rd, 2009 3:03am

Hi Marnix, SCOM is not able to alert us when server is hung as server is pingable. How can we set up SCOM alert if server is hung?

April 2nd, 2012 12:28pm

Hi Javed, That is where your "Health Service Heartbeat Failure"alert is useful. Indicates an unresponsive machine. Cheers, John Bradshaw

Free Windows Admin Tool Kit Click here and download it now

April 2nd, 2012 4:06pm

I'm sorry to tell you but that is right. I can't believe that SCOM is not capable to alert on this BASIC event.

June 14th, 2012 3:35pm

This topic is archived. No further replies will be accepted.