Alert storm for Heartbeat failure
Hi all, After not being able to nail down why, about once a week, an alert storm for Health Service Heartbeat Failure occurs, it's time for a different approach. Question: For a particular alert, is it possible to place a DELAY on that alert so that if the issue is resolved after a time lapse (say 2 mins) then no subscription is sent out? (Under subscriptions there is a delay that can be entered, but this would involve every subscription almost in the case of this particular alert, so I want to avoid that route if possible). Thx, John Bradshaw
March 16th, 2011 11:15pm

Wouldn't it be better to figure out why once a week, ad a particular time, you are getting massive heartbeat failures?Microsoft Corporation
Free Windows Admin Tool Kit Click here and download it now
March 17th, 2011 12:25am

Hi John Does this happen for all agents or just agents with a particular Management Server as their Primary? Are there any errors in the operationsmanager event log on the Management Servers prior to this mass of heartbeat failures? Is there a weekly backup of the Management Server(s) that might be causing this? How often do you see event 6022 being logged on the Management server or one of the agents around the time of the mass heartbeat failures. It should be logged every 15 minutes. Sadly as you have pointed out - alert aging has to be set on every subscription. There isn't a global setting for this. Cheers GrahamView OpsMgr tips and tricks at http://systemcentersolutions.wordpress.com/
March 17th, 2011 1:09am

Hi, Please check this article about how to troubleshooting alert storm: http://blogs.technet.com/b/operationsmgr/archive/2010/02/25/troubleshooting-alert-storms-in-opsmgr-2007.aspxPlease remember to click Mark as Answer on the post that helps you, and to click Unmark as Answer if a marked post does not actually answer your question. This can be beneficial to other community members reading the thread.
Free Windows Admin Tool Kit Click here and download it now
March 17th, 2011 9:35am

Hi Graham, Does this happen for all agents or just agents with a particular Management Server as their Primary? Just happens on the 2ndary MS, which also happens to be a Virtual box. Are there any errors in the operationsmanager event log on the Management Servers prior to this mass of heartbeat failures? As usual, no Events that seem out of character are apparent. Is there a weekly backup of the Management Server(s) that might be causing this? Investigating the backup regime.....Could well be the culprit!! I've heard rumblings in the corridor that backup is causing issues on quite a few servers. How often do you see event 6022 being logged on the Management server or one of the agents around the time of the mass heartbeat failures. Event is logged faithfully every 15 mins. Sadly as you have pointed out - alert aging has to be set on every subscription. There isn't a global setting for this. Ah well.....what can I say. Thx for the info on this. Cheers, JB
March 18th, 2011 12:12am

Perhaps pull up a chart or report of Disk IO and cpu on the Management Server at the time that you get the heartbeat failures. See if it is suffering resource contention. Also, what else is on the host? It might be other workloads on the host that are starving the Management Server on a weekly basis. Good Luck GrahamView OpsMgr tips and tricks at http://systemcentersolutions.wordpress.com/
Free Windows Admin Tool Kit Click here and download it now
March 18th, 2011 1:48am

I've seen where our weekly AV scan will cause an alert storm, it acts as if the RMS/MS is online, but everything else is taken offline. This could be your issue?
March 18th, 2011 5:39pm

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics