Flood of 20020 heartbeat failure events
Hello All, I was wondering if anyone else had experienced the following and found the reason/solution on their environment. Over the last three weeks we have been getting a flood of 20020 heartbeat failures from multiple agents (i.e. hundreds) between 3am and 4am. This sends out hundreds of notification alerts. My first thought is OK these Server are down (unlikely) or there is a glitch on the network causing loss of communications. I have checked both of these and this is not the case (network team tell me no issues to report). next I check the logs (i.e. Application, System, Operations Manager) on the RMS, MS, Database, Server reporting heartbeat failures, and AD Domain Controllers. As stated the RMS shows hundreds of 20022 events in the OpsMgr log interestingly around the same time I see hundreds of 2115 warning on the MS (i.e. not the RMS but is partner MS) showing the following A Bind Data Source in Management Group XX has posted items to the workflow, but has not received a response in 2082 seconds. This indicates a performance or functional problem with the workflow. Workflow Id : Microsoft.SystemCenter.DataWarehouse.CollectEntityHealthStateChange Instance : ManagementServer.Domain.com Instance Id : {1F0644BA-10AD-DA50-1BD1-43B0B98749B3} The database backups also run around this time and I see error stating the backups completed with errors i.e. do not complete normally. We store the database on NetApp iSCSI LUN's and use SnapManager for SQL for backups. I do not see anything in particular on the monitored servers or the domain controllers suggesting any network/communication type issue. Therefore at the moment is looks like the issue is between the MS, RMS and Database Server. I am wondering if the backups are conflicting with the Data Warehouse maintenance schedule and this is causing the backup to fail (although I believe SCOM maintenance tasks are normally around 2am), possibly holding the database locked for a period of time (although it is not supposed to as it uses a more or less instant snapshot) thereby stopping data being uploaded to the database my the MS. However on the surface of it I cannot see why this type of issue would cause hundreds of 20022 alerts. the 20022 and 2115 events are appearing around the same time in their hundreds so logic suggests they events may be related. This does not happen every day, it has occurred about 4 times over three week period always early am i.e. between 3am and 4am any ideas/suggestions please most welcome Thanks Ernie
July 9th, 2011 9:51am

This seems very similar to your issue, hopefully it's helpful: Flood of 20020 heartbeat failure events You could try stopping the backups for a few days and see if these errors still come?Certifications: MCSA 2003|MCSE 2003|MCTS(5*)| MCTIP:SA
Free Windows Admin Tool Kit Click here and download it now
July 10th, 2011 9:36am

Hi Shadow man, thanks for the input. That is me too I tend to post in more than one forum Thanks Ernie
July 10th, 2011 10:50am

Yes, it looks like Ernie moves around :-) In any case there is a possibility that it is caused by locks or timeouts due to maintenance and backup running at the same time, while also needing to cover inserts from live monitoring. As the install is sql Standard edition there is a bigger chance for locks and timeouts due to this than when enterprise edition is used. So either try to shift the backup job around or use DPM (we are system center dudes!!) or consider enterprise version (if price is not an issue for you). Or need to use maintenance mode if you know the fixed time periods when it happens. Of course it is not ideal.Bob Cornelissen - BICTT (My BICTT Blog) - Microsoft Community Contributor 2011 Recipient
Free Windows Admin Tool Kit Click here and download it now
July 10th, 2011 12:32pm

By the way at SCC you have the nice picture next to your name and here its just a normal thingy. :-)Bob Cornelissen - BICTT (My BICTT Blog) - Microsoft Community Contributor 2011 Recipient
July 10th, 2011 12:34pm

Hi Bob, this me in the pic, my I was a baby :)
Free Windows Admin Tool Kit Click here and download it now
July 11th, 2011 1:45am

I am wondering if the backups are conflicting with the Data Warehouse maintenance schedule and this is causing the backup to fail (although I believe SCOM maintenance tasks are normally around 2am), possibly holding the database locked for a period of time (although it is not supposed to as it uses a more or less instant snapshot) thereby stopping data being uploaded to the database my the MS. You can check this at the database error logs. It wil tell you how long the snapshots holds back the actual SQL processing. But then this is only the snapshot being taken. This does not mean no data needs to be transferred after this period (it most certainly will). My first guess would be the database backup that just takes too much performance from your iSCSI SAN (is it a shared SAN and can other backups also be interfering with the performance?). Try to schedule the backup at a later time and see what happens.Regards, Marc Klaver http://jama00.wordpress.com/
July 11th, 2011 4:40am

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics