System Center Management Health Service Unloaded System Rule(s)

Hey there,

I'm running Operations Manager 2012 R2 and have 4 agents that consistently show this alert despite my best efforts to resolve it.  On the agents I'll see an event 4000 followed immediately by an event 1103:

Log Name:      Operations Manager
Source:        HealthService
Date:          1/24/2014 11:47:05 AM
Event ID:      4000
Task Category: Health Service
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      HOSTNAME
Description:
A monitoring host is unresponsive or has crashed.  The status code for the host failure was 2164195371.

Log Name:      Operations Manager
Source:        HealthService
Date:          1/24/2014 11:47:05 AM
Event ID:      1103
Task Category: Health Service
Level:         Warning
Keywords:      Classic
User:          N/A
Computer:      HOSTNAME
Description:
Summary: 273 rule(s)/monitor(s) failed and got unloaded, 273 of them reached the failure limit that prevents automatic reload. Management group "mymanagementgroup". This is summary only event, please see other events with descriptions of unloaded rule(s)/monitor(s).

There's also a corresponding event 1000 in the Application log:

Log Name:      Application
Source:        Application Error
Date:          1/24/2014 11:28:14 AM
Event ID:      1000
Task Category: (100)
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      HOSTNAME
Description:
Faulting application name: MonitoringHost.exe, version: 7.1.10184.0, time stamp: 0x522a23d5
Faulting module name: MSVCR100.dll, version: 10.0.40219.325, time stamp: 0x4df2bcac
Exception code: 0xc0000417
Fault offset: 0x0000000000070468
Faulting process id: 0x1e38
Faulting application start time: 0x01cf1932066335fb
Faulting application path: C:\Program Files\Microsoft Monitoring Agent\Agent\MonitoringHost.exe
Faulting module path: C:\Windows\system32\MSVCR100.dll
Report Id: 48dc1e0c-8525-11e3-ae8c-005056a25687


Things I've already tried to resolve the issue:

  • Tried to repair the agent through the operations manager console
  • Uninstalled and reinstalled the agent by pushing it and installing the agent manually
  • Restarted the healthservice on the agents

What other things can I do to try to diagnose what's broken?   I have 265 other agents that are working just fine and I can't figure out what's different about these 4 agents.

The agents are running Server 2008 R2





January 24th, 2014 10:29pm

This should be caused a bad rule. If you have any event related rules. delete and see if the issue persists.

Also, take a look at http://support.microsoft.com/kb/982501/en-gb

Free Windows Admin Tool Kit Click here and download it now
January 27th, 2014 12:56pm

Is there a way to identify specifically what rules are bad?

Is there a log that the agent writes somewhere that I can throw into verbose mode to see which rule or rules failed to load?

EDIT:

So I found a way to do some level of tracing - and this is what I've turned up so far..

[0]4144.1596::01/27/2014-15:30:23.547 [ExecutionManager] [] [Error] :CMonitoringHostClient::CHostStatusChecker::DoCallback{MonitoringHostClient_cpp3308}CheckIsHostAlive failed with code 0x800706ba(RPC_S_SERVER_UNAVAILABLE).
[0]4144.1596::01/27/2014-15:30:23.547 [ExecutionManager] [] [Error] :CMonitoringHostClient::CHostStatusChecker::DoCallback{MonitoringHostClient_cpp3310}Shutting down host 1 due to host failure.
[0]4144.1596::01/27/2014-15:30:23.547 [ExecutionManager] [] [Error] :CMonitoringHostClient::ShutdownHostInternal{MonitoringHostClient_cpp3784}IMonitoringHost::Shutdown failed with code 0x800706ba(RPC_S_SERVER_UNAVAILABLE), will continue.
[0]4144.1596::01/27/2014-15:30:23.564 [ExecutionManager] [] [Information] :CMonitoringHostClient::UnloadModule{MonitoringHostClient_cpp1100}UnloadModules called on shutdown host.
[0]4144.1596::01/27/2014-15:30:23.564 [ExecutionManager] [] [Information] :CMonitoringHostClient::UnloadModule{MonitoringHostClient_cpp1100}UnloadModules called on shutdown host.
[0]4144.1596::01/27/2014-15:30:23.564 [HealthServiceRuntime] [] [Verbose] :CModuleHost::TerminateModuleNoLock{ModuleHost_cpp4555}TerminateModuleNoLock called on module 1501.
[0]4144.1596::01/27/2014-15:30:23.564 [HealthServiceRuntime] [] [Verbose] :CModuleHost::TerminateModuleNoLock{ModuleHost_cpp4555}TerminateModuleNoLock called on module 1503.
[0]4144.1596::01/27/2014-15:30:23.564 [ExecutionManager] [] [Verbose] :CWorkflowTracker::UnloadWorkflow{WorkflowTracker_cpp528}Workflow tracker 0000000020C51700 for workflow 72057594037928330 completed unloading.
[0]4144.1596::01/27/2014-15:30:23.564 [ExecutionManager] [] [Warning] :CModuleTracker::ModuleFailed{ExecutionManager_cpp2789}Reporting failure of workflow 72057594037928330 with failure code WINERROR=80FF002B.

So it looks like an RPC issue is crashing one of the modules?  The next question I guess is to find out which module it is heh

January 27th, 2014 2:17pm

Is there a way to identify specifically what rules are bad?

Is there a log that the agent writes somewhere that I can throw into verbose mode to see which rule or rules failed to load?

EDIT:

So I found a way to do some level of tracing - and this is what I've turned up so far..

[0]4144.1596::01/27/2014-15:30:23.547 [ExecutionManager] [] [Error] :CMonitoringHostClient::CHostStatusChecker::DoCallback{MonitoringHostClient_cpp3308}CheckIsHostAlive failed with code 0x800706ba(RPC_S_SERVER_UNAVAILABLE).
[0]4144.1596::01/27/2014-15:30:23.547 [ExecutionManager] [] [Error] :CMonitoringHostClient::CHostStatusChecker::DoCallback{MonitoringHostClient_cpp3310}Shutting down host 1 due to host failure.
[0]4144.1596::01/27/2014-15:30:23.547 [ExecutionManager] [] [Error] :CMonitoringHostClient::ShutdownHostInternal{MonitoringHostClient_cpp3784}IMonitoringHost::Shutdown failed with code 0x800706ba(RPC_S_SERVER_UNAVAILABLE), will continue.
[0]4144.1596::01/27/2014-15:30:23.564 [ExecutionManager] [] [Information] :CMonitoringHostClient::UnloadModule{MonitoringHostClient_cpp1100}UnloadModules called on shutdown host.
[0]4144.1596::01/27/2014-15:30:23.564 [ExecutionManager] [] [Information] :CMonitoringHostClient::UnloadModule{MonitoringHostClient_cpp1100}UnloadModules called on shutdown host.
[0]4144.1596::01/27/2014-15:30:23.564 [HealthServiceRuntime] [] [Verbose] :CModuleHost::TerminateModuleNoLock{ModuleHost_cpp4555}TerminateModuleNoLock called on module 1501.
[0]4144.1596::01/27/2014-15:30:23.564 [HealthServiceRuntime] [] [Verbose] :CModuleHost::TerminateModuleNoLock{ModuleHost_cpp4555}TerminateModuleNoLock called on module 1503.
[0]4144.1596::01/27/2014-15:30:23.564 [ExecutionManager] [] [Verbose] :CWorkflowTracker::UnloadWorkflow{WorkflowTracker_cpp528}Workflow tracker 0000000020C51700 for workflow 72057594037928330 completed unloading.
[0]4144.1596::01/27/2014-15:30:23.564 [ExecutionManager] [] [Warning] :CModuleTracker::ModuleFailed{ExecutionManager_cpp2789}Reporting failure of workflow 72057594037928330 with failure code WINERROR=80FF002B.

So it looks like an RPC issue is crashing one of the modules?  The next question I guess is to find out which module it is heh

Free Windows Admin Tool Kit Click here and download it now
January 27th, 2014 10:15pm

Every 15 seconds we ping MonitoringHost, with a COM+ call that only returns 1 things (S_OK).  If this call fails, we assume MH is dead, and terminate it.  That is what is happening here. 

There is either something wrong with com+ on this system, or the box is too overloaded to answer this quest.

If we want to know the root cause, debugging tools may be involved. I suggest you open a support case with Microsoft CTS for help on this case.

January 29th, 2014 9:22am

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics