Healthservice CPU usage (Network Steve Forum)

Healthservice CPU usage

Greetings, I currently have an issue with the healthservice on my RMS, it runs the CPU to 100% every second day or so. I have created a rule to monitor the CPU usage for the healthservice.exe when it gets to 100 for more then 10 minutes an alert is generated. The operator then has an work instruction to stop and start the service... This is manual process currently.. I have tried to automate the restart of the service, the recovery task tab as run a command line, i have tried net stop and start command for healthservice.exe but getting an error, it basically says its not a windows server service. Any other way i can restart the service as a recovery task?

July 14th, 2011 10:46am

I have checked the anti-virus as an exclusion for the health state folder. Also, i know the restart healthservice recovery task does exist cause for the agent performance rules the recovery task to run.

Free Windows Admin Tool Kit Click here and download it now

July 14th, 2011 10:51am

I really think the approach you are taking is broken, in a shoot your foot way. The health service using 100% cpu is an indication of load. The health service on the RMS is running workflows - and these are using CPU. If you are having your RMS do URl monitoring, nework monitoring, or have installed MPs that use the RMS to do work (Exchange 2010) you should think about giving the machine an upgrade or moving the work to a separate management server. Restarting the health service only makes it have to slow down - it still will have just as much work when it starts up again, and will add more work in the process of restarting and syncing to the config service. This will add more to the server workload, not reduce it. Think about looking at which workflows are running and if there are workflows that go into "stoopid-zone" - such as an infinite loop, delete the associated MP or disable that work via an override. The AD topology discovery runs on the RMS and can make it use a lot of CPU. That particular discovery causes thousands of extra workflows that will fail to run to be attempted as well (it's a bug in the way AD topology discovery works, and unforuntetly cannot be changed at this point). If you are runnign the AD topo discovery, just disable it and then run the powershell that removes discovered items with disabled discoveries. This will kill off the added thousands of workflows that are the side effecdt of thisMicrosoft Corporation

July 14th, 2011 12:27pm

Hi Dan, Thanks for the reply, when you mention workload that def is the case. Currently i have RMS, 12 Management servers 21000 agents. Problem is all the MP are vanilla no custom stuff at all... You mentioned checking the workflows? How do i go about doing that?

Free Windows Admin Tool Kit Click here and download it now

July 14th, 2011 2:22pm

Kevin, A few things: 1. How many Workflows are running on your RMS? You can find this using the Workflow Counter view in the Monitoring space (Monitoring\Operations Manager\Managment Server Performance\Workflow Count 2. Are there any agents reporting to your RMS? You can find this using the Agent Managed view in the Administration space (Administration\Device Management\Agent Managed) 3. 21000 Agents? Is that a typo? We only offically support 10000 Agents in a Managment Group. Part of your issue here might be that your Management Group may be too large, and that's why you are running into these CPU issues. 4. How many and of what type / model CPU do you have in your RMS? Maybe you can just scale up the hardware to get around this issue temporarily until you can tune the environment further to reduce the amount of work the RMS must do. Also, like Dan says, restarting the HealthService on the RMS will onyl make things worse due to all the work that must be done at HealthService start-up. I would avoid that if possible. Michael Pearson OpsMgr Performance Test Team http://blogs.technet.com/michaelpearson/ This posting is provided "AS IS" with no warranties, and confers no rights. Use of attachments are subject to the terms specified at http://www.microsoft.com/info/cpyright.htm

July 14th, 2011 3:21pm

I think you should start with 21000 agents. The engineering limit is around 10K agents. When you say vanilla, do you mean MP's you downloaded from Microsoft? Which ones? MP's that create a lot of load - Exchange 2010, DNS, AD with topology discovery enabled, DHCP, Lync/OCSMicrosoft Corporation

Free Windows Admin Tool Kit Click here and download it now

July 14th, 2011 3:23pm

If you really have 21000 agents, you can start reading this thread: http://social.technet.microsoft.com/Forums/en-US/operationsmanagergeneral/thread/dc5f78fe-7426-40c9-99c2-39aad0d3bfa5 If MP are vanilla, no custom stuff at all, I assume you will not have really tuned the management packs for your environemnt. I think you will have: 1) Config churn. Too many discoveries are running too frequently. You must tune the discoveries. 2) Monitor state changes. You will find too many monitors changing state too often. You must tune these monitors (disable, change threshold) 3) Event collection. You will have too many events being collected. These events are mostly used (if used) in views and/or reports. I am confinced you don't use 90+% of these events. You should disabled these event collections. 4) Performance collections. You will have too many performance counters being collected. I advise to turn off all performance counters with the exception of those counters you really use (you be surprised how much data is collected but never used). All this will reduce the load on your SQL servers and will reduce the amount of work for the RMS server. Regards, Marc Klaver http://jama00.wordpress.com/

July 15th, 2011 3:13am

Free Windows Admin Tool Kit Click here and download it now

July 15th, 2011 6:44am

Dan, By vanilla i mean downloaded by Microsoft. DNS AD, With Topology active DHCP SQL OS All standard ones. No Exchange.

July 15th, 2011 6:46am

Free Windows Admin Tool Kit Click here and download it now

July 15th, 2011 6:48am

This topic is archived. No further replies will be accepted.