RMS Server not working
Hi, I've currently been running scom for the past 6 months and everything was working fine until 2 days ago. The scom setup is below. 1 RMS server - SCOM 2007 SP1 - W2K8 32bit - Over 100 agents installed in a SINGLE DOMAIN. 1 DB server - SQL 2005 SP2 - W2K8 32 bit I came in 2 days ago and found out there had been no alerts in the ops console. I digged a bit further and checked the logs on the agents (servers) and on the RMS and i'm getting a lot of Event ID 21016 OpsMgr was unable to set up a communications channel to (RMS FQDN) and there are no failover hosts. Communication will resume when (RMS FQDN) is both available and allows communication from this computer. Event ID 20070 The OpsMgr Connector connected to (RMS FQDN), but the connection was closed immediately after authentication occured. The most likely cause of this error is that the agent is not authorized to communicate with the server, or the server has not received configuration. Check the event log on the server for the presence of 20000 events, indicating that agents which are not approved are attempting to connect. Event ID 21023 OpsMgr has no configuration for management group (managementgroupname) and is requesting new configuration from the Configuration Service. Event ID 21042 Operations Manager has discarded 1 items in management group (managementgroupname) , which came from $$ROOT$$. These items have been discarded because no valid route exists at this time. This can happen when new devices are added to the topology but the complete topology has not been distributed yet. The discarded items will be regenerated. Event ID 29106 The request to synchronize state for OpsMgr Health Service identified by "f9bc56f5-d69b-fb52-0788-792a86aec09d" failed due to the following exception "Microsoft.EnterpriseManagement.Common.DataAccessLayerException: Invalid column name SizeNumeric_486ADDDB_2EB8_819A_FA24_8F6AB3E29543 for query MTV_SelectProperty_5de7b548-657d-7794-52b4-2a828da0cfd1. at Microsoft.EnterpriseManagement.Mom.DataAccess.QueryDefinition.GetColumnDefinitionBySourceColumnName(String sourceColumnName, Int32 resultSetIndex) at Microsoft.EnterpriseManagement.Mom.DataAccess.QueryDefinition.GetColumnDefinitionBySourceColumnName(String sourceColumnName) at Microsoft.Mom.ConfigService.DataAccess.DatabaseAccessor.QueryInstanceProperties(ReadOnlyCollection`1 instances) at Microsoft.Mom.ConfigService.Engine.ConfigurationEngine.CommunicationHelper.StateSyncRequestTask.ConfigurationItems.Instances.CollectPublicProperties(ReadOnlyCollection`1 identities, IConfigurationDataAccessor dataAccessor) at Microsoft.Mom.ConfigService.Engine.ConfigurationEngine.CommunicationHelper.StateSyncRequestTask.ConfigurationItems.ConfigurationItemCollection`2.CollectPublicProperties(IConfigurationDataAccessor dataAccessor) at Microsoft.Mom.ConfigService.Engine.ConfigurationEngine.CommunicationHelper.StateSyncRequestTask.ConfigurationItems..ctor(StateContext stateContext, IConfigurationDataAccessor dataAccessor) at Microsoft.Mom.ConfigService.Engine.ConfigurationEngine.CommunicationHelper.StateSyncRequestTask.CreateResponse(Managers managers) at Microsoft.Mom.ConfigService.Engine.ConfigurationEngine.Managers.Synchronize(OnDoSynchronizedWork onDoSynchronizedWork) at Microsoft.Mom.ConfigService.Engine.ConfigurationEngine.CommunicationHelper.StateSyncRequestTask.Execute(Managers managers) at Microsoft.Mom.ConfigService.Engine.ConfigurationEngine.CommunicationHelper.StateSyncRequestTask.Run(Guid source, String cookie, Managers managers, IConfigurationDataAccessor dataAccessor, Stream stream, IConnection connection)". All the error seems to indicate that the servers/agents cant connect to the RMS and all the agents are in the pending management folder. When i try to 'Approve' them, it starts a task and just sits there for hours, with no report of success. I've checked a few blogs/forums and checked the SPN's for the RMS health service and SDK service. RMS health service is configured to local system of RMS server, SDK service is configured as SDK domain account, and the SPN's match the said confiurations. I can telnet to the RMS 5723 from the agents and i can also pint the FQDN of the RMS from the agents. The RMS's windows firwall is off and same with all the agents. I did a netstat on the RMS and the agents are connected, but the connection status is time wait, not established! I checked the installation directory for scom C:\Program Files\System Center Operations Manager 2007\ . The Config Service state folder is empty and this is not supposed to be the case. I also checked the Health Service state on the RMS and again, the management packs forlder is empty, though the health service store folder is not! I am totally at a loss on what to do and starting afresh/clean install is not an option. Any help would be much appreciated!
October 29th, 2009 5:04pm
looks like trouble in the database world. Did you change anything 2 days ago? import a new or updated MP?Did you already try to restart the sdk & config & health service?Greetz, Arie de Haan MVP SCOM This posting is provide "AS IS" with no guarantees, warranties, rigths etc.
October 30th, 2009 11:55pm
The only thing that changed in the past 2 -5 days was import the new Exchange MP and created new maintenance plans for the DB! nothing more than that and it seems it's all messed up. Is it possible to restore the OpsMgr DB to the last full backup, with SCOM/RMS picking up the restored DB rather than the existing one? I've restarted all the mentioned services and even checked the SPN registerations as well. nothing seems to fix the problem! Many thanks for the pointers.
October 31st, 2009 2:10am
yes getting the full backup in place of the current db is absolutly the way to go. how to do this and how the backup should have been, see this: http://technet.microsoft.com/en-us/library/cc540383.aspxit is the backup and restore chapter of the operations guideGreetz, Arie de Haan MVP SCOM This posting is provide "AS IS" with no guarantees, warranties, rigths etc.
November 1st, 2009 1:29am
Hi Arie, I restored the OperationsManager DB and the problem still exists. It seems the RMS does not have any config files for the management group and i dont know how to force it to either create one or get it from the database (assuming it is stored in there)! I was also contemplating promoting a new management server to RMS but i've read on technet that the management server would need to have existed before the RMS became faulty or stopped working, so that option wouldnt make any difference!
November 2nd, 2009 4:06pm
two things I have seen cause this:1. importing the base OS MP.... usually removing these - then reimporting resolves it.2. installing a SCOMagent on the RMS. Remove it.
November 3rd, 2009 5:22am
I had the excact same error, because we apparently only partially updated to the new Operating System MPs. For us, we used used the online catalog to list all updateable MPs and import them. Error gone.
November 10th, 2009 6:22pm
Hi Trondhindenes,I doubt if that was the problme, but i've managed to 'fix' it.I restored a 1 month old DB backup we had and initially, i was getting a huge amount of error on the RMS in regards to stale data, this has now stopped and everything is working perfectly again now. I guess it underscores the importance of regularly maintaining the SQL/databases that SCOM uses!Thanks to Kevin Ariefor helpling me with their troubleshooting tips and taking time out to reply to my post!Muchas gracias
November 12th, 2009 11:57am
Hi All I had the same problem, at first I tried updating all MPs but it didn't made any difference. I restored SCOM database from earlier backup and I was getting lots of errors but after an hour all those errors stopped and it is working fine now. Regards Tim Kind Regards Tim (Canberra)
July 13th, 2011 5:46am