Failed to failover SQL instance from one node to another
Hi,
I am experiencing an issue with SQL Server 2012 clustering. Briefly, when I move SQL instance from node SQ1 to node SQ2 to process fails.
At the beginning of the process, the cluster try to change to owner to SQ2 but somehow it fails back automatically to SQ1 and the following errors are thrown in cluster event viewer :
Event ID:1205
The Cluster service failed to bring clustered role 'SQL Server (MSSQLSERVER)' completely online or offline. One or more resources may be in a failed state. This may impact the availability of the clustered role.
Event ID:1069
Cluster resource 'SQL Server' of type 'SQL Server' in clustered role 'SQL Server (MSSQLSERVER)' failed.
Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it. Check the resource and group state using Failover Cluster
Manager or the Get-ClusterResource Windows PowerShell cmdlet.
Any idea can lead to resolve this issue
Thanks,
TF
-
Edited by
TarekF
Saturday, September 12, 2015 10:45 AM
correction
September 12th, 2015 10:16am
Here is what I see in the cluster log on Node 2:
00000c84.0000101c::2015/09/12-09:18:30.982 ERR [RES] SQL Server <SQL Server>: [sqsrvres] Failed to start service with error 1069. Please try again
Error 1069 is normally a logon failure... Is it possible that the password of the service account was changed after installation? (Or maybe has been mistyped during setup?)
-
Marked as answer by
TarekF
17 hours 32 minutes ago
September 12th, 2015 2:07pm
No Not possible it was set to never expire and the same user is used on Node1
Also SQL service is running on Node2
-
Edited by
TarekF
Saturday, September 12, 2015 2:21 PM
September 12th, 2015 2:20pm
I tried to restart SQL service on node 2 and it working fine with the password already set.
If it was entered incorrectly I will be able to restart sql/agent services. Am I right ?
I can confirm we haven't changed SQL service password.
-
Edited by
TarekF
Saturday, September 12, 2015 3:30 PM
September 12th, 2015 3:27pm
No not working.
I tired to failover to node2, it fails back automatically to node1 and generate the errors ID:1205 and 1069.
Any suggestion this is really weird. It was working fine
-
Edited by
TarekF
Saturday, September 12, 2015 3:33 PM
September 12th, 2015 3:33pm
In System Event log I can see a lot of error and warning entries for:
event ID:10028 :DistributedCOM DCOM was unable to communicate with the computer LEOMSSQL.leoxchange.local using any of the configured protocols; requested by PID b1c (C:\Windows\system32\ServerManager.exe).
Event ID:1014: Name resolution for the name 10.in-addr.arpa timed out after none of the configured DNS servers responded.
In security Event log I can't see any related error.
The SQL instance is working fine on node 1 with the same SQL service users credentials. If I issue Get-ClusterResource on the two node, everything looks fine (online)
I am facing this issue for the first time in my cluster
September 13th, 2015 4:46am
Sorry, but something doesn't add up here... The cluster log you provided clearly shows that the service user account is having troubles authenticating. This has to have a corresponding Audit Failure event in the domain controllers security log, unless of
course the node couldn't connect the domain controller in the first place.
I don't doubt that Node1 works fine, the cluster log reflects that clearly. I still doubt the correct password in the service config on Node2 though. Sorry for bringing this up time and again, but besides a really weird bug in either the SQL Server setup
or Active directory that's the only logical explanation for the messages I see in the log...
September 13th, 2015 4:52am
Sorry, but something doesn't add up here... The cluster log you provided clearly shows that the service user account is having troubles authenticating. This has to have a corresponding Audit Failure event in the domain controllers security log, unless of
course the node couldn't connect the domain controller in the first place.
I don't doubt that Node1 works fine, the cluster log reflects that clearly. I still doubt the correct password in the service config on Node2 though. Sorry for bringing this up time and again, but besides a really weird bug in either the SQL Server setup
or Active directory that's the only logical explanation for the messages I see in the log...
-
Marked as answer by
TarekF
17 hours 31 minutes ago
September 13th, 2015 8:51am
Hi thank you again for your reply.
You are correct, I double checked the security log and found 3 logon failures for SQLServer user
An account failed to log on.
Subject:
Security ID: SYSTEM
Account Name: SQ2$
Account Domain: XCHANGE
Logon ID: 0x3E7
Logon Type: 5
I managed to change the password for the logon user account used by SQLServer service, than tired the failover again but unfortunately it did not work.
Any thoughts ? What should I check next ?
September 13th, 2015 8:51am
One thing to check quickly: Does the account you run your SQL Server under have the "Log on as a service" privilege? (You can se this in "Local Security Policy" -> Local Policies -> User Rights Assignment -> Log on as a
service)
One more thing: The failed message above shows a logon failure for the SQ2$ account, which is the machines user account that normally is used only when the logon account is "Network Service". So just to double check: You are not running your SQL
Server under "Network Service", right?
September 13th, 2015 9:01am
Yes. I can see it clearly the account I run SQLSERVER under has "Log on as service" privilege.
This is the full message I received:
An account failed to log on.
Subject:
Security ID: SYSTEM
Account Name: -SQ2$
Account Domain: XCHANGE
Logon ID: 0x3E7
Logon Type: 5
Account For Which Logon Failed:
Security ID: NULL SID
Account Name: sql_user
Account Domain: XCHANGE
Failure Information:
Failure Reason: Unknown user name or bad password.
Status: 0xC000006D
Sub Status: 0xC000006A
Process Information:
Caller Process ID: 0x258
Caller Process Name: C:\Windows\System32\services.exe
Network Information:
Workstation Name: -SQ2
Source Network Address: -
Source Port: -
Detailed Authentication Information:
Logon Process: Advapi
Authentication Package: Negotiate
Transited Services: -
Package Name (NTLM only): -
Key Length: 0
This event is generated when a logon request fails. It is generated on the computer where access was attempted.
The Subject fields indicate the account on the local system which requested the logon. This is most commonly a service such as the Server service, or a local process such as Winlogon.exe or Services.exe.
The Logon Type field indicates the kind of logon that was requested. The most common types are 2 (interactive) and 3 (network).
The Process Information fields indicate which account and process on the system requested the logon.
The Network Information fields indicate where a remote logon request originated. Workstation name is not always available and may be left blank in some cases.
The authentication information fields provide detailed information about this specific logon request.
- Transited services indicate which intermediate services have participated in this logon request.
- Package name indicates which sub-protocol was used among the NTLM protocols.
- Key length indicates the length of the generated session key. This will be 0 if no session key was requested.
September 13th, 2015 9:14am
Unfortunately the full message confirms what I said so far:
Failure Reason: Unknown user name or bad password.
So for whatever reason, but the password in the Service Control Manager is still wrong...
September 13th, 2015 9:37am
I was able to logon to SQ2 using same user and password
Should I be able to logon to SQLServer visual Studio?
-
Edited by
TarekF
17 hours 21 minutes ago
September 13th, 2015 9:42am
Still it seems that the password configured in the service is still wrong... Have you tried retyping the password directly in Service Control Manager (Services.msc)?
September 13th, 2015 9:46am
I did but still failover not working.
And I am still getting logon failure for the same user and password.
Note that I tried to login to windows sq2 and it is working fine
September 13th, 2015 10:16am
It means that you still have a wrong password cached somewhere... Have you tried changing the SQL Server user in Services.msc to Local System, commit that change and then change it back to your service account?
We are running out of sane options unfortunately... It all points to a wrong password in the SCM database...
My next option would be to replace sqlsrvr.exe with a dummy program and see if that one launches... But that is pretty crazy already...
September 13th, 2015 11:38am
Unfortunately the full message confirms what I said so far:
Failure Reason: Unknown user name or bad password.
So for whatever reason, but the password in the Service Control Manager is still wrong...
-
Marked as answer by
TarekF
17 hours 31 minutes ago
September 13th, 2015 1:36pm
I was able to logon to SQ2 using same user and password
Should I be able to logon to SQLServer visual Studio?
-
Edited by
TarekF
Sunday, September 13, 2015 1:42 PM
September 13th, 2015 1:40pm
This is really crashes my head!
I tried changing the SQL Server user in Services.msc to Local System,commit. Then tried to failover to node2 but in vain..! Even when I changed logon account, it still showing wrong password for the same service account.
How come ? where it is stored or cached!
I was looking if there is any other service is using this user account but did not find anything. Only SQLSERVER service is using that account.
Also when I change it back to the already used service account, the problem remain the same.
September 13th, 2015 6:39pm
I would suggest to reset the password on domain controller and then change it on both nodes.
September 13th, 2015 9:32pm
I have to agree with Balmukund here... And if that doesn't work either I would suggest we try to get in direct contact and you let me have a look at your system... We are missing something here...
September 13th, 2015 11:51pm
I can't do that because it is a production environment, and I can't afford too much down time.
I am wondering if there is any other service using same service account
September 14th, 2015 2:22am
I can't do that because it is a production environment, and I can't afford too much down time.
I am wondering if there is any other service using same service account
September 14th, 2015 2:22am
Well, the cluster log you showed us before clearly stated that it was the SQL Server service having that problem... So I highly doubt that. (Additionally: Every other service would start after SQL Server, so if it was e.g. the Agent then you should already
have a SQL Error Log on Node2...) But you can of course do another failover attempt and send me the cluster log afterwards, so I can confirm that...
September 14th, 2015 2:25am
I think you need live help because we are not moving forward. May be we are not able to explain correctly. Which part of the world you are located? Ping me via facebook/twitter (details in my MSDN profile) and lets close it out.
September 14th, 2015 2:48am
Finally, the problem has been resolved. I re-entered the password from SQL Server Configuration Manager.
Before I was changing it from windows services.
Now everything looks good.
Thank you all for you help, really appreciated.
-
Edited by
TarekF
17 hours 30 minutes ago
cc
September 14th, 2015 9:33am
Finally, the problem has been resolved. I re-entered the password from SQL Server Configuration Manager
What a relief! PrinceLucifer would also agree :)
September 14th, 2015 11:03am
For sure. Happy that you have it under control now Tarek.
September 14th, 2015 11:05am
Yeah feeling relaxed.
PrinceLucifer, Balmukund thank you Guys.
I have a question please, is there any way to find in MS SQL SERVER or in security event log who changed the password of the service in SQL Server Configuration Manager?
September 15th, 2015 3:01am
I don't think that this is logged anywhere, no. Sorry!
September 15th, 2015 3:03am
I don't think that this is logged anywhere, no. Sorry!
Agree.
September 15th, 2015 3:04am