Failed to failover SQL instance from one node to another

Hi,

I am experiencing an issue with SQL Server 2012 clustering. Briefly, when I move SQL instance from node SQ1 to node SQ2 to process fails.

At the beginning of the process, the cluster try to change to owner to SQ2 but somehow it fails back automatically to SQ1 and the following errors are thrown in cluster event viewer :

Event ID:1205

The Cluster service failed to bring clustered role 'SQL Server (MSSQLSERVER)' completely online or offline. One or more resources may be in a failed state. This may impact the availability of the clustered role.

Event ID:1069

Cluster resource 'SQL Server' of type 'SQL Server' in clustered role 'SQL Server (MSSQLSERVER)' failed.
Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it.  Check the resource and group state using Failover Cluster Manager or the Get-ClusterResource Windows PowerShell cmdlet.

Any idea can lead to resolve this issue

Thanks,

TF



  • Edited by TarekF Saturday, September 12, 2015 10:45 AM correction
September 12th, 2015 10:16am

Here is what I see in the cluster log on Node 2:

00000c84.0000101c::2015/09/12-09:18:30.982 ERR   [RES] SQL Server <SQL Server>: [sqsrvres] Failed to start service with error 1069. Please try again

Error 1069 is normally a logon failure... Is it possible that the password of the service account was changed after installation? (Or maybe has been mistyped during setup?)

  • Marked as answer by TarekF 17 hours 32 minutes ago
Free Windows Admin Tool Kit Click here and download it now
September 12th, 2015 2:07pm

No Not possible it was set to never expire and the same user is used on Node1

Also SQL service is running on Node2

  • Edited by TarekF Saturday, September 12, 2015 2:21 PM
September 12th, 2015 2:20pm

I tried to restart SQL service on node 2 and it working fine with the password already set.

If it was entered incorrectly I will be able to restart sql/agent services. Am I right ?

I can confirm we haven't changed SQL service password.



  • Edited by TarekF Saturday, September 12, 2015 3:30 PM
Free Windows Admin Tool Kit Click here and download it now
September 12th, 2015 3:27pm

No not working.

I tired to failover to node2, it fails back automatically to node1 and generate the errors ID:1205 and 1069.

Any suggestion this is really weird. It was working fine


  • Edited by TarekF Saturday, September 12, 2015 3:33 PM
September 12th, 2015 3:33pm

In System Event log I can see a lot of error and warning  entries for:

event ID:10028 :DistributedCOM DCOM was unable to communicate with the computer LEOMSSQL.leoxchange.local using any of the configured protocols; requested by PID      b1c (C:\Windows\system32\ServerManager.exe).

Event ID:1014: Name resolution for the name 10.in-addr.arpa timed out after none of the configured DNS servers responded.

In security Event log  I can't see any related error.

The SQL instance is working fine on node 1 with the same SQL service users credentials. If I issue Get-ClusterResource on the two node, everything looks fine (online)

I am facing this issue for the first time in my cluster

Free Windows Admin Tool Kit Click here and download it now
September 13th, 2015 4:46am

Sorry, but something doesn't add up here... The cluster log you provided clearly shows that the service user account is having troubles authenticating. This has to have a corresponding Audit Failure event in the domain controllers security log, unless of course the node couldn't connect the domain controller in the first place.

I don't doubt that Node1 works fine, the cluster log reflects that clearly. I still doubt the correct password in the service config on Node2 though. Sorry for bringing this up time and again, but besides a really weird bug in either the SQL Server setup or Active directory that's the only logical explanation for the messages I see in the log...

September 13th, 2015 4:52am

Sorry, but something doesn't add up here... The cluster log you provided clearly shows that the service user account is having troubles authenticating. This has to have a corresponding Audit Failure event in the domain controllers security log, unless of course the node couldn't connect the domain controller in the first place.

I don't doubt that Node1 works fine, the cluster log reflects that clearly. I still doubt the correct password in the service config on Node2 though. Sorry for bringing this up time and again, but besides a really weird bug in either the SQL Server setup or Active directory that's the only logical explanation for the messages I see in the log...

  • Marked as answer by TarekF 17 hours 31 minutes ago
Free Windows Admin Tool Kit Click here and download it now
September 13th, 2015 8:51am

Hi thank you again for your reply.

You are correct, I double checked the security log and found 3 logon failures for SQLServer user

An account failed to log on.

Subject:
    Security ID:        SYSTEM
    Account Name:        SQ2$
    Account Domain:        XCHANGE
    Logon ID:        0x3E7
Logon Type:            5

I managed to change the password for the logon user account used by SQLServer service, than tired the failover again but unfortunately it did not work.

Any thoughts ? What should I check next ?

September 13th, 2015 8:51am

One thing to check quickly: Does the account you run your SQL Server under have the "Log on as a service" privilege? (You can se this in "Local Security Policy" -> Local Policies -> User Rights Assignment -> Log on as a service)

One more thing: The failed message above shows a logon failure for the SQ2$ account, which is the machines user account that normally is used only when the logon account is "Network Service". So just to double check: You are not running your SQL Server under "Network Service", right?

Free Windows Admin Tool Kit Click here and download it now
September 13th, 2015 9:01am

Yes. I can see it clearly the account I run SQLSERVER under has "Log on as service" privilege.

This is the full message I received:

An account failed to log on.

Subject:
    Security ID:        SYSTEM
    Account Name:        -SQ2$
    Account Domain:        XCHANGE
    Logon ID:        0x3E7

Logon Type:            5

Account For Which Logon Failed:
    Security ID:        NULL SID
    Account Name:        sql_user
    Account Domain:        XCHANGE

Failure Information:
    Failure Reason:        Unknown user name or bad password.
    Status:            0xC000006D
    Sub Status:        0xC000006A

Process Information:
    Caller Process ID:    0x258
    Caller Process Name:    C:\Windows\System32\services.exe

Network Information:
    Workstation Name:    -SQ2
    Source Network Address:    -
    Source Port:        -

Detailed Authentication Information:
    Logon Process:        Advapi  
    Authentication Package:    Negotiate
    Transited Services:    -
    Package Name (NTLM only):    -
    Key Length:        0

This event is generated when a logon request fails. It is generated on the computer where access was attempted.

The Subject fields indicate the account on the local system which requested the logon. This is most commonly a service such as the Server service, or a local process such as Winlogon.exe or Services.exe.

The Logon Type field indicates the kind of logon that was requested. The most common types are 2 (interactive) and 3 (network).

The Process Information fields indicate which account and process on the system requested the logon.

The Network Information fields indicate where a remote logon request originated. Workstation name is not always available and may be left blank in some cases.

The authentication information fields provide detailed information about this specific logon request.
    - Transited services indicate which intermediate services have participated in this logon request.
    - Package name indicates which sub-protocol was used among the NTLM protocols.
    - Key length indicates the length of the generated session key. This will be 0 if no session key was requested.

September 13th, 2015 9:14am

Unfortunately the full message confirms what I said so far:

Failure Reason:        Unknown user name or bad password.

So for whatever reason, but the password in the Service Control Manager is still wrong...

Free Windows Admin Tool Kit Click here and download it now
September 13th, 2015 9:37am

I was able to logon to SQ2 using same user and password

Should I be able to logon to SQLServer visual Studio?
  • Edited by TarekF 17 hours 21 minutes ago
September 13th, 2015 9:42am

Still it seems that the password configured in the service is still wrong... Have you tried retyping the password directly in Service Control Manager (Services.msc)?
Free Windows Admin Tool Kit Click here and download it now
September 13th, 2015 9:46am

I did but still failover not working.

And I am still getting logon failure for the same user and password.

Note that I tried to login to windows sq2 and it is working fine


September 13th, 2015 10:16am

It means that you still have a wrong password cached somewhere... Have you tried changing the SQL Server user in Services.msc to Local System, commit that change and then change it back to your service account?

We are running out of sane options unfortunately... It all points to a wrong password in the SCM database...

My next option would be to replace sqlsrvr.exe with a dummy program and see if that one launches... But that is pretty crazy already...

Free Windows Admin Tool Kit Click here and download it now
September 13th, 2015 11:38am

Unfortunately the full message confirms what I said so far:

Failure Reason:        Unknown user name or bad password.

So for whatever reason, but the password in the Service Control Manager is still wrong...

  • Marked as answer by TarekF 17 hours 31 minutes ago
September 13th, 2015 1:36pm

I was able to logon to SQ2 using same user and password

Should I be able to logon to SQLServer visual Studio?
  • Edited by TarekF Sunday, September 13, 2015 1:42 PM
Free Windows Admin Tool Kit Click here and download it now
September 13th, 2015 1:40pm

This is really crashes my head!

I tried changing the SQL Server user in Services.msc to Local System,commit. Then tried to failover to node2 but in vain..! Even when I changed logon account, it still showing wrong password for the same service account.

How come ? where it is stored or cached!

I was looking if there is any other service is using this user account but did not find anything. Only SQLSERVER service is using that account.

Also when I change it back to the already used service account, the problem remain the same.

September 13th, 2015 6:39pm

I would suggest to reset the password on domain controller and then change it on both nodes.

Free Windows Admin Tool Kit Click here and download it now
September 13th, 2015 9:32pm

I have to agree with Balmukund here... And if that doesn't work either I would suggest we try to get in direct contact and you let me have a look at your system... We are missing something here...
September 13th, 2015 11:51pm

I can't do that because it is a production environment, and I can't afford too much down time.

I am wondering if there is any other service using same service account

Free Windows Admin Tool Kit Click here and download it now
September 14th, 2015 2:22am

I can't do that because it is a production environment, and I can't afford too much down time.

I am wondering if there is any other service using same service account
September 14th, 2015 2:22am

Well, the cluster log you showed us before clearly stated that it was the SQL Server service having that problem... So I highly doubt that. (Additionally: Every other service would start after SQL Server, so if it was e.g. the Agent then you should already have a SQL Error Log on Node2...) But you can of course do another failover attempt and send me the cluster log afterwards, so I can confirm that...
Free Windows Admin Tool Kit Click here and download it now
September 14th, 2015 2:25am

I think you need live help because we are not moving forward. May be we are not able to explain correctly. Which part of the world you are located? Ping me via facebook/twitter (details in my MSDN profile) and lets close it out.
September 14th, 2015 2:48am

Finally, the problem has been resolved. I re-entered the password from SQL Server Configuration Manager.
Before I was changing it from windows services.

Now everything looks good.

Thank you all for you help, really appreciated.


  • Edited by TarekF 17 hours 30 minutes ago cc
Free Windows Admin Tool Kit Click here and download it now
September 14th, 2015 9:33am

Finally, the problem has been resolved. I re-entered the password from SQL Server Configuration Manager


What a relief! PrinceLucifer would also agree :)

September 14th, 2015 11:03am

For sure. Happy that you have it under control now Tarek.
Free Windows Admin Tool Kit Click here and download it now
September 14th, 2015 11:05am

Yeah feeling relaxed.

PrinceLucifer, Balmukund thank you Guys.

I have a question please, is there any way to find in MS SQL SERVER or in security event log who changed the password of the service in SQL Server Configuration Manager?

September 15th, 2015 3:01am

I don't think that this is logged anywhere, no. Sorry!
Free Windows Admin Tool Kit Click here and download it now
September 15th, 2015 3:03am

I don't think that this is logged anywhere, no. Sorry!

Agree.
September 15th, 2015 3:04am

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics