While working with a colleague this past week, I ran into a weird scenario with this error that you might find interesting. We had a 2 server active/passive cluster, and while on node b, the SQL Cluster Resource would fail to come online. We could bring the resource online outside of the cluster, but it would still fail to connect with this error, which we got from the SQL Error Logs:
SSPI handshake failed with error code 0x8009030c, state 14 while establishing a connection with integrated security; the connection has been closed. Reason: AcceptSecurityContext failed. The Windows error code indicates the cause of failure. The logon attempt failed
Login failed. The login is from an untrusted domain and cannot be used with Windows authentication.
Now, AcceptSecurityContext is the Windows API to authenticate on the server, and it’s returning the error code that the login is denied. Furthermore, it says that the login is from an untrusted domain and cannot be used with Windows Authentication. We know that the login is not from another domain, and we’re able to connect using the same user on node a. The service accounts were also the same on both servers.
We ran through the following blog and added the user and the service account explicitly to the “Access this computer from the network” permissions, and made sure that none of those accounts were in the deny permissions. Still, we were seeing the same issues while failed over to node b.
We thought that it might have to do with NTLM connections failing, but after adding SPNs so that the connection would go through Kerberos, I took an SSPI Client trace, and saw that it was still going through NTLM. But why?
Then something stood out at me. The FQDN of the SPN that was requested was misspelled. So what could lead to this? Either their DNS resolution was incorrect, or something specific on node b was causing this name resolution error. We know that DNS was working as we don’t have the same issues on node a, so the likely culprit was a hosts file entry.
Sure enough, someone had edited the hosts file entry on that server and misspelled the FQDN for the SQL Cluster Virtual Name. Once this was deleted, and their DNS cached was flushed, their SQL Server Resource came online.