How a Database Connection Issue Killed My Sitecore VM

An Administrator's error caused the SQL account for Sitecore's collection database to be denied login. Normally you'd expect the related services to stop functioning, but in this case the system's various resources spiraled out of control.


What Went Wrong

Due to human error the collectionuser login was disabled. As we were lucky this was a lower environment but still had an internal audience. The web site was still performing for its low demand, but I did see some degradation in performance.

Ok let's get cracking. I looked at the logs for the CD and saw an all too familiar entry:

Exception: Sitecore.Analytics.DataAccess.XdbUnavailableException
Message: xDB unavailable
Source: Sitecore.Analytics.XConnect
   at Sitecore.Analytics.XConnect.Extensions.XdbRuntimeContextExtensions.ExecuteWithExceptionHandling[T](IXdbRuntimeContext runtimeContext, Func`2 func)
   at Sitecore.Analytics.XConnect.Diagnostics.PerformanceCounters.OperationPerformanceMonitorExtensions.Monitor[T](OperationPerformanceMonitorBase monitor, Func`1 operation)
   at Sitecore.Analytics.XConnect.DataAccess.XConnectDeviceRepository.Load(Guid deviceId)
   at Sitecore.Analytics.Pipelines.EnsureSessionContext.EnsureDevice.LoadDevice(Guid deviceId)
Nested Exception
Exception: Sitecore.XConnect.OperationTimeoutException
Message: Operation was cancelled by timeout
Source: Sitecore.Xdb.Common.Web
   at Sitecore.Xdb.Common.Web.Synchronous.SynchronousExtensions.SuspendContextLock[TResult](Func`1 taskFactory)
   at Sitecore.XConnect.Client.XConnectSynchronousExtensions.SuspendContextLock[TResult](Func`1 taskFactory)
   at Sitecore.XConnect.Client.XConnectSynchronousExtensions.Get[TEntity](IXdbContext context, IEntityReference`1 reference, ExecutionOptions`1 executionOptions, TimeSpan timeout)
   at Sitecore.Analytics.XConnect.DataAccess.XConnectDataAdapterProvider.<>c__DisplayClass25_0.<GetDevice>b__1(IXdbContext xdbContext)
   at Sitecore.Analytics.XConnect.Extensions.XdbRuntimeContextExtensions.ExecuteWithExceptionHandling[T](IXdbRuntimeContext runtimeContext, Func`2 func)
Nested Exception
Exception: Sitecore.Xdb.Common.Web.ConnectionTimeoutException
Message: A task was canceled.
Source: Sitecore.Xdb.Common.Web
   at Sitecore.Xdb.Common.Web.CommonWebApiClient`1.<SendAsync>d__51.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Polly.CircuitBreaker.CircuitBreakerEngine.<ImplementationAsync>d__1`1.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Polly.Policy`1.<ExecuteAsync>d__8.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Sitecore.Xdb.Common.Web.CommonWebApiClient`1.<ExecuteAsync>d__48.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at Sitecore.Xdb.Common.HttpTransientFaultHandling.RetryPolicy.<ExecuteAsync>d__11.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Sitecore.Xdb.Common.HttpTransientFaultHandling.RetryPolicyRetryer.<ExecuteAsync>d__5`1.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Sitecore.XConnect.Client.WebApi.CollectionBatchWebApiClient.<ExecuteBatch>d__12.MoveNext()
Nested Exception
Exception: System.Threading.Tasks.TaskCanceledException
Message: A task was canceled.
Source: mscorlib
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Sitecore.Xdb.Common.Web.CommonWebApiClient`1.<SendAsync>d__51.MoveNext()

The timeout entry was interesting because I usually fault a certificate at times like this. The next step was to check the xConnect logs.


Unhandled Cyclical Error

By this point I'm having a really hard time navigating the VM servicing this app. I open the task manager to see:

  • CPU: 99%
  • RAM: 100%

Did IT downscale one of my VMs again? Couldn't be. They didn't and I'm seeing 64gb of ram used up. Ok time to check the logs.

The app has only been running a short time and I see my log file is 2gb, and that's never good. It is flooded with over 8 million lines of exceptions. The hard drive was quickly reaching capacity which would drop the entire OS.

Microsoft.Azure.SqlDatabase.ElasticScale.ShardManagement.ShardManagementException: Store Error: 
Login failed for user 'collectionuser'.. The error occurred while attempting to perform the underlying storage operation during 'Microsoft.Azure.SqlDatabase.ElasticScale.ShardManagement.StoreException: Error occurred while performing store operation. See the inner SqlException for details. ---> System.Data.SqlClient.SqlException: Login failed for user 'collectionuser'.
   at System.Data.SqlClient.SqlInternalConnectionTds..ctor(DbConnectionPoolIdentity identity, SqlConnectionString connectionOptions, SqlCredential credential, Object providerInfo, String newPassword, SecureString newSecurePassword, Boolean redirectedUserInstance, SqlConnectionString userConnectionOptions, SessionData reconnectSessionData, DbConnectionPool pool, String accessToken, Boolean applyTransientFaultHandling, SqlAuthenticationProviderManager sqlAuthProviderManager)
   at System.Data.SqlClient.SqlConnectionFactory.CreateConnection(DbConnectionOptions options, DbConnectionPoolKey poolKey, Object poolGroupProviderInfo, DbConnectionPool pool, DbConnection owningConnection, DbConnectionOptions userOptions)
   at System.Data.ProviderBase.DbConnectionFactory.CreatePooledConnection(DbConnectionPool pool, DbConnection owningObject, DbConnectionOptions options, DbConnectionPoolKey poolKey, DbConnectionOptions userOptions)
   at System.Data.ProviderBase.DbConnectionPool.CreateObject(DbConnection owningObject, DbConnectionOptions userOptions, DbConnectionInternal oldConnection)
   at System.Data.ProviderBase.DbConnectionPool.UserCreateRequest(DbConnection owningObject, DbConnectionOptions userOptions, DbConnectionInternal oldConnection)
   at System.Data.ProviderBase.DbConnectionPool.TryGetConnection(DbConnection owningObject, UInt32 waitForMultipleObjectsTimeout, Boolean allowCreate, Boolean onlyOneCheckConnection, DbConnectionOptions userOptions, DbConnectionInternal& connection)
   at System.Data.ProviderBase.DbConnectionPool.TryGetConnection(DbConnection owningObject, TaskCompletionSource`1 retry, DbConnectionOptions userOptions, DbConnectionInternal& connection)
   at System.Data.ProviderBase.DbConnectionFactory.TryGetConnection(DbConnection owningConnection, TaskCompletionSource`1 retry, DbConnectionOptions userOptions, DbConnectionInternal oldConnection, DbConnectionInternal& connection)
   at System.Data.ProviderBase.DbConnectionInternal.TryOpenConnectionInternal(DbConnection outerConnection, DbConnectionFactory connectionFactory, TaskCompletionSource`1 retry, DbConnectionOptions userOptions)
   at System.Data.SqlClient.SqlConnection.TryOpenInner(TaskCompletionSource`1 retry)
   at System.Data.SqlClient.SqlConnection.TryOpen(TaskCompletionSource`1 retry)
   at System.Data.SqlClient.SqlConnection.Open()
   at Microsoft.Azure.SqlDatabase.ElasticScale.ShardManagement.SqlUtils.WithSqlExceptionHandling(Action operation)


Stabilizing the Environment

This should be self-evident, but I stopped IIS, corrected the login and restarted all affected roles. I've submitted a feature request to have better handling on this type of exception.