What's Burning My Shard Databases at 100%?

This morning I was alerted to both shard databases burning at 100%, which should never happen in a right-sized configuration. After some checking around, the problem appeared to be bots indexing this site. That's great, we like bots! But Sitecore wasn't identifying these visits for what they were, and began processing them as Users, to be collected into the xDB. Here's a quick guide to keep these bots that slipped past detection at bay.


Identifying the Issue

A part of routine maintenance will be to check the collection chain of analytics data, so I ran this all-to-familiar query in both shards, which gets us the most popular contacts in the shards, based on their number of visits:

SELECT TOP (100) ContactId, COUNT(ContactId) as Count 
FROM [xdb_collection].[Interactions] 
GROUP BY ContactId ORDER BY Count DESC

You can see the results show us one contact is really high:



I've covered this in the past, but anything that comes back with a high number like the first record is going to be an issue. I can use the Analytics Database Manager to drop these if I need to, but first I want to find out why they're there, and I can get that information from the Interactions table.

I'm going to check the User Agent for this Contact:

SELECT distinct UserAgent FROM [xdb_collection].[Interactions] 
where ContactId='DAF93EB4-DB66-0000-0000-06111A5A0899'

This query returns the following, which confirms this User Agent isn't part of Sitecore's out of the box excludedUserAgents value.

Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)


Preventing Future Issues

You can patch the Sitecore.Analytics.ExcludeRobots.config file with the additional User Agent found above, and then run the database cleanup using the Analytics Database Manager, and the expensive shard operations will drop straight away.

<configuration>
  <sitecore>
    <analyticsExcludeRobots>
      <excludedUserAgents>
        Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)       
      </excludedUserAgents>
    </analyticsExcludeRobots>
  </sitecore>
</configuration>

Before closing this off, let's see what you didn't catch this time around. Check the interactions table for further bots that just haven't caused a big stink yet, and add it to your patched config:

SELECT distinct UserAgent FROM [xdb_collection].[Interactions] where UserAgent like '%bot%'

I found a couple other ones from this query and dropped that in my next release, and sent the shards off on their merry way to collect real user data.