Introducing the Exclude Robots by Agent Module

12Jun, 2023

Introducing the Exclude Robots by Agent Module

In my past post “Xdb SQL Injections Attempts Can't Do Any Harm, but They'll Still Fill Up Your Databases With Useless Records”, we went over how an unknown agent or script crawling your Sitecore installation can cause a massive demand on the shard databases for xDB, easily bringing it to 100% usage and causing performance issues with your site. We'll revisit this situation and how my new Exclude Robot by Agent module will prevent this from happening.

First of All, You're Safe

During initial investigation I've had several conversations with Sitecore engineers, sharing my findings and they confirmed the requests are completely benign. What we want to address is the load on your SQL server as the processing tasks try to make it through all this useless data. This module will block the requests and prevent any need for manual maintenance, but that's also covered at the end of this article.

The Module in Action

Ok let's get into the module and the lifecycle of a request. You can get the code and pre-compiled packages on GitHub to follow along, and I made it as readable as possible.

First, the processor CheckUserAgentUsingWildcard pipeline runs immediately after CheckUserAgent. The first thing that it does is validate the ExclusionValues setting, which is a comma delimited string that's used as a reference list as to what should be blocked. If it is empty, the module takes no action and logs this. It also checks to see if args.IsInExcludeList is true, which would be true if CheckUserAgent (which ran before this) found a match.

public override void Process(ExcludeRobotsArgs args)
{
    if (_missingConfigValues)
        return;
    if (args.IsInExcludeList)
        return;
    var valueSettingName = "SitecoreFundamentals.ExludeRobotsByAgent.ExclusionValues";
    var exclusionValuesSetting = Settings.GetSetting(valueSettingName);
    if (string.IsNullOrWhiteSpace(exclusionValuesSetting) && !_missingConfigValues)
    {
        Log.Warn($"{LogPrefix} No config value found in {valueSettingName}", this);
        _missingConfigValues = true;
        return;
    }

Next, it checks to see if the User Agent from the request has a match against any term in the ExclusionValues setting.

var exclusionValues = exclusionValuesSetting.ToLower().Split(',').Select(x => x.Trim()).ToList();
var context = HttpContext.Current;
if (string.IsNullOrWhiteSpace(context?.Request?.UserAgent))
    return;
var userAgent = context.Request.UserAgent.ToLower();
var ip = GetIP(context);
var exclusionValue = exclusionValues.FirstOrDefault(x => userAgent.Contains(x));
if (exclusionValue != null)
{
    args.IsInExcludeList = true;
    Log.Debug($"{LogPrefix} User Agent contains the value {exclusionValue.Trim()}", this);
    StoreHitForLogging(context, ip, false);
    if (!_blockedIps.Any(x => x.Equals(ip)))
        _blockedIps.Add(ip);
    return;
}

You can see that if there's a match, the StoreHitForLogging method is used to add a new record to the BlockedUserAgents list, which is referenced in an email report task later. It also keeps a limit of 60 records which can be changed with the SampleRecordsPerLogDump, because how many do you really want?

private void StoreHitForLogging(HttpContext context, string ip, bool ipBlocked)
{
    if (_blockedUserAgents.Count() < Settings.GetIntSetting("SitecoreFundamentals.ExludeRobotsByAgent.SampleRecordsPerLogDump", 60))
    {
        var userAgent = context.Request.UserAgent;
        if (ipBlocked)
            userAgent = $"{context.Request.UserAgent} - iP address {ip} was previously blocked due to its User Agent";
        _blockedUserAgents.Add(new Models.BlockedUserAgent()
        {
            Ip = ip,
            UserAgent = userAgent,
            Url = context.Request.Url.AbsoluteUri,
            DateTime = DateTime.Now
        });
    }
    if (_hitsLoggingTimer == DateTime.MinValue)
        _hitsLoggingTimer = DateTime.Now;
}

The StoreHitForLogging method also stores the IP in a static list, which is covered in the next section.

Finally, an email is sent to your desired list of recipients with a list of the blocks from the . The frequency and format of each record in the list is configurable, and the email content is authorable in the Exclude Robots by Agent Settings item.

But My Site Is Behind a Web Application Firewall

True, a well configured WAF is critical for your site, but it's not going to be smart enough to block everything since the values can be randomized, and that's where the list of blocked IPs will come into play. It's a simple tool and reliable enough to block enough to make it work for our needs. Here's how it works:

A site attack starts with a variety of User Agents, at least one of them containing a term in your ExclusionValues setting.
The IP of the offending request is added to the list of Blocked IPS.
Several thousand more requests come form this IP without a term in the ExclusionValues setting but are also ignored since it has been identified as suspicious.

This potion of the Process will reference the blocked IPs list:

else if (Settings.GetBoolSetting("SitecoreFundamentals.ExludeRobotsByAgent.BlockWithIp", true) && _blockedIps.Any(x => x.Equals(ip)))
{
    args.IsInExcludeList = true;
    Log.Debug($"{LogPrefix} IP address {ip} was previously blocked due to its User Agent", this);
    StoreHitForLogging(context, ip, true);
    return;
}

It's simple and it works, and of course you can disable this portion of the pipeline with the BlockWithIp Boolean setting in the same config file.

Remember, it's more than just Cyber Security

If you're familiar with configuring robot detection in Sitecore, you'll know you must have an exact match on User Agent values to see them blocked from getting into the xDB. In my previous post I illustrated how this module can also be used for easier maintenance of robot detection by adding keywords to its configuration. This is a huge time saver.

I'm Done Installing, but What Do I Do With All This Data?

See my post, “Site Failures and What Else Can Happen When You're Not Monitoring the Health of Your Sitecore Implementation” which covers identifying the offending contacts already in your xDB and using the Analytics Database Manager to clear them. I also just posted about a potential bug with this module leaving some orphaned jobs in the post, “Understanding 100% Database Usage Caused by Your Sitecore GenericProcessingPool Table”.

A Happy Ending

Once you've installed this module and cleanup up any offending data, you should see your database usage return to normal.

This module is also received as a feature request, and is being considered as new functionality in the OOTB robot detection process.