When Automation Backfires: Guide to Safe Practices

Igor K
October 18, 2024

How certain are you that all those background processes in your technology stack are doing what they should do? When did you last run an operational check on your automation systems and, more specifically, on their features — separately on each?

These questions resurfaced in our case after the recent incident where a software feature responsible for collecting and organising metadata went rogue and sent corrupt data to search engines. 

Jason, our CTO, was the first to pick up the anomaly thanks to a weekly performance report (yet another automated process). Everything was in red. The graph showed a steep and sudden drop across all the metrics.  

As some of you know, hardly anything has the power to wake you up like a message saying, “What the hell is going on with our traffic?”. 

Unfortunately, by that time, our losses were substantial. 

However, at CTO Academy, we use incidents and mistakes as learning tools not a trigger for the blaming game. So we jumped right onto solving the issue and making sure it doesn’t happen again.

Diagnosing the Problem

The first thing we did to diagnose the problem was to run a VPN check on all high-priority, high-volume pages that, until then, ranked #1 to #3 on all major SERPs. All of a sudden, they dropped beyond Page 10. We are talking about more than fifty pages that generated the majority of organic traffic.

Was it a result of the most recent Google Core Update? Did we get penalised? 

Given our content creation practices, it couldn’t be the case and if it somehow was, then it wouldn’t affect ranking on other search engines, would it? So we quickly eliminated the update as a possible cause.  

But as we were digging deeper into the search results and finally found a few of our pages, we noticed that the URL in the snippet was incorrect. 

Jason immediately checked the database and soon enough identified the culprit. It was a single feature of a much larger system that worked like a Swiss watch for three years. Little did we know that, in rare instances, tuning up security settings like WAP can cause it to malfunction. 

(By the way, it just goes to show how involved a CTO must be in daily operations. When Jason says, “…donning several hats”, he really means that.)

Ultimately, this was a multi-layered problem:

  1. Metadata was incorrectly changed by an automation plugin.
  2. The caching engine rolled out the error over four weeks as it slowly refreshed its cache.
  3. Security protocols had been tightened which meant some of our monitoring tools got blocked as they had not been whitelisted.
  4. Google updated their search algorithm.

The root cause was the metadata being changed, but the slow rollout and lack of visibility meant that the problem was not identified on time.

Decision-Making Process

At CTO Academy, security is the top priority. In other words, we don’t compromise to get “cleaner”, “faster” and “easily accessible” data. To give you an example, our marketing team has to manually attribute each hit through hardcore detective work because firewalls and other top-tier rules block them from seeing a visitor’s IP. Instead, they get the nod’s IP. You can imagine what it takes to identify and backtrace a lead especially when you dealing with an audience that switches between several devices and several physical environments a few times a day. 

So it wasn’t even a hard decision or a topic for discussion – the feature goes off, period. Purge, test, resend the sitemap, hope for the best, update safety protocols and start working on Plan B just in case. Only, in our case, we had to switch to an alternative automation software altogether because we couldn’t permanently turn off the feature; it kept popping back and continued sending corrupt data. 

Automation Safety Checklist

The first thing we did in the aftermath was reevaluate our protocols. Something in those policies didn’t work as it should. We ended up adding prevention measures to our automation protocols, specific to this type of incident. Here is the new addition to our subset of automation rules:

Plugin Configuration:

  • Pay close attention to the plugin’s settings, especially those related to automation and background processes.
  • When possible, configure the plugin to suggest changes for review instead of directly modifying data.
  • When such a configuration isn’t possible and there is no viable alternative, run a manual check immediately after publishing new content and/or editing metadata.

Validation Rules:

  • Validate that the plugin generates correct data. For example, check if the canonical URLs:
    • Start with HTTPS
    • Match our domain
    • Don’t contain any invalid characters

This subset is the part of our global automation safety checklist:

Automation Safety Checklist
(click to enlarge/download)

Let’s break this down a bit to show you what each item means. 

Before Automation

  • Define Clear Objectives:
    • What exactly do you want to achieve with automation? (eg, improve site speed by 15% by optimising image metadata, improve members engagement by 10%)
    • What are the Key Performance Indicators (KPIs) to measure success? (eg, page load time, bounce rate, search ranking, dwell time, read time, response time)
  • Thorough Risk Assessment:
    • Identify potential failure points in the automation process. (eg, what if the plugin misinterprets the content, the database connection fails or the system assigns a wrong label to a lead?)
    • Estimate the potential impact of each failure. (eg, incorrect metadata could lead to lower search ranking, marketing team could waste resources on bad leads due to the incorrect labels)
    • Develop mitigation strategies for each identified risk. (eg, implement data validation checks to ensure metadata accuracy)
  • Data Backup and Recovery:
    • Ensure you have a recent backup of the website/platform and database before implementing any automation.
    • Test your backup restoration process to ensure you can quickly recover in case of failure.
  • Staging Environment:
    • Essential! Always test automation on a staging environment that mirrors the live site. This allows you to identify and fix issues without affecting the live website.
  • Gradual Rollout:
    • In case of major automation solution implementation and if possible, don’t automate everything at once. Start with a small subset of items or limited functionality, then gradually expand after confirming it works correctly.

During Automation

  • Real-time Monitoring:
    • Set up monitoring tools to track the automation process in real time. Look for unusual patterns, errors or warnings. (eg, monitor the number of canonical URLs changed per hour, do the VPN check on markup data, analyse labelling)
  • Alerting System:
    • Configure alerts to receive immediate notifications of critical errors or anomalies during automation. (eg, get an email alert if page hits start dropping or the segment’s read ratio decreases)
  • Manual Spot Checks:
    • Periodically perform manual spot checks to verify the accuracy of the automated process.

After Automation:

  • Post-Automation Review:
    • After the automation is complete, conduct a thorough review to assess its impact on your KPIs. (eg, check Google Search Console for any crawl errors or ranking changes, check CRM system for possible discrepancies)
  • Documentation:
    • Document the entire automation process, including the objectives, configuration, potential risks and mitigation strategies to simplify maintenance and troubleshooting.

Conclusion

The bottom line is that we a) shouldn’t automate just about anything for the sake of speeding up processes and b) overly rely on automation in general. It is appealing, but it doesn’t come without risks. 

As you can see from our example, something as simple as organising metadata into a single table to serve them to the search engine algorithms faster thus speeding up the page load time can cause real reputational and financial damage without being aware of an ongoing incident. 

Granted, not even the best curated and executed security protocols could’ve prevented this but that doesn’t mean we should steer away from manual work even when everything screams that we should automate. At the very least, we need to establish checkup routines. 

Download Our Free eBook!

90 Things You Need To Know To Become an Effective CTO

CTO Academy Ebook - CTO Academy

Latest posts

ethical hacking and cybersecurity - expert perspective

Ethical Hacking and Cybersecurity – Expert’s Perspective

This article is based on a CTO Shadowing session with Bryan Seely, an ethical hacker and cybersecurity expert. Bryan is a former marine who, by […]
CTO Role in Cybersecurity - Complete Guide - blog featured image

CTO’s Role in Cybersecurity: Complete Guide

This guide provides a comprehensive overview of the responsibilities of a CTO in ensuring their organisation’s cybersecurity. It covers the following topics: As a specialised […]
Overlooked Benefit of Tech MBA

One Important Yet Overlooked Benefit of a Tech MBA

When contemplating career paths, people can question the necessity of undertaking a specialised educational programs such as a Tech MBA, that is until they face […]

Transform Your Career & Income

Our mission is simple.
To arm you with the leadership skills required to achieve the career and lifestyle you want.
Save Your Cart
Share Your Cart