How certain are you that all those background processes in your technology stack are doing what they should do? When did you last run an operational check on your automation systems and, more specifically, on their features — separately on each?
These questions resurfaced in our case after the recent incident where a software feature responsible for collecting and organising metadata went rogue and sent corrupt data to search engines.
Jason, our CTO, was the first to pick up the anomaly thanks to a weekly performance report (yet another automated process). Everything was in red. The graph showed a steep and sudden drop across all the metrics.
As some of you know, hardly anything has the power to wake you up like a message saying, “What the hell is going on with our traffic?”.
Unfortunately, by that time, our losses were substantial.
However, at CTO Academy, we use incidents and mistakes as learning tools not a trigger for the blaming game. So we jumped right onto solving the issue and making sure it doesn’t happen again.
The first thing we did to diagnose the problem was to run a VPN check on all high-priority, high-volume pages that, until then, ranked #1 to #3 on all major SERPs. All of a sudden, they dropped beyond Page 10. We are talking about more than fifty pages that generated the majority of organic traffic.
Was it a result of the most recent Google Core Update? Did we get penalised?
Given our content creation practices, it couldn’t be the case and if it somehow was, then it wouldn’t affect ranking on other search engines, would it? So we quickly eliminated the update as a possible cause.
But as we were digging deeper into the search results and finally found a few of our pages, we noticed that the URL in the snippet was incorrect.
Jason immediately checked the database and soon enough identified the culprit. It was a single feature of a much larger system that worked like a Swiss watch for three years. Little did we know that, in rare instances, tuning up security settings like WAP can cause it to malfunction.
(By the way, it just goes to show how involved a CTO must be in daily operations. When Jason says, “…donning several hats”, he really means that.)
Ultimately, this was a multi-layered problem:
The root cause was the metadata being changed, but the slow rollout and lack of visibility meant that the problem was not identified on time.
At CTO Academy, security is the top priority. In other words, we don’t compromise to get “cleaner”, “faster” and “easily accessible” data. To give you an example, our marketing team has to manually attribute each hit through hardcore detective work because firewalls and other top-tier rules block them from seeing a visitor’s IP. Instead, they get the nod’s IP. You can imagine what it takes to identify and backtrace a lead especially when you dealing with an audience that switches between several devices and several physical environments a few times a day.
So it wasn’t even a hard decision or a topic for discussion – the feature goes off, period. Purge, test, resend the sitemap, hope for the best, update safety protocols and start working on Plan B just in case. Only, in our case, we had to switch to an alternative automation software altogether because we couldn’t permanently turn off the feature; it kept popping back and continued sending corrupt data.
The first thing we did in the aftermath was reevaluate our protocols. Something in those policies didn’t work as it should. We ended up adding prevention measures to our automation protocols, specific to this type of incident. Here is the new addition to our subset of automation rules:
Plugin Configuration:
Validation Rules:
This subset is the part of our global automation safety checklist:
Let’s break this down a bit to show you what each item means.
The bottom line is that we a) shouldn’t automate just about anything for the sake of speeding up processes and b) overly rely on automation in general. It is appealing, but it doesn’t come without risks.
As you can see from our example, something as simple as organising metadata into a single table to serve them to the search engine algorithms faster thus speeding up the page load time can cause real reputational and financial damage without being aware of an ongoing incident.
Granted, not even the best curated and executed security protocols could’ve prevented this but that doesn’t mean we should steer away from manual work even when everything screams that we should automate. At the very least, we need to establish checkup routines.
90 Things You Need To Know To Become an Effective CTO
London
2nd Floor, 20 St Thomas St, SE1 9RS
Copyright © 2024 - CTO Academy Ltd