Fact: Every Tech Related Disaster Is Man-Made

Yesterday at work I was finishing off an extremely productive week. After spending the previous week on a spike, everything was really starting to come together for the feature I’m building. It was a good day. But then during a team demo somebody made a comment that turned things a bit sour for me. Later on I ranted about this during sprint planning. While the ultimate end result wasn’t a big deal, I mulled it over in an attempt to determine why any of this ticked me off to begin with.

The basic tale is this: In our main load testing environment, a change was made. Somebody openly wondered what effect that change would have on our ability to accurately predict performance in our prod environment. Normally this question would be perfectly fine and wouldn’t incur any wrath at all. After all, I work in performance and we tend to get these kinds of questions a lot. However in this case the situation isn’t quite so cut and dry.

While I can’t go into the details of the change that was made, I can tell you that the change was part of a set of changes, the change will never be made in our Prod environment and that this change will never be rolled back in our testing environment. Due to that combination of factors, we will ultimately be unable to determine what the specific effects of this change actually are. We can measure the combined effects of the change set as a whole as we have data from before and after. But measuring the effects of this specific change is virtually impossible. Of course none of this should come as a shock. It all should be obvious. Yet that clearly isn’t the case. Why is that?

My initial thought was, “Because we as humans do stupid things like this all the time”. Well yeah we do and it is quite annoying. Instead of changing just one thing and measuring the effects empirically, we have a tendency to make a multitude of changes all at once. This approach is chaotic and clearly births more chaos down the line. Scientifically this is the wrong way to go about things as such an approach inherently makes it harder to understand the effects of changes we make to complex systems.

But after thinking about it more, I realized that the real issue here is the very existence of a “complex system” and how people tend to react to such a thing. In my general experience the urge of people to treat a system as a black box increases proportionally with the complexity of said system. In other words as something becomes more complex people tend to stop trying to understand it. But of course when it comes to a complex Enterprise SaaS (Software as a Service) offering, the people in performance don’t have this option. It is literally our job to try and understand the system and impart our conclusions to the technical professionals we partner with. Well to be fair and clear, I just write the tools that help other smarter people do that.

This results in a paradox of sorts when faced with a scenario in which the natural urge of people is to give up on understanding the system. Ultimately all of these systems accept one or more streams of input and produce one or more streams of output. But if the people tasked with maintaining the system and enhancing its functionality ignore that basic premise, that fundamentally changes their relationship to the system. Now instead of making a change and measuring the effects, they are free to make a multitude of changes while only occasionally, if ever, considering the effects.

The basic truth here is once you give up on understanding a thing, you inherently lose control of a thing. I’ve actually been fond of saying that for years now, but it wasn’t until now that I really found an opportunity to think it through. So ultimately what does that mean? Well just like the title of this post, it means that every single tech related disaster is man-made. We are the ones creating these complex systems. We are the ones who are supposed to be managing them. But if we abdicate on our responsibility of understanding the tech and systems we erect with said tech, we will no longer be able to control any of it.

Sadly this is the default position of many so-called tech professionals in my experience. They don’t understand the tech nor do they understand the resulting system. Logically this means that in train terms they are the nothing more than passengers on a train that has no engineer. Being one of the guys called in to determine why that train crashed, yet again, tends to get really old really fast.

How do we fix this? I have no earthly idea. But if I think of something, I’ll be sure to share it you all. You can count on that.