HTTP Codes: Platform vs Software
My name is Angus, I work as a DevOps engineer at a large Australian company. Mid 2024 I made the decision to switch from Software Engineering to DevOps/Cloud/Platform Engineering. Since then, I've been digging into what makes a powerful platform. The experience is constrasting; rather than wrangling other developer's code, now the 'colleages' are the open source community. Pulling apart poorly documented terraform, or modifying software to suit specific needs is a much different game to building from scratch. I've heard the analogy that DevOps Engineers are like Draftsmen, Architects like Civil Engineers, and the builders are the software engineers. We don't put the bolt into ground, nor design with brushstrokes, but we can tell you the location of each bolt on the entire project, why it's there, and what the bolt is made of
To start this story off, let's discuss a conversation I had with some Software Engineers recently. We were walking to lunch, and the Software Engineers were discussing returning a 5xx series error. I quickly jumped to the offensive, and said 5xx is for DevOps and 4xx is for Software. Is this true? Not really. Of course there are edge cases will all these things. However, to justify my concerns, let's examine the MSDN HTTP response status codes documentation. We have two error families: 4xx client errors, and 5xx server errors. One can rightly argue that Software Engineers are 'the server' here. But in our modern world of cloud, high availability (HA) cluster based compute, where is the 6xx platform error? We have 'the server' aka the application, and the 'server of the server' - think kubernetes, eventing systems, and gateways.
Given these components, debugging a classic 500 Internal Server Error can be a hair-pulling experience for a DevOps engineer. Is this the server talking, is it a network stack problem? Traces alieviate this somewhat, though are costly to implement and challenging to maintain end-to-end. Cloudflare recognised this issue and extended the status code family into the 52x range. If a Software Engineer can argue the case to return a 502 Bad Gateway error, such as when an upstream returns an error, how can we debug a 502 when the downstream gateway errors? When Cloudflare is the originator, it's obvious and unambiguous: Say we receive a 523. Now we know our edge is functioning, but our ingress is not. The classical counter argument would be to include a message in the response. Yet I posit that platform errors are typically unplanned. It's not always practical nor possible for platforms to return errors. An intentional 502 from an application, on the other hand is likely framework related, or intentional. Should code reviews now gate 502s to include additional logging? Our 5xx status codes have lost their meaning! The function of a status code is to state the status. If we cannot disambiguate, then we naturally need new codes.
Comments ()