Keep your infrastructure boring

I’ve recently become quite grumpy that software development (And by extension, DevOps) has for some considerable time been fashion driven. Talented infrastructure engineers, DevOps, Sysadmins and other wranglers of the infrastructural arts are now at a point where they ignore their experience and instincts to embrace trendy tech that keeps them employed. Fear of missing out (FOMO) and C.V. driven development is ensuring the adage of ‘Right tool for the job’ is being quickly replaced with ‘Ooooh, shiny’. This attraction to the new is naturally leading to some excitement in my field. Note that this isn’t exciting like a roller coaster, but more exciting like a plane ploughing into the ground. Good infrastructure should be dull. Dull is good - dull is routine, easy to understand, and if it goes wrong, easy to fix. Exciting means new and novel ways for things to fail, leaving you scratching your head and wondering what part has fallen off now and never quite knowing what is going on. Dull is sleep-inducing, exciting is being wide awake and terror-stricken at odd hours of the morning. In short, exciting is a terrible state for infrastructure.

The buzzwords du jour are scale and elasticity, the two things that most organisation are not challenged by, but seemingly think they are. The lucky 1% have these challenges, and it almost certainly didn’t happen overnight, it was a gradual accretion of won business, retaining customers and new product lines. A vanishingly small number of companies have had an overnight influx of business that has forced their existing infrastructure to cry for mercy, and much of that pain is usually due to poor software architecture rather than limits on the existing infrastructure. However, this trend of focusing on scale and elasticity first has left a large number of companies, both startup and larger, with infrastructure that is unfit for purpose, hard to manage, hard to develop against, and ironically, unable to scale.

Let’s take an example from two teams I’ve worked with to illustrate this point. They make an excellent example as both are essentially CRUD (Create, Read, Update, Delete) apps which have low usage (Around five to ten users per second) but high-value transactions. Both have complex rules engines and allow users to upload files.

Team A has written their application as a ‘monolith’ using Java with Tomcat as their web server and a Postgres database. They host on Digital Ocean using Droplets built using Packer and deployed in a pair of application servers behind a load balancer using Terraform. Developers develop locally using the same version of the JDK and a local DB and push to Git with a Jenkins CI/CD pipeline. Team B has written their app in a mix of languages with Postgres, Redis and DynamoDB backends and is moving from a semi monolith to a micro-services approach. They originally hosted on Elastic Beanstalk, but then moved to Kubernetes clusters built using Kops hosted on AWS. Applications are packaged using Helm, and CI is done in part by Jenkins and other parts with a cloud-based application. Development team A is especially productive and spends the majority of their time pushing out features to users. Infrastructure is not something that occupies them daily and has been largely reliable. Team B struggle against their infrastructure and spend an inordinate amount of time and resources trying to either fix or indeed even locate, problems on their platform. Local development is messy, with an inability to replicate either production or staging platforms (They differ significantly due to a lack of standardisation between the K8’s clusters).

The irony here is that development team A almost certainly have a clearer path to scalability, and can add additional load balanced servers to scale the platform. This approach will hit a ceiling at some point, but at a point where commercially they will be a completely different business with a leadership team that would allow them to create a strategic approach to meet demand. Team B has a poorly implemented cutting edge system that they don’t need, and in places, doesn’t even scale to meet their existing load. It’s also created a boat anchor to productivity with the infrastructure and how each of its moving parts relates to each other creating enormous challenges to developers. Team A isn’t much interested in new shiny infrastructure things. Instead, they are focused on getting paid, and to do that, they need a reliable platform they can release features to. The team at B has adopted technology without thinking about which problem they are trying to solve, instead erring towards whatever has captured the imagination of the crowd at a given time. As a result they have spent a lot of effort fighting their own systems rather than releasing fixes or features.

I’ve worked with reasonably high scale platforms. During my tenure at BSkyB I was the lead DevOps on the content discovery platform (Primarily around search, but also on the recommendations and some other systems). This platform saw serious use as every internet-connected set top box used the search platform, and there were a few million of those. I was also in charge of the internet search platform, the Sky version of Google which again saw reasonable traffic. Both of these platforms did clever stuff in the background with result ranking, predictive text etc. And I designed it around what would be considered a ‘traditional’ deployment model of non-elastic load balanced servers which we kept an eye on and manually scaled if we saw the need. Partly this is because tooling we have now didn’t exist, but if I was approached to design it today with the same software stack and challenges? I’d probably arrive at the same conclusion, albeit perhaps with an autoscaling group. That being said, I’m currently working with a client where I’m enthusiastically recommending using Kubernetes as the target platform. This client needs to be able to deploy to all the major cloud platform, and at that point the abstraction makes sense. That being said, if they were going to be in say, AWS, I’d recommend they use an ALB fronted set of servers with maybe an Autoscale group. Why? Because it’s YACRUD (Yet Another CRUD) application which is not going to see high traffic in all but the rarest of instances.

Developers and CTO’s are sold on the lie that moving to a cloud and embracing elasticity will be cheaper and less complex. It won’t. Virtually every client I’ve worked with lately would have seen better cost savings by abandoning this wild hunt for elastic nirvana and using reserved instances. The same for Kubernetes, they are being sold the lie it will ease the ability for local development. It won’t, and in many cases, will lead to one or two vocal proponents causing havoc in a dev team as they force other developers to embrace technology before they understand it, or are comfortable developing with it. The result is massive lost productivity and lost opportunity cost, and in some spectacular cases, loss of knowledge as otherwise talented developers leave to go and work in a sane environment. I’ll caveat this that I have seen teams that have adopted these things and had it work marvellously, but what sets them apart is they planned upfront, either by starting with a blank slate and recruiting for the desired tech stack, or by carefully migrating with training and the recognition that during transition productivity would hit the floor. In every case where I’ve seen adoption by fashion, it has ended badly.

Choose the right strategy for now. Don’t worry about what might happen, it probably won’t, and if it does, you’ll have time to see it coming. Concentrate on the fundamentals - making infrastructure that is easy to support, scales to support your current load, and most importantly, doesn’t drag developers into a quagmire of having to have a deep understanding of how the infrastructure works to be able to be productive. If you do choose a new approach, be that microservices, Kubernetes or Serverless, then do it with commitment, and do it only when you are certain that the pain of transition will offer real gains on the other side. Ask the hard questions :

  • Why is the current infrastructure so broken it can’t support the new endeavour?
  • Do we understand this new system and it’s failure modes at the depth we can solve it at 4am when we are sleep deprived?
  • Is the whole team, developers, QA’s and DevOps comfortable using the new infrastructure to develop with and manage?
  • Does this really offer a positive benefit?

Make your infrastructure boring. Get sleep. Be productive.