Bringing IT into balance
This is the latest blog in our series from G-Research’s CTO, Chris Goddard. Catch up with the first one here.
Coming into balance can be a challenge. You don’t need to be a tightrope walker to understand this either. Everyone – from racing drivers, to power plant operators, to cooks – relies on carefully calibrated mixtures and delicately balanced formulations. Arriving at the “right” balance takes time, patience and sustained effort. It’s certainly not a “move fast and break things,” mentality. If you come into it with the wrong mindset, your car crashes, your powerplant melts down, or – worst of all – your soufflé doesn’t rise.
Cloud computing empowered everyone to begin provisioning and equipping their own infrastructure to do complicated computing tasks. And many people enthusiastically began doing so. The cloud took what had been a complicated, multi-week process involving delivery vans and cardboard boxes (gasp!) and installations and tests, and turned it into as little as a couple of clicks and the swipe of a credit card. This is almost always touted as a good thing – but the cloud conversation can miss that element of balance.
I recall when we built G-Research’s first compute farm racks (more like a smallholding actually). I was unloading the blades out of boxes and thinking to myself what a headache it was going to be to get each one provisioned with the software we were using at the time. Provisioning was the second problem – the first and more painful would be fitting the many hundreds of individual memory sticks that had been delivered into the motherboards. It was both invigorating as well as enervating, and I recall coming home exhausted.
As we modernise our approach to technology, we are still heavy users of on-premise hardware alongside cloud-hosted services. For us, cloud isn’t just about where the actual tin is, it is as much about approaches to provisioning and how to think about and manage the entire lifecycle of our technology estate. We refer to this as using ‘cloud-style’ approaches, rather than necessarily being all-in on public cloud. Our issue has been working out what we can usefully learn from the cloud computing revolution. Finding the balance means understanding and embracing the trade-offs.
DevOps, Infrastructure-as-Code, self-service, GitOps and their kin have all brought significant benefits, and can help people in the cloud and in their own datacentres – but often enthusiasm outpaces expertise. Changing ways of doing things takes time and effort, and while we believe wholeheartedly that this is the right way to go, it sometimes also makes sense to take a step back.
G-Research was arguably pretty late to start on its own ‘cloud’ journey (as I talked about in my last article). That gave us the opportunity to learn a lot from the experiences of others. It seems that an adage I often turn to holds true with how you approach handing out the power to build everything yourself from infrastructure building blocks – “just because you can, doesn’t mean you should.”
As we began working out how to break up our monoliths, and planning how we wanted to construct our new infrastructure platforms, we realised that many larger engineering organisations were dealing with a whole new set of problems. Manual provisioning, waiting for procurement processes, and limited choice had been replaced – with ballooning cloud spend, accelerating IT sprawl and confusing complexity.
So, we took some inspiration from those who had gone before, and decided we needed to work towards a more balanced approach. Most developers don’t need to be building virtual machines, configuring iptables rules, figuring out why the load balancer isn’t working or thinking about when the database will fill up. Most developers just need to be able to write code, test it, and get it into production with as few interruptions and hurdles as possible.
Often, you don’t hear about the downsides of self-service and cloud. Why would you? Here’s a brief list, in no particular order, of the headaches the G-Research team is hoping to avoid.
Wheels and their reinvention
So much of modern IT has already been done before. Although Netflix may have been among the first to openly discuss the fact that no developer should have to re-pave the roads before they can drive on them, it’s hardly a new idea. I wouldn’t want to re-derive the Pythagorean equation each time I needed to find the length of the hypotenuse of a triangle (an everyday occurrence, I assure you). It’s much more useful to plug the numbers into the existing equation.
We’re looking to harness the learnings and best-practices of the industry, leapfrogging the whole model which relied on developers creating and deploying their own VMs from scratch. There will, of course, be times when a developer will want or need a totally clean slate to work from, but most coders prefer some layer of abstraction between themselves and the bare-bones bits. Our devs, right now, prefer working in Kubernetes (K8S), which conveniently abstracts away many things so that they can focus on writing code that actually performs useful functions for us.
We too have decided to pursue a Paved Road strategy, where we look to provide many of the building blocks for building, testing, deploying and operating software from a central team. Why have every team discover the same problems and build the same tooling time and time again when most teams can work productively with some centrally curated building blocks? Even Kubernetes itself can be pretty complicated – maybe there’s another useful abstraction above that we would benefit from? This topic was even a feature of recent keynotes at KubeCon 2019 and Microsoft Ignite.
Just because you can, doesn’t mean you should
Why build something in the public cloud when you have Kubernetes in production on premise already? Why use a new tool or framework when other people in your organisation have already solved a similar (or even the same) problem? Why provision a large server when a small one will do?
Team members all over our organisation are increasingly having to think about questions like these for themselves. Taking away the central team controlling the stock of servers or hard disks has had huge benefits in terms of velocity, but it doesn’t guarantee a coherent overall outcome, or the most cost-effective use of the overall company technology spend.
We need to switch to a different mode of working. Providing guidance, offering patterns and outlining decision criteria are all things we need to do so that the autonomous teams we strive for are able to make good decisions with the necessary context.
Most people try to do the right thing most of the time. Communicating about overarching goals is just as important as ensuring all the ‘t’s have been crossed and ‘i’s dotted. If someone understands the “why” behind a given decision or policy, they are much more likely to comply with it or find creative new ways of doing what they want to do while doing what needs to be done.
It’s good to have control
The cloud makes it so easy to do things, that it is very easy to create chaos. Cloud computing vendors focus on selling functionality, and often tout what their systems can do and how easy it is to get things done on them. What they don’t really want to tell you is to slow down and consider how you will manage the costs, deal with lifecycles and avoid duplication.
There are huge benefits to having cloud be able to work this way when you’re innovating and trying things. The ability for engineers to have their ideas manifest, quickly and easily, has been game changing, in the same way rapid-prototyping enabled machinists to build new parts for industrial processes in the 80s. But the difficulty comes when these prototypes need to move into Day 2 operations. Then, the benefits of standardisation and some kind of inventory-management become clear. Snowflake configurations too often melt under the pressure of real data, or actual operations. Is the system able to handle inter-departmental chargebacks? Did anyone think about how to backup and restore the self-service database in an emergency? Who owns this set of 50 VMs with GPUs attached that were left running for four months?
When it comes to G-Research’s technology modernisation efforts, we are looking to see how we can capture as many of the benefits of cloud-style approaches as we can. Doing so means avoiding the worst of the pitfalls and learning from the experience of others. We will get some things wrong of course – but it is good to ask why we are doing, or not doing, something that we see others doing. As a data-driven company, we want to see the evidence, and as such are trying lots of things both in our datacentres and in the cloud.
The key to making a soufflé that rises is separating the egg yolks from the egg whites. Miss this step and your dish comes out a hot mess. Similarly, if we veer too much in one direction or another, it’ll quash innovation and value creation. It takes balance and I’ll admit that I’m still learning.
What the cloud has done for our wider industry’s ability to deliver at pace is incredible – now our job is working out, for G-Research’s use case, how much of this is down to the cloud-style approaches rather than where the tin is.
Chris Goddard, CTO