Your Pet Kubernetes Cluster Is Not a Metaphor Failure. It Is a Moral One.

Sometime around 2006, a Microsoft engineer called Bill Baker stood in front of a slide deck about scaling SQL Server and drew a line down the middle. On the left he wrote "pets." On the right he wrote "cattle." The distinction was simple, almost laughably so. Pets have names. You nurse them back to health when they get sick. You cry when they die. Cattle have numbers. When one goes down, you replace it on the line. Randy Bias later popularised the metaphor through the cloud computing community, and Tim Bell at CERN carried it further still, until "pets vs cattle" became one of those phrases that every architect, every platform engineer, every CTO with a conference lanyard could recite like a catechism.

That was nearly twenty years ago.

I want you to think about that for a moment. Twenty years of an industry telling itself that it has moved from pets to cattle, that it has embraced disposability, that it has learned to love immutability. Twenty years of conference talks and blog posts and architectural decision records all solemnly declaring that we do not name our servers any more. And here we are in 2026, and I have never seen more pets in my life.

Article content — AI generated: Cold war propaganda poster: an engineer cradles a Kubernetes cluster like a beloved pet

The problem is not that we failed to understand the metaphor. The problem is that we understood it perfectly, applied it to one layer, and then built an identical pet shop on the layer above. This is the pattern I keep coming back to — not a failure of comprehension but a failure of honesty. We did not solve the pet problem. We relocated it. And then we gave the new pets different names so we could pretend they were cattle.

The Relocation Programme

I gave a talk about this that I have been banging on about ever since, because the more I look at it, the angrier I get. The argument is straightforward. Kubernetes abstracts your nodes. It abstracts your workloads. It does a genuinely brilliant job of treating the things running inside a cluster as cattle. Containers crash, they get rescheduled, nobody cries. So far, so good.

But what is the first thing you do with a brand new Kubernetes cluster? You install a load of things. Certificate issuers. Log aggregation. Monitoring. Security policies. Service mesh. Ingress controllers. Policy engines. You install so many prerequisite components that by the time your actual application workloads arrive, the cluster is already bespoke. It has a personality. It has quirks. It has that one Helm value override that Dave set up eighteen months ago and nobody has touched since because nobody knows what it does and everyone is afraid to find out.

I know this because I have done it myself. More than once. I have stood up clusters with the best of intentions, automated their creation beautifully, and then watched as they accumulated state, configuration drift, and tribal knowledge until they were as precious and irreplaceable as any server I ever named in the 2000s. The shame is not that it happens. The shame is that we keep pretending it does not.

The grey layer — that sandwich of operational tooling between your applications and your infrastructure — delivers absolutely zero business value unless you are in the business of selling operational tooling. It is the fastest-growing, least-examined, most lovingly tended collection of pets in the history of computing. And it is held together, as I said in the talk, by sticky tape, chewing gum, pipe cleaners, thoughts, prayers, and Helm.

Helm. A string-based templating engine where every community chart must eventually expose every parameter of the thing it abstracts through a glorified string replace. I cannot think of a technology that better represents the gap between what we say we believe and what we actually do. We say we believe in infrastructure as code. We say we believe in declarative configuration. And then we build the most critical layer of our platform on top of a tool that operates on the same principle as a mail merge.

Then Like Now

Fred Brooks observed in 1986 that there is no silver bullet — that every new tool eliminates accidental complexity but leaves essential complexity untouched. He was right. But Brooks gives us something more useful than despair. He gives us a diagnostic question: which of the things in your grey layer are essential complexity that must be managed, and which are accidental complexity you have normalised? Most organisations have never done that audit. They cannot tell you which parts of their platform exist because the problem demands it and which parts exist because someone installed Istio in 2021 and nobody has had the courage to ask whether it is still earning its keep.

The pattern predates Kubernetes by decades. Virtualisation was supposed to kill the pet server. Instead it created pet VMs — hand-configured, lovingly maintained, sprawling across environments until administrators could no longer keep track. Lift-and-shift was supposed to liberate us from the data centre. Instead, organisations moved their pets to a new address without changing their behaviour. The server got a new postcode. The relationship did not.

Then containers arrived and we did it again. The container images were cattle, beautifully immutable, pulled from registries, destroyed without sentiment. But the clusters running those containers became pets within weeks. The Dockerfiles were cattle but the CI/CD pipelines building them became pets — sprawling, undocumented, untouchable Jenkins instances that everyone feared and nobody understood. Each abstraction layer faithfully reproduced the pet dynamic one level up, like a Russian doll of operational attachment.

The CNCF landscape tells the story in a single image. In 2018, roughly 25 projects. Today, north of 200 graduated and incubating projects, and over a thousand cards on the landscape. A thousand solutions to a problem that was supposed to be solved by treating infrastructure as disposable. If the problem were actually solved, the landscape would be shrinking, not metastasising.

How to Know If You Are Performing

I keep three tests in my head. If you fail any of them, you have pets. It does not matter what your architecture decision records say.

The reproduction test. Can you destroy this component right now and bring it back immutably from code, all without anyone noticing? Not in theory. Not in a disaster recovery document that was last updated when the Queen was alive. Right now, this afternoon, could you kill it with fire and have it back before the tea goes cold? If the answer involves a runbook, a specific engineer, or the phrase "we should really document that," you have a pet.

The knowledge coupling test. How many people in your organisation could reproduce this component if the person who built it left tomorrow? If the answer is fewer than three, the component's complexity lives in human memory rather than in code. That is what makes something a pet — not whether it has a name, but whether it has an irreplaceable owner.

The grey layer ratio. What proportion of your engineering effort goes to maintaining the substrate between your infrastructure and your applications? If the grey layer consumes more time than the business workloads it supports, you have inverted the value proposition. The infrastructure exists to serve the application. When the application exists to justify the infrastructure, you are running a pet hotel.

I described the recruitment angle in the talk as hunting unicorns: you need someone with Kubernetes experience, maybe a CKA, who also knows Linkerd or Istio, who can write Terraform and Helm, who understands your CI/CD pipeline, who has opinions about OPA versus Kyverno, who can debug a CNI plugin at three in the morning. That is not a job description for operating cattle. That is a job description for a veterinarian.

What Genuine Cattle-Thinking Actually Looks Like

Only 10% of organisations use ephemeral environments as their primary development and testing approach. Ten percent. Despite the evidence that ephemeral environments deliver order-of-magnitude improvements in feedback speed and millions in annual infrastructure savings. The reason adoption is so low is not technical. It is emotional. People do not want to let go of the cluster they have spent months configuring. They have a relationship with it. It is a pet.

When I designed NDX:Try — the sandbox environment within the National Digital Exchange for local government — the expiry was the feature, not a limitation. You request an environment. You get one. It has a timer on it. When the timer runs out, the environment ceases to exist. If you want another one, you request another one. There is no button to extend. There is no mechanism to preserve. The environment dies, and the only thing that survives is your code.

This was a deliberate design decision, and I will tell you exactly why. If the environment is permanent, people will configure it by hand. They will SSH into it and install things. They will make it theirs. They will name it. Within a week it will be a pet, and within a month it will be load-bearing infrastructure that nobody can reproduce. I have seen this happen so many times that I have lost the capacity to be surprised by it. The only way to guarantee cattle behaviour is to make the cattle mortal — to build the system so that doing it right is easier than doing it wrong.

That is the entire philosophy in one sentence. Make doing it right easier than doing it wrong.

The Moral Dimension

I used the word "moral" in the title and I meant it. This is not an academic argument about architectural patterns. The grey layer — that vast expanse of operational tooling that delivers zero business value — has a cost, and the cost is human.

I am not saying the people who built this complexity were stupid. Every individual decision in the grey layer was locally rational. Someone chose Istio because they had a genuine mTLS requirement. Someone adopted OPA because an auditor asked about policy enforcement. Someone installed Prometheus because the existing monitoring was inadequate. Each decision made sense in its moment. The problem is that nobody ever stepped back and asked what the aggregate looked like — and by the time they did, the aggregate was load-bearing and untouchable. That is the trap. Rational decisions, compounding into irrational outcomes, defended by the sunk cost of having made them.

Every hour an engineer spends debugging a Helm chart that wraps an operator that configures a CRD that manages a certificate that enables a service mesh that secures a connection between two microservices that used to be one monolith is an hour not spent on the thing the organisation actually exists to do. Every unicorn job advert that goes unfilled for six months is a team carrying an unsustainable workload. Every cluster that cannot be reproduced from code is a single point of failure waiting to become a 3am phone call that burns someone out.

We built this. All of it. We built the complexity, we built the fragility, and we built the recruitment crisis. And we did it whilst telling ourselves, with straight faces, that we had moved from pets to cattle.

The metaphor is not the problem. Bill Baker's insight from 2006 was correct — genuinely, profoundly correct. The problem is that we took a revolutionary idea about disposability and applied it to the one layer where it was easy, then stopped. We made the containers disposable. We left everything else precious. And then we wrapped the precious things in so many layers of abstraction that we could no longer see them, which meant we could no longer see that they were pets.

That is not an engineering failure. It is a failure of honesty. And the only fix — the one I keep coming back to, the one I built NDX:Try around, the one I argued for in the talk — is to stop managing pets better and start making pets impossible. Build systems where the default path is the reproducible one. Make permanence require effort, not the other way around. Make the cattle mortal so that the only thing that survives is the code that can bring them back.

The technology is there. It has been there for years. What is missing is the willingness to stop pretending.

(Views in this article are my own.)