Cloudy with a chance of freefall

The Instruction Manual

Thu, 23 Apr 2026 05:00:02 GMT

I rewatched Night at the Museum recently with my young children. In the first film, three retiring night guards hand Larry Daley a numbered instruction booklet. Cecil, the oldest, delivers the briefing with the confidence of a man who has been doing this for decades: "Do 'em in order, do 'em all and do 'em quick." No explanations. No rationale. Just a list of commands and a warning not to let anything in or out.

I think that might be the most honest depiction of a Standard Operating Procedure (SOP) I have ever seen on film. And I must admit, I recognised the handover. I have been Larry. I once inherited a deployment procedure for a banking system that included the step "wait 90 seconds before proceeding." No one could tell me why. I followed it for months before discovering it was the time a long-retired server had needed to flush its cache. The server was gone. The wait remained.

We have all inherited instruction manuals like Cecil's. They arrive with a new role, a new system, a new team. They are numbered, ordered, and stripped of context. They tell you what to do. They do not tell you why.

And how often do we ask whether the omission is accidental?

For automation, codified procedures are genuinely indispensable. Larry's first instruction is "throw the bone." It sounds absurd until you discover that the T-Rex skeleton just wants to play fetch. The rule, followed on faith, works. Grab's engineering team found that structuring LLM agents around explicit SOPs achieved over 99.8% accuracy, precisely because the machine does not need to understand why. You cannot automate what you cannot codify, and codification means writing down the steps, however strange they may look to the newcomer holding the bone.

However, I think there is a difference between codifying a procedure and understanding one. And that difference may matter most when your organisation is not trying to repeat the past but trying to move beyond it. I should be clear: in safety-critical settings, rigid SOPs save lives, and Gawande's checklist research makes a compelling case for not inviting reasoning under pressure. What I am talking about is the procedures that govern how organisations change, not how they operate at the sharp end.

In the third film, Secret of the Tomb, the golden tablet that animates the museum exhibits begins to corrode. It has been away from its source (the moonlight of the Temple of Khonsu) for too long. The exhibits start glitching, becoming aggressive, losing themselves. The rules keep firing.

The rules now hurt.

How many of our procedures are running on a corroding tablet? I suspect more than we think. The procedure still runs, but the context has quietly moved. Consider the WHO Surgical Safety Checklist: at eight pilot hospitals where teams were trained in what each item meant and why it mattered, deaths fell by 47% and complications by 36%. When Ontario rolled the same checklist out to 101 hospitals, with 215,000 procedures, there was no significant reduction in either. Same form. Opposite outcomes. The reasons are probably more complex than any single explanation (implementation quality, institutional culture, training investment all played a role), but I think the direction is clear: the form without the understanding is the tablet without the moonlight.

You may have heard the story about five monkeys, a banana, and a cold water spray (new monkeys, never sprayed, still attack anyone who reaches for the banana). It is a vivid parable about inherited behaviour. It is also entirely made up. And yet the fact that every manager has heard it is perhaps itself the phenomenon it purports to describe. We pass around a story about unquestioned rules without ever questioning where the story came from. (I am building my own argument on a Ben Stiller film, but I hope the difference is that I am using it as metaphor, not as evidence.)

G.K. Chesterton offered the opposite warning in 1929: do not remove a fence until you know why it was put there. I think both positions may be true simultaneously. A procedure without its rationale is vulnerable to what Richard Feynman called cargo-cult behaviour, ritual that perfectly imitates the form of the original while missing everything that made it work. A procedure whose rationale you never bother to recover is vulnerable to reckless removal. Either way, the why is what makes the difference. We cannot afford to treat it as optional documentation.

Taiichi Ohno, Toyota's chief engineer, understood this. His standardised work was never the end state, it was the starting line. "Without standards, there can be no kaizen", he argued, meaning that you begin by following the written procedure faithfully, measure what happens, and then improve it. The SOP is a hypothesis taped to the workstation, not a commandment carved in stone. But you must follow it first.

In Night at the Museum, we eventually discover that Cecil, Gus, and Reginald were the villains all along. They had been stealing the tablet's magic to keep themselves young. The manual was not neutral. Its omissions were deliberate. They did not forget to explain why. They chose not to.

I suspect more of our organisational SOPs carry this kind of silent self-interest than we might like to admit. The approval gate that justifies the approver's role. The weekly meeting whose purpose no one can articulate but whose cancellation would threaten someone's calendar. Not conspiracy, exactly. Just the quiet accumulation of procedures that serve their authors more faithfully than they serve the organisation.

If you are automating, then by all means codify. Throw the bone. But if you are transforming, perhaps the first question is not "what does the manual say" but "who wrote it, and what did they gain from what they left out." As David Marquet, the submarine commander who turned USS Santa Fe around, put it: "Instructions require obedience; intent requires thought."

The tablet can be restored, but only if we take it back to the moonlight. Write the why down while someone still knows it. And if no one knows it, perhaps that tells you something about who wrote the manual and what they wanted you not to ask.

After all: "I'm made of wax, Larry. What are you made of?"

(Views in this article are my own.)

Falling Without a Checklist: The Only Migration That Matters

Mon, 20 Apr 2026 06:56:19 GMT

On 30 October 1935, a prototype called the Boeing Model 299 taxied onto the runway at Wright Field, Ohio, and crashed on takeoff. The aircraft was not faulty. The pilot, Major Ployer Peter Hill, was not incompetent. The Model 299 was simply too complex for a single human being to operate from memory. It had four engines where previous bombers had two. It had more flaps, more fuel mixtures, more trim tabs, more ways to kill you if you forgot a step.

The US Army Air Corps nearly cancelled the programme.

Instead, a group of test pilots did something that no amount of individual heroism could have accomplished: they wrote a checklist. Pre-takeoff. Pre-landing. Pre-everything. The checklist was not a training aid for beginners. It was a systemic intervention that acknowledged a simple, uncomfortable truth — the aircraft had exceeded the capacity of human memory, and no amount of skill or courage could substitute for a system.

The Model 299 went on to become the B-17 Flying Fortress. It helped win a war. And it did so not because the pilots got braver, but because they got disciplined.

I keep thinking about that checklist. I keep thinking about it because 62% of organisations that attempt cloud migration report significant unplanned cost overruns, delays, or outright failure. Sixty-two percent. That is not a teething problem. That is a systematic absence of pre-flight procedure. These organisations are climbing into the cockpit of a four-engine bomber and trying to fly it from memory.

Ground crew preparing the 299 for the takeoff that would change aviation safety

The Sport That Industrialised Courage

The Model 299 was not the first time humans confronted the gap between individual bravery and systemic safety. Skydiving did it first — and did it better than almost any industry I have encountered.

In 1797, André-Jacques Garnerin jumped from a balloon over Paris with a silk canopy and no backup. He survived. In 1919, Leslie Irvin made the first deliberate free-fall jump with a manually deployed parachute. He survived too. What happened next is the part nobody in technology talks about. The skydiving community did not celebrate these individual acts of courage and move on. It did something far more radical. It built a safety stack — a layered system in which each component exists because the one above it might fail.

That stack looks like this. First, training — rigorous, standardised, non-negotiable. You do not get to skip ground school because you are clever. Second, the main parachute — packed according to a procedure, inspected according to a schedule. Third, the reserve parachute — packed by a certified rigger (regulated by the UK Civil Aviation Authority in Britain and the Federal Aviation Administration in the US), not by you, because the person most likely to make an error with your reserve is you. And fourth, the Automatic Activation Device, or AAD — a small computer strapped to the rig that measures altitude and velocity and deploys the reserve if the jumper has not done so by a predetermined altitude.

The AAD does not care about your experience. It does not care about your confidence. It fires when the numbers say fire. It is the final backstop in a system designed around the assumption that every human layer above it might fail.

This is not cowardice. This is the opposite of cowardice. This is what courage looks like when it has been industrialised.

Then Like Now

Here is the part that should make every CTO reading this put down their coffee.

The cloud migration industry in 2026 has no equivalent safety stack. Most organisations have a main parachute — the migration plan itself, the architecture diagrams, the Jira tickets. Some have a reserve — a rollback strategy, tested occasionally, understood by a handful of engineers. Almost none have an AAD. Almost none have a systemic, automated, threshold-triggered mechanism that fires independently of human judgement when the numbers say the migration is going wrong.

And the training? The ground school? The phrase "cargo cult" entered serious intellectual discourse through Richard Feynman's 1974 Caltech commencement address, in which he described "Cargo Cult Science" — research that has the form of science but is missing something essential. The islanders in Melanesia built bamboo control towers and carved coconut-shell headphones for the operator. They had replicated every visible artefact of an airfield. No planes landed. The ritual was perfect. The understanding was absent.

I have watched organisations send their infrastructure teams on a two-day cloud certification course and declare them ready for a migration that will take eighteen months and cost millions. That is not training. That is coconut headphones — the cargo-cult imitation of preparation, where the ritual replaces the substance. The planes do not come. They never do.

The United States Parachute Association's (USPA) 2024 fatality summary shows that the skydiving fatality rate has improved by a factor of 48 since 1961. The picture is similar in Britain, where British Skydiving (the sport's national governing body) reports comparable safety gains driven by the same systemic improvements. The US data is the most comprehensive: from 11.1 deaths per 100,000 jumps to 0.23. The CYPRES AAD alone has saved more than 5,400 lives since its introduction. Five thousand four hundred human beings who would be dead without an automated safety system that does not ask permission, does not wait for consensus, and does not care about your feelings.

If skydiving had a 62% failure rate, no sane person would jump. The UK Civil Aviation Authority would ground every drop zone in Britain. The Federal Aviation Administration would do the same in the United States. And yet in enterprise technology, a 62% failure rate is treated as normal. As the cost of doing business.

That is not risk management. That is negligence dressed in a suit.

A real photo of my feet (pre AI) some 6,000ft in the air above the English countryside

The Counterargument I Owe You

"OK," you might say, "but skydiving is a bounded physical system. Gravity is constant. Terminal velocity is known. The failure modes are finite. Cloud migration is an unbounded sociotechnical system where the failure modes mutate, where the vendors change the pricing model mid-flight, where a junior engineer can misconfigure an IAM policy and expose the entire customer database."

You are right. Cloud is more complex than skydiving. The variables are less predictable. The blast radius is wider. And that is precisely the argument for checklists, not against them.

Seneca taught his students to rehearse catastrophe — premeditatio malorum — not because the rehearsal would prevent the catastrophe, but because it would prevent the paralysis. When the system is infinite, the checklist is the finite thing you control. The skydiver cannot control the weather, the turbulence, or the moment of panic. But the skydiver can control the pre-jump check, the altimeter reading, the pull altitude. The checklist does not pretend to eliminate uncertainty. It creates a floor beneath which the uncertainty cannot drag you.

Cloud migration needs that floor. Right now, most organisations are free-falling without one.

The Cloud Migration Safety Stack

Every safe cloud migration depends on four layers, stacked in order. Remove one and the layers above it collapse.

Layer 1: Understanding (Ground School) First-principles knowledge of distributed systems, failure modes, and organisational dynamics. The foundation everything rests on. Not a certification course. Not a vendor webinar. A structured, multi-month programme in which the team migrates a non-critical workload end-to-end before touching production. You learn to pack the parachute before you jump out of the aircraft. If your team cannot deploy, monitor, roll back, and explain a migrated service in a test environment, they are not ready for production. Full stop.

Diagnostic: If you removed all your tooling tomorrow, could your team explain what the tools were doing and why?

The islanders with their coconut headphones were missing Layer 1 entirely. They had perfect practices — the runway, the fires, the hand signals — built on no understanding whatsoever. This is the cargo cult failure mode, and it is precisely what Feynman warned against: the form without the foundation.

Layer 2: Culture & Incentives (The Human Environment) Psychological safety, blameless postmortems, learning loops. The human environment that lets good architecture survive contact with reality. If your engineers are afraid to admit a migration is failing because they will be blamed, your reserve parachute might as well not exist — nobody will pull the handle.

Diagnostic: When did your team last kill a production incident with no blame, no punishment, and a published timeline?

Layer 3: Architecture (The Main Chute) The migration plan, the service boundaries, the dependency maps. This is what most organisations think migration is. It is necessary and insufficient. The main chute works most of the time. Most of the time is not a safety standard. Critically, this layer must be inspected by someone who did not design it — in skydiving, the reserve is packed by a certified rigger under CAA or FAA regulation, not by you. In cloud migration, that is an independent architecture review function whose incentive is not to prove the migration will work but to prove it might not.

Diagnostic: Can you draw your system's failure domains on a whiteboard right now — and has someone outside your team stress-tested them?

Layer 4: Automated Kill Switches (The AAD) This is the layer almost nobody builds. Automated, threshold-triggered mechanisms that fire without human approval when predefined conditions are met:

Cost ceiling: Monthly spend exceeds 130% of forecast? The system freezes new deployments and alerts the CFO. Not the engineering team. The CFO.
Error-rate trigger: P99 latency exceeds SLA for more than fifteen minutes? Traffic automatically routes back to the on-premises system.
Security tripwire: Publicly exposed storage bucket detected? Automated lockdown. No human in the loop.

The AAD exists because the moment you most need to pull the reserve is the moment you are least capable of deciding to pull it. In skydiving, that moment is unconsciousness. In cloud migration, it is the sunk-cost fallacy — the organisational inability to admit that the migration is failing when you have already spent two million pounds on it.

Diagnostic: If your migration went catastrophically wrong at 3 a.m. on a Saturday, would any automated system catch it before a human noticed?

A note on ownership: In most organisations, these layers belong to different people. Platform teams own Architecture. Leadership owns Culture. And Understanding — critically — is everyone's responsibility, which in practice means it is often nobody's. If you cannot name the owner of each layer in your organisation, you have found your first vulnerability.

Two Kinds of Repatriation

Not all retreats are failures. This is the distinction the industry refuses to make.

37signals moved off the cloud in 2023. That was strategic repatriation — the reserve deployed exactly as designed. They ran the numbers. The economics no longer justified cloud hosting for their specific workload profile. They had the on-premises capability to return to. They planned it, executed it, and saved millions. That is a skydiver deploying the reserve at the correct altitude, calmly, with full situational awareness. The reserve worked because it was packed.

GEICO's cloud migration troubles were the opposite. That was panicked repatriation — fumbling for the ripcord at five hundred feet because nobody had checked the gear. No clear rollback plan. No tested repatriation path. No AAD firing to force the decision before it was too late. They did not choose to come back. They were forced back, and the cost — financial, operational, reputational — was catastrophic.

The difference is not whether you come back. The difference is whether you packed the reserve before you jumped.

The Only Migration

The B-17 pilots were not less brave for using a checklist. They were more effective. The checklist freed them to focus their courage on the parts of the mission that actually required courage — the flak, the fighters, the weather, the decisions that no system could make for them. The checklist handled the parts that courage could not.

I have watched organisations spend millions on migration programmes that had no kill switch, no tested rollback, no automated threshold, no independent reserve inspection. I have watched CTOs bet their careers on plans that had less systemic safety than a first-time skydiver's rig. And I have watched them crash, not because they lacked talent or ambition, but because they lacked a checklist.

The Boeing Model 299 taught us this in 1935. The skydiving community teaches it every single day. The lesson is there for anyone willing to learn it.

The only migration that matters is not from on-prem to cloud. It is from courage to systems.

Bonus: skydivers can be nerds too!

Bonus photo ^ is Phil Hartree ~2011 on a weather hold teaching me landing patterns and also subnet masks CIDR blocks 🪂.

(Views in this article are my own.)

Why Pink

Fri, 17 Apr 2026 05:15:03 GMT

I counted them once. Not because I'd set out to count them, but because I was standing at the back of a hotel function room in Reading, holding a coffee that was too hot to drink and too sad to throw away, and I had to do something with my eyes.

Forty-eight men. White. Middle-aged. Blue shirt, or black shirt, or the shirt that is trying to be blue but has given up halfway and become grey. A few navy jumpers. One very brave beige. That was it. That was the room.

And then there was me, in a pink shirt.

AI Generated: A lone pink shirt in a sea of blue ones at a tech conference

People ask me, with genuine sincerity, why I wear pink. They ask it at conferences. They ask it in meetings. A woman called Janet once asked me in the queue at a Pret. I tend to mumble something and change the subject, because the real answer is long, and it involves cyanobacteria, and nobody wants cyanobacteria before lunch.

But I've been asked enough times now that I think I owe the world a proper reply. So here it is. The canonical one. Please link to it, and leave me be.

Reason one. I would like to be remembered.

I am a white middle-aged bearded man in technology. There are, at a conservative estimate, four hundred million of us, most of us are called Chris, and most of us have the same beard. If I stand in a room with forty-seven other blokes in blue shirts, I am, statistically speaking, interchangeable. The blue shirt is not a choice. It is a uniform. Tech has decided, quite without asking anyone, that more than half the top-100 tech brands should be medium-to-dark blue or black, and a lot of the men in the room have taken the hint and dressed to match the logos.

I looked up the proper term for what the shirt is doing. It turns out there isn't one that quite fits. Economists would call it a Schelling point, which is a nice phrase for a coordination signal. A thing that is cheap on day one and valuable because of a decade of consistency. You're not the pink one because pink is expensive. You're the pink one because you were the pink one last year, and the year before that, and the year before that, and now when two people in a foyer are trying to find each other they say "look for Chris, he'll be the pink one" and it saves everybody a phone call.

I should mention, while we're here, that it isn't just the shirt. I have bright pink trousers as well. I have been known to pair the two. On a bold day, in full daylight, I have gone full pink from collar to ankle, which is a look that commits. If the shirt says "you'll remember me", the trousers say "you'll remember me, and you'll tell your spouse about it when you get home."

The socks are also pink, and this is the bit I am most proud of, because the socks are pink for a logistical reason rather than an aesthetic one. If every sock you own is the same colour, you never have to pair them. You just grab two. And when one of them develops a hole, you throw out one sock, not two. I worked this out some years ago and have not been pairing socks since. I consider it one of the great small victories of my adult life.

The complication, and there is always a complication, is that I wear a size thirteen. Pink socks in size thirteen, in a cut that does not cut off the circulation, do not exist at scale. So I bought several dozen pairs of good-quality white ones and dyed them myself. The same applied, later, to the jeans. There is, it turns out, not a mass market for pink jeans in a gentleman's fit, and so the gentleman has to take a pair of white Levi's and a bucket of dye and work it out on a Saturday afternoon. I will not pretend this was planned. I will pretend, to my wife, that it was.

AI Generated: An open drawer stuffed with dozens of identical loose pink socks

If you meet me once, in a pink shirt, you will remember me. If you meet me twice, in a pink shirt, you will think, right, that's the pink one. And that, in a career built largely on people remembering to email me back, is worth the price of a slightly startled conversation with a woman called Janet in a Pret.

I should say, while I'm here, that this only works because the room is mostly blue. If the room ever goes pink I'll go beige, and I'll be sorry to see it.

Reason two. Pink is the oldest colour anyone has dug out of a rock.

This is the one nobody believes, which is why I enjoy it.

In 2018, a team led by Dr Nur Gueneli at the Australian National University extracted 1.1-billion-year-old pigment molecules from marine black shales in the Taoudeni Basin in Mauritania. A billion, with a B. These molecules are the molecular fossils of chlorophyll, produced by tiny cyanobacteria that dominated the base of the food chain before animals had even been invented. Concentrated, they run blood-red to purple. Diluted, they fluoresce a bright, joyful pink.

I find this quite moving, to be fair.

Half a billion years before there were animals, half a billion years before there was anything that could be said to look at anything, there was pink. Pink was doing its work in the oceans while the rest of the colour wheel was still a rumour. If pink is good enough to predate animals, it's good enough for a conference lanyard in Reading.

AI Generated: A vial of 1.1-billion-year-old pink pigment beside a mug of tea on a lab bench

Reason three. Pink isn't actually there.

I'll explain what I mean. The rainbow goes red, orange, yellow, green, blue, violet. You've seen one. You know the order. What you may not have clocked, because nobody sits you down and tells you, is that there is no wavelength of light that produces magenta or pink. There's a gap. Red is at one end. Violet is at the other. They don't meet. Nothing lives in the space between them.

But your brain doesn't like the gap. So when your long-wavelength cones (the red ones) and your short-wavelength cones (the blue/violet ones) fire at the same time, with nothing coming from the green ones in the middle, your visual system just invents a colour. It says, right, I'll bridge those two ends by hallucinating something that isn't in the spectrum. And the something it invents is pink.

Pink is a colour the brain paints into a gap in the rainbow because the rainbow is embarrassing.

I walk around wearing a hue that doesn't technically exist. It is a hallucination everyone agrees to have at the same time. I find that rather lovely. And I find it even lovelier that nobody at the conference knows. They're just thinking, there's Chris in that shirt again.

Reason four. I have been lying to my wife for many years.

This is the bit I've been putting off.

The truth is, when we first met, she was the one who liked pink. I was the one going round in navy, like everyone else. At some point, and I cannot tell you exactly when, I started wearing a pink shirt. Then another one. Then, gradually, over about a decade, I somehow convinced her, with the quiet complicity of several friends and at least two members of her own family, that I had always been the pink one. That pink was, in fact, my thing. That she'd got into it because of me.

I would like to clear up, right now, in writing, that this is not true.

I want to be careful here. Real gaslighting is a pattern of behaviour, a form of domestic abuse, and I am not for a single second making light of what it does to people. What I am describing is a long-running affectionate family joke, conducted in full daylight, about a shirt. I would not want anyone to confuse the two.

Anyway. She knows. She's always known. She's letting me have it because she's a kind person and because, at this stage, the mythology has become load-bearing. If I ever admitted it out loud, a load of dinner parties would collapse.

So: I am confessing in print. She liked pink first. I am a fraud. The pink is hers. I am merely wearing it, badly, in her honour.

I am also, if I am being completely honest, relying quite heavily on the fact that she doesn't read my blog. If you know her, please don't send her a link. If you are her, hello, I was going to tell you over dinner, probably, at some point, once I'd worked out the running order.

AI Generated: A diagram of the visible spectrum with 'here be magenta' labelled in the gap the brain paints in

Reason five. Pink was for boys first.

This is the one that gets people in the pub.

Up until about the 1950s, pink was considered the proper colour for a boy. It was a strong colour, a little version of red, suitable for a small lad. Blue was for girls. Something to do with the Virgin Mary, apparently. Professor Jo Paoletti, who wrote the book on this, puts it very plainly: pink-blue gender coding was known in the late 1860s but didn't become dominant until the 1950s, and wasn't universal until a generation after that. In June 1918, the Ladies' Home Journal told American mothers, in print, that "pink being a more decided and stronger colour, is more suitable for the boy, while blue, which is more delicate and dainty, is prettier for the girl." That is the actual sentence. Someone wrote it. Someone agreed with it. Someone's granny cut it out and kept it.

Which means the idea that pink is girly is about seventy years old. It is younger than my mother. It is younger than the Queen Mother was when she opened most of the bridges in the north-east. It is, in the grand sweep of human history, a bit of a rumour that got out of hand.

So when a man wears pink, he isn't being subversive. He's being restorative. He's putting the shirt back where it belongs.

Most things people treat as permanent are about seventy years old and made up.

One last thing, before you go. Please do not all start wearing pink. I am saying this with love. Pink is my thing. I have put the hours in. I have a decade of photographic evidence and a drawer full of trousers that match nothing else in the house. If all of you turn up at the next conference in pink, the whole system breaks, and we are back to forty-nine interchangeable men, just pinker. Get your own colour. Teal is available. Mustard is having a moment. Someone needs to be brave about green.

Right. That's the answer. Please stop asking. I've got a conference in Reading next week and I need to iron the shirt. And the trousers.

(Views in this article are my own.)

Too Busy to Ride the Bike

Wed, 15 Apr 2026 05:15:03 GMT

My grandmother told a story about a woman in her village. The woman was rushing to church, practically running, pushing her bicycle along beside her. My grandmother stopped her and asked, "Why don't you get on your bike?" The woman said, "I'm in too much of a hurry to get on the bike."

I do not know whether the story was true, or a parable, or whether my grandmother was talking about herself. She is gone now, so I cannot ask. What I do know is that I have quoted that line in more meetings, more architecture reviews, and more standups than I can count, and every time I say it, at least one person in the room does the recognition nod — the slow, slightly ashamed nod of someone who knows exactly what I am describing because they did it this morning.

Festina Lente

The woman with the bicycle would have understood Augustus Caesar, though I doubt they moved in the same circles.

Around 19 BCE, Augustus adopted a personal motto: festina lente — make haste slowly. The phrase became an emblem. Literally. The Aldine Press of Venice took the dolphin-and-anchor as its printer's mark — the dolphin for speed, the anchor for deliberation. Erasmus included it in his Adagia in 1508. Two thousand years of intellectual inheritance, from a Roman emperor to a Venetian printing house to a Dutch humanist, and the lesson is always the same: the fastest way forward is not always forward. Every culture arrived at this independently — Ecclesiastes, the Muromachi poets, the Chinese woodcutters, the British since the fourteenth century. Technologies change. Human nature does not.

Then there is the one everybody attributes to Abraham Lincoln: "Give me six hours to chop down a tree and I will spend the first four sharpening the axe." Lincoln never said it. The earliest known source is from 1945, eighty years after his death. The misattribution tells you everything. We do not just ignore the principle — we fabricate historical endorsement for it, which is exactly what the woman pushing her bicycle was doing: performing urgency as proof of commitment. Engineers have their own version: "Weeks of coding can save you hours of planning." The joke works because the pattern is universal and universally violated.

Then Like Now

I watch this in engineering teams every week. Someone sitting at a machine that can generate, test, and refactor code — a machine that is, in the most literal sense, a bicycle for the mind — and they are pushing it along beside them. Writing the same boilerplate they wrote last year. Manually testing what could be automated in an afternoon. Copying and pasting between terminal windows because setting up the script would take twenty minutes and they do not have twenty minutes because they are too busy copying and pasting between terminal windows.

The research calls this action bias. Bar-Eli and colleagues studied penalty kicks and found that goalkeepers who stayed in the centre saved more often — yet almost all of them dive left or right. Doing something, anything, feels better than doing the right thing if the right thing looks like standing still. Bellezza showed in 2017 that in Anglo-Saxon corporate culture, being busy is a status signal. The woman pushing her bicycle is not failing. She is performing.

It is both criminal and deeply human.

And it is not only individual psychology. In many organisations, the incentives are structural. People push the bicycle because the system rewards visible motion — sprint velocity, tickets closed, lines committed. A contractor who stops to build the automation looks idle; a contractor who types furiously looks billable. The dashboard measures the pedalling, not the arriving. Until organisations learn to measure outcomes over output, the bicycle will keep being pushed uphill by people who know perfectly well how to ride it.

AI Generated: a bicycle leaning against a stone wall at dawn, no rider, morning mist over a churchyard - the rider chose to walk

The Distinction

Now, you might reasonably point out that this is all very well — get on the bike, sharpen the axe, automate the toil. And most of the time, that instinct is right.

But the story is more complicated than "just ride."

Because there is a difference — a sharp one — between the woman who pushes the bicycle out of irrational urgency and the person who chooses to walk because the walk itself serves a purpose. The grandmother's woman could ride. She had the skill, the bicycle, the road. She pushed it anyway, out of a compulsion that mistook motion for progress. That is one mode. The other is the engineer who builds the thing by hand not because they are afraid of the tool but because the problem is not yet stable enough to automate, or because the manual repetition is teaching them something they have not yet articulated.

The Default Mode Network — the brain's resting-state network — is where creativity lives. A Stanford study found that walking boosts creative output by 81% compared to sitting. Walking, not cycling. The bicycle would have got you there faster. The walk would have got you there with an idea.

So the diagnostic is not "are you riding or pushing?" It is finer than that.

If the task is repetitive and the automation would teach you nothing new, get on the bike — you are wasting daylight.

If the task is not yet stable enough to automate, if the requirements are still shifting under your feet, then wait — building the machine now means rebuilding it next week.

If the slow execution is building understanding, if pushing the bike is how you learn the gradient of the hill, then push deliberately — that is craft, not cowardice.

But if you are pushing out of habit, out of fear of the twenty minutes it takes to learn the tool, out of a need to look busy for a system that rewards visible effort over invisible thought — that is the problem the grandmother saw. That is inertia dressed in urgency.

Here is a test I have found useful: if someone automated this task for you tonight, would you feel relieved or robbed? Relieved means you are in the fourth mode and should have got on the bike weeks ago. Robbed means you are in the third, and the walk is doing work your calendar cannot see.

Still Learning

Some things never change. The woman in my grandmother's village is every engineer who will not write the script, every manager who will not block out thinking time, every organisation that holds the meeting about the meeting instead of cancelling both. That is the crime.

But the wisdom — the quiet, counterintuitive, deeply human wisdom — is knowing the difference between refusing to ride and choosing to walk.

My grandmother, I suspect, knew the difference. I am still learning it.

(Views in this article are my own ...and my grandmothers.)

The Thing I Didn't Know About the Thing I Thought I Knew

Mon, 13 Apr 2026 06:00:05 GMT

The Thing I Didn't Know About the Thing I Thought I Knew

I was going to write this article because I thought I knew all there was to know about the Dunning-Kruger effect. I sat down, cracked my knuckles, and prepared to hold forth on the subject of people who don't know what they don't know. Without a shred of irony. Without a flicker of self-awareness that I was about to become the very thing I was writing about.

Let me tell you what happened next.

The first thing I discovered is that the graph is fake. You know the one. Mount Stupid, the Valley of Despair, the Slope of Enlightenment -- that satisfying curve that gets wheeled out in every conference talk, every LinkedIn post, every smug conversation about why other people are wrong about things. That graph does not appear in the original 1999 Kruger and Dunning paper. Not once. The paper contains quartile bar charts comparing perceived ability to actual test scores among Cornell undergraduates sitting logic and grammar exams. No mountain. No valley. No slope. The curve that half the internet attributes to two Cornell psychologists was drawn by Zach Weinersmith in a webcomic in 2011, probably influenced by the Gartner Hype Cycle, which itself is a marketing framework dressed up as science.

I did not know this. I had cited that curve. I had used it in presentations. I had nodded along knowingly when others used it, as though I were intimately familiar with the underlying research.

I was not.

A Dunning-Kruger curve being erased from a whiteboard, revealing nothing beneath

The second discovery was worse. In 2022, a researcher called Blair Fix published an analysis showing that the Dunning-Kruger effect can be reproduced using entirely random data. The curve is an artefact of autocorrelation. When you plot people's self-assessment error against their actual performance, you are plotting a variable against a component of itself. The maths produces the curve whether or not the psychology exists. Gignac and Zajenkowski, in a 2020 study, used proper statistical methods and found "much less evidence" for the effect than the original paper claimed. McGill University's Office for Science and Society now states flatly that the Dunning-Kruger effect is "probably not real."

The original 1999 paper has roughly 7,893 citations. The debunking papers have about 88 between them.

Here is what that means: the most famous psychological concept about people who don't understand things they claim to understand may itself be a thing that people don't understand while claiming to understand it. The persistent irony is not incidental. It is structural.

Every one of us has been complicit in it, and not one of us bothered to check.

What the Paper Actually Tested

The third thing I learnt is that virtually everyone misuses the concept. The original study tested Cornell undergraduates on logical reasoning and grammar. Not the general public. Not "dumb people versus smart people." Not experts versus novices in any real-world domain. The dramatic effect that Kruger and Dunning found was in relative self-placement -- how students ranked themselves compared to their peers. When tested with direct methods, about 80% of the bottom-quartile students could accurately assess their own absolute competence. They knew roughly how well they had done. They just couldn't estimate where they sat relative to everyone else.

This matters because the way Dunning-Kruger gets deployed in the wild -- "that person is too stupid to know they're stupid" -- is not what the paper found. Scientific American pointed this out years ago. Nobody listened. The meme was too satisfying. The feeling of superiority it grants -- "I, unlike those people, can see my own limitations" -- is too intoxicating to surrender to mere evidence.

We have all cited that curve. We have all nodded along. And we have all, at some point, used it to feel cleverer than someone else in the room.

Which brings me to skydiving.

153 Jumps

I have done 153 solo skydives. Never a tandem. I started jumping out of aeroplanes because I wanted to, not because I was strapped to someone who knew what they were doing. To a layperson, 153 is a lot. At a dinner party it sounds reckless, impressive, slightly unhinged. To anyone in the sport, 153 makes me a relative noob. The serious skydivers -- the ones doing formation work, wingsuit proximity flying, canopy piloting -- have thousands of jumps. I am, in the language of drop zones, a "low-timer."

I have always used this to self-deprecate:

"I'm best case aspiring to be either climbing towards Mount Stupid, or on the decline back to the Valley of Despair."

It is a line I have delivered many times. It gets a laugh. It makes me sound humble and self-aware. It signals that I understand the Dunning-Kruger effect and have inoculated myself against it through the hard-won wisdom of knowing my place on the curve.

Except the curve is fake. And the self-deprecation is doing something I did not intend it to do.

Feltovich, Harbaugh and To formalised counter-signalling theory in 2002. The insight is devastatingly simple: high-ability agents can afford not to signal their competence. When a genuinely skilled person says "I'm not that good," it does not read as humility. It reads as proof that they are good enough not to need to say so. The self-deprecation is a power move. It is the intellectual equivalent of a billionaire wearing a hoodie. The modesty is the flex.

When I say "153 jumps, that's nothing in the sport," what you hear is: this person has done 153 solo skydives and is so comfortable with that fact that he can dismiss it. The self-deprecation does not reduce my status. It amplifies it. Harvard researchers Sezer, Gino and Norton showed in 2018 that humblebragging backfires worse than straightforward bragging. People see through it. They like you less for it than if you had simply said "I've done 153 skydives and I loved every one of them."

I did not know this either. The self-deprecation I thought was my most honest move was, it turns out, my most dishonest one.

AI Generated: an empty skydiving rig hanging in a sunlit hangar, evoking courage and self-deception

Then Like Now

Bertrand Russell, writing in 1933 in an essay called The Triumph of Stupidity, put it with a precision that has not been bettered in the ninety-three years since: "The fundamental cause of the trouble is that in the modern world the stupid are cocksure while the intelligent are full of doubt." He was writing about the rise of the Nazis. The context matters. Russell was not making a dinner-party observation about overconfidence. He was watching a continent slide towards catastrophe and diagnosing -- with the clarity of a logician who had co-written Principia Mathematica -- the asymmetry of conviction that made it possible. The stupid were cocksure. The intelligent doubted themselves. And the stupid won, because certainty is a weapon and doubt is not.

Russell saw it in 1933. Socrates described it twenty-four centuries before that. Kruger and Dunning did not discover anything. They gave a 2,500-year-old observation empirical clothing -- and even the empirical clothing now appears to be made of autocorrelation.

The instinct to feel superior to the overconfident is ancient. What Dunning-Kruger did was make that instinct seem scientific. It gave us a graph -- a fake graph, from a webcomic -- and a citation, and we could point at people we disagreed with and say "Dunning-Kruger" instead of saying "I think you're wrong." It became the intellectual's "because I said so."

I was doing exactly this. In my head, I had sorted the world into people who knew about Dunning-Kruger (enlightened, like me) and people who didn't (the poor sods still stuck on Mount Stupid). The categorisation was itself a demonstration of the very bias I thought I was immune to.

The Bias Blind Spot

Here is the final knife, and it cuts all of us. Pronin, Lin and Ross demonstrated in 2002 that knowing about cognitive biases makes you think you are less susceptible to them. Not more. Less. This is the bias blind spot: the meta-bias, the one that weaponises your own knowledge against you. The more you know about Dunning-Kruger, the more confident you become that it applies to other people and not to you. Cognitive debiasing research confirms it -- awareness does not debias. It adds a layer of false security.

We have all done this. We have all learned a concept, felt the warm glow of understanding, and then immediately deployed it as a weapon against someone we thought understood it less well. The bias blind spot is not an edge case. It is the default human response to learning about bias.

I sat down to write an article about people who don't know what they don't know. I was going to explain it to you. I was going to cite the research, draw the curve, reference the original paper, and demonstrate my sophisticated understanding of a concept that turns out to be a statistical artefact wrapped in a webcomic illustration popularised by people who never read the study they were citing.

People like me.

AI generated: An empty lecture theatre with a blank projection screen, the architecture of authority without content

The Hard Part

The hard-easy effect tells us that self-assessment is contextual, not fixed. There is no stable "place on the curve" because there is no stable curve. My 153 jumps make me an expert to my mother and a beginner to anyone at Skydive Perris. The expertise is not a property of me. It is a property of the room I am standing in.

So the next time someone drops "Dunning-Kruger" into a conversation, ask yourself one question: is this person using the concept to understand something, or to win an argument? That is your diagnostic. If the answer is "to win," you are not witnessing insight. You are witnessing the bias blind spot in real time -- a thought-terminating cliche dressed up as psychology, deployed to shut down a person rather than engage with what they are saying. Stop nodding along. Name what is actually happening.

There is something genuinely uncomfortable about realising that the tool you use to demonstrate your self-awareness is itself a demonstration of your lack of self-awareness. I cannot resolve this. I cannot now pivot to a new, corrected understanding of Dunning-Kruger and deploy it with the same confidence I had before, because the whole point of the last 2,000 words is that the confidence was the problem.

What I can tell you is this. I thought I was writing an article about other people. I was writing an article about myself. The research process for this piece -- which I began as a victory lap through well-understood territory -- became instead a series of discoveries that each, in turn, made me feel more foolish than the last. The fake graph. The autocorrelation. The counter-signalling. The bias blind spot.

Every layer peeled back revealed another layer of my own overconfidence about a concept that describes overconfidence.

Socrates was right.

Knowing that you know nothing is not the end of wisdom. It is the beginning of the realisation that even that knowledge -- the knowledge of your own ignorance -- is something you can be wrong about.

I have done 153 skydives. I don't know what that means about me. And for the first time, I am not going to pretend that not knowing is the same as being wise.

But I will tell you this: the next time someone invokes Dunning-Kruger to dismiss a colleague in a meeting, ask them if they have read the paper. Ask them where the graph comes from. Watch what happens to their confidence.

(Views in this article are my own.)

In 1900, Every Serious Manufacturer Had a Coal Strategy

Sat, 11 Apr 2026 16:02:49 GMT

In the spring of 1913, inside a brick cathedral on the edge of Detroit, William B. Mayo — Henry Ford's chief power engineer, a Massachusetts man who had cut his teeth on marine steam plants — walked the gantry of the new powerhouse and signed off the load tests. Twin turbines. Eight boilers. A forest of brass dials. Fifty-three thousand horsepower generated on site, piped out as alternating current through the newly redesigned Highland Park plant. It was the largest privately owned electrical station in America. The trade press called it a marvel. The management consultants of the day — there were such creatures — called it inevitable. Every serious manufacturer, they explained, needed an energy strategy before an electrification strategy. Mayo had spent three years specifying this beast. He believed, as every serious power engineer of his generation believed, that a factory worthy of the name generated its own current, and that renting electrons from somebody else's wire was a confession of weakness dressed up as accounting. He was, by all accounts, quietly pleased.

Thirty years later, nobody who built a factory did this. They plugged it into the grid and got on with the work. Mayo's turbines were scrap.

Here is the thesis, and I am going to put it in the third paragraph so you cannot miss it. The size of your data work must match the blast radius of what you let the AI do — and whether you can un-do the damage when it gets it wrong. Blast radius is what the AI is allowed to touch. Reversibility is whether a wrong answer is an embarrassing paragraph or a regulator on the phone. Put those two axes together and you get four quadrants, and the work looks different in each. So: read-only retrieval, write-enabled agency, foundation-model training, and classical machine learning — four quadrants, four disciplines, four separate answers to the question the orthodoxy insists on answering only once. The industry is selling you one answer, priced as though every situation were the hardest of the four, and calling it "data strategy first." It is not a strategy. It is a prerequisite mis-specified at every quadrant by people whose revenue depends on the mis-specification.

I run into this argument every week. Here is why I no longer buy it

AI Generated William B.Mayo's 1913 Highland park powerhouse - 53,000 horsepower generated on site, and scrap within thirty years

The prerequisite that was always a sales pitch

Start with a fact that ought to be more embarrassing to its promoters than it is. The phrase "AI-ready data" — deployed as if it were handed down on stone tablets — was coined by Gartner in 2024, not 2014. The doctrine that you must have a data strategy before you touch a large language model hardened as received wisdom after ChatGPT landed. It is younger than the wave it claims to precede. The orthodoxy was retrofitted to the moment it was meant to predict.

And who is selling it? The data-platform vendors whose product categories were looking orthogonal to the action, the consultancies whose billable ETL years were at risk, and the CDO profession itself. The canonical survey figure — 93% of Chief Data Officers telling pollsters that an effective data strategy is essential — is not a finding. It is a confession of self-interest, rendered in the third person.

Look at what the evidence actually says. MIT's NANDA initiative, looking at over three hundred enterprise generative AI initiatives in August 2025, found that 95% of pilots deliver zero measurable profit-and-loss impact. A brutal number, and one the vendors love to quote because it sounds like it proves their case. But read the next paragraph — the one they tend not to quote — where MIT names the root cause. It is not data. It is what they call the learning gap: most generative AI systems do not retain feedback, do not adapt to context, and do not improve over time. The successful initiatives bought capability from vendors and wired it into a workflow. Internally built systems succeeded a third of the time. Vendor-bought systems succeeded two thirds of the time. The bottleneck is showing up in the feedback layer, not the data layer — and that is a problem that cannot be solved by buying more of what the incumbents are selling.

Why the four quadrants are the right unit of analysis

Back to Detroit. Back to Mayo on his gantry. Hold the four-quadrant frame in your head while I walk you through the history, because the history is where the framework earns its authority.

The story of factory electrification is the cleanest historical analogue we have for what is happening now, and it has been sitting in plain sight since Paul David published "The Dynamo and the Computer" in May 1990. Alternating current was commercially available from Niagara Falls in 1896. American manufacturing productivity then did what productivity statistics do when a general-purpose technology arrives and nobody knows how to use it yet: absolutely nothing, for thirty years.

Why? Because the factories of 1900 had been built around the line shaft. A single prime mover spun a long iron axle running the length of the shop floor, and every tool in the building hung off it with leather belts. The architecture was the constraint. When you electrified by swapping the steam engine for a central electric motor (what Warren Devine called "group drive"), you changed the energy source but kept the architecture. You got the same factory, slightly cheaper to run, and none of the productivity gains — because none of the productivity was hiding in the energy source. It was hiding in the architecture.

Unit drive — giving every machine its own motor, freeing the layout from the tyranny of the shaft, letting materials flow in straight lines — is what made Highland Park work. And yet, in 1913, Ford still had Mayo build his own fifty-three-thousand-horsepower generating station, because the orthodoxy of the day said you needed to own your own power. By 1930 nobody built their own generating station. The grid did the job and did it better. The prerequisite had dissolved into a utility bill. The real transformation — the ten-fold productivity jump — came from the rearrangement, not the resource.

Draw the AI landscape on the same evolution axis and the picture sharpens. Foundation models are sliding visibly towards commodity — you rent them from a shrinking number of providers at prices that halve every eighteen months. Retrieval-augmented generation and vector search have just commoditised: both are now native column types in Postgres, Oracle, Snowflake and Databricks, which means the thing the vendors were calling a moat last year is a feature of the database this year. The semantic layer, by contrast, is still stubbornly custom-built and always will be, because it encodes the specific meaning of your business; ETL and the warehouse beneath it have been commodity for fifteen years. Four components, four evolution stages, four different answers — and the orthodoxy sells you the same three-year programme for all of them.

AI Generated: Group drive versus unit drive - the ten-fold productivity gain was hiding in the architecture, not in the energy source

The category error in 2026

Coal was consumed. Data is not consumed — it is referenced. The 2015-vintage definition of data strategy — master data management, enterprise data warehouse, governance-first, build-the-lake-then-stock-it-with-fish — is not the foundation of modern AI. It is orthogonal to it.

Wavestone's Data and AI Leadership Executive Survey reports that "data-driven culture" inside large enterprises jumped from 21% in 2023 to 43% in 2024, and the researchers were explicit that the cause of the jump was generative AI. The cause. For ten years the CDO profession had been trying to get executives to care about data and getting nowhere. Then ChatGPT arrived and within eighteen months the board was asking the questions the CDO had been begging them to ask for a decade. The AI pulled the data along behind it.

And the profession knows. IBM's 2025 CAIO Global Study found that 26% of organisations now have a Chief AI Officer, up from 11% two years earlier — 48% of the FTSE 100. Jamie Dimon told an investor call in 2024 what he had done inside JPMorgan: "We took AI, slash data, out of technology. It's too important." The CDO role is being quietly folded into the CAIO role, because the market has decided the thing you optimise for is the outcome, not the upstream dependency.

I am not arguing the CDOs were frauds. Many of them are excellent. The thoughtful ones have spent a decade doing necessary, unglamorous work inside a category the market had wrongly framed — and the invitation in this piece is for them to walk out of the old category and into the new one, because their craft is needed there and the title is not what matters.

The steelman, and why it doesn't bury the argument

I owe you a fair hearing of the other side. Here is the best version of it, and then where I think it holds and where it crumbles.

The strongest objection comes from agentic compounding. The maths is brutal. A twenty-step agent running at 95% per-step reliability delivers a 36% end-to-end success rate. Drop each step to 85% — which is what you get when the context is noisy, schemas drift, and records are duplicated across three systems — and you are down to 4%. Bain's 2025 report makes exactly this point, and eight out of ten executives tell Bain that data limitations are the number-one blocker to scaling agentic AI. This is the single argument the orthodoxy gets right. No foundation model is clever enough to paper over a duplicate customer record spread across Salesforce and NetSuite with conflicting addresses. Notice the quadrant, though. It lives in the high-blast-radius, low-reversibility corner, and only there.

The second objection is the NYC MyCity chatbot. Built on Azure AI, grounded on two thousand curated dot-gov pages, and for eighteen months it cheerfully told small business owners they could steal tips from their staff and refuse cash payments. State-of-the-art stack. Authoritative sources. And it still lied with confidence because nobody had modelled the entity relationships or tagged which regulations superseded which. RAG on chaos is still chaos. That is a read-only case where reversibility was badly under-estimated — small business owners taking legal advice from a chatbot and then getting sued is not damage you can un-do with a product update.

The third objection is the EU AI Act, Article 10, binding from August 2026, which requires training and reference datasets to be "relevant, representative, free of errors and complete" for high-risk systems, with fines up to €15 million or 3% of global turnover. The Italian Garante has already fined OpenAI €15 million on lawful-basis grounds. If you operate inside any regulated sector, "no data strategy" now translates as "I will write one under enforcement pressure, next year, at four times the cost." That is regulation choosing the quadrant for you.

The fourth objection is the semantic layer, and this is the one I find most honest. A language model cannot tell you the Q3 margin on EMEA enterprise customers unless somebody has reconciled Q3, margin, enterprise customer, and EMEA across Salesforce, NetSuite, the warehouse, and the product database. BI was forgiving of missing semantics — an analyst would simply ask the human who knew. Language models are not forgiving. They will give you a confident, beautifully formatted, completely wrong answer. The uncomfortable version, put by people who are not trying to sell you a platform, is this: AI does not require a data strategy so much as it exposes the fact that you never really had one. The Open Semantic Interchange spec, pushed out by Snowflake, dbt, Salesforce and Mistral in January 2026, is the industry's belated admission that the semantic layer is what matters, not the warehouse beneath it.

AI Generated: A rusted line shaft and a modern GPU rack - every era has its prerequsite, sold by the people whose revenue depends on the wait

I concede the maths on agentic compounding. I concede NYC MyCity. I concede the EU AI Act. I concede that the semantic layer is irreducible. And yet none of that rebuilds the orthodoxy. Because the orthodoxy is not "match the data work to the blast radius and the reversibility of the damage." The orthodoxy is "build a three-year, eight-figure enterprise data programme before you are allowed to touch a language model." Each of the four objections picks out a specific quadrant where the data work is irreducible. None justifies pricing every situation as if it were the hardest quadrant. The 2026 version of a data strategy, sized per quadrant, is roughly a tenth the size of the 2019 version, and it runs in parallel with the AI work, not in front of it.

Who benefits from the prerequisite? And who owns the utility?

If the prerequisite has dissolved into a utility bill, who owns the utility? The answer, in 2026, is that three companies own the power stations — AWS, Azure and GCP — and a handful of foundation-model landlords rent you the turbines on top. Anthropic. OpenAI. Google DeepMind. The 1930 grid was contested, regulated, in many places publicly owned; cheap power was treated as a political question because everyone understood that whoever controlled it controlled industry. The 2026 equivalent is none of those things. It is a private, unregulated, four-provider oligopoly that the "rent, do not build" recommendation I made two sections ago quietly accelerates. I am not resolving that question here. I am not equipped to. But I refuse to pretend it is not sitting there, because "renting" and "being enclosed" may turn out to be synonyms on a long enough timeline, and the managerial argument does not get to duck the structural one underneath it. I would rather leave you holding the question honestly than palm you off with closure I have not earned.

Management is moral work. When a consultant or a vendor tells you that you must do X before you are permitted to do Y, the first question is not whether the claim is true. The first question is: who is paying the cost of the wait? Not the consultant. Not the vendor. The customers whose problems could have been solved in month three but will now be solved in year three. The colleagues who spend eighteen months in a governance committee re-litigating what a "customer" is across four systems, producing a unified model that will be obsolete before it ships. I have seen a hundred data strategy programmes in my career. Most were a polite way of saying "we are not ready to change anything yet, and this gives us eighteen months to avoid admitting it." When I ask the people pushing hardest for data-strategy-first whether they would personally bet their own money on a three-year programme delivering more value than a ninety-day vendor-led pilot wired to a narrow workflow, they go quiet. Every single time.

What I would do on Monday morning

Here is the test I would apply, if I were sitting where you are sitting, reading this on a Sunday evening because you have a board meeting on Tuesday.

Pick a single workflow where the outcome is measurable in pounds, dollars or euros within ninety days. Not strategic. Not transformational. Measurable. Name the quadrant it lives in — read-only, write-enabled agent, foundation-model training, or classical ML — and name, out loud, how reversible a wrong answer would be. Size the data work to match the quadrant and not a gram more. Build the evaluation harness before you build the system, because MIT is right about the learning gap and you are going to need it. Rent the capability. Do not build it. Wire it to the workflow. Measure. Learn. Move to the next workflow. Do this six times in a year, across the quadrants that actually matter to you, and then tell me whether you need a three-year data strategy to precede your AI strategy, or whether you have already accidentally built the only data strategy that matters.

I guarantee, with history on my side, the answer is the second one.

The factory owners who won the electrification transition in the 1920s did not win because they built the biggest powerhouses. They won because they rearranged their shop floors around what unit drive made possible. The ones who lost spent the 1900s building ever larger on-site dynamos, because they treated the energy source as the strategy and the architecture as an afterthought. The technology was not the bottleneck. The imagination was. Mayo, for what it is worth, retired a respected man. His turbines still got scrapped.

Technologies change. Human nature does not.

(Views in this article are my own.)

Policy as [Versioned] Code: A Mea Culpa, a Technical Argument, and a Lonely Experiment

Wed, 08 Apr 2026 06:45:05 GMT

I owe someone a credit. I owe him more than a credit, actually — I owe him a lineage. For six years I carried an idea around, refined it, engineered it into working code, and presented it at twenty-one conferences across three continents without once citing where it came from. Not out of malice. Out of something that anyone who has ever absorbed a good idea will recognise: I had been so thoroughly convinced by his argument that I forgot it was his.

The man is Michael Brunton-Spall, and in 2016, at GOTO Amsterdam, he gave a talk called "Rugged: Being Secure & Agile" that planted a seed I didn't recognise as someone else's until it had already grown into a tree with my name on the trunk. But ideas have a lineage longer than any one person. Michael would be the first to say — and has said — that he stood on shoulders of his own: James Abley and Gareth Rushgrove at GDS, conversations at ScaleCamp and ScaleSummit, the DevOpsLondon community. What Michael did — and it was substantial — was take those scattered conversations and forge them into a narrative that could travel. He spent two or three years refining that talk, making the ideas communicable, pragmatic, and sticky. That is its own form of creation.

Michael Brunton-Spall's impressing talk, as true in 2026 as it was in 2016 watch it, now!

Michael's argument was elegant and, in retrospect, obvious. It was also early. 2016 was too soon for the industry to be ready, and Michael knew it. Security principles should not be bolted onto agile delivery like armour plating on a Ford Fiesta. They should be woven into the fabric of how teams build software. He was working at the Government Digital Service at the time, helping create security design principles that would later be adopted and maintained by the National Cyber Security Centre. He went on to co-author Agile Application Security with Laura Bell, Rich Smith, and Jim Bird. He is now Deputy Director of Cyber Policy and Capabilities at the Cabinet Office . Someone who earned their credentials.

I watched that talk and something happened that I have since learnt is the highest compliment a speaker can receive and the most dangerous trap a listener can fall into: I absorbed his principles so completely that I began to believe they were mine. When I sat down in 2022 to build what became Policy as Versioned Code, I was not consciously drawing on Michael's work. I was drawing on what I thought was my own conviction that policy should be testable, versionable, and consumable as a dependency. The philosophy was shaped by Michael and the community around him. The engineering was mine. And I failed to cite my sources — not because I chose not to, but because the ideas had become so thoroughly part of how I think that I forgot where they came from.

Michael, was kind enough to give this article blessing and added the philosophical commentary that knowledge is a communal product. "We take a bit of this, a pinch of that, and if it is good, it grows and spreads". Michael took conversations from ScaleCamp and Government Digital Service and turned them into a conference talk that changed how I think. I took that talk and turned it into working code and YAML and Mend.io Renovate configs. Someone will take this article and turn it into something neither of us has imagined yet. That is how ideas are supposed to work. The best measure of a teacher's influence is when the student genuinely believes the ideas are his own — and the best response is not guilt, but gratitude and a proper citation.

What I did not understand then — what I am only beginning to understand now, twenty-one conferences and a lot of solitary airports later — is that carrying an idea forward is its own form of loneliness. It does not matter whether the idea started as yours or someone else's. You are alone with a conviction in rooms full of people who will forget your name by the next session. That loneliness is the reason this piece exists — as a test of whether the technical argument I built from this shared philosophy can stand on its own in writing the way it never quite could from a stage.

The Lift

Here is the central image. You are in a lift — an elevator, if you must — in a large organisation. Four people ride with you.

The CIO says: "I have no idea what our teams are actually doing. I set policy, but I cannot tell you whether anyone follows it, bends it, or ignores it entirely."

The Product Manager says: "Bureaucracy is killing us. Every policy change means weeks of back-and-forth before we can ship."

The Developer says: "I just want to know what the rules are. Which ones I have to follow, which ones I can bend, and which ones lose me my job."

And the Cleaner — the person everyone else in the lift has stopped noticing — says: "We get a memo. We compile it into our operational manual. Sometimes a new memo arrives and we miss it. Last week I wiped the war room whiteboard because nobody told me it mattered."

AI Generated: Four workers in a cramped office lift, hierarchies visible in their postures and uniforms

The Cleaner is the emotional heart of this story. Not because cleaning is trivial — it is not — but because the Cleaner is doing exactly the same job as everyone else in that lift. Receiving policy. Compiling it into a working manual. Trying to stay current. Dealing with version conflicts. Missing updates. And nobody recognises it, because nobody has ever framed policy management as a universal problem rather than a domain-specific one.

I think about the Cleaner more than I should. I think about the Cleaner at 2am when Renovate opens a pull request in a repository nobody has looked at in four months. I think about the Cleaner when an enterprise CIO shows me a compliance dashboard that tracks everything except whether anyone actually changed their behaviour. The Cleaner is working in good faith with a broken system, and nobody — not the vendor, not the consultancy, not the framework — is solving the Cleaner's problem.

Every person in that lift is a consumer of policy. The CIO produces it. The Product Manager resents it. The Developer needs it. The Cleaner follows it — or tries to. They are all doing the same thing. They are all failing at it for the same reason.

Policy, in most organisations, is a Word document in a SharePoint folder that nobody can find, nobody has versioned, and nobody can prove they have read.

The Industrialisation Nobody Noticed

The industrialisation of policy is happening whether we notice or not. Kyverno achieved CNCF Graduated status this year — the same level of maturity as Kubernetes itself. The CNCF published an introduction to Policy as Code that reads like mainstream recognition of something a small number of people have been banging on about for years. KubeCon London 2025 put policy at the centre of platform engineering. Policy is evolving from custom-built, hand-cranked enforcement into something that looks increasingly like commodity infrastructure. Most people have not noticed.

But here is what frustrates me. The consensus position — "policy-as-code is good; use OPA, use Kyverno, use Checkov; enforce at the gate" — is wrong. Not wrong in principle. Wrong in architecture.

AI Generated: Airport security checkpoint reimagined as a software admission control system

Most policy-as-code implementations treat policy as a gatekeeper. An admission controller that blocks your deployment. A pre-commit hook that rejects your code. A compliance scanner that tells you what you got wrong after you have already built it. This is guardrails thinking. You hit the guardrail, your car is totalled, and technically the guardrail did its job because you did not go off the cliff. Congratulations. Your car is still wrecked.

Gregor Hohpe makes a useful distinction in his work on strategy mechanisms: the difference between a guardrail and lane keeping assist. A guardrail stops you at the point of failure. Lane keeping assist nudges you continuously, correcting in real time before you ever reach the edge. I should be honest that the dependency model I am proposing is not quite lane keeping assist — a versioned policy arriving as a pull request is a discrete event, not a continuous correction. But it is far closer to lane assist than to a guardrail. It nudges teams toward compliance before deployment rather than blocking them at the point of it.

Now — I need to be honest about the limits of this argument. Some policies belong at the gate. Access control. Data protection. Cryptographic key management. There are policies where the cost of a violation is so catastrophic and so irreversible that admission control is not just appropriate — it is the only responsible architecture. If a workload is about to deploy with an unencrypted database connection to a production environment containing personal data, I do not want lane keeping assist. I want a locked door.

The distinction is this: policies that govern how teams label, tag, configure, and structure their work are candidates for the dependency model. Policies that govern whether a workload is permitted to exist at all — security boundaries, data classification, access control — those belong at the gate. Confusing the two categories is how organisations end up blocking deployments over missing metadata while waving through genuine security risks, because the admission controller treats a missing department label and an exposed secret with equal severity.

Policy should be lane keeping assist where lane keeping assist is appropriate. And that turns out to be most of the policy surface area that enterprises actually struggle with. The labelling. The tagging. The configuration standards. The operational metadata. The stuff that makes the Cleaner's manual either current or dangerously stale.

Policy — for that surface area — should be something you pull towards you, not something that blocks you. Policy should be a dependency — versioned, tested, consumed, and updated automatically — not a gate.

The Dependency Model

Here is the technical argument, and I am going to show my working.

If policy is a dependency, then it should behave like one. It should have a version number. It should follow semantic versioning. It should live in a Git repository. It should have unit tests. And it should be consumed by the teams that need it the way they consume any other dependency — pulled in, pinned to a version, and updated via automated pull requests.

I built this. The policy-as-versioned-code GitHub organisation has eleven repositories and is still active. Renovate — the dependency update tool — has generated over 1,222 automated pull requests across those repositories. Each PR is a measurable signal: did the team accept the policy update? Did it break their build? Did they pin to the old version and ignore it?

Let me show you what v1.0.0 looks like. A simple policy: every Kubernetes resource must have a department label.

In Kyverno:

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-department-label
  annotations:
    policies.kyverno.io/title: Require Department Label
    policies.kyverno.io/category: Example Org Policy
    policies.kyverno.io/description: >-
      It is important we know the department that resources
      belong to, so you need to define a 'mycompany.com/department'
      label on all your resources.
    pod-policies.kyverno.io/autogen-controllers: none
spec:
  validationFailureAction: enforce
  background: false
  rules:
    - name: require-department-label
      validate:
        message: >-
          The label `mycompany.com/department` is required.
        pattern:
          metadata:
            labels:
              "mycompany.com/department": "?*"

In Checkov, for Terraform:

metadata:
  name: >-
    Check that all resources are tagged with the key - department"
  id: "CUSTOM_AWS_1"
  category: "CONVENTION"
scope:
  provider: aws
definition:
  and:
    - cond_type: "attribute"
      resource_types: "all"
      attribute: "tags.mycompany.com.department"
      operator: "exists"

And it has tests — because policy without tests is just an opinion:

# fail0.yaml — should be rejected
apiVersion: v1
kind: Pod
metadata:
  name: require-department-label-fail0
spec: ...
---
# pass0.yaml — should be accepted
apiVersion: v1
kind: Pod
metadata:
  name: require-department-label-pass0
  labels:
    mycompany.com/department: finance
spec: ...

That v1.0.0 becomes v2.0.0 when the organisation decides the department label must come from a constrained list rather than freetext — a breaking change, so a major version bump. Then v2.1.0 when someone spots a spelling mistake in the validation message. Then v2.1.1 when a new department is added to the allowed list. The version number communicates the nature of the change. Major means you must act. Minor means you should look. Patch means the system handles it.

This is where it matters for the Cleaner. The Cleaner's operational manual is a dependency too. When the policy bumps from v1.0.0 to v2.0.0, the Cleaner does not need to wait for a memo. The Cleaner does not need to check the SharePoint folder. The update arrives as a pull request. It either passes or fails. It is testable. It is visible. And the version number tells the Cleaner exactly how much attention to pay: a major bump means the rules have changed and the manual needs rewriting; a minor bump means a correction has been made; a patch means carry on. That is more information than any memo has ever carried, and it arrives automatically rather than three weeks late.

AI Generated: Victorian telegraph office reimagined as a modern policy operations centre

The Renovate configuration that makes this work:

{
  "$schema": "https://docs.renovatebot.com/renovate-schema.json",
  "labels": ["policy"],
  "regexManagers": [{
    "fileMatch": ["kustomization.yaml"],
    "matchStrings": ["mycompany.com/policy-version: \"(?.*)\"\\s+"],
    "datasourceTemplate": "github-tags",
    "depNameTemplate": "policy",
    "packageNameTemplate": "policy-as-versioned-code/policy",
    "versioningTemplate": "semver"
  },{
    "fileMatch": [".*tf$"],
    "matchStrings": ["#\\s*renovate:\\s*policy?\\s*default = \"(?.*)\"\\s"],
    "datasourceTemplate": "github-tags",
    "depNameTemplate": "policy",
    "lookupNameTemplate": "policy-as-versioned-code/policy",
    "versioningTemplate": "semver"
  }]
}

When a team's build fails because they have not adopted the new policy version, the feedback is immediate and clear. When the CIO wants to know how many teams are compliant, the answer is a GitHub PR search away.

Go back to the lift. The CIO now has visibility — not a dashboard that measures deployment counts, but a PR acceptance rate that measures actual adoption. The Product Manager has automation instead of bureaucracy — policy updates arrive as pull requests, not as emails requiring three meetings. The Developer has explicit, testable rules with version numbers that distinguish "you must act" from "carry on." And the Cleaner — the Cleaner gets a notification that the policy version has bumped from v1.0.0 to v2.0.0, opens the pull request, sees exactly what changed (freetext labels are now a constrained list), and updates the operational manual in the same morning. Not three weeks later. Not after chasing someone in procurement for the latest memo. The same morning, using the same mechanism as the Developer three floors up. Policy-as-a-dependency does not care about your job title. It cares about whether you are consuming the current version.

The Code for Humans

There is a missing layer in this argument that I did not see until Michael pointed it out, and it is characteristic of him that the thing I missed was the human part.

The dependency model solves the distribution problem. Policies get versioned, tested, consumed, updated automatically. But who writes the policy? Who decides when it is stale? Who clears out the dead wood?

Michael created something at GDS called The GDS Way that answers these questions, and the answers are deceptively simple. There is no policy writing committee. A team notices a practice that works — how they run incident reviews, say, or how they structure service healthchecks — and they submit it as a proposal. Other teams review it, challenge it, adopt it or push back. Every accepted practice carries a date. Every practice must be regularly reviewed. And here is the part that matters most: if nobody can argue that a practice is still good — if nobody will stand up in a review and say "yes, this is still right, and here is why" — it gets removed. Not archived. Not deprecated. Removed. The dead wood gets cleared because the system demands that someone actively defend every piece of guidance that remains.

The Kyverno YAML, the Renovate PRs, the semver tags — that machinery serves teams who live in Git, who understand pull request workflows, who can read a version number and know what it means. The actual Cleaner — the person with the mop and the operational manual — does not have a GitHub account. The dependency model provides the single source of truth: the policy is versioned, tested, and current. But the last mile to non-technical consumers — how the Cleaner's manual stays in sync — is a different problem that the versioning alone does not solve. The GDS Way model starts to bridge that gap, because its governance is human-readable. A dated practice with a mandatory review cycle is something any consumer can understand, whether they read YAML or not.

This is the complement to the versioning model. Semantic versioning tells the Developer what changed and how much attention to pay. The review cycle tells everyone — including the Cleaner — something equally important: this policy is still alive. Someone still believes in it. It has not been left to rot in a SharePoint folder where nobody remembers who wrote it or why.

The combination is powerful. Version the policy so it can be distributed as a dependency. But also date the policy, review the policy, and — crucially — be willing to delete the policy when it no longer serves. Make explicit what is otherwise implicit. That phrase is Michael's, and it is the human half of the technical argument I have been making for six years without realising I was only telling half the story.

The Enterprise Temptation

I need to name something that the industry does not want named.

Enterprises default to buying policy solutions rather than changing culture. This is not stupidity. It is structural. Procurement is measurable — a purchase order has a date, a cost, a vendor, a contract. Culture change has none of those properties. You cannot put "transformed how 3,000 engineers think about compliance" on a quarterly report. You can put "deployed Kyverno across 47 clusters" on one.

In the 1830s and 1840s, Victorian factory owners faced the same structural incentive. Parliament passed the Factories Act of 1833 — the first law to require factory inspectors. The owners responded rationally: they bought safety equipment, posted notices, filed the paperwork. What they did not do was redesign the work itself. A posted notice about machinery guards satisfies an inspector. Redesigning a production line to eliminate the hazard satisfies nobody's quarterly report but saves the worker's hand.

AI Generated: Victorian factory inspector ticking clipboard while worker operates dangerous machinery

This is precisely what happens when an enterprise buys a policy engine and bolts it onto the CI pipeline. The admission controller satisfies the auditor. The dashboard satisfies the board. The YAML file in the repository satisfies the compliance team. But the Cleaner — the person who actually needs to know what changed and why — is still working from a memo that arrived three weeks late. The purchase order closed. The problem did not.

The EU Cyber Resilience Act begins enforcement in September 2026 for vulnerability reporting and December 2027 for full SBOM requirements. When Log4Shell hit in December 2021, most organisations could not even answer the question "are we running Log4j, and if so, which version?" The xz Utils backdoor in March 2024 demonstrated that the supply chain threat had not gone away. Policy that cannot move as fast as the risk landscape is already obsolete.

Purposeless policy is potentially practically pointless policy.

The Lonely Experiment

From June 2022 to December 2023, I took this talk to twenty-one conferences across three continents. Cloud Native London and Wales. Open Source Summit in Brazil and Dublin. DevSecCon in Germany and the Netherlands. The UK Government Cyber Security Conference. Detroit. Virtual stages I cannot remember the names of any more.

I did not sell a product. I shared principles. There was no sales pipeline, no lead generation, no ROI spreadsheet. I gave conference organisers a rare thing — an independent voice not hawking wares — and in return I got audiences who told me I was "very clever" without ever engaging with the substance.

The Detroit trip crystallised something. I flew from Heathrow, landed, slept one night, gave the talk, flew back to Heathrow, went straight to the Crown Prosecution Service office in Westminster for a senior stakeholder workshop, had drinks after, and went home. Still immune to jet lag, touch wood. But the physical endurance was not the point. The point was this: I had carried Michael's philosophy six thousand miles, transmuted it into YAML and Renovate configs and semver tags, presented it to a room of strangers, and flown home to present it to a room of civil servants — and in neither room could I tell whether a single person would do anything differently on Monday morning.

I did it anyway. I would do it again. That is the defiant part, and I want to be clear about it: the loneliness was real, but so was the conviction. I believed — I still believe — that policy-as-a-dependency is architecturally right. Not knowing whether anyone acted on it does not make it wrong. It makes it unproven. There is a difference, and I have spent eighteen months learning to sit with that difference.

I have to believe there was value, because the alternative is that I spent eighteen months talking to myself. And maybe that is all writing is — talking to yourself in public and hoping someone overhears. But the policy-as-versioned-code GitHub organisation sits there with its eleven repositories and its 1,222 automated pull requests, and the engineering works whether or not anyone is watching.

I practised that alliteration line — "purposeless policy is potentially practically pointless policy" — because when you give the same talk twenty-one times, you need to find ways to keep it alive for yourself. To introduce jeopardy, to risk tripping over your own tongue, to create a moment of human engagement in a room full of people staring at their phones. The speaking circuit is made of small things like that — small moments of connection in large rooms full of polite indifference.

AI Generated: Solitary figure walking docks at twilight, containers awaiting unknown destinations

The Irony and the Aspiration

There is something I have been turning over. I took a philosophy shaped by Michael and others and turned it into engineering. That is what engineers do — and it is also what Michael did when he took scattered community conversations and turned them into a communicable narrative. Ideas move through people. The question is not who owns them but whether they are growing. I have a working demo, a recorded talk, and — finally — a proper citation.

The aspiration is simpler and, I suspect, more lasting. Even if someone else commercialises this idea — and someone will, because the land grab has already begun — the world is marginally better for the idea having been shared openly. Appvia wrote about it. The CNCF is moving in this direction. The principles are out there. Influencing the influencers is its own reward, even when you cannot measure the influence.

If you want to evaluate whether this model fits your organisation, here is a practical exercise you can do on Monday morning in fifteen minutes. Take a blank page. Draw two columns. Label the left column "Gate" and the right column "Dependency." Now list every policy your organisation enforces on engineering teams. Policies where a violation is catastrophic and irreversible — access control, data protection, cryptographic boundaries — go in the left column. Policies where a violation is correctable and the real cost is inconsistency — labelling, tagging, configuration standards, operational metadata — go in the right column. The left column belongs at the gate. The right column — and it will be the longer column — is where the dependency model applies.

Then go to github.com/policy-as-versioned-code/policy. Open the require-department-label directory. Read the Kyverno YAML. Read the tests — pass and fail. Then look at the Renovate config and imagine that automation running across every repository in your estate, opening pull requests whenever the policy version bumps. If that pattern fits the right column of your two-column map — the correctable, consistency-focused policies — you have a candidate for the dependency model. If it does not fit, you have learnt something about your organisation's policy surface that most CIOs never discover.

The Cleaner is all of us. We are all compiling memos into manuals, trying to stay current, dealing with version conflicts that nobody designed a system to manage. The CIO wants visibility but cannot get it. The Product Manager wants speed but cannot have it. The Developer wants clarity but cannot find it. And the Cleaner — the Cleaner has been solving this problem with a highlighter and a ring binder for longer than any of us have been writing YAML.

The question is whether we are brave enough to version the manual. And whether, when the automated pull request arrives, we will merge it or let it rot.

I wrote this article partly because I believe the technical argument is right, partly because ideas deserve their lineage traced, and partly because — after twenty-one conferences and eighteen months of speaking to rooms that politely applauded and moved on — I wanted to know whether anyone out there is listening.

If you have tried this approach, or argued against it, or built something better — drop a comment below and help me feel less lonely.

My talk on youtube of Policy as [Versioned] Code, elevator pitch included

(Views in this article are my own.)

I Jumped Out of a Plane to Have Something Interesting to Say at Parties. The Work Was the Interesting Thing All Along.

Thu, 02 Apr 2026 05:45:07 GMT

I got into skydiving to impress girls at parties. I am not going to dress that up. I was a technologist in my twenties, I worked on things that I believed — correctly, as it turned out — were genuinely important, and I could not for the life of me work out how to make any of it sound interesting to someone holding a glass of wine and looking for a reason not to walk away. So I signed up for an AFF course — accelerated freefall, where you jump solo from day one with two instructors holding on to you — paid up front because I am not a person who does things by halves, and threw myself out of a perfectly good aeroplane.

I hated it.

Not violently. Not with any great conviction. I did not hate it the way you hate something that frightens you. I hated it the way you hate something that disappoints you — the gap between the story you expected to tell and the experience you actually had. The first jump was fine. The second was fine. The third and fourth were fine. Everything was fine, which is the most damning word in the English language when you have paid for a course of jumps expecting to feel transformed.

Then came the fifth jump, and everything went wrong. I say wrong — it was line twists, which any experienced skydiver will tell you are a fairly routine malfunction that you kick out of. You look up, you see the lines are twisted, you kick, you spin, the canopy inflates properly, and you carry on. It is not dramatic. It is not cinematic. But it was the first time anything had gone wrong, the first time the script deviated, the first time I had to solve a problem in real time with the ground approaching at a rate that concentrated the mind wonderfully.

I loved it. From that moment, I genuinely loved it.

The adversity was the catalyst. Not the freefall, not the view, not the adrenaline — the problem-solving. The moment when something did not go to plan and I had to think, act, adapt. Researchers at Queensland University of Technology found exactly this: extreme sports participants are not thrill-seekers but self-knowledge-seekers. The value is not in the danger. The value is in discovering what you are capable of when the stakes are real.

Funny thing, though. I only made it to that fifth jump because I had bought the course up front. If it had been pay-per-jump, I would have walked away after the third. The sunk cost — that most derided of cognitive biases — was the thing that kept me in long enough to discover genuine love for the sport.

AI Generated: industrial mundanity meets the vastness of freefall

The Signal That Collapsed

Here is the thing I did not understand at twenty-five. The reason I thought I needed skydiving was not that my work was boring. It was that I did not know how to talk about it. I had internalised, without ever examining it, the technology industry's catastrophic inability to tell its own story.

I was building things that mattered. The work I do now — things like NDX:Try, giving local government organisations free cloud sandboxes to experiment with AI before committing a penny of public money — is objectively more exciting than a skydive. It affects millions of people. It changes how public services work. It is, by any honest measure, a more compelling story than "I fell out of a plane and threw a pilot chute."

But I could not tell that story. And the reason I could not tell it is the same reason the technology industry haemorrhages the very people it most needs.

There is a concept in evolutionary psychology called costly signalling theory. The idea is straightforward: a signal's value is proportional to its cost. A peacock's tail is expensive to grow and maintain, which is precisely what makes it a reliable indicator of fitness. Skydiving, when I started, was a costly signal. It was unusual, it was mildly dangerous, and it was the kind of thing that made people lean in at parties.

But skydiving had been democratising for decades. Tandem jumping arrived in 1983, turning what was once a military skill into a stag-do gift experience. By the time I was doing my AFF, skydiving was no longer unusual. The cost had collapsed. The signal had collapsed with it. I was paying for an increasingly commoditised experience whilst ignoring the genuinely rare and valuable thing I already had — work that was interesting, complex, and consequential.

I should thank Adam Craven here, because he was the person who first showed me that skydiving was accessible. He revealed to me that this outrageous-sounding thing was actually quite safe, quite reachable, quite normal. And he was right. But in doing so, he also inadvertently demonstrated the mechanism that undermined the entire exercise. The moment something extreme becomes accessible, it stops functioning as a signal of distinction. The moment everyone can do it, nobody is impressed by it.

What I should have been doing was learning to tell the story of the work. The same mechanism that devalued my skydiving story — signal collapses when cost collapses — is exactly what happened to the technology industry's narrative about itself.

I have watched this play out in three distinct modes, and I have been guilty of all of them. The first is the jargon fortress — describing mechanism instead of consequence, burying the human impact under layers of technical vocabulary that function less as communication and more as a drawbridge. The second is borrowed excitement — grafting someone else's story onto yours because you do not trust that what you actually built is worth talking about. That was me with the skydiving. The third is audience of one — telling the story exclusively to people who already understand it, who already care, who are already inside the walls. Every organisation I have seen that cannot recruit diverse talent is doing at least two of these simultaneously.

You want to see what these look like in the wild? The jargon fortress is a job advert that says "seeking expertise in Kubernetes orchestration, Terraform IaC, and GitOps pipelines" when what it means is "we need someone who can build systems that a million people rely on without thinking about." The borrowed excitement was me — literally me, standing at parties talking about freefall when I should have been talking about the work. And the audience of one is the conference talk packed with architecture diagrams that gets a standing ovation from the two hundred people in the room who already agree with every word, and reaches precisely zero of the people you actually need to walk through the door.

Then Like Now

The technology industry has a storytelling problem, and it is not an aesthetic failure. It is a structural one with measurable consequences.

The BCS Diversity Report 2024 found that at current rates, gender parity in UK technology will take 283 years. Two hundred and eighty-three years. Women hold 21-22% of software development roles. Seventy per cent of computer scientists do not match the stereotypical interest profile that the industry projects to the outside world — and that mismatch is not random. It is the direct result of a story the industry tells about itself that is narrower, duller, and more exclusionary than the reality.

Storytelling did not create those numbers, and storytelling alone will not fix them. But storytelling determines who even considers showing up.

"OK," you concede, "but surely the exciting industries do better?" They do not. Gaming is 76% male with 2% Black developers. The space industry is 80% male. Excitement does not fix diversity. If anything, excitement that is marketed to a narrow demographic entrenches it.

The problem is not that technology is boring. The problem is that the people who tell technology's story — people like me, for most of my career — told it in a way that resonated with people who were already like us. We described the work in jargon that excluded. We celebrated the wrong things: the all-nighter, the hackathon, the hero deploy. We built a culture that signalled "this is for a specific kind of person" and then wondered, with apparently genuine bewilderment, why only that specific kind of person showed up.

Universities graduate diverse computer science students at twice the rate that companies hire them. The pipeline is not the problem. The pipeline was never the problem. The problem is what happens at the other end — the job adverts, the interview culture, the 50% fewer callbacks for African-American-sounding names, the mythology that this work requires a particular personality rather than a particular capability.

IEEE found that 66% of engineers do not match the public stereotypes of what an engineer looks or acts like. The majority of the people already doing this work do not fit the image the industry uses to recruit more of them. That is not a diversity problem. That is a marketing hallucination.

AI Generated: 1950s recruitment poster versus a modern diverse engineering team, the mythology versus the reality of who actually does the work

The Confidence Problem

Here is what I actually learnt from the skydiving-at-parties experiment: confidence predicts attractiveness more reliably than any specific hobby or achievement. It was never the skydiving. It was the way I talked about it — the energy, the conviction, the willingness to be animated about something. The specific thing was almost irrelevant.

Which means the technology industry does not need to become more exciting. It needs to become more confident about what it already is. The work is extraordinary. Building systems that serve millions of people. Solving problems that governments and corporations and communities cannot solve without you. The engineer who builds an AI-powered translation service for council residents who do not speak English is doing something more meaningful than anyone who has ever jumped out of a plane. But that engineer has been taught — by the industry, by the culture, by two decades of hoodie-wearing founder mythology — that the work is not the story. That you need something else, something outside, something extreme, to be interesting.

That is a lie. And it is a lie with consequences. Every person who does not apply because they looked at the industry and thought "that is not for me" is a perspective we lose. Every team that is less diverse than it could be is a team that will build less robust, less creative, less representative technology. The diversity deficit is not a moral decoration. It is an engineering failure. I am aware of the risk in that framing — reducing people to engineering inputs is exactly the kind of dehumanisation I am arguing against. But the engineering frame is what this industry responds to, and I would rather use a language that gets heard than a language that gets ignored.

I must be honest about the limits of this argument, though. I am not saying that if we just told better stories, the diversity problem would vanish. Structural barriers are real. Pay gaps are real. Hostile cultures are real. The outdoor recreation industry is 72% white despite decades of campaigns to make it more accessible, which tells you that storytelling alone is insufficient. But storytelling is where it starts. You cannot recruit someone who never considered applying. You cannot change a culture that does not believe it needs changing. The story is the first domino, not the last.

The Wind Tunnel and the Work

I have not jumped out of a plane in ten years. Life intervened. My wife, the inimitable Hannah Nesbitt-Smith ; who, it must be said, has absolutely zero interest in skydiving whatsoever — and our children have restructured my relationship with risk in ways that a twenty-five-year-old paying for an AFF course could not have predicted.

Nowadays, all I get is indoor skydiving. I recently introduced my friend Patrick Crompton to the wind tunnel, and he has found the same deep joy in it that I did. There is something about the sport — even the indoor version, even the sanitised, controlled, nobody-is-going-to-die version — that teaches you things about yourself that you cannot learn any other way. Body position. Awareness. The way small adjustments produce outsized effects.

I will be honest, though: in the wind tunnel, I more closely resemble a daddy longlegs bashing off the walls than I do any of the incredible tunnel flyers you might see on YouTube. And if anyone ever asks what I do, I always show them someone else's videos. The gap between aspiration and reality is something I have learnt to find funny rather than embarrassing, which may be the most useful thing skydiving has taught me.

But here is the thing. When someone asks what I do for a living — not what I do for fun, but what I do for work — and I tell them, with conviction and energy, that I build platforms that let local government experiment with AI at zero cost, or that I work on systems that catalogue twenty-four thousand open source repositories across the entire UK government estate, or that I am trying to make it so that a council officer with an idea can go from "what if" to a working prototype in fifteen minutes — the reaction is the same as it ever was with the skydiving. Better, actually. Because the story is real, it is consequential, and it does not require me to have jumped out of anything.

The work was the interesting thing all along. I just needed to learn how to say so.

AI Generated: a lone ungainly figure in a neon-lit wind tunnel while elegant flyers watch - aspiration versus reality

What Must Change

I am not going to end this with a ten-point plan. The problem is too structural for that and I am too honest to pretend otherwise. But I will say three things.

First: if you work in technology, learn to tell the story of what you do with the same energy you would use to describe jumping out of a plane. Not the jargon. Not the stack. The impact. The why. The human consequence. If you cannot make someone lean in when you describe your work, the problem is not the work. It is you.

Second: if you hire in technology, look at your job adverts, your careers pages, your conference sponsorships, and ask who they are speaking to. Pull up your three most recent job adverts. Read them aloud. If they sound like they were written by and for the same person, they were. Goldman Sachs rebuilt its entire employer brand around showing what the work actually looked like, not what the mythology said it looked like. The technology industry — an industry that employs some of the most creative people on earth — has somehow produced the least imaginative recruitment marketing in the history of professional services. That is fixable.

Third: recognise that storytelling is necessary but not sufficient. Better stories will widen the top of the funnel. They will not, on their own, fix the cultures that push people out. The 283-year figure from BCS is not just a recruitment problem. It is a retention problem, a promotion problem, a whose-voice-gets-heard-in-the-room problem. Storytelling opens the door. What happens after the door opens is a different fight, and an older one.

I am aware of the irony. This entire piece argues that signals collapse when they become accessible, and then prescribes making our storytelling more accessible. The mechanism I diagnosed is the mechanism I am invoking. But here is the difference: a collapsing skydiving signal costs you a party anecdote. A collapsing recruitment signal costs you 283 years.

I started jumping because I thought the work was not enough. I was wrong. The work was always enough. We just need to get better at saying so — and then we need to build the kind of workplaces that prove it.

(Views in this article are my own.)

We've Commoditised Innovation (And Most of You Haven't Noticed)

Tue, 31 Mar 2026 05:45:07 GMT

We've commoditised innovation. Not the ideas — you can't commoditise human creativity. But the ability to experiment? The ability to go from "I wonder if this works" to "let me try it"? That used to be expensive, slow, and bespoke. Now it isn't. Except where it is.

We've now commoditised innovation.

For cloud services in local government, it still is. And that gap — between the components that have commoditised and the access that hasn't — is where most of your strategy is quietly dying. I should explain what I mean, because "commoditised innovation" sounds like something a consultant would say on a stage before selling you a three-year transformation programme. It isn't. It's an observation about evolution — specifically, about what happens when the evolution axis shows you two things that should be moving together but aren't.

Cloud compute has commoditised. Storage is a utility. Even sophisticated AI services — natural language processing, computer vision, translation — are moving rapidly from product to commodity. Available, standardised, cheap. But the ability to experiment with those commoditised services in local government? That's stuck in the custom-built phase. 209 NHS organisations and 320 local councils each independently navigating procurement for substantially similar tools. Every council negotiates its own path. Every team writes its own business case. Every experiment requires bespoke approval.

The components have commoditised. The access hasn't. And the strategy for each is completely different. If you can't see that gap, you're playing chess without a board.

AI Generated: A river with stepping stones blocked by a concrete wall, people queuing with paperwork

Both sides are right (and both are wrong)

Here's the debate I keep hearing. On one side: "We need sandboxes! Let people experiment! Remove the barriers!" On the other: "Sandboxes are theatre. 88% of AI pilots never reach production. You're just generating experiments that go nowhere."

Both sides are right. Both sides are wrong. Because they're talking about different things at different stages of evolution.

The sceptics are right that a sandbox without a path to production is theatre — expensive, demoralising, innovation-flavoured prattle that changes nothing. But the advocates are right that without a sandbox, you never discover what the path to production should look like. You can't write a business case for something you haven't been able to try. And you can't know what's worth scaling until you've seen it work at the smallest possible scale.

The resolution, as usual, sits on the evolution axis. Experimentation is a component. The sandbox is a component. The path from sandbox to production is a different component. And they are not independent — experimentation access enables sandboxes, sandboxes generate learning, and the path to production converts that learning into operational value. Break the chain at any point and the whole thing stalls. Most organisations haven't mapped any of this, let alone worked out what evolution stage each one is at. That's the problem. Not sandboxes. Not the absence of sandboxes. The absence of a map.

The silent majority

Let me tell you about someone I'll call Sarah. She's a data analyst at a district council — one of the smaller ones, the kind where you're expected to do three jobs and be grateful for two. She's been there since 2014, when she applied for something in planning and ended up in data by accident. Last year she had an idea: use an AI language model to draft initial responses to Freedom of Information requests. Not the decision — the drafting. She reckoned it could save her team fifteen hours a week.

Sarah looked into it. She'd need access to a cloud environment. That meant a business case. The business case meant her line manager, who said it was a good idea and, with genuine sympathy, said it would take about a year. Then the digital board, who meet quarterly. If approved, procurement would take twelve to twenty-four months. By which point the AI services she wanted to test would have evolved twice over. Sarah felt the energy drain out of the whole thing somewhere between the second email and the third form.

Sarah didn't fail. Sarah didn't even start. She went back to her day job, because people have day jobs, and I don't blame them for doing the thing they're actually paid to do. Her idea exists only as an unexpressed hypothesis. We have no idea how many Sarahs there are, because their innovations are invisible. They don't show up in any metric. The MHCLG Future Councils pilot found that the number one blocker to innovation in local government was "no way to de-risk innovation." Not lack of ideas. Not lack of ambition. No safe way to try.

That's a landscape problem, not a people problem.

AI Generated: Organisation chart mirrored as cloud architecture through Conway's Law

Conway's Law is eating your cloud migration

Melvin Conway observed in 1968 that organisations produce designs that mirror their communication structures. Give an organisation a cloud platform and what do they build? The organisation they already have. On the cloud. 52% of cloud migrations are lift-and-shift. McKinsey estimates that lift-and-shift wastes roughly two-thirds of the potential value. Two-thirds. That's not a migration strategy. That's an expensive relocation.

If your procurement process takes eighteen months, your innovation cycle takes eighteen months. If your governance requires a business case before anyone touches a cloud console, you've built a structure that guarantees nobody will discover the things that business cases can't predict. And here's what most people miss: procurement itself is a component on the evolution axis. It's sitting in the custom-built phase — every council designing its own process — while everything it governs has moved to commodity. That mismatch is the real architectural problem.

And what happens when the rational choice is not to try officially? You already know. 71% of UK employees are using unapproved AI tools at work. Shadow IT isn't a mystery. It's Conway's Law applied to experimentation: if the official structure won't support the work people need to do, they'll create an unofficial structure that will. The NCSC understands this — their guidance explicitly says to "avoid unnecessary IT lockdowns" because the alternative is worse.

You lock down environments to prevent risk. People experiment unofficially, outside your governance, outside your visibility. You've created exactly the risk you were trying to prevent. Brilliant.

AI Generated: Reinforcing feedback loop: lockdown creates shadow IT, creates risk, creates more lockdown

What commoditised experimentation actually looks like

So what if you could push the ability to experiment from custom-built towards commodity? Not the ideas. Not the cloud services. The access.

That's what NDX:Try appears to be doing. It's part of the National Digital Exchange, and the proposition — if it works, and I'm genuinely not sure yet — is to remove the procurement wall between people like Sarah and the cloud services that have already commoditised. Free sandboxes for local government. You do a quiz, pick a scenario, get a working environment. Whether that means anything useful is the question I can't answer yet. And there's a landscape question worth asking: lowering the barrier to experimentation on a hyperscaler's platform is not the same thing as building public digital infrastructure. Whose sand are these sandcastles built on? That matters, and I'd want to see it mapped.

Now, I've seen enough "innovation platforms" to be deeply sceptical. Most of them are rubbish — consultant-designed, workshop-delivered, context-free. But let me be honest about what makes this different on the evolution axis: it's not trying to make innovation happen. It's trying to make the cost of trying low enough that people try without needing permission, funding, or irrational conviction. That's an infrastructure play, not an innovation theatre play. And the distinction matters.

The bit I'm less sure about

Here's where I hedge, because this is gameplay, not doctrine. I don't know whether commoditised experimentation leads to better outcomes or just more experiments. That 88% pilot purgatory number deserves more than a passing glance. If all you're doing is making it easier to create things that go nowhere, you've commoditised innovation theatre. Well done. Slow clap.

But I don't think that's the whole picture. Let me be precise about what I think is doctrine and what's gameplay here.

Doctrine: reduce the cost of experimentation and you increase the rate of learning. That's always true. It doesn't depend on context.

Gameplay: whether this particular platform delivers on that doctrine for this particular set of organisations — that's context-specific. I might be completely wrong. NDX:Try might gather dust in six months. Who knows.

What I do know is that the sandbox-to-production gap is real, and it's a separate component that needs its own strategy. The people building this need to map that gap as carefully as they've mapped the sandbox itself. If there's no path from "I tried something interesting" to "this is now running in production," then the sceptics are right and this is sandcastles. The path from sandbox to production is where the hard work lives — and where most innovation platforms quietly die. I'd want to see that map before I'd call this a success. But the fact that someone is addressing the experimentation access component at all? That's worth paying attention to, because almost nobody else is.

AI Generated: Evolution curve showing experimentation access moving from custom-built toward commodity

Where's your map?

If you're in local government technology, here's my challenge. Map your experimentation landscape. Not your cloud landscape — your experimentation landscape. How long does it take to go from idea to prototype? What are the dependencies? Where's the friction? Is it there for a reason that still makes sense, or is it one of those institutional habits where nobody remembers why it started but everybody assumes it must be important?

Map it. You might be surprised by what you find.

The mapping might reveal something I haven't considered. That's rather the point. At least with a map, you can see where you're wrong. Without one, the Sarahs in your organisation will keep having ideas and keep not trying them, and you'll never even know what you lost. And if you are a Sarah — if you can see the wall but can't move it — then at least this map gives you the language to name what's being lost. Making the invisible visible is sometimes the first act of change.

The 400+ councils standing at that procurement wall aren't short of ideas. They're short of a way to test them. The ideas that would have saved time, reduced cost, and actually improved services for residents — those ideas are sitting in people's heads, unexpressed, untested, and decaying. Every month that wall stays up, the gap between what's possible and what's attempted gets wider. That's not a technology problem. That's a situational awareness problem. And you won't solve it until you can see it.

(Views in this article are my own.)

The Paperclip Maximiser Is You

Fri, 27 Mar 2026 06:30:08 GMT

In 1858, the New York Times ran a piece about the transatlantic telegraph that should be tattooed on the forehead of every technologist alive today. The telegraph, the paper warned, was "too fast for the truth." Messages now crossed the Atlantic in minutes instead of weeks, and the editors worried that speed without verification would unleash a torrent of rumour, misinformation, and panic onto an unprepared public. They were right. They were also, in a way that matters enormously right now, describing a pattern that is precisely 168 years old and still accelerating.

I grew up in Brighton in the 1980s. We had four television channels. Channel 4 didn't even broadcast twenty-four hours a day until 1996. My mother told me that if I stared at the television too long, my eyes would go square. The scarcity was the point. Four channels meant somebody — a commissioning editor, a scheduler, a regulator — had decided what was worth broadcasting and when. The information came in a trickle. You could drink from it.

Now it comes from a fire hose attached to a sewage main.

AI Generated: A child of the eighties, before the flood

Every generation panics about the new medium. In 1492, the Benedictine abbot Johannes Trithemius wrote De Laude Scriptorum Manualium — In Praise of Scribes — arguing that the printing press would corrupt knowledge and destroy monastic discipline. Scribes, he insisted, engaged with the divine through the physical act of copying; the press severed the link between comprehension and effort. He was mocked for it. He published his defence of hand-copying as a printed book — the irony was not lost on his critics, though Trithemius argued the press was acceptable for distributing his argument while still inferior for forming one. Five hundred years later, he sounds less like a Luddite and more like a prophet. The trade-off he described — effort removed, comprehension diminished — is exactly what is happening when you ask an AI to summarise a report you should have read yourself.

Neil Postman picked up the thread in 1985 with Amusing Ourselves to Death, arguing that Aldous Huxley had beaten George Orwell — that we would not be destroyed by what we fear but by what we love, drowning in an ocean of entertainment we chose for ourselves. Postman's insight was structural: each medium does not merely carry content but reshapes the act of thinking. Television turned political argument into entertainment. The internet turned entertainment into a feedback loop. And AI is turning the feedback loop into something that thinks — or pretends to think — on your behalf.

From Trithemius's printing press to Postman's television to the LLM on your laptop, the pattern is the same: a technology arrives that makes information cheaper, and the thing it makes cheaper is not just production but cognition itself. Amy Orben called this the "Sisyphean Cycle of Technology Panics", and she is right that the pattern exists. I am not interested in relitigating the pattern. I am interested in whether this time the pattern breaks.

The Feedback Loop That Changed Everything

Here is what is different. Every previous communication technology operated on human timescales. A newspaper editor wrote an article, printed it, distributed it. The feedback loop was measured in days or weeks.

Then the loop tightened. Then it got so tight that the human stopped being the author and became the product.

Richard Serra said it in 1973: "You are the product." Sean Parker confirmed it in 2017: Facebook was designed to exploit "a vulnerability in human psychology." Chamath Palihapitiya went further: "I think we have created tools that are ripping apart the social fabric of how society works." These are not critics. These are the architects, confessing.

B.F. Skinner identified variable-ratio reinforcement as the most addictive schedule of reward in the 1950s. Social media bolted his rat lever onto a global communication network and called it a platform. Andrew Bosworth's internal memo made growth "de facto good." Frances Haugen confirmed the platform knew it was causing harm and chose profit over safety.

Growth as the justification for everything. Sound familiar?

AI Generated: The telegraph office drowning in its own output

Three Threads, One Ratchet

This argument has three threads and they reinforce each other viciously. The first is supply-side: more than 52% of long-form web articles are now AI-generated, and HBR estimates the resulting "workslop" costs organisations $9 million a year. The second is demand-side: when AI summarises your emails and filters your feeds, you do not become better informed — you become a person who has stopped processing information altogether. The third is structural: the engagement engine was built to maximise attention captured, not understanding, and AI has simply given it a new production line. The flood creates the need for AI filters. The filters degrade your ability to evaluate what the flood contains. And the business model profits from both.

I use these tools. I am writing about the ratchet and I can feel its teeth in my own workflow — the pull of the summary, the relief of the shortcut, the slight hollowing-out when I accept an answer I did not earn. I am not writing from above this problem. I am writing from inside it.

Herbert Simon saw the core constraint in 1971: "A wealth of information creates a poverty of attention." Researchers at Caltech have shown that human thought operates at roughly 10 bits per second. Your sensory systems take in billions. Your conscious mind processes ten. Every communication technology in history has increased the volume arriving at that bottleneck. Not one has widened the bottleneck itself.

Then Like Now

'OK,' you might reasonably argue, 'but at least AI helps us manage the overload. It summarises. It filters. It prioritises. Isn't that the whole point?'

The reasonable version of this objection is that tools are neutral and usage is what matters. I used to believe that. But neutral tools do not redesign your information diet without asking. Neutral tools do not create feedback loops that amplify your existing biases. The ratchet does not care about your intentions.

A 2024 study in Nature Human Behaviour showed that AI-human feedback loops amplify existing biases rather than correcting them. A study in PNAS Nexus this year found that people who relied on LLM-generated summaries developed shallower knowledge structures than those who read the source material. They felt more confident. They knew less.

A 2025 study in Societies found a correlation of -0.68 between AI tool usage and critical thinking ability. The correlation does not tell us which way the arrow points — whether AI usage degrades thinking or whether weaker thinkers reach for AI more readily — but either direction is troubling.

And here is the part that should genuinely frighten you. ActivTrak's latest research found that AI tools are increasing task completion time by 346%. Not reducing it. Increasing it. Deep focus time is falling. Email time has doubled. BCG calls it "AI brain fry" — the cognitive exhaustion of constantly supervising a system that is supposed to be supervising you. The 346% figure likely captures the chaos of early adoption and may moderate, but the structural pattern — the supervisor needing a supervisor — is not a teething problem. It is the architecture.

AI Generated: Take the pill, the label says it helps

The Paperclip Maximizer Is Not a Metaphor for AI

In 2003, Nick Bostrom introduced the paperclip maximiser — a thought experiment about an AI given the simple goal of making paperclips, which converts all available matter in the universe into paperclips, including the humans who built it.

But the paperclip maximiser is not a metaphor for AI. It is a metaphor for us.

We built the engagement engines. We optimised for clicks, views, time-on-site, and share counts. When those systems produced polarisation, addiction, and the erosion of shared reality, we did not shut them down. We scaled them up. We called it growth. AI slop is not a bug. It is the system working as designed.

Here is the ratchet. I am building it deliberately, because I want you to see how each step makes the next feel necessary and the alternative — doing the cognitive work yourself — feel impossible:

Step 1 — The Flood: AI generates content at a scale no human team can match. More than half of long-form articles are now AI-generated, and the number is climbing.

Step 2 — The Filter: You cannot process the flood, so you use AI to filter and summarise it. The filter selects based on what you have engaged with before. Your information diet narrows. This is where a thoughtful leader stops and asks: what is the filter removing? What am I no longer seeing? If you cannot answer that, you have already ceded the decision about what matters.

Step 3 — The Delegation: The AI-generated summaries replace your engagement with source material. Your comprehension shallows. You feel informed. You are not. Shallower knowledge structures, greater confidence, less actual understanding. This is the second off-ramp. If you are a leader making decisions on summaries of summaries, mandate that your team — and you — read the source material for any decision above a given threshold. Name the threshold. Write it down.

Step 4 — The Atrophy: Your reduced attention creates a gap. The AI generates more content to fill it. The noise increases. The signal degrades. You lean harder on the AI. The -0.68 correlation between AI usage and critical thinking is the ratchet in motion.

Step 5 — The Business Model: None of this is accidental. The engagement engine that drives Step 1 profits from every iteration of Steps 2 through 4. The company selling you the flood is selling you the filter. The company selling you the filter has no incentive to restore your ability to drink from the stream yourself. Growth is de facto good. The number goes up. The harm is acceptable because the number is still going up.

That is not a decision framework. That is a ratchet. You are not making decisions. You are being funnelled.

Hannah Arendt wrote about what she called "the banality of evil" — the idea that great moral failures are not caused by monsters but by ordinary people who stop thinking. Not people who think evil thoughts, but people who delegate their thinking to systems, to procedures, to authorities, and in doing so become incapable of moral judgement. Arendt's word for it was thoughtlessness, and she did not mean stupidity. She meant the abdication of the individual's responsibility to comprehend.

The Ratchet and the Responsibility

I need to be honest about something. I drafted sections of this argument with an AI assistant. I asked it to find the Simon quote. I asked it to check the Bostrom citation. I asked it to scan six studies I did not have time to read in full. Each time, I felt the pull — the relief of outsourcing a cognitive task, the slight diminishment of my engagement with the material. I caught myself accepting a summary instead of reading the source. I am describing a ratchet, and I can feel its teeth on my own cognition.

Simon Wardley and I have been putting the world to rights on this — over afternoon coffees, and Greek Raki, the kind of argument that goes in circles because neither of us can find the exit. His position is that the real danger is not that AI produces rubbish but that it degrades comprehension. His deeper worry: once AI controls the reasoning layer, you have handed the keys to your cognition to a system whose incentives are not your own. I keep trying to find the flaw. I have not found it yet.

The technology industry will tell you this is just another moral panic. Trithemius worried about printing. Postman worried about television. And look — civilisation survived.

True. But the printing press did reduce our reliance on monastic memory — Trithemius was right about that. Television did turn political discourse into entertainment — Postman was right about that. The panickers were not wrong about the mechanism. They were wrong about the scale. Civilisation survived, but as something different — something that had traded one cognitive capacity for another without ever consciously choosing to.

This time, the thing we are trading away is comprehension itself. The ability to take in information, process it, weigh it against what you already know, and form a judgement. And the feedback loop is no longer measured in days or weeks but in milliseconds.

AI Generated: The arcades are full, but nobody chose the game

So here is my challenge. Two things. One is personal, one is organisational.

The personal one: this week, read one thing in full that you would normally have asked an AI to summarise. Sit with it. Let it be slow. Let it be boring. Let your ten bits per second do their work.

The organisational one: audit where your company has inserted AI into the comprehension layer — the places where a human used to read, evaluate, and decide, and a system now does it for them. Map it against the ratchet. At which step are you? Ask whether the people downstream can still do the work if the system is switched off. If the answer is no, you do not have an AI strategy. You have a dependency. And dependencies, left unexamined, become vulnerabilities.

The paperclip maximiser is not coming for you. It is already here. And it is not a machine. It is every decision you made to let something else do your thinking.

When Your CMS Gets a Brain: Building AI into LocalGov Drupal on AWS

Tue, 24 Mar 2026 06:05:54 GMT

I learnt something unexpected last year while watching a local authority content editor at work. She was creating a service page about waste collection — a task she'd done perhaps three hundred times before — and she paused, mid-sentence, to manually check the Flesch reading score of her draft in a separate browser tab. The irony struck me: here was a skilled professional, using a computer to write content for other humans, while manually performing a task that the computer beside her could do rather better. It felt a bit like watching someone use a calculator to check whether their spreadsheet was correct.

That moment shaped what became the LocalGov Drupal AI scenario — one of a growing collection of "try before you buy" demonstrations we have been building through the National Digital Exchange (NDX). Our attempt to bridge the gap between what local government organisations need and what cloud services can already provide, if only someone would wire them together in a way that makes sense.

The problem is not technology. It is confidence.

UK local government is stuck in a curious paradox. Research shows the vast majority of local authorities are actively exploring AI, yet cloud spending growth has flatlined. G-Cloud has existed for over twelve years, and many local government organisations have never used it. The barrier is not lack of interest — it is lack of confidence to evaluate and act.

NDX:Try exists to address precisely this. It provides UK local government organisations — from county councils and metropolitan boroughs to district authorities and parish councils — with free, time-limited AWS sandbox environments pre-loaded with realistic scenarios drawn from genuine local government needs. No procurement. No commitment. No risk. Just guided curiosity in an isolated environment that cleans itself up afterwards.

However, an empty sandbox is rather like giving someone a fully equipped kitchen and no recipes. What we needed were scenarios — complete, deployable demonstrations that let a service manager or a technology lead experience the possibilities firsthand.

A recipe, not a restaurant

Perhaps the most instructive analogy comes from cooking (a domain I think about more than is strictly professional). When you follow a recipe for the first time, you are not just making dinner — you are building a mental model of how ingredients combine, what heat does to protein, why timing matters. A good recipe teaches you principles whilst producing something edible.

Our approach to the LocalGov Drupal scenario follows this pattern. We did not build a bespoke application. We took the standard LocalGov Drupal — the same open-source CMS already used across local government by organisations like Croydon Council , Brighton & Hove City Council , and Bracknell Forest Council — and layered seven AI capabilities onto it using Amazon Web Services (AWS) managed services. The architecture is deliberately transparent:

Users → CloudFront (HTTPS) → ALB → ECS Fargate → LocalGov Drupal
                                        ↓
                                   Aurora Serverless v2 (MySQL)
                                   EFS (persistent files)
                                        ↓
                              Bedrock · Polly · Translate
                              Textract · Rekognition

Every component maps to a service you could provision independently. Five CDK constructs — networking, database, storage, compute, CDN — each doing one thing well. I must admit I find the simplicity slightly deceptive; a fair amount of hard-won learning is buried in those clean interfaces.

The deployed LocalGov Drupal homepage showing AI-generated council identity and services

Seven features that meet real workflows

The seven AI capabilities were chosen not for technical impressiveness (though I think some of them are genuinely impressive) but for their proximity to what local government teams actually do every day:

AI Content Editing — Amazon Bedrock suggests improvements to draft content directly in the Drupal editor. No copy-pasting into a separate tool.
Readability Simplification — One click transforms complex policy language into plain English, targeting an appropriate reading age for public-facing content.
Auto Alt-Text — Upload an image, receive a description. This might sound modest until you realise most local government websites have thousands of images with empty alt attributes.
Listen to Page — Amazon Polly provides neural text-to-speech in seven languages, with speed controls and language selection. Accessibility compliance as a feature, not an afterthought.
Content Translation — Amazon Translate delivers 75+ languages instantly. For any authority serving diverse communities, this moves translation from a procurement exercise to an immediate capability.
PDF-to-Web Conversion — Amazon Textract extracts content from uploaded PDFs. Given the volume of PDF-only content on local government websites (and the accessibility problems that creates), this addresses a genuine gap.
AI-Enhanced Search — Bedrock-powered semantic search that understands what you meant, not just what you typed.

The AI writing assistant bar integrated into the Drupal content editor

What makes this interesting is the orchestration

However, I believe the most technically interesting aspect is not the AI integration itself but what happens when someone clicks deploy. A single CloudFormation template provisions roughly twenty AWS resources, then a container starts, installs Drupal, enables forty-odd LocalGov modules, configures the AI features, and then — the part that still makes me slightly nervous — invokes Amazon Bedrock to generate an entire fictional council identity.

Each deployment creates a unique fictional authority: a plausible name, a heraldic logo generated by Nova Canvas, service pages, news articles, and step-by-step guides. You are not exploring a generic demo with placeholder text. You are exploring what feels like a real local government website, complete with locally-flavoured content about bins, planning applications, and library services. The user sees a live progress page while this happens, which I think strikes the right balance between transparency and not overwhelming people with details they did not ask for.

CloudFormation stack outputs showing the Drupal URL and credentials (no longer current)

Three things we learnt

Three lessons from building this that I think generalise beyond our specific context:

First, the gap between "working" and "useful to someone else" is wider than you expect. The infrastructure code is perhaps 200 lines of TypeScript. The initialisation script that makes it usable by a non-technical person is over 1,000 lines. The ratio tells you where the real complexity lives: not in the cloud services, but in the human experience of using them.

Second, AI features are only as good as their integration points. We initially tried building a sophisticated CKEditor plugin for the AI writing assistant. It needed webpack, which complicated the container build. The simpler approach — injecting toolbar buttons via Drupal's hook system — delivered identical functionality with perhaps a tenth of the complexity. Sometimes the less technically sophisticated path is simply the correct one.

Third, people need to experience things to believe them. We can write about AI-generated alt-text or one-click translation until we are blue in the face. It does not land until someone uploads their own image, sees the description appear, and thinks: "oh, that is actually quite good." NDX:Try exists because of this insight. Reading about cloud services and using cloud services are fundamentally different experiences, and the gap between them is where institutional inertia lives.

Returning to the kitchen

It seems to me likely that what we have built is less a product and more a recipe card. The ingredients are all managed services; the method is infrastructure as code; the result is something a council can taste, understand, and — critically — decide whether to cook themselves. Perhaps the content editor I watched struggling with her Flesch score tab will never use our specific implementation. But if she deploys this scenario and sees her CMS suggest clearer phrasing, generate alt-text automatically, and offer page content in Urdu, Polish, or Welsh with a single click — perhaps that changes what she expects her tools to do for her.

We should not pretend that deploying AI into a CMS solves the structural challenges facing local government. It does not. But it does something arguably more valuable at this stage: it makes the abstract concrete, the theoretical practical, and the unfamiliar safe to explore.

If you work in UK local government, you can try this scenario (and several others) for free through NDX:Try. No procurement, no cost, no commitment — just a sandbox and 24 hours.

If you work elsewhere in the public sector (or are simply curious about what we are building), drop us a line at ndx@dsit.gov.uk. We would genuinely like to hear from you.

(Views in this article are my own.)

Asking 24,500 Repositories a Question

Mon, 23 Mar 2026 08:15:11 GMT

In 1945, Vannevar Bush published an essay called As We May Think in The Atlantic. In it, he described a hypothetical device called the Memex — a desk-sized machine that could store an entire library and, crucially, allow its user to create trails of association between documents. The problem Bush identified was not that knowledge didn't exist, but that we had no good way of finding the specific piece we needed when we needed it. "The summation of human experience is being expanded at a prodigious rate," he wrote, "and the means we use for threading through the consequent maze to the momentarily important item is the same as was used in the days of square-rigged ships."

Vannevar Bush during his time in the Office for Emergency Management (part of the US Gov during World War 2) https://commons.wikimedia.org/w/index.php?curid=1633052

I was reminded of Bush's Memex last week, after writing about the X-UK-Gov Public Repository Leaderboard and how we now have over 24,500 public repositories across UK government organisations, complete with Software Bills of Materials. The response was gratifying — a number of people got in touch to say they found the data useful, or that they hadn't realised the scale of what was out there. But one question kept coming up, in various forms: "This is interesting, but how do we actually find the thing we need in 24,500 repositories?"

Screenshot of the X-UK-Gov Public Repository Leaderboard

It's a fair question. A sorted table is a fine thing (I should know, I've been running one since 2018), but it's not particularly useful if you're a developer in a council trying to work out whether anyone else in government has already solved the problem you're about to spend three months building. We have, in effect, built an enormous library and forgotten to hire a librarian.

I think that's the real gap we've been living with. The code is in the open. The data is available. But discoverability has been, frankly, terrible.

Searching With Intent

Which is why we now have govreposcrape — a semantic search layer over the entire UK government open source estate. It uses Google's Vertex AI Search to index all 24,500+ repositories (and growing), and exposes the results through an API and, more interestingly, through a Model Context Protocol (MCP) server.

For those not yet familiar with MCP, it's an open standard (developed by Anthropic, but designed to be universal) that allows AI assistants to connect to external data sources and tools. In practical terms, it means you can add a single configuration block to Claude Desktop, GitHub Copilot, or any MCP-compatible tool, and your AI assistant suddenly has the ability to search across every public UK government repository as part of its normal workflow.

The setup takes about two minutes. You add a few lines of JSON to your configuration, restart the application, and from that point on we can ask natural language questions about what exists across government code. No API keys to manage, no authentication to configure — the service is freely available.

And yet.

The interesting thing is not the technology. It's what happens when we actually use it.

"Who Else Has Done This?"

I think the most powerful question a developer in government can ask is not "how do I build this?" but "has someone already built this?" So I tried it, asking the kind of questions that come up regularly from teams across the public sector. The results were genuinely illuminating.

The first scenario comes up constantly: "I'm thinking of building a case management system that will OCR text and help the user process casework documents faster. Who else in government has done that? What components can we reuse?"

The response surfaced projects that none of us would have found through keyword searching. Bath & North East Somerset's ProcessMaker implementation — a complete workflow and case management system. Dorset Council's Flood Online Reporting Tool, which is essentially a case intake and processing system. East Sussex County Council's form controls and database file handling libraries. York's Lucene Entity Search Tools for text processing.

None of these are a drop-in solution (they rarely are), but together they represent a considerable body of work that teams could learn from, adapt, or build upon. The alternative — starting from scratch in ignorance of what already exists — is the kind of waste that we should find genuinely frustrating when public money is involved.

"How Should We Store an Address?"

The second question was more specific, and perhaps more revealing: "How should we store an address for portability with other UK government organisations?"

This is one of those problems that every government service encounters, and which (I must admit) I assumed would have been solved definitively years ago. The search surfaced East Sussex County Council's address and personal details library, which implements the BS7666 British Standard for addresses and uses UPRNs (Unique Property Reference Numbers) as the canonical identifier. Dorset Council's GDS Blazor Components showed GOV.UK Design System-compliant address entry patterns.

The emerging consensus from the code is clear: use the BS7666 standard, always include the UPRN, split addresses into structured fields rather than storing them as free text, validate postcodes properly, and design for integration with Ordnance Survey AddressBase. It seems to me that this is exactly the kind of practical, hard-won knowledge that should be easy to find — and until now, it really hasn't been.

From Catalogue to Conversation

It seems to me likely that this is what "reuse" actually looks like in practice. Not a central repository of blessed components (we've tried that approach before, and it tends to go stale), but a searchable, AI-augmented view of what teams are actually building and shipping. The catalogue provides the data. The SBOM collection provides the dependency graph. And the semantic search layer makes it conversational — we can ask questions in the language of the problem we're trying to solve, rather than needing to know which repository to look in.

However, I believe there's something more significant happening here than just better search. When an AI assistant can draw on the entire UK government open source estate as context, it changes the nature of the conversation between developers and their tools. Instead of "write me a function that validates a postcode," the question becomes "show me how other government services validate postcodes, and what patterns have they settled on." The answer comes with provenance, with links to real implementations running in production today, maintained by teams who face the same constraints and compliance requirements we do.

(I should note that the search is only as good as what's in the open. If your department hasn't published its code, it can't be found, it can't be reused, and the rest of government can't benefit from it. Yet another reason to code in the open.)

Getting Started

If you'd like to try this yourself, the govreposcrape repository has full setup instructions. For Claude Desktop, the configuration is minimal:

{
  "mcpServers": {
    "govscraperepo": {
      "url": "https://govreposcrape-api-1060386346356.us-central1.run.app/mcp",
      "description": "UK Government code discovery - semantic search over 24k government repositories"
    }
  }
}

The API is also available directly at the production endpoint with an OpenAPI specification for building your own integrations. The service is free, requires no authentication, and the source code is open (naturally).

I should be upfront: this is running on Google Cloud Run backed by Vertex AI Search, and I have genuinely no idea what the operating costs will settle at. This is, after all, a side quest of a side quest — built in spare time alongside spare time. I intend to keep it running for as long as I can, but if it starts costing more than a few quid a month I may need to scale it back or take it down. The source code will always be there if someone with a more generous cloud budget wants to pick it up.

Reuse in Practice: The National Digital Exchange

This is, in many ways, the same problem I'm trying to solve in my day job. I work on the National Digital Exchange (NDX), and one of the things we've built is NDX:Try — a platform that gives local government organisations free cloud sandboxes to experiment with, no procurement required. The idea is simple: lower the barrier to trying things out. Over 50 organisations are currently using it to explore everything from council chatbots to planning application AI to FOI redaction tools, and every use case is shared openly for others to learn from.

I think the thread connecting all of this — the leaderboard, the SBOMs, the semantic search, NDX:Try — is the same one that's run through my career in public sector technology: we should make it easier for government teams to find, try, and reuse what already exists, rather than building everything from scratch behind closed doors. NDX:Try lowers the barrier to experimentation. The leaderboard and govreposcrape lower the barrier to discovery. They're different tools solving different parts of the same problem.

If you're in local government (or any department with an interesting use case you're willing to share), I'd genuinely encourage you to have a look at NDX:Try. There's a two-minute quiz to find relevant scenarios, or you can browse the full catalogue. It's free, the environments are completely isolated from production, and everything is cleaned up automatically afterwards.

What Comes Next

I think we're at the beginning of something genuinely useful. With over 24,500 repositories catalogued, SBOMs mapping out the dependency landscape, semantic search making the whole thing queryable in natural language, and platforms like NDX:Try making it possible to experiment without a procurement exercise, the infrastructure for meaningful cross-government reuse is starting to exist in a way that it simply hasn't before.

Perhaps the most important shift is a cultural one. We've spent years making the case for coding in the open — and that case has been won, at least in principle. The next challenge is making sure that openness translates into actual reuse, actual collaboration, actual reduction in duplicated effort. That requires making our code not just available but findable, not just findable but understandable in context, and not just understandable but tryable without a six-month procurement cycle.

We're not there yet. But we're closer than we've ever been, and I'm cautiously optimistic that the combination of open data, SBOMs, AI-powered search, and free experimentation platforms might be the thing that finally bridges the gap between "we publish our code" and "we build on each other's work."

Bush imagined his Memex as a desk for one person. What we're building is something rather more collaborative — a shared memory for everyone building public services in the open. Perhaps we should keep going.

Twenty-four Thousand Reasons to Code in the Open

Thu, 19 Mar 2026 08:30:05 GMT

I learnt this week that the very first commit to a side project of mine was made at twenty to one in the morning on a Saturday in October 2018. I have no real recollection of what prompted that particular late-night coding session (the git log, as ever, is more reliable than my memory), but I do know the context. I was a tech lead in the office of the CTO at the Home Office, and I was looking for some hard numbers to support a case I kept having to make: that open source in UK public sector was not only happening, but thriving.

The tool I built that night was a scraper. It pulled the official list of UK government organisations from GitHub, fetched all their public repositories, and rendered the results in a simple table. AngularJS 1.5, Bootstrap 3, a bit of inline CSS. It worked. And — rather remarkably — it continued to work for the best part of eight years with almost no maintenance.

I think that fact alone tells you something about the nature of side projects built in the small hours.

From Hundreds to Twenty-Six Thousand

When I first assembled that leaderboard, the number of public repositories across UK government organisations was perhaps in the low thousands. It seemed to me at the time that this was already a success story worth telling, particularly given the resistance I often encountered when encouraging departments to code in the open. Today, that number stands at over 24,000 repositories across 186 organisations. The growth has been extraordinary, and I think it reflects a genuine shift in culture across UK public sector technology.

I should be honest: I have been an enthusiastic (perhaps occasionally tiresome) advocate for open source throughout my career in and around government. I have walked into departments and personally evangelised the benefits of coding in the open, sometimes to receptive audiences and sometimes to rooms full of politely sceptical faces. The typical objections tend to follow familiar patterns: that the work is somehow exempt from the Service Standard; that what they're building is so unique and specialised that sharing it would be meaningless; or — the most persistent myth — that publishing source code creates unacceptable security risks.

Fortunately, there are people considerably smarter than I who have written at length about why these objections don't hold up. The government's own security considerations guidance is unequivocal: open code can be "just as secure or more secure than closed code," and security through obscurity "is considered insufficient by security experts." The guidance on when code should be open or closed limits the exceptions to three narrow categories: keys and credentials, fraud detection algorithms, and unreleased policy. Everything else should be open. They use a rather good padlock analogy: everyone knows how a padlock works, but it's still secure because you cannot open it without the key.

And when my own arguments have fallen short, friends at the National Cyber Security Centre have always been generous in supporting the case. The NCSC's guidance on protecting code repositories takes a pragmatic approach — acknowledging that coding in the open requires good security practices (automated testing, peer reviews), while their position on secure by default design explicitly states that security through obscurity should be avoided. At a cross-government open source security meetup back in 2017, an NCSC panel agreed that "open code is not more or less secure than closed code" — what matters is writing clean code, employing peer reviews, and developing a team culture that thinks like an attacker.

Opening First Repositories

Perhaps the thing I'm most proud of in this space is that several organisations opened their very first public GitHub repositories during my tenure working with them. The Bank of England (which now has 11 public repositories) and the Crown Prosecution Service (with 52 public repositories and counting) are amongst those that took what can feel like a significant step. The conversation that leads to that first commit is always interesting — there is genuine nervousness, a sense that publishing code is somehow irreversible and dangerous. But once that first repository is public, something shifts. Teams realise that the sky has not fallen in, and the benefits (accountability, collaboration, reduced duplication) start to become visible.

Coding in the Open vs Truly Open Sourcing

I do think it's important to acknowledge a distinction that often gets lost in these conversations. There is a meaningful difference between coding in the open — making your source code publicly visible — and truly open sourcing, which means actively accepting contributions from outside your organisation.

And yet.

Truly open sourcing remains inherently challenging in UK public sector. In my experience, external contributions to government repositories are vanishingly rare. I have only received outside pull requests on a couple of occasions across all the public sector repos I've been involved with. So the risk that departments worry about — being overwhelmed by external contributions, or having to manage a community — is largely theoretical.

However, I believe that coding in the open is an enormous step in the right direction regardless. As Anna Shipman wrote in her excellent GDS blog post on the benefits of coding in the open, the Skills Funding Agency once built a tool in a week instead of two months by reusing GDS code they found on GitHub. That kind of serendipitous reuse only happens when code is visible. It sets organisations up for the possibility of genuine collaboration when the time is right. It makes reuse possible even when active contribution isn't happening. And it creates a culture of transparency that has knock-on effects throughout a team's way of working. ( James Stewart 's original 2012 GDS post on coding in the open remains a remarkably good articulation of why this matters, and I find myself returning to it regularly.)

What GDS Assessments Taught Me

As a GDS Service Standard assessor, I have sat through a good number (53) of service assessments, and I have noticed a consistent pattern: teams that meet point 12 of the Service Standard — "make new source code open" — tend to be teams that are getting other things right as well. Point 12 requires teams to make all new source code open and reusable, published under appropriate licences. The detailed guidance on making source code open and reusable goes further, recommending that teams start open from day one rather than trying to retroactively open existing code. The Technology Code of Practice reinforces this at point 3: "be open and use open source."

It's not just about the code itself — it's a cultural signal. These teams are typically aligned to the Service Standard more broadly. They tend to be progressive and forward-thinking, working with an appropriate understanding of security and privacy rather than imagining that they are somehow uniquely risk-averse.

It seems to me that openness in code reflects an openness in mindset, and that correlation is strong enough that I've come to view it as a reliable leading indicator during assessments.

The Upgrade: From AngularJS to Something That Actually Scales

Which brings me to the reason for writing this piece. After nearly eight years of that original AngularJS 1.5 frontend faithfully rendering its table (a testament, perhaps, to the durability of simple things), I have finally given the leaderboard a significant upgrade.

The original version was written entirely without AI coding assistance — in what now feels like a distant era, though it was really only 2018. The irony is that it's precisely because of AI coding assistance that I've been able to make these improvements as a sideline activity alongside my day job on the National Digital Exchange (NDX). NDX:Try is currently providing free cloud sandboxes to local government and other departments with interesting use cases they're willing to share openly — essentially saying "here's an AWS environment, go experiment" with zero procurement overhead. Over 50 organisations are currently evaluating cloud services through the platform, working on everything from council chatbots to planning application AI to FOI redaction tools. It's the sort of practical, unglamorous infrastructure that I think genuinely accelerates digital transformation.

But I digress (a privilege of the side project blog post).

The old interface looked like this:

Old Angular 1 table view of leaderboard

And the new one looks like this:

New version with graphs, charts and faster to load

The new frontend streams 24,000 repositories through a custom JSON parser, renders them in a virtual-scrolling table (so the browser doesn't choke trying to create 24,000 DOM nodes, which is essentially what the old Angular version did), and includes a collapsible dashboard with live statistics: repository counts, star distributions, language breakdowns, top organisations, license analysis, and activity trends over time. No frameworks, no dependencies — just vanilla JavaScript, CSS, and inline SVG.

The Real Change: Software Bills of Materials

However, the more significant addition is not the user interface. We are now collecting Software Bills of Materials (SBOMs) for every repository in the dataset.

For those unfamiliar with the concept, an SBOM is essentially an ingredients list for software. Just as food packaging tells you what's in your sandwich, an SBOM tells you what dependencies a piece of software relies upon — every library, every framework, every transitive dependency, with version numbers and licence information. The concept has gained considerable traction in recent years (particularly following various high-profile supply chain incidents), and GitHub now generates them automatically for any public repository through their dependency graph API.

We are collecting these SBOMs incrementally (it takes some time to process over 24,000 repositories at a pace that respects GitHub's API rate limits) and publishing them alongside the existing repository data. You can browse and download individual SBOMs from the leaderboard site, and the complete dataset is available as compressed SPDX JSON files.

I think this is where things get genuinely interesting. With SBOMs for thousands of public sector repositories, we can start to explore some real questions about commonality and reuse across UK government technology. Which libraries are most widely shared? Where are there clusters of organisations solving the same problems independently? What does the dependency landscape actually look like at scale? And yes — there are security implications too, in terms of understanding exposure to specific vulnerable dependencies across the estate.

I should note that I am not revealing anything that a motivated adversary couldn't discover independently. Everything here is based on publicly available information, and GitHub's own dependency graph is accessible to anyone. I'm simply aggregating what's already in the open. (The breadth of my ignorance about what might constitute a genuine security concern expands every day, but I'm reasonably confident on this point. The NCSC has written specifically about SBOMs and the importance of inventory, advocating for exactly this kind of transparency in software supply chains.)

What the Numbers Tell Us

The dashboard tells some interesting stories even at a glance. JavaScript, Python, and HTML dominate the language landscape. The Ministry of Justice leads with over 2,400 public repositories, followed by HMCTS, HMRC, and DEFRA. Nearly 46% of repositories use the MIT licence, but a concerning 32% have no licence at all — which technically means they're published but not actually open source in any meaningful legal sense. (Perhaps we should do something about that.) Around 38% of repositories show recent activity, while roughly 35% are archived. GCHQ's CyberChef remains the standout star with over 34,000 GitHub stars, which I think is a wonderful example of a government-produced tool that has found genuine utility far beyond its original context.

The growth curve is telling too. Repository creation accelerated sharply from around 2014, peaked in 2018-2019, and has maintained a steady pace since. Push activity, interestingly, continues to climb — 2025 and 2026 are showing the highest activity levels yet, suggesting that the sector is not just creating repositories but actively maintaining them.

Grace Hopper Would Approve

Grace Hopper — who gave us the term "debugging" after removing an actual moth from a relay in the Mark II computer — was a fierce advocate for sharing and reuse long before the term "open source" existed. She famously said that the most dangerous phrase in the language was "we've always done it this way." I think she would have appreciated the quiet revolution happening across UK public sector: thousands of teams, in departments from the Home Office to the Bank of England and even GCHQ, choosing transparency over secrecy, collaboration over duplication.

We haven't solved open source in government. We probably never will, entirely — it's an ongoing negotiation between openness and pragmatism, between the ideal and the achievable. But 24,000 public repositories is not nothing. It's a foundation. And with SBOMs now providing a window into what those repositories actually contain, we have an opportunity to move beyond simply counting repositories and start understanding what UK public sector technology really looks like at scale.

Perhaps that's worth a late-night coding session or two.

The Most Expensive Game You'll Ever Play

Tue, 17 Mar 2026 08:00:17 GMT

There is a moment in every organisation's cloud journey where someone asks a question that sounds simple and turns out to be anything but: "How much will this cost?"

Black Friday game start screen

I have watched senior leaders' faces when they receive their first unforecasted cloud bill. There is a particular expression — somewhere between confusion and quiet horror — that I think anyone who has worked in cloud infrastructure will recognise. The bill is never what they expected. It is frequently not even in the right order of magnitude. And the explanation for why involves a level of complexity that makes the person asking wish they'd never raised the subject.

I built a game about this.

In game explanation screen

A few years ago, before AI was quite such a thing, I put together a simulation called the Black Friday Game. The premise is straightforward: you're running an online retailer, Black Friday is coming, and you need to configure your cloud infrastructure to handle the traffic spike without either falling over or spending a fortune on capacity you don't use. You adjust the scaling thresholds, the over-provisioning levels, the node startup times — and then you watch your scenario play out, seeing in near real-time how many requests you serve, how many you drop, and how much it all costs. There's a scoreboard. (I am told it is surprisingly competitive for what is essentially a spreadsheet with better graphics.)

I think the reason it resonates is that it captures something genuinely difficult about cloud economics: the relationship between cost, capacity, and failure is non-linear, counter-intuitive, and extremely hard to reason about in the abstract.

Gameplay starting position

The Spaghetti Problem

It seems to me likely that cloud spend is a bit like cooking spaghetti for a dinner party. You know roughly how much pasta one person eats. You think you can simply multiply that by the number of guests. But then you have to account for the fact that the pot takes ten minutes to boil (your node startup time), that some guests arrive early and some arrive late (your traffic profile), that everyone takes slightly different amounts (your per-request resource consumption), and that any pasta you cook but don't serve goes in the bin (your wasted capacity). Oh, and every strand of uneaten spaghetti costs you money.

The temptation is to cook far too much — the cloud equivalent of setting your auto-scaling threshold to 20% instead of 80%. That works, in the sense that nobody goes hungry. But as I noted in a talk I gave on this topic, you could flip that statement around and say that you're targeting to waste 80% of your money, all the time. That might be proportionate if the cost of dropping a request is high enough. But knowing whether it's proportionate requires understanding each component of your system and how it scales — which, in a microservice architecture with dozens of components, is a genuinely non-trivial exercise.

Then AI Arrived

The game I built modelled a relatively simple three-tier architecture: a front end, a back end, and a database, scaling roughly linearly. Cloud spend was already complex enough to warrant a simulation. But the cost models we're now dealing with in the age of AI make that original game look almost quaint.

AI pricing is not linear. It's not even consistently measured. We have costs per token (input tokens priced differently from output tokens), costs per inference, costs per GPU-hour, costs that vary depending on the model size, the context window, the batch size, and whether you're using spot instances or reserved capacity. We have services where the cost of a single API call can vary by a factor of ten depending on how long the response is. I must admit that my own ability to forecast AI costs with any confidence is roughly zero, and I suspect I'm not alone in that.

This is, I think, one of the most underappreciated challenges facing organisations — particularly in the public sector — as they explore AI adoption. The question is not just "can we build this?" but "can we afford to run it at scale, and do we even know how to estimate what that means?"

Trying Before Buying

Which is where I think tools like NDX:Try become genuinely valuable. NDX:Try gives local government organisations free, isolated cloud sandboxes to experiment with — and crucially, those experiments include real cloud services with real cost structures. You can deploy a council chatbot or a document translation service in fifteen minutes, put some realistic load through it, and start to understand what the cost profile actually looks like — all without touching production infrastructure or spending any of your own budget.

However, I believe the real value goes beyond the individual experiment. When we can see how costs behave in a safe environment — how they scale with usage, where the non-linear jumps are, what the baseline overhead looks like — we start to build the kind of institutional understanding that protects organisations from bill shock later. The game taught people about cloud scaling through simulation. NDX:Try teaches people about cloud costs through direct, no-risk experience.

I think that distinction matters. Reading a pricing page is not the same as watching a meter tick up as your chatbot handles its hundredth concurrent conversation. Just as my Black Friday simulation taught people more about auto-scaling in ten minutes than a whiteboard session ever could, deploying an actual AI service — even in a sandbox — teaches you things about cost that no spreadsheet forecast can capture.

Back to the Dinner Party

The spaghetti problem never really goes away. Whether we're scaling Kubernetes pods or provisioning GPU instances for large language models, the fundamental tension is the same: too little capacity and we fail our users, too much and we waste public money. The complexity has increased enormously — AI has added an entirely new set of variables to an already difficult equation — but the approach should remain the same. Model it. Simulate it. Try it in a safe space. Learn before you commit.

Perhaps we should cook a small batch first.

Teaching a computer to understand British Sign Language

Mon, 16 Mar 2026 08:51:16 GMT

I spent four days last week trying to teach a computer to recognise British Sign Language. I use the word "trying" deliberately. The computer learnt to classify 18,871 distinct signs with 77.8% accuracy, which sounds impressive until you try to have an actual conversation with it, at which point it becomes clear that recognising isolated dictionary entries is to understanding a language what recognising individual ingredients is to cooking a meal. You can identify flour, eggs, and butter with perfect accuracy and still have absolutely no idea how to make a cake.

This is a story about that experiment. It is also, as it turns out, a story about what happens when you give an AI coding assistant access to cloud infrastructure and a vague brief, and then stay up until one in the morning asking it "now?" every eight minutes.

Video of my first attempt interacting with the model not knowing how to even tell if it was working or not

The idea

Around 87,000 people in the UK use British Sign Language as their first or preferred language. For many of them, interacting with public services means navigating systems built entirely around English. There are interpreters and relay services, but they are not always available, and they are expensive. What if a browser could recognise sign language in real time?

That was the question. Not "can we build a production service" (that is a much bigger ask with years of work behind it). The question was simpler: is this even feasible? Could someone explore this idea without needing to procure GPU servers, negotiate data agreements, or stand up ML infrastructure from scratch?

What NDX:Try gives you

NDX:Try is a free platform that provides UK public sector organisations with temporary AWS environments for experimentation. You get a sandbox account, an isolated, time-limited AWS environment with guardrails, and you can use it to try things out. The key thing is that it is safe. The sandbox accounts are isolated from production systems. They auto-clean when your session expires. There is no risk of accidentally exposing real data or racking up unexpected bills.

It is designed for exactly this kind of thing: "I have an idea, I want to see if it works."

Day one: the spec

On Monday morning, I wrote a technical specification describing what I wanted: a bidirectional BSL translation system. Camera input for sign recognition on one side, text-to-BSL translation using Amazon Bedrock's Claude on the other. Three interface modes: live translation, a practice mode for learning signs, and a kiosk mode for reception desks. SageMaker GPU inference, Lambda functions, the full stack.

I should clarify what "I wrote" means. I described the desired outcome. Claude Code, Anthropic's AI coding assistant, generated the specification, the architecture, and then the code. The methodology is called BMAD: spec-driven development where the human provides direction and the AI provides implementation.

By mid-morning, the application existed. It had a three-mode frontend, CloudFormation templates, Lambda functions, and all the plumbing.

The main interface: camera feed on the left for sign recognition, text-to-BSL translation panel on the right. Three modes let you explore different interaction patterns.

The text-to-BSL direction uses Bedrock's Claude to translate English sentences into BSL gloss notation, the written representation that captures BSL's distinct grammar. "Hello, how are you today?" becomes TODAY IX-2P HOW, because BSL front-loads time references.

English to BSL gloss. The AI translates natural English into BSL gloss notation, which follows BSL grammar rather than English word order

BSL has its own grammar. Hello, how are you today? becomes TODAY, IX-2P (a pronoun pointing sign), HOW.

It is worth noting that the prototype included a practice mode with sign categories, reference videos, and a star rating system. I did not ask for this. The AI decided, unprompted, that a gamified learning mode would be useful and built one. It is a curious example of an AI assistant making product decisions: the practice mode is a reasonable idea, and I was interested to see where it went. It did not, as it happens, actually work. The recognition was not accurate enough to score anyone's signing meaningfully, so the stars were essentially decorative. But the fact that it appeared at all, unbidden, is worth reflecting on.

Practice mode: the AI's unsolicited contribution. Pick a category, watch the reference video, try the sign yourself, and get rated. The rating did not work.

Day one: the deployment disaster tour

Generating code from a spec is the easy part. Getting it to actually run in a sandbox environment is where the educational content begins.

The first deployment attempt failed because the CloudFormation template exceeded the 51KB inline limit. Then the seed Lambda function exceeded the ZipFile size limit. Then SAM's Tags format did not match CloudFormation's Tags format. Then orphaned CloudWatch Log Groups from a failed rollback blocked the next attempt. Then Lambda Function URLs were blocked by the Innovation Sandbox's Service Control Policy. Each failure took between ten and thirty minutes to diagnose and fix.

By late morning, the stack was deployed. The frontend was running but the recogniser was not recognising anything. Just "waiting for signs."

This is the part where things got properly embarrassing.

Day one: the wrong language

The first model was trained on WLASL, the Word-Level American Sign Language dataset. Not British Sign Language. American. The AI assistant, when tasked with building a BSL recogniser, had reached for the most readily available dataset it could find, and that happened to be the wrong sign language entirely.

It gets worse. The model loading code included model.load_state_dict(state_dict, strict=False). That strict=False flag silently drops any parameters that do not match. In this case, it had dropped all 344 parameters. The model was running on random weights. It had approximately 0% real accuracy.

Fair feedback was given: "this is not acceptable, this is british, only BSL and maybe Makaton."

By the afternoon, we had switched to BSL-1K, a dataset from Oxford's Visual Geometry Group containing 1,064 BSL signs. The model loaded properly this time. But the results were still poor. Signing even "hello" was not recognised.

That evening was spent on a series of increasingly desperate attempts to make a hand-crafted scoring approach work. Version 9 scored 6 out of 119 signs correctly: 5.0%. Continuous scoring made things worse. A power transform made things worse. Dynamic Time Warping produced marginal improvements. The ceiling was low and we were hitting it repeatedly.

I think it is worth pausing on why 5% accuracy represents a dead end. Hand-crafted scoring requires you to define, precisely and mathematically, what each sign looks like. "Is the dominant hand above the non-dominant hand? Is the palm facing inward?" These binary questions throw away enormous amounts of information. Real signing is fluid, continuous, and highly variable between signers. It is rather like trying to learn a language from a phrasebook: you can memorise the pronunciation of "where is the train station?" but the moment a real person answers you in their natural accent, at their natural speed, with their natural word choices, the phrasebook is useless.

Day two: the ML pivot

At quarter to seven on Tuesday morning: "plan a ML based classification development journey then."

This was the decisive moment. Stop trying to encode human knowledge of what signs look like. Instead, show the computer thousands of examples and let it learn. The sandbox had the compute. The academic world had the data.

The ideal dataset would be BOBSL, the BBC-Oxford British Sign Language dataset: 1,400 hours of interpreted BBC content with 2,281 sign classes. But BOBSL access is restricted to academic research institutions under BBC Terms of Use. Independent researchers, students, and commercial organisations are explicitly excluded. A government sandbox experiment does not qualify.

So we worked with what was openly available.

Day two: the multi-signer breakthrough

The first ML model was trained on synthetic data: one reference video per sign, with computer-generated variations. It scored 14 out of 119 signs correctly (11.8%). Better than hand-crafted, but still terrible.

Then we trained on BSLDict, an academic dataset from Oxford containing over 14,000 video clips of BSL signs performed by 124 different signers. Even with just 4-5 videos per sign from different people, accuracy jumped to 103 out of 119: 86.6%.

That was the breakthrough. Different people sign differently. Their hands are different sizes. They move at different speeds. A model trained on one person's signing cannot recognise another person's signing. A model trained on many people's signing can recognise almost anyone's signing. To return to the language-learning analogy: listening to one native speaker repeat phrases in a recording studio is nothing like being dropped into a crowded market in a foreign city. The market is terrifying, but it is also where you actually learn.

Real human variance, it turns out, is something you cannot synthesise.

Day three: the cloud pivot

On Wednesday morning: "this is still rubbish." The browser demo with 119 signs was not impressive enough. "Abandon running locally and boot a big vm to download quickly and run there."

We spun up a c5.4xlarge EC2 instance (16 vCPU cores, 32GB of RAM) and started downloading data from every BSL-related source we could find: BSLDict from Oxford VGG (sourced from signbsl.com contributors), BSL SignBank from UCL, Auslan Signbank and NZSL (both part of the same BANZSL language family as BSL, sharing roughly 82% of their vocabulary), Dicta-Sign from an EU research project, SSC STEM from the Scottish Sensory Centre, Christian-BSL, and BKS.

Source                   Videos
---------------------------------------
BSLDict (Oxford VGG)     13,090
BSL SignBank (UCL)        3,586
Auslan Signbank           8,561
NZSL                      4,805
Dicta-Sign                1,019
SSC STEM                  2,682
Christian-BSL               580
BKS                       2,072
                         ------
Total                    36,395

Including Auslan and NZSL was a bet that shared hand movements would help generalisation, even where specific signs differ.

The v18 mega-training run started with 306,174 samples across 14,948 sign classes. CPU utilisation climbed from 579% to 672% across the 16 cores. The SSH daemon became unreachable as the operating system had nothing left to give it. RAM usage grew from 6GB to 9.5GB.

At 02:24 on Thursday morning, after nearly 24 hours of continuous training, fold 1 came back: 89.6% top-1 accuracy, 98.6% top-5, across 14,948 signs.

The messy reality

Blog posts about ML projects tend to present a clean narrative: we had an idea, we tried it, it worked. The reality was considerably messier, and that is worth talking about because it is the reality of experimentation.

The sandbox session was ticking down. "Download any intermediary data so that we can resume if our sandbox acc expires." The response was sobering: "sandbox expire will also delete s3 data, the whole aws acc will go away." Twenty-four gigabytes of processed training data, extracted features, and partially trained models would vanish. We downloaded everything. It took hours over the SSH connection that kept dropping (SSH tends to struggle when you are running a CPU-intensive training job at 675% utilisation and the operating system has very little headroom left for anything else).

When we tried to speed up training by launching a GPU instance, we hit a GPU vCPU quota of zero. This is standard for new AWS accounts, not a sandbox restriction. The first quota increase request was denied. A second attempt was approved within a couple of hours. It is the kind of thing you only learn by trying.

The data downloading was its own adventure. Cloudflare blocked video downloads from EC2 IP addresses. The Auslan Signbank download hit connection resets and slowed to a crawl. The SSC STEM extraction died at 57% completion. Academic video servers had inconsistent availability and aggressive rate limits.

And then there is licensing. For a research experiment, downloading publicly available sign language videos and training a model is reasonable. But the licensing landscape is a patchwork. Only NZSL has a clearly permissive licence (CC BY 4.0). Auslan Signbank is CC BY-NC-ND 4.0 (non-commercial, no derivatives). SSC STEM is University of Edinburgh IP requiring explicit permission. BSLDict, BSL SignBank, Dicta-Sign, and Christian-BSL all have unclear or unstated terms.

The most notable omission is BOBSL. It contains 1,400 hours of interpreted BBC content with 2,281 sign classes and would be transformative training data. But access is restricted to academic research institutions under BBC Terms of Use. For a public sector innovation experiment, that door is closed. It is an area where more openly-licensed BSL data would make a significant difference.

Day four: the 1am impatience

The v19 training run was on a GPU instance, a g4dn.xlarge with an NVIDIA Tesla T4. What had taken 60+ hours on CPU was projected to take around 2.5 hours. The GPU sat at 100% utilisation, using 14GB of its 15GB VRAM.

At 01:15: "now?"

At 01:23: "now?"

At 01:28: "now?"

There is something both absurd and perfectly human about checking on a machine learning training run at one in the morning, every eight minutes, like a child asking "are we there yet?" from the back seat. The experiment had started as a professional curiosity on Monday morning. By Thursday night, it had become a compulsion.

At 10:10 on Friday morning: "sorry machine crashed, check in, hows it going?" The local machine had crashed overnight. The training, running on EC2, was fine. By 10:47, all data was downloaded locally. "Everything is off AWS. Safe to terminate."

The training architecture

+-------------------------------------------------------------------+
|                    Training Pipeline (EC2)                          |
|                                                                    |
|  +----------+   +----------+   +---------+   +------------+       |
|  | Video    |-->| MediaPipe|-->| Feature |-->|   Train    |       |
|  | Sources  |   | Holistic |   | Extract |   |  PyTorch   |       |
|  | (27,000+)|   | Landmarks|   | 142-dim |   |    MLP     |       |
|  +----------+   +----------+   +---------+   +-----+------+       |
|                                                     |              |
|                                                     v              |
|                                             +--------------+       |
|                                             | ONNX Export  |       |
|                                             |  (30MB)      |       |
|                                             +------+-------+       |
+--------------------------------------------+------++--------------+
                                                    |
                          +-------------------------+
                          v
+-----------------------------------------------+
|              Browser (no server needed)        |
|                                                |
|  +--------+   +----------+   +-------------+  |
|  | Webcam |-->| MediaPipe|-->| ONNX Runtime|  |
|  |        |   | (browser)|   |  Web (30MB) |  |
|  +--------+   +----------+   +------+------+  |
|                                      |         |
|                                      v         |
|                               +------------+   |
|                               | Recognised |   |
|                               |   Sign     |   |
|                               +------------+   |
+------------------------------------------------+

The pipeline processes 27,000+ videos from seven data sources. MediaPipe Holistic extracts 142-dimensional feature vectors from each video frame. A PyTorch MLP classifier trains on the extracted features. The trained model exports to ONNX format and runs entirely in the browser. No server calls needed for inference.

Where we are now

The current model (version 19) recognises 18,871 distinct signs. Here is where honesty matters. Version 19 is actually less accurate than version 18: 77.8% top-1 versus 89.6%, and 97.2% top-5 versus 98.6%. More signs, worse per-sign accuracy. This is the entirely predictable consequence of scaling a classifier to nearly 19,000 classes, many of which are visually similar.

Version  Signs   Top-1   Top-5  What changed
---------------------------------------------------
v9         119    5.0%      --  Hand-crafted scoring
v14        119   11.8%      --  ML, synthetic data
v15        119   86.6%      --  Multi-signer BSLDict
v16        944   85.7%      --  8x vocab expansion
v18     14,948   89.6%   98.6%  7 sources, 60h CPU
v19     18,871   77.8%   97.2%  GPU, expanded Auslan

Version 19: 18,871 signs. The accuracy dropped from v18, but the vocabulary expanded significantly. Whether tat trade-off is worthwhile depends entirely on what you are trying to do.

The model runs entirely in the browser using ONNX Runtime Web. No server calls needed for inference. It is a 30MB file that loads once and then classifies signs in milliseconds.

About that "we"

I should come clean about something. Throughout this post, I have said "we" in the way that blog posts about technical work tend to say "we." In this case, "we" means Claude and I. Claude the AI, specifically Claude Code, and I the human sitting in front of a laptop providing direction.

I should also come clean about the "four days." The experiment ran across four calendar days, but it was not four days of dedicated work. It was done in the margins of my actual day job: building NDX:Try, running the platform, promoting it across public sector, and the various other bits and bobs that fill a week at GDS. The four days of wall-clock time were largely Claude's. My contribution was more like a series of interruptions: checking in between meetings, giving direction over lunch, asking "now?" at one in the morning when I should have been asleep. The 60+ hours of compute ran regardless of whether I was paying attention to it, which is rather the point.

Nobody wrote any code for this project. Not the MediaPipe integration, not the PyTorch training pipeline, not the ONNX export, not the feature extraction scripts, not the browser-based classifier, not the download scrapers for seven different academic data sources, not the CloudFormation templates, not the Lambda functions, not the EC2 bootstrap scripts.

Not even this blog post.

The human contribution was direction and judgement. Which ideas to pursue. When to pivot. Whether the accuracy was good enough. Whether the experiment was worth continuing. When to say "this is not acceptable, this is british." When to say "plan a ML based classification development journey then." When to say "abandon running locally and boot a big vm." When to ask "now?" at one in the morning.

The AI handled the research, the implementation, the infrastructure, and the iteration. It also, as I mentioned, decided unprompted to build a practice mode with star ratings (which did not work, but was a reasonable idea).

I think this matters because it changes the profile of who can do this kind of work. You do not need to be a machine learning engineer. You do not need to know PyTorch, or MediaPipe, or how to configure EC2 instances, or how to export ONNX models. You need curiosity, a clear idea of what you are trying to achieve, and the judgement to evaluate whether it is working.

What this does not prove

It is still rubbish for real BSL translation. I am going to say that plainly because it is true.

Recognising isolated signs is to understanding BSL what recognising individual words is to understanding spoken English: necessary but nowhere near sufficient. BSL has its own grammar, which is fundamentally different from English. It uses space, facial expressions, body movement, and timing as grammatical structures. A raised eyebrow is not decoration, it is grammar. None of this is captured by a model that classifies isolated signs.

This was built by someone who does not know BSL (the process of doing it taught me an enormous amount about how much I did not know). It has not been tested with deaf BSL users. The accuracy numbers come from reference videos, not real-world signing. More signs made the model worse, not better. BSL has somewhere between 20,000 and 100,000 signs in active use, with dialects and regional variations that our training data has no awareness of.

However, I think it proves something more fundamental than any particular accuracy number.

What this does prove

A single person with a laptop and a sandbox can explore ideas that would previously have required a dedicated ML team. The entire experiment, from "I wonder if this is possible" to a working prototype with nearly 19,000 signs, was done in four days, without writing a line of code manually. No ML engineers, no frontend developers, no DevOps team. The compute for this project would have cost tens of thousands of pounds a decade ago.

Experiments are supposed to be messy. We hit GPU quota limits, SSH timeouts, expired sandbox sessions, flaky data downloads, the wrong sign language entirely, training runs that took four days instead of four hours, and a model that got worse as we added more data. None of that meant the experiment failed. It meant we were learning.

Perhaps the most important thing is this: public sector innovation does not need to start with a business case. This experiment might lead somewhere useful, or it might not. The point is that someone was able to try, to ask "what if?" and actually explore the answer, without procurement, without a project board, without a budget. That is what sandbox environments are for.

The phrasebook approach (hand-crafted rules, 5% accuracy) was never going to get us to fluency. The language school approach (synthetic training data, 11.8%) was better but still inadequate. The immersion approach (real data from real signers, 86.6%) was the breakthrough. And yet even immersion does not make you fluent. It just proves that fluency is possible, given enough time and the right environment.

The repo is published

The entire codebase, the frontend, the training pipeline, the data extraction scripts, the CloudFormation templates, the trained models, the documentation, and all 19 versions of increasingly questionable accuracy, is published at github.com/chrisns/bsl-experiment. For posterity and as a warning to others.

If you are in UK public sector and you have an idea you would like to explore with AWS, NDX:Try is there for exactly that. The worst that can happen is that your experiment does not work.

And that is fine. That is what experiments are for.

NDX:Try is available to UK public sector organisations. The BSL sign language recognition experiment, including all training code and models, is open source at github.com/chrisns/bsl-experiment.

(Views in this article are my own.)

Cloudy with a chance of freefall

The Instruction Manual

Falling Without a Checklist: The Only Migration That Matters

The Sport That Industrialised Courage

Then Like Now

The Counterargument I Owe You

The Cloud Migration Safety Stack

Two Kinds of Repatriation

The Only Migration

Why Pink

Reason one. I would like to be remembered.

Reason two. Pink is the oldest colour anyone has dug out of a rock.

Reason three. Pink isn't actually there.

Reason four. I have been lying to my wife for many years.

Reason five. Pink was for boys first.

Too Busy to Ride the Bike

Festina Lente

Then Like Now

The Distinction

Still Learning

The Thing I Didn't Know About the Thing I Thought I Knew

The Thing I Didn't Know About the Thing I Thought I Knew

What the Paper Actually Tested

153 Jumps

Then Like Now

The Bias Blind Spot

The Hard Part

In 1900, Every Serious Manufacturer Had a Coal Strategy

The prerequisite that was always a sales pitch

Why the four quadrants are the right unit of analysis

The category error in 2026

The steelman, and why it doesn't bury the argument

Who benefits from the prerequisite? And who owns the utility?

What I would do on Monday morning

Policy as [Versioned] Code: A Mea Culpa, a Technical Argument, and a Lonely Experiment

The Lift

The Industrialisation Nobody Noticed

The Dependency Model

The Code for Humans

The Enterprise Temptation

The Lonely Experiment

The Irony and the Aspiration

I Jumped Out of a Plane to Have Something Interesting to Say at Parties. The Work Was the Interesting Thing All Along.

The Signal That Collapsed

Then Like Now

The Confidence Problem

The Wind Tunnel and the Work

What Must Change

We've Commoditised Innovation (And Most of You Haven't Noticed)

Both sides are right (and both are wrong)

The silent majority

That's a landscape problem, not a people problem.

Conway's Law is eating your cloud migration

What commoditised experimentation actually looks like

The bit I'm less sure about

Where's your map?

The Paperclip Maximiser Is You

The Feedback Loop That Changed Everything

Three Threads, One Ratchet

Then Like Now

The Paperclip Maximizer Is Not a Metaphor for AI

The Ratchet and the Responsibility

When Your CMS Gets a Brain: Building AI into LocalGov Drupal on AWS

The problem is not technology. It is confidence.

A recipe, not a restaurant

Seven features that meet real workflows

What makes this interesting is the orchestration

Three things we learnt

Returning to the kitchen

Asking 24,500 Repositories a Question

Searching With Intent

And yet.

From Catalogue to Conversation

Getting Started

Reuse in Practice: The National Digital Exchange

What Comes Next

Links

Twenty-four Thousand Reasons to Code in the Open

From Hundreds to Twenty-Six Thousand

Opening First Repositories