I spent four days last week trying to teach a computer to recognise British Sign Language. I use the word "trying" deliberately. The computer learnt to classify 18,871 distinct signs with 77.8% accuracy, which sounds impressive until you try to have an actual conversation with it, at which point it becomes clear that recognising isolated dictionary entries is to understanding a language what recognising individual ingredients is to cooking a meal. You can identify flour, eggs, and butter with perfect accuracy and still have absolutely no idea how to make a cake.
This is a story about that experiment. It is also, as it turns out, a story about what happens when you give an AI coding assistant access to cloud infrastructure and a vague brief, and then stay up until one in the morning asking it "now?" every eight minutes.
The idea
Around 87,000 people in the UK use British Sign Language as their first or preferred language. For many of them, interacting with public services means navigating systems built entirely around English. There are interpreters and relay services, but they are not always available, and they are expensive. What if a browser could recognise sign language in real time?
That was the question. Not "can we build a production service" (that is a much bigger ask with years of work behind it). The question was simpler: is this even feasible? Could someone explore this idea without needing to procure GPU servers, negotiate data agreements, or stand up ML infrastructure from scratch?
What NDX:Try gives you
NDX:Try is a free platform that provides UK public sector organisations with temporary AWS environments for experimentation. You get a sandbox account, an isolated, time-limited AWS environment with guardrails, and you can use it to try things out. The key thing is that it is safe. The sandbox accounts are isolated from production systems. They auto-clean when your session expires. There is no risk of accidentally exposing real data or racking up unexpected bills.
It is designed for exactly this kind of thing: "I have an idea, I want to see if it works."
Day one: the spec
On Monday morning, I wrote a technical specification describing what I wanted: a bidirectional BSL translation system. Camera input for sign recognition on one side, text-to-BSL translation using Amazon Bedrock's Claude on the other. Three interface modes: live translation, a practice mode for learning signs, and a kiosk mode for reception desks. SageMaker GPU inference, Lambda functions, the full stack.
I should clarify what "I wrote" means. I described the desired outcome. Claude Code, Anthropic's AI coding assistant, generated the specification, the architecture, and then the code. The methodology is called BMAD: spec-driven development where the human provides direction and the AI provides implementation.
By mid-morning, the application existed. It had a three-mode frontend, CloudFormation templates, Lambda functions, and all the plumbing.
The text-to-BSL direction uses Bedrock's Claude to translate English sentences into BSL gloss notation, the written representation that captures BSL's distinct grammar. "Hello, how are you today?" becomes TODAY IX-2P HOW, because BSL front-loads time references.
It is worth noting that the prototype included a practice mode with sign categories, reference videos, and a star rating system. I did not ask for this. The AI decided, unprompted, that a gamified learning mode would be useful and built one. It is a curious example of an AI assistant making product decisions: the practice mode is a reasonable idea, and I was interested to see where it went. It did not, as it happens, actually work. The recognition was not accurate enough to score anyone's signing meaningfully, so the stars were essentially decorative. But the fact that it appeared at all, unbidden, is worth reflecting on.
Day one: the deployment disaster tour
Generating code from a spec is the easy part. Getting it to actually run in a sandbox environment is where the educational content begins.
The first deployment attempt failed because the CloudFormation template exceeded the 51KB inline limit. Then the seed Lambda function exceeded the ZipFile size limit. Then SAM's Tags format did not match CloudFormation's Tags format. Then orphaned CloudWatch Log Groups from a failed rollback blocked the next attempt. Then Lambda Function URLs were blocked by the Innovation Sandbox's Service Control Policy. Each failure took between ten and thirty minutes to diagnose and fix.
By late morning, the stack was deployed. The frontend was running but the recogniser was not recognising anything. Just "waiting for signs."
This is the part where things got properly embarrassing.
Day one: the wrong language
The first model was trained on WLASL, the Word-Level American Sign Language dataset. Not British Sign Language. American. The AI assistant, when tasked with building a BSL recogniser, had reached for the most readily available dataset it could find, and that happened to be the wrong sign language entirely.
It gets worse. The model loading code included model.load_state_dict(state_dict, strict=False). That strict=False flag silently drops any parameters that do not match. In this case, it had dropped all 344 parameters. The model was running on random weights. It had approximately 0% real accuracy.
Fair feedback was given: "this is not acceptable, this is british, only BSL and maybe Makaton."
By the afternoon, we had switched to BSL-1K, a dataset from Oxford's Visual Geometry Group containing 1,064 BSL signs. The model loaded properly this time. But the results were still poor. Signing even "hello" was not recognised.
That evening was spent on a series of increasingly desperate attempts to make a hand-crafted scoring approach work. Version 9 scored 6 out of 119 signs correctly: 5.0%. Continuous scoring made things worse. A power transform made things worse. Dynamic Time Warping produced marginal improvements. The ceiling was low and we were hitting it repeatedly.
I think it is worth pausing on why 5% accuracy represents a dead end. Hand-crafted scoring requires you to define, precisely and mathematically, what each sign looks like. "Is the dominant hand above the non-dominant hand? Is the palm facing inward?" These binary questions throw away enormous amounts of information. Real signing is fluid, continuous, and highly variable between signers. It is rather like trying to learn a language from a phrasebook: you can memorise the pronunciation of "where is the train station?" but the moment a real person answers you in their natural accent, at their natural speed, with their natural word choices, the phrasebook is useless.
Day two: the ML pivot
At quarter to seven on Tuesday morning: "plan a ML based classification development journey then."
This was the decisive moment. Stop trying to encode human knowledge of what signs look like. Instead, show the computer thousands of examples and let it learn. The sandbox had the compute. The academic world had the data.
The ideal dataset would be BOBSL, the BBC-Oxford British Sign Language dataset: 1,400 hours of interpreted BBC content with 2,281 sign classes. But BOBSL access is restricted to academic research institutions under BBC Terms of Use. Independent researchers, students, and commercial organisations are explicitly excluded. A government sandbox experiment does not qualify.
So we worked with what was openly available.
Day two: the multi-signer breakthrough
The first ML model was trained on synthetic data: one reference video per sign, with computer-generated variations. It scored 14 out of 119 signs correctly (11.8%). Better than hand-crafted, but still terrible.
Then we trained on BSLDict, an academic dataset from Oxford containing over 14,000 video clips of BSL signs performed by 124 different signers. Even with just 4-5 videos per sign from different people, accuracy jumped to 103 out of 119: 86.6%.
That was the breakthrough. Different people sign differently. Their hands are different sizes. They move at different speeds. A model trained on one person's signing cannot recognise another person's signing. A model trained on many people's signing can recognise almost anyone's signing. To return to the language-learning analogy: listening to one native speaker repeat phrases in a recording studio is nothing like being dropped into a crowded market in a foreign city. The market is terrifying, but it is also where you actually learn.
Real human variance, it turns out, is something you cannot synthesise.
Day three: the cloud pivot
On Wednesday morning: "this is still rubbish." The browser demo with 119 signs was not impressive enough. "Abandon running locally and boot a big vm to download quickly and run there."
We spun up a c5.4xlarge EC2 instance (16 vCPU cores, 32GB of RAM) and started downloading data from every BSL-related source we could find: BSLDict from Oxford VGG (sourced from signbsl.com contributors), BSL SignBank from UCL, Auslan Signbank and NZSL (both part of the same BANZSL language family as BSL, sharing roughly 82% of their vocabulary), Dicta-Sign from an EU research project, SSC STEM from the Scottish Sensory Centre, Christian-BSL, and BKS.
Source Videos
---------------------------------------
BSLDict (Oxford VGG) 13,090
BSL SignBank (UCL) 3,586
Auslan Signbank 8,561
NZSL 4,805
Dicta-Sign 1,019
SSC STEM 2,682
Christian-BSL 580
BKS 2,072
------
Total 36,395
Including Auslan and NZSL was a bet that shared hand movements would help generalisation, even where specific signs differ.
The v18 mega-training run started with 306,174 samples across 14,948 sign classes. CPU utilisation climbed from 579% to 672% across the 16 cores. The SSH daemon became unreachable as the operating system had nothing left to give it. RAM usage grew from 6GB to 9.5GB.
At 02:24 on Thursday morning, after nearly 24 hours of continuous training, fold 1 came back: 89.6% top-1 accuracy, 98.6% top-5, across 14,948 signs.
The messy reality
Blog posts about ML projects tend to present a clean narrative: we had an idea, we tried it, it worked. The reality was considerably messier, and that is worth talking about because it is the reality of experimentation.
The sandbox session was ticking down. "Download any intermediary data so that we can resume if our sandbox acc expires." The response was sobering: "sandbox expire will also delete s3 data, the whole aws acc will go away." Twenty-four gigabytes of processed training data, extracted features, and partially trained models would vanish. We downloaded everything. It took hours over the SSH connection that kept dropping (SSH tends to struggle when you are running a CPU-intensive training job at 675% utilisation and the operating system has very little headroom left for anything else).
When we tried to speed up training by launching a GPU instance, we hit a GPU vCPU quota of zero. This is standard for new AWS accounts, not a sandbox restriction. The first quota increase request was denied. A second attempt was approved within a couple of hours. It is the kind of thing you only learn by trying.
The data downloading was its own adventure. Cloudflare blocked video downloads from EC2 IP addresses. The Auslan Signbank download hit connection resets and slowed to a crawl. The SSC STEM extraction died at 57% completion. Academic video servers had inconsistent availability and aggressive rate limits.
And then there is licensing. For a research experiment, downloading publicly available sign language videos and training a model is reasonable. But the licensing landscape is a patchwork. Only NZSL has a clearly permissive licence (CC BY 4.0). Auslan Signbank is CC BY-NC-ND 4.0 (non-commercial, no derivatives). SSC STEM is University of Edinburgh IP requiring explicit permission. BSLDict, BSL SignBank, Dicta-Sign, and Christian-BSL all have unclear or unstated terms.
The most notable omission is BOBSL. It contains 1,400 hours of interpreted BBC content with 2,281 sign classes and would be transformative training data. But access is restricted to academic research institutions under BBC Terms of Use. For a public sector innovation experiment, that door is closed. It is an area where more openly-licensed BSL data would make a significant difference.
Day four: the 1am impatience
The v19 training run was on a GPU instance, a g4dn.xlarge with an NVIDIA Tesla T4. What had taken 60+ hours on CPU was projected to take around 2.5 hours. The GPU sat at 100% utilisation, using 14GB of its 15GB VRAM.
At 01:15: "now?"
At 01:23: "now?"
At 01:28: "now?"
There is something both absurd and perfectly human about checking on a machine learning training run at one in the morning, every eight minutes, like a child asking "are we there yet?" from the back seat. The experiment had started as a professional curiosity on Monday morning. By Thursday night, it had become a compulsion.
At 10:10 on Friday morning: "sorry machine crashed, check in, hows it going?" The local machine had crashed overnight. The training, running on EC2, was fine. By 10:47, all data was downloaded locally. "Everything is off AWS. Safe to terminate."
The training architecture
+-------------------------------------------------------------------+
| Training Pipeline (EC2) |
| |
| +----------+ +----------+ +---------+ +------------+ |
| | Video |-->| MediaPipe|-->| Feature |-->| Train | |
| | Sources | | Holistic | | Extract | | PyTorch | |
| | (27,000+)| | Landmarks| | 142-dim | | MLP | |
| +----------+ +----------+ +---------+ +-----+------+ |
| | |
| v |
| +--------------+ |
| | ONNX Export | |
| | (30MB) | |
| +------+-------+ |
+--------------------------------------------+------++--------------+
|
+-------------------------+
v
+-----------------------------------------------+
| Browser (no server needed) |
| |
| +--------+ +----------+ +-------------+ |
| | Webcam |-->| MediaPipe|-->| ONNX Runtime| |
| | | | (browser)| | Web (30MB) | |
| +--------+ +----------+ +------+------+ |
| | |
| v |
| +------------+ |
| | Recognised | |
| | Sign | |
| +------------+ |
+------------------------------------------------+
The pipeline processes 27,000+ videos from seven data sources. MediaPipe Holistic extracts 142-dimensional feature vectors from each video frame. A PyTorch MLP classifier trains on the extracted features. The trained model exports to ONNX format and runs entirely in the browser. No server calls needed for inference.
Where we are now
The current model (version 19) recognises 18,871 distinct signs. Here is where honesty matters. Version 19 is actually less accurate than version 18: 77.8% top-1 versus 89.6%, and 97.2% top-5 versus 98.6%. More signs, worse per-sign accuracy. This is the entirely predictable consequence of scaling a classifier to nearly 19,000 classes, many of which are visually similar.
Version Signs Top-1 Top-5 What changed
---------------------------------------------------
v9 119 5.0% -- Hand-crafted scoring
v14 119 11.8% -- ML, synthetic data
v15 119 86.6% -- Multi-signer BSLDict
v16 944 85.7% -- 8x vocab expansion
v18 14,948 89.6% 98.6% 7 sources, 60h CPU
v19 18,871 77.8% 97.2% GPU, expanded Auslan
The model runs entirely in the browser using ONNX Runtime Web. No server calls needed for inference. It is a 30MB file that loads once and then classifies signs in milliseconds.
About that "we"
I should come clean about something. Throughout this post, I have said "we" in the way that blog posts about technical work tend to say "we." In this case, "we" means Claude and I. Claude the AI, specifically Claude Code, and I the human sitting in front of a laptop providing direction.
I should also come clean about the "four days." The experiment ran across four calendar days, but it was not four days of dedicated work. It was done in the margins of my actual day job: building NDX:Try, running the platform, promoting it across public sector, and the various other bits and bobs that fill a week at GDS. The four days of wall-clock time were largely Claude's. My contribution was more like a series of interruptions: checking in between meetings, giving direction over lunch, asking "now?" at one in the morning when I should have been asleep. The 60+ hours of compute ran regardless of whether I was paying attention to it, which is rather the point.
Nobody wrote any code for this project. Not the MediaPipe integration, not the PyTorch training pipeline, not the ONNX export, not the feature extraction scripts, not the browser-based classifier, not the download scrapers for seven different academic data sources, not the CloudFormation templates, not the Lambda functions, not the EC2 bootstrap scripts.
Not even this blog post.
The human contribution was direction and judgement. Which ideas to pursue. When to pivot. Whether the accuracy was good enough. Whether the experiment was worth continuing. When to say "this is not acceptable, this is british." When to say "plan a ML based classification development journey then." When to say "abandon running locally and boot a big vm." When to ask "now?" at one in the morning.
The AI handled the research, the implementation, the infrastructure, and the iteration. It also, as I mentioned, decided unprompted to build a practice mode with star ratings (which did not work, but was a reasonable idea).
I think this matters because it changes the profile of who can do this kind of work. You do not need to be a machine learning engineer. You do not need to know PyTorch, or MediaPipe, or how to configure EC2 instances, or how to export ONNX models. You need curiosity, a clear idea of what you are trying to achieve, and the judgement to evaluate whether it is working.
What this does not prove
It is still rubbish for real BSL translation. I am going to say that plainly because it is true.
Recognising isolated signs is to understanding BSL what recognising individual words is to understanding spoken English: necessary but nowhere near sufficient. BSL has its own grammar, which is fundamentally different from English. It uses space, facial expressions, body movement, and timing as grammatical structures. A raised eyebrow is not decoration, it is grammar. None of this is captured by a model that classifies isolated signs.
This was built by someone who does not know BSL (the process of doing it taught me an enormous amount about how much I did not know). It has not been tested with deaf BSL users. The accuracy numbers come from reference videos, not real-world signing. More signs made the model worse, not better. BSL has somewhere between 20,000 and 100,000 signs in active use, with dialects and regional variations that our training data has no awareness of.
However, I think it proves something more fundamental than any particular accuracy number.
What this does prove
A single person with a laptop and a sandbox can explore ideas that would previously have required a dedicated ML team. The entire experiment, from "I wonder if this is possible" to a working prototype with nearly 19,000 signs, was done in four days, without writing a line of code manually. No ML engineers, no frontend developers, no DevOps team. The compute for this project would have cost tens of thousands of pounds a decade ago.
Experiments are supposed to be messy. We hit GPU quota limits, SSH timeouts, expired sandbox sessions, flaky data downloads, the wrong sign language entirely, training runs that took four days instead of four hours, and a model that got worse as we added more data. None of that meant the experiment failed. It meant we were learning.
Perhaps the most important thing is this: public sector innovation does not need to start with a business case. This experiment might lead somewhere useful, or it might not. The point is that someone was able to try, to ask "what if?" and actually explore the answer, without procurement, without a project board, without a budget. That is what sandbox environments are for.
The phrasebook approach (hand-crafted rules, 5% accuracy) was never going to get us to fluency. The language school approach (synthetic training data, 11.8%) was better but still inadequate. The immersion approach (real data from real signers, 86.6%) was the breakthrough. And yet even immersion does not make you fluent. It just proves that fluency is possible, given enough time and the right environment.
The repo is published
The entire codebase, the frontend, the training pipeline, the data extraction scripts, the CloudFormation templates, the trained models, the documentation, and all 19 versions of increasingly questionable accuracy, is published at github.com/chrisns/bsl-experiment. For posterity and as a warning to others.
If you are in UK public sector and you have an idea you would like to explore with AWS, NDX:Try is there for exactly that. The worst that can happen is that your experiment does not work.
And that is fine. That is what experiments are for.
NDX:Try is available to UK public sector organisations. The BSL sign language recognition experiment, including all training code and models, is open source at github.com/chrisns/bsl-experiment.
(Views in this article are my own.)