AI model evaluation
Why I shipped plain code, not an AI model
Problem
I built a pipeline to transcribe made-up bedtime stories and label who said what. I imposed one constraint: nothing leaves my local machine, for family privacy. Analyzing the transcriptions, I found the most common error was mistranscribed proper names. That matters because I plan to chart these improvised characters across sessions, which only works if their names stay stable. It even misspelled my daughter's own name:
So before fixing it, I ran an evaluation: what's the most reliable way to catch proper-name errors against a set roster?
Action
First I went through the session transcripts and flagged every place the tool mis-rendered a name. With the family roster as the correct spellings, those flags became the ground truth I scored against. Then I ran a range of detectors over the same transcripts: plain code that matches words by phonetic sound, a size-ladder of local AI models (each prompted with the roster), and a frontier cloud model as a comparison.
One honest limit on the comparison: every model got the same prompt and I changed only the model. That's a fair head-to-head, not proof a sharper prompt couldn't push one further.
Result
Scored against the roughly 100 errors I'd flagged across 5 sessions (1,382 segments of transcript), the detectors split sharply by what they could actually do on the device.
| Detector | Runs on device? | Result |
|---|---|---|
| Phonetic matcher (plain code) | yes | Caught all but one name error1, and 96% of its flags were right (99% recall, 96% precision) |
| Gemma, ~2B | yes | Crashed on the longer stories |
| Gemma, ~4B | yes | Stable, but over-flagged ordinary words |
| Qwen, ~8B | yes | Crashed on the long, name-dense story |
| Gemma, ~12B | yes | First to reason through a hard case, but buried it in false flags |
| Gemma, ~26B | no (out of memory) | Won't run on the device |
| Opus (cloud) | no (privacy) | Reasoned out every case except one3, but breaks the local-only constraint |
So the phonetic matcher is what ships. As a final check, I ran it on 2 new sessions it had never seen: 745 segments of fresh transcript. It caught all 10 name errors a careful listen could find, with a single false positive2 (100% recall, 91% precision). A very different name might not match as cleanly, but for a tool only I use, that's reliable enough.
The choice has honest costs, and they live in 3 outliers:
That last case is the one nothing text-based will ever catch: once the name comes out as ordinary words that fit in context, the transcript holds no trace of the error.
The next step is to give the model the audio and let it listen, not just read.
The real payoff isn't the detector; it's the foundation the evaluation leaves. Once you've found the failure modes, sorted and measured them, and built a way to detect them, you can fix the problem on solid ground and keep catching it as the system grows.
The code behind this (the detector, its scorer, and the audio checks) is on GitHub. Written with AI assistance, reviewed and edited by me.