AI model evaluation

Why I shipped plain code, not an AI model

Problem

I built a pipeline to transcribe made-up bedtime stories and label who said what. I imposed one constraint: nothing leaves my local machine, for family privacy. Analyzing the transcriptions, I found the most common error was mistranscribed proper names. That matters because I plan to chart these improvised characters across sessions, which only works if their names stay stable. It even misspelled my daughter's own name:

Her nameArti

transcribed as →

ArtieArthieEarthyArty

So before fixing it, I ran an evaluation: what's the most reliable way to catch proper-name errors against a set roster?

Action

First I went through the session transcripts and flagged every place the tool mis-rendered a name. With the family roster as the correct spellings, those flags became the ground truth I scored against. Then I ran a range of detectors over the same transcripts: plain code that matches words by phonetic sound, a size-ladder of local AI models (each prompted with the roster), and a frontier cloud model as a comparison.

One honest limit on the comparison: every model got the same prompt and I changed only the model. That's a fair head-to-head, not proof a sharper prompt couldn't push one further.

Result

Scored against the roughly 100 errors I'd flagged across 5 sessions (1,382 segments of transcript), the detectors split sharply by what they could actually do on the device.

Detector	Runs on device?	Result
Phonetic matcher (plain code)	yes	Caught all but one name error¹, and 96% of its flags were right (99% recall, 96% precision)
Gemma, ~2B	yes	Crashed on the longer stories
Gemma, ~4B	yes	Stable, but over-flagged ordinary words
Qwen, ~8B	yes	Crashed on the long, name-dense story
Gemma, ~12B	yes	First to reason through a hard case, but buried it in false flags
Gemma, ~26B	no (out of memory)	Won't run on the device
Opus (cloud)	no (privacy)	Reasoned out every case except one³, but breaks the local-only constraint

Gemma, Qwen, and Opus are AI models; the numbers (~2B to ~26B) are their sizes in billions of parameters. Bigger is more capable but heavier, and the ~26B wouldn't fit on the 16GB M1 MacBook Pro this ran on. Recall is the share of real errors caught; precision, the share that were right.

So the phonetic matcher is what ships. As a final check, I ran it on 2 new sessions it had never seen: 745 segments of fresh transcript. It caught all 10 name errors a careful listen could find, with a single false positive² (100% recall, 91% precision). A very different name might not match as cleanly, but for a tool only I use, that's reliable enough.

The choice has honest costs, and they live in 3 outliers:

Swapped for another name¹ ArtibecameRicky No sounds in common, so matching by sound is blind to it. Only the cloud model caught it, by reasoning from context that a parent was praising her.

Sound-alike false alarms² weirdwhere'dflagged as her name Correctly transcribed words that share her name's phonetic code. They're lowercase mid-sentence where real names are capitalized, so a one-line capitalization check, added after this test, now screens them out.

Dissolved into ordinary words³ Artibecameare these Nothing name-shaped is left to flag. Every method missed it, cloud included.

That last case is the one nothing text-based will ever catch: once the name comes out as ordinary words that fit in context, the transcript holds no trace of the error.

The next step is to give the model the audio and let it listen, not just read.

The real payoff isn't the detector; it's the foundation the evaluation leaves. Once you've found the failure modes, sorted and measured them, and built a way to detect them, you can fix the problem on solid ground and keep catching it as the system grows.

The code behind this (the detector, its scorer, and the audio checks) is on GitHub. Written with AI assistance, reviewed and edited by me.