Tell Me A Story
A local-first system for capturing bedtime stories. Built for my daughter and me to understand our storytelling voices.
“Tell me a story. Tell me a story.
Tell me a storrryyyyyy!”
Origin
My daughter, Arti, shouts this sentence at me 1,000 times a day. Then at bedtime, I'm commanded to tell three more before she'll go to sleep. Folks, it's the end of the night. My brain's melted. I love my daughter but telling another story feels like a type of niche torture.
But when it's done, I'm always glad we share this ritual. And sometimes, I'm impressed by what we create together.
Pappy, the boy who looks like whatever food he wants to eat, usually a quesadilla. Too good, Arti! A perfectly executed Hero's Journey arc about a talking fork? Who knew it could be done?
Those bedtime stories are told and gone. I decided I wanted to keep them.
Why and for what, I'm not sure yet. One side of my brain complains, "Do we really have to digitize everything?" But a little voice insists, "Build the thing. Do it your way. Keep these stories. See what comes next when we get there." Ok...
The System
Guiding Principles
Immediate Connection
Removing friction brings creative choices you could not have imagined before.
Hard Fun
The best learning happens when you're building something you actually want.
Calm Technology
Technology can live on the periphery, there when you need it, and otherwise invisible.
The Arti Test
I only work on things for kids that I'd give my child.
Learning
Part of why I am excited about this project: I get to learn things I actually want to learn.
- AI-assisted development (Claude Code)
- Audio ML pipelines (Whisper, pyannote)
- IoT capture devices (ESP32)
- Edge ML deployment (Jetson, CUDA)
- Local-first architecture
When I realized this bedtime story project gave me a reason to explore NVIDIA's Cuda/Jetson stack, I was unreasonably happy. Robot-brain tech! Positronic! That's exciting. And I do want to have a voice in the rooms where these things that will interact with us - our families, our kids - get built.
AI & Kids
Once I have transcribed stories, the generative AI applications seem easy and obvious. Extract recurring themes and characters using a sprinkle of local model intelligence? (Seems fine) Generate Nano Banana illustrations of characters in the style of famed Pixar Illustrator, Sanjay Patel? (um.. not cool) Build an Eleven Labs powered penguin companion stuffy that tells stories with a voice that sounds exactly like Daddy. (OH GOD. WHAT HAVE I DONE)
Where do I go once the system is built? Not sure. I know the point isn't to outsource storytelling, creativity, or imagination to machines. It's to understand our voices as storytellers. And have fun.
I do believe this project will help me clarify my own red lines about AI/technology and kids. I've started talking to experts, artists, and writers in my children's media, animation, and writing networks to help sharpen my thinking.
The Mahabharata
Right as I started the project, Arti became interested in the Mahabharata. My absolute favorite stories from when I was a kid, but not the easiest to tell to a four-year-old.
The Mahabharata isn't the Ramayana. It's dense. It's twisty. It's murky. The good guys aren't totally good. The bad guys aren't totally bad. How do you tell your child about the part when Krishna cuts off Shishupal's head? When Bhima vows to drink Dushasana's blood? Or the simple fact that the five Pandava brothers share one wife? (Tho maybe this is very Gen Alpha and I am an old, old man)
I don't know. But I hope this project helps me figure it out.
Where I'm Coming From
I wrote for Mo Willems. I earned an Emmy nomination for Daniel Tiger's Neighborhood. Most recently, I built tools for the content team at Kibeam Learning, a company that makes interactive products for kids.
Build Log
Two Weeks In, Rearchitecture
After two weeks of building, I paused. I noticed I was writing code to reshape data into other shapes of data. I had debug files and intermediate files everywhere.
I decided I had two fundamental artifacts: transcript and diarization. I can derive everything else at query time.
I'm still figuring out what's straight noise vs. what's incorrect but worth keeping in my source of truth artifacts. Whisper sometimes hallucinates "empty" segments where start time equals end time. I don't need those. But hallucinated words where there should be silence — I'm still wondering how to catch those without side effects. Do I remove them completely? Flag them and keep them? I built a validation tool that helped me understand the model's outputs. These are the questions I'm sitting with now.
Validation Player, Take Two
I scrapped the existing validation player and rebuilt it.
The first version tried to do everything: verify transcription accuracy AND speaker labels.
The new version focuses solely on: what did Whisper hear? Waveform, word highlighting, playback controls, and more robust note-taking features. Diarization validation can come later, once I trust the transcription.
Process note: I built the first version by having Claude Code write specs and implementation together. For the rebuild, I developed the spec in Claude Chat first—thinking through the workflow, what I actually needed to see—then handed a clean spec to Claude Code for implementation. Better results. Separating "what should this do" from "how do I build it" helped me think more clearly about both.
Whisper Needs Context
Short clips produced different transcriptions than the full recording. Arti's opening question:
| Clip Length | Transcription |
|---|---|
| 38s – 210s | "Dad, why do the Fondos and Goros lose?" |
| Full (5min) | "Dad, why do the Fondos and Goros want to be king?" ✓ |
Whisper uses future context to disambiguate unclear audio. The full conversation — all the later discussion about Yudhishthir and Duryodhan wanting the throne — helps the model correctly interpret Arti's slightly unclear speech.
Implication: Process full recordings, don't chunk before transcription. For bedtime audio where a child's voice can be soft or unclear, the surrounding context is what makes accurate transcription possible.