← all posts

Teaching an App to Talk Back: A New Language Tutor From Empty Repo to Live Voice

2026-06-22

Some weeks are maintenance. This one was the opposite — I started a brand-new project from an empty repo and pushed it all the way to live, spoken AI conversations on a phone. The bulk of my time went into a language-learning app, and most of the lessons came not from building features but from getting audio to actually come out of the speaker.

From Nothing to a Real App

The week started with the most basic thing possible: an Xcode project with a SwiftUI scaffold and a web backend restructured into a monorepo so the iOS app and the API could share a roof. From there it moved fast. Auth landed first — Supabase sessions with a Bearer-token API client, the same pattern I lean on across all my apps because once you've wired it once, you stop thinking about it.

Then the actual product showed up in layers. Phase 1 was the foundation. Phase 2 was where it started to feel like a language tutor: a streaming conversation practice loop, a Learn tab with lesson listening and quizzes, and a language switcher so it isn't locked to one target language. I seeded intermediate Spanish and Japanese content — real lessons and conversation scenarios — so I'd have something honest to test against instead of "hello world" placeholders.

The features that made it feel genuinely useful were the ones tied to speaking. Read-along got pronunciation scoring with word-synced playback, so you can read a sentence aloud and get told which words you fumbled. I went through a couple of providers for that — started with SpeechSuper, then moved to Azure AI Speech for the read-along assessment, mostly down to scoring quality and voice naturalness. Conversation practice got a voice layer too: the app speaks its replies and transcribes yours from the mic. And for Japanese specifically, I added furigana ruby rendering so kanji shows its reading inline, in both chat and lessons — a small touch that makes the difference between "I can read this" and "I'm guessing." A Vocab tab rounded it out with a notebook and SRS flashcard review, plus AI-assisted adding of new words.

That's a lot of surface area for one app in a few days. But the part that ate the most hours wasn't any of it.

The Silence

I wanted real-time spoken conversation — not record-then-send, but an actual back-and-forth voice call with an AI tutor. I locked the hosting decision to LiveKit Cloud, wrote up an implementation plan, and built a Phase A spike: a LiveKit agent, a token route, and an iOS call screen wired to the LiveKit Swift SDK.

It connected. It just didn't make a sound.

What followed was the kind of debugging that doesn't show up in a feature demo but defines whether the feature exists at all. The agent's greeting was coming through the wrong path — calling it via the chat completion flow tripped a 400 from Claude on empty tools, so I moved the opener to a direct say() and routed around the system-only message constraint with an OpenAI-compatible endpoint. That fixed the greeting but not the silence underneath it. The real culprit turned out to be a transcription-sync layer that was wedging the text-to-speech calls entirely, plus ElevenLabs sync_alignment quietly producing nothing. Disabling that sync path is what finally made audio come out.

On the iOS side it was a parallel fight: the mic needed to be enabled at connect rather than after, the simulator's audio engine kept throwing a -4010 error I had to make non-fatal, and the call screen was latching to an "ended" state while the call was very much still connected. There was a memorable detour where a connect() guard short-circuited the first connection attempt, and another where the call was opening two rooms at once until I made connect idempotent and presented the call as a proper full-screen cover.

To stop debugging blind, I built tooling: a headless audio-probe that measures actual audio energy rather than just confirming bytes are flowing, a doctor script that preflights every provider's health before a call, and a playground token helper so I could test the agent from a browser instead of rebuilding the app every time. Those three things turned "why is it silent" from a guessing game into a checklist. By the end the agent speaks only the target language — translations and furigana stripped from what it says aloud — and the call pre-warms LiveKit with a loading state so the first connection doesn't feel broken.

The takeaway I keep relearning: with real-time audio, "it connected" tells you almost nothing. Build the probe before you need it.

Billing and Scheduling Elsewhere

The new app took most of the oxygen, but two other projects moved.

On faceless-vid, I added Stripe subscription billing on a credits model and wrote up a consistency design proposal for the video output. The integration came with the usual deploy-time papercuts — the Stripe client had to be constructed lazily so it wouldn't blow up the Vercel build, and the Remotion Lambda deploy needed the @/ path alias applied through a deploySite webpack override before it would render. I also fixed slide-transition timing and dropped some unused Instagram tables that were cluttering the schema.

On Content Forge, I shipped a bulk drip-scheduler that generates per-row AI captions — so instead of scheduling posts one at a time, you can queue a batch and let it write the captions as it goes. Small feature, but it's the kind of automation that compounds.

What's Next

The language app is real now — it talks, it listens, it scores you. The next stretch is making the live conversation robust enough to trust on a flaky network, not just a clean simulator. Audio taught me its lesson this week. I suspect it has a few more.

#ios#swift#ai#voice#livekit