Counting the Cost: Adding Spend Tracking to AI Video Generation

Lighter week on Content Forge — seven commits, but each one addressed something that was either broken or missing from the Studio experience. The theme was mostly about making things reliable and measurable.

Knowing What Each Video Costs

The most useful addition this week was cost tracking. Every AI generation step in Studio — script writing, image generation, TTS audio, sound effects — now logs its cost. The total accumulates per session and shows in the chat UI, so you can see exactly how much a video cost to produce as you're building it.

This matters because Studio chains together multiple AI services in a single flow. A five-slide video might hit Claude for the script, Gemini for images, ElevenLabs for narration and sound effects — each with different pricing models. Without tracking, you're flying blind on unit economics. Now each generation call records its cost, and the running total updates in real time.

The implementation logs costs at the API call level rather than estimating from token counts. That means the numbers reflect actual charges, including any retries or regenerations. If you regenerate a slide's image three times, all three calls show up in the total.

Fixing the Chat Persistence Crash

There was a client-side crash when loading persisted chat messages — the kind of bug that only shows up when you revisit a conversation after the session data has been serialised and restored from Postgres. The issue was in how the UI message format was being reconstructed from stored data. Parts that were tool calls or reasoning blocks weren't being handled correctly during deserialisation, so the renderer would hit an unexpected shape and throw.

This was related to but different from the chat persistence fixes from two weeks ago. The earlier fix addressed saving messages correctly; this one addressed loading them back. Both sides of the round trip now work reliably.

Caption Sync and Sound Effects

Captions were drifting on some slides — specifically, slides where the audio had been regenerated but the captions hadn't been updated to match. Added a detection step that identifies segments with missing or stale captions and regenerates their audio to bring everything back in sync. It runs automatically during the export preparation phase, so you don't need to manually check each slide.

Per-slide sound effects via ElevenLabs also landed this week. Each slide can have its own ambient sound or effect layered underneath the narration. The API integration was straightforward; the harder part was mixing the audio levels so effects don't compete with the voice track.

Export Reliability

Two fixes in the export path. First, the Chromium binary URL used for server-side PNG rendering had gone stale — updated it so exports work again on both Vercel and Railway. Second, export URLs are now persisted so you can retrieve a previously rendered video without triggering a full re-render. That saves both time and compute cost on longer videos.

Design Quality

Integrated a structured set of design principles into the AI generation pipeline. These are layout rules — spacing, visual hierarchy, element placement — that the model follows when composing slides. It's not a dramatic change in the prompt, but the output compositions are noticeably more intentional. Less random element placement, more deliberate use of whitespace and alignment.

What's Next

Studio is approaching the point where the end-to-end flow — topic to finished video — works without manual intervention. The remaining gaps are mostly around consistency: making sure the style holds across all slides in a multi-slide video, and making the voice-to-caption sync bulletproof. Cost tracking gives me the data I need to make decisions about which services to keep and which to replace.