Making Videos Sound Right: Music, Sound Effects, and Voice Selection

Last week was Studio's big structural push. This week was about making the output actually feel like a finished video. Twenty-three commits on Content Forge, almost all focused on audio, voice, and export polish.

Background Music and Sound Effects

A faceless video without audio is a slideshow. I added background music support — you can select a track that plays underneath the narration, properly mixed so the voice stays clear. But the more interesting addition was per-slide sound effects via the ElevenLabs Sound Effects API. Each slide can have its own ambient sound or effect layered in, which makes a surprising difference in perceived quality. A slide about cooking with a subtle sizzle underneath just hits different.

Getting the audio timing right was fiddly. When you regenerate audio for a segment, the captions can drift out of sync. Built a fix that detects segments with missing captions and regenerates their audio, keeping everything aligned. The audio buffer between segments also got tuned — too short and narration clips, too long and it feels disjointed.

Voice Picker

The voice selection UI was just a dropdown before — usable but not helpful when you have dozens of options across multiple languages. Replaced it with a proper modal that groups voices by language, shows use-case filters (narration, conversational, dramatic), and lets you preview each voice with a sample before committing. Deduplicated the voice list too, since several voices appeared in multiple language groups.

When you change voice, the system now properly clears existing audio and lets you regenerate fresh. Before, switching voices would leave stale audio attached to slides, which produced jarring cuts between old and new voice segments.

Export and Download Flow

The export pipeline had a few rough edges. The Download button wasn't showing when a video was ready — fixed that visibility logic. Music export had a separate issue where the audio track wasn't being included in the final render. Both required tracing through the Remotion composition to find where the streams were being dropped.

Also added persistence for export URLs so you can retrieve a previously rendered video without re-rendering. That's important for longer videos where a render might take a few minutes.

Design Generation Improvements

Alongside the Studio work, I integrated what I'm calling "impeccable design principles" into the AI generation pipeline — a structured set of layout rules that the model follows when composing slides. Early results are noticeably better compositions with more intentional spacing and visual hierarchy.

The PNG export path also got a fix — the Chromium binary URL for serverless rendering on Vercel had gone stale, so exports were silently failing. Updated the binary reference and confirmed exports work again on both Vercel and Railway.

Looking Ahead

Content Forge is getting close to the point where you can go from a topic to a polished faceless video in one chat session. The audio layer was the biggest missing piece, and it's now in place. Next up is refining the end-to-end flow and making cost tracking visible so I know exactly what each generation costs.