Why I Added Pre-Generated Audio to Interview Aloud

January 24, 2026 · 3 min read

One of the core ideas behind interview-aloud is simple:
knowing an answer isn’t the same as being able to say it out loud.

Reading silently gives you a false sense of confidence. Speaking forces clarity. That’s why audio matters in this project.

The problem with built-in speech synthesis

At first, Interview Aloud relied entirely on the browser’s built-in speech synthesis APIs. They’re convenient and widely available, but in practice they come with some real issues:

The voice often sounds robotic and unnatural
Quality varies a lot between browsers and operating systems
The same content can sound completely different on a Mac, Windows machine, or mobile device
Some voices are fast, some are flat, some are hard to follow

This inconsistency breaks the learning experience. When you’re practicing interview answers, your brain is already busy structuring thoughts — fighting against poor audio quality doesn’t help.

What I wanted instead

I wanted audio that is:

More human-like
Consistent across devices
Predictable and controlled
Available instantly without runtime surprises

At the same time, I didn’t want to introduce runtime complexity, latency, or ongoing infrastructure costs just to play audio.

Choosing Google Cloud Text-to-Speech

I decided to use Google Cloud Text-to-Speech for generating the audio.

Two reasons made it a good fit for this project:

The voices sound noticeably more natural than most built-in browser options
The free tier is very generous, especially for a static-content use case like this

Since Interview Aloud is mostly text-based and changes infrequently, this approach keeps costs low while still improving quality significantly.

The solution: pre-generated audio at build time

Instead of generating audio on demand, everything happens during the build:

Audio is generated at build time
Each interview topic maps to a deterministic audio file name
Files are stored in the public directory and served statically
If an audio file already exists, generation is skipped
Playback always prefers the pre-generated audio
Browser speech synthesis is kept as a fallback when audio files are missing

This makes the system very cost-effective:

No runtime API calls
No repeated TTS requests for the same content
No surprise bills from traffic spikes

Once an audio file exists, it’s reused forever unless the content changes.

Refactoring without breaking things

Under the hood, the speech logic was refactored so that:

Pre-generated audio is always checked first
Existing speech synthesis behavior remains intact
The rest of the app doesn’t need to know which path is used

This keeps backward compatibility and makes the feature additive rather than disruptive.

Why this matters for interview practice

Interviews are stressful partly because they’re spoken, not written.
Hearing a clear, natural explanation — then responding out loud — helps train the right mental pathways.

This change isn’t flashy, but it improves the core experience in a way that aligns closely with why Interview Aloud exists in the first place.

More consistency. Less friction. Better practice.

That’s the goal.

The problem with built-in speech synthesis​

What I wanted instead​

Choosing Google Cloud Text-to-Speech​

The solution: pre-generated audio at build time​

Refactoring without breaking things​

Why this matters for interview practice​