Why I Added Pre-Generated Audio to Interview Aloud
One of the core ideas behind interview-aloud is simple:
knowing an answer isn’t the same as being able to say it out loud.
Reading silently gives you a false sense of confidence. Speaking forces clarity. That’s why audio matters in this project.
The problem with built-in speech synthesis
At first, Interview Aloud relied entirely on the browser’s built-in speech synthesis APIs. They’re convenient and widely available, but in practice they come with some real issues:
- The voice often sounds robotic and unnatural
- Quality varies a lot between browsers and operating systems
- The same content can sound completely different on a Mac, Windows machine, or mobile device
- Some voices are fast, some are flat, some are hard to follow
This inconsistency breaks the learning experience. When you’re practicing interview answers, your brain is already busy structuring thoughts — fighting against poor audio quality doesn’t help.
What I wanted instead
I wanted audio that is:
- More human-like
- Consistent across devices
- Predictable and controlled
- Available instantly without runtime surprises
At the same time, I didn’t want to introduce runtime complexity, latency, or ongoing infrastructure costs just to play audio.
Choosing Google Cloud Text-to-Speech
I decided to use Google Cloud Text-to-Speech for generating the audio.
Two reasons made it a good fit for this project:
- The voices sound noticeably more natural than most built-in browser options
- The free tier is very generous, especially for a static-content use case like this
Since Interview Aloud is mostly text-based and changes infrequently, this approach keeps costs low while still improving quality significantly.
The solution: pre-generated audio at build time
Instead of generating audio on demand, everything happens during the build:
- Audio is generated at build time
- Each interview topic maps to a deterministic audio file name
- Files are stored in the
publicdirectory and served statically - If an audio file already exists, generation is skipped
- Playback always prefers the pre-generated audio
- Browser speech synthesis is kept as a fallback when audio files are missing
This makes the system very cost-effective:
- No runtime API calls
- No repeated TTS requests for the same content
- No surprise bills from traffic spikes
Once an audio file exists, it’s reused forever unless the content changes.
Refactoring without breaking things
Under the hood, the speech logic was refactored so that:
- Pre-generated audio is always checked first
- Existing speech synthesis behavior remains intact
- The rest of the app doesn’t need to know which path is used
This keeps backward compatibility and makes the feature additive rather than disruptive.
Why this matters for interview practice
Interviews are stressful partly because they’re spoken, not written.
Hearing a clear, natural explanation — then responding out loud — helps train the right mental pathways.
This change isn’t flashy, but it improves the core experience in a way that aligns closely with why Interview Aloud exists in the first place.
More consistency. Less friction. Better practice.
That’s the goal.