The Complete AI
Dubbing Pipeline

Six modular stages, each running as an independent microservice. Swap any component, scale what you need.

ASR

WhisperX Speech Recognition

Industry-leading automatic speech recognition with word-level timestamps and speaker diarization. WhisperX identifies who is speaking and when, supporting over 50 source languages with automatic language detection. Word-level alignment ensures precise subtitle placement and accurate translation boundaries.

Word-level timestampsSpeaker diarization50+ languagesAuto language detection
01
Translation

Neural Machine Translation

Context-preserving translation powered by M2M-100, a many-to-many multilingual model handling 100 language pairs directly. For speed-optimized workflows, deep-translator provides cloud-based translation. Both backends maintain sentence structure, handle idioms, and respect the original meaning.

M2M-100 multilingual model100 language pairsContext preservationCloud fallback option
02
TTS

Voice Synthesis & Cloning

Two synthesis engines to match your needs. Chatterbox voice cloning analyzes the original speaker's voice and generates speech that preserves their pitch, tone, and cadence in the target language. Edge TTS offers hundreds of natural-sounding neural voices for fast, reliable standard dubbing.

Chatterbox voice cloningEdge TTS neural voicesPer-speaker voice profilesNatural prosody
03
Separation

Audio Source Separation

MelBand RoFormer cleanly isolates vocals from background music, ambient sounds, and environmental noise. The separated background track is preserved and mixed back into the final output, maintaining the original audio atmosphere while replacing only the speech.

MelBand RoFormerClean vocal isolationBackground preservationAmbient sound mixing
04
Sync

Smart Duration Alignment

VAD-based duration analysis ensures dubbed speech matches the original timing. When synthesized audio is longer or shorter than the original, pyrubberband applies transparent time-stretching to fit perfectly. No awkward pauses, no overlapping segments.

VAD-based alignmentPyrubberband stretchingNo timing gapsSeamless pacing
05
Subtitles

Netflix-Style Burned-In Subtitles

Professional subtitle rendering with multiple styles — Netflix standard, bold desktop, and mobile-optimized. Subtitles are burned directly into the video with pixel-perfect typography, proper line breaks, and configurable positioning. Perfect for social media and broadcast.

Multiple subtitle stylesBurned-in renderingPixel-perfect typographyMobile optimized
06