The Dubbing Rubric 2.0: A Human-Evaluated Benchmark of 14 AI Dubbing Providers
A comprehensive human evaluation of 14 leading AI dubbing providers across eight languages and seven key dimensions
/blog-assets/authors/ahmed.jpeg
by Ahmed Hanzala
Cover Image for The Dubbing Rubric 2.0: A Human-Evaluated Benchmark of 14 AI Dubbing Providers

A few weeks ago, we released The Dubbing Rubric: a framework and human evaluation process for benchmarking AI dubbing systems across languages.

Today, we're releasing an expanded version of the Dubbing Rubric, an evaluation that includes 14 of the top dubbing providers like HeyGen, ElevenLabs, Kapwing, and more— all backed by third-party native speaker reviews across eight languages. Each provider was evaluated for accent and speech quality, handling multi-speaker videos, and translation accuracy.

How the Evaluation Was Conducted

We generated 28 dubbed outputs by running four core videos through eight languages, each designed to test both typical and edge-case scenarios. Some clips featured multi-speaker turns, others focused on emotional delivery, technical terminology, or complex timing.

Native speakers evaluated each provider's output in eight different languages, and then scored across seven dimensions:

  • Translation Accuracy
  • Grammar, Syntax & Terminology
  • Voice Cloning & Speaker Identity
  • Naturalness & Accent Matching
  • Timing, Sync & Speed Adjustments
  • Clarity & Noise Robustness
  • Multi-speaker Handling

You can read more about our evaluation methodology here. Scoring was done blindly and independently. Evaluators were given access to both the original and dubbed versions, with no indication of which tool generated which output.

What We Found

  • Speaker diarization remains the biggest failure point. Most tools struggled to maintain distinct speaker identities in multi-speaker videos, often blending voices, misattributing lines, or botching turn transitions. Only HeyGen, Sieve, Dubly, and Panjaya handled speaker transitions reliably.
  • Sieve and HeyGen lead on audio clarity. Both produced clean, artifact-free speech that blended naturally with the original video. Many other tools suffered from muffled audio, inconsistent volume, or synthetic artifacts that made the dubbing feel unnatural.
  • Rask AI and others showed issues with pacing and sync. These tools frequently stretched or compressed speech unnaturally to match video timing, creating a robotic or rushed feel that breaks immersion.

Note: VEED appears to have the same overall score as Sieve in this evaluation, that's because VEED uses Sieve's dubbing pipeline under the hood.

Provider comparison showing translation quality scores across different AI dubbing providers

Why Sieve Ranked Highest

Sieve's performance was driven by a few key architectural choices.

  • Prosody-Aware Timing and Sync Control. A duration-aware TTS system explicitly models phoneme timing, avoiding the unnatural speed fluctuations seen in other systems.
  • Adaptive Voice Synthesis. Post-synthesis signal processing helps maintain consistent audio quality, preventing robotic tone and reducing noise artifacts.
  • Multi-modal Diarization System. Most critically, Sieve uses a multimodal diarization system that combines visual and audio cues to accurately track speaker turns — a foundational advantage in multi-speaker scenes.
  • Semantics-Preserving Translation Adaptation. These diarized segments are passed through context-aware translation models that adapt phrasing, tone, and grammar based on speaker identity and context.

The result is a more natural, coherent output, especially in complex scenes where many systems fall apart.

Conclusion

We're excited to continue refining this rubric and releasing updated evaluations over time. If you're a researcher or provider working on AI dubbing, we'd love to collaborate on more open, rigorous standards that the entire industry can rely on.

Explore the full results here and reach out if you'd like to get added to the evaluation or collaborate on further improving the rubric.