Artificial Analysis Launches AA-WER Streaming Benchmark for Real Time Voice Agents

Artificial Analysis

May 31, 2026 · Updated Jun 20, 2026

Artificial Analysis released AA-WER Streaming, a benchmark evaluating real-time Speech-to-Text models on accuracy and latency. The framework identifies the best-performing models for voice agents, where fast transcription is critical for natural dialogue and downstream reasoning.

Artificial Analysis introduced AA-WER Streaming, a benchmark measuring streaming Speech-to-Text (STT) models on Word Error Rate and latency. Kiriill Butler, Member of Technical Staff, detailed the framework's focus on First Partial Transcription for speed and First Final Transcription for standalone accuracy. Unlike batch processing, these models transcribe audio continuously as it is fed in.

Highest Accuracy (Final): Cartesia Ink-2 (3.59% WER at 210ms)
Fastest Transcription: Deepgram Flux (~20ms, 7.36% WER)
ElevenLabs Performance: Scribe v2 (3.64% WER at 140ms)
Price Range: $2.00 - $17.00 per 1k minutes
Test Data Volume: ~8 hours across 3 datasets

Voice agents require sub-second response times to maintain conversational flow. Fast transcripts preserve the latency budget for reasoning and tool calls. High accuracy is vital, as errors compound when passed to models like gpt-realtime-1.5 or Gemini 3.1 Flash Live, which can break downstream logic or execution.

The analysis identifies Cartesia, ElevenLabs, and Deepgram as Pareto optimal. Model choice comes down to the constraint that matters — the ~20ms speed of Flux or the 3.64% accuracy of ElevenLabs at 140ms — with pricing from $2 to $17 per 1,000 minutes.

View the full update on artificialanalysis.ai

Artificial Analysis

@ArtificialAnlysJun 1

Overview of our recently launched AA-WER Streaming benchmark, measuring streaming Speech to Text models on accuracy and latency for voice agent use cases Streaming Speech to Text (STT) powers real-time transcription in voice agents and live captioning, where models must balance accuracy against speed. Fast transcripts keep responses feeling natural and free up the response-time budget for reasoning and tool calls. Accuracy matters too, since errors can compound downstream. Streaming STT models transcribe audio as it is fed in, sharing outputs continuously, unlike offline (batch) models that process the entire file at once and are typically slower. Models from Cartesia, ElevenLabs, and Deepgram sit on the accuracy-latency Pareto frontier. Cartesia Ink-2 leads on final transcript accuracy at 3.59% WER (210ms), closely followed by ElevenLabs Scribe v2 Realtime at 3.64% WER (140ms). Deepgram Flux is fastest at ~20ms on final transcript latency (7.36% WER). In this video, Kiriill Butler, Member of Technical Staff at Artificial Analysis, walks through the benchmark and key results.

View on X

Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →

See all AI news & updates from Artificial Analysis →

Keep reading

Artificial Analysis Benchmarks Google's Gemma 4 12B Transcription at 8.8% WER

Artificial Analysis benchmarked Google DeepMind's new open-weight Gemma 4 12B model for transcription, reporting an 8.8% Word Error Rate (WER). This places the model behind specialized open-weight transcription solutions, but it is available for local deployment alongside Google's new Eloquent dictation app.