Announcing AA-WER Streaming, our new benchmark measuring streaming Speech to Text models on accuracy and latency for voice agent use cases. Pareto optimal models on this new benchmark include those from Cartesia, ElevenLabs, and Deepgram Streaming Speech to Text (STT) powers real-time transcription in voice agents and live captioning, where models must balance accuracy against speed. Fast transcripts are especially important for keeping responses feeling natural and leaves more of the response-time budget for reasoning and tool calls. Accuracy also matters since transcription errors compound in downstream reasoning and speech generation. Streaming STT models transcribe audio as it is fed in, sharing outputs continuously, unlike offline (batch) models that process the entire file at once and are typically slower. What we measure: AA-WER Streaming reports Word Error Rate and latency together, measured from the moment end of speech is detected, with a Pareto line of increasing accuracy as time to transcript received increases. For direct comparability to offline models on accuracy, we test these streaming models on the same ~8 hours of audio as our offline benchmark, AA-WER v2.0: AA-AgentTalk, Earnings22-Cleaned-AA, VoxPopuli-Cleaned-AA. We measure WER and latency as paired metrics at two points after Silero VAD-detected end of speech: First Final Transcription: WER is measured on the first final-denoted transcript returned after end of speech is detected. Latency is the time in seconds from end of speech to that final-denoted transcript. This is more useful for understanding performance as a standalone streaming transcription model, and for higher accuracy. First Partial Transcription: WER is measured on the first transcript-bearing event (partial or final) returned after end of speech is detected. Latency is the time in seconds from end of speech to that first transcript event. This is more useful for near instantaneous transcription for lower-accuracy tasks like responding to "yes" or "no" questions, or for speculative decoding. Key results: ➤ Highest accuracy on Final after End of Speech: @Cartesia Ink-2 (semantic endpoints) at 3.59% WER, 0.21s latency, followed by ElevenLabs Scribe v2 Realtime (3.64%, 0.14s) and Cartesia Ink-2 (external endpoints) (3.66%, 0.09s) ➤ Highest accuracy on First Partial after End of Speech: @ElevenLabs Scribe v2 Realtime at 3.65% WER, 0.13s latency, followed by Cartesia Ink-2 (external endpoints) (4.33%, 0.07s) and @AssemblyAI U3 Realtime Pro (4.46%, 0.47s) ➤ Fastest transcription: @DeepgramAI Flux leads both Final and Partial at 0.020s and 0.019s respectively (both 7.36% WER). On Final, it's followed by @soniox_ai Realtime and Deepgram Nova-3 Realtime (both 0.06s); on First Partial, it’s followed by @NVIDIA Nemotron 3 ASR 80ms (0.04s) and Soniox Realtime (0.05s) Charts below include a Pareto frontier of accuracy vs. speed, so you can shortlist the models that best fit your latency constraints while still achieving high accuracy. See below for further detail ⬇️
Artificial Analysis Launches AA-WER Streaming Benchmark for Real Time Voice Agents
Artificial Analysis introduced AA-WER Streaming, a benchmark measuring streaming Speech-to-Text (STT) models on Word Error Rate and latency. Unlike batch processing, these models transcribe audio continuously. The evaluation tracks First Partial Transcription for speed and First Final Transcription for standalone accuracy.
- Highest Accuracy (Final)
- Cartesia Ink-2 (3.59% WER)
- Fastest Transcription
- Deepgram Flux (0.020s)
- Price Range
- $2.00 - $17.00 per 1k minutes
- Test Data Volume
- ~8 hours across 3 datasets
- Top Price-Performance
- Cartesia Ink-2 and ElevenLabs Scribe v2
Voice agents require sub-second response times to maintain conversational flow. Fast transcripts preserve the latency budget for reasoning. High accuracy is vital, as errors compound when passed to models like gpt-realtime-1.5 or Gemini 3.1 Flash Live, which can break downstream logic or tool execution.
The analysis identifies Cartesia Ink-2, ElevenLabs Scribe v2, and Deepgram Flux as Pareto optimal. Developers can select models based on constraints, such as the 0.02s speed of Deepgram or the 3.6% accuracy of ElevenLabs. Pricing spans $2 to $17 per 1,000 minutes.
Artificial Analysis
@ArtificialAnlys
14retweets146likes
View on X


