HeadsUpAI

ElevenLabs Launches Speech Engine for Plug and Play Voice Agent Upgrades

ElevenLabs, an AI voice platform, launched Speech Engine to provide a unified audio orchestration layer for existing applications. The service bundles speech-to-text, text-to-speech, and turn-taking logic into a single WebSocket-based pipeline. It sits on top of any text-based agent, handling verbal interaction while leaving core logic untouched.
Pricing
8 cents per minute
Language support (TTS)
70+ languages
Language support (STT)
90+ languages
Availability
ElevenAPI (Node.js and Python SDKs)
Core components
STT, TTS, Turn Detection, and Interruption Handling

Building reliable voice agents requires stitching separate providers, which introduces latency. This release follows an industry shift toward unified voice stacks, mirroring the Together AI unified voice agent cloud launch and the OpenAI Realtime API launch. By managing the full voice lifecycle, it removes the need for custom orchestration code.

You can integrate the engine using Node.js or Python SDKs to convert chat workflows into voice-first experiences. The system supports over 70 languages and provides pre-built UI components for web and mobile apps. Speech Engine is available now via the ElevenAPI at 8 cents per minute, with a path to the ElevenAgents platform.

ElevenLabs
ElevenLabs
@ElevenLabs
X

Introducing Speech Engine. Developers can now turn their existing chat agent into a full voice agent with one prompt. Speech Engine combines our leading speech, transcription, and voice orchestration models into a single pipeline - all custom built to work best together. https://t.co/WSWM7nppwd

1retweets6likes
View on X

Still wondering? A few quick answers below.

Speech Engine is a unified voice orchestration pipeline that allows developers to add a conversational audio layer to existing text-based agents. It combines transcription, speech synthesis, and turn-taking logic into a single integration, handling the complexities of real-time verbal communication while allowing the underlying model and business logic to remain unchanged.

ElevenAgents is a fully-managed platform where ElevenLabs provides the language model, knowledge base, and tools in an all-in-one solution. In contrast, Speech Engine is designed for developers who want to bring their own language model and maintain full control over their conversation logic and server architecture while using ElevenLabs for the voice layer.

Speech Engine is available through the ElevenAPI with a pricing model that starts at 8 cents per minute. This cost decreases as usage scales, making it a flexible option for developers who need to manage their own infrastructure while adding high-fidelity voice capabilities to their applications without a total re-architecture.

Speech Engine supports any language model that produces text. The developer kit includes built-in stream extraction for major providers like OpenAI, Anthropic, and Google Gemini. For other models, developers can pass plain strings or an asynchronous iterable of string chunks to the engine to generate the corresponding human-like voice responses.

Yes, Speech Engine includes dedicated models for interruption handling and turn detection. It monitors for user speech while the agent is talking and can instantly stop audio playback and loop back when a user cuts in. This removes the need for developers to write custom logic to manage overlapping speech or background noise.

Share this update