HeadsUpAI

OpenRouter Launches Unified Audio Endpoints to Simplify Multi-Provider Voice Agents

Ā· Updated

OpenRouter, a unified API platform for accessing hundreds of language models, launched two dedicated audio endpoints for text-to-speech and transcription. These endpoints allow developers to process audio using the same API keys and billing infrastructure they already use for text and OpenRouter's video generation API.
TTS endpoint
/api/v1/audio/speech
SST endpoint
/api/v1/audio/transcriptions
Providers supported
OpenAI, Google, Mistral, and others
API compatibility
OpenAI Audio Speech API (for TTS)
Input format
Base64-encoded audio (for SST)
Availability
Live for all OpenRouter users

Building reliable voice agents currently requires managing fragmented SDKs for providers like Google and Groq. This update applies OpenRouter's aggregation model to the audio stack, offering a single interface that handles model routing and automatic fallbacks. It follows the platform's recent OpenRouter Audio Input leaderboard.

You can now integrate these endpoints to swap between audio providers without changing code. The text-to-speech endpoint is compatible with the OpenAI Audio Speech API, while the transcription endpoint accepts base64-encoded audio. Both are live today, providing a consolidated view of audio usage alongside standard metrics.

OpenRouter
OpenRouter
@OpenRouter
X

1/ Audio is now first-class on OpenRouter. Two new endpoints live today: šŸ“¢ /api/v1/audio/speech — text-to-speech (TTS) šŸŽ¤ /api/v1/audio/transcriptions — speech-to-text (SST) Same routing, billing, and keys you already use for text, image, and video. https://t.co/6uHeEUuDl5

26retweets342likes
View on X

Still wondering? A few quick answers below.

OpenRouter has introduced two dedicated endpoints for audio processing. The /api/v1/audio/speech endpoint handles text-to-speech tasks, while the /api/v1/audio/transcriptions endpoint is used for speech-to-text or transcription services. These allow developers to integrate audio capabilities into their applications using the same unified API structure they already use for text and video models.

These endpoints function as a unified interface for multiple audio providers. The text-to-speech endpoint is designed to be compatible with the OpenAI Audio Speech API, making it easier for developers to switch providers. For transcriptions, the system accepts base64-encoded audio files and returns a JSON response containing the transcribed text, providing a standardized way to handle speech data.

OpenRouter provides access to a diversifying supply of audio models from major providers including OpenAI, Google, Mistral, and Groq. By using a single API, developers can access these different models without integrating separate SDKs for each vendor. This setup allows for automatic fallbacks and easier observability across the various speech-to-text and text-to-speech services available on the platform.

Audio services on OpenRouter are integrated into the platform's existing unified billing system. This means developers use the same API keys and single bill for audio tasks that they use for text, image, and video generation. This consolidation simplifies financial management for teams building multimodal applications that require multiple types of AI model interactions across different providers.

Text-to-speech, or TTS, uses the /api/v1/audio/speech endpoint to convert written text into spoken audio output. Speech-to-text, or SST, uses the /api/v1/audio/transcriptions endpoint to convert audio recordings into written text. Both are treated as first-class features on the platform, meaning they receive the same level of routing, observability, and infrastructure support as standard language models.

Share this update