HeadsUpAI

OpenRouter adds Microsoft MAI models for high speed multimodal generation

OpenRouter has added three specialized models from the newly launched Microsoft AI (MAI) family. This release includes MAI-Image-2.5 for generation and editing, MAI-Transcribe-1.5 for speech-to-text, and MAI-Voice-2 for expressive text-to-speech. These models are built without distillation—training a smaller model to mimic a larger one—to ensure original performance.
Voice Language Support
15 languages
Transcription Performance
1 hour of audio in under 15 seconds
Voice Control Options
Excited, embarrassed, and whispered
Alternative Hosting
Fireworks and Baseten

This integration follows the recent MAI-Image-2.5 debut on the Arena image leaderboard. By hosting these on OpenRouter, Microsoft makes its frontier-grade media capabilities accessible outside of Azure. This move follows how OpenRouter previously added the xAI creative stack to provide a single endpoint for multimodal workflows.

Developers can now use the OpenRouter API for production-grade audio and visual tasks. MAI-Transcribe-1.5 transcribes an hour of audio in under 15 seconds, while MAI-Voice-2 offers emotional controls like whispering. These models are available now alongside hundreds of others via a unified API.

OpenRouter
OpenRouter
@OpenRouter
X

Three new @MicrosoftAI models now live on OpenRouter! Launching together: MAI-Image-2.5, MAI-Transcribe-1.5, and MAI-Voice-2. More on each below 🧵 https://t.co/KD5JlX6DT6

3retweets40likes
View on X

Still wondering? A few quick answers below.

OpenRouter has added three models from the Microsoft AI (MAI) family: MAI-Image-2.5 for image generation and editing, MAI-Transcribe-1.5 for high-speed speech-to-text, and MAI-Voice-2 for expressive text-to-speech. These models are designed to work together as a multimodal stack for developers building complex AI applications.

MAI-Transcribe-1.5 is designed for production-grade speed and accuracy. It can transcribe one hour of audio in less than 15 seconds. The model supports 43 languages and includes built-in support for domain-specific terminology, ranking #1 on the Artificial Analysis Accuracy x Speed Pareto frontier.

MAI-Voice-2 provides expressive text-to-speech capabilities across 15 languages. It allows for specific emotional control, enabling the generated voice to sound excited, embarrassed, or whispered. The model is also designed to maintain a stable speaker identity even when generating long-form content or adapting from a short voice sample.

No. Microsoft AI trains the MAI model family from the ground up on clean, licensed data. They do not use distillation, which is the process of training a smaller model to replicate the behavior of a larger teacher model. This approach is intended to ensure original performance and long-term self-sufficiency.

Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →

Share this update