OpenRouter adds Microsoft MAI models for high speed multimodal generation

OpenRouter

Jun 2, 2026 · Updated Jun 20, 2026

OpenRouter has integrated Microsoft’s new in-house MAI-Image-2.5, MAI-Transcribe-1.5, and MAI-Voice-2 models into its unified API. These models provide a high-performance stack for image, speech, and voice tasks built entirely without third-party distillation.

OpenRouter has added three specialized models from the newly launched Microsoft AI (MAI) family. This release includes MAI-Image-2.5 for generation and editing, MAI-Transcribe-1.5 for speech-to-text, and MAI-Voice-2 for expressive text-to-speech. These models are built without distillation—training a smaller model to mimic a larger one—to ensure original performance.

Voice Language Support: 15 languages
Transcription Performance: 1 hour of audio in under 15 seconds
Voice Control Options: Excited, embarrassed, and whispered
Alternative Hosting: Fireworks and Baseten

This integration follows the recent MAI-Image-2.5 debut on the Arena image leaderboard. By hosting these on OpenRouter, Microsoft makes its frontier-grade media capabilities accessible outside of Azure. This move follows how OpenRouter previously added the xAI creative stack to provide a single endpoint for multimodal workflows.

Developers can now use the OpenRouter API for production-grade audio and visual tasks. MAI-Transcribe-1.5 transcribes an hour of audio in under 15 seconds, while MAI-Voice-2 offers emotional controls like whispering. These models are available now alongside hundreds of others via a unified API.

View the full update on microsoft.ai

OpenRouter

@OpenRouterJun 2

Three new @MicrosoftAI models now live on OpenRouter! Launching together: MAI-Image-2.5, MAI-Transcribe-1.5, and MAI-Voice-2. More on each below 🧵 https://t.co/KD5JlX6DT6

340

View on X

Still wondering? A few quick answers below.

OpenRouter has added three models from the Microsoft AI (MAI) family: MAI-Image-2.5 for image generation and editing, MAI-Transcribe-1.5 for high-speed speech-to-text, and MAI-Voice-2 for expressive text-to-speech. These models are designed to work together as a multimodal stack for developers building complex AI applications.

MAI-Transcribe-1.5 is designed for production-grade speed and accuracy. It can transcribe one hour of audio in less than 15 seconds. The model supports 43 languages and includes built-in support for domain-specific terminology, ranking #1 on the Artificial Analysis Accuracy x Speed Pareto frontier.

MAI-Voice-2 provides expressive text-to-speech capabilities across 15 languages. It allows for specific emotional control, enabling the generated voice to sound excited, embarrassed, or whispered. The model is also designed to maintain a stable speaker identity even when generating long-form content or adapting from a short voice sample.

No. Microsoft AI trains the MAI model family from the ground up on clean, licensed data. They do not use distillation, which is the process of training a smaller model to replicate the behavior of a larger teacher model. This approach is intended to ensure original performance and long-term self-sufficiency.

Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →

See all AI news & updates from OpenRouter →

Keep reading

OpenRouter Adds xAI Creative Stack for Unified Video and Voice Generation

OpenRouter integrated xAI's multimodal suite, enabling developers to generate photorealistic images, short video clips, and natural speech through a single API. The update allows for complex creative workflows that combine xAI's generative models with existing reasoning and coding tools on the platform.

Microsoft MAI-Transcribe-1.5 delivers top tier accuracy at 276x real time speed

Artificial AnalysisJun 2

Microsoft MAI-Transcribe-1.5 delivers top tier accuracy at 276x real time speed

Microsoft has released MAI-Transcribe-1.5, a speech-to-text model that ranks third for accuracy while processing audio at 276x real-time speed. The model leads the accuracy-speed Pareto frontier, offering a high-performance alternative for high-volume enterprise audio workloads.

WarpMay 26

Warp Terminal Integrates OpenRouter to Provide Direct Access to Multi-Model Inference

Warp added native support for OpenRouter, allowing developers to connect their own API keys to access a wide range of frontier and open-weight models directly from the terminal. This integration enables users to bypass platform credits by routing agent requests through their personal OpenRouter accounts for better pricing and uptime.

What are the new Microsoft MAI models on OpenRouter?

How fast is MAI-Transcribe-1.5?

What emotional controls does MAI-Voice-2 support?

Are the Microsoft MAI models distilled from other LLMs?

Keep reading

OpenRouter Adds xAI Creative Stack for Unified Video and Voice Generation

OpenRouter Adds xAI Creative Stack for Unified Video and Voice Generation

Microsoft MAI-Transcribe-1.5 delivers top tier accuracy at 276x real time speed

Microsoft MAI-Transcribe-1.5 delivers top tier accuracy at 276x real time speed

Warp Terminal Integrates OpenRouter to Provide Direct Access to Multi-Model Inference

Warp Terminal Integrates OpenRouter to Provide Direct Access to Multi-Model Inference

Keep reading

OpenRouter Adds xAI Creative Stack for Unified Video and Voice Generation

OpenRouter Adds xAI Creative Stack for Unified Video and Voice Generation

Microsoft MAI-Transcribe-1.5 delivers top tier accuracy at 276x real time speed

Microsoft MAI-Transcribe-1.5 delivers top tier accuracy at 276x real time speed

Warp Terminal Integrates OpenRouter to Provide Direct Access to Multi-Model Inference

Warp Terminal Integrates OpenRouter to Provide Direct Access to Multi-Model Inference