Microsoft has released MAI-Transcribe-1.5: an exceptionally fast speech transcription model at a speed factor of ~276x, while still achieving 2.4% on AA-WER (#3), leading the accuracy-speed Pareto frontier MAI-Transcribe-1.5 is Microsoft AI (MAI)’s latest speech transcription model, coming in at 3rd overall on the on the Artificial Analysis Word Error Rate (AA-WER) leaderboard, behind Alibaba’s Fun-Realtime-ASR-preview (1.7% WER), and ElevenLabs Scribe v2 (2.2% WER). The model stands out as the fastest STT model in the top 10 for accuracy, processing audio at ~276x real-time - this is more than double the speed of the second fastest model in the top 10 for accuracy. The new model supports keyword biasing (improved recognition of rarer vocabulary such as names and medical terminology), in addition to support for 43 languages including English, French, Arabic, Japanese, and Chinese. See more details below ⬇️
Microsoft MAI-Transcribe-1.5 delivers top tier accuracy at 276x real time speed
- Model
- MAI-Transcribe-1.5
- Word Error Rate
- 2.4%
- Speed Factor
- 276x real-time
- Price
- $6 per 1,000 minutes
- Language Support
- 43 languages
The model leads the accuracy-speed Pareto frontier, offering a high-throughput alternative to Cohere Transcribe. This follows a trend of high-ranking media models from the company, including MAI-Image-2.5, which recently secured a top-three spot for image quality. It bridges the gap between transcription precision and processing speed for production environments.
Available for $6 per 1,000 minutes via Microsoft Foundry, the model supports 43 languages and keyword biasing for specialized terms. It performed exceptionally well on the VoxPopuli dataset with a 1.6% error rate. This makes it a viable choice for high-volume enterprise audio workloads requiring both speed and reliability without the typical performance trade-offs.
Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →




