On April 2, 2026, Microsoft, the U.S. tech behemoth, declared the full-scale commercial launch of three self-developed AI models—MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2—crafted by its AI Superintelligence division. This strategic initiative is designed to diminish its long-standing dependence on partner OpenAI and underscore Microsoft's self-sufficiency in the realm of AI. MAI-Transcribe-1 stands out as a speech transcription model, boasting an impressive average word error rate of merely 3.9% on the FLEURS benchmark, thereby earning the title of the most accurate model globally. It offers support for 25 languages, operates at a transcription speed 2.5 times faster than Azure Fast services, and is competitively priced at $0.36 per hour.
MAI-Voice-1, a speech synthesis model, is capable of generating 60-second audio segments in less than a second using a single GPU, accommodating both solo narrations and multi-speaker dialogues. This model has been seamlessly incorporated into functionalities such as Copilot Daily and Podcasts. Meanwhile, MAI-Image-2, an image generation model, swiftly ascended to the top three on the Arena.ai leaderboard upon its debut, doubling the generation speed of its forerunner. Fully integrated into Bing search and presentation tools, it faithfully replicates lighting effects and lifelike skin tones, while also enhancing its text rendering prowess.
