Podcast Episode
The model achieves an inference frame rate of three point one hertz, enabling real-time high-fidelity generation of minute-long audio content. Users can control voice parameters including timbre, speaking speed, intonation, volume, emotion, and dialect through simple natural language instructions. It also supports zero-shot voice cloning and customisation.
The model weights and inference code are available now on Hugging Face and through Ant's Ling Studio platform.
Ant Group Open-Sources Ming-Flash-Omni 2.0, a Multimodal AI Model Rivalling Gemini 2.5 Pro
February 16, 2026
0:00
2:31
Chinese fintech giant Ant Group has open-sourced Ming-Flash-Omni 2.0, a multimodal AI model built on a hundred billion parameter Mixture-of-Experts architecture. The model is the first to unify speech, sound effects, and music generation in a single audio track, and outperforms Google's Gemini 2.5 Pro on several benchmark tests.
Ant Group Unleashes a New Open-Source Multimodal Powerhouse
Ant Group, the Chinese fintech giant behind Alipay, has released Ming-Flash-Omni 2.0, an open-source multimodal large model that introduces what the company calls the industry's first unified audio generation system. The model can simultaneously produce speech, ambient sound effects, and music within a single audio track, a capability that no other open-source model currently offers.How It Works
Built on the Ling 2.0 architecture, Ming-Flash-Omni 2.0 uses a Mixture-of-Experts design with one hundred billion total parameters, but only activates six point one billion parameters per token. This sparse architecture allows developers to access visual, speech, and generation capabilities within a single framework, dramatically reducing the engineering complexity of traditional multi-model setups.The model achieves an inference frame rate of three point one hertz, enabling real-time high-fidelity generation of minute-long audio content. Users can control voice parameters including timbre, speaking speed, intonation, volume, emotion, and dialect through simple natural language instructions. It also supports zero-shot voice cloning and customisation.
Benchmark Performance
Ant Group claims the model outperforms Google's Gemini 2.5 Pro in certain benchmark metrics across visual language understanding, speech-controlled generation, and image generation and editing. Specific results include a score of zero point nine zero on GenEval, surpassing all non-reinforcement-learning methods, seventy four point six on MVBench for video comprehension, and record-setting scores across all twelve contextual speech recognition benchmarks.Part of a Broader Push
The release of Ming-Flash-Omni 2.0 is part of a wider upgrade to Ant Group's open-source model family. Just days later, the company also released Ling 2.5 1T, a trillion-parameter language model, and Ring 2.5 1T, the world's first hybrid linear-architecture thinking model that achieved gold-medal-tier results on International Mathematical Olympiad benchmarks. Together, these models represent Ant Group's accelerating push toward artificial general intelligence through open-source development.The model weights and inference code are available now on Hugging Face and through Ant's Ling Studio platform.
Published February 16, 2026 at 1:47pm