Podcast Episode

Ant Group Open-Sources Ming-Flash-Omni 2.0, a Multimodal AI Model Rivalling Gemini 2.5 Pro

February 16, 2026

0:00

2:31

Chinese fintech giant Ant Group has open-sourced Ming-Flash-Omni 2.0, a multimodal AI model built on a hundred billion parameter Mixture-of-Experts architecture. The model is the first to unify speech, sound effects, and music generation in a single audio track, and outperforms Google's Gemini 2.5 Pro on several benchmark tests.

Ant Group Unleashes a New Open-Source Multimodal Powerhouse

Ant Group, the Chinese fintech giant behind Alipay, has released Ming-Flash-Omni 2.0, an open-source multimodal large model that introduces what the company calls the industry's first unified audio generation system. The model can simultaneously produce speech, ambient sound effects, and music within a single audio track, a capability that no other open-source model currently offers.

How It Works

Built on the Ling 2.0 architecture, Ming-Flash-Omni 2.0 uses a Mixture-of-Experts design with one hundred billion total parameters, but only activates six point one billion parameters per token. This sparse architecture allows developers to access visual, speech, and generation capabilities within a single framework, dramatically reducing the engineering complexity of traditional multi-model setups.

The model achieves an inference frame rate of three point one hertz, enabling real-time high-fidelity generation of minute-long audio content. Users can control voice parameters including timbre, speaking speed, intonation, volume, emotion, and dialect through simple natural language instructions. It also supports zero-shot voice cloning and customisation.

Benchmark Performance

Ant Group claims the model outperforms Google's Gemini 2.5 Pro in certain benchmark metrics across visual language understanding, speech-controlled generation, and image generation and editing. Specific results include a score of zero point nine zero on GenEval, surpassing all non-reinforcement-learning methods, seventy four point six on MVBench for video comprehension, and record-setting scores across all twelve contextual speech recognition benchmarks.

Part of a Broader Push

The release of Ming-Flash-Omni 2.0 is part of a wider upgrade to Ant Group's open-source model family. Just days later, the company also released Ling 2.5 1T, a trillion-parameter language model, and Ring 2.5 1T, the world's first hybrid linear-architecture thinking model that achieved gold-medal-tier results on International Mathematical Olympiad benchmarks. Together, these models represent Ant Group's accelerating push toward artificial general intelligence through open-source development.

The model weights and inference code are available now on Hugging Face and through Ant's Ling Studio platform.

Published February 16, 2026 at 1:47pm

Ant Group Open-Sources Ming-Flash-Omni 2.0, a Multimodal AI Model Rivalling Gemini 2.5 Pro

Ant Group Unleashes a New Open-Source Multimodal Powerhouse

How It Works

Benchmark Performance

Part of a Broader Push

More Recent Episodes

Artemis II Crew Returns From Historic Moon Voyage, Set for First Press Conference

Microsoft Takes Over OpenAI's Stargate Data Centre in Norway

Stanford Report Finds Robots Fail 88% of Household Tasks