Based on the official announcement from Tongyi Large Model, on January 8, 2026, the Alibaba Cloud Tongyi Qianwen team took a significant step by officially open-sourcing two pivotal multimodal retrieval models: Qwen3-VL-Embedding and Qwen3-VL-Reranker.
These models are ingeniously crafted on the foundation of Qwen3-VL and are tailored specifically for multimodal information retrieval and cross-modal comprehension. They boast the remarkable ability to handle a wide array of modal inputs, including text, images, visual documents, and videos, all within a single, unified framework.
When it comes to performance, they truly shine. They have achieved top-tier results across a variety of tasks, such as image-text retrieval, video-text matching, visual question answering, and multimodal content clustering.
Qwen3-VL-Embedding utilizes a dual-tower architecture. This clever design enables it to efficiently encode content from different modalities into unified vector representations. As a result, it can effectively compute cross-modal similarity and carry out retrieval operations.
On the other hand, Qwen3-VL-Reranker adopts a single-tower architecture. It harnesses the power of cross-attention mechanisms to conduct an in-depth analysis of the semantic correlations between queries and documents. This allows it to generate precise relevance scores.
In real-world applications, these two models frequently work in tandem, forming what is known as a "two-stage retrieval process." This collaborative approach significantly boosts the accuracy of the final results.
Moreover, this series of models inherits the multilingual prowess of Qwen3-VL. They support over 30 languages, making them well-suited for global deployment and use across diverse linguistic landscapes.
