Meituan Unveils Its Native Multimodal Marvel: LongCat-Next
2 day ago / Read about 0 minute
Author:小编   

On March 27, Meituan made a significant leap in the AI landscape by releasing and fully open-sourcing its innovative native multimodal large model, LongCat-Next, along with its pivotal component, the Discrete Native Resolution Visual Tokenizer (dNaViT).
This groundbreaking model diverges from the conventional language-dominated architecture of large models. It achieves this by uniformly transforming images, speech, and text into homologous discrete tokens, thus creating a unified representation across modalities.
Leveraging the "Next Token Prediction" (NTP) paradigm, LongCat-Next empowers vision and speech to seamlessly integrate as native input modalities for AI systems. This integration opens up new avenues for AI applications, enabling them to process and understand multimodal data more effectively.
LongCat-Next boasts three pivotal technological advancements. Firstly, the Discrete Native Autoregressive Architecture (DiNA) dismantles the barriers between different modalities, allowing for a more fluid and integrated processing of multimodal information. Secondly, the Discrete Native Resolution Visual Tokenizer (dNaViT) constructs a visual "dictionary" that enhances the model's ability to interpret and generate visual content. Lastly, the Semantically Aligned Complete Encoder tackles the challenge of information loss during the discretization process, ensuring that the model retains the nuances and subtleties of the original data.
Across various domains, including visual understanding, image generation, and audio processing, LongCat-Next showcases performance that is on par with, or even exceeds, that of specialized models. This remarkable achievement underscores Meituan's commitment to pushing the boundaries of AI technology and its potential to revolutionize the way we interact with and understand multimodal data.