On January 28th, a paper detailing the multimodal large model 'Wujie·Emu3'—spearheaded by the Beijing Academy of Artificial Intelligence (BAAI), a prominent Chinese research institution—was published online in Nature, a highly esteemed international academic journal. The print edition is slated for official release on February 12th. This groundbreaking study is the first to confirm that a unified strategy for multimodal learning, encompassing text, images, videos, and beyond, can be effectively realized through an autoregressive framework focused on 'predicting the next token.' This approach has successfully trained the native multimodal large model, Emu3. Experimental results indicate that Emu3's performance in both generation and perception tasks rivals that of specialized models, offering a vital pathway for the development of scalable, unified multimodal intelligent systems.
