What Was Actually Revealed in DeepSeek's New Paper—Before It Vanished Overnight?
6 hour ago / Read about 0 minute
Author:小编   

Last night, Chen Xiaokang, a researcher specializing in multimodal technologies at DeepSeek, shared—and then promptly removed—a tweet on X about a groundbreaking new paper titled "Thinking with Visual Primitives." This paper introduces an innovative approach to multimodal reasoning, aiming to bridge the so-called "reference gap"—a persistent challenge where AI models struggle to accurately pinpoint visual objects during the reasoning process. The solution proposed involves leveraging visual primitives, which are essentially fundamental visual elements such as points and bounding boxes.

The paper delves into the specifics of the model architecture, exploring how visual compression techniques are employed to streamline data processing. It also outlines the methods used for constructing training datasets tailored to this new paradigm, as well as post-training optimization strategies to enhance model performance. Experimental findings indicate that this approach surpasses the capabilities of leading models like GPT-5.4 in tasks requiring precise counting and spatial reasoning, marking a significant advancement in the field.