DeepSeek Unveils Technical Report on Multimodal Model: Outperforming GPT-5.4
9 hour ago / Read about 0 minute
Author:小编   

On April 30, 2026, DeepSeek released a technical report for a multimodal model, titled Thinking with Visual Primitives, on GitHub. This report provides an in-depth look at the technical foundations of DeepSeek's newly launched image recognition mode. Built on the robust DeepSeek V4-Flash architecture—which boasts a total of 284 billion parameters and incorporates a Mixture of Experts (MoE) design that activates 13 billion parameters during inference—the model introduces an innovative multimodal reasoning approach. It elevates traditional linguistic reasoning chains to a dual-track thinking process that seamlessly integrates 'linguistic logic' with 'spatial coordinates'. Throughout the reasoning process, the model directly outputs specific bounding boxes or points, effectively 'pointing out' the objects of interest within images. It then continuously refers to these visual anchors for subsequent judgments, significantly boosting the accuracy of visual reasoning.

Through a visual compression strategy, the model retains only around 90 visual entries in the KV cache, even for high-resolution images. This approach achieves over 7,000-fold compression, making the thinking process notably more 'lightweight'. In a series of challenging visual question-answering tasks, the model demonstrated superior performance, outperforming competitors such as GPT-5.4, Claude-Sonnet-4.6, Gemini-3-Flash, and Qwen3-VL.