On January 27th, the DeepSeek team released a research paper entitled "DeepSeek-OCR 2: Visual Causal Flow" and made the DeepSeek-OCR 2 model available as open-source software. This cutting-edge model utilizes the novel DeepEncoder V2 approach, empowering AI to dynamically reorder image segments according to their semantic meanings. This brings the AI's visual encoding process closer to the way humans process visual information. In terms of technical implementation, the model's architecture is realized through the utilization of Qwen2-0.5B. By incorporating learnable "causal flow queries," the model can intelligently rearrange visual information during the encoding phase. This process constructs a two-tier cascading one-dimensional causal reasoning framework. This groundbreaking innovation enables DeepSeek-OCR 2 to surpass traditional visual-language models, especially when handling images with intricate layouts. It achieves a more intelligent and causally coherent form of visual comprehension. In the OmniDocBench v1.5 benchmark assessment, the model attained a score of 91.09%, marking a 3.73% enhancement compared to its previous iteration. Additionally, the model effectively manages computational expenses, with the quantity of visual tokens constrained between 256 and 1120. In real-world production scenarios, when processing online user logs and PDF pretraining data, the model exhibited a reduction in duplicate rates by 2.08% and 0.81%, respectively. This underscores its exceptional level of practical readiness.
