On April 28, 2026, SenseTime took a significant leap forward by officially open-sourcing the SenseNova U1 series—a suite of native models that seamlessly integrate understanding and generation capabilities. Constructed upon SenseTime's proprietary NEO-Unify architecture, this groundbreaking model obviates the necessity for conventional visual encoders (VE) and variational autoencoders (VAE). Instead, it achieves a native amalgamation of multimodal comprehension, reasoning, and generation within a singular model framework. This milestone heralds a transformative shift in multimodal AI, transitioning from mere modality integration to a more profound native unification.
The SenseNova U1 series possesses the unique ability to directly model both linguistic and visual information as a cohesive entity. This enables efficient synergy between the two modalities, bolstering both understanding and generation capabilities while maintaining the semantic richness of language and the pixel-level fidelity of visuals. When it comes to logical reasoning and spatial intelligence, the model demonstrates a profound comprehension of the intricate layouts and nuanced relationships inherent in the physical world.
Looking ahead, the SenseNova U1 series holds immense potential to function as the cognitive core for robots. It could empower them with sophisticated environmental perception, logical reasoning, and precise task execution—all encapsulated within a single, closed-loop model system. The open-sourced iteration, known as SenseNova U1 Lite, represents a lightweight variant of the SenseNova U1 series. It encompasses two distinct models, each tailored to different specifications, and is now readily accessible on both GitHub and Hugging Face platforms for developers and researchers worldwide.
