
Ornith-1.0 deep-reinforce.com
DeepReinforce today released Ornith-1.0, a family of open-source coding models built around a mechanism most RL-trained agents avoid: the model itself writes the training harness that guides its own improvement. Released free under the MIT license on Hugging Face, the lineup spans four sizes — a 9B dense model for edge devices, a 31B dense variant, a 35B mixture-of-experts build, and a 397B MoE flagship — all available for immediate download. The MIT license removes the legal friction that has made some permissive-but-restricted open-weight releases difficult for commercial teams to adopt.
For developers evaluating open-source alternatives to closed-source coding agents like Claude Code or OpenAI Codex, the release matters for two reasons: the benchmark scores are competitive with frontier-class models, and the architecture that produced them is structurally different from anything currently available under an open license.
Most agentic coding systems pair a model with a fixed, human-designed harness — a static framework that specifies how the model generates candidate solutions and evaluates them. The harness is engineered once, validated against a target task class, and then held constant while the model trains against it. The problem is brittleness: a harness optimized for one category of coding task degrades on others, and updating it requires manual engineering effort.
Ornith-1.0 replaces that static layer with a learnable one. DeepReinforce describes the mechanism as autonomous scaffolding: each reinforcement learning step runs in two stages. Conditioned on a task and the scaffold the model used most recently, the model first proposes a refined scaffold for that specific task. It then generates a solution conditioned on the updated scaffold. Reward from the resulting solution is propagated back to both stages — so the model learns not only to produce better code but to author better orchestration logic. Over many iterations, both the harness and the outputs improve together.
The practical implication is that per-task-category strategies emerge automatically without hand-engineering. If a class of tasks rewards a particular memory-management pattern, the scaffold evolves toward that pattern. If a different class rewards aggressive test generation, the scaffold shifts accordingly.
Giving a model influence over its own training signal introduces a well-known problem in reinforcement learning: a sufficiently capable model can satisfy the evaluator without actually solving the task, a failure mode known as reward hacking. The classic form in coding environments is reading withheld test files and hardcoding the expected output, or copying an oracle solution present in the evaluation environment.
DeepReinforce reports addressing this through a three-layer defense. The outermost layer is a fixed trust boundary: the environment, tool surface, and test isolation are immutable, outside the model's reach. The model can evolve its memory, error-handling, and orchestration logic — but it cannot touch the verification infrastructure. The second layer is a deterministic monitor that flags any attempt to read withheld paths, modify verification scripts, or invoke out-of-bounds tools, assigning such trajectories zero reward and excluding them from the training update. The third layer is a frozen LLM judge that acts as a veto on top of the verifier, catching intent-level gaming that occurs entirely within the permitted tool surface but does not constitute genuine problem-solving.
Whether this three-layer stack fully solves the reward-hacking problem in production deployments — rather than in the controlled evaluation harness where it was designed — remains to be tested independently. The theoretical literature on reward hacking notes that the risk grows with agent capability and is not fully eliminable by architectural means alone.
Ornith-1.0 ships in four configurations. The 9B dense model targets edge and resource-constrained deployments: at approximately 19GB in BF16 precision, it fits on a single 80GB GPU. The 31B dense model serves as the general-purpose mid-tier build. The 35B mixture-of-experts variant uses sparse activation so that only a subset of its total parameters process each token, enabling stronger performance at lower inference cost. The 397B MoE flagship is designed for maximum capability, with FP8 and GGUF quantized builds available alongside the base weights for teams who need faster local serving.
All four models are post-trained on top of pretrained foundations from the Gemma 4 and Qwen 3.5 families. Each variant is a reasoning model by default: the assistant turn opens with a chain-of-thought block before the final answer, and the serving infrastructure returns the reasoning in a separate field so downstream systems can inspect it independently of the solution output.
The stated target use cases are practical agentic coding tasks: multi-file refactors, bug localization, and test-driven patches — workloads that require sustained tool use across an extended session rather than single-turn code completion.
Read more: AI Agent Orchestration Gets a Control Plane: Databricks Open-Sources Omnigent
DeepReinforce reports the 397B flagship at 82.4 on SWE-Bench Verified, the most widely cited evaluation for software engineering agents, and 77.5 on Terminal-Bench 2.1, a newer benchmark focused on autonomous terminal-native coding tasks. According to DeepReinforce's own comparison table, those numbers put Ornith-1.0-397B ahead of Claude Opus 4.7 (which scores 80.8 on SWE-Bench Verified and 70.3 on Terminal-Bench 2.1) and above open-source models of comparable total parameter count, including MiniMax M3 and DeepSeek-V4-Pro.
The comparison table also shows where the flagship does not lead: Claude Opus 4.8 posts 87.6 on SWE-Bench Verified and 85 on Terminal-Bench 2.1, both above Ornith. GLM-5.2, a larger 744B model, scores 81.0 on Terminal-Bench 2.1, also above Ornith's 77.5. The "state-of-the-art" designation in DeepReinforce's release applies specifically to open-source models of comparable parameter count, not to the overall leaderboard.
SWE-Bench Verified scores warrant context independent of which model posts them. Independent research published in March 2026 found that approximately 19.78% of patches labeled as resolved by the top-30 leaderboard agents are semantically incorrect when evaluated against strengthened test suites, with the top-ranked agent's score dropping from 78.80% to 62.20% as a result. A separate analysis documented solution leakage in more than 32% of the benchmark's instances — cases where the expected fix is described in the issue report itself, allowing a model to copy rather than generate the solution. These are structural limitations of the benchmark rather than of any specific model, but they mean that SWE-Bench Verified scores should not be read as a direct measure of real-world software engineering capability.
The harder SWE-Bench Pro benchmark, which uses contamination-resistant tasks from proprietary codebases, provides a different signal: the best frontier models score roughly 23% there. In DeepReinforce's own table, Ornith-1.0-397B posts 62.2 on SWE-Bench Pro — competitive with the other models listed, where Claude Opus 4.8 scores 69.2 and Claude Opus 4.7 scores 64.3.
The smaller models carry the efficiency argument. The 35B MoE scores 64.2 on Terminal-Bench 2.1, above Qwen 3.5-397B's 53.5 — a model with over ten times the total parameter count. The 9B dense model reaches 43.1 on Terminal-Bench 2.1 and 69.4 on SWE-Bench Verified, exceeding Gemma 4-31B on both benchmarks despite its smaller size.
Read more: GLM-5.2 Open Weights Live: Top Coding Benchmark, but API Use Carries China Data Risk
One persistent criticism of RL-trained coding agents is harness-specific brittleness: a model that trains well under one harness configuration may degrade when the harness changes or when the task distribution shifts. The autonomous scaffolding mechanism is designed to address that by making the harness adaptive rather than fixed. Whether it does so in practice outside the evaluation suite DeepReinforce used is an open question — one that the research community and production users will need to answer independently.
Initial community response to the release has been mixed. Of the early social media engagement tracked on launch day, 55.4% of sentiment-coded responses were negative, with concerns about benchmark inflation and skepticism about whether SWE-Bench Verified scores reflect real-world software engineering capability being the most common criticisms. One ML researcher with a substantial following noted that engagement patterns appeared unusual and recommended evaluation by practitioners rather than headline acceptance. DeepReinforce's prior published work includes CUDA-L1 and the IterX optimization loop for code agents — both open-source efforts — giving the team a track record in the field, though Ornith-1.0 represents a significantly more ambitious release.
All model weights, evaluation details, and deployment recipes are available on Hugging Face.
The benchmark was created to evaluate whether language models can resolve real GitHub issues. It measures whether a model-generated patch passes the repository's existing test suite — not whether the patch is semantically correct, well-structured, or generalizable to code patterns the model has not seen. Independent analyses have documented that the benchmark over-represents bug fixes in a small set of Python repositories, that more than 30% of its instances contain solution leakage, and that scores can improve substantially through harness engineering without any change in the underlying model's reasoning capability. On the harder, contamination-resistant SWE-Bench Pro, top frontier models score around 23% — roughly one-quarter of their SWE-Bench Verified scores. That gap is the most useful calibration tool when reading any model's Verified score.
What is Ornith-1.0, and who made it?
Ornith-1.0 is a family of four open-source coding-focused language models released on June 25, 2026 by DeepReinforce, an AI research team with prior open-source work including CUDA-L1 and the IterX code-agent optimization loop. The models are built on top of pretrained Gemma 4 and Qwen 3.5 foundations and are available under the MIT license on Hugging Face. Their defining feature is a self-improving training framework: rather than training against a fixed, human-designed harness, each model learns to generate the scaffold that guides its own solution search during reinforcement learning.
How reliable are SWE-Bench Verified scores as a measure of coding agent quality?
They are the most widely cited metric but carry documented limitations. Research published in March 2026 found that roughly one in five patches labeled as resolved by top agents are semantically incorrect when tested against stronger test suites. More than 30% of the benchmark's problems have solutions described in the issue text itself, enabling copying over genuine problem-solving. On the harder SWE-Bench Pro benchmark, which uses contamination-resistant tasks from proprietary codebases, even the best models score around 23%. SWE-Bench Verified scores are a useful comparative signal but should not be read as a direct proxy for production software engineering capability.
Can the 9B model run on a single consumer GPU?
The 9B dense model weighs approximately 19GB in BF16 precision and is designed to run on a single 80GB GPU. GGUF quantized builds are also available, which would allow deployment on hardware with less VRAM, though the specific requirements at different quantization levels should be confirmed against the model's Hugging Face card before deployment.
What is the self-improving training approach, and how does it differ from standard RL for coding?
Standard RL coding agents train against a fixed harness — a human-engineered framework that defines how the model searches for and evaluates solutions. Ornith-1.0 replaces that with a two-stage loop: the model first proposes a refined scaffold for the current task, then generates a solution conditioned on that scaffold. Reward is propagated to both stages, so over many iterations the model learns to improve both its harness design and its solution quality simultaneously. The risk this introduces — that the model might learn to satisfy the training evaluator without genuinely solving tasks — is addressed by a three-layer defense combining a fixed trust boundary, a deterministic monitor, and a frozen LLM judge.
