Visual State Cards in AI Agent Skills More Than Double Small Model Success Rates on Real Desktop Tasks
1 day ago / Read about 26 minute
Source:TechTimes

Xiaohongshu.com

Reliable desktop automation has long come with a hidden tax: the more complex the software environment, the larger — and more expensive — the model required to run it. A new research paper published May 13, 2026 argues that assumption is wrong, and that the missing ingredient is not a bigger model but a better format for packaging procedural knowledge.

The paper, by Kangning Zhang and ten co-authors at Shanghai Jiao Tong University, Xiaohongshu Inc., and Southeast University, introduces MMSkills — a framework that extends the text-only Agent Skills standard to bundle visual evidence alongside written procedure. The result, tested across four real-environment benchmarks, is that the small Qwen3-VL-8B-Instruct model more than doubled its task success rate on the OSWorld desktop benchmark, rising from 10.78 percent to 25.40 percent. On the Minecraft visual-agent benchmark, the same model's success rate climbed from 23.28 percent to 38.79 percent.

Those gains are commercially important because they suggest that a well-built skill package can partially substitute for raw model scale — making reliable automation meaningfully cheaper.

The backdrop is the rapid standardization of how AI agents acquire reusable capabilities. Anthropic published the Agent Skills specification as an open standard on December 18, 2025. Within 48 hours, Microsoft integrated it into VS Code and OpenAI added support to the Codex CLI; GitHub and Cursor followed immediately. By March 2026, 32 tools — including Google's Gemini CLI, JetBrains Junie, AWS Kiro, and Block's Goose — supported the same format, making Agent Skills the closest thing the agent ecosystem has to a universal plug-in standard.

A "skill" in this system is deliberately simple: a folder containing a SKILL.md file with plain-English instructions, plus optional scripts and reference documents. The Agent Skills specification uses progressive disclosure: at startup, an agent loads only a skill's name and a short description — a few dozen tokens — and pulls in full instructions only when a relevant task arises. The design keeps the agent's limited working memory lean while giving it awareness of hundreds of capabilities.

Text-Only Skills Break Down for Agents That Must Look at a Screen

Every mainstream skill format until now shares one underlying assumption: that reusable procedural knowledge can be expressed entirely in text or code. For an agent that drafts a document or queries a database through a clean application programming interface, that assumption holds.

It breaks down for visual agents — AI systems that operate desktop software, web browsers, or games by reading a live screen. The MMSkills paper states the problem precisely: a desktop agent may know the correct operation but fail to recognize that a dialog box is not yet ready. A game agent may know its goal but still need a visual cue to distinguish progress from completion. Text alone cannot reliably carry that situational awareness.

What a Multimodal Skill Package Actually Contains

MMSkills addresses this by extending the standard skill package with two additions: runtime state cards and multi-view keyframes.

A state card is not an image caption. It is a structured decision node linked to a specific point in a procedure. For each relevant state, a state card records four things: when to apply the skill, an explicit guard condition for when not to apply it (which the paper shows sharply reduces wrong triggers), which visual cues on screen to inspect, and how to verify that the action worked. This "when-not-to-use" field is a first-class element of the package — a deliberate design choice that reduces erroneous activations in ways that text procedures alone cannot achieve.

The keyframes are reference screenshots bundled in up to four views: a full frame for spatial context, a tight crop of the relevant interface element, and optional before-and-after pairs showing what a state transition should look like. The paper is explicit that these images are reference evidence, not coordinates to copy. The agent is expected to use them to interpret the live screen in front of it, not to mimic them pixel by pixel.

To avoid flooding the agent's context with images, MMSkills uses a mechanism called branch loading. When the agent's current state suggests a skill might apply, it opens a temporary side branch that selects the relevant state cards and keyframes, aligns them against the live screen, and returns a compact structured summary — an applicability judgment, a local subgoal, and a step-by-step plan — back to the main reasoning thread. The main agent then acts on that summary while keeping its own context lean. This is the visual extension of the same progressive disclosure principle that Anthropic built into the text-only Agent Skills format.

Cheaper Models Gain the Most — the Key Commercial Finding

The benchmark results, run across OSWorld (real Ubuntu desktop tasks), macOSWorld (macOS tasks), VAB-Minecraft from VisualAgentBench, and Super Mario Bros from LMGame-Bench, consistently favor MMSkills over both no-skill and text-only-skill conditions. Frontier models also benefit: Gemini 3.1 Pro's OSWorld success rate rose from 44.08 percent to 50.11 percent, and Gemini 3 Flash's rose from 36.65 percent to 47.97 percent.

The most striking finding, however, is the gap closed by smaller models. Qwen3-VL-8B-Instruct, an open model running at a fraction of the inference cost of frontier systems, more than doubled its OSWorld success rate. The behavioral data reveals why: MMSkills reduced that model's rate of exact repeated actions from 21.8 percent of steps to 6.2 percent, and increased the frequency with which it correctly recognized a task as complete. The agents were not just scoring higher — they were behaving more like agents that understand what they are doing.

MMSkills also shortened average trajectory length. Text-only skills sometimes added overhead without grounding, but full MMSkills reduced the average number of steps in every tested setting, with the largest reductions for the smaller models.

Enterprise Automation, Skill Factories, and the Supply Chain Risk

The broader commercial trajectory is a move away from "scale the model" and toward "wrap reliable, inspectable procedure around the model." Several enterprise applications follow directly.

For desktop automation, the implications are immediate. Anthropic has said Agent Skills are already in production use for legal, finance, accounting, and data-science workflows. Multimodal skills extend this reach to legacy desktop and browser applications with no clean application programming interface — the long tail of enterprise software that traditional robotic process automation handles inconsistently. An agent that can visually confirm whether an upload finished or a spinner is still running is the difference between a working deployment and a brittle demo.

The MMSkills paper also introduces a trajectory-to-skill Generator: an automated pipeline that converts ordinary interaction recordings — screen captures of a human completing a task — into audited multimodal skill packages. That capability points directly toward a services business: enterprises upload their workflows and receive governed, reusable skill packages that can be deployed across any Agent Skills-compatible tool.

The EU AI Act, fully applicable from August 2, 2026, introduces transparency and governance obligations for AI systems across the European Union, including requirements for audit trails in high-risk deployments. Skill provenance, version control, and activity logs are therefore compliance requirements for regulated industries deploying autonomous agents, not optional features.

That governance need is driven partly by a documented supply chain risk in the broader Agent Skills ecosystem. A Snyk study published in February 2026 confirmed 76 malicious payloads in a sample of 3,984 skills from the ClawHub community marketplace, with techniques ranging from credential theft and data exfiltration to obfuscated remote code execution. A separate large-scale study covering 98,380 skills across two registries confirmed 157 behaviorally verified malicious packages. Anthropic itself advises that skills should be loaded only from trusted sources. The pattern mirrors the early history of open-source package registries such as npm and PyPI, where rapid ecosystem growth preceded systematic security tooling. Skill verification, private tiered marketplaces, and provenance filtering are becoming a category in their own right.

An Academic Collaboration With Chinese Institutional Ties

The MMSkills framework is open-source and publicly available on GitHub. The research was conducted partly during an internship at Xiaohongshu Inc., a Chinese social media company whose consumer platform stores user data in China and is subject to Chinese law requiring companies to provide data to authorities on request. This institutional affiliation is a factual matter of record; the research paper itself processes no consumer data, and the MMSkills code is published under an open license with no operational connection to Xiaohongshu's data infrastructure.

What Comes Next

The paper identifies three limitations: the quality of MMSkills depends on the coverage of the source trajectories used to generate them; skill generation and visual grounding can introduce errors; and branch loading adds inference cost. Extending the framework to safety-critical or embodied settings will require stronger verification and the ability to repair skills online when they fail.

The deeper shift the paper represents is a maturing of how the field thinks about reliable AI action. The Agent Skills standardization moment in late 2025 moved the industry's center of gravity from what an AI knows to what it can reliably do. MMSkills makes the argument that for any agent that must look before it acts, "reliably" requires letting it see the instructions — not just read them. For enterprise buyers who have been told that frontier model scale is the price of admission for visual automation, that argument now has benchmark numbers behind it.