The number the AI research paper left out
A new AI system from University of Maryland researchers can master six video games in a row, and in the process gets slightly worse at math and general knowledge. The framework, called COSPLAY, posts a 25.1 percent benchmark improvement that sounds like a clear win. The fine print lives in the GitHub repository, not the paper's abstract.
The catch: standard reasoning benchmarks that the researchers tested alongside the game environments show measurable drops. On MMLU-Pro, a common general-knowledge test, COSPLAY scores 61.15 percent against a baseline of 61.99 percent. On Math-500, a math problem set, it drops to 44.60 percent from 46.40 percent. The paper's abstract mentions the game gains. It does not mention these regressions.
COSPLAY (Co-Evolving LLM Decision and Skill Bank) is a two-agent system described in an arXiv preprint submitted April 22, 2026 by eight researchers at UMD's GAMMA lab: one agent makes decisions inside a task environment, while a separate skill pipeline watches those decisions and extracts reusable tricks into a shared library. The two agents train together in a loop, each getting better at their role over time. That mutual improvement is the architectural novelty. Unlike hierarchical skill-banking systems such as SkillRL or PolicyBank, where a manager dispatches tasks down a chain, COSPLAY's agents teach each other without a central boss.
In comments to type0, lead author Xiyang Wu said the surprise was that the two agents changed the data distribution for each other. As the decision agent became more skill-aware, its rollouts became cleaner and easier for the skill-bank agent to segment; as the skill bank improved, the decision agent got better temporal abstractions and stopped constantly re-planning. Wu described the loop as "less like 'learn a skill library, then use it' and more like a mutual curriculum."
The approach uses Group Relative Policy Optimization (GRPO), a reinforcement learning technique where the system rewards itself based on how its decisions compare to a baseline, plus five small LoRA adapters (lightweight fine-tuning modules that let the system specialize without rebuilding the whole model from scratch). Training requires eight A100-80GB GPUs, split four and four between the decision and skill-bank agents, as documented in the project's GitHub repository. Results on single-player games (2048, Candy Crush, Tetris, and Super Mario Bros) show an average 25.1 percent improvement over four frontier large language model baselines using an 8-billion-parameter model called Qwen3-8B. On multi-player social reasoning games (Avalon and Diplomacy), the paper says the system remains competitive without providing a comparable quantitative figure.
Catastrophic forgetting, the tendency of a model to lose capabilities in one area while improving in another, is a known hazard of reinforcement learning training. COSPLAY's authors acknowledge it in the paper. What the paper's abstract omits is that MMLU-Pro and Math-500 scores fell below the base model's during training, not just relative to some hypothetical ceiling but in absolute terms against the same unmodified Qwen3-8B. Those numbers are listed in the GitHub repository, not in the paper's main results section.
The work appears as a preprint on arXiv, submitted April 22, 2026 by a team of eight researchers from UMD's GAMMA lab led by Xiyang Wu and Dinesh Manocha. It has not been peer-reviewed and no independent lab has reproduced the results. The compute requirements, eight A100-80GB GPUs (roughly $30,000 to $40,000 in cloud hardware at current rates), put independent replication beyond most academic groups. Wu told type0 that the eight-GPU setup is partly a speed choice, not a hard conceptual requirement: the GPU requirement can be lowered if researchers accept slower training and inference. But he also cautioned that simply extracting the text skill library and attaching it to a vanilla base model may not work; a few adaptation steps may still be needed.
What to watch: whether the approach generalizes beyond game environments to real-world agent tasks, and whether the skill-bank component can be decoupled from the full eight-GPU training loop for broader use. The forgotten knowledge problem, if it persists in production settings, would matter for any deployment where a model needs to retain reasoning while also handling newly learned skills. The next independent benchmark on this architecture should tell us whether the game gains survive contact with tasks that actually matter.