AI Models Know What a Hammer Is. They Just Cannot Figure Out Which Part to Use.
When AI agents are supposed to book your flights, drive your robots, and write your code, researchers want to know one thing: can they figure out what to do with a tool they've never seen before? A team at the University of Illinois Urbana-Champaign has an answer, and it is not reassuring.
The researchers built a benchmark called CreativityBench that tests how well AI models can repurpose objects for tasks they were never designed for. The test is simple in concept: here is a hammer, a rubber band, and a pencil. Use two of them to move water across a room without spilling. The answer requires identifying not just which objects work, but which part of each object does the job. The CreativityBench paper found that even the most capable models collapse on this task.
Models can usually pick a plausible object, the researchers found, but they consistently fail to identify the specific part or attribute that makes the object useful. That distinction matters. Selecting a knife to cut something is table stakes. Selecting the knife's handle to grip hot cookware, or its flat spine to spread paste, requires understanding the physical structure of the tool itself. When the benchmark isolated this part-level reasoning, performance dropped by more than 60 percent across all tested models. The CreativityBench paper ran evaluations against ten state-of-the-art systems, generating 14,000 distinct tasks grounded in real-world physical constraints.
The result was not close. Larger models did not reliably outperform smaller ones. Adding Chain-of-Thought, the reasoning scaffolding that has become standard for squeezing more capability out of existing hardware, produced minimal gains. Scaling up compute and prompting strategy simultaneously failed to close the gap, according to the CreativityBench paper. This finding aligns with independent research: the MacGyver benchmark from 2024 established that LLM creative problem solving was already lagging in similar settings, and the SocialGrid paper from April 2026 showed the same pattern appearing across separate research groups, suggesting the problem is not a one-off evaluation artifact.
The hierarchy of model capabilities also surprised the researchers. Models that scored highest on logical reasoning benchmarks, including the GPT-5 family, were consistently outperformed by models like Qwen3-32B on creative tool discovery tasks. The CreativityBench paper interprets this as evidence that reasoning and creativity are dissociable capabilities, not different expressions of the same underlying skill. Strong planners are not necessarily good improvisers.
The implication is uncomfortable for anyone betting on AI agents to operate in the physical world or handle open-ended coding tasks. Current foundation models appear to have a structural gap in how they represent the relationship between objects and their uses, a gap that does not respond to the techniques that have reliably improved performance on other tasks. The benchmark also built an affordance knowledge base containing over 4,000 entities and 150,000 annotations linking physical objects to their actionable properties, a dataset the authors say is the largest of its kind and designed to be reusable by other researchers.
What to watch next is whether this limitation survives contact with production environments. The benchmark is controlled; real tasks are messier. Labs working on physical AI and coding agents will have to decide whether to treat part-level affordance reasoning as an engineering problem solvable with more data and scaling, or as a more fundamental representational gap. The answer shapes which agents get built and which remain research curiosities.