When AI Knows It in Pixels but Forgets It in Words

When AI Knows It in Pixels but Forgets It in Words — type0 | type0

PREVIEWWhen AI Knows It in Pixels but Forgets It in Words · MD

When a user shows an AI assistant a photo of a specific dog and tells it, in a paired text-and-image prompt, "this is Max, a golden retriever," the model can learn that fact. But ask the same model "what kind of dog is in the image?" in text only, and it may revert to a generic answer or a stale label. The edit stuck in one input mode and slipped in the other.

That asymmetry is the subject of a new arXiv preprint, "Correct When Paired, Wrong When Split: Decoupling and Editing Modality-Specific Neurons in MLLMs", which names the phenomenon "editing decoupling failure." The authors show that today's standard ways of updating a multimodal model's "knowledge," which is the practice of rewriting facts inside a deployed neural network, routinely succeed on multimodal (text-plus-image) queries and silently fail on unimodal (text-only or image-only) ones. The same fact lives in two partially independent mental pathways, and only one of them receives the patch.

This matters for any organization that ships a multimodal AI product and issues corrections to it. A safety team can remove a harmful association from a model's text pathway, watch it pass a text-only eval, and still see the same harm emerge when a user phrases the prompt with an image attached. The "fix" was half-applied, and the half is not the one a typical red team is set up to test.

The paper's central claim is that entity knowledge inside a multimodal large language model (MLLM) is not stored as a single unified representation. It is distributed across modality-specific circuits, so updates biased toward one input mode, say a paired text-image edit, fail to propagate into the other. The authors' proposed remedy, a method called DECODE, tries to disentangle and localize those modality-specific neuron groups and edit them in parallel. In their experiments on open-weight MLLMs, the method is described as consistently producing knowledge updates that hold under different input triggers.

That is one research group's result, on open-weight models, and the paper has not yet been peer reviewed. The findings may not transfer to closed-weight commercial systems like GPT-4V-class or Gemini-class models, and the authors do not claim they do. Whether the same modality-specific storage pattern shows up there, and whether DECODE-style edits land the same way, are open questions. There is also a long-running critique that targeted knowledge editing is the wrong frame entirely: that honest fixes come from retraining, grounding, or architectural change, not from surgical rewrites of stored facts.

What the paper does sharpen is the usually-unstated assumption inside most public discussion of AI updates, which is that a model's "knowledge" is one thing in one place, and that an "edit" is a single atomic operation. If that assumption is wrong, then the field's usual language of "the model now knows X" or "we have corrected the model" is doing more work than it should. A model can know a fact in pixels, forget it in words, and still pass the eval the safety team designed.

The watch item from here is whether independent groups replicate the decoupling finding, and whether commercial multimodal vendors disclose how they test edits across input modes. The simplest version of the question a reader can hold: when an AI "knows" a fact, is that one fact or several?

When AI Knows It in Pixels but Forgets It in Words

Sources