The last mile problem in agent vision is not getting models to see. It is getting models that see differently to talk to each other.
When a robot fleet mixes Vision-JEPA 2, DINOv2, and CLIP, the raw pixels are legible. The latent representations are not. Each encoder compresses visual reality into a geometry its training made legible: video prediction, self-distillation, language contrast. Those geometries do not line up. Linear mappings between them collapse. In controlled experiments, linear adapters achieve R² of just 0.068 when matching V-JEPA 2 features to DINOv2 features on the same video. The information survives translation only in a statistical sense.
Tomasz Kaszyński has published what he calls a solution. His paper, posted to arXiv on March 18, describes a protocol he calls WMCP v0.1, the World Model Communication Protocol. The mechanism is a discrete Gumbel-Softmax bottleneck: instead of letting agents negotiate raw representations, he forces communication through a small number of discrete tokens. This bottleneck, iterated across training generations, causes agents to converge on a shared compositional language for physical properties without anyone defining what those properties are or how they should be named.
The empirical results are solid. Four-agent systems converge in 100 percent of 80 random seeds, reaching positional disentanglement of 0.999 and 98.3 percent holdout accuracy. This holds across encoder heterogeneity: V-JEPA 2, DINOv2, and CLIP agents match the performance of homogeneous groups on Physics 101 real-video tasks. The discrete bottleneck does the translation work that linear adapters cannot.
What separates WMCP v0.1 from most emergent communication papers is that Kaszyński wrote down the protocol. The GitHub repository includes a formal specification, not just a trained model, along with deployment numbers that researchers rarely publish. Single-sample inference runs at 1.19 milliseconds on an Apple M3 Pro CPU, with a 95th-percentile latency of 1.35 milliseconds. Feature compression is 5,200x compared to raw video features. New vision encoders onboard in 50 training steps; in Kaszyński's experiments, 10 out of 10 random seeds reached 90 percent of base accuracy within that window.
The causal intervention result is where the claim earns scrutiny. On real Physics 101 video, zeroing the mass-relevant communication channel reduces accuracy by 7.8 percentage points while zeroing other channels disrupts accuracy by 2.1 percentage points or less. The p-value is 0.022. The effect size (Cohen's d of 1.87) is large. This is the strongest evidence that the protocol represents physical properties rather than accidentally correlating with them.
The limitation is also the headline: a single-author preprint with no industry adoption and no peer-reviewed validation. The protocol is a working artifact, not deployed infrastructure. What Kaszyński has built is more rigorous than most emergent communication papers, with a spec, real-video benchmark, and deployment numbers all on GitHub, but it is not community infrastructure until someone else builds on it.
That is where the interesting question sits. WMCP v0.1, if it generalizes, becomes the shared protocol that flattens the encoder interoperability problem. Instead of training N×M per-encoder adapters for a system with N agents and M vision backbones, you implement the protocol once per encoder. New vision models onboard by learning to speak it, not by training a custom translator for every existing peer.
Whether that matters depends on whether the protocol generalizes beyond Kaszyński's controlled setup. His minimum viable model has 886,000 parameters with a hidden dimension of 8. The protocol is tiny, which is good for deployment but may mean the expressiveness ceiling is low. Real-world physical scenes with many simultaneous properties, occlusions, and novel object categories are absent from the Physics 101 curriculum.
The frame that matters for builders: WMCP v0.1 is an infrastructure proposal, not a product claim. It solves a documented problem with a reproducible mechanism. Whether it solves that problem for production multi-agent systems is a question the open-source community can now answer, because the code and the spec are both on GitHub.