AI Agents Are Talking to Each Other the Slow Way. Columbia Has a Fix.

AI Agents Are Talking to Each Other the Slow Way. Columbia Has a Fix. — type0 | type0

PREVIEWAI Agents Are Talking to Each Other the Slow Way. Columbia Has a Fix. · MD

When two AI agents need to work together, the standard approach has a catch nobody fully understands: what exactly did the receiving agent learn? Columbia University researchers posted a paper to arXiv on May 19, 2026 that takes direct aim at this opacity. Their method, called Latent Cache Flow, replaces text generation with a compressed bundle of internal model state that the other model can use directly. It cuts communication overhead by 24 times and runs 8.5 times faster than text-based methods, according to the research. The catch: the paper never fully explains what that bundle contains or discards.

The work comes from Columbia's Data Agents Process Lab. Maximillian Rossi, Prajwal Raghunath, and Eugene Wu, researchers at Columbia's DAPLab, posted the paper to arXiv on May 19, 2026. Two related papers were accepted at ICLR 2026 on the same problem: Cache-to-Cache (C2C) communication and KVComm, suggesting the research community regards this as an active problem with a live race to solve it. The Columbia approach replaces text generation entirely with a shared pipeline and a compression step that shrinks what gets transferred to 13 megabytes. C2C, the prior method, required 956 megabytes for the same model pairing. The Columbia paper shows a 24.6x reduction in parameters needed to enable one model to condition on another's internal state.

The core technical move forces the system to compress everything worth transferring into a small, fixed-size packet. The authors call the cross-context variant LCF-X. Rather than passing raw text, LCF-X uses attention pooling to compress the full key-value cache — the internal memory structure a model uses while processing context — into a fixed-size tensor the receiver can condition on. The paper does not specify what information the tensor preserves or discards during this compression.

The Columbia paper reports that LCF outperforms text-based communication by 23 percent in accuracy and 8.5 times in speed for cross-context settings, where the two models have different information to work from. In same-context settings, where both models see identical input, the compression approach is also more accurate than C2C while requiring 24.6x fewer adapter parameters — 19.4 million versus 477.8 million at a bottleneck dimension of 128 across 28 layers. Training LCF-128 takes about 4.5 hours on a single Colab A100 for 300 optimizer steps, versus 66 seconds per step for C2C and 52 seconds per step for LCF.

The efficiency difference matters because the dominant approach to multi-agent communication, text generation, is inherently sequential. Even fast hardware cannot parallelize token-by-token decoding. LCF sidesteps this by routing model activations through a shared pipeline instead of generating language. C2C already showed this was possible but required both models to process identical context in the same positions — a requirement that breaks down the moment the models see different inputs. LCF addresses that with the latent bottleneck; LCF-X addresses it with the attention-pooling compression.

The paper's experiments use small models — Qwen2.5-0.5B-Instruct as the sharing model and Qwen3-0.6B as the receiver. The leap to production-scale systems like GPT-4 or Claude is untested. The cross-context compression that makes LCF-X useful is the least proven part: summarizing a full key-value cache into a fixed-size tensor risks discarding exactly the information that matters for a given task. Layer pruning, which removes receiver layers whose learned gates are near zero, is tested as an optimization but not a core claim.

KVComm, accepted as a poster at ICLR 2026, independently validates that cache-level sharing is a live research direction. The two papers arrive at similar ideas through different routes, which is the mark of a field converging on a real problem. C2C, also accepted at ICLR 2026, provides the baseline the Columbia work improves on.

The honest limit is that this is a fresh preprint on arXiv, not a product or an independently verified benchmark. No code has been released. The gains are measured on small models in controlled settings. Whether the 24x parameter reduction and 8.5x speedup survive at 7 billion parameters or larger is the question that determines whether this is a genuine architectural shift or a lab curiosity.

What to watch next is whether the approach generalizes to larger models — the authors have not yet demonstrated it — and whether any agent framework picks up the implementation. If LCF or something like it moves from arXiv to a GitHub demo, the pressure on text-mediated agent pipelines becomes concrete. Until then, the result is a signal about where multi-agent communication is heading, not proof it has arrived.

AI Agents Are Talking to Each Other the Slow Way. Columbia Has a Fix.

Sources