47dAINEWS

New Set-Level Retrieval Improves Results, Study Finds

reported by Sky · 4 min read · published April 8, 2026

PREVIEWNew Set-Level Retrieval Improves Results, Study Finds · MD

When RAG systems retrieve evidence for a question, they pick chunks one at a time, choosing whatever looks most similar to the query. It is a greedy strategy that works until it does not. Pick the most similar chunk, then the next most similar, then the next, and you end up with a cluster of near-duplicates that waste the context window saying the same thing three times.

That is the core argument of ScalDPP, a preprint posted to arXiv on February 4, 2026 by researchers Xun Sun, Baiheng Xie, Li Huang, and Qiang Gao, submitted to the International Conference on Machine Learning (ICML). The paper proposes that RAG retrieval should be treated as a set-level problem rather than a point-wise one. Instead of maximizing similarity to the query, the goal should be maximizing the diversity and density of the evidence set as a whole. The authors call their formalization Diverse Margin Loss (DML), and they have built a small parameter-efficient adapter called a P-Adapter to make it work with existing language models.

The technical intuition draws from determinantal point processes (DPPs), a class of probabilistic models that originated in statistical physics. DPPs have a useful property: when you sample a subset from a larger pool, they naturally favor items that are both individually relevant and collectively diverse. The determinant, the mathematical object at the core of DPPs, captures both criteria at once. A DPP is more likely to pick one relevant item from a cluster of near-duplicates than to pick all of them, which is exactly what standard similarity-ranking fails to do.

Sun and colleagues are not the first to apply DPPs to retrieval, but earlier work struggled with scalability. Exact DPP inference is NP-hard, so researchers turned to greedy approximations. ScalDPP combines a fast greedy algorithm with the P-Adapter, a lightweight two-layer network with a bottleneck architecture that maps from a high-dimensional space down to a compact representation and back. The adapter adds relatively few parameters compared to full fine-tuning, which the authors argue makes their approach practical for deployment.

The ablation numbers are the paper's most notable empirical finding. Removing the P-Adapter causes NDCG@10 to fall from 0.4895 to 0.2265, a 53.7 percent drop on the Qwen3-4B model, as shown in the ScalDPP paper. At k=4, the effect is larger: NDCG@4 falls by 65.5 percent. Those are not marginal differences and they are the kind of numbers that make an engineer ask whether their production system has the same problem.

The comparison baselines include BM25, DPR, ANCE, AR2, RR, SentenceBERT, Dragon, and BM25 with a reranker. ScalDPP reports gains across all three evaluation datasets: Natural Questions, TriviaQA, and PopQA. The exact magnitude depends on the model and the dataset, but the direction is consistent: treating retrieval as a set-level problem produces better recall and better coverage of distinct facts than point-wise ranking.

The implicit critique is what matters for anyone building or buying RAG systems. The standard approach measures how well each individual retrieved chunk matches the query. That is intuitive, but it does not capture whether the retrieved set as a whole gives the language model the evidence it needs. A set can be individually relevant and collectively redundant. The authors are arguing that the objective function is wrong, and that fixing it requires changing how retrieval systems are trained, not just how they are ranked at inference time.

There are reasons for caution. The paper lists no institutional affiliation for its authors, which makes independent verification of their credentials impossible from the document alone. No code has been released, so the results cannot be reproduced by external researchers. It is a preprint submitted to a major conference, not a published peer-reviewed paper. The benchmarks are the authors' own evaluation, run on datasets they selected. Large ablation effects are easier to produce in a controlled setup than to replicate across the variety of document collections and query distributions that appear in production.

Those caveats do not make the core argument incoherent. The concern about near-duplicate retrieval is not new; practitioners have described the problem for years. Formalizing it as a set-level optimization problem with a concrete loss function is a genuine contribution. Whether DML specifically is the right formalization is a question for the community to work through. The more durable insight may be the reframing: the field has been measuring the wrong thing.

For anyone running a RAG pipeline today, the practical question is whether the retrieval system is silently optimizing for the wrong objective. If chunks look relevant but answers feel thin, the problem might not be the chunking strategy or the embedding model. It might be that the system keeps picking the same kind of evidence and calling it done.

Sources: ScalDPP on arXiv