A Pay-As-You-Go Recipe for Cleaning Giant Messy Databases

A Pay-As-You-Go Recipe for Cleaning Giant Messy Databases — type0 | type0

PREVIEWA Pay-As-You-Go Recipe for Cleaning Giant Messy Databases · MD

When a giant database is full of messy records that might or might not refer to the same customer, transaction, or product, somebody has to clean it up. In a growing number of organizations, that work is farmed out to a paid "oracle," any service that, given a record, returns a list of other records that probably match it, and the bill is per lookup. A new preprint on arXiv asks how to get the most matches out of such a service without burning the budget, and answers with a batching strategy that is provably optimal under one natural condition.

The condition matters. The work, titled Entity Resolution via Batched Oracle Queries, frames entity resolution, the data-engineering term for matching and deduplicating records that refer to the same real-world entity, as a sequential game in which an algorithm picks a fixed-size batch of records per call, learns from the oracle which pairs match, and pays a cost for each query. The goal is to maximize recall, the share of true matches eventually surfaced, while controlling that cost. Picking the next best batch turns out to be NP-hard, the formal computer-science label for a class of problems with no known efficient general solution.

In other words, there is no shortcut that always finds the next best batch quickly. But the authors show that, if the true entity sizes, meaning how many records in the database actually point to each distinct person or item, satisfy a structural property they call a natural condition, the optimal batch has a clean closed-form description, and a simple greedy algorithm recovers it. The result is part theoretical (a proof of NP-hardness plus a conditional optimum) and part practical (a strategy for the pay-as-you-go setting where most real teams actually live).

The empirical case for the method comes from six benchmark datasets on which the authors report improvements over state-of-the-art baselines. The details live in the paper's experimental section; the headline is that the batching strategy surfaces more matches per query than the prior best across the suite. As with any preprint, those gains are the authors' own measurements until independent reproduction catches up, and the work is framed as a research contribution rather than a vendor study. There are no latency numbers in milliseconds, no dollar figures, and no production deployment. Cost is reported in oracle-query counts.

For a data team paying a third-party service to enrich, match, or dedupe records, the practical takeaway is straightforward: instead of asking the oracle to compare records one at a time or in arbitrary groups, batch strategically, pick the batches that the paper's analysis says are optimal under the natural condition, and re-check whether your data actually satisfies that condition before betting on the guarantee. The paper's full method and proof are available in the PDF.

What is still open: how often the natural condition holds in messy production datasets; whether the gains survive head-to-head tests against commercial identity-resolution tools rather than academic baselines; and how the cost trade-off maps to actual dollars at production scale. The theory is now in place; the field tests are the next step.

A Pay-As-You-Go Recipe for Cleaning Giant Messy Databases

Sources