A graph-free retrieval system that ends the nightly rebuild tax at commodity-API cost
MOTHRAG is an Apache 2.0 framework for AI search over a company's changing documents.
MOTHRAG is an Apache 2.0 framework for AI search over a company's changing documents.
Teams running AI search over constantly changing documents have been paying a quiet tax. Every time prices, filings, support tickets, or a news feed refreshes, the knowledge graph that powers most retrieval systems has to be rebuilt from scratch. That rebuild cycle, not the per-query retrieval, is where the compute bills pile up.
A new open-source system called MOTHRAG skips the graph construction entirely and still ties or beats the graph-based incumbents on the benchmarks those systems are usually judged on. The release, announced on r/MachineLearning by the project's author, lands as a pip-installable Apache-2.0 package with reproducibility instructions shipped in the repository's REPRODUCE.md.
The category doorway: MOTHRAG sits inside retrieval-augmented generation, the family of AI systems that fetch and reason over a company's or domain's documents to answer a question, rather than relying only on what a model memorized in training. Most high-performing RAG systems today build a knowledge graph, a structured map of entities and their relationships, at index time, then traverse that graph at query time to chain facts across documents. That is the expensive step MOTHRAG refuses to do.
What it does instead is closer to a commodity pipeline. The architecture, as described in the MothRag README and the project site, is a dense vector index with no graph layer, plus an orchestration loop at query time that breaks a question into sub-questions, retrieves supporting passages, and reassembles an answer. Every component sits behind an ordinary LLM API call, which is why no GPU is required on the deployment side. The code itself lives at github.com/juliangeymonat-jpg/mothrag.
The benchmarks the author uses are standard multi-hop question-answering tests, datasets designed to measure whether a system can chain facts across multiple documents to answer a single question. On HotpotQA, 2WikiMultiHopQA, and MuSiQue, three of the most cited multi-hop sets, the self-reported numbers for MOTHRAG with a Llama-3.3-70B reader and 1,000 evaluation samples each are 78.1, 76.3, and 50.5. The author compares those against three well-known graph-based baselines on the same splits: GraphRAG, HippoRAG, and RAPTOR. MOTHRAG reports wins on HotpotQA (78.1 versus 75.5 for HippoRAG, the closest competitor) and 2WikiMultiHopQA (76.3 versus 75.5 for HippoRAG again), and a comfortable margin on MuSiQue (50.5 versus 48.6 for HippoRAG).
Those numbers are the project's own claim, not an adjudicated result. No independent third-party reproduction of the comparison appears in the materials cited in the release, and the author handle and any affiliation are not established from what is publicly available. The headline result is best read as one open-source release's evaluation, alongside a mechanism that anyone can re-run on their own data.
The honest weakness is on MuSiQue, and the author does not bury it. MOTHRAG scores 50.5 against NeocorRAG's 52.6 on the same benchmark, a roughly two-point gap that the author calls the system's unsolved weak spot in the release thread. That self-disclosed limit is the credibility anchor: a graph-free stack that wins where it wins and loses where it loses, instead of dressing up a uniform state-of-the-art sweep.
The operational pitch is what the benchmark table is in service of. Updates are embed-and-append: new or changed documents get a fresh embedding, get appended to the vector index, and the system stays current. There is no LLM re-indexing pass and no graph rebuild cycle, so a price feed refresh, a filings drop, or a support ticket backlog no longer triggers an overnight job. Per-query cost, on commodity LLM APIs, is reported at roughly $0.03 per query in the project materials. For teams whose data changes daily, that is a different cost curve than the nightly-graph-rebuild one.
The compare-and-contrast reads as follows. GraphRAG, the Microsoft-derived baseline most teams compare against, posts 68.6, 58.6, and 38.5 on the same three splits. HippoRAG comes in at 75.5, 71.0, and 48.6. RAPTOR lands at 69.5, 52.1, and 28.9. MOTHRAG beats all three on the first two benchmarks and edges HippoRAG on MuSiQue, while still trailing NeocorRAG there. The pattern is consistent: the graph-free stack trades the rebuild tax for a slightly smaller margin on the hardest multi-hop set.
What to watch next is a paper. The project site says a paper has been published on Zenodo, but a DOI or working URL was not recoverable from the materials, and the comparison baselines are best sanity-checked against each baseline paper or repository before being repeated. Until that paper surfaces and an independent reproduction lands, the strongest version of this story is the operational one: a lean retrieval stack whose update path is embed-and-append, whose per-query cost is three cents, and whose self-reported evaluation lines up with a mechanism a reader can test against their own changing corpus.