When Closed Models Own the Leaderboard, Papers With Code Adjusts the Question

When Closed Models Own the Leaderboard, Papers With Code Adjusts the Question — type0 | type0

PREVIEWWhen Closed Models Own the Leaderboard, Papers With Code Adjusts the Question · MD

Papers with Code, the crowdsourced SOTA leaderboard that became a fixture for tracking reproducible research, has quietly rebuilt itself around a world where the top results have no code to speak of.

The change, announced this week on r/MachineLearning by u/NielsRogge, is more than a UI update. According to the post, the relaunched paperswithcode.co now treats closed-weight model evaluations as first-class entries. The site automatically parses research papers from arXiv and Hugging Face to populate leaderboards across domains including 3D generation and AI agents, and the submission flow accepts any source beyond arXiv, including blog posts and system cards, as a "paper."

The mechanic that matters for the field: a UI toggle, plus a setting in the site's PwC preferences, lets users hide closed-source rows and save an open-only default view. BrowseComp, a benchmark shown in the post with a scatter plot and a table view, demonstrates the new model in action. Every closed-source eval carries a visible "closed" tag so a reader can see at a glance which rows are auditable and which are not.

The framing is partly tongue-in-cheek. The post carries Reddit's [P] project flair and uses the title "Papers Without Code" to underline the inversion the relaunch is responding to. The paperswithcode.co site is live as of this writing and currently returns a 200 with the page title "Papers with Code." It is a community project running on a .co domain, not a continuation of the original Papers with Code site by Meta.

The actual story is what the data model change implies. A SOTA leaderboard built around the assumption that a result comes with code is now structurally agnostic about both the "paper" and the "code." The post cites GPT-5.5 and Mythos 5 as example closed-source models that dominate current benchmarks, presented in the new submission flow as blog-post-style papers. Whether those models are in fact market leaders is not independently verified here; the relevant point is that the project's own example of what a SOTA entry now looks like has no code to read.

This is a beat about how the machinery of SOTA claims gets built. When the top row of a leaderboard is unreadable, the leaderboard stops being a tool for reproducibility. It becomes a tracking surface, and the "closed" tag is the visible seam. The toggle exists for readers who would rather skip rows they cannot inspect.

Showing them is more accurate than hiding them. The visible tag is the concession to reproducibility. Without it, a closed-source row would just look like any other SOTA entry. The site is documenting the new ground, not endorsing the closed model.

As of this writing, no major benchmark author or reproducibility researcher has publicly responded to the relaunch. The visible "closed" tag may be treated as sufficient. Or the field may push for something stronger, like public eval code, or a signed provenance claim, even when the model itself is gated.

The new data model is live. The question is what gets counted next, and on whose terms.

When Closed Models Own the Leaderboard, Papers With Code Adjusts the Question

Sources