Bias research on large language models has spent years producing answers that contradict each other. The same model can look scrupulously fair in one study and openly discriminatory in another, and the field has rarely agreed on which result to trust. A new methodological audit argues the contradiction comes less from the models than from how the field has been asking the question.
The paper, "To Compare, or Not to Compare: On Methodological Practices in Evaluating Social Bias," hosted by the INSAIT Institute and posted to arXiv as 2606.24596, introduces a unified framework that lets researchers run bias tests under two controlled conditions. An "isolated" prompt asks about a single demographic group. A "forced-choice" prompt asks the model to pick between groups. Across multiple model families, the gap between the two was consistent and large enough that the authors call it a "massive, systematic paradigm gap." The headline finding is structural: a forced-choice framing does not just reveal latent bias, it actively amplifies it.
The mechanism the authors identify is underspecified context. When a model is asked a single, neutral question about a demographic, it tends to default to a safe, equality-leaning answer. When the same model is asked to choose between two demographic options, the prompt removes the easy neutral fallback and forces the model to commit. The paper's framework holds this constant by varying only the prompt structure, so the difference in answers can be attributed to the test design rather than to model behavior in the wild.
Two structural caveats make the finding hard to dismiss. The bias gap persists under Chain-of-Thought prompting, the technique of asking a model to reason step by step before answering, and under neutral "no preference" fallback options. Chain-of-Thought has become the default recipe for making models more careful, and the common assumption is that step-by-step reasoning de-biases outputs. The audit reports the opposite: under comparative settings, Chain-of-Thought reasoning appears to amplify social biases rather than dampen them. That is a direct challenge to a working assumption across the field.
The framework also controls for prompt artifacts that other studies have either ignored or treated as features. Neutral fallback options, demographic ordering, and reasoning scaffolding are held fixed in the framework's controlled conditions. That matters because contradictory results in prior work often traced back to exactly these confounders, even when authors believed they were measuring the same thing. The paper's code and evaluation harness are released openly on GitHub, letting other teams rerun the comparison on their own models and prompts.
The audit's constructive bottom line is straightforward. Comparative bias tests are valuable audit instruments, useful for surfacing latent discrimination that an isolated prompt would smooth over. They should not be treated as deployment verdicts, because the same amplification that makes them useful for auditing also makes them poor proxies for how a model will behave in real, ambiguous, single-decision contexts. Treating the two test types as interchangeable has been the root of most of the field's recent disagreement.
What to watch next is whether other groups reproduce the gap. The framework is the authors' own, the headline language ("massive, systematic paradigm gap") is their framing, and the paper is an arXiv preprint rather than a peer-reviewed venue. If independent teams replicate the comparative-versus-isolated split on their own model suites and their own benchmarks, the operational rule starts to look durable. If they do not, the finding will need to be treated as a specific artifact of the framework's design rather than a general property of bias measurement.