Five leading AI models systematically preferred female candidates when scoring Japanese-format résumés in a controlled hiring simulation, and a deliberately worded gender-neutrality instruction failed to remove the gap. The bias, identified by a new arXiv preprint (June 2026), ran in the opposite direction of the more familiar Western finding of discrimination against women, and it survived an attempt to neutralize it through prompting alone.
The result, which the authors describe as a clean negative for the prompt-level mitigation, matters less as a moral headline than as a mechanism finding. A formal name-reliance analysis in the same paper isolates the candidate's name as the primary channel through which gender reaches the model's decision. Strip the name from the prompt, and the bias retreats to a much smaller residual. Leave it in, and the model reads the résumé through a gendered lens before reading the résumé at all.
For builders, the takeaway is structural, not rhetorical. Telling a model to be fair is not the same as removing the input that makes the model act unfairly. The paper's three conditions — baseline, prompt instruction, and a privacy filter that strips names — produced that lesson directly. Prompt instructions did not move the result. Name removal did.
The study tested five state-of-the-art models: Claude Sonnet 4.6, GPT-4o, DeepSeek-V3, Gemini 2.5 Flash, and Llama 3.3 70B. Researchers built 60 résumés in the rirekisho format used in Japanese hiring, then varied them across 12 name pairs selected on linguistically grounded gender-signal criteria. Across baseline, prompt-instruction, and privacy-filter conditions, the team issued 43,200 API calls, then fitted a crossed random-effects linear mixed model to the resulting scores. The pro-female effect was significant in every model tested.
A counterintuitive direction is the story's most attention-grabbing feature, but the paper's real contribution is the causal lever it exposes. Most prior literature on AI hiring bias has documented pro-male patterns in English-language, Western-format résumés. This study, by reproducing a clear bias in a non-Western corporate format and then formally testing the name channel, narrows the field's options: if the name is doing the work, then name-level interventions — anonymization, post-hoc name masking, and counterfactual auditing pipelines that score the same résumé under swapped names — are where the leverage sits. Prompt engineering, by contrast, is a lever that has now been tested and found insufficient under this design.
The deployment lesson is concrete. In the privacy-filter condition, which the authors describe as a name-anonymization layer, GPT-4o refused to score 42% of the résumés it was given. The collision with the model's content-safety filter turned a privacy tool into a reliability problem. A practitioner shipping a name-stripping pipeline cannot assume the model beneath it will quietly continue scoring candidates; the model may simply say no. The paper presents this as a deployment trap, not a model failure, but the practical effect is the same: a mitigation that does not score is a mitigation that does not hire.
The findings come with the usual preprint caveats. The paper has not yet been peer-reviewed, and the sample — 60 résumés, 12 name pairs, and a single national format — is narrow. The scope claim is Japan-specific by design, and the authors do not argue that pro-female bias is universal or that the name-channel finding transfers cleanly to every other résumé convention. A reader who treats this as a global verdict is reading past the paper.
The watch items are simple. The first is replication: does a similar name-reliance analysis on a different format, language, or candidate pool recover the same mechanism, or does the bias reattach itself to a different cue when the name is removed. The second is whether the field treats the prompt-instruction negative as the end of the line for prompt-level mitigations, or as a reason to design better prompts. The third is whether the model providers respond to the privacy-filter collision, since refusing to score 42% of résumés is a product defect as well as a research finding.