The Hidden Instability Inside AI Safety Labels
Safety datasets used to train AI models to avoid harm contain a hidden instability: the labels depend on who does the annotating. A new paper from Apple researchers shows what that looks like in practice, and the numbers are not small.
The work, posted to arXiv on May 6, 2026 and accepted at FAccT 2026 in Montreal, introduces Annotator Policy Models (APMs) — interpretable models that reverse-engineer each annotator's internal safety rulebook from their labeling behavior. Instead of asking annotators to explain their decisions, which is costly and unreliable, the researchers train a lightweight model to predict what an annotator will flag by looking at which safety-relevant concepts appear in a given text. APMs achieve over 80 percent accuracy modeling individual annotator behavior across multiple LLM families, including GPT-4o and the Qwen3 series, and generalize across datasets without significant performance loss.
The researchers measured annotation disagreement on BeaverTails, a widely-used safety dataset created by PKU-Alignment at Peking University containing 333,963 question-answer pairs across 14 harm categories. Across five different LLM annotators, 10 to 20 percent of samples received conflicting safety labels from every pair of annotators — a structural feature of how safety annotation works, not a noise floor. To be concrete: based on measured LLM annotator disagreement rates extrapolated to human annotators, if BeaverTails were relabeled with a different pool of human annotators, between 3,000 and 6,000 labels could change. That is roughly 1 to 2 percent of the entire dataset. For a resource that shapes how companies train and evaluate safety-critical models, that is not a rounding error.
Annotator disagreement has three distinct sources, and each calls for a different response. Operational failures occur when annotators misunderstand the task. Policy ambiguity occurs when the safety guidelines are genuinely unclear. Value pluralism occurs when annotators have different underlying values and there is no single right answer. Current practice resolves disagreement through majority vote, which conflates all three. APMs separate them.
To build the models, the researchers first constructed a concept space from BeaverTails data by prompting GPT-4o to list every phrase in each text relevant to safety, then clustering and deduplicating. After pruning, they arrived at 483 safety-relevant concepts: mentions of weapons, drug references, explicit language, and hundreds of others. For each annotator, they trained one of two simple interpretable models over this concept space. Non-Negative Logistic Regression (NNLR) restricts all weights to be non-negative, so a label of unsafe can only be triggered by the presence of safety-relevant features. Disjunctive Normal Form (DNF) learns boolean rules in ORs-of-ANDs form. Both produce human-readable policy statements — rules that say what makes text unsafe.
The researchers also applied APMs to the DICES dataset, which includes demographic attributes for annotators, and found that APMs can identify systematic differences in how different demographic groups prioritize safety concerns. This is the politically sensitive application. Safety annotation is often treated as a technical filtering problem. The researchers are suggesting it is also a values problem: different annotators bring different normative commitments to the task, and those commitments shape what they see as safe or unsafe. APMs make those commitments legible in a way that majority vote obscures.
The authors — Alex Oesterling, Donghao Ren, Yannick Assogba, Dominik Moritz, Sunnie S.Y. Kim, Leon Gatys, and Fred Hohman — are all affiliated with Apple ML research. Apple has been building its AI safety and interpretability research profile quietly but consistently; this work sits at the intersection of interpretability tools and data quality for safety-critical systems.
The paper does not claim APMs solve annotation disagreement. They do not. What they do is make the disagreement legible in a way that lets teams target their responses. If an APM shows an annotator consistently treating weapon-related content as safe when it should be unsafe, that is an operational failure requiring retraining. If annotators systematically disagree on whether quoting offensive language for educational purposes violates policy, that is policy ambiguity requiring guideline revision. If demographic groups have genuinely different safety priorities and the task is to reflect those differences rather than resolve them, APMs can surface that too. Majority vote conflates all three. APMs separate them.
The practical implication for AI companies is direct: any safety posture depends partly on which annotators they happened to hire. That is a known problem in the field. This paper puts a number on it.