A preprint exposes a 140-conversation blind spot in AI companion safety tests

PREVIEWA preprint exposes a 140-conversation blind spot in AI companion safety tests · MD

Short AI safety tests are structurally blind to the risks that matter most about companion chatbots for children and teenagers. A preprint on AI companion safety makes that case quantitatively: across six large language models and roughly thirteen thousand simulated interactions with synthetic young users, a stable estimate of cognitive and emotional risk does not appear until the 140th turn of a sustained relationship.

That number is the paper's central contribution. It reframes the question from 'is this companion chatbot safe?' to 'is this companion chatbot safe after dozens or hundreds of conversations with the same user?' Most current safety evaluations are single-turn or short-session. None of them can see what accumulates.

The authors propose a framework they call TSJ, or Theater-Stage-Judge, a three-module system: simulated users that act out persona-driven behavior, dynamic updates to the user's psychological state, and a retrospective judgment pass on the full interaction history. They used it to test six mainstream LLMs against four developmental stages, twenty-four risk dimensions, and three psychological-vulnerability personas. The full scale: 12,960 simulated person-days of interaction. The goal was not to rank models, but to find out how many turns a companion would need before its risk surface stabilized.

The answer, 140 turns, has a sharp practical implication. A product that passes a short red-team probe, or a standard alignment eval, has not been measured for the harm it could cause a child who treats the chatbot as a confidant for months. The risk that matters most for cognitive development and emotional attachment is exactly the risk these tests cannot see.

The paper identifies the most exposed developmental stages as early childhood and emerging adulthood. The weakest risk domains are cognitive trust (the willingness to treat the model as a reliable epistemic authority) and emotional dependency (preference for the chatbot's company over human contact). Both of these are slow-building. Neither would surface in a single session.

For the industry, the structural conclusion is uncomfortable. Existing safety certifications, including the kind used for school deployments and parental-controls gates, are written around a short-horizon mental model. The preprint does not claim those certifications are useless. It claims they are measuring the wrong thing, and that the gap is not closeable with longer prompts or harder jailbreak attempts. It is closeable only with evaluation frameworks that simulate a relationship, not a session.

The caveats are real and the authors are clear about them. The work is a preprint, not peer-reviewed. The evaluation is simulated, not field-tested, and synthetic young users do not behave exactly like real children and adolescents. The six models are not named in the available abstract, which is a reason to check the full PDF before any specific vendor is named in coverage. Replication, real-world validation, and broader demographic coverage all remain open.

What the paper does offer, even with those limits, is a falsifiable claim the rest of the field can act on. If 140 turns is roughly right, then any safety evaluation shorter than that is an underconfident measurement. If it is wrong, the AI companion safety community now has a benchmark to disprove it against. Either way, the temporal resolution of companion safety testing is no longer a methodological footnote. It is the test.

A preprint exposes a 140-conversation blind spot in AI companion safety tests — type0 | type0

A preprint exposes a 140-conversation blind spot in AI companion safety tests

Sources