An AI flags abuse years before disclosure. The deployment standard does not exist yet.

An AI flags abuse years before disclosure. The deployment standard does not exist yet. — type0 | type0

PREVIEWAn AI flags abuse years before disclosure. The deployment standard does not exist yet. · MD

Most patients who experience intimate partner violence are never asked, and most of those who are asked say no. The clinical record that accumulates regardless, a prior fracture, an anxiety prescription, a pain clinic referral, a return visit a year later, often holds the only signal available. A team led by Dr. Bharti Khurana, an emergency radiologist at Brigham and Women's Hospital and founding director of the Trauma Imaging Research and Innovation Center at Harvard Medical School, has built a model that reads that record and flags patients years before they can say the words themselves, per a March 2026 paper in npj Women's Health.

AIRS, the Automated IPV Risk Support tool, was trained on 841 female patients enrolled in a Mass General Brigham domestic abuse intervention program between 2017 and 2022, plus 5,212 matched controls. It combines structured electronic medical record data (diagnoses, medications, radiology timing, emergency visit frequency, vitals, a social deprivation index tied to zip code) with unstructured clinical notes processed by Clinical-Longformer, a long-context language model tuned for medical text. The two streams are kept independent and fused at the prediction stage, a design choice that lets the model handle missing-data variation across institutions, according to the paper.

In the primary test cohort, AIRS reached an AUC of 0.88, and held above 0.80 across three validation cohorts at two hospitals in the Mass General Brigham network. The fusion version flagged 80.6 percent of IPV cases before the patient self-reported, with an average lead time of 3.68 years; some patients were flagged five years or more before disclosure, per the same Nature paper. Those are different numbers and measure different things. The AUC describes the model's ability to rank cases above non-cases. The 80.6 percent is the share of true cases the model catches before any patient says the words, in the validation cohort. It is not a generic accuracy figure.

AIRS is not a diagnosis. It does not speak to the patient, file a report, or trigger an automatic intervention. It surfaces a risk score to the clinician only, intended to prompt a trauma-informed "Caring Conversations" discussion with a trained provider. The tool is in active IRB-approved pilot testing at several Mass General Brigham clinical settings, not in routine clinical use, per the institutional release.

AIRS performs well on the metrics the team reported. The harder question is what the clinician does with the flag, and what is logged when they do it. Khurana's team has named survivor agency as the north-star criterion, and is engaging with an IEEE Industry Connections Activity on user-centered principles for AI in family-violence evaluation as a parallel governance track.

ZDNET's coverage of the paper frames AIRS through a more familiar question, asking whether a tool that can do this should. That framing treats safety as a brake on the technology. A more useful frame treats safety as the design specification, and asks whether the deployment infrastructure in the United States is currently built to honor it.

The cautionary precedent is not theoretical. Spain's VioGén system, an algorithmic risk-assessment tool deployed by the Interior Ministry in 2007 and used by police and some judges, has been associated with at least 247 women killed by partners after assessment, according to a 2024 New York Times investigation referenced in ZDNET's reporting. A review of 98 of those homicides found that 55 had been scored as negligible or low risk, and reporting in multiple outlets documented that officers frequently deferred to the algorithm's score rather than overriding it on the basis of context. VioGén is what high metric accuracy looks like when the downstream infrastructure is not built to act on the score, or to override it.

The structural problem AIRS would face in deployment is similar, and the design choices that distinguish it from VioGén are precise. The model is silent to the patient, so disclosure remains the patient's to make. The flag goes to a clinician, not to a case file or a mandated reporter. The intervention, when warranted, is a trained conversation, not a referral triggered by a threshold. None of those choices are technical. They are governance choices about which clinical AI builders, hospital counsel, and state policy are likely to disagree.

The other contested boundary is the one Alexia Maddox, a senior lecturer at La Trobe University and co-chair of an IEEE Industry Connections Activity on user-centered principles for AI in family-violence evaluation, drew in her commentary on the paper, as reported by ZDNET. AIRS, as currently built, detects patterns of physical injury, repeat emergency presentations, and associated clinical notes. It is largely blind to coercive control, financial abuse, and technology-facilitated abuse, categories of intimate partner violence that frequently leave no radiology trace and may not generate emergency visits at all. The model's training cohort was female patients, and the team has flagged expansion to transgender, non-binary, and male patients as future work. Maddox's critique is that the tool's narrow definition of IPV will produce false negatives for patients whose abuse does not look like the cases the model was trained on, and may produce false positives for patients whose clinical history matches the pattern for reasons unrelated to violence.

The consent question is the one Khurana's team has tried to answer. The records used to train AIRS were drawn from a patient population enrolled in a domestic abuse program and covered by the relevant institutional review. Future patients scored by a deployed version would not have a comparable enrollment moment, and would not necessarily be notified that a model had flagged their record. That asymmetry, between the data used to build the system and the data the system will be run on, mirrors the asymmetry that has driven most of the high-profile clinical-AI controversies of the last several years, and is the question the IEEE governance activity is explicitly trying to standardize an answer to.

The U.S. public-health backdrop explains the urgency. The CDC's National Intimate Partner and Sexual Violence Survey reports that more than one in three U.S. women experience rape, physical violence, or stalking by an intimate partner in their lifetime, and the agency has documented that current self-report screening tools capture only a fraction of affected patients. The U.S. Preventive Services Task Force recommends routine IPV screening for women of childbearing age, a recommendation that has been hard to operationalize in time-pressured clinical settings. Mass General Brigham, in its institutional release, frames the comparable lifetime figure for men at one in seven, with one in four women. AIRS is being developed in a domain where the gap between prevalence and detection is large, the existing tools underperform, and the harm of missed cases is severe. The question is whether the deployment infrastructure can be built to match the intervention's accuracy with a comparable standard for consent, override, and survivor control.

What to watch next: the IRB-approved Mass General Brigham pilot's first published results on clinician behavior, false-positive rates, and patient outcomes; the IEEE Industry Connections Activity's first published standard; the team's separately NIH-funded work to extend AIRS to transgender, non-binary, and male patients; and the first state or federal action that defines what a hospital is required to do, or prohibited from doing, when a model surfaces a risk score on a patient who has not asked to be screened.

An AI flags abuse years before disclosure. The deployment standard does not exist yet.

Sources