Wearable motion data predicts two heart-risk blood markers, but not for everyone
On a CDC survey of 1,381 adults, the model predicts inflammation and blood sugar markers from hip motion but its 90% safety net is too thin for Mexican American men.
On a CDC survey of 1,381 adults, the model predicts inflammation and blood sugar markers from hip motion but its 90% safety net is too thin for Mexican American men.
A 90% prediction interval is the kind of safety net a wearable-derived risk score would be marketed on. New benchmark results show that safety net holds in aggregate, then quietly misses its 90% target for some of the patients a screening program is most often built to catch.
The benchmark behind the claim sits on the U.S. Centers for Disease Control and Prevention's National Health and Nutrition Examination Survey (NHANES), a public dataset that already tracks what Americans eat, drink, and how their blood tests come back. Researchers pulled a week of hip-worn motion-sensor data from 1,381 NHANES participants between 2003 and 2006 and asked a machine-learning model to predict two of the most common warning signs for heart disease and diabetes: C-reactive protein (CRP), an inflammation marker, and HbA1c, a three-month blood-sugar average. A third target, fasting triglycerides, was included as a stress test.
The model's predictions come with a conformal prediction interval. That phrase describes a statistical envelope around each forecast that is supposed to contain the true value 90% of the time, no matter how the underlying data is distributed. Across the full sample, the envelope holds for both CRP and HbA1c. The 90% promise is met in aggregate. The interval says nothing about why a patient's number is high. A well-calibrated conformal envelope is not a causal claim. It does not say the model knows what to do about the risk, only that its forecast is honest about its own uncertainty.
That aggregate number is where most clinical-machine-learning stories end. This one should not.
Look at the same intervals broken out by race, sex, and age, and the envelope cracks. The benchmark authors, working under an ICML 2026 workshop track on machine learning for health called SD4H, found that conditional coverage, the share of times the interval actually catches the right value for a specific subgroup, falls short for some populations. Mexican American men in the dataset, in particular, see their HbA1c intervals miss the 90% mark, even when the overall sample hits it. For a tool pitched as a screening aid, that gap is the difference between a working risk flag and a quiet miss in the patients most likely to be flagged.
The benchmark also makes an honest admission about what motion data cannot do. Try to predict fasting triglycerides, the blood fat linked to cardiovascular risk, and every model in the benchmark collapses to near-zero explanatory power, with an R² below 0.05. The authors treat this as the data telling them triglyceride variation in this cohort is dominated by genetics rather than behavior. The open-source release frames it as a feature: a wearable-derived risk score should know when to abstain rather than guess.
The model that does best is the one with the smallest public track record in clinical care. TabPFN v2, a tabular foundation model from the Berlin-based startup Prior Labs, leads on both HbA1c (R² = 0.156) and CRP (R² = 0.383), beating a standard ridge regression baseline and a popular gradient-boosted tree model called XGBoost. TabPFN v2's Hugging Face model card and the underlying methodology, published in Nature in 2025 by the Prior Labs team, position it as a small-data generalist that does not need retraining for every new tabular task. For a benchmark with 1,381 rows, that profile matters.
Two cautions sit on the table. First, the paper is a preprint, not yet peer reviewed, and its venue is an ICML 2026 workshop, not a clinical journal. Second, an R² of 0.156 on HbA1c is statistically real but practically modest: a useful signal in aggregate, not a personalized risk score a doctor should act on without a real blood test.
That modesty is exactly why the coverage gap matters more than the headline accuracy. A model that meets its 90% interval on average but misses it for the patients a screening program is most often built to catch is, in plain terms, a tool that works in the abstract and breaks on real people. The benchmark's constructive framing, that conditional coverage is the next research agenda item rather than a footnote, is the right one for the field. A benchmark that names the gap on public, reproducible data is a more useful starting point than another near-perfect accuracy claim.
The watch item for any reader whose future risk score might come from a wearable is whether this benchmark becomes the floor or the ceiling. If TabPFN v2 and its peers reach a clinic wearing a marginal-coverage badge only, the same coverage gap that lives in this paper will travel into a doctor's office. If the deployment bar is "meet the 90% interval inside every subgroup that matters," the path is slower and the screening tools that arrive will be quieter and more honest about who they miss.
The full HTML preprint and the companion code release are public. Anyone with a 2003–2006 NHANES pull can rerun it. That part of this paper is the part the wearable-AI story usually skips. The failure modes are visible, the data is open, and the next round of fixes is something a critic, a clinician, or a regulator can run themselves.