When a behavioral inference model gets audited, the question is rarely whether it was accurate. The question is whether anyone can trace a specific prediction back to a specific input signal and defend it in front of a regulator, a platform counterparty, or an internal review board. That gap between raw performance and defensible decisions is the problem a new framework called SemantiClean tries to close, and the way it closes it is unusual: by trading a measurable amount of predictive accuracy for a fully traceable element-level architecture, with the entire behavioral vocabulary stored in a single JSON file as the source of truth.
The paper, "From Explicit Elements to Implicit Intent: A Predefined Library for Auditable Behavioral Inference" by Hung Ming Liu, is a 20-page preprint posted to arXiv in late April. It introduces SemantiClean as a modular, JSON-driven library for extracting structured semantic signals from e-commerce session data and routing them into pluggable inference targets: purchase intent, customer segment, and product affinity. A fourth gender target sits in the schema but is non-functional and excluded from all reported numbers, a disclosure the paper handles directly rather than burying.
The structural anchor is a fixed library of 24 behavioral elements, organized into four layers that run in a specific order: Functional, Interaction, Systemic, and Contextual. Eleven core elements (E01 to E11) cover the standard session signals. Thirteen expanded elements (E12 to E24) fill in OSPI-specific gaps: things like query and category supplementation, returning-visitor loyalty bonuses, and promotional-context sensitivity. Every element definition lives in one behavior_elements.json file, which means a reviewer who wants to understand what drove a given prediction can read the file, find the elements that fired, and audit each one against its definition. There is no second source of behavioral truth.
Three anti-inflation mechanisms keep the library honest. RedundancyGroup contribution caps prevent a single underlying signal from being double-counted across correlated elements. TieredPenaltyCalculator applies bias penalties, for example flagging high promotional-context sensitivity as a warning that purchasing behavior may be driven by promotional pressure rather than genuine preference. AdaptiveConstraintMode protects cold-start sessions, where a user has too little history for the element library to produce stable signals, by defaulting to conservative abstention rather than guessing. The framework is explicit about the philosophy behind these mechanisms: a defensible "I don't know" is worth more than a confident wrong answer, and the system is designed to produce one.
Underneath the element library sit two inference engines. The deterministic core engine returns fully reproducible results: the same session, same JSON, same prediction, every time, a property the paper labels with σ=0. The newer LLM-Integrated Semantic Inference Engine adds a two-phase LLM-driven layer that loads complete element metadata on demand and handles the elements that deterministic rules can't reach, including the query-text and category-detail elements (E8) and the context-influenced element (E10). All quantitative results in the preprint are produced by the LLM engine. This is where the auditability story gets complicated: the deterministic engine is bit-for-bit reproducible, but the LLM-dependent elements carry controlled output variability even under fixed provider, model, and temperature settings. The paper flags this honestly in the HTML version rather than presenting the whole framework as σ=0.
The benchmark anchor is the Online Shoppers Purchasing Intention (OSPI) dataset, a decade-old UCI repository benchmark of 12,330 sessions and 18 features with a roughly 15 percent positive revenue rate. Prior work on this dataset has produced strong numbers. Sakar and colleagues used a real-time LSTM approach. Gupta, Alamuri, and Bondalapu combined SMOTE with an SVC for a ROC-AUC around 0.886 and F1 of 0.633. Wang and colleagues reported XGBoost accuracy around 0.9761 and F1 around 0.9763. Gorman's multiple logistic regression hit AUC near 0.8969. SemantiClean's OSPI performance is competitive but does not top these baselines, and that gap is the point of the design choice. The framework explicitly prefers slightly lower headline metrics in exchange for the traceable, auditable structure that the release notes describe as the core deliverable.
The remaining limitations deserve their own paragraph. Several load-bearing parameters, including the 10-second total-duration threshold for Fallback Rule 1, the s_depth weighting coefficients (0.40, 0.30, 0.30), the s_commit traffic penalty, the s_research thresholds, the E3 imputation value, the E6 returning-visitor loyalty bonus, and the designated peak months, are flagged by the author as design constants without ablation support. The bias flags on E4 and E10 are the only ones surfaced in the preprint, so claims that the framework handles a broad fairness surface would overstate what the source documents. The single-author, v1-only status means no peer review or independent replication is available yet, and no public code repository is linked from the abstract, so reproducibility of the LLM engine depends on author cooperation or a future release. Translation of the work to modern, mobile-heavy, or privacy-regulated e-commerce traffic is unsupported by the source as written.
What to watch next: whether the framework publishes a usable element-library reference implementation, whether the LLM engine's variability gets formally characterized rather than just acknowledged, and whether the gender target becomes a real, bias-tested component or stays as a non-functional placeholder. An auditability-first design is a bet, and the bet only pays off if the audit trail is concrete enough for someone outside the author to actually walk it.