When Jimin Mun's team at Carnegie Mellon needed to know whether feedback on a scientific paper was actually useful, they found an elegant proxy: whether the paper's authors did what the reviewer suggested. If the author accepted the critique and revised the paper, the feedback was good. If they rebutted it or ignored it, it was not. This logic, obvious in retrospect, is the backbone of GoodPoint, a new framework from researchers at CMU, NVIDIA, and an independent contributor that trains a small language model to give feedback that authors find genuinely actionable.
The framework, posted to arXiv on Monday, fine-tunes Qwen3-8B on 19,534 ICLR papers paired with reviewer feedback and author responses from 2020 through 2026. The training signal is not whether a review sounds authoritative; it is whether the author agreed the critique was valid and committed to act on it. The result, according to the paper, is an 83.7% improvement in predicted feedback success rate over the base Qwen3-8B model, measured on a held-out set of 1,198 ICLR papers. In matching feedback against multi-reviewer consensus, GoodPoint-SFT surpasses Gemini-3-flash and GPT-5.2 on precision, suggesting the model has learned to be selective: to flag only the critiques that genuinely warrant a response.
The work matters because the existing alternatives are either too generic or too credulous. Earlier attempts to use LLMs in peer review tend to over-praise papers, reflect reviewer biases rather than paper quality, or deliver feedback so broad it could apply to any submission. GoodPoint's dataset construction sidesteps this by using author responses as a natural labeler: if the author accepted a critique and revised, that feedback unit is a positive example. If they rebutted it, it is not. The team then applies both supervised fine-tuning on successful feedback and direct preference optimization, training the model to distinguish valid from invalid critiques and corrupting successful feedback along five dimensions: specificity, clarity, accuracy, prioritization, and supportive tone, to teach the model what good feedback actually looks like when it breaks down.
The human evaluation reinforces the numbers. When expert authors were asked to rate feedback from GoodPoint-DPO versus the base Qwen3-8B model, GoodPoint-DPO won on validity, actionability, specificity, and helpfulness. The gap to Gemini-3-flash narrowed meaningfully. "These findings highlight the promise of training and evaluating LLMs grounded in an author-centric definition of constructive feedback," the authors write, "underscoring the potential of LLMs to augment rather than replace human researchers."
That last clause is doing real work. The paper is explicitly arguing against a certain vision of AI in science: the fully automated reviewer that writes critiques without domain expertise or human oversight. Instead, the authors position GoodPoint as a tool for researchers who lack access to timely expert feedback, calling out junior researchers and non-native English speakers in particular as the primary beneficiaries. Whether that positioning holds up in deployment, or whether GoodPoint simply automates the preferences of reviewers at top-tier institutions, is the question the paper itself does not fully answer.
There is a structural risk the authors acknowledge only obliquely. The training data is drawn from ICLR, a top venue with papers that by definition survived at least one round of review. According to the ICLR blog, ICLR 2026 received 19,525 submissions reviewed by 18,054 reviewers, with a 27.4% acceptance rate and documented problems of LLM-generated reviews and hallucinated references. Feedback on accepted papers may not generalize to feedback on rejected papers, or to the longer cycle of revision and resubmission. If the model learns what successful feedback looks like at ICLR, it may be optimizing for a very specific genre of scientific argument rather than correctness or rigor.
The 83.7% figure is also a predicted success rate, the team's own metric, derived from author response parsing rather than direct human judgment of feedback quality. The authors validated the parsing with human annotators (0.936 accuracy for validity, 0.941 for actionability), which is rigorous by academic standards, but the underlying claim rests on how well GPT-4.1 labeled the author responses in the first place. This is a chain of inference, not a direct measurement.
What is unambiguously new is the specificity of the approach. Rather than benchmarking existing models against each other on synthetic reviews, the GoodPoint team built a training recipe and a dataset designed to be reproduced and extended. Code, dataset, and trained models are promised upon acceptance. If that holds, it is the kind of work the field can actually build on, not a leaderboard entry but a method.
The broader context is a field actively debating what role AI should play in scientific publishing. ICLR deployed an official AI feedback tool at ICLR 2025. Stanford researchers have published on AI's ability to spot errors and judge research significance. In April 2025, Japanese startup Sakana AI used a system called The AI Scientist-v2 to generate end-to-end papers and submit them to the ICLR 2025 ICBINB workshop; one was accepted with an average score of 6.33, surpassing the average human acceptance threshold and scoring higher than 55% of human-authored papers submitted to the same track. Sakana AI withdrew the paper before publication and later published the work in Nature on March 26, 2026. GoodPoint sits in a different part of that space: it is not trying to replace reviewers, but to make the feedback authors get before submission more useful. Whether that is a genuinely new capability or an incremental improvement on existing tools will depend on what happens when researchers outside ICLR start using it.
The paper is arXiv:2604.11924.