A new benchmark called ChartDiff is trying to answer a question that existing chart benchmarks do not ask: what happens when you ask a model to compare two charts, not just describe one?
ChartDiff, posted to arXiv on Monday by researchers including Rongtian Ye, is the first large-scale benchmark for cross-chart comparative summarization. It contains 8,541 chart pairs spanning diverse data sources, chart types, and visual styles, each annotated with LLM-generated and human-verified summaries describing differences in trends, fluctuations, and anomalies. The paper is on arXiv with no lab affiliation listed — it has not been peer-reviewed and carries no institutional weight beyond the work itself.
The results reveal something worth knowing about how current vision-language models handle comparative reasoning. Frontier general-purpose models achieved the highest GPT-based quality scores when evaluated holistically. Specialized and pipeline-based methods — approaches that combine multiple tools rather than running end-to-end — scored higher on ROUGE, a standard lexical overlap metric, but lower on human-aligned evaluation. The gap between what ROUGE measures and what humans actually find useful in a chart comparison is the most honest finding in the paper, and the one most worth sitting with.
Multi-series charts — charts with multiple data series plotted together — remain challenging across all model families. Strong end-to-end models proved relatively robust to differences in plotting libraries. The authors position ChartDiff as a benchmark for advancing multi-chart understanding research, which is a reasonable claim given the absence of prior large-scale cross-chart benchmarks in the field.
The ROUGE-versus-quality mismatch is the finding with the most practical implication for the kind of work this community cares about. ROUGE has long been criticized in NLP for rewarding word overlap over semantic accuracy. ChartDiff shows the problem extends to multimodal chart reasoning: optimizing for a metric that is easy to compute does not produce summaries that humans recognize as correct. If a developer deploys a chart comparison tool and evaluates it by ROUGE, the tool will look like it is working. If a user reads the output, they will notice the gap immediately.
For developers building data analysis tools, the practical takeaway is not that end-to-end models are clearly better — it is that evaluation methodology determines what you end up measuring. A benchmark that rewards lexical overlap will produce systems optimized for lexical overlap, not for the actual reasoning task that a human needs. The ChartDiff authors went out of their way to include human-verified annotations for exactly this reason, and the fact that the gap persists across model families suggests the problem is structural, not accidental.
The paper is 21 pages with 17 figures and is posted on arXiv without a published conference affiliation. ArXiv papers at this scale appear regularly and rarely change the field on their own. What ChartDiff offers is a shared measurement stick: a consistent dataset and evaluation framework that other researchers can use to compare their own methods. Whether that produces actual progress depends on whether the benchmark outlives the novelty of the moment — and on whether the models that are evaluated against it are held to human-aligned standards rather than just ROUGE scores.
Sources: arXiv