Baidu Open-Sourced an OCR Model That Treats Long Documents Like a Human Reading a Thick Book

Baidu Open-Sourced an OCR Model That Treats Long Documents Like a Human Reading a Thick Book — type0 | type0

PREVIEWBaidu Open-Sourced an OCR Model That Treats Long Documents Like a Human Reading a Thick Book · MD

Baidu has open-sourced Unlimited OCR, a long-document parser that holds the page image permanently in attention while letting intermediate decoded text fade from active memory. That trade is the actual innovation: the model treats long documents the way a human transcriber treats a thick book, glancing back at the original rather than re-reading every line it just wrote. The result, by the company's own measurements, is that a 40-page file runs at the same speed as a single page, with no accuracy loss and a flat memory footprint.

The architectural lever is Reference Sliding Window Attention, or R-SWA. Reference tokens, meaning the visual tokens of the source page plus any user prompt, are always attended in full. Only the most recent 128 output tokens participate in attention. The model's key-value cache, the working memory an attention layer uses to score new tokens against old ones, stays a fixed-length queue rather than growing with document length. Compute and memory therefore stay flat as pages pile up. By contrast, traditional OCR pipelines reset at each page boundary and stitch results back together, losing continuity across page turns. DeepSeek-OCR, the most prominent open-weight competitor, reportedly scales more gracefully but still attends to its full decoded history.

The mechanism matters beyond OCR. Any task that mixes generation with a fixed reference source, from legal-document review to long-form video captioning to code reading, faces the same memory wall: either keep everything in attention and pay quadratic cost, or forget the source and lose accuracy. R-SWA picks a third path, keeping the reference fully in view while letting generation history fade. If that pattern generalizes, it is a reusable design choice for long-context AI rather than a Baidu-specific trick.

Baidu backs the architecture with benchmark numbers, and they come from the company itself, not yet from an independent reproduction. On OmniDocBench, a CVPR 2025 long-document parsing benchmark from OpenDataLab, Baidu reports 93.23% on v1.5, a 6.22-point gain over DeepSeek-OCR, and 93.92% on v1.6, described as the current state of the art. On a Baidu-internal long-doc set spanning 2 to more than 40 pages, the company reports a Distinct-35 score of 96.90% and an Edit Distance below 0.1069. Throughput is reportedly about 35% higher than DeepSeek-OCR at 6,000 generated tokens, and latency stays flat across document length. Model size, per the Hugging Face model card, is around 3 billion parameters.

The release itself is real and easy to verify. Baidu posted weights on Hugging Face under baidu/Unlimited-OCR, mirrored on ModelScope under PaddlePaddle/Unlimited-OCR, with code on GitHub at baidu/Unlimited-OCR and a working demo on Hugging Face Spaces. Tech Times, MarkTechPost, and AI-Bot independently covered the release. That is multi-outlet corroboration that the model exists and runs, even if the benchmark deltas are not yet third-party reproduced.

The more colorful framing comes from QbitAI, the Chinese tech outlet that first surfaced the story, and which headlines the release by speculating that the lead author is a former DeepSeek researcher. The claim is hedged in the original Chinese and is not confirmed in the public GitHub or Hugging Face attribution blocks. For industry-watchers tracking where DeepSeek alumni land, that is a thread worth pulling. As evidence for this story, it is gossip rather than a load-bearing fact.

Two open questions will decide whether Unlimited OCR shifts the long-context field or stays a curiosity. First, can the R-SWA pattern be re-implemented cleanly for non-OCR tasks, where the reference is text or video rather than a page image, without re-tuning the sliding window size? Second, will an independent lab reproduce the OmniDocBench numbers on neutral infrastructure, the standard checkpoint before any benchmark delta is taken seriously? Baidu has put the weights and code out in the open, so the answer to both questions is now in the hands of the community rather than the company.

Baidu Open-Sourced an OCR Model That Treats Long Documents Like a Human Reading a Thick Book

Sources