A 25-Year-Old PDF Feature, Repurposed for the LLM Era

A 25-Year-Old PDF Feature, Repurposed for the LLM Era — type0 | type0

PREVIEWA 25-Year-Old PDF Feature, Repurposed for the LLM Era · MD

A single .pdf file can render as a formatted document to a person reading on screen, and extract as clean markdown to a large language model parsing it for a summarization pipeline. No new file extension, no separate copy, no companion text file. The same file produces two different texts depending on who is reading it, and it does so using a property that has been part of the PDF specification since 2001.

In an essay dated March 22, 2026, developer Sarthak Gaud describes the mechanism in a piece titled "Adaptive PDFs" on sgaud.com. The trick is PDF 1.4's marked-content replacement-text property, originally added to handle ligatures and cases where a glyph does not have a one-to-one Unicode mapping. The property attaches an alternate string to a marked run of content. Renderers, which care about appearance, ignore the alternate and draw the visual glyphs. Text extractors, which care about content, return the alternate instead of the visual glyph sequence. Gaud's contribution is to repurpose the property at the document level: not for a single awkward ligature, but as a layer of structured text that lives inside the visual PDF and surfaces only when a machine asks for it.

The need is real, and it is not just a developer quirk. The essay frames the problem plainly: PDFs are a visual format. They store instructions for where to draw glyphs on a page, not the document structure underneath. Tagged PDF, a structure tree marking headings, paragraphs, and lists, has existed in the spec for years and shows up in some government accessibility and enterprise publishing pipelines. Most PDFs in the wild remain untagged, because the tools that produce them, including LaTeX, Chrome's print-to-PDF, and most export buttons in word processors, do not emit tags. When a text extractor opens one of these untagged PDFs, it reads the draw commands in left-to-right, top-to-bottom order and "hopes for the best," in Gaud's words. Structure is lost in the process.

That used to matter mostly to screen readers. It matters more now because most PDFs end up in an LLM. They get uploaded to ChatGPT, summarized by Claude, parsed in retrieval pipelines, and fed to agents that need to reason about the document. A human reader can usually infer that "Project Alpha" is a heading and that "Led a team of 5 engineers to deliver the" is the start of a sentence, even when the PDF does not say so. An LLM staring at the extracted text cannot. Gaud's illustrative failure case in the essay is exactly this kind of joined-up text, where the model has to guess where the heading ends and the body begins.

The proposed solution is small and principled. A single .pdf file contains both the visual rendering and the replacement-text layer. A human opens it in any PDF viewer and sees a formatted document. PyMuPDF and Poppler, two widely used PDF libraries, honor the marked-content replacement-text property in testing, and their extractors return the structured alternate string, which reads as clean markdown. Other tools and older versions do not, and the property can be ignored entirely. When a renderer ignores it, the human still sees the formatted document, and the machine sees whatever the extractor can recover. The file is honest about which audience it serves: the human gets the visual, the model gets the structure, and the divergence only appears on extraction.

Gaud is explicit that this is one approach, not a universal fix. The essay frames the work as a small format experiment rather than a standards proposal or a product launch. Tagged PDF and existing PDF/UA accessibility work remain the heavier, more comprehensive paths, and the property does not solve layout inference, table structure, or anything the marked runs do not cover. What it offers is an opt-in path that uses existing spec infrastructure rather than a new file format, and that is why it reads as agency-expanding: a creator with control over the source file can ship a document that already speaks two languages, without waiting for extractors to get smarter or for the untagged-PDF norm to change.

The deeper frame, the one Gaud is reaching for, is that the PDF specification already carried the seed of machine-readable structure in 2001, long before LLMs became primary PDF consumers. The marked-content replacement-text property was sitting in the spec, waiting for a use case. The LLM era is that use case. The story of adaptive PDFs is not really that PDFs are broken for AI. It is that the spec already had a solution, and it just needed someone to repurpose it at the document level.

The trade to watch is adoption. A format experiment only matters if the tools that creators use start to emit the marked runs by default, and if the tools that extractors use start to honor them reliably across versions. Gaud names the support question directly: PyMuPDF and Poppler work, support varies, and the rest is open. That is the honest version of the story, and it is the version worth reading.

A 25-Year-Old PDF Feature, Repurposed for the LLM Era

Sources