Most AI tools that claim to watch video are overstating what they do. ChatGPT falls back to reading a transcript. Claude refuses video files outright. Gemini samples frames at a fixed rate of about one per second and uploads the footage to Google. A single-developer open-source project called claude-real-video exposes the gap and offers a concrete workaround: a local pipeline that uses scene-change detection, frame deduplication, and on-device transcription so any large language model can actually use a video.
The framing matters here. "AI watches video" today usually means one of three shortcuts. Reading a transcript misses anything visual. Sampling frames on a timer wastes tokens on static scenes and still misses fast cuts. Refusing the file sidesteps the question. None of these is what a reasonable reader means by "watch."
Claude-real-video takes a different approach. Instead of pulling one frame every second, it runs scene-change detection on the user's own machine via ffmpeg (a widely used open-source command-line tool for handling video files) and only extracts frames when the visual content actually shifts. A density floor prevents long static stretches from going unrepresented. A sliding-window deduplication step keeps rapid A-B-A cuts from being sent three times.
Audio gets the same local treatment. The pipeline runs Whisper, OpenAI's open-source speech-to-text model, on the same machine with language detection, so the transcript arrives in the same folder as the frames. Nothing leaves the laptop. No cloud upload, no per-minute API bill for frame extraction, no third-party video host in the loop.
The output is a folder any LLM can ingest. By default, claude-real-video writes to crv-out/, with the extracted frames at crv-out/frames/*.jpg, the transcript at crv-out/transcript.txt, and a human-readable manifest at crv-out/MANIFEST.txt describing what was pulled and why. That folder drops straight into Claude, ChatGPT, or Gemini as input.
Independent academic work has been circling the same problem. A benchmark paper titled "Frame Sampling Strategies Matter" tests how different sampling strategies affect vision-language model performance and finds the choice is far from settled. InfoShot proposes shot-aware sampling that weights informative shots more heavily. F2C ("Frames to Clips") selects key clips rather than isolated frames. Claude-real-video is essentially one open-source implementation of that research direction, applied to a workflow question rather than a benchmark question.
A practitioner write-up on dev.to reports 13 to 45 percent reductions in vision-LLM token costs by combining frame deduplication with scene detection, the same core idea. Those numbers come from one developer's pipeline and are not peer-reviewed, but they line up with what the academic literature is pointing at.
Running it locally takes about ten minutes. Install the Python package with pip install claude-real-video (or pip install claude-real-video[whisper] for bundled transcription). Install ffmpeg once via brew install ffmpeg, apt install ffmpeg, or winget install ffmpeg. Then run crv against a YouTube URL or a local file path: crv 'https://www.youtube.com/watch?v=...' or crv path/to/video.mp4. The README on GitHub documents the full set of flags.
The caveats are real. Claude-real-video is a single-developer release with no independent benchmark against Gemini's or Claude's native video pipelines, and the Hacker News discussion of the project has only collected around five points as of early July 2026, so community validation is thin. The academic work cited above is preprint-level, not peer-reviewed. The tool is best read as one accessible implementation of an active research direction, not a definitive answer to how LLMs should handle video.
What is worth watching is whether the mainstream assistants start borrowing the same idea. If scene-aware sampling and on-device transcription show up in Claude, ChatGPT, or Gemini as default behavior, the gap this open-source project is filling closes by absorption. Until then, the workflow for anyone who actually wants an LLM to engage with a video still looks like running a local pipeline like this one.