Pixels Over Selectors: Vision-Based UI Automation, and the Arms Race It Doesn't Win

Pixels Over Selectors: Vision-Based UI Automation, and the Arms Race It Doesn't Win — type0 | type0

PREVIEWPixels Over Selectors: Vision-Based UI Automation, and the Arms Race It Doesn't Win · MD

When Instagram scrambled its DOM into a maze of randomly generated class names, most developers who wanted to automate likes, follows, or scrolling gave up. Selectors broke. Workarounds broke faster. The platform kept shipping new builds that rearranged the markup, and the bots kept dying on Tuesday.

Florian Herrengt took a different route. He stopped reading Instagram's HTML and started reading its pixels. In a blog post on building a computer-vision loop for Instagram, he describes treating the rendered screen as the only interface that matters. Take a screenshot. Find the heart icon by what it looks like, not by where it lives in the document tree. Move the cursor to those coordinates. Click.

The shift matters because Instagram's anti-scraping posture is structural, not just policy. Class names are randomly generated. The DOM is deeply nested, full of decoy divs that exist only to confuse anything relying on selector paths. Scripts that match the markup directly last about as long as a deployment cycle, which on a platform shipping every few weeks is a working life of days. The visual surface, by contrast, is what real users see. It is also what gets shipped last and broken least, because Instagram still has to render something a person can use.

Herrengt's loop leans on that asymmetry. The naive version of "click the heart" with computer vision is brittle in its own way: a full-screen template match across millions of pixels, hunting for a small icon that may appear at different positions on every post because captions, location tags, and carousels push the action bar around. Hardcoded coordinates are worse; they fail the moment the layout shifts.

His fix is the part that generalizes. Before searching for the heart, the script searches for something easier to find: a stable visual landmark, like the triple-dots menu that appears on every post. The landmark shrinks the search space from the whole screenshot to a small region, and only then does the template matcher hunt for the target icon inside that window. False positives drop. The loop becomes robust to the layout shifts that killed DOM-based scripts. Per Herrengt, the same pattern works on "anything that renders to pixels. Web apps, native apps, games, terminals."

That portability is the reason the technique is worth naming as a pattern, not just a stunt. Any platform that obfuscates its markup, ships A/B variants, or otherwise makes its DOM a moving target is a candidate for the same anchor-then-match approach. The DOM is a contract the platform controls. The pixel surface is a contract the user controls, in the sense that the platform still has to render something legible to humans. Vision-based automation reads the second contract, and that contract is hard for a platform to break without breaking its own product.

The constraint is the part the title of the post is blunt about. Herrengt's technique works. Instagram still bans accounts that use it. The arms race between behavioral automation and platform detection is asymmetric: detection gets cheaper as platforms instrument more signals, while automation has to keep paying the cost of looking human. The ban is not a bug in the technique. It is a feature of the system the technique is being used against.

That distinction matters for anyone considering this approach. The reusable part is the pattern: shrink the search space with a stable visual anchor, template-match the actual target, run detection and targeting in a loop so the system recovers when layouts shift. The non-reusable part is the assumption that any of this will keep working at scale against a platform with a financial interest in shutting it down.

For builders, the takeaway is not "use computer vision to bot Instagram." It is that when the DOM becomes a wall, the pixel surface is a door, and the door is easier to walk through if you stop trying to map the whole building first. The walls, on the other hand, are the part the platform will keep rebuilding.

Pixels Over Selectors: Vision-Based UI Automation, and the Arms Race It Doesn't Win

Sources