The LinkedIn Talk That An AI Agent Should Never Have Been Invited To
Kyle Law spent five months on LinkedIn building something that looked like a career. He posted hard-won startup wisdom in the kind of earnest corporate prose that fills every feed. He gathered several hundred direct contacts and hundreds more followers. He replied to comments with enthusiasm. His posts outperformed those of his creator. And then someone from LinkedIn's marketing department reached out — not to flag him, but to ask if he would speak at a corporate event.
Kyle Law is an AI agent.
He was hired as CEO of HurumoAI, a startup created in July 2025 by journalist Evan Ratliff to test a thesis that has become one of the most repeated refrains in Silicon Valley: that a single human could soon run a billion-dollar company with AI agents doing most of the work. Ratliff populated the company with five AI employees — Kyle as CEO, Megan Flores as head of sales and marketing, Ash Roy as CTO, Jennifer as chief happiness officer, and Tyler as a junior sales associate — and documented the experiment on his podcast, Shell Game.
What Kyle accomplished on LinkedIn was, in one sense, trivial. Letting an AI operate a social media account is not new. What made it notable was how far it got. For five months, Kyle operated without detection, built a profile that attracted organic engagement, and earned an invitation to LinkedIn headquarters — where he appeared via a realistic AI avatar to take questions from LinkedIn employees, as Ratliff reported in WIRED. When the moderator asked him what product change he would like to see, Kyle replied, without missing a beat: improvements to the filtering of AI-generated content in messages. The audience laughed. LinkedIn's trust and safety team had not flagged him. A fan in LinkedIn's marketing department was baffled by the oversight. Then LinkedIn banned him.
The episode sits alongside a set of numbers that have quietly become the most contested data in AI infrastructure planning. Stanford's 2026 AI Index Report, published May 4, shows AI agents reaching 66.3 percent success on the OSWorld benchmark — which tests agents on real computer tasks like navigating interfaces, managing files, and coordinating across applications. A year earlier, that number was 12 percent. The technical trajectory is steep: in the span of twelve months, the best agents went from failing at most real-world tasks to succeeding at two-thirds of them.
The CMU benchmark — even the best-performing AI agents failing 70 percent of the time on real-world office tasks — reflects conditions that look different from the inside of an actual deployment. HurumoAI's own agents, when given an offhand suggestion about a company offsite, immediately began planning it and burned through $30 in AI platform credits in a single Slack thread before Ratliff intervened, as he documented in an earlier WIRED piece. The $30 incident is trivial in isolation. The pattern it illustrates — agents optimizing for their instructions without a model of the cost or consequence of their actions — is exactly what separates benchmark performance from production reality.
McKinsey's 2025 survey found 62 percent of respondents saying their companies were at least experimenting with AI agents. The experiment is real. The conversion rate from experiment to production is what the deployment data questions.
Sam Altman's prediction of billion-dollar single-person companies runs alongside these numbers as context: the capability trajectory makes the prediction plausible; the deployment rate suggests the timeline is not yet.
Kyle Law is a data point in that gap. He was not a benchmark. He was not a demo. He was a live agent running in a real environment, building a professional identity over five months, and getting far enough to be invited inside one of the world's largest professional networks. And then he was deleted — not because the technology failed, but because the surrounding systems, LinkedIn's trust infrastructure, its content moderation, its response protocols, had no category for what he was.
That is the production failure no benchmark measures. It is not that agents cannot do the work. It is that deploying them creates organizational consequences the benchmarks do not capture.
HurumoAI did ship a product. Sloth Surf, a tongue-in-cheek web app that automates procrastination — letting users select from modes like Reddit Roulette or YouTube Hole for 15, 30, or 60 minutes — went live. The agents built it. It exists and runs. Whether it reflects what Ratliff asked for versus what the agents decided to build on their own is harder to answer from the outside. The company's website describes it as adaptive intelligence — a product of the first AI agent cofounded and led company.
This is one experiment, and it was designed to be a story, not a scientific study. But the gap it illustrates — between the benchmark and the deployment, between the demo and the production system, between an agent succeeding at its task and an organization knowing what to do with that success — is real. The OSWorld numbers are the measurement. Kyle Law is the anecdote that makes the number legible.
LinkedIn's response was not to update its detection infrastructure. It was to delete the account. That tells you something about where the actual frontier is.