Your Service Looks Fast. Your Users Don't.
The inspection paradox — why the wait you are most likely to land inside is the long one — is why a 100 millisecond mean can feel like a one second wait, and why dashboards keep undercounting the slow tail.
The inspection paradox — why the wait you are most likely to land inside is the long one — is why a 100 millisecond mean can feel like a one second wait, and why dashboards keep undercounting the slow tail.
An engineer opens a dashboard and sees mean request latency of 100 milliseconds. Alice, a user trying to finish a task, watches a spinner that has lasted a full second. Both readings are correct. The gap between them is not a glitch or a buggy metric. It is a mathematical property of any system where slow events eat more of your waiting time than fast ones, and the engineering literature has a name for it that rarely makes it into product meetings: the inspection paradox.
In a recent post, Marc Brooker (an engineer at AWS who works on agentic AI safety) calls this the "Alice problem". The core insight is small enough to fit on a Post-it. Customers experience latency as the average of how long they wait. Services report latency as the average of how long requests take. Those are not the same average. When Alice refreshes a page, opens a checkout, or waits for a search, she is sampling time, not requests. The browser that hangs for five seconds pulls her average toward five seconds. The quick page that loads in 50 milliseconds barely registers.
Brooker writes the identity the way an SRE would want to see it on a whiteboard: the time a user spends waiting is E_a[X] = E[X^2] / E[X], which expands to the mean plus the variance divided by the mean. A service with a mean request time of 100 milliseconds and a wide enough tail can present Alice with a user-perceived wait closer to one second, an order of magnitude larger than the dashboard claims. The same math turns a mean time to recover (MTTR) under a minute into an average outage that lasts an hour, because each customer is more likely to land inside the long incident than outside it. (Brooker)
This is why mean latency dashboards so often lie in a specific direction. Services count requests. Customers count minutes. The slow events are doing two jobs at once: they take longer, and they are more likely to be the one a user actually sees. Brooker pairs the identity with a log-normal simulation that any reader can run against their own p50 and p99 numbers to see how badly the comfortable mean understates lived experience.
The structural critique lands where it should. Dashboard design rewards the team that counts the most requests, the most incidents, and the cleanest means. On-call reviews get graded on MTTR. Error budgets get spent in counts of events, not minutes of customer time. That accounting is not wrong, but it is biased toward the operator's view of the system and away from the user's. Brooker is direct about the consequence: an organization that pays for minutes will spend its engineering hours on long-tail work, and an organization that pays for requests will keep shipping features whose tail latency it never measures against a real Alice.
What to do on Monday is small enough to actually do. Pull last week's latency distribution and last week's incident log. Compute Var(X) / E[X] for latency and for recovery time. If that ratio is large in either, the mean is understating user pain by a factor that can be named. The next SLA conversation, the next error budget argument, the next on-call retro is where duration-weighted thinking earns its keep. Brooker's simulation makes the rest a worksheet.