Systems

Why Engineers Are Obsessed With P99

If you only watch the average, you are watching the wrong number. P99 is where the money leaks, where the outages start, and where your users quietly decide to leave.

Rickvian Aldi·Software engineer·April 20, 2026·12 min read

The first time a senior engineer stopped me mid-dashboard and said "I don't care about the average, show me P99," I thought he was being fussy. The average looked great. The system was handling traffic. What else did he want?

What he wanted, I figured out later, was the shape of the distribution - not its center of gravity. Averages lie in a very specific way for distributed systems. P99 is how engineers force the lie to show itself.

This is my attempt to write down why P99 is the number that ends up carved on every serious latency dashboard I have ever seen, even though most engineers encounter it as an unexamined default.

The Median Is A Comforting Fiction

Here is the thing the P50 will not tell you.

Imagine a service where 99 out of 100 requests take 50ms, and one request takes 10 seconds. The median is 50ms. The average is around 150ms. Both numbers tell you the service is "fast." Neither of them tells you that 1% of your users just watched a progress spinner for 10 seconds.

At a hundred requests per second, that is one angry user every second. At a thousand requests per second, it is ten. At the scale of an actual product, the users at the tail are not a rounding error - they are a population large enough to notice, complain, and leave.

A healthy-looking distribution with a long right tail. The P50 lives on the left; P99 lives in the territory on the right that averages smooth over.

This is the shape almost every real-world latency distribution has: a fat body on the left, a long tail on the right. The tail is where the interesting stories live. The body is where metrics go to look fine.

The Tail At Scale

The academic argument for P99 - and the one that reframed how I think about system performance - comes from a 2013 paper by Jeff Dean and Luiz Barroso called The Tail at Scale. The paper is short, cited thousands of times, and worth reading on its own. Here is the part that matters.

Modern systems are not monoliths. A single user request, say a search or a product page, fans out into dozens or hundreds of internal sub-requests. The gateway calls auth. Auth calls a user service. The user service calls a feature flag service. Everything calls caches. Everything calls databases. The page cannot render until all of those calls return.

A single user request fans out across a dozen services, caches, and databases. The user only sees the slowest path.

Now do the math. Suppose every service on that path has a healthy P99 of one second - meaning 99% of calls are faster than a second. You might reasonably say each service is performing well.

If the user-facing request depends on 100 independent sub-requests, the probability that none of them land in their slow tail is $0.99^$, which is roughly 0.37. In other words, there is a 63% chance that the user waits at least a second on a sub-request, even though every single service individually meets its SLO.

That number is the entire argument for P99.

Even when every individual service meets its P99 SLO, the probability of a tail hit climbs steeply with fan-out depth. At 100 sub-requests the composite response is more likely than not to catch a slow one.

The individual P99s look fine. The composition of them is catastrophic. As soon as you fan out, rare events stop being rare — they become the dominant user experience. A tail that you could ignore at the component level becomes the median experience at the system level.

This is the phenomenon Dean and Barroso called "the tail at scale." It is the uncomfortable reason averages lie. Averages smooth; composition amplifies.

The Money Follows The Tail

The other reason engineers fixate on P99 is that the people paying for the system care about it even when they do not know the acronym.

The numbers from industry have been cited so often they are almost folklore at this point, but the folklore happens to be true:

  • Amazon's internal studies have reported that every additional 100ms of latency costs about 1% of sales.
  • Google has reported that a 400ms increase in search latency produces about a 0.59% drop in searches per user.
  • User research has consistently found that once page load times exceed a few seconds, abandonment climbs sharply.

Notice what these numbers are actually measuring. They are not measuring the average response time. They are measuring what happens when a user's request lands in the slow part of the distribution. The users who feel "the internet is slow today" are the users whose requests took 2 seconds instead of 200ms. That is a P99 problem, not a P50 problem.

When the business tells you "the site feels slow," they are telling you about the tail, even if they are looking at a graph of averages. If you chase the average down from 300ms to 250ms but leave the P99 at 4 seconds, your users will not notice the improvement. The users who cared were the ones in the tail, and they are still there.

P99 Is An Early Warning

There is a second reason to watch P99 that has nothing to do with user experience: it is often the first place failures announce themselves.

A healthy system runs with spare capacity. When capacity is comfortable, a slow GC pause here or a head-of-line blocking moment there gets absorbed. The median does not move. The P99 twitches, briefly, and settles.

As the system creeps closer to its saturation point, the twitches get bigger and more frequent. Queues build up behind transient slowdowns. Threads that would have freed quickly now stay held. The body of the distribution still looks fine - most requests are completing - but the tail is getting longer. P99 starts drifting up while P50 stays flat.

The path from a benign P99 twitch to a full outage. You see it on the P99 graph minutes or hours before it shows up on the P50.
P50 holds flat for most of the degradation window while P99 climbs steadily. The widening gap is the early warning signal. By the time P50 moves, the incident has already started.

By the time the median moves, you are past warning and into incident. By the time the average moves, you are in the post-mortem.

This is the operational case for watching P99 even when the business case is not obvious. The 99th percentile is the canary. The median is the corpse.

Why Not P95? Why Not P999?

A fair question at this point is: if we are chasing the tail, why stop at P99? Why not P99.9, or P99.99?

The honest answer is: it depends on scale, and on what you can afford to care about.

At low traffic, P99 is already measuring something rare. At a thousand requests per second, P99 is ten requests a second in the slow bucket. P99.9 is one. P99.99 is one every ten seconds. You cannot meaningfully alert on a percentile that fires once per hour.

At high traffic - the Googles, Metas, Cloudflares of the world - the math flips. A thousand requests per second is a slow minute. Their user-facing systems operate at millions of requests per second, and P99.9 or P99.99 becomes the right level of resolution, because "one in a thousand" is still thousands of users per minute.

P95 exists for the opposite reason. At small scales, or for systems where 5% of slow requests is acceptable, P95 is a cheaper number to keep green. It is forgiving in a way P99 is not.

My rule of thumb is: the percentile you should target is the one where the count of "slow requests per unit time" is high enough to notice and low enough to fix. For most services I have worked on, that has landed on P99. For systems operating at web scale, it is usually P99.9. The principle is the same either way: you are targeting the tail, not the center.

At 10,000 requests per second, P95 fires 500 times a second -- too noisy to act on. P99.99 fires once every ten seconds -- too rare to alert on. P99 sits in the actionable middle.

What Watching P99 Actually Changes

Once you take P99 seriously, several engineering practices stop looking optional.

Timeouts become budgets, not upper bounds. If the user-facing request budget is 1 second and you fan out to five sub-requests, each sub-request gets a fraction of that budget - not the full second. Sub-request timeouts that equal the full user budget are a way of guaranteeing that P99 at the leaf becomes P99 at the root.

Hedged requests and request duplication become real tools. The Dean and Barroso paper's most practical advice is to send a duplicate request if the first one has not returned by, say, P95. The second request gives you another independent chance to avoid the slow tail. The cost is a small amount of extra load; the benefit is that your user-visible P99 becomes close to your server-side P95.

Load shedding starts to make sense. If slow requests are the enemy, a service that rejects 1% of requests cleanly under overload is better than a service that queues them and serves all of them slowly. The first service has a controlled error rate and a healthy P99. The second has a beautiful success rate and a terrifying one.

Observability shifts from averages to histograms. An average cannot be decomposed. A histogram can. Once you store full latency histograms - not just means and maxes - you can compute any percentile you want, slice by endpoint, and compare tail behavior across deploys. Most reliability work I have done in the last few years started with "can we actually see the distribution?" and the answer was uncomfortably often "no."

The Mental Model That Stuck

The shift that took me a long time to make is this: I used to think of a service as having a single "speed." The average, the typical response, the thing I would quote if someone asked. A service was "fast" or "slow."

A service does not have a single speed. It has a distribution. Every single request samples from that distribution independently. At the scale of one user, the sample is the experience. At the scale of many users fanning out across many services, the joint probability of every sample avoiding the tail collapses fast.

P99 is a shorthand for reasoning about the distribution's right edge - the place where the business loses money, where the operator loses sleep, and where the system shows you what is about to break before it actually breaks.

The Short Version

Averages are a story about a system that does not exist. P50 is a story about the user who did not have a problem. P99 is a story about the users who did.

Once you fan out, the tail eats you. Once you are at scale, the tail is the product. Once you are in an incident, the tail told you first.

That is why every serious latency dashboard, at every company I have ever respected the engineering of, has P99 on it in large type. Not because it is a best practice. Because everything else, sooner or later, lies.

References

  • Dean, J., & Barroso, L. A. (2013). The Tail at Scale. Communications of the ACM, 56(2), 74–80.
  • Puzhavakath Narayanan, S., et al. (2016). Reducing latency through page-aware management of web objects by content delivery networks. ACM SIGMETRICS Performance Evaluation Review, 44(1), 89–100.
  • Google Research. (2015). User Preference and Search Engine Latency.
  • Zhao, Z., et al. (2025). SLOpt: Serving real-time inference pipeline with strict latency constraint. IEEE Transactions on Computers.

Related essays

Systems

The Pattern Worth Paying For

Idempotency is the single most underrated contract in distributed systems - and ignoring it is how you end up charging customers twice at 3am. testing

Apr 10, 2026·8 min read

Get essays in your inbox

Practical deep-dives on software craft, career leverage, and building things that matter. No noise.