How API Latency SLOs Mislead Engineering Teams
The Developer Tools Podcast with Fexingo · 2026-06-25 · 12 min
Substance score
50 / 100
Five dimensions, 20 points each
This episode explores how latency SLOs based on medians and averages mislead engineering teams by masking tail latencies that frustrate users. The hosts explain why p99 percentiles matter, how to implement burn rate alerting, and how to measure latency from the user perspective rather than just server-side metrics.
Key takeaways
- Median and average latency SLOs hide tail latency problems - a customer experiences the slow request, not the median, so p99 or p99.9 percentiles are essential for user-aligned SLOs.
- Burn rate alerting lets teams detect latency issues faster than the error budget window itself - for example, alarming when one-hour burn rate would exhaust a 30-day budget in five days.
- Latency must be measured from the client perspective including network round trip, not just server-side latency, ideally via synthetic probes from multiple geographies.
- Teams gaming latency SLOs by adding aggressive timeouts can sacrifice availability - separate SLOs for latency and availability with coordinated error budgets prevent this optimization trap.
- Per-service latency SLOs in microservice architectures must be stricter than their linear allocation share because tail latencies compound across service call chains.
Guests
What our scoring noted
Our reviewer’s read on each dimension, with quotes from the episode.
Insight Density
The episode moves quickly through genuinely useful SRE concepts - burn rate alerts, tail amplification in microservice chains, gaming SLOs with aggressive timeouts - with little dead air. However, most ideas are direct derivatives of widely-read material like the Google SRE book and 'The Tail at Scale,' limiting how many concepts will be novel to an informed practitioner.
If in a single hour you see 1 percent of requests breaching the threshold, that's 36 seconds of budget burned in one hour - which would exhaust your budget in about 72 hours.
teams might game the system by sacrificing the slowest 1 percent - they drop or cancel requests that exceed the threshold, artificially improving the metric
Originality
The episode is an honest, well-structured synthesis of established SRE doctrine (Google SRE book burn rates, 'The Tail at Scale'), but there is no contrarian argument, first-principles reasoning, or genuinely novel framing - it largely repackages canonical material without pushing past it.
There's a well-known Google paper on this, right? 'The Tail at Scale'
Google SRE book calls that 'alerting on burn rate.'
Guest Caliber
There is no external guest; it is a two-host discussion format where neither host establishes direct, named practitioner credentials at scale - they reference 'a SaaS company' and papers rather than first-hand operational experience, placing them in the informed-synthesizer rather than practitioner category.
I was reading about a SaaS company - payments infrastructure - that had a p50 SLO of 200 milliseconds. They were hitting it consistently. But their p99 was regularly above two seconds.
Specificity & Evidence
The episode earns credit for working through a concrete burn rate calculation with real arithmetic, assigning named millisecond budgets to hypothetical services, and setting named thresholds; the main case study company is unnamed and unverifiable, which limits the evidentiary weight.
That gives you an error budget of about 43 minutes of allowed violations over the month.
maybe 50 milliseconds for the gateway, 200 for the auth service, 250 for the data service
Conversational Craft
The dialogue is clearly pre-planned and mutually supportive - Lucas says 'Exactly' or 'Totally' in response to nearly every Luna prompt - and there is no genuine follow-up pressure, pushback, or productive disagreement; it reads as a scripted explainer with two voices rather than a real interview.
Lucas: Exactly. And this is surprisingly common.
Lucas: Totally. With availability, it's binary
Conversation analysis
Computed from the transcript - who did the talking, and the verbal tics along the way.
Filler words
Episode notes
Episode 72 of The Developer Tools Podcast with Fexingo dives into why API latency SLOs often mislead engineering teams. Lucas and Luna explore a common pitfall: teams optimize for median latency while ignoring tail latency at the 99th percentile, which kills real-world reliability for end users. They break down a case study from a SaaS company that hit its 200ms p50 SLO but saw customer complaints spike because p99 regularly exceeded 2 seconds. The hosts explain the math behind percentile-based SLOs, the difference between latency and availability SLOs, and practical ways to set meaningful targets. They also discuss error budgets, burn rates, and the danger of metric myopia. This episode is packed with concrete advice for platform engineers, API designers, and anyone responsible for keeping distributed systems fast and reliable. #APILatency #SLOs #TailLatency #Percentiles #ErrorBudget #BurnRate #PlatformEngineering #DistributedSystems #Reliability #Observability #DeveloperExperience #SLI #p99 #LatencyOptimization #EngineeringCulture #BusinessAndTechnology #FexingoBusiness #BusinessPodcast Keep every episode free: buymeacoffee.com/fexingo
Full transcript
12 minTranscribed and scored by The B2B Podcast Index.
Lucas: So you've got an API that returns responses in under 200 milliseconds - measured across the last 30 days, your median latency is 180 milliseconds. Your team celebrates hitting the SLO. Yet your biggest customer just sent a message saying the API feels slow. How does that happen? Luna: Because median tells you almost nothing about the experience of the unlucky request. The real story is in the tail - the 99th percentile, maybe the 99.9th. Lucas: Exactly. And that's the trap a lot of engineering teams fall into. They define latency SLOs using averages or medians because those numbers look good on a dashboard, but they end up masking the very problems that frustrate users. Today I want to dig into why latency SLOs often mislead teams - and what to do instead. Luna: Let's start with a concrete example. I was reading about a SaaS company - payments infrastructure - that had a p50 SLO of 200 milliseconds. They were hitting it consistently. But their p99 was regularly above two seconds. Some customers saw three-second delays. Lucas: Right. And at p99, one in a hundred requests is that slow. For a company processing tens of millions of API calls a day, that means hundreds of thousands of slow responses. And the customer doesn't see the median - they see the one slow call when they're trying to check out a cart. Luna: So the SLO was technically met, but the experience was broken. That's a failure of measurement, not of engineering effort. Lucas: Exactly. And this is surprisingly common. I think the root cause is that medians and averages are easy to compute and easy to explain to stakeholders. But latency distributions are highly skewed - a few slow outliers can represent a disproportionate amount of user pain. Luna: There's a well-known Google paper on this, right? 'The Tail at Scale' - they showed that even small tail latencies compound across microservice call chains. A single service adding 50 milliseconds at p99 can cause cascading delays. Lucas: That paper is still essential reading. And it highlights why latency SLOs should shift focus to higher percentiles. But even then, there's nuance. Just setting a p99 SLO isn't enough if you don't also think about the time window, the measurement methodology, and what happens when you burn through your error budget. Luna: Let's talk about error budgets then. Because I think a lot of teams set them up wrong - they apply the same logic as for availability SLOs, but latency is fundamentally different. Lucas: Totally. With availability, it's binary - the request either succeeds or fails. With latency, it's a continuum. A request that takes 210 milliseconds when your SLO is 200 milliseconds is technically an error, but is it really a failure? Maybe not. But if your error budget treats it as a hard threshold, you might start alarming too aggressively. Luna: Or the opposite - you might ignore it because it's only a small breach. Meanwhile, users are feeling the pain of consistent p99 spikes. Lucas: So you need to define your latency SLO in terms of a distribution over a rolling window. For example: 'At least 99 percent of requests in a 30-day window must complete in under 500 milliseconds.' That's better than a median, but it still has a problem: the window is too long. If you have a bad hour every day, it gets averaged out. Luna: That's where burn rate comes in. You want to detect faster than your error budget window. If your budget is 30 days, you want to alarm when the error rate over the last hour would exhaust your budget in, say, one day. Lucas: Right. Google SRE book calls that 'alerting on burn rate.' You set multiple burn rate thresholds - fast, medium, slow - so you catch issues before they become customer complaints. That's the way to make latency SLOs actionable, not just reporting. Luna: Another angle I see teams mess up: they measure latency from the server side only. But the user experience includes network round trip, client processing, and any intermediary hops. Lucas: That's critical. Your API might return a response in 50 milliseconds from the server, but if the client is in Brazil and your server is in Virginia, you're looking at 150 milliseconds of network latency. The SLO should ideally reflect the experience of the end user, which means you need client-side instrumentation or synthetic monitoring from multiple regions. Luna: But then you have to deal with client clock skew, which makes measuring latency from the client side really tricky. There's no perfect solution. Lucas: No, but you can get close. Use server-side timestamps and measure the difference between request receipt and response dispatch. That's your server-side latency. Then use synthetic probes from multiple geographies to estimate network latency. Combine the two for a realistic user-facing SLO. Luna: And don't forget that latency SLOs should be tiered. Not every endpoint needs the same performance. A login endpoint might be less latency-sensitive than a checkout endpoint. Lucas: Yeah, that's a good point. You should classify your APIs by criticality. For the most user-facing, set aggressive p99 targets. For internal batch endpoints, maybe a p95 is fine. But you have to be explicit about it, otherwise teams optimize for the wrong thing. Luna: Speaking of optimizing, there's a behavioral risk here. If you set a p99 SLO, teams might game the system by sacrificing the slowest 1 percent - they drop or cancel requests that exceed the threshold, artificially improving the metric. Lucas: That's a real problem. I've seen teams add aggressive timeouts or circuit breakers that just fail fast on requests that would be slow, protecting the SLO but hurting availability. You need to have separate SLOs for latency and availability, and make sure the error budget covers both. Luna: And you need to monitor the correlation between the two. If you see availability dropping when latency should be fine, that's a red flag that your timeouts are too aggressive. Lucas: Let's step back for a second. I think the core issue is that SLOs are meant to align engineering work with user expectations. But if you're measuring the wrong thing, you get the wrong alignment. A median-focused team might spend weeks optimizing database queries that shave 10 milliseconds off the 50th percentile, while the real bottleneck - a noisy neighbor service causing occasional 2-second pauses - goes unfixed. Luna: So what's the practical takeaway? How should a team design their latency SLO today? Lucas: First, pick a high percentile - p99 or p99.9, depending on your traffic volume. Second, define a rolling window that's short enough to be actionable, like 10 minutes. Third, set burn rate alerts. Fourth, measure from the user's perspective as much as possible. And fifth, document your rationale so that when a new engineer joins, they understand why the SLO is set that way. Luna: I'd add: don't set SLOs in isolation. They need to be part of a reliability framework that includes SLIs, error budgets, and a process for prioritizing work based on budget burn. It's a culture thing, not just a metric. Lucas: Yeah, and that culture is hard to build. But it starts with honest measurement. If your dashboard shows a green 200 millisecond median but your customer success team is hearing complaints, you have a measurement problem, not a customer problem. Luna: Honestly, if these kinds of deep dives into developer tools and infrastructure are useful for what you're building or running, it's a good sign we're on the right track. This show stays ad-free because listeners chip in now and then. Lucas: Yeah, exactly. If today's episode was worth the price of a coffee to you, that's the link - buy me a coffee dot com slash fexingo. No pressure, just a small way to keep the conversation going without ads. Luna: And we really appreciate it. Okay, back to latency. Lucas, you mentioned burn rate - can you walk through a concrete example of how that would work in practice? Lucas: Sure. Let's say your latency SLO is: 99 percent of requests complete under 500 milliseconds over a 30-day window. That gives you an error budget of about 43 minutes of allowed violations over the month. If in a single hour you see 1 percent of requests breaching the threshold, that's 36 seconds of budget burned in one hour - which would exhaust your budget in about 72 hours. That's a high burn rate, and you should page. Luna: So you set an alert at, say, burn rate of 6 - meaning if you keep burning at that rate, you'll exhaust your budget in 5 days instead of 30. That gives you time to investigate without waking someone up at 3 AM for a minor blip. Lucas: Right. And you can have multiple thresholds: a 'page' threshold at burn rate 10, and a 'ticket' threshold at burn rate 2. The key is that the alert is tied to the actual user impact, not just a raw metric crossing a line. Luna: One more thing - what about when you have multiple services calling each other? The overall latency SLO needs to be decomposed into individual service SLOs, but that's tricky because tail latencies compound. Lucas: That's the 'budgeting' approach from Google SRE. You set a global SLO, say 500 milliseconds at p99 for the entire request. Then you allocate a portion to each service - maybe 50 milliseconds for the gateway, 200 for the auth service, 250 for the data service. But you have to account for the fact that if one service is slow, it doesn't mean others get to be slow too. You need to set per-service SLOs that are stricter than their proportional share, because of tail amplification. Luna: And that's where a lot of teams slip. They do a linear allocation without understanding the statistics. If each service hits its p99 exactly, the overall latency at p99 will be much higher than the sum of the medians. Lucas: Exactly. A better approach is to simulate or monitor the distribution dependency. Or use a technique like 'request-level sampling' to trace the slow ones and see which service contributed the most. That's where distributed tracing becomes essential. Luna: So the bottom line: median SLOs are a trap. p99 is better but not sufficient alone. You need thoughtful error budgets, burn rate alerts, client-side measurement, and service-level decomposition. Lucas: And a culture that treats SLOs as living agreements, not static targets. You should revisit them as your traffic patterns and user expectations evolve. What works for a startup with 10,000 requests a day won't work for the same company at 10 million. Luna: Good point. Maybe that's a future episode - how to evolve your SLOs as you scale. Lucas: Absolutely. For now, I'd encourage listeners to look at their own dashboards. If you're only tracking median latency, pull up the p99 and see what it looks like. Chances are, there's a story there you haven't been telling. Luna: And if that story is ugly, don't panic - just start measuring the right thing. You can always improve it. Lucas: Exactly. That's the talk for today. Thanks for listening.
More from The Developer Tools Podcast with Fexingo
All episodes →- Why API Deprecation Notices Fail Developers72 / 100
- Why API Response Caching Must Be Explicitly Designed59 / 100
- How API Versioning Strategies Destroy Developer Productivity
- Why API Changelogs Should Be Machine Readable
- Why API Errors Should Return Structured Error Objects