Sampling sounds like a reasonable engineering trade-off. You can't store every trace, so you keep a representative subset. Simple enough. Until the thing you needed to see was in the data you threw away.
This guide explains how observability sampling works, why it exists, the two dominant approaches and their failure modes, and why a growing number of engineering teams are rethinking whether sampling should be a default at all.
Why Sampling Exists in the First Place
Modern distributed systems generate an enormous volume of telemetry. A single user request touching ten microservices can produce hundreds of spans, thousands of log lines, and dozens of metric data points. At any meaningful scale, storing all of it naively is expensive.
Sampling is the practice of capturing only a fraction of that data (typically traces) and discarding the rest. The bet is that the data you keep is statistically representative enough to be useful.
For a long time, this was the only viable approach. Observability tools were built on the assumption that you'd sample, and they optimized around that constraint: query engines, storage systems, pricing models. Sampling became invisible infrastructure, a default most teams accepted without examining.
The Two Types of Sampling
Head-Based Sampling
In head-based sampling, the decision to record or discard a trace is made at the start of the request, before any of the trace's data has been collected. (If you need a primer on how traces are structured before going further, our guide to distributed tracing covers the mechanics in detail.)
The most common implementation is probabilistic: keep 1% (or 5%, or 10%) of all traces, chosen at random. Some implementations add rules on top of this, always keeping traces that match certain criteria (error status, specific endpoints) and sampling everything else.
The problem: You make the keep/discard decision before you know what happened. A request that starts normally and then produces a subtle latency spike, an unusual downstream call, or a low-frequency error will be discarded at the same rate as a completely routine request. The traces most likely to contain diagnostic signal are not distinguishable from noise at the moment the sampling decision is made.
Head-based sampling is cheap to implement and has low overhead, which is why it became the default. But it's structurally blind to the data it most needs to keep.
Tail-Based Sampling
Tail-based sampling moves the decision point to the end of a request, after the full trace has been assembled. This lets you make an informed decision: keep traces that contain errors, keep traces with latency above a threshold, keep traces that touch a specific service, discard the rest.
This is meaningfully better than head-based sampling for diagnostic purposes. You're keeping data based on what it actually contains.
The problem: Tail-based sampling is operationally complex. To make a decision about a complete trace, you need to buffer all spans from all services until the full trace is assembled. That requires a stateful aggregation layer: a trace collector sitting in the critical path of your telemetry pipeline. This collector needs to be sized for your peak traffic, managed and scaled, and becomes a failure point. If it falls behind or crashes, you lose data.
The other problem is that tail-based sampling still discards data. Even a sophisticated tail sampler can only evaluate the criteria you've defined in advance. Unknown unknowns, the failure modes you haven't seen before and therefore haven't written rules for, still get sampled away.
The Core Problem with Both Approaches
Sampling is a lossy process. Once data is discarded, it's gone. You can't go back and reconstruct what happened.
This creates a specific category of observability failure: the incident you can't debug because the relevant trace was in the discarded fraction. At 1% sampling, 99 out of every 100 requests leave no trace. If a low-frequency error pattern affects 0.5% of requests, there's a reasonable chance that pattern never appears in your sampled data at all.
There's also a subtler problem: sampled data distorts aggregations. If you're sampling non-uniformly (keeping errors at 100%, keeping normal requests at 1%), your service-level metrics derived from trace data will be wrong unless you carefully account for the sampling rate. Most teams don't. Most dashboards built on sampled trace data contain errors they don't know about.
Finally, sampling creates blind spots at the edges of the system. Head-based sampling with probability 0.01 means that rare but important requests (the ones with unusual payloads, unusual timing, unusual service combinations) appear in your data at the same rate as routine requests, or less. Exactly the requests you'd most want to investigate are the least likely to be in your corpus.
Why Teams Accept the Trade-Off Anyway
The honest answer is cost. Storing all traces from a high-throughput system at full fidelity is expensive, and most observability vendors price on data volume. Sampling reduces the bill. The tools are built around the assumption of sampling, the pricing models assume it, and the default configurations ship with it enabled. This dynamic plays out in a predictable way as companies scale. Our post on why Datadog costs spike documents the pattern well, including what they call "sampling roulette": the point at which teams start turning down sampling rates to control bills and lose meaningful visibility as a result.
Teams accept sampling not because they've evaluated it and concluded it's the right trade-off, but because it's the default and the alternative seems prohibitively expensive.
This is worth questioning, because the cost of sampling isn't just storage. It's the debugging time spent on incidents where the right trace wasn't captured. It's the false confidence in metrics that are silently wrong. It's the unknown failure modes that never surface because they're always in the discarded fraction.
Why Sampling Breaks Down Completely for LLM Applications
Everything above applies to traditional microservices. With LLM applications, the problems compound in ways that make sampling not just imperfect but actively misleading.
LLM requests are not interchangeable. The core assumption behind sampling is that the requests you discard are roughly similar to the ones you keep. That assumption holds reasonably well for a payment processing endpoint or an authentication service. It does not hold for LLM calls. Two requests with identical-looking inputs can produce wildly different outputs: one correct, one hallucinated, one a policy refusal. Sampling 1% of LLM traces does not give you a representative picture of model behavior. It gives you 1% of the picture with no reliable way to know what's missing.
You need the output, not just the metadata. In traditional observability, a trace tells you latency, error status, and which services were called. That's enough to diagnose most problems. With LLM applications, the response content is itself a signal. You need to know whether the model hallucinated, whether it refused a valid request, whether it violated a policy, whether a prompt change caused a regression in output quality. None of this is visible from trace metadata alone, and none of it is recoverable if the trace was discarded.
Evals and regression detection require complete records. Teams running prompt evaluations or monitoring for model degradation compare outputs over time. A sampled dataset makes this unreliable. If your eval set only covers 5% of production calls, you can't distinguish a genuine regression from a gap in what was captured. Full fidelity isn't a nice-to-have for LLM quality monitoring; it's a requirement.
Cost visibility depends on every call. LLM inference costs are per-call and highly variable: token counts, model selection, and prompt length all affect the bill. Sampling away 99% of calls means you're extrapolating your cost breakdown from a small, potentially unrepresentative subset. At scale, that extrapolation can be significantly wrong, and the errors compound as you add models, providers, and prompt variants.
Latency distributions are harder to characterize. LLM response times are more variable than most microservice calls. A p99 latency spike in a background job is an operational annoyance. A p99 latency spike in a user-facing LLM call is often the difference between a product that feels usable and one that doesn't. Understanding the real latency distribution requires every data point, not a sampled approximation.
The practical consequence is that teams building LLM applications need an observability approach that captures every call by default, not one that samples and hopes the important cases make it through. This is an area where the architectural limitations of traditional tools show up most clearly, and where the no-sampling model has the most direct impact on what teams can actually know about their systems. Our LLM observability is built on this principle.
The Alternative: Architecture That Doesn't Require Sampling
Why traditional tools create the sampling problem in the first place
The reason sampling became necessary is that traditional observability tools are built around explicit instrumentation: agents, SDKs, or code changes that generate telemetry by intercepting application code. At scale, this produces more data than the architecture can store or query efficiently, so sampling is used as a pressure valve.
But collection overhead is only half the problem. Even if you could collect everything cheaply, sending it all to a SaaS vendor creates a second pressure: cost scales with data volume, and the vendor captures the margin. The result is that teams face the same incentive to sample regardless of how efficient their collection layer is. Reducing sampling rates means a bigger bill. So the default stays low, the blind spots stay large, and the problem persists.
Solving the sampling problem properly requires addressing both pressures, not just one.
How groundcover removes both constraints
A different approach builds the telemetry collection layer at a lower level of the stack, where it can capture all system activity with low overhead, without requiring the application to instrument itself.
groundcover does this using eBPF to collect telemetry at the kernel level across all services in a Kubernetes cluster. Because collection happens in the kernel rather than in application code, the overhead is low enough that full-fidelity data collection is practical. Benchmarks from our Flora sensor show 73% less CPU consumption than Datadog and up to 96% less memory than comparable agents. That handles the collection side.
The storage side is handled by groundcover's BYOC architecture. Rather than sending your telemetry to a vendor's infrastructure and paying on data volume, the data stays in your own cloud. You control the storage, you control the retention, and there is no per-GB incentive to sample anything away. The two constraints that made sampling feel necessary, collection overhead and vendor storage costs, are both removed at the architecture level.
No sampling decisions. No discarded traces. Every request is visible.
If you're evaluating observability tools and sampling is currently a constraint you're working around, it's worth asking whether the constraint is necessary, or whether it's a product of the architecture you're running today.
Summary
Sampling exists because traditional observability architectures can't afford to keep everything. Head-based sampling is cheap but blind; tail-based sampling is smarter but complex and still lossy. Both approaches share the same fundamental problem: the data most likely to contain diagnostic signal is the data most likely to be discarded.
The trade-off is real, but it's not universal. Whether you're designing a new observability pipeline or re-evaluating an existing one, sampling is a constraint worth examining rather than accepting as a given.





