If you've ever had the displeasure of undergoing a barium swallow test, you know a little about what distributed tracing is like. In a barium swallow test, patients ingest barium, which doctors can trace using X-ray machines as it flows down the digestive tract. Being able to watch the movement of the barium helps doctors diagnose issues inside the esophagus and stomach without operating.
Distributed tracing is similar in that it's a way of tracing how requests flow within a distributed software system, like a microservices application. This visibility helps IT engineers to pinpoint the source of performance issues and errors.
That, at least, is a high-level summary of distributed tracing and its role in modern application performance management. For a deeper dive, keep reading as we unpack what distributed tracing means, how it works, the benefits of distributed tracing, the types of distributed tracing and how to get started with distributed tracing as part of your microservices observability strategy.
What is distributed tracing?
Distributed tracing is the tracking of application requests as they flow between the various services or components of an application. Tracing makes it possible to identify problems and study critical data elements within the context of microservice interactions.
Many modern applications are deployed using a microservices architecture. This means that instead of running as a single service, the application is a set of loosely coupled services. Frontend services receive requests from clients, then pass them to backend services for processing. In turn, the backend services send the result back to the frontend, which directs them onto the client.
When you create a distributed trace, you study how a request flows between the multiple services in an application. Distributed tracing allows you to determine which services are involved in fulfilling the request, as well as to measure how long it takes for each service to handle its part of the process.
Distributed tracing vs. logging
If you're thinking, "Can't I get this visibility through logs?" the answer is "not usually."
Typically, logs record events that occur as an application operates. A log might tell you which clients sent a request and when the application issued a response. But logs don't usually record details about how requests flow between individual services within an application. You need distributed tracing to gain that level of visibility.
So, while logs provide an overall view of the numbers and types of requests an application is receiving, distributed tracing goes deeper by tracking interactions between services during the processing of requests.
How does distributed tracing work in a microservices architecture?
In a microservices application, distributed tracing usually works as follows:
- A tracing tool identifies an incoming request and assigns a unique ID to it.
- The tracing tool follows the request as it flows to other services within the application.
- Whenever the request moves to a new service, the tool measures how long it takes before that service passes the request onto another service. In this way, the tool collects data about how much time it takes to process each part of the request.
- Teams review the data – often with help from visualizations like a flame graph, which illustrates how long each part of the request took to process. Based on this insight, engineers can determine which parts of the request take the longest.
In many cases, some parts of a request naturally take longer than others to process. For example, if a microservice has to pull data from a remote database to respond to a request, it might take several hundred milliseconds for the data to move over the network, even if the microservice itself is operating normally.
But in other cases, unusually long processing times are a sign of a problem. If a microservice that normally takes just a few milliseconds to process its part of a request before forwarding the data to a different service suddenly starts taking one or two seconds, it could be a sign of an issue like buggy code in the microservice or lack of adequate CPU and memory resources.
Types of distributed tracing
There are several types of tracing, each with pros and cons.
Code tracing
Code tracing is the practice of manually tracking how different parts of an application process data. Typically, code tracing involves simulating the results of code on a line-by-line basis and writing the results out on paper, although this method can also be used to estimate how requests will flow between microservices.
Because code tracing is usually manual, it's not a great way to gain visibility into request flows quickly. But it can be useful when designing an application as a way of predicting how the application will process requests and where bottlenecks may occur.
Program tracing
Program tracing is the monitoring and measurement of an application at various stages of execution. To create a program trace, developers usually start an application, then collect data about the application's status. They can also inject certain requests into the application to trace how it responds.
Program tracing focuses on tracing the behavior of an application as a whole. As such, it doesn't usually provide visibility into interactions between microservices (and it's not technically a type of distributed tracing because it doesn't focus on distributed systems), but it can be useful for gaining context on the overall execution of an app.
End-to-end tracing
End-to-end tracing is the tracking of requests as they flow within a distributed system. End-to-end traces monitor the status of a request from the time it is initiated through the point where the application completes its processing of the request.
End-to-end tracing is usually what people mean when they talk about distributed tracing. End-to-end traces provide the greatest degree of visibility into request processing within complex, distributed systems.
Benefits of distributed tracing
By providing insights that wouldn't be available through logs or other data sources, distributed traces provide a range of benefits.
Accelerate software troubleshooting
Microservices monitoring and alerting tools often tell teams that something is wrong – that an application is experiencing a high rate of errors or taking unusually long to process requests, for example. But they often don't tell engineers exactly why the problem is occurring.
Distributed traces provide this information, which speeds troubleshooting. The ability to see exactly what is happening inside an application as it responds to a request can help developers pinpoint the microservice that is slowing down processing or triggering errors.
Measure specific user actions
With distributed tracing, engineers can pick and choose which requests to trace. This means that they can monitor certain types of requests – like ones associated with a specific user who has been experiencing an application performance issue.
In this way, tracing provides visibility into performance problems that appear only under certain conditions, such as when users send requests from devices running a specific type of operating system, or when they attempt to access a certain type of functionality.
Understand service relationships
Distributed traces provide visibility into the way services interact within complex, distributed applications. In theory, application developers should have a strong sense of how application services interact. However, bugs and configuration variables may lead to unexpected interactions – such as a service that passes data to another service in a way that the receiving service was not designed to support. This makes it important to be able to map service relationships using distributed tracing.
In addition, the original developers of an application are not always available to explain how services work. Here again, distributed traces provide visibility into service relationships, which in turn assists the teams who need to support an app.
Improve collaboration and productivity
The ability to gain a deeper understanding of application behavior through distributed tracing can help stakeholders collaborate more effectively and productively. Rather than having to study an application's source code to understand how it processes requests, a team that needs to support the application can create distributed traces, then use the insights it gains to work more effectively with other teams on managing and improving the app's performance.
Reduce MTTR
By helping teams to troubleshoot issues faster, distributed tracing reduces Mean Time to Repair (MTTR), which measures how long it takes teams, on average, to resolve performance issues. When engineers can quickly run a distributed trace to figure out the root cause of a problem, they can plan and implement a resolution faster.
Maintain Service Level Agreements (SLAs)
Along similar lines, distributed tracing can help teams meet Service Level Agreements (SLAs). SLAs are commitments to maintain a certain level of uptime or performance. The faster you can troubleshoot an issue, the easier it is to uphold SLA guarantees.
Distributed tracing challenges
Distributed tracing is a powerful source of insight into application operations and performance, but it also presents some challenges.
Implementation difficulty
To ensure that each microservice in an application can expose the data necessary to track traces, you typically need to implement tracing logic within every microservice during the software development process. Libraries like those from the OpenTracing project can help with this task by eliminating the need for developers to write their own tracing code from scratch. But they still have to instrument the libraries and ensure that they're compatible with whichever tracing tool their teams choose to use.
In some cases, it's possible to collect tracing data using automatic instrumentation, which eliminates the need to modify application logic. Auto instrumentation tools work by monitoring microservices from the outside. The limitation of this approach is that you may not be able to collect as much detail about traces as you could if you manually instrument custom tracing code.
To get the best of both worlds – deep tracing visibility without complex instrumentation – you can use eBPF to collect trace data. This is the unique approach that groundcover supports. More on that below.
Limited frontend coverage
Some tracing tools focus only on tracking how applications are processed on the backend. They provide little visibility into the frontend, meaning the part of the application that accepts requests from clients and sends the results back to the clients.
This makes sense to the extent that, in general, the backend is where most of the heavy-lifting takes place when processing a request, so the backend is usually where you'll find the source of errors. But frontends can have issues, too, like a microservice that can't register incoming requests quickly enough because it's starved of CPU.
So, without full coverage of the frontend, tracing doesn't always provide the visibility necessary to troubleshoot issues quickly.
Random sampling
Although most distributed tracing systems allow you to select specific requests to trace, the default approach to tracing typically involves random sampling. This means the tools randomly choose which requests to focus on.
This is a problem if only certain types of requests result in performance degradations because those requests may not be represented accurately in your sample. Instead, you may see a series of requests that look normal, while missing the requests that are associated with performance issues.
You can mitigate this risk by increasing the number of requests you sample. However, the challenge there is that more samples lead to more traces, which increases the load placed on your observability tools and which could potentially deprive your actual microservices of the CPU and memory resources they need to operate. Plus, you end up with more trace data to store, leading to higher storage costs.
But there's a better way, as we explain below. Sampling is another area where groundcover takes a different approach to tracing by making it possible to capture every trace without overloading your tracing tools.
Distributed tracing tools
Implementing distributed tracing typically involves deploying tools that cover three needs: Instrumentation, data collection and data analysis, and visualization.
Instrumentation
As noted above, distributed tracing requires the instrumentation of tracing capabilities within the application you want to observe. You'll want to make sure that your instrumentation is compatible with the tool you plan to use to collect traces. Most modern tools support the OpenTracing project, which provides instrumentation libraries in a variety of languages.
Data collection
To collect trace data, you deploy a tool that integrates with your application using the instrumentation you implemented and monitors traces as they flow between microservices.
Analysis and visualization
To make sense of trace data, you can deploy analysis and visualization tools. These tools can identify anomalies, such as trace spans that are unusually long. They can also generate visualizations that summarize tracing trends.
How groundcover can help with distributed tracing
At groundcover, we have a unique tracing philosophy. We don't think you should have to implement complex instrumentation to collect tracing data. Nor do we think you should have to settle for sampled traces that may or may not capture what you actually want to see.
Instead, we use the extended Berkeley Packet Filter (eBPF) to collect tracing data from every service in your stack. Because eBPF is hyper-efficient, it's able to trace every request with minimal CPU and memory overhead – so there's no need to sample. And because eBPF runs directly in the operating system kernel, it can attach to any service running on your servers without requiring custom instrumentation.
Coupled with groundcover's sophisticated data analysis and visualization features, this unique tracing strategy makes it possible to achieve deep visibility without extensive effort.
By the way, if you want to collect traces using external tools, we support that, too. You can easily import third-party tracing data into the groundcover platform. But for a simpler experience, you can take advantage of eBPF tracing support, which is available out-of-the-box.
Distributed tracing without the downside
Logs and metrics are great. But on their own, they often do not provide the deep visibility necessary to optimize the performance of microservices apps. Distributed tracing fills this gap by allowing teams to analyze requests in a granular fashion.
And while implementing tracing the traditional way can be challenging, new approaches – like the eBPF tracing features at the heart of groundcover – make it easier than ever to trace requests in a distributed system with minimal overhead and hassle.
Sign up for Updates
Keep up with all things cloud-native observability.