Kubernetes Tracing: Best Practices for Deep Insights
Kubernetes tracing offers insights into performance. Discover expert strategies to understand and optimize your Kubernetes deployments
In some respects, trying to achieve performance optimization for Kubernetes is like trying to feed a picky toddler. One day you feed her macaroni and cheese and she eagerly eats it up. So you give it to her again the next day, at which point she promptly dumps it all over the floor, refuses to eat lunch and, two hours later, complains that she's hungry because you "didn't give her lunch." At that point you offer her pizza, which she also throws on the floor, demanding macaroni and cheese. So you make her a new batch, but she violently melts down because one of the macaronis is broken.
Exhausted, you give up and let her stuff herself with Skittles while streaming Blippi. It's a dirty hack of a fix, to be sure, but at least you're putting some food in the kid.
What do Kubernetes workloads have in common with capricious toddlers, you ask? The answer is that both have a tendency to go off the rails in unpredictable ways – and unless you understand exactly what's going on, you resort to ugly fixes because you can't identify the root-cause problem. A bug that shows up in a Kubernetes workload one moment may disappear the next. Or, you might run into a situation where you know something's off, but you can't figure out exactly what's causing the issue amid the sprawling conglomeration of services, protocols and infrastructure components that is your Kubernetes cluster.
Now, we can't tell you how to troubleshoot the behavior of a toddler who acts one way one day, then in a totally opposite way the next day, all the while acting as if her parents are the crazy ones. There are, alas, no tracing tools for toddlers.
Fortunately, though, tracing tools and application performance monitoring solutions do exist for Kubernetes. When used in the right way, tracing tools make it possible to get to the root of even the most complex performance issues in microservices architectures. This article demonstrates this by explaining how tracing in Kubernetes works, as well as highlighting best practices for optimizing the performance of the Kubernetes container orchestration platform.
What is tracing?
Before diving into the specifics of tracing in Kubernetes, let's talk about tracing in general.
Tracing is a software monitoring and troubleshooting technique that allows engineers to track the execution of a program step-by-step. In this way, tracing can reveal exactly where a program fails or where a performance problem appears.
As an example of tracing at work, imagine you have a Web application that stores data in Redis. The application is responding slowly to user requests, but you don't know exactly what's going wrong. So you use tracing tools to monitor requests as they flow from HTTP clients, to the application server, to the Redis data store and back again to the client. By monitoring how long each step of the journey takes, you can identify where the delay occurs and which component it's associated with. That's a lot better than simply knowing that requests are slow but having no idea where the issue originates.
Importantly, tracing is different from other forms of logging and monitoring because with tracing, you monitor requests as they flow through multiple software components. In contrast, most monitoring and logging tools record only events. For example, a log for the Web app described above might record that a client request was processed and tell you how long it took to complete it. But a log wouldn't typically offer granular, step-by-step details on exactly how the request proceeded.
Why tracing matters
In many cases, tracing is the only way – or, at a minimum, the most efficient way – to get to the root of a complex performance issue.
There are several reasons why. One is that in modern, complex environments, you're often dealing with a jungle of different components, services and protocols. That makes it very tricky in many cases to figure out exactly which piece of the puzzle is the problematic one.
Another reason why tracing is critical is that sometimes, issues only manifest themselves irregularly. Just as our picky toddler is totally happy eating mac and cheese one day but becomes unreasonably (by adult standards, at least) angry about the same meal the next day, an application that is perfectly responsive one moment might start failing the next, only to go back to normal shortly thereafter. By tracking program execution in a highly granular, context-aware way, tracing helps troubleshoot failures that don't occur systematically or consistently.
A third benefit of tracing is that it's a whole lot easier to use traces to understand how different parts of your stack fit together than it is to rely just on logs and metrics for that purpose. If you want to take the long way, you could sit down and meticulously compare multiple log files – such as logs from your Web server and your Redis server – in an effort to figure out where the wonkiness happens when the application experiences a performance issue. But correlating log files like this is a lot of work. You'll be poring over timestamps, trying to figure out which events in one log potentially relate to events in the other log. There is also no guarantee that you'll actually be able to troubleshoot the issue, because the logs may not record enough information to allow you to figure out what's going wrong. And as for metrics, they are an even blunter measure of overall performance.
So, tracing provides critical context that you just can't obtain using other application performance monitoring methods.
Tracing components in Kubernetes
Tracing is valuable in any environment that includes multiple interdependent components. But it's especially important in Kubernetes, which takes the cake when it comes to complexity and having a large number of moving pieces.
After all, a Web server that depends on a Redis data store would be complex enough to troubleshoot if the app ran directly on a bare-metal host server. But if you containerize the app and deploy it to Kubernetes, you introduce a whole slew of other dependencies. You've got the containers and Pods that host the app. You have the worker node or nodes that host those containers and Pods. The nodes may or may not be VMs running on top of bare-metal servers. And alongside all of this, there is the Kubernetes API server, control plane nodes, Etcd and a bunch of other things that, if they don't work the way they're supposed to, may cause issues for your Web app.
So, if you need to trace an app in Kubernetes, you need to figure out how all of these various components – some of which (like the application's containers) directly relate to the app, and others of which provide auxiliary services but are cluster-wide components that are not linked to the app specifically – fit together to make your app work. Maybe your tracing data will reveal that some nodes that the app depends on are low on memory, which is why the app intermittently becomes slow to respond. Maybe you'll discover that there are too many workloads in all in the cluster, causing the Scheduler to struggle to find enough nodes for the app on occasion. Maybe a service discovery issue is causing disconnects between the app and Redis, which runs in its own Pods.
We could go on, but you get the point: Tracing in Kubernetes means collecting data from a whole host of different components and workloads in order to understand which one is the culprit when something strange happens with an app.
Examples and use cases for tracing in Kubernetes
To drill down even further into what tracing for Kubernetes might involve, let's discuss two specific examples where you could use tracing to troubleshoot the Web app described above.
Scenario 1: Debugging microservice architectures
In the first example, imagine that the root cause of the application performance issue is that the HTTP server occasionally sends a bad request to the Redis server, which breaks the API. Of course, you almost certainly wouldn't know this just by monitoring application metrics or logs. Those would tell you that there's a delay, but not much, if anything, about its root cause.
But if you run traces on multiple application requests, you'll see that some requests fail at the point where the HTTP server calls the Redis server. From there, you can examine the specific requests, determine that they are improperly formatted or incomplete, and conclude that the root cause of the problem is bad requests – in which case it's likely that you need to update the application logic to ensure it sends proper requests.
Scenario 2: Identifying performance bottlenecks
Alternatively, imagine that the Redis server occasionally responds slowly to queries from the HTTP server. Again, you likely wouldn't know this just by looking at logs or metrics – unless you meticulously compared Redis server logs against the HTTP server logs, in which case there's a chance you'd notice that the HTTP server makes responses that the Redis server takes a long time to fulfill. But again, that would be a lot of effort, and there's no guarantee the logs would be detailed enough to surface the issue.
A better approach is to use traces to see what happens as requests flow between the HTTP server and the Redis server. Tracing data would quickly reveal slow responses from the latter, allowing you to investigate further and determine that the root cause of the problem lies with the way Redis is configured.
Best practices for tracing in Kubernetes
To get the most out of Kubernetes tracing, consider the following best practices:
• Use different, complementary observability sources: Tracing is not a replacement for metrics and logs; on the contrary, metrics and logs can provide additional context that helps you interpret traces. Use all three data sources – metrics, logs and traces – together for maximum effect.
• Keep the Kubernetes context in mind: Tracing data by itself is not complete - always have a look at the state of your kubernetes system. For example, a service reaching its CPU limits might cause high-latency traces.
• Store trace data efficiently: Like the rest of your Kubernetes monitoring data, tracing data can be costly to store – but it's also important to keep on hand in some cases in case you want to analyze a trace again in the future. So, be strategic about which traces you retain and how long you retain them for. For instance, traces associated with issues that occur only intermittently are more valuable to retain (because you might not be able to reproduce the errors again on demand) whereas trace data is less valuable if you can go and run a new trace for the same issue whenever you want.
Why we love tracing
While the solution to a persnickety toddler may be shipping that kid off to pre-K boarding school (which is a thing, it turns out), you don't need to take such drastic measures to fix performance optimization issues in Kubernetes. Before moving your microservices workloads back to a less complex hosting environment, use tracing tools – alongside metrics and logs – to pinpoint what's going wrong with Kubernetes workloads and take action to remediate the root of the problem.
eBPF Academy
Related content
Sign up for Updates
Keep up with all things cloud-native observability.