Aviv Zohari
,
Founding Engineer
10
minutes read,
May 30th, 2023

In some respects, trying to achieve performance optimization for Kubernetes is like trying to feed a picky toddler. One day you feed her macaroni and cheese and she eagerly eats it up. So you give it to her again the next day, at which point she promptly dumps it all over the floor, refuses to eat lunch and, two hours later, complains that she's hungry because you "didn't give her lunch." At that point you offer her pizza, which she also throws on the floor, demanding macaroni and cheese. So you make her a new batch, but she violently melts down because one of the macaronis is broken.

Exhausted, you give up and let her stuff herself with Skittles while streaming Blippi. It's a dirty hack of a fix, to be sure, but at least you're putting some food in the kid.

What do Kubernetes workloads have in common with capricious toddlers, you ask? The answer is that both have a tendency to go off the rails in unpredictable ways – and unless you understand exactly what's going on, you resort to ugly fixes because you can't identify the root-cause problem. A bug that shows up in a Kubernetes workload one moment may disappear the next. Or, you might run into a situation where you know something's off, but you can't figure out exactly what's causing the issue amid the sprawling conglomeration of services, protocols, and infrastructure components that is your Kubernetes cluster.

Now, we can't tell you how to troubleshoot the behavior of a toddler who acts one way one day, then in a totally opposite way the next day, all the while acting as if her parents are the crazy ones. There are, alas, no tracing tools for toddlers. Fortunately, though, tracing tools and application performance monitoring solutions do exist for Kubernetes. When used in the right way, tracing tools make it possible to get to the root of even the most complex performance issues in microservices architectures

This article demonstrates this by explaining how tracing in Kubernetes works, as well as highlighting best practices for optimizing the performance of the Kubernetes container orchestration platform.

What is Kubernetes tracing?

Kubernetes tracing is the practice of monitoring the execution of Kubernetes-based applications and services on a step-by-step basis.

Typically, Kubernetes tracing involves distributed tracing, which is a specific type of tracing, but other tracing methods are possible. We'll discuss different types of tracing in a moment.

Understanding tracing

First, though, let's talk about tracing in general and how it works.

Tracing is a software monitoring and troubleshooting technique that allows engineers to track the execution of a program step-by-step. In this way, tracing can reveal exactly where a program fails or where a performance problem appears.

As an example of tracing at work, imagine you have a web application that stores data in Redis. The application is responding slowly to user requests, but you don't know exactly what's going wrong. So you use tracing tools to monitor requests as they flow from HTTP clients, to the application server, to the Redis data store and back again to the client. By monitoring how long each step of the journey takes, you can identify where the delay occurs and which component it's associated with. That's a lot better than simply knowing that requests are slow but having no idea where the issue originates.

Tracing vs. logging

Importantly, tracing is different from other forms of logging and monitoring because with tracing, you monitor requests as they flow through multiple software components. In contrast, most monitoring and logging tools record only events or collect metrics.

For example, a log for the web app described above might record that a client request was processed and tell you how long it took to complete it. But a log wouldn't typically offer granular, step-by-step details on exactly how the request proceeded. Likewise, a monitoring tool may tell you how much CPU and memory an application was consuming at a given point in time. But it won't show you how requests were flowing within the application at any given moment.

app.groundcover.com

Distributed tracing vs. system call tracing and application tracing

It's worth noting, too, that there are multiple types of tracing – most notably, distributed tracing, system call tracing, and application tracing.

In the context of Kubernetes, the most common tracing technique is distributed tracing. This is the practice of monitoring application requests as they flow within a distributed system – meaning one that consists of multiple components. A microservices app is a classic example of a distributed system. You can perform distributed tracing in this type of system by tracking requests as they flow between microservices.

In contrast, system call tracing involves monitoring the way applications interact with an operating system. Its main purpose is to debug operating system kernels, and it can work with almost any type of application or OS, not just those with a distributed architecture. System call tracing is not typically used to support Kubernetes monitoring or troubleshooting because Kubernetes is not an operating system, so debugging operating systems is not closely related to Kubernetes performance management.

As for application tracing, this is a broad term that refers to any type of tracing technique that focuses on monitoring application execution. Distributed tracing is one type of application tracing; however, you can also perform application tracing on non-distributed, monolithic applications. Typically, you'd do this by closely monitoring how the application's state changes during its course of execution – as opposed to monitoring the flow of requests between microservices (which don't exist in a monolithic app). Because applications deployed on Kubernetes are usually distributed apps, distributed tracing is the only kind of application tracing that you'll commonly encounter in Kubernetes environments.

Why tracing matters in Kubernetes environments

In many cases, distributed tracing is the only way – or, at a minimum, the most efficient way – to get to the root of a complex performance issue. There are several reasons why.

1. Increased visibility

One is that in modern, complex environments, you're often dealing with a jungle of different components, services, and protocols. That makes it very tricky in many cases to figure out exactly which piece of the puzzle is the problematic one.

2. Monitoring periodic failures

Another reason why tracing is critical is that sometimes, issues only manifest themselves irregularly. Just as our picky toddler is happy eating mac and cheese one day but becomes unreasonably (by adult standards, at least) angry about the same meal the next day, an application that is perfectly responsive one moment might start failing the next, only to go back to normal shortly thereafter. By tracking program execution in a highly granular, context-aware way, tracing helps troubleshoot failures that don't occur systematically or consistently.

3. Understanding relationships between components

A third benefit of tracing is that it's a whole lot easier to use traces to understand how different parts of your stack fit together than it is to rely just on logs and metrics for that purpose. If you want to take the long way, you could sit down and meticulously compare multiple log files – such as logs from your Web server and your Redis server – in an effort to figure out where the wonkiness happens when the application experiences a performance issue.

But correlating log files like this is a lot of work. You'll be poring over timestamps, trying to figure out which events in one log potentially relate to events in the other log. There is also no guarantee that you'll actually be able to troubleshoot the issue, because the logs may not record enough information to allow you to figure out what's going wrong. And as for metrics, they are an even blunter measure of overall performance.

So, tracing provides critical context that you just can't obtain using other application performance monitoring methods.

Tracing components in Kubernetes

Sample set of tracing components in a simple Kubernetes-based web application:

| Component | Description | |---|---| | Redis data store | Supports web server by providing caching, etc. | | Kubernetes control plane | Manages the cluster. Includes multiple sub-components (API server, Scheduler, Etcd, etc.). | | Nodes | Servers that form the Kubernetes cluster. | | Hypervisor | Operates nodes as virtual machines. | | Pods | Host the web server, Redis, and possibly other software used for monitoring or security purposes. | | Services | Expose web server to the network. |

Distributed tracing is valuable in any environment that includes multiple interdependent components. But it's especially important in Kubernetes, which takes the cake when it comes to complexity and having a large number of moving pieces.

After all, a web server that depends on a Redis data store would be complex enough to troubleshoot if the app ran directly on a bare-metal host server. But if you containerize the app and deploy it to Kubernetes, you introduce a whole slew of other dependencies. In all, your list of major components would look like:

  • A Redis data store, which can support tasks like caching web content.
  • A Kubernetes control plane – including an API server, Scheduler, Etcd key-value store, and other software that Kubernetes uses to manage Kubernetes clusters.
  • A set of nodes, which form the Kubernetes cluster.
  • A hypervisor, if your Kubernetes nodes are virtual machines.
  • Pods, which host the Web server, Redis and possibly additional software (like monitoring agents) that you use to help operate the server.
  • One or more Kubernetes Services to make your web server accessible over the Internet.

This means that if you need to trace an app in Kubernetes, you need to figure out how all of these various components – some of which (like the application's containers) directly relate to the app, and others of which provide auxiliary services but are Kubernetes cluster-wide components that are not linked to the app specifically – fit together to make your app work. Maybe your tracing data will reveal that some nodes that the app depends on are low on memory, which is why the app intermittently becomes slow to respond. Maybe you'll discover that there are too many workloads in the Kubernetes cluster, causing the Scheduler to struggle to find enough nodes for the app on occasion. Maybe a service discovery issue is causing disconnects between the app and Redis, which runs in its own Pods.

We could go on, but you get the point: Tracing in Kubernetes means collecting data from a whole host of different components and workloads in order to understand which one is the culprit when something strange happens with an app.

Tracing in Kubernetes: Examples and use cases

To drill down even further into what tracing for Kubernetes might involve, let's discuss two specific examples where you could use tracing to troubleshoot the web app described above.

Scenario 1: Debugging microservice architectures

In the first example, imagine that the root cause of the application performance issue is that the HTTP server occasionally sends a bad request to the Redis server, which breaks the API. Of course, you almost certainly wouldn't know this just by monitoring application metrics or logs. Those would tell you that there's a delay, but not much, if anything, about its root cause.

But if you run traces on multiple application requests, you'll see that some requests fail at the point where the HTTP server calls the Redis server. From there, you can examine the specific requests, determine that they are improperly formatted or incomplete, and conclude that the root cause of the problem is bad requests – in which case it's likely that you need to update the application logic to ensure it sends proper requests.

Scenario 2: Identifying performance bottlenecks

Alternatively, imagine that the Redis server occasionally responds slowly to queries from the HTTP server. Again, you likely wouldn't know this just by looking at logs or metrics – unless you meticulously compared Redis server logs against the HTTP server logs, in which case there's a chance you'd notice that the HTTP server makes responses that the Redis server takes a long time to fulfill. But again, that would be a lot of effort, and there's no guarantee the logs would be detailed enough to surface the issue.

A better approach is to use traces to see what happens as requests flow between the HTTP server and the Redis server. Tracing data would quickly reveal slow responses from the latter, allowing you to investigate further and determine that the root cause of the problem lies with the way Redis is configured.

Best practices for tracing in Kubernetes

To get the most out of Kubernetes tracing, consider the following best practices:

  • Standardize tracing data: The more consistent your traces are, the easier it is to draw accurate comparisons between them and identify anomalies. For this reason, you should strive to perform tracing based on the same type of request. Otherwise, it's difficult to know when an issue like one trace taking longer than another is a sign of a performance problem, or simply a reflection of one trace requiring more resources to process than the other.
  • Instrument efficiently using tracing standards: To collect tracing data, you need to instrument tracing within your applications. Instrumentation means implementing logic that makes it possible to monitor requests within each microservice or other tracing component. This can be a lot of work if you write the logic from scratch. To gain efficiency, consider leveraging standardized tracing libraries and APIs, like those available from OpenTracing.
  • Use multiple, complementary observability sources: Tracing is not a replacement for metrics and logs; on the contrary, metrics and logs can provide additional trace context that helps you interpret traces. Use all three data sources – metrics, logs, and traces – together for maximum effect.
  • Keep the Kubernetes context in mind: Tracing application requests by themselves doesn't provide a complete picture. You should also look at the state of your Kubernetes system. For example, a service reaching its CPU limits might cause high-latency traces.
  • Store trace data efficiently: Like the rest of your Kubernetes monitoring data, tracing data can be costly to store – but it's also important to keep on hand in some instances in case you want to analyze a trace again in the future. So, be strategic about which traces you retain and how long you retain them for. For example, traces associated with issues that occur only intermittently are more valuable to retain (because you might not be able to reproduce the errors again on demand) whereas trace data is less valuable if you can go and run a new trace for the same issue whenever you want.

Popular Kubernetes tracing tools

Kubernetes itself doesn't offer built-in tracing tools or functionality beyond support for exporting traces for Kubernetes system components, which is a beta feature as of Kubernetes version 1.27. This feature must be turned on explicitly by providing the API server with a tracing configuration file. The configuration file must specify the endpoint. This capability is useful for implementing basic Kubernetes tracing, but it falls short of being a full-fledged Kubernetes tracing tool because it only supports Kubernetes system components. It also only generates traces; it doesn't do anything to help you analyze them, which would require the use of third-party tools.

However, a variety of third-party tracing solutions exist that offer more complete support for Kubernetes tracing. Here's a look at some of the most popular open source offerings:

  • Jaeger: Jaeger is an open-source distributed tracing tool originally developed by Uber. It’s a popular choice for tracing microservices running on Kubernetes and other distributed systems.
  • Zipkin: Another open source distributed tracing system that helps collect and visualize timing data. Developed by Twitter, Zipkin offers an easy setup and integrates well with Kubernetes and other monitoring tools. Its functionality is a bit more limited than that of Jaeger, but it's easier to use.
  • SigNoz: SigNoz is an open source observability platform that supports tracing in addition to logging and metrics collection – which means it's more of an end-to-end solution than Kubernetes tracing software like Jaeger and Zipkin. The latter focus on tracing alone, which is a disadvantage if you need a generic observability solution, but an advantage if you want a tool that is designed specifically for tracing.
  • OpenTelemetry: OpenTelemetry isn't a tracing tool per se; it's a set of instrumentation libraries and APIs that can be used to help collect tracing data. Still, OpenTelemetry is an important solution for adding efficiency to tracing instrumentation.
  • OpenTracing: Like OpenTelemetry, OpenTracing provides instrumentation libraries and APIs to help collect tracing data. The main difference between OpenTelemetry and OpenTracing is that the latter focuses just on tracing, whereas OpenTelemetry can also instrument logs and metrics collection.

How to implement tracing in Kubernetes with groundcover

The open source Kubernetes tracing tools we just mentioned are useful for generating traces and collecting tracing data. But on their own, they don't provide the deep context that you need to correlate traces with other observability insights or drill deep down into performance issues. That's where groundcover comes in. By providing robust support for Kubernetes tracing in conjunction with other core Kubernetes observability capabilities, groundcover handles not just traces, but full picture.

But wait – there's more! One of the other standout features of groundcover when it comes to tracing is support for automatically generating traces using eBPF – a hyper-efficient approach that avoids the CPU and memory overhead associated with traditional tracing techniques.

At the same time, however, groundcover also fully supports ingesting third-party traces – so no matter how you want to generate trace spans, you can analyze that data and correlate it with other observability insights using groundcover.

Tracing implementation steps on groundcover

There are two ways to go about setting up Kubernetes tracing in groundcover.

One is simply to use eBPF-based traces, which groundcover generates automatically for every supported protocol and service. You don't need to do anything special to implement traces in this way; we do it for you, making the tracing data automatically available for analysis through your groundcover dashboards.

What if you want to ingest your own tracing data, you ask? Well, you can do that, too. The basic steps are as follows.

1. Identify your groundcover OpenTelemetry Collector Endpoint

You'll need to know your groundcover OpenTelemetry Collector Endpoint so you can direct traces to it. The location varies depending on your groundcover installation type, which is either Managed inCloud or Standard inCluster.

2. Define exporter pipelines

Next, add a new exporter:

exporters:
  otlp/groundcover:
	endpoint: {GROUNDCOVER_OPENTELEMETRY_COLLECTOR_ENDPOINT}:4317

Replace {GROUNDCOVER_OPENTELEMETRY_COLLECTOR_ENDPOINT} with the value of your actual endpoint.

3. Add traces to your pipeline

Finally, add the following traces pipeline:

pipelines:
  traces:
	exporters:
	- otlp/groundcover

That's it! Your traces will now be forwarded from your OpenTelemetry-compatible tracing tool to groundcover, where you can work with them alongside the rest of your Kubernetes observability data.

Why we love tracing

While the solution to a persnickety toddler may be shipping that kid off to pre-K boarding school (which is a thing, it turns out), you don't need to take such drastic measures to fix performance optimization issues in Kubernetes. Before moving your microservices workloads back to a less complex hosting environment, use tracing tools – alongside metrics and logs – to pinpoint what's going wrong with Kubernetes workloads and take action to remediate the root of the problem.

Sign up for Updates

Keep up with all things cloud-native observability.

We care about data. Check out our privacy policy.

We care about data. Check out our privacy policy.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.