Kubernetes Observability: Lead with Pillars for laser vision cluster insights
By wisely leveraging your logs, metrics and traces combined with the cloud native observability approach, you can gain a comprehensive understanding of the behavior and performance of your Kubernetes cluster. You can then use information gathered through Kubernetes observability to troubleshoot issues, optimize performance, and improve the overall reliability of your system.
If you work in software development or operations, you've probably heard by now all about the so-called "pillars of observability." There are three of them, and when you put them together, you theoretically learn everything you need to know to keep your cloud-native applications healthy, wealthy and wise beyond its years – or at least performing adequately.
It's one thing to talk about the pillars of observability. It's quite another, however, to collect that data that power them. No matter how many you choose to believe in, actually getting at the metrics, logs, traces and whichever other data source you require for observability can be hard work – especially when you're dealing with complex cloud-native environments, like Kubernetes.
What is Kubernetes Observability?
Let's start by defining what Kubernetes observability actually means.
Kubernetes is a so-called cloud-native platform because it's designed to orchestrate cloud-native applications, meaning those that use loosely coupled, scalable and flexible architectures. (Confusingly, cloud-native doesn't necessarily mean your apps have to run in a cloud environment, but we'll save that discussion for another day.) So, if you want to monitor your Kubernetes clusters – you need a cloud-native observability strategy that’s tailored specifically for Kubernetes and can handle the complex and dynamic nature of cloud-based systems.
Cloud-native Kubernetes observability refers to the ability to monitor and analyze the behavior and performance of Kubernetes clusters in a cloud-native environment, while also being able to understand the Kubernetes context of each event in the system. By wisely leveraging your logs, metrics and traces combined with the cloud native observability approach, you can gain a comprehensive understanding of the behavior and performance of your Kubernetes cluster. You can then use this information to troubleshoot issues, optimize performance and improve the overall reliability of your system.
Kubernetes observability vs. monitoring
For the sake of clarity, let's make clear that when we talk about Kubernetes observability, we're referring to something distinct from kubernetes monitoring.
There’s no shortage of blog posts, conference presentations and videos that dive into long discussions of how monitoring is different from observability. But suffice it to say that the main difference is that monitoring mostly boils down to collecting data that tells you what's happening with a system. Through monitoring, for example, you can determine whether an application has stopped responding or monitor how long requests take to process.
Observability, on the other hand, is all about understanding why something is happening in a system. When you observe a system, you get insights that help you understand what's causing an application to crash, or which specific microservice within a distributed application is bottlenecking application requests. These insights enable performance tuning, help with cluster performance optimization and maximize the scalability of your cluster.
So, if you were to monitor Kubernetes, you'd merely be collecting data that shows you what the state of your cluster is at any given point in time. When you observe Kubernetes, however, you go deeper. You can understand the context necessary to know how containers, pods, nodes, key-value stores, API servers and so on are interacting with each other, and how their interactions influence the overall health and performance of the environment.
The basics of Kubernetes observability: Data sources
So, how do you go about observing Kubernetes? Part of the answer involves data sources that reveal what is happening deep inside Kubernetes. There are three key data sources in this regard:
- Logs: Logs are records of events that occur within a system. In Kubernetes, logs can be generated mainly by containers. By analyzing log data, you can gain insights into the behavior of your Kubernetes cluster and troubleshoot issues as they arise.
- Metrics: Metrics are measurements of system behavior over time. In Kubernetes, metrics can be collected from various sources, such as pods and nodes. By analyzing metrics, you can gain insights into the performance and resource utilization of your Kubernetes cluster.
- Traces: Traces are records of the flow of requests through a system. In Kubernetes, traces can be generated by applications running in pods. By analyzing trace data, you can gain insights into the performance of individual requests and identify bottlenecks and issues in your system.
The point is that one key step toward Kubernetes observability is collecting the data that allows you to understand what is happening in your clusters. The way you choose to define that data – and whether or not you think it can all fit neatly within the three-pillar model, or you opt for a more nuanced take – isn't really important. What matters is simply understanding that you can't observe Kubernetes if you don't have observability data.
Getting started with Kubernetes observability
Now, it would be awesome if all of your logs, metrics, traces and other data you need to observe Kubernetes environments were centralized in the same place. Sadly, it isn't. Observing Kubernetes requires a means of collecting observability data from all components of your cluster. There are a few different ways to go about this.
Agent-based observability (pre-eBPF era)
One approach to Kubernetes observability is to go out and deploy monitoring agents on each node and/or pod in your cluster. The agents can collect metrics, logs and potentially other data from the components that they have access to.
This method – which is the traditional approach to Kubernetes observability, and the one that you'll still find at the heart of conventional Kubernetes monitoring and observability tools – will get you most of the data you need to understand what's happening in your clusters. Deploying each agent is a lot of work. Worse, those agents consume a lot of resources when they run, so they can end up starving your actual workloads of the resources they require to operate optimally.
The metrics API
The metrics API is a native Kubernetes feature that exposes data about resource usage for pods and nodes. At first glance, the metrics API probably sounds great. Why wouldn't you simply collect your observability data from across the cluster using a centralized API, without having to deal with monitoring agents?
Well, mainly because the metrics API only exposes a fraction of the data you need to observe Kubernetes effectively. As its name implies, the metrics API generates metrics, not logs, traces or other data. And the metrics it produces reflect the health of only certain Kubernetes components.
So, while the metrics API is convenient, it doesn't provide all of the data you need for complete Kubernetes observability. You either have to settle for limited insights, or you have to combine the metrics API with other Kubernetes observability methods.
Agent-based Kubernetes observability with eBPF
A third method for getting Kubernetes observability data – and one that is not subject to the drawbacks of the other two – involves using the extended Berkeley Packet Filter, or eBPF, to collect traces.
eBPF is a framework that enables you to run programs in the kernel space of Linux-based servers. Because software that runs in the kernel is hyper-efficient, eBPF lets you deploy monitoring and observability software (in addition to other kinds of tools) tools that consume minimal resources, so you avoid the resource overhead issues associated with using traditional software agents. In addition, eBPF programs run in sandboxed environments, so they are very secure.
Typically, you'll want to pair eBPF with some other data sources and tools. You'll want to monitor Kubernetes metadata and log data, for example. But by using eBPF as the foundation of your Kubernetes observability and container monitoring strategy, you get as much visibility and context as possible, with as little waste as possible.
One observability platform to rule them all: Or, why you need groundcover
If you've read this far, you're probably thinking, "I'm super stoked about eBPF! But how do I actually use it?"
That's a good question, because eBPF is a very complex tool. Creating eBPF monitoring tools from scratch requires a lot of coding, probably followed by a lot of troubleshooting when your code doesn't work as expected. And then you have to figure out how to interpret the data that eBPF collects, which is a whole gargantuan task unto itself because it requires choosing a data visualization and/or analytics tool and then setting up a pipeline to move eBPF-generated data into it.
That's why the real-world way to leverage eBPF for Kubernetes observability is to take advantage of Kubernetes observability tools, like groundcover, that are powered by eBPF under the hood. At groundcover, we use eBPF to help collect traces. We also tag that data automatically with cloud native metadata – such as container names, pod names and nodes – so that it's immediately actionable. And we expose it through an interface that makes it easy for you to glean actionable insights based on the data.
For example, in the following video, we use groundcover to troubleshoot an HTTP trace with a 500 status code in the frontend service. groundcover identifies that there are error logs at the same time as the trace, and we can see that the required product id was not found.
In this example, we use Kubernetes events and metrics to catch memory leak in one of our workloads.
Best practices for Kubernetes observability
Now that we've walked through the essentials of Kubernetes observability – and highlighted how next-generation tools like eBPF make it possible to achieve Kubernetes observability insights in ways that once seemed unimaginable – let's talk about best practices for getting the most out of Kubernetes observability.
Choose the right Kubernetes observability tools and frameworks
Arguably the most important Kubernetes observability best practice is selecting the right tools and frameworks. Above, we made the case – and we think we made it well – that eBPF-based tooling is the best solution in most cases. But in addition to evaluating whether your tools use eBPF or not, you'll also want to think about ease of implementation. Is your observability software fast and simple to deploy, or does it require complex setup? Do you have to configure data visualization and analytics software separately, or is it all built into your tooling?
Think about pricing as well – including not just the direct purchase or licensing costs of observability tools themselves, but also about secondary costs like data ingestion and storage fees. Kubernetes observability tools that seem affordable on the surface may turn out to cost more than you counted on due to add-on costs like these.
Strive for performance
As we mentioned, Kubernetes observability software can place a hefty burden on your clusters. Monitoring software may suck up considerable memory and CPU, depriving your actual workloads of those resources. That's another reason to opt for eBPF-based solutions, which take advantage of kernel space to collect observability data with minimal resource overhead and performance impact.
Focus on context
Even by the standards of cloud-native software, which is always more complex than earlier software architectures, Kubernetes is especially complex. It consists of a plethora of distinct components, and it's full of complicated abstractions.
To sort through all of this complexity, context is key. By context, we mean collecting observability data that allows you to understand how different components relate to each other, as well as to see through abstractions in order to get at the underlying causes of Kubernetes performance issues.
The point here is that the more Kubernetes observability data you collect, and the greater your ability to interrelate different types of data, the more effective your observability strategy will be.
Kubernetes observability so simple, your grandpa can do it.
In closing, let us point out that until just a few years ago, devising a Kubernetes strategy was a lot of work. At that time, eBPF wasn't yet mature enough for production usage, and collecting observability data from Kubernetes required making a lot of compromises – such as accepting the inefficiency of agent-based monitoring, or settling for sampled data that didn't always provide accurate insights.
Fortunately, those limitations are a thing of the past. Today, even your grandpa – whom we assume is not a seasoned Kubernetes admin, although we could be wrong – can use observability solutions like groundcover to achieve simple, efficient, secure and effective Kubernetes observability powered by eBPF.