If you've ever cooked a filet of fish, you know something about what it's like to monitor Kubernetes applications. That’s right, because frying a filet to perfection – just enough that you eliminate the risk of food poisoning, but not so much that you end up with a tough and dry piece of meat – is akin to the challenge of optimizing Kubernetes application performance. From the surface, it's hard to tell how cooked your fish is, just as collecting surface-level monitoring metrics from Kubernetes provides relatively little insight into how well the apps are actually performing.

Fish cooking tips aside, what we're actually here for is to help you conquer the challenges of monitoring Kubernetes application performance to optimize efficiency and reliability. Below, we walk through the metrics you should track for Kubernetes apps, as well as the challenges that make Kubernetes app performance monitoring difficult. We then discuss tools and practices that help to streamline Kubernetes application performance monitoring, regardless of which Kubernetes distribution you use or which types of app you deploy.

Why Kubernetes performance monitoring matters

Let's start by discussing why you should care about Kubernetes app performance monitoring.

The answer might seem simple – if you don't monitor your apps, you risk running into performance issues like slow response rates or an inability to maintain service level agreements (SLAs) with your customers. But actually, the importance of Kubernetes performance monitoring involves more than just ensuring that your apps keep operating. It's also about ensuring that you can optimize your Kubernetes environment for efficiency.

After all, Kubernetes can mitigate some performance issues on its own. If an application starts sucking up a large amount of memory and risks exhausting the resources available on its host node, for example, Kubernetes will know that it should move that app to a different node that has more memory available. In this way, Kubernetes can help prevent the app (and host node) from dropping requests or crashing due to high resource utilization.

But just because your Kubernetes app isn't failing doesn't mean you should not be monitoring it. In the example above, you probably want to know why your app uses a lot of memory. It could be a memory leak bug in the application code. It could be due to poor memory management. It could be a backlog of requests that the app is struggling to catch up on. It could be any number of other things, too, but you won't be able to identify and resolve the root cause of the problem unless you’re monitoring the app.

By extension, you can't optimize the overall performance and efficiency of Kubernetes unless you monitor your apps. If you have applications that are consuming resources inefficiently or not handling requests as quickly as they should, your cluster is probably wasting resources. That leads to higher infrastructure costs, as well as an increased risk that Kubernetes will eventually reach the point where it can no longer keep shifting workloads between nodes, and your entire cluster will come crashing down.

Kubernetes app monitoring helps you stay ahead of issues like these. It allows you to identify application performance issues early, so you can address them before they lead to wasted money or place the stability of your cluster at risk.

The challenges of monitoring Kubernetes applications

Unfortunately, detecting performance issues for Kubernetes apps is often complicated, for a variety of reasons:

 • Distributed applications: The apps you deploy on Kubernetes are probably microservices-based and include multiple containers (each hosting its own microservice). As a result, you need to monitor the performance of each container, while simultaneously tracking how performance issues of one container impact other containers.

 • Complex root causes: The complex dependencies between Kubernetes applications and host infrastructure mean that an application component where you detect an issue may be different from the component associated with the issue's root cause. An app could be experiencing errors due to a hardware problem on a host node, for example, but you probably wouldn't know that just by tracking the application's error rate.  

 • Dynamic scaling: Kubernetes constantly scales applications up and down in response to fluctuations in demand. That's one of its main jobs, and it helps ensure optimal use of available resources. But it also makes it impossible to establish a baseline of "normal" activity and measure deviations against it. As a result, you can't detect app performance problems based on simplistic strategies like monitoring whether an application exceeds a preset level of resource consumption. You need a more dynamic monitoring strategy.

 • Varying application needs: Along similar lines, there’s no one-size-fits-all guide for which performance metrics are appropriate for a Kubernetes app. You need to look at the unique requirements of each application, not conclude that an app is under-performing just because its latency rate surpasses a certain number, for instance.

The bottom line is: Kubernetes is a complex and dynamic platform, which makes it quite difficult in many cases to figure out what's causing application performance problems.

Key performance metrics in Kubernetes

Just because Kubernetes application monitoring is hard doesn't mean it's impossible. What you need is the right strategy, and that starts with knowing which performance metrics to track. Although the exact metrics you monitor may vary depending on which applications you're running, key Kubernetes performance metrics fall into four main categories.

Resource utilization monitoring

First, you have resource utilization metrics, such as CPU utilization, memory consumption and storage consumption.

Tracking resource consumption metrics helps you identify situations where an app suddenly begins consuming significantly more or fewer resources than it does typically, which could indicate a performance issue like a memory leak.

In addition, resource utilization monitoring is critical for ensuring that you know when an app is maxing out available resources. By default, Kubernetes should move apps in this case to nodes with more resources, assuming such nodes are available. But it won't do that if you've configured a DaemonSet that prevents Kubernetes from moving your app to a different node, for instance, or if a resource limit stops Kubernetes from assigning more resources to the app. In these cases, you'd need to be monitoring your application to determine that a change in your configuration is needed.

As an example, this is a snapshot from a groundcover dashboard showing node’s CPU, memory and disk usage (along with other cluster resources):

source: https://app.groundcover.com

Network performance monitoring

Kubernetes is a distributed environment where applications and various other components of the cluster rely on networks to communicate with each other. Thus, tracking network performance metrics such as latency, throughput and packet loss is essential for identifying network performance problems that could render your applications or microservices unable to exchange data efficiently with each other.

Network monitoring is also important because it can help narrow down the root cause of complex performance problems. For example, if you notice an app that is slow to respond to requests, you'll want to know whether the problem is caused by a spike in latency on the network, or by an internal problem with the app. Tracking network latency gives you the information you'll need to tell the difference.

Implementing network monitoring solutions such as Kubernetes-native tools like kube-proxy, container network interfaces (CNIs), or third-party solutions like groundcover can provide visibility into network performance metrics and facilitate troubleshooting network-related issues in Kubernetes environments.

App response time and latency monitoring

Speaking of application response speed, you'll want to monitor how long it takes for applications to process and respond to user requests (which is response time), as well as how long it takes to transmit data between the components involved in processing a request (which is application latency).

These metrics allow you to determine when your application is not meeting the requirements necessary to deliver a great user experience. They also help you to zero in on performance bottlenecks by determining which specific part of an application or stack is causing slow performance.

You can monitor application response time and latency monitoring by leveraging eBPF-base tools such as groundcover, or using distributed tracing frameworks like OpenTelemetry. Both types of solutions provide insight into the timing and duration of various components' operations within the application stack.

Error rate and failure analysis

By tracking error rates and performing failure analysis, you can detect anomalies, bugs and other problems that, if left unaddressed, will quickly degrade the user experience.

Sometimes, errors happen because applications are simply overwhelmed with requests, in which case scaling the application up will probably improve performance. Other times, you might have an issue with your application itself and the way it handles requests. By analyzing error patterns, you can investigate the root cause of errors and implement appropriate remediation strategies.

Solutions for centralized logging or error monitoring like groundcover can aid in tracking and analyzing error rates in Kubernetes. Contextualizing error data alongside other types of information will help you get to the root cause of high error rates by allowing you to determine whether and how they correlate with other issues, such as unusual changes in resource utilization or poor network performance.

For example, this snapshot from groundcover shows rate of requests and errors along latency measured for a specific service on the cluster.


Kubernetes monitoring tools and techniques

Now that we know what you should monitor for in Kubernetes, let's look at the tools and techniques available for doing it.

Logging and log analysis

There are plenty of logging tools out there that provide centralized log collection, aggregation and analysis for Kubernetes. Elasticsearch, Fluentd and Kibana (which are known as the EFK stack when used collectively) are a popular option, as are solutions like Loki by Grafana and ClickHouse.

If you run Kubernetes in the cloud using a managed service like GKE or EKS, you may also be able to take advantage of cloud-based logging solutions, such as Google Cloud Logging and AWS CloudWatch, that integrate with your Kubernetes distribution.

No matter which logging solutions you use, your goal should be to ensure that you collect and aggregate all available log data from all application components, then analyze it centrally so that you have full context into how issues revealed by one log correlate with other logs.

This example from groundcover traces screen shows the benefits of a centralized logging solution correlated to traces:


Custom metrics and instrumentation with OpenTelemetry  

Custom metrics and instrumentation in Kubernetes can be achieved using tools like OpenTelemetry, an open-source observability framework. OpenTelemetry provides libraries and components for instrumenting applications, allowing organizations to capture fine-grained performance data and custom metrics without having to bake extensive custom logging and monitoring logic into each app. Instead, they can expose those metrics through OpenTelemetry libraries.

OpenTelemetry also integrates well with groundcover and other monitoring and data visualization tools, making it a versatile choice for capturing and analyzing custom metrics in a standardized manner.

Leveraging Kubernetes APIs for performance monitoring

You can also gain insight into Kubernetes application performance using the APIs provided by Kubernetes itself. The Kubernetes APIs are particularly useful for tracking resource utilization metrics. In addition, you can monitor the status of Pods and containers, which is helpful for understanding whether unusual performance is linked to an issue like a failed container or one that is taking longer than expected to start.

API-based monitoring solutions, such as Prometheus Operator or Kubernetes Metrics Server, enable organizations to collect and visualize Kubernetes-specific metrics directly from the API server.

Although the Kubernetes APIs aren't designed solely for application monitoring and can't provide visibility into internal application issues (such as buggy code), they do provide some useful metrics. More generally, they offer critical contextual information that helps you make informed decisions and identify the root cause of performance problems.

Monitoring strategies for K8s apps

Collecting the right metrics with the right tools is only part of the battle when it comes to Kubernetes application performance monitoring. You'll also want to deploy effective monitoring strategies and techniques.

Define clear monitoring objectives and KPIs

To align monitoring efforts with business goals, it's important to set clear objectives and KPIs that your Kubernetes apps need to hit. For example, you should determine which level of latency is tolerable for a given application based on what you're using it for.

A useful set of principles for establishing performance monitoring objectives is to remember that they should be specific, measurable, achievable, relevant and time-bound (or SMART, as people who like acronyms like to put it). When your metrics have these qualities and are bound to business outcomes, monitoring drives actionable insights.

Leverage service meshes

Service meshes, which help to manage interactions between microservices within a Kubernetes cluster or other distributed environment, can help to optimize monitoring workflows by providing highly granular, service-by-service level visibility into application performance. You can also use service meshes to perform distributed tracing, which is a useful technique for pinpointing the root cause of performance issues in Kubernetes.

Build monitoring pipelines with Operators

Kubernetes Operators, which provide packaging, deployment and management functionality for Kubernetes apps, can simplify the deployment of monitoring tools (among other types of software) in Kubernetes. For example, you can use the Prometheus Operator to simplify the setup and management of Prometheus as an application and network monitoring tool.

By leveraging operators, you can deploy the multiple components needed to create a monitoring pipeline more quickly than you could if you set up each tool independently.

Monitor continuously

Rather than pulling performance metrics from Kubernetes applications periodically, strive to monitor continuously wherever possible. Continuous monitoring and performance optimization help to ensure the ongoing health, stability, and optimal performance of Kubernetes applications.

Continuous monitoring is especially important if you also automate response operations. By detecting issues as soon as they occur and then automating the action required to fix them, you can resolve many Kubernetes app performance problems within a fraction of the time it would take a human to recognize and remediate the issue.

For example, imagine that you have an app that experiences a sudden spike in CPU utilization and is maxing out the CPU limit assigned to it. If you wait even just five minutes to detect that issue due to periodic rather than continuous monitoring, your app may well have crashed by the time you can respond. But if you detect the CPU utilization spike in real time and automatically change the CPU limits assigned to the app, you’re more likely to prevent a user-impacting failure.

Keeping Kubernetes apps running swimmingly

In short, monitoring your Kubernetes apps continuously is the only way to ensure that your apps deliver the experience your end-users require while also using resources efficiently. But given the complexity of Kubernetes and microservices apps, the only way to monitor effectively is to collect a variety of data points, then correlate them so that you can get to the root cause of multi-layered performance problems.

Fortunately, a variety of tools are available to help you do this. Solutions like OpenTelemetry simplify the collection of custom metrics from applications, while eBPF-based monitoring tools such as groundcover provide deep visibility into Kubernetes applications, nodes and networks so that you can figure out where the sources of performance issues lay.

Sign up for Updates

Keep up with all things cloud-native observability.

We care about data. Check out our privacy policy.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.