Aviv Zohari's profile Image
Aviv Zohari
,
Founding Engineer
10
minutes read,
May 30th, 2023

If you've ever supervised a group of rambunctious young children, you know a bit about what it's like to manage alerts on Kubernetes. Like young kids, Kubernetes clusters are constantly in motion and making lots of noise. Also, like kids, not every action or change of state that occurs in Kubernetes requires a response from a responsible adult. Whether or not a given event within your cluster actually warrants an alert depends heavily on context.

For example, a node that is running out of CPU could be a big deal. But it might also not be, depending on factors like how many other nodes you have running and what their CPU utilization metrics look like.

Likewise, a toddler who has changed from a happy state to a crying state because his popsicle is melting probably doesn't warrant your immediate, undivided attention. But if the popsicle juice is spreading across the floor and becoming a tripping hazard for other kids, you probably will want to get ahead of that problem before another kid slips and breaks a bone.

https://app.groundcover.com

types of metrics

The point here is that for Kubernetes alerts and small children alike, context is everything. If you generate alerts about events that aren't actually problems, you'll end up with too many notifications and may subject your team to "alert fatigue." On the other hand, if you fail to catch critical issues because you didn't configure the proper alerting policies, you'll face even worse problems in the form of a workload or infrastructure failure.

With these challenges in mind, this article breaks down alerting best practices for Kubernetes. We'll go over how alerting in Kubernetes works, which types of events you should alert on and best practices for navigating the narrow strait between producing too many alerts and missing alerts for important events.

What Is Kubernetes Alerting and Why Is It Important?

Kubernetes alerting is the practice of generating notifications for events or trends in Kubernetes that require admins' attention. Examples of such events and trends include:

  • A node that has failed.
  • A Pod that is stuck in the pending state.
  • A container or Pod that is consuming a high level of resources relative to normal consumption trends.
  • High latency rates for communication between Kubernetes cluster components (such as between kubelet and control plane nodes).

This is only a small set of examples of events that you might want to generate alerts for in Kubernetes. Also, we're not saying that each of these events always requires an alert. As a best practice (and we'll say a lot more about this below), you should factor in context when setting up alerts to ensure that the events you alert on are actually relevant.

Ensuring that you generate alerts when necessary – while also avoiding irrelevant alerts – is important because alerts are your main way of detecting problems within Kubernetes. A single Kubernetes cluster could include dozens, hundreds or even thousands of nodes, Pods, containers and other objects. Realistically speaking, there is no feasible way of keeping track of all of them and identifying potential issues by hand. Although you can use commands like kubectl describe to check the state of various resources manually, that's not a practical means of catching issues when you have a high volume of resources to manage.

So, instead, you can configure alerts to appear as a Kubernetes dashboard image source, or in email, Slack or virtually any other communication channel whenever something noteworthy happens inside your Kubernetes clusters. That way, the team responsible for supporting the cluster will know right away when a potential issue arises.

Deploying Alerts from a Kubernetes Monitoring System

Kubernetes itself offers no built-in alerting system or framework. To generate alerts, you'll typically need to collect monitoring data from your Kubernetes cluster, then feed it into an external K8s metrics monitoring tool or platform that can generate alerts based on the data.

There are several key types of metrics that you can check for Kubernetes monitoring and alerting purposes:

 • Node resource metrics: You can monitor CPU, memory and disk utilization at the node level to detect issues such as nodes that are running out of resources.
 • Container resource metrics
: Likewise, container metrics like CPU and memory utilization are important for detecting unusual consumption patterns, which could accompany container failures. 
 • Application metrics
: Tracking application metrics helps detect application performance problems, such as high rates of latency when responding to requests.
 • Control plane metrics
: Monitoring metrics for control plane components, such as control plane nodes and Etcd, allows you to identify unusual behavior that could impact your cluster as a whole.

Metric type How to use for alerts
Node resource metrics Detect anomalous resource utilization trends or exhaustion of node resources.
Container resource metrics Track utilization across the container lifecycle. Correlate with container status to gain context on status changes.
Application metrics Detect application performance problems.
Control plane metrics Detect control plane component failures.

As we explain below, metric data alone shouldn't be the basis for Kubernetes alerts. You should correlate metrics with events to control when an alert actually fires. In most cases, it's only when a certain event is accompanied by a certain metrics condition (such as unusual CPU or memory utilization that has lasted for a given period of time) that you have a problem worthy of an alert.

Alert Configuration

Collecting data is only the first step in Kubernetes alert management. You also have to configure alerting rules in your Kubernetes monitoring tool, which is where the actual alerts originate. From there, they are pushed to contacts or channels of your choosing.

https://app.groundcover.com

The exact method for configuring alerts will vary depending on the tool, but in general, alert configuration involves writing out policies that establish conditions for when an alert should fire from your alerting tool. For example, you could write policies to the effect of:

 • "If a node's CPU utilization surpasses 90 percent for at least 60 seconds, generate an alert".

 • "If a Pod is stuck in the pending state and the Pod's label includes the value 'production' and the Pod has already restarted at least once, generate an alert".

Here again, these are just a couple of basic examples. Most alerting tools that you can use in conjunction with Kubernetes are flexible enough to generate alerts for virtually any condition you can imagine. They also support highly contextual alerting, allowing you to base alerting rules on multiple conditions.

What Should Kubernetes Alerts Cover?

Type of resource What to alert for Data sources to drive alerting
Control plane Failure or performance degradation of control plane components. Cluster metrics; control plane node logs and metrics.
Nodes Hosts going down; very high resource utilization rates. Node logs; node utilization metrics.
Deployments, Pods
& Applications
Workload unavailability or performance degradation; problematic Pod and container states or exit codes. Pod and container status; application metrics; application logs.

Kubernetes alerts should extend to every component of your cluster, with the possible exception of testing workloads whose failure you may not care about. (We say "possible" here because in many cases, admins will want to know about problems related to testing workloads, although they may not prioritize them as much as they do problems with production workloads.)

At a high level, the types of resources you'd want to generate alerts for fall into three categories. Let's explore each one.

Kubernetes Control Plane

When problems occur with any component of your Kubernetes control plane – such as the Etcd key-value store or kubelet – you'll want an alert so that your team can take action. These components are the foundation of Kubernetes, and a failure or performance degradation at the cluster level can cause problems for everything running on the cluster.

You can use cluster metric data to perform Kubernetes monitoring when tracking the status of control plane resources. Log and metrics data from control plane nodes may help, too.

Nodes

Nodes are also a foundational resource for your cluster. An individual node failure may not be a critical problem, but if enough nodes crash, your cluster may lack the infrastructure resources necessary to support workloads. In addition, unexpected node behavior could be an early sign of a larger problem, like hardware issues with your servers, that you'll want to address before it brings down your entire cluster.

To detect problems with nodes, monitor node logs, as well as node resource utilization metrics.

Deployments, Pods and Applications

Deployments, Pods and applications are the actual workloads that you run on Kubernetes. As we mentioned above, some of these may be testing workloads while others are mission-critical production workloads.

https://apps.groundcover.com

Either way, you'll typically at least want to know when something goes wrong – when a Pod fails to start properly, for example, or when an application's CPU or memory utilization spikes unexpectedly. Although problems with deployments, Pods and applications are usually limited to the specific workload where they originate, you still want to keep those workloads running smoothly, and the only way to do that is to alert your team to problems with the workloads.

Sources of insight into the status of deployments, Pods and applications vary depending on the exact nature of your workloads. But in general, you'll want to monitor Pod and container status, as well as any available application logs and metrics.

Kubernetes Alert Management Best Practices Overview

Now that we've gone over the essentials of how Kubernetes alerts work, let's talk about best practices for getting the most value out of it.

As we've indicated, your core goal when it comes to K8s alerts should be to ensure that your tools generate alerts for events that require attention, while at the same time avoiding unnecessary or redundant alerts. To strike this balance, the following general best practices are crucial.

Symptoms-Based Alerts

You should configure alerting rules such that your tools generate notifications when a certain "symptom" occurs with a given resource. A symptom is a status that is undesirable – such as a node failure or a Pod that has failed to launch.

Symptom-based alerting is different from generating alerts based on metrics alone. You can use metrics to contextualize symptom-based alerting; for example, you could configure a rule to generate an alert when a Pod is stuck in the pending state and the Pod's CPU utilization is below 20 percent (which probably means the Pod is hanging, as opposed to taking a long time to start due to a CPU-intensive startup process). However, if you alert based on simple metrics, you’re likely to end up with lots of alerts that aren't important, because not every change in CPU or memory utilization actually matters.

Actionable Alerts

Your alerts should also be actionable, meaning that K8s admins can take action in response. You don't want to generate alerts for events like a container successfully starting, because there is nothing that anyone can do in response to that event.

You'd probably want to log the event somewhere just so you have a record of when your container started. But you don't need to alert anyone about it – and if you do, you add to the meaningless noise of your alerting system, making it harder for your team to identify issues that actually matter.

Kubernetes Alerts Best Practices

Now, let's dive into best practices for managing specific types of alerts in Kubernetes. This isn't a comprehensive guide – we can't cover every imaginable alert in a single article – but the following sections cover the core alerts that most teams will need to manage in a Kubernetes environment.

Disk Usage

Knowing when your disks are running out of space is important. However, that doesn't mean you should necessarily fire an alert every time disk usage exceeds a fixed threshold. In some cases, disk usage might increase temporarily before going back down to a normal rate; for example, maybe your log files are taking up more disk than usual because they haven't been rotated, but once the rotation happens, things go back to normal.

To generate effective alerts about Kubernetes disk usage, then, your alerting rules should factor in how long disk usage has exceeded a certain rate as well as the usage rate itself.

Consider as well contextualizing disk usage alerting rules based on how many stateful applications you have, since more stateful apps generally means that preserving spare disk space is more of a priority.

Problematic Container or Pod Status

Containers and Pods can exist in a variety of states in Kubernetes. Many of them are routine, but certain states can be signs of an issue. If a Pod gets stuck in Pending, for example, it means that it's not starting properly for some reason. Likewise, CrashLoopBackOff, ImagePullBacKOff, OOMKilled are all states or codes that typically signal that something has gone wrong.

Most Kubernetes monitoring tools can detect these statuses or exit codes automatically, then generate alerts based on them. Just keep in mind, however, that some nuance is necessary for avoiding unnecessary alerts here.

For example, exit code 143 is typically not an issue, but it can be under certain conditions, such as a container that is consuming too much memory. For this reason, you'd want to configure alerting rules that generate notifications when exit code 143 is observed and a container's memory usage has spiked.

Network Connectivity

The main types of network events that you should alert for in Kubernetes are failure to communicate and high latency. But to avoid generating alerts every time a slight networking blip occurs, you should factor the duration and scope of network problems into your alerting rules.

https://app.groundcover.com

For example, you may want to wait ten or twenty seconds to allow a node to come back online before you generate an alert. Likewise, you may only want to alert about a network latency problem if the problem persists for more than several seconds and/or if it impacts more than just one resource. Otherwise, you may end up drowning in notifications about small dips in network performance that have resolved themselves by the time your team can investigate.

Node Failure

In general, having a node fail – which means it goes down and can no longer communicate with the control plane – is something you'll want to be aware of using alerts. But there are exceptions:

 1. If the node is down for a short period but then reconnects, it could be that the node didn't actually fail, but that you experienced a network connectivity problem instead.

 2. If your cluster has plenty of spare nodes, the failure of a single node may not require action.

 3. If the node is a control-plane node but other control-plane nodes remain up, the failure may not require action.

 4. If an admin shut the node down manually, the failure will not require action.

Thus, your node failure alerting rules should factor in conditions like overall node count and cluster resource utilization to ensure that you don't generate alerts for node failures that aren't actionable or relevant.

Getting More from Kubernetes Alerts

Kubernetes alerts are a good thing. But it's possible to have too much of a good thing, and alerts in K8s are no exception. To strike a healthy balance – which means generating alerts when it matters but avoiding extraneous notifications that distract your team – you should configure alerting policies that correlate data sources to provide nuanced, context-aware notifications.

Sign up for Updates

Keep up with all things cloud-native observability.

We care about data. Check out our privacy policy.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.