Kubernetes has done to software monitoring and management what modern automotive technologies did to car repair: It has made it much harder to troubleshoot problems.
Just as DIY mechanics often complain that the complexity of modern cars makes them way too complicated to maintain and fix, the complexity of Kubernetes makes it much more difficult to troubleshoot workload performance problems in many cases.
Now, that doesn't mean you should avoid Kubernetes and stick with legacy application architectures and hosting technologies. (If you want to keep your 1986 Volvo so you can avoid the challenges of maintaining a modern car, though, we’re all for it) But it does mean that anyone who deploys applications on Kubernetes must understand the intricacies of Kubernetes troubleshooting and Kubernetes cluster management.
Below, we break down the essentials of Kubernetes troubleshooting, as well as how to respond to common Kubernetes error codes. We'll also discuss troubleshooting strategies for different components of Kubernetes, such as control plane nodes, worker nodes, Pods and beyond.
What Is Kubernetes Troubleshooting?
Kubernetes troubleshooting is the process of detecting and remediating any type of performance issue that arises within a Kubernetes environment. Common performance problems that you might encounter on Kubernetes include:
- Containers or Pods that fail to start.
- Containers or Pods that take a long time to start.
- Applications that are slow to respond to requests.
- Applications that can't interface with the network properly.
- The unexpected crash of a container or Pod.
- Pods being placed on the wrong nodes, leading to insufficient availability of resources.
- Slow application performance due to poor choice of resource limits.
This is only a partial list. In a production Kubernetes environment, you could run into any number of potential performance issues that you'll need to identify and troubleshoot to avoid disruptions to your end-users.
Kubernetes Troubleshooting Challenges
Kubernetes troubleshooting would be easy if Kubernetes were a straightforward, uncomplicated system.
Alas, it's not. Kubernetes is a very complex platform that includes a variety of distinct components – an API server, an etcd key-value store, control plane nodes, worker nodes, Pods, various network resources and more. These components interact with each other in complicated ways, with the result that it's often not obvious what the root cause of a performance issue is based simply on the surface-level manifestation of the issue.
For example, imagine you're troubleshooting an application hosted on Kubernetes that is experiencing high latency. You know the latency rate for the app, but that information alone tells you little about the root cause of the problem, which could be any number of the following:
- Congestion on the network.
- A problematic configuration with the networking plugin you're using.
- Buggy code in the application that’s causing it not to respond to requests quickly.
- Insufficient resources for the Pod that hosts the app, causing slow performance.
We could go on, but the point is this: Kubernetes troubleshooting is tricky because there are so many potential root causes to sort through, and so many individual resources that you need to monitor and observe to maintain the visibility necessary to trace problems to their root cause.
Common Kubernetes Errors and Their Fixes
Fortunately, Kubernetes doesn't leave you totally in the dark when you encounter a performance issue. It provides various error codes, which in many cases are the best piece of information available for figuring out what triggered a problem.
Here's a list of the most common error codes on Kubernetes.
Exit code 1
Exit code 1 in Kubernetes means that a container terminated due to an application error. This typically indicates a problem with the container image, or with code inside the image.
If you encounter exit code 1, try running the container directly from the command line, instead of deploying it in Kubernetes, to verify that it starts properly. If that works, ensure that the image you're pointing to in your Kubernetes deployment is not corrupted.
Exit code 125
Exit code 125 means that a container failed to run because the command that Kubernetes tried to use to run it didn't execute successfully.
If you see this error code, check the commands inside your container image for typos or undefined arguments or flags. You should also ensure that permissions settings are configured properly.
Exit code 143
Exit code 143 happens when a container receives the SIGTERM signal. This is a signal sent by the operating system that tells a container to shut down.
Exit code 143 often does not indicate a problem at all; in many cases, it simply means that the orchestration engine asked the container to shut down for a legitimate reason. But if your containers keep shutting down with code 143 when they shouldn't, look at the kubelet logs to see what the source of the SIGTERM request was.
Read more about troubleshooting Exit Code 143.
Exit code 139
The appearance of exit code 139 corresponds to a specific scenario wherein a container is subjected to the SIGSEGV signal originating from the underlying operating system residing on its host node.
With Linux and Unix-like operating systems, SIGSEGV represents a category of termination signals that mandates a process to undergo shutdown proceedings. This signal typically emerges when the operating system identifies a process attempting to access system memory that either does not exist or lacks the necessary permissions to be accessed – a phenomenon referred to as a segmentation fault, often colloquially termed a "segfault" in the realm of avid Linux enthusiasts.
When a container encounters SIGSEGV, it generally results in termination. Such an outcome is less than ideal since the norm is to keep containers operational unless a deliberate decision is made to shut them down. However, the alternative to SIGSEGV is the potential scenario wherein an entire server might succumb to a crash due to multiple processes vying for access to the same memory address. Picture it as a situation where all the dogs in a neighborhood rush into a single yard to engage in a brawl – utter chaos that disrupts everything because no container can securely access memory resources.
Hence, the issuance of the SIGSEGV error by the operating system serves as a preventive measure, intended to avert a much larger-scale crisis.
Read more about troubleshooting Exit Code 139.
CrashLoopBackOff in Kubernetes, a common but solvable problem, occurs if a container repeatedly fails to start. Although Kubernetes will automatically keep trying to restart the container (with increasingly long intervals between restart attempts), it will eventually give up – or "back off" – if it has been five minutes since the last restart attempt and the container still fails to start. Read more about troubleshooting CrashLoopBackOff.
CrashLoopBackOff events can occur for a variety of reasons, and they are therefore one of the more difficult problems to troubleshoot in Kubernetes. But to resolve them, start by checking for the most obvious causes of CrashLoopBackOff, which include lack of sufficient resources, broken deployment configurations and problems with the application or image you’re trying to run.
An ImagePullBackOff error means that Kubernetes couldn't pull the image for a container. As with CrashLoopBackOffs, Kubernetes will repeatedly retry to pull an image if it fails on the first attempt, but eventually it gives up.
ImagePullBackOff usually happens either because your deployment configuration doesn't point to the right image registry or path, or because there is an issue (like lack of network connectivity) with your registry.
Node Not Ready
The Node Not Ready error appears when a node in your Kubernetes cluster fails to reach the "ready" state, which is the state it needs to be in to host workloads. This typically happens because of insufficient resources on the node or an issue starting the kubelet agent on the node.
The best way to troubleshoot this type of node status issue is to check the kubelet logs of affected worker nodes, as well as any operating system logs on the node, for information about why the node is failing to achieve a ready state.
CreateContainerConfigError means that a container that was in a pending state failed to transition successfully to a running state. This is usually due to missing information in the deployment configuration for the container, such as lack of a Secret definition that the container depends on.
A Kubernetes OOMKilled error indicates that a container was shut down because it was using more memory than allowed. To troubleshoot this error, check whether any memory limits are in place for the container and whether they are appropriate for the container's requirements.
You should also make sure you've defined the right Quality of Service (QoS) class for the Pod in question. There are three QoS classes – Guaranteed, Burstable and Best Effort – and if a cluster runs short on memory, Kubernetes may terminate containers that don't belong to the Guaranteed class in order to free up memory for ones that do.
Read more about troubleshooting Kubernetes OOMKilled errors.
Troubleshooting Kubernetes Clusters
Now that we've discussed how to troubleshoot specific Kubernetes errors, let's talk about how to troubleshoot issues that affect different components of Kubernetes, starting with the cluster as a whole.
If you're experiencing a performance issue across your entire cluster, as opposed to an individual node or Pod, the likeliest cause is a problem with your control plane. Check the logs on the control plane node or nodes for any unexpected events, such as a network connectivity issue.
You should also ensure that the size of your cluster and the overall resource availability is sufficient for your workloads. If your workloads have experienced a surge in requests, or if the number of existing Pods exceeds what your nodes can support, you might need to allocate more nodes, or change the resource allocations of existing nodes (if they are VMs with variable allocations or configurations), to deliver the resources your cluster needs to run reliably.
Troubleshooting Kubernetes Pods
To troubleshoot issues that affect specific Pods, such as frequent crashes, start by running the kubectl describe pod command, which uses this syntax:
kubectl describe pods pod-name
The output will include information about the status of the Pod. Ideally, your Pod will be in the Ready state, which means it's operating normally. But if it's stuck in a state like PodHasNetwork, it means that it's connected to the network but has not yet started all of its containers – an indication that there is probably an issue getting one or more containers in the Pod up and running.
Any logs or metrics that you can collect from the Pod are also valuable for troubleshooting. Although Kubernetes doesn't create Pod logs directly, you can use an observability tool to monitor the resource consumption of Pods. In addition, containers in the Pod may be configured to write log files or export them to a logging tool.
When troubleshooting Pods, it may also help to try starting the containers inside the Pod directly from the command line. If they all start successfully, this rules out problems with the container images themselves.
In some cases, simply deleting and redeploying a Pod may fix unusual issues. You can delete a Pod with the command kubectl delete pods.
If you've configured the replication controller manager to run many copies of a Pod, that could create issues in situations where there aren't enough nodes to support the defined number of replicas. In that case, simply change the replication controller settings.
Finally, if you discover that Pods running on a certain node keep crashing, you should remove that node from your cluster with the command kubectl delete node.
Strategies for Troubleshooting Kubernetes Issues Effectively
We wish we could reveal the "one simple trick" that will make it super-easy to solve all of your Kubernetes troubleshooting woes. But the complexity of Kubernetes means that fixing issues is never that simple. Every problem is different, and you often need to be creative about your approach to Kubernetes troubleshooting to solve strange issues.
That said, you can streamline your approach to Kubernetes troubleshooting by following these guidelines:
- Define the scope of the problem: Before you do anything else, figure out how many resources are affected by the issue. Is it just a single Pod or node, or are you seeing unusual activity across large parts of your cluster? The scope of the problem helps you determine whether the root cause is linked to a component that affects the entire cluster, or just a specific workload or node.
- Look for error codes: Again, any error codes you can pull from Kubernetes itself are often your best starting-point for identifying likely root causes of problems.
- Identify data sources: The logs and metrics available for troubleshooting Kubernetes can vary widely depending on which observability software you've deployed and which logging options you've configured. Determine which data is available to you, since that information will play a central role in shaping your troubleshooting options.
- Take advantage of generic mitigations: A generic mitigation is an action like redeploying a Pod or allocating more CPU or memory to a container. It doesn't resolve any underlying problems with your workloads or Kubernetes clusters, but it sometimes does get things working again in some cases. Although you should still endeavor to figure out what the root cause of a performance problem in Kubernetes so it doesn't keep coming back, generic mitigations can at least get applications back up and running again for your users.
Tools and Techniques for Efficient Kubernetes Troubleshooting
The massive popularity of Kubernetes means that there is no shortage of troubleshooting tools and resources available. In general, the tools fall into two main categories:
- Troubleshooting tools that help monitor and fix performance issues with specific components of Kubernetes. For example, netshoot helps fix network-related problems. The tooling built into kubectl for describing Pods, nodes and so on also fits in this category, since it provides basic data about specific types of Kubernetes components that is useful when troubleshooting.
- End-to-end observability and troubleshooting platforms, like groundcover, that continuously monitor all components of your Kubernetes cluster and provide the data necessary to contextualize complex problems.
The first set of tools are useful if you experience an issue with a narrow scope and need to trace its root source. But for complex issues whose scope and root cause are not at all obvious from the surface, a holistic observability solution is usually your best bet for getting to the root of the problem.
By the same token, complex issues typically require a broad troubleshooting technique that draws on as much data available to you as possible. The more data you have about each element in Kubernetes, the better positioned you are to associate unusual performance in one component with anomalies from other components, and to rule out different potential root causes of an issue.
Getting to the root cause isn’t always easy
Kubernetes is a very complex system, which often makes it difficult to get to the root cause of performance problems. However, Kubernetes error codes offer a good starting point for investigating many types of problems. You should also draw on logs, metrics and any other observability data sources available to you to help pinpoint the main cause of an issue.