Context is everything: Using K8s events to boost APM insights

Tracking Kubernetes events can help you make better decisions based on the symptoms application monitoring will drive and detect.

“The hardest part in solving a difficult problem is knowing where to start”.

That simple sentence has been on my mind (without me knowing it) as a developer for many years, and is on the minds of many developers I’m sure.

You have probably heard it in many different forms if you ever shopped for a monitoring or troubleshooting tool. Phrases like “Find the root cause!” and “reduce time to resolution!” are especially common and they capture its exact essence.

It is really the hardest thing in troubleshooting a difficult problem in production. Where should I start? Difficult problems tend to combine an invisible race to get them resolved quickly together with endless possibilities from where they can originate.

Is it a new code version I (or someone else in my team) recently deployed? Is it some crash or exception in my application’s code which I didn’t track? Or perhaps it’s even a resource issue like Out Of Memory or CPU quota?

Kubernetes: the good, the bad and the ugly

So many good things to be said about Kubernetes. Yes, Kubernetes is a container orchestrator, but it really is so much more. It’s inherently infrastructure agnostic, so you can use Kubernetes to abstract the infrastructure away from higher-level services and applications. Not only does this make your applications a lot more portable, but it can also boost your team’s velocity which can now deploy at scale without worrying about all the “boring” details.

Ops teams can provide developers with a platform to run their stacks and applications in an infrastructure-independent manner, while Dev teams, on the other hand, can focus on what they do best, which is writing software.

The abstraction introduced by Kubernetes is truly amazing. Any developer can run a simple application at varying scale without worrying about resource allocation, auto-scaling, self-healing and so much more.

So where's the catch? Abstraction can also be a source of problems and confusion if mishandled.  The simplest example is the Kubernetes self-healing mechanisms. One of the great benefits of Kubernetes is its self-healing ability. If a containerized application goes down, Kubernetes will instantly redeploy it, taking it back to its desired state.

So if your application crashes mysteriously every 20 hours you might never know about it. Ever. Kubernetes will make sure it is up and running 99.9% of the time which might be enough for you to never notice anything ever went wrong.

However, there might be cases where you would notice the collateral effect of behaviors Kubernetes will abstract away so elegantly. Assume that your crashing application is a customer facing web server. You might care about the 1 customer out of a 1000 that will experience issues while your application is down, even though on average things are usually great.

So, assuming there is a symptom you care about like a measured error rate on one of your applications. How would you know where to start? There could be many possible problems hidden behind Kubernetes’ abstraction layers that can help explain what is happening.

Cut through the abstraction by tracking Kubernetes

APM does a perfect job at creating many metrics you can use to detect symptoms that may indicate something is happening which you’re not fully aware of. Among these metrics are the four golden signals (latency, traffic, errors, and saturation. See here for more information) are highly dominant.

A rise in error rate on a specific service, or a traffic volume drop on another would trouble any experienced engineer.

Yet, application monitoring is dominated by detecting these symptoms that might indicate a problem, but are not highly actionable in helping to solve it quickly. A rise in service latency can indicate a problem, but it’s only a symptom. The problem can be high CPU utilization at the service or even node level, a specific API which is experiencing issues under a new untested use-case, auto-scaling misbehaviors or even another service down the line experiencing trouble causing latency to trickle up stream.

Tracking Kubernetes events can help you make better decisions based on the symptoms application monitoring will drive and detect. Context is everything. The same symptom can lead you down many different paths of investigation.

Tracking deployments

One of the most common root causes for any problem in a complex system is that something changed. Even a small and local change can create major effects in a highly distributed environment like a micro services architecture.

And the most common change? A new version has been deployed for a workload, service, job or any Kubernetes components that plays a role in your production. It might sound unintuitive. Tracking deployments of new code? Why wouldn't a dev or an Ops team be aware of new code being deployed? Well the reason lies in the fact that organizations get very complicated very fast as they grow. A team of 20 developers usually maintains automatic CI processes so even tracking what exactly changed in each version is not that easy. Add continuous deployment mechanisms to that and multiple micro services under their watch and you quickly end up at a point where not everyone knows what is currently running in production.

Tracking an event where the “recipe” of a deployment changed (its base image, configuration, dependencies) is a great place to start. That’s an invaluable context for any emerging bad symptom in your system. Know IF something changed right before things started to break and if so WHAT changed. Imagine how many hours of wandering around that can save you.

Update in the deployment of service ‘cartservice’ used to explain a decrease in errors

Tracking container crashes

As mentioned before, Kubernetes’ self-healing mechanisms are one of the most common and confusing forms of abstraction. If in the good old days a developer could have logged into a server to see what’s going on, now this server is a Kubernetes pod and it might be a different instance than the one 5 minutes ago.

Tracking container crashes can help pinpoint application level problems that end up with an exception you might have missed, but also surface resource problems like Out Of Memory (OOM). A container crashing from OOM is usually a combination of an application that consumes more than expected at peak or rush hours, together with a memory limitation set too strict.

Once a container crashes for any reason, it can create a cascade of events inside the cluster manifesting in higher latencies, higher error rates and even down time.

A crash in the workload ‘usermanagerservice’ used to explain a sudden spike in error rate

Final words

Application monitoring insights is the standard way to detect problematic symptoms in the complex Kubernetes production environments teams use today. However, translating these symptoms back to their root cause can be really difficult, especially since Kubernetes is such an amazing abstraction tool.  

Work to get Kubernetes events and APM insights in one place. You’ll enjoy the synergy of these two domains and how they interact to provide explanations to some of the most elusive problems you face.

April 28, 2022

3 min read

Explore related posts