Going for the gold: How K8s warriors can monitor the four golden signals
Find out how the four golden signals work, and how to apply them in the context of modern application monitoring
To monitor apps running in a Kubernetes cluster is to experience what you might call a "poverty of riches." On the one hand, the fact that Kubernetes consists of so many entities, which are all communicating over API calls between one another, means you have more potential monitoring surface to work with and the granularity in which you can pinpoint problems increases.
On the other hand, it means there is so much more data to collect and analyze in a K8s environment that you can't feasibly monitor it all. If you try, you're likely to end up with too much noise, and too little actionable visibility into what is happening to your apps.
That's part of the reason why the so-called "four golden signals" have become so popular. In the API-driven, Kubernetes-centric world in which we live, the four golden signals concept is a great way to be strategic about which data you monitor – and to ensure that you get the actionable information you need, but without becoming distracted or overwhelmed by data that doesn't translate to meaningful insights.
Keep reading for a look at how the four golden signals work, and how to apply them in the context of modern application monitoring.
What is monitoring, and why do we do it?
Before diving into a discussion of the four golden signals, let's start with the basics: A look at what monitoring means and why it's important.
Monitoring is the practice of tracking what is happening in an application environment in order to detect trends or changes that may require action.
For example, when you track the memory consumption of an app as your customer base grows, you can determine when it's time to increase the memory allocation to the infrastructure hosting the app, in order to ensure it doesn't run out of memory.
Monitoring can also help you validate whether a change you made produced the desired result. If you want to know whether a new application release triggered an increase in API latency, for instance, monitoring is your friend.
And when something demonstrably bad happens – like an application that stops responding – monitoring (and alerts generated by your monitoring tools) is how you detect the problem before hearing about it from your customers.
For monitoring, less (data) is more
Given how central monitoring is for managing the health of your applications, you might think that the more monitoring data you can collect and analyze, the better.
In reality, though, trying to monitor every single data point in every layer of your application stack can be a bad thing, for a few reasons:
- Setup difficulty: To monitor applications and infrastructure, you need to configure them to expose the right metrics in the right way. This takes work, so trying to monitor everything may mean that you spend forever setting up your monitoring tools (and setting them up again whenever your hosting stack changes), which is a huge investment of time and effort.
- Risk of distraction: More data means more information to sort through as you investigate issues, and too many signals can create noise that distracts you from identifying the root cause of problems. You want enough data that you can make informed decisions, but not so much that you miss the forest for the trees, as they say.
- Storage limitations: Monitoring data can add up over time if you store it persistently. Storage costs money and requires effort to manage, so keeping your monitoring scope reasonable helps to streamline your storage needs.
When it comes to K8s monitoring, think like a doctor
If you can't monitor everything, then, how do you decide what you should monitor?
The answer is to think like a doctor by identifying which "symptoms" are most relevant for achieving the insights you need. Just as a good doctor knows which data points to collect to figure out what is ailing you, a good IT team knows which specific signals it requires to achieve application and infrastructure performance requirements (and meet SLA commitments).
For example, if a doctor wants to diagnose the root cause of a fever, they are probably not going to look at your toes or ask what you ate last week. Instead, they’ll take your temperature, listen to your lungs and maybe look in your ears and mouth.
Likewise, an IT engineer probably doesn't need to run a SMART analysis of a disk drive to determine why an application is experiencing high latency. Instead, they want to look at metrics related to application load and network traffic to diagnose the cause of the issue.
The bottom line: Choosing the right symptoms to monitor is critical. You can't monitor everything, but you also can't diagnose problems effectively if you fail to monitor the right things.
The four golden signals: The key to Kubernetes monitoring
Although there's no universal law about which data to focus on when monitoring applications, a popular approach today is to study what Google calls the "four golden signals": Latency, traffic, error and saturation.
The four golden signals approach works well for a couple of reasons. One is that, if your application makes heavy use of internal and external APIs – as most applications orchestrated by Kubernetes do – you can collect latency, traffic, error and saturation data by monitoring APIs. That's a big deal because in the past, you would have had to collect those signals directly from within your application, which would have required custom instrumentation (which is a ton of work). But when you focus on monitoring the "black box" of API interactions, collecting data from complex, microservices-based apps becomes a simple process that you can perform in a systematic way, without worrying about the specific business logic inside your code.
The four signals are also handy because they are simple and easy to work with. They yield actionable insights for virtually any type of application, and you don't need complex metrics to generate them. You just need quantifiable information that you can easily collect by watching API interactions, and you can then analyze that information to determine whether your app is doing what users expect it to do – like responding to their requests in a reasonable timeframe.
You can also define Service Level Objectives (meaning a specific metric you guarantee to your users) based on the four golden signals, which ensures a consistent approach to setting and tracking SLO adherence.
Collecting your golden signals in Kubernetes
There's an important caveat to what we said above about the golden signals being easy to collect: It's only easy when you have an easy way of collecting metrics based on API interactions, then exposing those metrics seamlessly to your favorite monitoring tools.
This is not necessarily easy to do (and Kubernetes certainly doesn't provide any tools to help you do it). Conventionally, collecting the signals required a "white-box" approach wherein you exposed metrics from directly within an application using custom instrumentation, then fed them into a tool like Prometheus. That was really time-consuming to set up, and it meant that your monitoring routine became dependent on your code, and became one more thing your developers had to maintain.
An alternative was to deploy full-blown APM tools like Datadog or New Relic, which would collect the data for you and provide analysis and visualization features. That was a little easier than instrumenting monitoring within your own code. But it still required lots of preplanning and a more complex monitoring stack – especially in an environment like Kubernetes, where you need to figure out how to get metrics out of your containers, since the containers don't send metrics data to any central location automatically.
Moreover, you’d still be left blind about specific parts of your environment like sidecars and proxies that you cannot instrument. Therefore, getting hold of these precious golden signals, anywhere across your production environment is by no means an easy task. It's a task many teams spend weeks and months on.
eBPF for the gold
eBPF has proven advantages in many domains, including K8s observability. Basically, it lets a programmer run code very efficiently, from within the kernel, and observe any and all programs running in user space – which is hard to do when relying on observability tools that operate in user space themselves.
These Linux superpowers, as referred to by Brendan Gregg, help turn the complex “white-box” approach of having to expose the golden signal metrics from within your application, into a simple and easy “black-box” implementation. You simply run your eBPF monitoring inside the kernel and if you do it right you can monitor any API call going in and out of any container in the K8s node you’re running on. No code change, no manual exposing of collectable metrics. Nothing.
groundcover is a working example of how that giant leap forward can look like. Once deployed over any K8s cluster, groundcover’s eBPF agent instruments all API calls, DB queries etc. going in and out of any container running in the cluster. Clearly we’re talking about massive, very high-throughput data. Turning that into golden signals isn’t a scalable task by itself if you’re not careful. That’s where groundcover’s distributed architecture kicks in. As data flows by each of the eBPF agents in each node in the K8s cluster, it is turned into metrics without the data being stored or even moved off the node. Among these metrics one can find the golden signals for any resource found across the cluster. This out-of-the-box experience is exactly that leap forward in getting a hold on the gold. No R&D efforts in the process, yet a full coverage of what matters most - the metrics that can drive your monitoring stack home.
To make things even more easy, groundcover labels each metric in a K8s-native way, so it’s easy to query and explore K8s “questions” like which workload is rising in latency? Or what node is experiencing the highest throughput?
The golden signal metrics are accessible as a Prometheus-compatible datasource which can be added directly into Grafana.
For example, to explore the error rate of all your HTTP APIs (in a resource path granularity) you can simply run: