Observability has gotten plenty of attention since the IT industry began tossing around the term several years ago. Yet, while many explain what observability means and why it's important, fewer actually get into how to "do" observability, such as which observability indicators you need to track and analyze to make your systems observable.
That's the topic I'd like to address in this post. Below, I walk through the key observability metrics that you should be collecting today if you want to observe complex systems. I'll then take things further by explaining what you should actually do with those metrics and how to interpret them efficiently.
Importance of observability in complex systems
Before diving into a discussion of observability metrics, let's go over the fundamentals of observability in the IT industry.
As you may be aware if you work in IT, observability is the evaluation of the internal state of a system based on its external outputs. Put another way, observability means using data that you can collect from the "surface" of a system – such as infrastructure and application metrics, if the system is an IT environment – to infer what's happening deep within the system.
The observability concept isn't all that new. The electrical engineer Rudolf Kálmán introduced observability in 1960 as part of his effort to define a "system" and ways of interacting with it. But it wasn't until the advent of complex, distributed software systems (like Kubernetes, to name a popular example) that observability became important in the IT ecosystem.
When you have microservices applications deployed across clusters of servers, measuring the health and performance of those applications is a lot more complicated than simply monitoring predefined metrics and analyzing them in isolation. With observability, you pair complex sets of metrics together and analyze them in a dynamic, correlative way in order to identify and debug problems that you didn't anticipate in advance.
Observability metrics: An analogy
To understand the difference between observability and monitoring, take the Trojan Horse, the infamous giant wooden horse in which Greek soldiers hid in order to sneak into the city of Troy. From the surface, the Trojan horse just looked like a random giant wooden horse, which is why the Trojans dragged it inside their city after the Greek army besieging Troy had departed. The Trojans were monitoring the horse from the outside only, and they weren't correlating surface-level data with other information available to them. As a result, they knew little about the internal state of the Trojan Horse – which was a problem for them, of course, because the internal state of the Trojan Horse was that it was filled with angry Greek warriors.
If the Trojans had instead observed the Trojan Horse – if they had, for example, correlated its sudden appearance with the sudden departure of the Greek army (which might have suggested to them that the Greeks were trying to pull off some kind of trick), or if they had heeded the warnings of Cassandra (the prophetess, not the open source database), who foresaw the Greeks storming the city with a horse – they might have inferred more about the internal state of the Trojan Horse. But they didn't, and Troy burned.
We can't blame the Trojans, of course. They were dealing with this stuff long before Kálmán came along, so they didn't know about observability. They only knew monitoring.
Key observability metrics
Now that we know what observability means, let's talk about the data points and data collection techniques that enable observability. Broadly speaking, observability metrics fall into two main categories: Infrastructure metrics and application performance metrics.
You can collect metrics from infrastructure to track utilization of the CPU, memory, network and persistent storage . By analyzing these metrics for pods, containers, nodes or other types of infrastructure resources in your system, you lay the foundation for understanding what's happening within the system.
Application performance metrics
The application performance metrics you collect for observability purposes may vary based on the type of application you’re managing, but they typically include:
• Request rate, or the total rate of requests that the application handles.
• Latency rate, which tracks the time between when an application receives a request and when it finishes responding to it.
• Error rate, meaning how many requests result in an error.
Changes in these metrics may or may not result from problems with your application itself. For example, high latency could be caused by the application failing to process requests efficiently due to poorly designed code, but it could also be a networking problem. Or, it could result from Kubernetes having assigned a Pod to a node with insufficient memory resources for the application to achieve a low latency rate.
But parsing problems with multiple potential causes is the very point of observability: By collecting and correlating many different metrics from across your system, you can figure out what's actually happening.
Leveraging observability metrics
Obviously, simply collecting observability metrics doesn't make complex systems observable. You also need to know what to do with the metrics.
There are several things you can do with observability data. Let's look at common use cases for observability metrics.
Managing system performance
You can use collected metrics to identify the root cause of system performance issues, then make changes to optimize performance.
For example, if you have a high number of error rates in your application, you could look at metrics from each microservice in the app to determine which ones are acting in an anomalous way. It's likely that the odd microservice out (which is to say, the microservice behaving strangely) is the one where the root cause of your errors is.
Likewise, you can continuously profile the performance of apps by tracking various observability metrics over time. This is useful if, for example, you want to know how adding or removing nodes to your cluster impacts performance, or how performance varies between different times of the day.
Observability metrics are a great source of insight for creating early alerts about emerging problems.
For example, if you notice that the latency of a service is rising over time, you can configure your alerting tools to fire off alerts about the issue. In turn, your engineers can take steps to remediate the root cause of the latency before your end-users start seeing multi-second response rates (which, in most contexts, would lead to a pretty suboptimal user experience).
Decreased request rate is another example of a type of event that you could alert on early. A reduction in requests could simply mean that application demand is decreasing, of course. But if it happens at a time of day or day of the week when that trend is unexpected, decreasing requests rates could signal an issue like a problem like a network failure that’s preventing users from connecting to your app.
Increased error rate metrics serve a similar purpose. If you see an increase in errors over time, you'll want to investigate and remediate the problem before it leads to severe degradation of your user experience. Increased error rate could stem from issues like lack of available infrastructure resources, which means you should scale your infrastructure up.
Observability metrics play an absolutely vital role in minimizing Mean Time to Remediate, or MTTR.
MTTR is a measure of how long it takes your team to fix application performance problems after discovering them. In complex systems, tracing problems to their root cause can be quite difficult, given the many components and variables involved. Even if you can trace an issue like an error to an individual microservice, you might not know whether the root cause is logic in the microservice, a configuration issue on the node hosting it, a poor resource rate or limit setting for its Pod or container and so on.
The longer it takes you to evaluate the various possibilities, the higher your MTTR will be – which is a bad thing from the perspective of application performance. But by collecting and comparing observability indicators from across the various components of your system, you can more quickly rule out different potential causes and trace the issue to its source.
Monitoring and visualization tools for observability metrics
There's a whole fleet of tools on the market for monitoring and visualizing observability metrics. If you asked me to pick my favorite ones, though, I'd confidently choose Prometheus and Grafana. Prometheus is an open source monitoring and alerting tool that can track a wide range of metrics and fire custom alerts based on them. No matter which types of apps you need to support or which metrics you have to monitor, Prometheus gives you a flexible way to collect and analyze your data.
As for visualization, I'm partial to Grafana. It's also open source, it integrates well with Prometheus (among other monitoring tools) and it allows you to configure visualizations to your heart's content, so that you can make more sense of your observability metrics.
Fear not the gifts of observability
Modern application observability might seem complex, but it's actually simple enough. An effective observability strategy boils down to collecting the rights metrics with the right tools, then leveraging those metrics in an effective way. When you can do that, you can make sense of even the most complex systems – and, unlike the Greeks capturing Troy, it won't take you ten years to accomplish your goals.