Microservices Monitoring: How to Choose the Right Tool

Microservices Monitoring: How to Choose the Right Tool

Aviv Zohari

Founding Engineer

minutes read,

May 30th, 2023

And when you're responsible for more than one kid, you face a steeper set of challenges. It's not just that you have multiple kids to keep track of. It's also that you have to monitor how those kids are interacting and how the actions of one child impact the other children. If one kid starts crying, for example, you need to determine whether it's because of a problem of his own making, or because another little monster stole his chocolate chip cookie.

Likewise, to monitor microservices, you have to do more than just keep track of which microservices you have running. You must also compare monitoring data from multiple microservices to gain insight into the state of each individual microservice, as well as the health of your app as a whole.

While we’re hardly qualified to give parenting advice, we do know a thing or two about monitoring microservices, and we'd like to share our thoughts. Keep reading for a guide to microservices monitoring, including how it works, why it can be challenging, and best practices for monitoring a microservices app efficiently, effectively, and scalably.

What is Microservices Monitoring?

Microservices monitoring is the process of tracking the status and health of microservices within a distributed application.

Today, deploying applications as a set of microservices – as opposed to deploying them as monoliths, in which everything runs as a single process or service – has become a popular application architecture. But to keep track of the overall health of a microservices app, you must do more than simply monitor the app as a whole. You need to collect metrics and monitor log events from each microservice within the app, then analyze, compare, and correlate that data to help manage the health of the entire app.

Monitoring for apps that use a microservices architecture allows teams to do this by collecting monitoring data produced by each microservice.

Monolith vs. Microservice Architecture Monitoring

	Monolith	Microservices
Data sources	Few (one set of data types for the app as a whole).	Many (one set of data types for each microservice).
Traces	Not applicable in most cases.	Critical for monitoring.
Service interactions	Not applicable.	Critical for monitoring.

In some respects, monitoring microservices isn’t radically different from monolithic app monitoring. Both types of monitoring rely in part on the same types of data sources – namely, logs and metrics – and both involve looking for trends and anomalies to identify potential problems with an application.

However, there are some key differences between monitoring a microservices app and monitoring a monolith:

More data sources: Most monoliths produce just one log file and one set of metrics, meaning you have relatively few data sources to collect. With microservices, you need to collect logs and metrics from each microservice.
Traces: When monitoring distributed systems, you may collect trace data in addition to metrics and logs. Tracing is typically not important in monolithic apps because there’s nothing to trace when your app runs as a single service.
Service interactions: To monitor a microservices app, you need to not just monitor each microservice in isolation. You also need to correlate monitoring data between microservices to understand how issues involving one microservice potentially correlate with other microservices.

Thus, while you can use the same basic processes for both monolithic app monitoring and microservices, monitoring microservices is more complex and requires a more sophisticated approach.

Why is Monitoring Microservices Important?

If you only monitored your application as a whole – if, in other words, you collected data such as total CPU and memory utilization for the entire app – you would lack the granular visibility necessary to home in on the root cause of performance issues. For example, knowing that there’s a sudden spike in CPU usage by the app as a whole if the problem stems from a bug in a particular microservice. You'd need to identify the specific microservice at fault, then fix it.

By monitoring individual microservices, you gain granular visibility into your application, allowing you to identify and remediate application performance problems more quickly.

Microservices Monitoring vs. Observability

	Monitoring	Observability
Purpose	Identify trends and anomalies.	Investigate complex problems.
Data sources	Typically, application and infrastructure metrics, logs, and traces.	Data from across all layers of a distributed system.
Used for	Monoliths and microservices apps.	Primarily microservices apps.

Monitoring for microservices also helps you gain observability into distributed applications. However, monitoring and observability are distinct processes, and it's important not to conflate the two.

Monitoring refers primarily to data collection and display. In contrast, observability involves deep analysis of data. Monitoring can help you identify basic trends and detect potential problems, such as a microservice that has suddenly become unresponsive. In contrast, observability allows you to investigate the issue and gain context on why it's happening based on all available data outputs from across your environment. Through observability, for instance, you might discover that the Pod that you had created to host the failed microservice has experienced an error, explaining why the crash happened.

Read more about Microservices Observability.

Microservices Monitoring Metrics

The specific metrics that you monitor for microservices can vary depending on factors like what your application does and where it's hosted. In general, however, there are two basic types of metrics to monitor for microservices apps.

Resource Metrics

The first is resource metrics. These track how many infrastructure resources – such as CPU, memory, and disk space – your microservices are consuming. You can typically track this information through your hosting platform (although in some cases, such as if you’re deploying microservices via serverless functions, you may not have full visibility into resource metrics because you lack direct access to the underlying host infrastructure).

Golden Signals

The Golden Signals are a set of monitoring metrics popularized by Google. They include four specific data points:

Latency: How long it takes a service to fulfill a request. Latency is typically measured in milliseconds.
Traffic: How many requests your services are receiving. For instance, a microservice that receives 100 HTTP requests per minute has a request rate of 100.
Errors: How many requests result in errors. What counts as an error can vary depending on the type of app you’re monitoring, but error rate typically focuses on status codes like 500.
Saturation: Resource consumption as a percentage of total available resources. For example, a microservice consuming 90 percent of the memory allocated to it has a memory saturation rate of 90 percent.

By collecting these metrics in addition to generic resource utilization metrics, you can track how patterns, like a spike in traffic, correspond to an uptick in error rates, or how a microservice's latency rates change when it approaches 100 percent CPU utilization.

Microservices Monitoring Challenges

While monitoring of microservices is critical for ensuring the availability and performance of distributed applications, it can be challenging, for a variety of reasons.

Cloud Native Architecture Based on Containers and Kubernetes

For starters, microservices are typically deployed in containers, which are orchestrated using a platform like Kubernetes. This approach complicates monitoring for two basic reasons:

You have more layers of abstraction between monitoring tools and microservices, which can make it harder to access monitoring data. For example, a monitoring agent that runs outside a Kubernetes cluster would typically not be able to collect monitoring data from microservices hosted inside the cluster, unless each microservice included special code that told it how to send monitoring data to the agent.
Containers and Kubernetes create highly dynamic environments. Kubernetes might migrate microservices automatically from one server to another, for instance, making it impossible to know ahead of time which server will host a microservice. Tools for monitoring of microservices must therefore be dynamic, too, so they can adapt to constant changes.

Troubleshooting Log Loss on Container Termination

Another challenge that arises if you host microservices using containers is that log data stored in containers disappears permanently when the containers terminate. The best way to work around this limitation is to collect monitoring data from running containers in real time rather than relying on log files. That way, if your container suddenly shuts down, you already have monitoring data for the microservice that ran inside it.

An alternative approach is to configure microservices to write log files to persistent storage that is external to containers, but this requires extra logic inside the microservices. Real-time collection of monitoring data is usually a better solution.

Coexistence of Containerized Applications on a Single Host

Monitoring for microservices apps would be relatively easy if each server hosted only one microservice. In that case, you could collect monitoring data from each server and make reliable assumptions about how the data correlates to individual microservices.

Unfortunately, this is not how microservices apps usually work. Typically, a single server hosts multiple microservices. To monitor each one, you need a monitoring method that is independent of the server.

Integration with Third-Party Platforms

In some cases, you might deploy microservices on a specific platform, such as Amazon Elastic Kubernetes Service (EKS) or Azure Kubernetes Service (AKS). If you do, you’ll find that the platform providers make certain monitoring tools – such as CloudWatch in the case of EKS – available to help monitor microservices. But the functionality of those tools is limited – you can only monitor the data types that the tools support, and the tools offer limited analysis and visualization features.

This means that an effective monitoring strategy often must extend beyond the built-in monitoring tooling that comes with certain platforms. Native monitoring tools may be useful for collecting basic data, but they are rarely sufficient on their own for meeting complex monitoring needs.

How to Monitor Your Microservices the Right Way

Effective monitoring for microservices starts with ensuring that you collect the basic types of data – such as resource utilization metrics and the Golden Signals, which we described above. However, to get the very most out of your monitoring strategy, consider taking additional steps.

Monitor APIs

In addition to monitoring individual microservices, perform API monitoring. Sometimes, problematic behavior results from issues with APIs rather than with microservices; for example, a malformed API response could cause an error.

By pairing API monitoring with monitoring of microservices, you'll gain critical context that can help you when troubleshooting microservices performance issues.

Monitor Containers and Pods

For similar reasons, you should monitor the containers and Pods that host microservices. You may find that a Pod crashed because a network issue prevented it from pulling a container image, for example, and that the microservice that was supposed to run in the Pod failed to start as a result.

In this case, Pod monitoring would provide important insight into why your microservice failed. You'd miss that insight if you monitored the microservice alone.

Implement Crash Reporting

Crash reporting, which provides information about what was happening inside a microservice prior to a crash, can provide valuable insight when a microservice fails. Just remember that if your microservices run in containers, any data stored in them may disappear when the containers shut down, so you'll ideally export crash reports to a location that can persist beyond the container.

Set Dynamic Baselines

When you're monitoring a microservices app that constantly scales up and down, there is no such thing as "normal." For example, a microservice that handles 100 requests per minute at one point in time may suddenly begin handling 10,000 due to an increase in the number of users connected to the app.

You therefore can't assume that a specific number indicates that your microservice is performing well or not. Instead, you should establish dynamic baselines, meaning baselines that change over time as your application scales up and down. Rather than alerting based on fixed monitoring thresholds, alert based on unexpected patterns or anomalies in overall microservice behavior.

Choosing the Right Microservices Monitoring Tools

Most monitoring tools available today are capable of supporting microservices, in the sense that they can collect data from them. However, some tools offer more advanced monitoring capabilities than others. When evaluating microservice monitoring tools, consider the following factors.

Distributed Tracing

Distributed tracing is the practice of tracing how different microservices within the same app handle a request. In other words, tracing monitors how a request flows through a distributed system. This is valuable because tracing can help pinpoint which microservice is causing a delay in request processing or is the source of an error.

Look for monitoring tools that offer full support for distributed tracing so that you can run traces whenever necessary to investigate an issue.

Scalability

Your monitoring tools should be able to scale as seamlessly as your microservices. However, tools that have a cumbersome deployment process, or that consume significant resources, may scale poorly. Look for low-overhead tools that can scale up and down rapidly.

Thorough Collection and Analysis of Data

Monitoring tools should do more than simply collect data. They should also help you visualize and interpret it. And they should allow you to collect all types of relevant data. The more monitoring data available to you, the greater your ability to troubleshoot complex microservices problems.

Ease of Use

Monitoring solutions that require complex configuration or installation processes can become a burden on IT teams. Ideally, monitoring tools for microservices will require no special skills to deploy or operate.

Monitoring Microservices with groundcover

Groundcover is a new breed of monitoring tool for distributed systems and apps. Unlike traditional solutions, which rely on resource-hungry software agents to collect monitoring data, groundcover uses the eBPF framework to collect monitoring information in a hyper-efficient and (from the user's perspective) hyper-simple way. Groundcover saves admins from the tedium of cumbersome monitoring tool deployment processes, and it provides access to a virtually unlimited set of data types.

Groundcover also offers a rich set of analytics and data visualization tools, making it easy to make sense of complex monitoring data. With groundcover, admins never have to worry that monitoring tool complexity or feature limitations will prevent them from discovering and remediating microservices performance problems.

Monitoring each microservice continuously is a pillar of effective application performance management

It's only by tracking data such as resource utilization, latency, and error rates for each microservice that you'll know when you have a problem, and that you'll be prepared to investigate it. And while there's no denying that microservices monitoring is more challenging than monitoring a monolith, we like to think solutions like groundcover make it a lot easier.

Microservices Academy

Sign up for Updates

Keep up with all things cloud-native observability.

Microservices Monitoring: How to Choose the Right Tool

What is Microservices Monitoring?

Monolith vs. Microservice Architecture Monitoring

Why is Monitoring Microservices Important?

Microservices Monitoring vs. Observability