Microservices inside a cloud-native app are like those two family members who often can’t seem to get along. When things are all nice and peachy, they’re civil to one another. But when a conversation between them starts to go sideways – it's a sign that things are about to go downhill fast.
Luckily for those who prefer less drama in their lives, things are much more clear-cut in the world of microservices. When your microservices stop communicating, what you need to do is troubleshoot the issue by looking at the communication framework that the microservices depend on.
In many cases, that framework is gRPC, an open source solution for defining and implementing remote procedure calls between services within a distributed cloud-native architecture. ("Remote procedure calls" is a fancy way of saying what you might otherwise label "data exchanges"). When your microservices stop communicating as they should, monitoring gRPC is an essential step toward figuring out what's wrong.
Keep reading for a breakdown of how gRPC monitoring works, the different approaches available and how to get the most out of gRPC monitoring solutions for powering high-performance, distributed systems.
Microservices and gRPC: Better together
Before we look at how to monitor gRPC, let's first make clear that gRPC is not the only communication framework that you can employ for microservices. You could also use approaches and protocols like REST, GraphQL or Kafka, to name just a few alternatives.
However, gRPC has become popular in recent years as a microservices communication framework because it works particularly well with microservices that transfer a lot of data. It also supports real-time streaming with very low latency rates. Put together, these strengths mean that you can use gRPC to enable high-capacity, high-speed, bi-directional API calls.
The fact that gRPC supports multiple channels of communication as well as compression is another advantage, especially in large-scale environments where you have hundreds or thousands of microservices instances talking to one another at the same time. It works with multiple, simultaneous connections between the same microservices, too, and it supports microservices written in different programming languages. Last but not least, gRPC works over HTTP/2 and supports end-to-end encryption using TLS. Many other protocols don't offer the same level of built-in security.
To be sure, you won't find gRPC in every microservices app or cloud-native architecture out there. Legacy apps, in particular, are less likely to use gRPC (although some may be migrated to take advantage of the framework).
But for most modern cloud-native apps and distributed systems, gRPC is the foundation for communication. If you want to find and fix problems in those apps and systems, you need to be able to monitor gRPC.
The many facets of gRPC monitoring
So far, we've been using the phrase "monitor gRPC" as if gRPC monitoring were a singular process. But actually, it's more complicated than that. gRPC monitoring involves collecting multiple types of data from diverse sources.
Let's walk through them, one-by-one.
For starters, you need to be able to identify all of the different communications taking place within your microservices app. Just as the members of your family probably have a habit of talking over each other sometimes or carrying on multiple conversations at once, your microservices may use many different protocols and send data in tons of different directions at once.
The first step in gRPC monitoring, then, is simply getting a handle on where communications are happening and which protocols are behind them.
The metadata of the communications, including gRPC headers, is a key part of gRPC monitoring. It contains the most interesting information about the communications, such as:
• The names of the resources being transferred over gRPC.
• The sizes and frequencies of communications.
• Whether communications succeed and if not, the gRPC-associated reason (also known as the status code).
• Communication latency.
• User-agent data, which tells you which engines are powering communication "under the hood" on both the sender and the recipient sides of a communication.
This metadata provides critical context for understanding gRPC performance issues.
Once you know what your microservices communications channels look like, you'll want to figure out which data is actually moving across them. In other words, you need to be able to identify which resources are being passed between microservices. This way, you can compare the actual data transferred to the expected data, and find mismatches.
Beyond metadata stored in gRPC headers, there's other contextual information that you will typically want to track, such as:
• The CPU and memory utilization of nodes communicating via gRPC.
• Logs and log management data related to gRPC nodes and microservices.
• Which microservice is the bottleneck in cases where gRPC calls are experiencing high latency.
This data doesn't come from gRPC itself, but rather from your infrastructure and applications.
A final key element of gRPC monitoring is understanding different communication channels. This is important because one of the major features of gRPC, which we mentioned above, is its ability to support simultaneous data streams between the same microservices.
Streams are great for enabling redundant communication routes. But when you're troubleshooting communications, it's often pretty important to know whether a problem is occurring across all communication channels, or just specific ones.
How gRPC monitoring works
Just as there are multiple facets of gRPC monitoring, there are multiple ways of going about gRPC monitoring. In other words, there are different methods you can use for collecting the types of data we just described.
The boring way: Code instrumentation
The conventional approach – and, to be blunt, the more boring and less efficient one – is to use code instrumentation frameworks. Under this method, you instrument observability inside your application at the code level via frameworks like OpenTelemetry.
This approach works, but it requires you to spend time setting up observability within your code. The library you use to instrument your code may also take up resources from your environment. It also limits you to whichever types of observability data you instrument for, and may be unfeasible for third-party or legacy code. If you want to collect a new type of data that you didn’t originally plan on, or simply forgot, you have to go back and modify your instrumentation.
A solution to some of the downsides of code instrumentation would be to use automatic instrumentation, which basically does the instrumentation for you. However, this doesn’t always implement observability for all the use-cases in your environment, and the resources taken up from your environment - such as memory and CPU - are not improved. Instrumentation may add significant latency to your communications as well. In extreme cases, automatic instrumentation, which under-the-hood injects extra logic into your code, may cause bugs and even crash your application.
The fun (and profitable) way: eBPF
A more modern – indeed, we'd even call it state-of-the-art – methodology for gRPC monitoring is to take advantage of eBPF, a framework that lets you run sandboxed programs inside the Linux kernel.
If you're wondering what sandboxed programs inside the kernel have to do with gRPC monitoring, here's the simple explanation: With eBPF, you can run programs at the kernel level on each node in your microservices cluster to collect the data you need to understand what's happening with gRPC. Because the programs run inside the kernel rather than relying on code inside your application, there's no need to instrument observability at the code level. Plus, because kernel-level programs are hyper-efficient, eBPF-based monitoring consumes fewer resources in most cases than code instrumentation.
To use eBPF for gRPC monitoring, you must first figure out where the gRPC connections exist in your environment, so that you know what to monitor. Tools like Caretta, which maps connections inside microservices environments, make this easy. From there, you can deploy eBPF on each node to probe gRPC libraries inside your environment. This is an easy, resource-efficient way to get all gRPC data and metadata.
But wait, there's more! Because eBPF programs running inside the kernel can see everything else happening on your nodes – not just the gRPC libraries – they also allow you to collect the additional contextual information you need to monitor and troubleshoot efficiently, such as other communications and database operations. Combined with CPU, memory utilization and logs from the relevant timeframe, this allows you to fully observe gRPC communications and contexts.
Did we mention that eBPF can do all of these things in a highly efficient, low-overhead manner? OK, we did, but we're mentioning it again because we want to make the point that eBPF-based gRPC monitoring is highly scalable. Even if you have thousands of communications to track, you can do it without taking a major performance hit, thanks to the efficiency of eBPF.
Making sense of gRPC monitoring data
However you choose to collect gRPC monitoring data, the final step in the monitoring process is to analyze it. You're likely to end up with a lot of information, and going through it manually is tough.
Instead, feed the data to an APM tool that can identify anomalies, as well as map different types of data to each other so that you can quickly understand, for example, which events coincide with high latency or failed data transmissions.
Best practices for gRPC monitoring
To get the very most out of gRPC monitoring, take advantage of best practices like the following:
• Choose the right monitoring tools and frameworks: As we explained, there are multiple approaches to getting gRPC monitoring data, but some (like eBPF) are more efficient than others. That said, there may be situations where you choose a less efficient approach; for example, if you've already instrumented observability into your code, maybe you decide to stick with that methodology, at least until some data is missing or until your codebase changes enough to justify a change.
• Optimize for performance and scalability: Collecting monitoring data can consume a lot of resources. You shouldn't let monitoring processes suck up so much memory and CPU that your applications don't have the resources they need to perform well. Look for monitoring methods that minimize the performance impact on your apps, especially when you're monitoring at scale.
• Make the most of your data: Again, raw gRPC monitoring data isn't very useful if you can't analyze it efficiently. Rather than trying to parse it manually, take advantage of APM tools that can pull out relevant insights automatically.
• Keep things secure: The data collected by gRPC monitoring tools can be sensitive in some cases; after all, it's the same data that passes through your app. Make sure to secure it properly so that your monitoring operations don't become the vector for an attack or data breach.
When you choose wisely in these areas, you get a gRPC monitoring strategy that is highly efficient, highly scalable and highly secure.
Take gRPC monitoring to the next level
You can't operate a modern microservices app very effectively unless you can monitor gRPC – which is, more likely than not, the framework that powers communications between your microservices. And you can't monitor gRPC very effectively unless you can collect and analyze monitoring data in an efficient, scalable way.
In case it's not clear, we think that using eBPF to collect monitoring data, then feeding it into an APM tool that can make sense of that data, is the best approach for supercharging gRPC monitoring.