Kafka Monitoring with eBPF: It’s a Whole New Perspective
Find out why comprehensive monitoring of your Kafka performance is critical and why standard approaches like server-side monitoring can undercut visibility into Kafka performance – even if you don't realize it until it's too late. Get the tips you need to improve your Kafka monitoring strategy by leveraging tools like eBPF.
Monitoring Kafka Metrics: Everything You Need to Know
At first glance, monitoring Kafka might seem simple enough. After all, Kafka produces event streams, and streaming data is one of the simplest types of data sources out there. If you simply monitor basic Kafka metrics based on the data stream, you can quickly and reliably identify issues with applications that rely on the data stream, right?
Well, not necessarily. The problem that many teams run into with Kafka monitoring is that the status of the data streams from the perspective of the Kafka producer isn't necessarily the same as that of the consumers. In other words, to actually monitor the data stream from within the Kafka cluster – you can’t overlook important Kafka metrics, like Producer-Consumer latency.
This article explains why a comprehensive monitoring of your Kafka performance is critical and why standard approaches like server-side monitoring, along with other Kafka monitoring misconceptions, can undercut visibility into Kafka performance – even if you don't realize it until it's too late. It also discusses tips for improving your Kafka monitoring strategy by leveraging tools like eBPF.
What is Kafka?
Let's start with the basics by briefly discussing what Kafka is and why it's used.
Kafka is an open source event streaming platform. Its purpose is to generate data feeds so that applications can share data efficiently and in as close to real time as possible.
Born inside LinkedIn circa 2010, Kafka became an Apache Software Foundation open source project in 2011. Since then, the platform has become massively popular as a solution for building real-time data pipelines. Kafka's adoption probably reflects, in part, the fact that it's open source, and is therefore free to use and easy to adapt. But the platform's distributed architecture, which makes it possible to use multiple "brokers" to share data with multiple applications, is also a selling point, because it helps ensure that data streams remain available even if part of the Kafka cluster fails.
Kafka architecture overview
Kafka's distributed architecture involves several discrete components:
• Producers: Producers are the source of data that is streamed using Kafka. Typically, a Kafka Producer is an application that generates data on a continuous basis.
• Consumers: Consumers are clients that receive the data streamed through Kafka. A Consumer might be an application that needs to react to events that take place within the producer application, for example.
• Topics: Topics are a mechanism that separates Kafka data streams into individual categories or feeds. If your producers generate multiple types of data, you might choose to create a Topic for each one.
• Partitions: Partitions make it possible to distribute Kafka Topics data across multiple servers within the Kafka cluster. That way, the work of processing and streaming Topics can be shared by several nodes.
Again, this distributed architecture is part of what makes Kafka so appealing. Kafka is not the only event-streaming platform out there, but it's one of the only major event-streaming solutions that uses a distributed architecture to make data processing more efficient and reliable.
Understanding Kafka Producer-Consumer latency
To do its job well, Kafka needs to move data between Producers and Consumers in as close to real time as possible. If it doesn't, the data that Consumers receive may no longer be relevant by the time they receive and react to it.
That's why it's critical to monitor what's known as Producer-Consumer latency (otherwise known as lag). Producer-Consumer latency is the time between when a Producer pushes a message to a Kafka Topic and when the Consumer receives it.
(last produced offset) - (last consumed offset) = (offset lag)
Now, you may be thinking: "Nothing happens in true real time, and some amount of Producer-Consumer latency is always inevitable." That's true. Even the most operationally-efficient Kafka clusters will always be subject to some degree of lag.
However, when you're dealing with data streams that are theoretically supposed to move information in real time, your threshold of tolerance for lag is likely to be very, very small. For example, if you're leveraging a Kafka data stream for use cases like fraud detection during payment processing or tracking the movement of IoT-connected vehicles, delays of even just a few seconds could make Kafka basically useless. An identity-thief might already have walked out of the store, or your vehicle might already have crashed, by the time you detect the issue.
Even for Kafka use cases that might not seem to require almost real-time performance, the ability to detect lag is crucial. Take performance monitoring, for example. If you have APM tools that are continuously processing Kafka streams of data to detect application availability or performance issues, high Consumer lag would mean that you can't detect problems before they impact your end-users.
Keep in mind, too, that lag has a tendency to explode. In other words, once Producer-Consumer lag starts to appear within your data stream, it's likely to grow quickly because catching up becomes harder and harder the longer that latency issue persists. So, a 50-millisecond lag rate could snowball within a few minutes into multi-second latency, causing applications and APIs to begin timing out.
All of the above is to say that if you lack visibility into Producer-Consumer Kafka lag, you lack visibility into the operational effectiveness of your Kafka data streams as a whole – and some pretty bad things could happen as a result.
Key Kafka metrics for monitoring lag: The theory
Identifying lag issues within Kafka data streams is easy enough – at least in theory.
One simple solution is to measure the difference in Consumer offset within the head of each Partition for each Topic. This data, which you can push to Prometheus using a tool like Kafka Exporter and translate to actual seconds (or milliseconds) using Kafka Lag Exporter, provides an idea of how many messages have been ingested by brokers within the Kafka cluster, but have not yet been received and processed by Consumers. It provides a baseline for identifying lag issues.
A slightly more sophisticated approach is to use tools like Zipkin, which can measure the actual time latency between data ingestion and consumption. This provides a more accurate and direct metric for gauging how lag impacts end-user experience.
The problem with standard Kafka lag monitoring
Unfortunately, in practice, monitoring Consumer metrics and fetch requests as the basis for tracking the effectiveness of your Kafka deployment doesn't always allow you to get ahead of latency issues. There are several shortcomings to this approach:
• Lack of granularity: Monitoring metrics originating from all the Consumers of a specific topic leaves you with a monolithic view of Consumer behavior. In some aspects you view all the Consumers of a specific topic “as one”.
• Lack of root cause focus: Relatedly, it's difficult to identify the root cause of Kafka performance issues under this approach because you can't pinpoint which specific Consumers are experiencing problems.
• Lack of actionable data: Most monitoring tools identify Consumers using Kafka client-ID notations. The client-ID identification is internal to Kafka, and it's not always obvious how client-IDs map onto actual services. As a result, if you're dealing with many applications or services that rely on your Kafka data streams, you'll struggle to figure out which service is lagging based on client ID data alone.
For all of these reasons, traditional server-side Kafka monitoring strategies make it difficult to generate the insights necessary to trace performance issues back to their root cause and take action to remediate them.
Here comes eBPF: A better approach to Kafka performance monitoring
We told you in the introduction that we have a better take on Kafka performance monitoring, and we do. It hinges on eBPF, the game-changing technology that makes it possible to collect performance data from deep inside any environment that is hosted on Linux servers – as Kafka clusters are.
With eBPF, you can monitor Kafka from the client side. You can see what's happening from the perspective of both Producer and Consumer applications, and you can measure Producer-Consumer latency from each individual client.
At the same time, eBPF means that you can integrate Kafka monitoring more elegantly into your broader monitoring strategy. The reason why is that eBPF serves as the foundation for monitoring anything that runs within a Linux-based stack. So, using the same tooling, you can monitor not just Kafka data streams, but also (for example) the Kubernetes clusters that host the applications that depend on those data streams. In turn, you can more easily understand which applications or services are impacted by Kafka latency issues. You get more context than just Kafka client-IDs.
A healthier way to get Kafka metrics
In summary, traditional approaches to collecting Kafka latency metrics don't work well. In addition to producing data that is sometimes difficult to interpret for troubleshooting purposes due to the fact that Kafka isn’t aware of your application and microservices stack, conventional Kafka monitoring may result in lag readings that don't reflect a clear root cause or a clear path to action. And when it comes to performance monitoring, the worst thing in the world is seeing data that is confusing or doesn’t take you far enough to solve the issue. That is a DevOps team's worst nightmare.
Fortunately, eBPF has opened up a radically new approach to Kafka metrics collection and monitoring. With eBPF, you can see what's truly happening latency-wise from the perspective of your Consumers, not just the Kafka server. When you put that together with the immediate time-to-value eBPF offers, you get accurate, actionable visibility, which you can use to ensure that data moves very quickly – if not in true real time, then as close to it as reasonably possible.