Is your APM slowing you down?
Application Performance Monitoring tools (APM) are an amazing tool that any developer should have accessible in their toolbox. They promise (and deliver) end-to-end visibility into how your applications interact. Even if you’re not a big tracing fan, just the golden signals you suddenly have access to like error rates, latency etc. are a huge step ahead in gaining actionable insights of your production. (Here's a good resource about the idea of golden signals for distributed monitoring - https://sre.google/sre-book/monitoring-distributed-systems/.)
APM offers auto-instrumentation (in most languages), which means it creates one standard for metrics like API golden signals across many teams and many conventions and work ethics.
A world without APM would mean teams would have to heavily rely on application-specific metrics (or custom metrics in their more common name) to get these insights. While a lot of teams do that today, partially since APM is not accessible to some due to cost and integration complexity, it requires discipline to be efficient. Maintaining these custom metrics means teams need to make sure they cover any new service or API, and most importantly - that they’re consistent in order to allow DevOps and SRE teams to utilize them in full.
So that all sounds awesome. Let’s instrument APM everywhere!
Mmm…. However, most APMs require code instrumentation and inject third party code, preventing you from running the clean code you’ve written in production. When you’re starting small it shouldn’t concern you. The value introduced by APM is so much more than any downside of these new unknown code snippets instrumenting your original application.
But, at scale, this is known to potentially introduce higher infrastructure costs, slower response times and additional hardship in diagnosing performance issues in your system.
And the biggest concern? APM overhead and footprint are very hard to measure. APM relies on code instrumentation of your application, which makes it really difficult to separate between a momentary peak in overhead from your original code and one from the APM agent instrumented into it.
Better understanding of APM overhead
Performance measurement comes at a cost, and this will vary depending on the different methods APM tools use to instrument and monitor your applications. For example, measuring the execution time of a method, and then processing and storing the data cannot be done without creating some overhead. This additional overhead will be manifested in the form of longer response times, increased CPU utilization, increased memory usage, or additional network load.
There's also another inevitable risk, even if small, that the overhead of measuring performance will do more than just degrade the performance of the application, but will also affect the application's logical runtime behavior. So what kind of overheads should you worry about?
Response time overhead
Increasing an application’s response times is one of the most measurable and clear types of overhead you can encounter. It has the most direct impact on your users and can cause the most critical effect over your business.
An instrumentation-based approach for monitoring will essentially mean that if we’re trying to measure the execution time of a method and post-process some data once it finished, we will need to “wrap” that method with external code segments that will manage the logic of the monitoring tool.
Generally speaking, the more you measure, the larger the overall impact of the generated overhead. This can become painfully visible if an instrumentation-based monitoring tool will instrument specific methods that are called very frequently and have a very short execution time. The added overhead in these cases will be the biggest (proportionally) and might be very noticeable.
To get specific, let’s imagine a few numbers. Say the additional code from the monitoring tool is executed in only one millisecond, and the execution time of the target method is 100ms. Then, in this case the response time overhead will be 1%.
Response time overhead is often described in percentages, for example 5% average response time overhead in production. However, in many cases relative percentages can be hard to understand so you’ll find some developers referring to nominal response times like “my application is 50ms slower when I instrument my monitoring tool”.
So what is the response time overhead of a typical APM? This is still somewhat of a mystery. You would hear 3%-5% in many benchmarks conducted by the APM vendors themselves, usually framed as marginal and related to specific rare use cases. But…for example in this blog by Scout APM, New Relic’s was measured to add over 44% (!) on a Ruby benchmark scenario.
Clearly, not APM and not every use case will cause dramatic response time overhead. However, it’s up to us as developers, DevOps teams or SRE to be mindful of that overhead, since once an APM tool is instrumented into the code these overheads can become transparent, or at least really hard to measure.
Memory and CPU overheads
So we already know an instrumentation-based monitoring tool adds a response time overhead, but it also affects CPU and memory usage.
The more measurement logic is executed, the more CPU time is used for monitoring purposes rather than the original business logic. A monitoring tool also utilized memory to maintain complex state machines that helps it propage context across the application, related requests to their responses, collect and store relevant contextual information and more. Additionally, if the monitoring logic does not manage memory efficiently, this could cause additional garbage collection, which is yet more overhead.
But why should I care? Unlike response time overheads, memory and CPU overheads don’t directly impact the experience of my users.
The first reason you should care about lies in the heart of the strategy APM tools take to reduce CPU overheads - sampling only a percentage of transactions. This has the obvious drawback that you’ll lose full application visibility, potentially missing out on rare events that you’d care about.
The second reason is that unexpected resource overheads can cause unpredictable production impacts. For example, in Kubernetes, your application might be running under strict memory limitations as a safeguard against over utilization. If your monitoring tool suddenly spikes under load this might cause sudden OOM crashes or CPU throttling.
The third, and perhaps most important reason, is that if your APM tool is consuming an unknown amount of resources it will be extremely difficult to manage and reduce your application’s footprint. Many teams face the effort of trying to make their applications more efficient, more resilient to sudden spikes and jittery behavior. With an APM instrumented into your code things become harder. If the memory peaked, is it your application to blame or the APM?
Most application monitoring tools rely on a centralized architecture in which the data about the monitored applications is trucked over the network to a centralized server. In high throughput environments, where there’s high volumes of measured data points around an application’s performance, the volume of data transferred over the network grows accordingly.
In metrics and time series the base measurement interval also plays an important role in affecting the amount of data being transferred.
Let’s try playing with some numbers to make this point more clear. Assume we are monitoring an application where each user triggers 100 actions per second we’re trying to measure. Now let’s assume there are a 1000 users using the application at any given moment, resulting in 100k executed actions that we’re trying to measure per second. Even if we’d only send the duration execution (32 bit float) with no additional meta-data we’d still end up sending almost 1 MB/sec to a remote server.
Though this might sound minor, at high scale in real production environments hosting hundreds of microservices the amount of sent data can be dramatically high. To reduce network traffic, some tools rely on data aggregation methods within the application before sending it to a centralized server. Unfortunately, this approach has the drawback of consuming additional memory which might create new limitations.
APM overheads can have a critical impact on the health and performance of the monitored application. Since most tools rely on code instrumentation there’s no easy way to measure these impacts rather than some good old A/B testing. Take a crack at your APM. Try and measure your application’s overhead under the same load test twice - once with no APM instrumentation, and once with your full-blown APM enabled. You might be surprised by what you’ll find out.