One of the hardest challenges for any modern observability tool is being able to keep up with the ever-growing scale of data - delivering a comprehensive, accurate picture of a system, all while leaving a minimal footprint. Without the right performance-first mindset, the once hidden overhead required to observe a system could suddenly become very visible and painful at high scale.
With many leading observability solutions today dramatically impacting the resource consumption of the applications they are in charge of monitoring - eventually limiting their performance or causing cost surges, engineers are more and more aware of the hidden overhead inflicted by their observability stack.
Enter eBPF - the up and rising observability sensor
eBPF has been around for a while, but in the last couple of years it has secured its place as one of the most promising, revolutionary technologies in IT.
When it comes to observing cloud-native environments, eBPF is particularly exciting due to the fact that eBPF allows you to gather clean data to trace any type of event. The data comes straight from the Linux kernel, with minimal performance overhead. That means you can deploy an eBPF program that watches each packet as it enters or exits the host, then map it onto processes or containers running on that host. The result is super-granular visibility into what’s happening with your network traffic.
Along with that, eBPF programs are very efficient and secure, unlocking a plethora of possibilities for observability in a cloud-native world.
While eBPF is destined to become the next standard for observability platforms enabling low-friction integration and deep visibility coverage, existing eBPF-based approaches still have a long way to go when it comes to operating at high scale.
Introducing Flora: the future of eBPF observability is here
We set out to create an eBPF-based observability agent that is unprecedented in both efficiency and cost-effectiveness.
Flora was built with a strict performance mindset - promising full observability at high scales while costing next to zero overhead. It is based on two key ideas that allow it to operate at a massively high scale. The first, is the use of new, unorthodox eBPF concepts that unlock extremely low overhead kernel-user data transferring mechanisms. The second, is a nearly zero-copy, memory efficient pipeline which converts observations flowing from the kernel into meaningful outputs.
These two concepts allow Flora to relieve two of the most painful inherited problems of application observability in high volume cloud-native environments. Using eBPF allows it to operate out-of-band to the applications it monitors, promising minimal impact to its resource utilization and ensures it can continue to run smoothly in the limited containerized environment it was provided. Second, the total resources it consumes is low even at massive scales, making it extremely cost-effective overall, reducing the ongoing cost burden of observability teams face today.
Flora is the driving engine behind our cloud observability solution and is now generally available and can be experienced as part of our free tier offering.
Putting Flora to the test
All theory aside, we rolled up our sleeves and started looking into what kind of value Flora can truly provide our users. Flora dramatically outperformed all leading observability platforms including Datadog, OpenTelemetry and New Relic’s modern eBPF solution - Pixie Labs.
The conducted benchmark simulated a high volume environment tracking a simple baseline HTTP server application for metrics of CPU and memory consumption before and after the integration of the various observability platforms in question.
Setting up the test bench
Our test application was a basic HTTP server built in Golang (v1.19) that serves a configurable number of random JSON objects, performs a pre-configured amount of CPU-intensive tasks per each request it receives, and returns its response in a Plaintext or Gzip format. All parameters were built to be configurable through the received URL parameters.
The test application was then tested in the different scenarios as is (for baseline), when instrumented according to relevant documentation of Datadog and OpenTelemetry, and when running on a Kubernetes node alongside New Relic’s Pixie agent and the Flora agent. Prometheus-based CPU and memory utilization metrics were generated for all test cases, and were scraped and stored in a VictoriaMetrics database instance.
The infrastructure was a Kubernetes cluster with Node Taints that allowed us to isolate each deployment test case from the others. Every tested application flavor ran alongside the bare minimum components required for monitoring according to the relevant test case.
To generate the test load we used a K6 operator, with K6 test objects that executed from each of the separate Node groups. We used a custom-built K6 image that also exposes Prometheus metrics so we could get metrics from client side as well for sanity purposes.
We analyzed the results in Grafana, through a Prometheus data source integration that queried the deployed VictoriaMetrics instance.
Flora lives up to its promise
First, we generated a constant load of 3000 req/s hitting our HTTP server application across the different setups. Flora demonstrated minimal to zero overhead to the application’s CPU (+9%) and memory (+0%), while Datadog, OpenTelemetry and the Pixie agent inflicted dramatic overhead of 249%, 59% and 32% above the CPU baseline, respectively, and 227%, 27% and 9% above the memory baseline.
All other solutions but Flora raised the resource consumption of the application dramatically and in an unexpected manner, potentially causing the application to reach CPU throttling that might degrade its performance or even create an out of memory crash (OOM) in a limited environment.
Additionally, under a limited CPU environment for the monitored application (limiting the CPU of to a maximum of 1000mCPU), and a constant load, the overhead added by Datadog, OpenTelemetry and the Pixie agent was also demonstrated to significantly limit the bandwidth for the HTTP server, reducing the volume of handled requests by 71%, 19% and 12%, respectively, compared to the measured baseline.
Outperforming Datadog By Over 3X
Flora also proved to be highly efficient in the total resources it consumed, making it the most cost-effective solution at high scale. When combining the resources consumed by the different agents tested and the overhead measured on the monitored application, Flora consumed a total CPU which was similar to the one used by OpenTelemetry and the Pixie agent, but that was 73% less than the CPU consumed by Datadog. Additionally Flora consumed 74%, 77% and 96% less memory than Datadog, OpenTelemetry and the Pixie agent, respectively.
Flora shines the light on the groundcover APM revolution
With its unprecedented use of new and cutting-edge eBPF concepts and its memory efficient data pipeline, Flora delivers full observability while incurring near-zero overhead on the resources of the application it monitors, which is particularly important in cloud-native environments where resource consumption is a critical concern. Flora has significantly outperformed leading observability platforms in a head to head benchmark test, demonstrating its fit to modern high scale cloud-native environments.
groundcover is on the fast track to completely redefine cloud-native application performance monitoring and the introduction of Flora into the market is just one more stable, promising stepping stone in this direction.
Disclaimer: benchmarks are a significant method to demonstrate the strengths and weaknesses of different solutions, and allow important conclusions to be made. However, no benchmark can cover every aspect of a complex system, and we encourage teams out there to run their own benchmarks for their own specific needs, hardware and data. It’s the best way to choose the solution that fits your environment perfectly.