How Humans Ship Code

This one is for my fellow developers. Consider the last feature you've shipped to your platform — let's call it toast. It likely started with some feature specification and an initial design. You've started writing the code, determined to create the best toast you can. You've opened a pull request — "introducing toast" — and implemented some files like "knife.go" or "butter.py", which work together to create the perfect breaded delight.

Then someone started reviewing your code, indicating that "butter.py" contains some potential butter overflow (ha-ha). You probably want to fix that potential issue and add some logs around that area. While doing so you start thinking about the ways spreading the butter with the knife could go wrong, so you instrument that area with traces to get end-to-end visibility. Ultimately you decide it might be interesting to know how many toasts were made, so you add metrics to build beautiful dashboards and alerts. The pull request is approved and toast is a go!

Do you see the problem?

The Irony with Monitoring

The parts of the code which are now monitored — the knife and butter — are also the parts of the code that were reviewed more thoroughly. Both developers and reviewers tend to focus on areas they deem important — their bread and butter (pun intended!). But there’s more to toast than just the knife and butter components, and here's the kicker — the parts that fly under the radar during code review are the same ones that end up flying under the radar when it comes to monitoring.

Issues never come from where we expect them to — and this is not because we don't try to find them. But in the world of fast moving code and human bias, it’s just simply impossible to think  of every possible failure scenario. If the title of the pull request is "toast", it's very difficult to focus on anything else, and that’s why oftentimes, issues will be discovered in hindsight during outages. Which is when the dime drops that we’ve been monitoring the wrong things, or only some of the right things.

Like the many existing cognitive biases — whether survivorship or anchoring bias, the areas of code that receive more attention and visibility are less likely to break, precisely because they received the necessary attention and investment to ensure they are robust and resilient. In contrast, the parts of the system that are overlooked often lack both sufficient quality and testing scrutiny along with monitoring controls, making them more likely to fail silently. 

It's like the same fallacy where doctors only know how to diagnose medical issues they've encountered before — and the outliers often get overlooked or misdiagnosed (which is the source of the popular “second opinion” in medicine, meant to diversify perspectives). 

This is the classic problem of unknown unknowns. We know what we know, and at times we even know what we don't know. However, we just can't account for the edge cases we can’t predict or don't know that we don't know. It's always the unexpected system behavior you don't plan for that will surprise you. The toast that burns not because of faulty butter spreading, but because someone left the toaster plugged in during a power surge, or because the bread supplier changed their recipe and now it toasts differently.

The only way to make sure we have all the visibility we need for all potential incidents is, well, to monitor everything all the time. But there’s a catch.

See Everything, Store Everything

To monitor everything all the time, you need to start with collecting everything. And I mean, everything, from every service, API call, and 3rd party app. Not just what you think is important - you will figure out you were wrong later. 

eBPF was built to provide kernel-level observability without performance overhead, and when applied to monitoring, it makes comprehensive system visibility possible at scale. With eBPF, you can now capture every function call, network packet, and system event across your entire stack without the instrumentation burden or the prohibitive costs that made this approach impossible before.

But this introduced new challenges. Now that you have everything, you need to find a way to store it in a sensible way.  Teams have tried DIY methods of storing everything, but the economics don't work when you're paying for massive storage and data transfer costs.

Shipping so much data to the internet is becoming a financially impossible burden for engineering teams. Which is the fundamental challenge the BYOC approach was introduced to solve.  And it is radically changing the monitoring domain. 

The BYOC model tackles the core cost problem - data gravity. Instead of shipping terabytes of monitoring data to expensive third-party storage, you store everything in your own cloud infrastructure. No egress costs, no storage markup, no vendor lock-in on your most valuable asset.

Why let anyone else keep your data? 

The groundcover Effect

I am humbly guilty of many bugs shipped into production over my career. Not once did I find myself saying "yes, I thought that might be a problem so I've added a log message there, just in case".

We can't win over our human nature, but luckily, monitoring everything means we don't have to. Because you can't think of every single monitoring permutation and you can't predict how your systems might fail in unexpected ways. We built groundcover with the mindset that the best contingency plan to unpredictability and unknown unknowns, is to have wide monitoring coverage, which is now possible with eBPF. This enables anomaly detection based monitoring that doesn't require you to have clairvoyance, and know what to look for ahead of time.

groundcover addresses all of these challenges through its unique approach - using eBPF to capture comprehensive system-wide visibility without requiring developers to predict failure points, combined with a BYOC model that keeps your data close and costs manageable. Instead of playing whack-a-mole with monitoring after each outage, you get the complete picture from day one.

The next time you ship your toast feature, you won't need to worry about whether you monitored the right components. Because with complete visibility, every component is the right component to monitor.

Sign up for Updates

Keep up with all things cloud-native observability.

We care about data. Check out our privacy policy.