Omer Mayer
Omer Mayer
March 16, 2026
March 16, 2026
X
min read

The notification layer is one of the least glamorous parts of an observability platform. It is also one of the most critical and load-bearing. When it breaks, engineers miss alerts. When it is hard to configure, teams route around it and build their own wrappers. When it is a black box, nobody knows why notifications stopped arriving.

For groundcover, that load-bearing layer was Keep which is an open-source notification engine we adopted early because it got the job done. Then Keep was acquired. Then it was effectively abandoned. And by the time we hit enterprise scale, the cracks were impossible to ignore.

This is the story of what we built to replace it, and why we decided to build it ourselves rather than replace one dependency with another.

The Problem With Keep

Keep worked in the early days. It covered the standard notification use cases like routing alerts to Slack, PagerDuty, and other webhook-based destinations, so we could stay focused on building the core platform while it handled the notification layer.

But Keep had structural problems that compounded over time. Configuration required complex YAML workflows just to route an alert. That might be fine for a DevOps team that writes YAML all day; it is a serious friction point for everyone else. At scale, the cracks were harder to ignore; notification latency ballooned between when an issue fired and when the alert actually arrives, rate limit hits on destinations like Slack went unretried and silently dropped, and Keep’s workloads consumed resources inefficiently.  When something broke, like a notification that never arrived or a route that stopped matching, debugging was extremely difficult, and the Keep team's support was often unresponsive.

The feedback was consistent: the notification layer was the biggest operational pain point in the platform. SEs were fielding these issues daily. On-call teams felt them hourly.

What We Built Instead

Once it was clear Keep would not meet our needs at scale, we had two options: adopt another notification engine or build one ourselves. We built our own. The Dispatch Center is 100% groundcover-owned logic and we use the Temporal open-source workflow engine to manage execution and retries, with every bit of the decision-making, routing and notification logic written and owned by us. 

The core design principle was that notification routing should work like the rest of groundcover. That meant an integration setup per destination, routing rules written in gcQL rather than YAML workflows, and every configuration decision visible and testable before a real alert fires.

In practice: you configure Slack channels in Connected Apps and route them freely from there. Notification Routes let you match monitor issues using a gcQL query, select the statuses you care about, and choose your destination. Before saving, a live preview shows exactly which monitors the route will match. You can also simulate an issue end-to-end on the monitor itself to verify the right routes fire and the right apps receive the alert. Previously there was no way to test this short of waiting for a real incident.

The other thing Keep couldn't do: tell you when something went wrong. The Dispatch Center surfaces a full delivery history per route, including diagnosable failures. If Slack rate-limited a notification, you can see it in groundcover without opening a support ticket. Both Notification Routes and Connected Apps are also fully Terraform-manageable, so your routing configuration lives in version control alongside everything else.

Using groundcover Query Language (gcQL) for routing rules was a deliberate choice. An engineer who already writes log queries in gcQL doesn’t need to learn a separate routing DSL. The same syntax, the same mental model. It also means the AI agent can generate and modify notification routes using the same language it uses for everything else and routing becomes part of the same surface the agent can reason about, not a separate configuration system it can’t touch.

The Transition

The Dispatch Center runs side-by-side with Keep during transition. Existing customers do not face a forced migration since the new engine handles all use cases. New deployments and trials use the Dispatch Center exclusively.

What This Unlocks

The more important story is what the Dispatch Center makes possible going forward. Notification routes are building blocks. They also open the door to automation.

An AI agent that generates routes, modifies them based on what it detects in your infrastructure, and triggers automated workflows requires a notification layer the platform actually owns. Keep was a ceiling. The Dispatch Center is a foundation.

Sign up for Updates

Keep up with all things cloud-native observability.

We care about data. Check out our privacy policy.