groundcover

Max Levin • Apr 26, 2026

From Dashboards to Drivers: Why the Future of Observability is Agent-First

Why groundcover built Agent Mode as a native operator of the observability platform, using gcQL and backend execution instead of generic chat over telemetry.

Max Levin

April 26, 2026

April 27, 2026

min read

groundcover

We started with the wrong question

A few months ago, I asked our designer for an icon for our agent.

It sounded like a normal product request. We were building a major new AI capability, so naturally I thought about where it should live in the UI, how users would recognize it, and how it should appear as a new product surface.

But that framing was already wrong.

If the agent has its own icon, its own corner, its own isolated identity, then it becomes a place users visit. A sidecar. A helper. A chat window next to the real product.

That is not the future we believe in.

The agent should not be another surface inside an observability platform. It should become a new way to operate the platform.

That distinction changes everything: the UX, the APIs, the query language, the data model, and the way investigations are represented.

This release is not about adding AI to observability.

It is about moving observability from dashboards you navigate to systems you can drive.

The agent is not an assistant. it is a power user

We do not think the right mental model for Agent Mode is “assistant.” An assistant answers from the edge of the product. Agent Mode needs to work from inside the product: understanding its objects, using its language, and operating its workflows.

That is the bar we set for Agent Mode. It should know groundcover deeply enough to use it well: navigate the platform, write gcQL, understand how telemetry types relate to each other, analyze existing assets, create new ones, and manipulate the objects that already live in the customer environment.

This matters because observability work is not only about finding data. It is about knowing what to do with it. A strong user knows when to pivot from traces to logs, and when the outcome of an investigation should become a monitor rather than stay as a one-time query. Agent Mode should follow the same pattern: operate the platform from the inside, not summarize it from the outside.

This also gives the product a new kind of flexibility. Some workflows are important, but too specific to deserve a dedicated page. Others are valuable, but complex enough that turning them into generic UX would create more friction than value. Batch asset analysis is a good example: it is clearly useful, but usually too contextual to justify a rigid product flow.

Agent Mode gives us a way to support that messy middle without forcing every use case into a fixed interface. The app should still make the common paths simple and explicit, while Agent Mode handles the work that requires context, judgment, cross-signal analysis, and action across multiple parts of the platform.

Why external observability agents run into hard limits

There is a real appeal to external agents that sit on top of existing observability tools. Keep your current stack, add an AI layer, and let it fetch data, summarize results, and suggest next steps.

We think that model has a hard ceiling.

An external agent is always a guest in someone else’s system. It does not control the data model, query language, indexing strategy, correlation primitives, or execution path. It can only use the APIs and abstractions the platform exposes.

That gets even harder when the agent tries to correlate signals across multiple platforms. At that point, it has to fetch data from different systems, normalize it outside the data layer, and merge results in its own context. Time windows, entity names, labels, sampling, and aggregation semantics rarely line up perfectly. The correlation logic becomes fragile, hard to validate, and easy to break as the underlying tools change.

That matters because observability investigations rarely live inside one clean object. The interesting questions are usually between signals: errors and deploys, traces and logs, metrics and vents. If the platform cannot express those relationships natively, the agent has to reconstruct them from the outside.

That is where the architecture starts to drift. The agent fetches chunks, samples, summaries, or separate API responses, then asks the model to reason across them in context. It can look impressive in a demo, but it is a weak foundation for production investigation.

The failure mode is predictable: low precision, high token usage. The model ends up doing work the data layer should have done, which makes the system more expensive, less deterministic, and harder to trust.

A native agent should not work around the platform. It should operate through it.

The key architectural decision: move the LLM up, push the analysis down

A lot of the engineering work behind Agent Mode came from one core decision: the LLM should not be the execution engine.

The model should understand intent, plan the investigation, choose the right operations, and interpret the results. The backend should do the analytical work.

This is the same pattern that made coding agents useful. The strong ones do not try to keep the whole codebase inside the model. They write code, run commands, inspect outputs, and iterate over deterministic systems. The model orchestrates; the system executes.

Instead of asking the model to ingest raw telemetry and infer relationships directly, Agent Mode translates intent into gcQL, executes it through the backend, and receives compact result sets that already contain the relevant structure.

The flow is simple: intent becomes a plan, the plan becomes gcQL, gcQL compiles into optimized ClickHouse execution, and the backend returns results the model can interpret and act on.

That is not just an implementation detail. It is a product architecture decision. It defines what the model is responsible for, what the backend is responsible for, and how reliable the system can become.

Why gcQL became central to the product

Once Agent Mode became an operator of the platform, it needed a language to operate with.

That is where gcQL became central.

gcQL is not only a user-facing query language. It is also the operational language of Agent Mode. It gives the agent a structured way to express investigations across logs, traces, metrics, and events, using primitives that are native to groundcover.

This is a major advantage of building the agent into the product. We are not limited to whatever API a third-party platform exposes. If Agent Mode needs richer joins, better aggregations, or stronger cross-signal primitives, we can build them into the language and execution layer.

That creates a compounding loop: better language, more precise execution; more precise execution, less model improvisation; less improvisation, more reliable investigations.

One example: correlating HTTP errors with Kubernetes events

Take a question like:

“Are the 500 errors in the frontend service related to recent infrastructure changes?”

A weak architecture answers this by fetching application data, fetching Kubernetes events, maybe fetching traces, and then asking the LLM to infer whether there is a meaningful relationship.

That is exactly the path we wanted to avoid.

The better approach is to let Agent Mode translate the question into a query plan and push the correlation down into the backend:

span_type:http status_code>=500
| stats by (namespace, workload) count() as http_errors
| join by (namespace) (
    category:k8s_events type:Warning
    | stats by (namespace, reason) count() as event_count
)
| sort by (http_errors desc)
| limit 20

The important part is not the syntax. It is where the work happens.

The backend computes the relationship between application errors and infrastructure events. Agent Mode receives a compact result set that already contains the correlation, instead of trying to manufacture one inside the model from raw samples.

In a real environment, this can surface something concrete: a workload with a spike in HTTP 500s, joined against BackOff or Unhealthy Kubernetes events in the same namespace and time window. The model can then explain the result, prioritize it, and decide what to investigate next.

What changes at the product level

This is why we see Agent Mode as a product shift, not only a technical one.

Dashboards, monitors, and saved queries are still important, but they stop being the only entry points into the system. In an agent-first model, users can begin with intent instead of navigation.

Agent Mode can run the investigation, query the relevant signals, form a result, and then materialize the right artifact based on what it found. That might be a dashboard. It might be a monitor. It might be a saved query. It might just be an answer with the next recommended action.

That is the shift: observability artifacts become outputs of investigation, not only prerequisites for starting one.

This is more interesting than “chat with your telemetry.” It is a different way to express operational work inside the product.