The dashboarding journey every observability startup takes
Learn how groundcover evolved from off-the-shelf charting to a custom canvas-powered dashboard engine, overcoming scaling, performance, and correctness challenges to support full-fidelity observability at scale.
.jpg)
There's a version of this post that's just a list of performance bugs we fixed. It would be accurate and it would put you to sleep. The thing worth telling is how a frontend team spent a year and a half growing into problems it wasn't equipped to solve at the start, and how each stage of that growth was the thing that made the next one possible.
Every team that renders serious volumes of time-series data walks some version of this path. Datadog built an SVG charting solution and outgrew it. So did we. The interesting part is the shape of the walk: pain and capability feeding each other until you can attempt something that would have looked reckless a year earlier.
groundcover's promise is full-fidelity observability across RUM, APM, Infra, and AI Observability, with no sampling and no rate limiting. Our eBPF sensor and OTel normalizer pull in enormous throughput. That's the selling point, and it's also what kept breaking our dashboards. When you refuse to drop data on the way in, you've signed up to render all of it on the way out.
Act one: building something we had no business building
A year and a half ago our charts ran on ECharts. Enterprise-grade and proven, but a fight to customize past its defaults, rigid, and dated. We wanted to build real dashboarding infrastructure on top of it, and the experience we had in mind wasn't reachable.
Suggesting we write our own charting library would have sounded crazy at the time. “Just use the thing that's stood the test of time”. Nobody seriously floated building it by hand. Then we started hitting the walls you hit with any off-the-shelf library. We wanted a tooltip that resolved to a single point instead of dumping every series at once. We wanted a custom legend, and a hover experience where the rest of the chart fades back around the series you're looking at, the kind of polished interaction Datadog had that we didn't. The list of things we wanted and couldn't get kept growing, and the frustration accumulated with it.
Frustration alone wasn't enough to act on. You also need the confidence that you can pull off the thing you're frustrated into wanting. The most annoyed team in the world won't attempt a rewrite it doesn't believe it can finish. We'd built up the expertise, the appetite was there, and the pain was waiting. So we built our own.
We built it on SVG, by hand, on top of D3. There are two ways to render charts in a browser. Canvas gives you raw throughput. SVG gives you crisp output and easy, rich interaction like hover and tooltips. SVG looked great, scaled well enough for the dashboards we had, and didn't force us down to rendering primitives we couldn't yet maintain. We were proud of it, and we'd earned that. It unlocked the custom experiences we'd been chasing for a year. With hindsight the choice was naive even though at the time it was right.
Act two: the house of cards
About a month ago, the SVG solution hit a wall we couldn't optimize our way past. We'd taken it as far as it could go, and then it fell over in a single move. We stood up a proof of concept on a heavy dashboard and nothing worked. The performance problems were structural, not a slow corner here and there, and we were close to losing customers over them. The dashboards had become unusable to clients.
That collapse forced the lesson this whole post is built around. "Slow" was never one bug. We'd been chasing it like it was. A customer says a dashboard is slow, an engineer opens a profiler, finds the worst offender, fixes it, ships. Still slow. Finds the next one. Still slow. After enough rounds the conclusion becomes "rendering a lot of data in a browser is just hard," and the team starts sampling or quietly dropping data to make the number go down. We refuse to drop data, so that exit was closed to us.
What we actually had was five different problems wearing the same costume. Each one freezes the tab in the identical way, which is exactly why treating them as one keeps you stuck. They are: how much a series count is allowed to cost, how much point density is allowed to cost, what each surface of the app promises, which interactions get their own performance budget, and what a single downsampled point is allowed to mean. Different causes, different fixes, one symptom.
Naming the five is what made the rewrite tractable. Here's what each turned out to be.
A series cap is a promise about memory, not a display preference
The first crack looked trivial. Customers kept hitting the 40-series limit and asking for more, so we raised it to 100. Within days it went badly on real dashboards. Memory climbed sharply even on modest ones, some became unusable, and the app could crash outright. We reverted.
The revert is where the decomposition started, because it forced us to say out loud what the cap had been quietly protecting. It governs how many series objects live in memory, how many DOM and SVG elements we generate, how many legend rows and tooltip candidates exist, how much per-series work runs on load, and all of that multiplies across every widget on a page. A constant can't carry that much weight. It had to become a contract: a safe default of 40, opt-in expansion with a clear warning about what it costs, and a hard ceiling of 100 so no single chart can take down the page. "Show more" turned into a deliberate, priced choice instead of a default that invited a crash.
Pulling that thread surfaced a question we'd been ignoring. Expanding the view sometimes means re-running the query, and a fresh query can return slightly different data than what's already on screen. Honest expansion accounts for that rather than papering over it, which is the kind of detail you only notice once you've stopped treating the cap as a number.
Point density is a separate budget from series count
For a long time we assumed a slow dashboard meant too many series. One customer dashboard corrected us. It had gone unusable on a widget carrying about eight series. The culprit was resolution: a 10-second step over a one-day range, which is 8,640 datapoints per series. Eight series at nearly nine thousand points each, and the browser fell over.
This breaks rendering through entirely separate machinery. Series count stresses legends, color assignment, tooltip matching, per-series transforms. Point density stresses path generation, downsampling, scale mapping, hover hit-testing, and paint. From the user's chair they're indistinguishable; both just freeze the tab. That's the whole trap. With only one mental model, you'll debug the eight-series dashboard hunting for cardinality that isn't there.
There's no magic number here either. Render safety has to weigh time range, step size, series count, and widget count together, and we're still building that combined guardrail. We'd rather say so than pretend a dashboard can judge what's safe from series count alone.
A chart that renders fast can still lie
Performance bugs announce themselves. Correctness bugs hide, because the chart renders cleanly and looks plausible while being wrong. People forget this contract exists, and it's the most expensive one to get wrong.
When a query returns more raw points than a chart can draw, we downsample. That sounds like a rendering shortcut, but the aggregation rule carries meaning in observability data. Averaging is fine for CPU utilization. For counts, error totals, or request volumes, averaging understates the real number. For a stacked bar, the tooltip has to reflect the true bucket total or the per-segment contributions, not whichever shape your cursor happened to land on. Get it wrong and you ship a chart that renders smoothly, reads cleanly, and quietly lies during an incident, the worst possible moment to be confidently misled.
So downsampling lives inside the chart's correctness contract. The chart has to know what kind of value it's reducing and what a single drawn point stands for. We're also surfacing step and resolution directly in the tooltip so you can always tell what window a point represents.
Interaction gets its own performance budget
The most satisfying wins came from profiling what happens when someone uses a chart, not how it first loads. Three separate problems hid here, and the initial-render benchmark catches none of them.
One investigation began with a 35-second CPU trace of a single dashboard interaction. The data had already arrived; the query wasn't the problem. Hover was. Every mouse movement asked the browser to recompute layout, hundreds of times per sweep across a dense chart, and we were re-rendering on every resize observation during ordinary paint churn. We changed it to read that geometry once per hover instead of once per mouse event, and collapsed resize bursts into a single update on the next frame.
The second showed up with crosshair sync turned on. Every pointer movement wrote shared application state, which forced a wide set of components to re-render, restyle, and repaint inside the few milliseconds a frame is supposed to take. We moved synced-crosshair behavior down into the charting library where it belongs, so moving the mouse no longer taxes the whole dashboard.
The third needed no user action at all. Switch to another browser tab, come back, and the dashboard would freeze for a few seconds. When a hidden tab wakes up, a dozen subsystems resume at once: focus and visibility events, viewport tracking, query enablement, grid layout, chart redraw. Each is correct on its own, and together they stampede. We changed the dashboard to keep showing what was already on screen and resume work gradually instead of treating a tab switch like a full reload.
The thread tying these together is that initial render is the easy benchmark and a dishonest one. Dashboards get judged by how they feel the moment you start investigating, and that's a budget you fund separately.
One engine, different promises per surface
The rewrite forced a distinction we'd been blurring. The same chart component can't behave identically everywhere it shows up. In an exploration or query-builder view you're actively inspecting data, so offering "show all series" makes sense. In a dashboard card that same widget sits in a crowded grid with limited space and neighbors competing for the browser's budget. Put "show all" in every dashboard widget and a monitoring board becomes a minefield where any card can trip the expensive path.
So dashboard cards stay capped and show a clear indicator that results are limited, while the full expansion controls live in the exploration view. The engine is shared; the contract per surface is not. Naming that explicitly is what kept a single shared component from quietly importing exploration-grade costs into every tile.
Act three: moving to canvas
Decomposing the problem told us what to fix. It didn't make the fixes cheap. Incremental work had already stopped paying off. Viewport-only rendering, lighter table widgets, smarter DOM updates were each a drop in the bucket, because the bottleneck was the rendering model itself.
That bottleneck is the DOM. SVG is vector-based, and every element is a node the browser has to walk and lay out. Deep, wide trees are expensive, and the cost compounds: more elements, deeper nesting, more cause and effect, until the penalty grows close to exponential across series. Canvas behaves like an image instead. Drawing a chart on canvas is like painting a simple picture by hand. It doesn't deepen the DOM tree at all; it's a single element. The penalty stays roughly flat whether you're drawing one series or many. The cost is that canvas is lower-level and its API is awkward where SVG is declarative and pleasant.
So we split the work. Canvas renders the high-cardinality series data, where throughput is the only thing that matters and it's night-and-day faster. SVG handles everything else, including the hover and tooltip layer, where crispness and rich interaction earn their keep. That division of labor only became obvious after we'd stopped asking "how do we make charts fast" and started asking the five narrower questions, each with its own answer at its own layer.
The rewrite itself was a port more than a clean-room rebuild, and that was deliberate. What made it survivable was groundwork laid long before. We'd spent roughly a year maturing the public API of our chart components, so the contract the new engine had to honor was already explicit and testable. We were swapping the engine, not redefining behavior. We had a large automated test suite, hard-won answers to the unglamorous questions like whether a null renders as a null or a zero and how gaps on the x-axis should behave, and D3's render-agnostic core to port existing logic from instead of reinventing it. The maturity of the codebase, and our understanding of what was common, what was weird, and how each piece was supposed to behave, is what let a week-and-a-half rewrite land smoothly instead of becoming a multi-month rebuild.
The payoff: we lifted the practical ceiling from 40 series to 1,000, with headroom into several thousand points on a single chart, and dashboards that used to lock up became smooth enough to investigate during an incident, which is the only test that counts.
The shape of the whole thing
None of the three acts could have come earlier than it did. We couldn't have written the SVG library before we had both the frustration and the confidence to justify it. We couldn't have decomposed "slow" into five contracts before the SVG solution collapsed and forced us to. We couldn't have run the canvas rewrite quickly and safely before that decomposition gave us named contracts and a mature API to port against. Each stage needed the perfect storm of pain and capability that the stage before it produced. That progression is the part worth telling, more than any individual bug: the way a team's reach grows to match the problems in front of it. It's a journey most observability startups end up taking, whether or not they name it as one.
We're not done, and we won't pretend to be
Some of this is open work, and the decomposition tells us exactly which contract each gap belongs to. A single overloaded widget can still drag down a dashboard in the worst cases, and we're designing that class out rather than patching around it. Very long time ranges with heavy filtering can still hit query timeouts on the backend, and we're improving both the query path and how a widget communicates a partial failure instead of going dark. We want dashboards explicitly built to show all series to remember that intent across reloads. And we want every edge case, down to a pie chart with too many slices, to fail with a clear message rather than silently.
That's the kind of work we think observability tooling should be built on: full data, rendered honestly, fast enough to actually investigate, inside your own cloud. If that's the platform your team wants to run on, try it with our playground or launch groundcover's eBPF sensor on your cluster for free.
.jpg)

.png)




