Home
>
Webinars
Events
>
From SaaS Chaos to BYOC Freedom: Redefining Observability for the AI Era
From SaaS Chaos to BYOC Freedom

From SaaS Chaos to BYOC Freedom: Redefining Observability for the AI Era

Thursday, 06 Nov 2025 | 1:00 PM (EST)

Join this session to explore how observability must evolve for the AI era. Led by Adam Hicks, Senior Solutions Architect at groundcover, you’ll learn how teams can debug, optimize, and control AI systems with complete visibility, zero friction, and full data sovereignty.

From SaaS Chaos to BYOC Freedom: Redefining Observability for the AI EraFrom SaaS Chaos to BYOC Freedom: Redefining Observability for the AI Era
Transcript

00:00:02 - 00:01:15
All right. Hello everybody. Thanks for joining. We're going to give it just a few seconds to make sure that everybody has a chance to join. Okay, let's go ahead and get started. Um, so thank you for joining the session today. Uh, my name is Adam Hicks. I'm a senior solutions architect with ground cover and I want to talk to you today about going from SAS chaos to bring your own cloud freedom. Uh and and especially I want to overview what that means when it comes to LLM observability or AI


00:01:02 - 00:02:05
observability. Uh now that we're living in the AI area, it presents some unique challenges and I'd like to talk to you about how ground cover can help you manage those unique challenges. uh so without further ado uh let's jump into it. So uh I wanted to start with a little bit of ground setting about the state of observability up to now. Uh you know some some perspective for you. So legacy observability is what I might call it. I you know it's easy for us to cast dispersions on the past as we move into


00:01:33 - 00:02:55
the modern era. Um but to be honest like when we when we think about this what we've seen is that SAS platforms have evolved from legacy monitoring into this modern observability. Um they were they were built for a different era of it. Uh while they certainly played their part in how observability has transformed uh through time, they're truly struggling to keep up with today's cloudnative world. uh modern applications have shifted. We see microser architectures. We see AI agents developed uh interoperating with each


00:02:15 - 00:03:33
other. Agent agent communications and our interactions with LLMs. We're also seeing mountains of data more from this complexity in these distributed architectures than we used to see. terabytes of telemetry data uh frequently overwhelming your traditional architectures certainly overwhelming a lot of the on-prem capacity management of your uh but even the SAS platforms uh and it can create processing delays as well as uh you know some monumental costs uh we've seen a shift into uh a per gigabyte pricing model that uh


00:02:54 - 00:04:04
has become prohibitive atively expensive as data scales. So sensitive telemetry data crossing organizational boundary also raises some compliance issues and of course with that regulatory challenges. Um, so teams unfortunately have become faced with unpredictable costs, siloed product lines, uh, and they have to make data budget trade-offs, rate limiting, forcing teams to make difficult trade-offs between visibility and and and budget, you know, employing things like sampling. And while there can be


00:03:29 - 00:04:51
good cause for sampling, you know, maintaining fidelity without having to overburden systems, um, it shouldn't have to be a choice. Um, and we've seen a return from this to self-hosted open-source software solutions. And while I'm a big fan of open-source, uh, self-hosting oftentimes leads to a lot of operational burden. Um, so it sort of begs the question then, how do we do observability, right? You know, if we can talk about the problems, what is the solution? uh when we approached this here at


00:04:10 - 00:05:20
ground cover, we realized that this is fundamentally a data problem uh you know a lot of what I was just talking about on the the I'll call it the problem slide before this one uh was really about managing massive volume uh the the cost for ingest how you store it and optimize it uh as well as you know secure it. Um, so at the end of the day, the way we started thinking about this is that the the problem with the SAS model is that it fragments your stack. Logs end up living in one tool, traces


00:04:45 - 00:05:56
in another, metrics in a third. And while we've certainly seen uh some attempts or I should say some some solutions being provided from other vendors um to unify them into one interface. uh inevitably when you dig underneath the hood you find out that these solutions too are actually themselves siloed. Um I'll use the old sources versus sync terminology. Uh your sources being your applications and your infrastructure um and the the instrumentation to get that telemetry the relevant telemetry out. The sync


00:05:20 - 00:06:35
being the database uh that you can now go do analytics on top of all of these uh telemetry signals. um these these syncs are purpose-built for a lot of the I'll say like the the telemetry types that are being ingested they end up being siloed. So while there are ways to unify through metadata inside of a pane of glass they themselves are not unified. So what ends up happening then is that you either spend a lot of time upfront unifying them, bringing institutional knowledge in in terms of how they should be unified, whether it's


00:05:57 - 00:07:01
at source or it's at analytics layer, or you handle it through silos, which means that every time something breaks, engineers waste hours switching dashboards trying to correlate symptoms across tools that don't talk to each other. And then what ends up happening is that during incidents rate limits kick in right when visibility matters most. Uh and the kicker every gigabyte costs you more. You want to send more in during an incident you pay for it. Um observability shouldn't be like this. It


00:06:30 - 00:07:46
should make debugging easier. Not feel like you end up having to constantly debug your own tools. Uh so at ground cover we rethought things and we realized that we need to think about cloud native first. Um uh this means infrastructure as a service, platform as a service, uh application as a service, infrastructure as code. We also needed to think about uh bring your own cloud native and we'll talk a little bit about what that is and means in a second but it was really important for us that uh we could allow


00:07:09 - 00:08:32
our customers to you know keep data inside of their own four walls. Um and um we also recognized that we needed to be truly cloudnative in terms of being able to meet our customers where they were. um they get accustomed to tools and architectures and services from their cloud provider. So we need to be able to work with those. Uh we thought we could change the pricing model as well since ingestion is now not a concern of ours. Uh we can come up with a different pricing model. We now have predictable pricing based on


00:07:50 - 00:08:54
what you monitor, what systems or uh hosts you are actually employing to do observability on. Uh and that frees us then we don't have to employ any limits on you in terms of what data you're keeping and storing uh what whether it's by types, shape, cardality or volume. And we wanted to make it really really easy for you to deploy whether it's getting started with us in your own cloud or getting started instrumenting. You know I I talked a little bit about this before but the the burden of uh


00:08:23 - 00:09:27
troubleshooting your own stack and having to bring institutional knowledge from your organization so that you can do correlation across signals. Uh what if you didn't have to do that? it was easy to deploy uh instrumentation agents or sensors that were uh super performant and of course we wanted to make sure that we could still keep up with where you're at uh and where all of our customers are at uh all the the modern collection methods. So this could be open telemetry uh or and and we'll


00:08:54 - 00:10:06
talk about this as well uh extended Berkeley packet filter or EVPF uh and then also help you do observability on on truly modern types of telemetry signals which could be things like LLM experiences which are a little bit different and we'll talk about that too. Um, so let let let's talk about what this BYOC I I mentioned it bring your own cloud. What it really means um uh what it means at the end of the day is that you get to keep your data inside of your VPC, inside of your four walls of


00:09:30 - 00:10:38
of your infrastructure. Um but it's fully managed by us. Uh you get the operational ease of software as a service, but you get the privacy and control of on on premises. um even if in in a hyperscaler and in a hyperscaler we get all of the capabilities that allow us to to scale this like it's a SAS native service for you um but you can put it anywhere uh so we see customers deploying this in everything from you know large financial institutions with strict compliance to fastmoving startups


00:10:04 - 00:11:17
in EKS uh same performance same cost advantage no matter what it it's very flat um and because we don't require you to sample um at all because it's not ingestionbased anymore. Uh you can actually afford to store whatever it is that you want to keep in terms of telemetry. Um and that's what makes modern debugging possible. You don't have to to sample out um you can maintain the fidelity you need to troubleshoot your applications, your infrastructure and get to what you mean um very very quickly. Um so yes to


00:10:41 - 00:11:51
answer questions that may be uh on the tips of your tongues and we're going to do questions at the end but this gets deployed into uh your environment. Um the the next thing that we like to talk about too and I mentioned this you know a couple slides back deployment ease. So we deploy the backend for you. So there is no management of this even though it's running in your environment that is a operational burden on you but then also um inside of your environment what it is you want to monitor. We make it


00:11:17 - 00:12:39
very very easy. Um the front end is accessed from our cloud that is a SAS-like experience. It is a single pane of glass that handles authentication in and our reverse proxy um uh to do queries across backends that you may have deployed inside of your environment. Um but uh we also offer a completely free or on premises model too. So it could be airgapped um and it it could be um uh you know just inside of your own infrastructure if you want it that way. We do have free in teams versions as well u that can be managed by you inside


00:11:58 - 00:13:08
of your own environment if desired. Um but the next thing I like to talk about is the simplicity of instrumentation. Um I think the world is rife with telemetry agents but it's also rife now with uh especially as we get into how do we get context across distributed cloudnative applications with um code level instrumentation tracing libraries and there's a lot of value in them. um they allow us to get uh granular with data in our business logic in a customized way that uh you simply can't


00:12:33 - 00:13:42
do any other way. But there's a a huge maturity uh that you need to go through in order to get to that to where you can actually do that and find value from that. Um and one of the reasons that we see teams hesitating to add observability as opposed to traditional monitoring is is the pain of that instrumentation. endless SDKs uh agents and agent configurations uh even if you're not touching code but on the other side maybe you are making code changes which uh becomes more of a challenge internal of the organization


00:13:09 - 00:14:15
you've got competing priorities between observability and resiliency and feature velocity and how do they feed each other uh so with ground copper you just install one thing once um we have an ebpf sensor extended Berkeley packet filter uh and when you go deploy this into Kubernetes clusters or Linux hosts it starts streaming enriched telemetry instantly and when I say enriched I mean that we are contextualizing for you all the way at source uh again I'll hearken back to that term I used before


00:13:42 - 00:14:45
institutional knowledge that you need to do correlation unnecessary with us we understand how uh the metrics from a system in an application correlate to a trace that's been generated by request as well as a log output. Um, so it hooks into the kernel layer if you're unfamiliar with extended Berkeley packet filter. Uh, meaning that it actually sees every single request, every trace, every event and you didn't have to touch application code. Uh, we don't do any injection. It is passive probing of the


00:14:13 - 00:15:28
kernel entirely. So in practice, this means full visibility from day one. uh every micros systemystem and now every call and and of course if you want to mature beyond that and enrich that uh you certainly can uh this means that we're we're pretty unique. Um we get a a full stack experience a a single pane of glass that unifies infrastructure application performance real user monitoring and now LLM observability. um no add-ons, no enterprise tiers from us, no AI monitoring skew. It's all


00:14:51 - 00:16:00
built in. All we're going to be pricing on and all we're going to be thinking about is uh you know how many hosts you have decided to deploy our sensor to and monitor with. Um external data that you may send in uh additional telemetry logs, metrics or traces from other systems uh come at no additional cost. Uh so when we talk about 10x more data uh for a fraction of the cost, this is what enables it. Uh one unified system instead of five disconnected ones. Uh but let's shift gears and talk about


00:15:29 - 00:16:31
AI. That's that's really what we're here for. Um, so of course the the bring your own cloud is a really powerful component for us, but I want I want to talk about why that's so special in this era of AI. That's really what we're here for. Um, LLM applications are fundamentally different from traditional software, you know, and we may say, oh, they follow distributed architecture, uh, 12 factor. At the end of the day, uh, they do and they don't. The they're different.


00:16:00 - 00:17:10
there, especially if we're doing true agent to agent, uh, in integration with MCP calls with our agents, bots, uh, they shift. They're probabilistic, multi-step, and they're dynamic. Uh, you can't just log a single request and call it a day. It doesn't work like that anymore. Uh, latency balloons across agent chains. We we've see that time and time again. um prompt drift and hallucinations silently degrade quality over and over again. Uh we're all familiar with hallucinations


00:16:35 - 00:17:46
um especially when we're seeking factbased responses from LLMs and we we know that we need to employ like significant rag and fine-tuning in order to uh mitigate that. Uh and of course one of the biggest problems of all is that sensitive data flows through prompts every second. uh we we can't always gatekeep what people are going to go put into an agentic system or a AI application. Um and uh oftentimes it's in ways that compliance teams can't even see and there's unique architectures


00:17:11 - 00:18:23
that have emerged to do this whether it's you know pre-filtering um pre-scrubbing um or or otherwise. Uh on the observability side, you really need to have specialized observability for this. You need to be able to uh see where in time uh token usage and risks are going. You need to see what trends, what directions they are. You need to be able to account for them. Um and and there's a whole different set of I'll say data points that you need to be concerned with and you need to be able


00:17:47 - 00:19:06
to manage. uh this is the blind spot that this shift in the AI universe is introduced and where ground cover steps in. Uh so what is LLM observability? What what is that and how does that differ? Uh at the end of the day I mean it's it's like observability of anything where we're looking to understand what's going on. The the shift is that now we have to employ a practice of monitoring and analyzing how these large language models behave in production. Uh just like you just observe an application but


00:18:26 - 00:19:51
with new dimensions. I mentioned this before new data points there's new dimensions whether we think about them in terms of uh trace attributes uh different metrics that we may be looking at or different features and logs. Um so at ground cover what we have done is we have configured a way to uh actually collect specific LLM telemetry like prompts completions uh uh model latency and token usage. These are really important. Uh this allows you to find patterns um and and for us to surface some of these like


00:19:08 - 00:20:17
which models perform best, which prompts degrade and where costs or errors spike. One of the other hidden things that we haven't talked about as much has been that token usage and the expense of interacting with LLMs. Um especially when we're using ones that are are not our own. They're not deployed. These models aren't inside of our own agentic systems. These are public ones that we're consuming. Ostensibly, we have a contractual relationship with them and there's an expense for every single interaction.


00:19:43 - 00:21:00
Uh so the goal at the end of the day isn't just visualization. And this is how ground cover thinks about maturity with observability in general. It's not just about being able to go look and say, "Oh, we see the system and we see the interoperability. It's the ability to ask questions and use those questions and derive insights from them that lead to continuous improvement that is uh the next generation of observ observability. Uh I like to think of this as a little bit more like a feedback loop. uh it is


00:20:21 - 00:21:36
collect analyze optimize in its most simple form but uh we want to democratize and flatten access to this observability data. Uh we don't want to gatekeep any of this observability data behind things like esoteric query languages. We make it very simple for you to discover exactly what you need, what features are inside of your systems uh so that you can uh iterate and incorporate um you know what's going on say in production into development uh uh of your products. So that loop is also how AI applications


00:20:59 - 00:22:23
become more reliable just like any others uh but also costefficient and production ready at the end of the day. Um so we we have actually taken the classic pillars of observability metrics logs and traces and in and events and we've extended them for the AI era. So this is a a ground cover uh view of the four pillars around LLM observability. Uh so number one prompts and feedback. uh these are at the heart of LLM applications and logging them lets you identify drift template failure and quality degradation at the end of the


00:21:40 - 00:23:04
day. Uh number two, tracing is still an important signal. Um so we want to follow every step from prompt assembly to retrieval to output parsing so you can find out exactly where latency or errors are appearing. Uh third, latency and usage. Um really important, especially the usage metric. Uh LLMs are expensive and they're slow. Uh tracking token usage as well as model versions and response time gives you direct cost control as well as uh you know valuable feedback for improvement and and brokering.


00:22:22 - 00:23:38
uh final retrieval analysis for things like rag retrieval augmented generation. We want to be able to measure speed, cost and and soon relevance uh which is we think the next frontier of AI observability. So we want to give you the ability to monitor every single LLM interaction. Now, go back to what I was talking about with bring your own cloud and some of this uh the freedom that it offers. You don't have to sample. Um you can keep everything. It's inside of your own infrastructure. Um we're not charging


00:23:03 - 00:24:17
you an ingestion cost whatsoever. So now you can bring every single one back and use that analytically to answer questions and improve your product life cycle with. Um, we capture every single LLM request and response. We capture the full payload. We capture all the token counts. We capture latency and even error patterns. And we do it with our EVPF sensor automatically. Uh there's no proxy, no SDK, no wrapping your code. It all happens transparently at the kernel level. Uh so this means that you can debug a


00:23:41 - 00:24:55
failing LLM output with the same depth that you debug a 500 error. Uh full context, full fidelity and zero friction. Once you have this visibility, then you immediately gain control over one of the hardest parts of AI operations, which is is cost. I I mentioned this before, but uh this comes up time and time again. uh token usage what we have seen from LLM interaction is that it varies wildly across services routes even model versions uh when you can actually see that data you can optimize intelligently


00:24:20 - 00:25:39
so maybe you don't need GBT4 or or sonnet for a classification task you know I laugh a little bit uh to the days when machine learning was still considered AI but um that's certainly true we we see people employing uh AI and LLMs to do tasks that could easily be served by statistical methods and instead of uh you know actually using large language models. Uh so on top of that we help teams monitor things like latency trends, error patterns and model performance which is uh vitally important. Uh so when engineers can


00:25:00 - 00:26:11
correlate performance what we have seen as well as when they can understand the spend directly they make better design decisions. Uh so when we go back to that that mention that I made of of high maturity democratizing access uh uh to data um this is where that comes in. Um, we want to employ uh the capability inside of all aspects of our organization. Not just operational, not just the response to what went wrong, but the response. We want people to be thinking more in terms of experimentation.


00:25:35 - 00:26:51
How does this change affect these different signals? Uh, and what we've seen is that when developers can make insightful decisions about how they're employing this, then they can do things like reduce AI bills, um, we've actually seen it as high as 30% or more. Um, and and coming from the lens of performance rather than cost. So truly impressive indeed. Um, so the magic moment for most teams is when they open a single LLM trace inside of ground cover. And I'll show you this in a live demo in a second. I


00:26:14 - 00:27:35
have some screenshots here in this slide. Um, but it it really does come together because you get to see the entire journey. um which model was called, uh what retrieval happened, how long the tool took, uh the the tool call took, um and of course where a misconfiguration may have occurred. Um in this example, a model mismatch error, GP22 large does not exist immediately traceable uh to a specific service. No guesswork, no logs, uh no digging through them, no SDK stack trace. Um so this is how debugging AI uh becomes as


00:26:53 - 00:27:57
easy as debugging a web request. Um the other thing I want to talk about is is security. Um you know because I think this is a a really important area for a lot of organizations and what has prevented so many from really getting into this. Uh so many customers that we call upon and we talk to say you know are you developing yet? and they say no, it's on a road map. We want to we have budget for it. We have not because of security and compliance because we have concerns. One of the chief drivers for


00:27:27 - 00:28:43
this is that they don't have any way of seeing the data. Um uh but also because they're worried about where spillover of this data may go. uh even if they could observe it, does that then open additional uh risk? So for us it we what we've seen is to put it bluntly is that it's a non-negotiable for for our customers and for the world and we believe that as well. So uh with us every LLM prompt uh we understand could contain sensitive data, customer information, PII or even intellectual


00:28:05 - 00:29:15
property. Um, and the last thing that you need is a vendor copying that data into their infrastructure. So with with bring your own cloud, all that telemetry stays within your VPC, within your uh infrastructure environment. None of it is ever copied to our infrastructure or externally to us. Uh that's how you get uh observability and compliance in the same sentence. Contrast this with, you know, maybe an SDKbased tool that proxies your LLM traffic or ships the full payload to their SAS platform.


00:28:41 - 00:29:56
Bring your own cloud means you get observability without creating new security vulnerabilities. Uh, and what we say is that the architecture difference isn't subtle. It's honestly it's everything. Um, where others require SDKs, we able to use EVPF. uh where they ship your data to their cloud. Uh we stay inside of yours. Uh we don't gate features or throttle visibility. You get 100% capture, flat pricing, and a single platform that handles infra apps and AI altogether. Uh so this really isn't about feature


00:29:20 - 00:30:27
parity. It's a structural advantage and it's why we can say with confidence that no one else can deliver this level of visibility without trade-offs. When something goes wrong in an LLM workflow, latency spikes or error spike, what's the cause? Is it the model provider, a database bottleneck, or network congestion? With siloed tools, you're guessing. With ground cover, you see it instantly. We correlate LLM metrics with infrastructure and application data automatically. This means faster meanantime to resolution,


00:29:55 - 00:31:09
clear ownership, no fingerpointing between ML, DevOps or your platform teams. At the end of the day, reasons teams adopt ground cover isn't just performance, it's truly peace of mind. Uh they get complete visibility, deploy in minutes, and never have to worry about hidden quotas or rising bills. So, as workloads grow, especially AI, they know exactly what's happening, where it's going, and how much it's costing. And now I'd like to shift gears a little bit and uh I'd like to drop into ground


00:30:34 - 00:31:38
cover and uh show you a demonstration of the product directly so you can understand how uh LM observability is functioning in the real world. Okay. So, what I've gone ahead and done is I've pulled up my ground cover demonstration environment. And if you're new to uh ground cover um you know, a few things that I'll explain to you. This is uh uh I'll call like the the homepage. It is the standard default login page for most individuals when they're coming into ground cover. uh


00:31:09 - 00:32:25
they get a view of workloads and of course they have a view of any clusters that are wired up and any environments that may be tied to those. Um we we talked a little bit about that uh ebpf sensor. Let me go into this. I'll pick on a single cluster as an example. Um let's take production and in namespaces we have a ground cover namespace. Now there's obviously more than one thing running. Our sensor is the most important. This is our ebpf uh intellectual property that we have. These other elements are


00:31:47 - 00:32:59
supporting. A few of them are optional. A few of them come out of the box. Uh but I call them the supporting cast or uh if you will the the supporting band, the lead singer. Um some of these uh like Victoria metrics agent um are doing allowing us to do things like custommetric scraping inside of a Kubernetes cluster. Kate's watcher is giving us insight into the the Kubernetes topology uh vector and the metrics ingesttor helping us with with pipelines to bring data back in. This is not the most important feature


00:32:23 - 00:33:34
here but it is good things to understand just architecturally which um all together in concert uh using the band analogy again uh they work together to do that contextualization and enrichment of data all the way at source all the way back to sync and give us the ability to build uh analytical heruristics easily and uh quickly that are meaningful for you to go derive context. Next. Um, now there's there's a lot of other really interesting rich telemetry you see out of the box here too. Things like


00:32:58 - 00:34:21
request errors and latency across these workloads. And if I go into any one of these workloads, uh, let's just pick one like shop service as an example, you get other rich telemetry too. Um, whether it's a a timeline that in can include changes, pod status, warning events, and crashes. uh if there are live issues uh as well as resources pod node noisy neighbors. Um we also do API discovery for services uh uh using ebpf. Then we can build maps contextual views of how these are are connected and if there are


00:33:40 - 00:34:51
any issues a shop service communicating with order service how there may be affecting upstream. In this case, this is actually being affected by a downstream issue burning on order service. Um, we also give you network telemetry, throughput, workload partners, cross availability zones and uh connections and uh and more. The thing that I will say about all of these views is that uh a savvy observability user may look at this and say these are awesome dashboards. And that's exactly right. That's what they


00:34:14 - 00:35:41
are. These are as I I put this uh before analytical heristics on top of this observability data that we give you out of the box because we know that these are very relevant. But where we can really take things is allowing you to go explore this data yourself. And this is everything from APIs, logs, traces, events, or in our data explorer even metrics. I'm going to start with traces. Uh our our trace area right now is a uh a query interface where you can come drop in uh key valuebased filtering to do search across


00:34:58 - 00:36:10
relevant spans or andor traces and trace data. Oh, I actually have one here. Since we're talking about AI and that's uh such a such a focus for our times, uh let's zero in on span type anthropic and see what we can see. Uh so I found I do have a service at least one um that we see multiple spans coming in from and I can zoom in and out on time. I can also expand this to include uh searches across other clusters. But eBPF very nicely as you can see from the facets that are respecting my filter on the


00:35:33 - 00:36:43
left um discovers the the protocol um as many trace libraries would. Ours does this out of the box um and that includes uh you know a AI uh protocol in this case anthropic. Uh where this really comes together is when I click into shop service. Um, now there's some interesting detail. This is a a very beautiful view, but I want to slide over here. I apologize. I want to slide over here and zoom in just on this aspect of things. So, this is and I'm going to use this term. I'm going to


00:36:10 - 00:37:24
overuse it. Hopefully, you get tired by the end of this uh presentation, but uh uh this is beautiful context for us. And this is an outofthe-box experience. What you have captured here is the anthropic request details. So this is again another overused term a built-in heristic that we have developed for you where we can understand what the request model was what the response model was input output and total tokens used for this interaction and then also most importantly is the chat experience itself.


00:36:46 - 00:38:19
Um we also can give you the full request payload also give you visibility into the headers used for the request and the response. Um what's really important to understand here is that when we think about how these respond they are structured and in this structure it gives us the ability to identify things like PII. Um with ground cover you certainly have a field here is PII and if we say is PII true we'll find out that yes there are a number of services that we've detected uh that have PII not necessarily


00:37:33 - 00:38:35
interacting with AI. Uh so juxtapose that to what we saw before with the anthropic one. This one's a little bit more of a simple span view. Uh but what we do see inside of this is that in our response we have a user ID. So this could be considered PII as well as the email andor name. Um we give you mechanisms to obfiscate that data. Uh we can do this at ingestion. We can do this at uh the sensor layer as well in the sensor ecosystem as part of that orchestra. Um so it's very very powerful


00:38:04 - 00:39:20
giving you all the security controls that you need and this extends into the payloads that uh may be uh observed interacting with an LLM. So I want to go back to LLMs again for a second because while we give you this beautiful view inside of traces, what's really important is that we have all of the raw fundamental data at your fingertips as well. This comes together when we want to go build dashboards or just do data exploration in general. Our dashboards since they're built on top of our query


00:38:41 - 00:39:54
explorer for um query exploration. Let let's jump into an LLM observability purpose-built dashboard that we've created. Um in this case we have uh you know managed to build lots of different visualizations and charts that I think are really important when it comes to uh uh LLMs and AI. Uh something like total LLM calls with inside of a uh a specific time period. Now our dashboards also give you the ability to go build variables. So you can use those and that'll drive the widgets as well. Uh,


00:39:18 - 00:40:32
so you can focus on LLM as observability as a whole across all of your estate. You can focus on LLM observability on specific clusters, namespaces, applications, whatever it is that you want to go feed into this. Um, but something like I went out six hours and I see a whole lot more total LLM calls than I did in that uh that uh five minute period. And if I open this up and I edit, I get a view of what my query interface looks like in order to go build these. Um, because we're doing tracing at the EVPF layer, the


00:39:54 - 00:41:15
fundamental units of a trace are spans. We categorize spans as a certain type. Span type open AI, span type anthropic, etc. Um, and that allows us to do simple things. This one is a very very simple visualization. We're simply counting across all of these by span type. Uh juxtapose that with something that maybe has a little bit more complexity. Average output tokens per second. This is a really important fundamental uh metric I would say for understanding uh model performance. Uh there's been


00:40:36 - 00:41:54
papers written on this but uh the two time to first token and output tokens per second tend to be the ones that uh you find as being the the most critical to understanding. Well, wonderful. Again, we have a metric that our ebpf sensor generates ground cover geni response usage output tokens. Uh we also have another one ground cover workload latency seconds. And using these, all I did was query this and average buy genai request model. Uh, and this one I wanted to look for uh the the 50th percent. Um, that's what I wanted


00:41:15 - 00:42:30
to look at. So the the 50 50th quantile, if you will, and then I'm just dividing them. I'm doing a over b. That is a awesome formula that you can go build yourself to go find out what the average output tokens per second is by model and that's what we have here. We only have one model being uh consumed in this demo environment. Um in the past if I went back I could show you uh you know several different models being used. In fact, um, uh, nah, won't go back just yet, but, uh, you would you would


00:41:53 - 00:43:01
actually see different groupings. Uh, this is the second one that we see most popularized by, uh, those in the space telling us how we need to do analytics for LLM performance. Uh, average time to first token. So, edit here. Again, we're just using a metric. In this case, we easily could have used a traced derived um uh filter as well, but uh we did generate we do generate this workload latency seconds. Uh so I'm simply filtering to make sure that I'm interacting with the right systems and


00:42:27 - 00:43:36
again I'm using the 50th quantile. Um and I'm grouping by geni request model. So I get awesome powerful visibility. uh all of our visualizations. You can build these in whatever way is meaningful for you. Time series gives us the ability to go build these um in lines or even a stack bars if we wanted to shift that. Uh and I easy updates. Um so anything that you need to do to go understand analytically you can do. Uh another feature of ground cover is that we there are very very popular observability


00:43:02 - 00:44:04
purpose-built analytics tools like graphfana out there. Um we have an embedded graphana dashboard. So if there are questions that you want to ask and build visualizations on that we don't yet support uh we're iterating in a very rapid clip in order to support and create complete feature parody. Uh but you can certainly get to them this way. So again, I want to take you back to uh some of those fundamental components that I talked about, which are we want to give you access at the end of the day


00:43:33 - 00:44:58
to all of the relevant metrics and interaction details uh that you may have. Um, and this could be, you know, if I just look here and I just say AI, I've got uh Gen AI uh metrics that are being generated by my um uh by my sensor. In fact, I I'll simplify it this way. Response usage uh I've got um you know response usage, total tokens, output tokens, input tokens, anything I want to go slice and dice. Uh this could be uh as well as traces. Um and this could be looking for span type where uh


00:44:15 - 00:45:26
you know I I can look and find these but and then I can find other specific features. Um we include these metrics inside of our traces as well. So um I can go build the visualizations and ask the questions I need to ask to answer um the questions that I have about uh things that are going on. So fundamentally like that is one of the most important features is that we can get visibility into what's going on. The second is this lives inside of your own cloud. So, this gives you the security of knowing that while we're able to help


00:44:51 - 00:45:54
you um you know get visibility into every single transaction, you're not going to be penalized from uh the the cost of uh ingesting them. And you're also not going to be uh penalized in terms of creating security vulnerabilities because this data is still yours and it's still living with inside you um and and your environment. So, you've got security there. And then we have other built-in features like that ability to go scrub PII or other features of that chat interaction that you may want to obiscate uh from use yet


00:45:23 - 00:46:49
still provide the ability to go do analysis on ballistic performance analysis other insightful analysis about how people are interacting and using the chat experience all at your fingertips. Um, so LLM observability should not be thought of as a frightening uh foroding uh thing anymore. That about sums up our demo. Uh, what I'd like to do next is answer any questions. Take just a a quick second um and see if there's any outstanding questions that folks may have um out there uh from from the field.


00:46:06 - 00:47:21
So I do have a question. Uh looks like this is not necessarily an AI or LLM one. Um but uh somebody is asking that it says, "So you mentioned that you do ingest open telemetry data. Um how does that work with eBPF?" That's a great question. Uh so this is a little outside LLMs, but yeah, if we go back to traces, this is the easiest way for me to showcase this. Um we have multiple different source types. We take in uh span data from hotel, ebpf, rum. We can also take it in from various


00:46:44 - 00:47:40
different others um and create visualizations for this. So if I just say source open telemetry, what I find is that I've ingested these. Now these could be coming in from lots of different systems. These could be workloads running in a Kubernetes cluster. These could additionally be uh workloads that are running not in a Kubernetes cluster. uh ostensibly you know maybe lambdas or on virtual machines or other serverless architectures uh really anywhere they could be running uh this particular one


00:47:12 - 00:48:30
I'll pick on catalog service and it happens to be running on a K kubernetes cluster in my demo environment now because I came in through here I'll go back really quick just to highlight something is there is a difference between trace views span views we find span analytics to be more important Um, and so I'm coming in at the span layer and this takes me to the span in question, but it puts it in context of the full trace. And these are all connected these spans by trace ID. But how I know this is uh uh running in


00:47:51 - 00:48:58
Kubernetes is I I mean there's a few different things other than it told me that it was running on a cluster, you know, respected my filters, etc. But I also get this enriched by ebpf because our ebpf sensor is able to go run uh on your kubernetes cluster as a damon set. You can actually send your hotel data directly through it. Uh and we have a great better together story. We enrich the hotel data. So you still get full payload request and response. You get the the header as well. Uh if there are


00:48:25 - 00:49:40
any span attributes, you get an additional field over here. Um you still get all your context picture the ability to understand where this trace lives in timeline context of your system metrics logs. Uh many organizations we find are on a journey to mature to either standardize on OTL log APIs and SDKs or embedding trace ids into their logs and maybe they're not there. So we can still correlate even without a trace ID based correlation. Um so uh it's pretty powerful. You get uh of course


00:49:02 - 00:50:12
visibility from a flame chart perspective where the bottlenecks were. Uh you get a full service map still with OTL and of course you get a view holistically of the spans. You can build any analytics you want across this too. and any of the features you get in terms of you know creating views of sets of spans over timelines uh anomaly detection or error detection insights uh attribute drill down all here you can build these filters and create monitors or even open in data explorer to go do additional analytics and build


00:49:36 - 00:50:38
dashboards from so wonderful question love to answer that all the time um and and I apologize I would love to answer more questions, but it looks like we're running out of time. So, with that, I do want to close and say thank you very much. I appreciate you spending all this time with me. I'm sorry we ran out of time. If there are additional questions, please let us know and we'll follow up individually with you. Um hopefully this gives you a much better understanding of how bring your own cloud can unlock you


00:50:08 - 00:50:35
uh give you freedom from SAS uh as well as give you the all the value that you need as we enter into this AI era. Again, Adam Hicks with Ground Cover. I look forward to talking to you again and have a wonderful day.

Join us at the KubeATL Margarita Bash!🍹
November 12 · 7:30 PM (EST)

We’re co-hosting an unforgettable night of margaritas, nachos, and good vibes!

Catch up with the community, unwind after a busy day, and enjoy a taste of Atlanta with us. See you there!

Schedule a meeting

By submitting this for, you agree to our Privacy Policy

Thank you for registering!

We look forward to seeing you there

groundcover events

Popular sessions - don’t miss out

Trusted by teams who demand more

Real teams, real workloads, real results with groundcover.

“We cut our costs in half and now have full coverage in prod, dev, and testing environments where we previously had to limit it due to cost concerns.”

Sushant Gulati

Sr Engineering Mgr, BigBasket

“Observability used to be scattered and unreliable. With groundcover, we finally have one consolidated, no-touch solution we can rely on.“

ShemTov Fisher

DevOps team lead
Solidus Labs

“We went from limited visibility to a full-cluster view in no time. groundcover’s eBPF tracing gave us deep Kubernetes insights with zero months spent on instrumentation.”

Kristian Lee

Global DevOps Lead, Tracr

“The POC took only a day and suddenly we had trace-level insight. groundcover was the snappiest, easiest observability platform we’ve touched.”

Adam Ceresia

Software Engineering Mgr, Posh

“All vendors charge on data ingest, some even on users, which doesn’t fit a growing company. One of the first things that we liked about groundcover is the fact that pricing is based on nodes, not data volumes, not number of users. That seemed like a perfect fit for our rapid growth”

Elihai Blomberg,

DevOps Team Lead, Riskified

“We got a bill from Datadog that was more then double the cost of the entire EC2 instance”

Said Sinai Rijcov,

DevOps Engineer at EX.CO.

“We ditched Datadog’s integration overhead and embraced groundcover’s eBPF approach. Now we get full-stack Kubernetes visibility, auto-enriched logs, and reliable alerts across clusters with zero code changes.”

Eli Yaacov

Prod Eng Team Lead, Similarweb

Make observability yours

Stop renting visibility. With groundcover, you get full fidelity, flat cost, and total control — all inside your cloud.