A practical guide for developers and platform engineers who want signal, not noise. Includes cross-links to the groundcover Learn Center for operational deep-dives.
Kubernetes generates a staggering volume of metrics. Nodes, pods, containers, namespaces, control-plane components, the scheduler, etcd: all of them emit data continuously. The challenge isn't collecting metrics. It's knowing which ones actually matter, what they're telling you, and what to do when they cross a threshold.
This guide is for developers and platform engineers who are building or maturing their Kubernetes observability strategy. We'll cover the metrics that matter most, how they relate to each other, and how to build a monitoring setup that surfaces real problems without burying you in alerts.
groundcover monitors all of these metrics automatically, without instrumentation, configuration files, or SDK integrations. But we'll get to that. First, the fundamentals.
Why Kubernetes metrics are harder than they look
Traditional application monitoring is relatively straightforward: you watch CPU, memory, and request latency on a fixed set of servers. Kubernetes changes all of that.
Resources are ephemeral. Pods come and go. Nodes scale up and down. A container that was healthy thirty seconds ago might not exist anymore. This dynamism is a feature, but it means your metrics need to capture the behavior of a constantly-changing system, not a static one.
There are also multiple layers to monitor:
- The infrastructure layer: nodes, CPU, memory, disk, network
- The workload layer: pods, containers, deployments, DaemonSets
- The control-plane layer: the API server, scheduler, etcd, controller manager
- The application layer: request rates, error rates, latency (the signals your users actually feel)
Each layer tells a different part of the story. A spike in CPU at the node level might be causing pod throttling at the container level, which is causing latency spikes at the application level. All three need to be visible to diagnose the problem quickly.
The metrics that actually matter
There are hundreds of Kubernetes metrics. Here are the ones that experienced platform engineers monitor first, organized by layer.
Node-level metrics
Nodes are the physical (or virtual) machines that run your workloads. If a node is unhealthy, every pod on it is at risk.
Pod and container metrics
Pods are the atomic unit of Kubernetes workloads. Container-level metrics tell you what's actually consuming resources and what's at risk of failure.
⚠️ The OOMKill and CrashLoopBackOff connection
High container_memory_working_set_bytes approaching the memory limit is the leading indicator for OOMKilled pods. If you're responding to CrashLoopBackOff events reactively, memory metrics are where you should be looking proactively. For a step-by-step debugging guide when these failures occur, see the groundcover Learn Center.
Deployment and workload metrics
These metrics tell you whether your desired state matches your actual state. That is the central question Kubernetes is always trying to answer.
Resource quota and limits metrics
Resource limits and requests are how Kubernetes schedules work and prevents resource contention. If you haven't set them, or set them incorrectly, you're flying blind on capacity. For a deeper look at how to configure these, see Understanding Kubernetes resource limits and requests.
Control-plane metrics
The control plane is the brain of your cluster. Most teams don't monitor it until something breaks. By then, diagnosing the issue is much harder.
How these metrics connect to each other
The most useful insight you can get from Kubernetes metrics isn't a single number. It's the correlation between layers. Here are the diagnostic chains that experienced SREs follow first:
Latency spike → Check container_cpu_cfs_throttled_seconds_total (CPU throttling), then container_memory_working_set_bytes (memory pressure), then node_cpu_seconds_total (node saturation). Latency spikes are often a container resource story.
Pods stuck in Pending → Check kube_pod_container_resource_requests against available node capacity. Pending pods usually mean the scheduler can't find a node that meets the resource request: either the request is too high or nodes are full.
Unexpected cost increase → Compare kube_pod_container_resource_requests to container_cpu_usage_seconds_total and container_memory_working_set_bytes. If your requests significantly exceed actual usage, you're over-provisioning and paying for unused capacity.
For a strategic guide to controlling resource costs, see Kubernetes cost monitoring: tracking and reducing spend. For the operational mechanics of how autoscaling responds to resource metrics, the Learn Center article on Kubernetes Cluster Autoscaler covers how the autoscaler reads these signals to make scaling decisions.
What most teams get wrong about Kubernetes metrics
Based on how teams typically set up monitoring, there are a few consistent mistakes:
Alerting on metrics without context
CPU at 80% isn't inherently bad. It depends on whether that's normal for this workload at this time of day, whether it's trending upward, and whether there's throttling. Metrics without baselines generate noise. Alerts should fire on deviation from expected behavior, not absolute thresholds.
Watching nodes, not containers
Node-level averages hide container-level problems. A node might show 60% memory utilization while one container is at 95% of its limit and about to be OOMKilled. Always monitor at the container level in addition to the node level.
Missing request/limit parity
Many teams set resource requests and limits once and forget them. As workloads grow and change, those numbers become wrong, leading to either wasteful over-provisioning or dangerous under-provisioning. Metrics like container_cpu_usage_seconds_total relative to kube_pod_container_resource_limits are how you catch this drift.
No control-plane visibility
The API server and etcd are invisible to most monitoring setups because they require access to metrics endpoints inside the cluster control plane. This leaves a major blind spot: control-plane degradation often manifests as mysterious pod scheduling delays and kubectl timeouts before the root cause is clear.
How groundcover approaches Kubernetes metrics
groundcover collects all of these metrics automatically using an eBPF-based sensor that runs at the node level, with no instrumentation, no SDK integrations, and no configuration files for each new service. Every pod and container in your cluster is visible from the moment it starts, including control-plane components.
The platform surfaces the diagnostic chains described above natively: when a latency spike occurs, you can immediately see the correlated resource metrics from the same container, on the same timeline, in the same view. You don't have to jump between dashboards or write PromQL to connect the dots.
groundcover also monitors resource requests and limits relative to actual consumption across your entire cluster, giving you a continuous right-sizing view. Over-provisioning becomes visible before it becomes a surprise at billing time. For a full technical breakdown of how this works, see How groundcover monitors Kubernetes without instrumentation.
A note on metric sources
The metrics in this guide come from three separate systems that most teams run together under Prometheus. It helps to know which is which:
Node Exporter supplies the node-level metrics (node_cpu_seconds_total, node_memory_MemAvailable_bytes, node_disk_io_time_seconds_total). It runs as a DaemonSet and must be deployed separately.
cAdvisor supplies the container-level metrics (container_cpu_usage_seconds_total, container_memory_working_set_bytes, container_cpu_cfs_throttled_seconds_total). It runs embedded inside the kubelet, so no separate deployment is needed.
kube-state-metrics supplies the object-state metrics (kube_pod_*, kube_deployment_*, kube_node_status_condition, kube_resourcequota, etc.). It watches the Kubernetes API and translates object state into metrics. Also a separate deployment.
The API server and scheduler metrics (apiserver_request_total, scheduler_scheduling_duration_seconds) are emitted directly by those components and documented in the Kubernetes Metrics Reference.




