Kubernetes Metrics: What to Monitor and Why

A practical guide for developers and platform engineers who want signal, not noise. Includes cross-links to the groundcover Learn Center for operational deep-dives.

Kubernetes generates a staggering volume of metrics. Nodes, pods, containers, namespaces, control-plane components, the scheduler, etcd: all of them emit data continuously. The challenge isn't collecting metrics. It's knowing which ones actually matter, what they're telling you, and what to do when they cross a threshold.

This guide is for developers and platform engineers who are building or maturing their Kubernetes observability strategy. We'll cover the metrics that matter most, how they relate to each other, and how to build a monitoring setup that surfaces real problems without burying you in alerts.

groundcover monitors all of these metrics automatically, without instrumentation, configuration files, or SDK integrations. But we'll get to that. First, the fundamentals.

Why Kubernetes metrics are harder than they look

Traditional application monitoring is relatively straightforward: you watch CPU, memory, and request latency on a fixed set of servers. Kubernetes changes all of that.

Resources are ephemeral. Pods come and go. Nodes scale up and down. A container that was healthy thirty seconds ago might not exist anymore. This dynamism is a feature, but it means your metrics need to capture the behavior of a constantly-changing system, not a static one.

There are also multiple layers to monitor:

The infrastructure layer: nodes, CPU, memory, disk, network
The workload layer: pods, containers, deployments, DaemonSets
The control-plane layer: the API server, scheduler, etcd, controller manager
The application layer: request rates, error rates, latency (the signals your users actually feel)

Each layer tells a different part of the story. A spike in CPU at the node level might be causing pod throttling at the container level, which is causing latency spikes at the application level. All three need to be visible to diagnose the problem quickly.

The metrics that actually matter

There are hundreds of Kubernetes metrics. Here are the ones that experienced platform engineers monitor first, organized by layer.

Node-level metrics

Nodes are the physical (or virtual) machines that run your workloads. If a node is unhealthy, every pod on it is at risk.

| Metric | What it tells you | Alert threshold | | ------------------------------- | ---------------------------------------------------------------------------------------------------- | --------------------------------------- | | node_cpu_seconds_total | Cumulative CPU time by mode (user, system, idle). Use rate() to compute utilization. | \> 80% non-idle sustained for 5 min | | node_memory_MemAvailable_bytes | Free + reclaimable memory | < 10% of total capacity | | node_disk_io_time_seconds_total | Counter of time spent on disk I/O. Use rate(): when the rate approaches 1s/s, the disk is saturated. | rate() > 0.8 sustained (80% saturation) | | kube_node_status_condition | Node readiness status (from kube-state-metrics) | NotReady for > 1 min |

Pod and container metrics

Pods are the atomic unit of Kubernetes workloads. Container-level metrics tell you what's actually consuming resources and what's at risk of failure.

| Metric | What it tells you | Alert threshold | | ----------------------------------------- | ---------------------------------- | --------------------------------------- | | container_cpu_usage_seconds_total | CPU used by each container | Approaching CPU limit (throttling risk) | | container_memory_working_set_bytes | Active memory in use | Approaching memory limit (OOMKill risk) | | container_cpu_cfs_throttled_seconds_total | Time a container was CPU-throttled | Any sustained throttling | | kube_pod_container_status_restarts_total | Number of container restarts | \> 3 restarts in 1 hour | | kube_pod_status_phase | Pod phase (Pending, Running, etc.) | Pending > 5 min; Failed |

⚠️ The OOMKill and CrashLoopBackOff connection

High container_memory_working_set_bytes approaching the memory limit is the leading indicator for OOMKilled pods. If you're responding to CrashLoopBackOff events reactively, memory metrics are where you should be looking proactively. For a step-by-step debugging guide when these failures occur, see the groundcover Learn Center.

Deployment and workload metrics

These metrics tell you whether your desired state matches your actual state. That is the central question Kubernetes is always trying to answer.

| Metric | What it tells you | Alert threshold | | ------------------------------------------- | -------------------------------- | ------------------------------------------ | | kube_deployment_status_replicas_available | Healthy replicas serving traffic | < desired replicas for > 5 min | | kube_deployment_status_replicas_unavailable | Replicas not serving traffic | \> 0 for more than a rolling update window | | kube_daemonset_status_number_misscheduled | DaemonSet pods on wrong nodes | \> 0 | | kube_statefulset_status_replicas_ready | Ready replicas in a StatefulSet | < desired |

Resource quota and limits metrics

Resource limits and requests are how Kubernetes schedules work and prevents resource contention. If you haven't set them, or set them incorrectly, you're flying blind on capacity.

| Metric | What it tells you | Alert threshold | | ------------------------------------ | --------------------------------------- | ---------------------------------------- | | kube_resourcequota | Namespace-level resource usage vs quota | Usage > 85% of quota | | kube_pod_container_resource_requests | Requested CPU/memory per container | Baseline for scheduling decisions | | kube_pod_container_resource_limits | Hard limits per container | Compare to actual usage for right-sizing |

Control-plane metrics

The control plane is the brain of your cluster. Most teams don't monitor it until something breaks. By then, diagnosing the issue is much harder.

| Metric | What it tells you | Alert threshold | | ------------------------------------- | -------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------- | | apiserver_request_total | API server request volume by verb and resource | Error rate > 1% of total requests | | apiserver_request_duration_seconds | API server response latency | p99 > 1 second | | etcd_db_total_size_in_bytes | etcd database size. Default storage quota is 2GB; etcd stops accepting writes at this limit. | \> 1.5GB (to warn before hitting the 2GB default quota; adjust if --quota-backend-bytes is raised) | | scheduler_scheduling_duration_seconds | Time to schedule a pod | p99 > 1 second sustained |

How these metrics connect to each other

The most useful insight you can get from Kubernetes metrics isn't a single number. It's the correlation between layers. Here are the diagnostic chains that experienced SREs follow first:

Latency spike → Check container_cpu_cfs_throttled_seconds_total (CPU throttling), then container_memory_working_set_bytes (memory pressure), then node_cpu_seconds_total (node saturation). Latency spikes are often a container resource story.

Pods stuck in Pending → Check kube_pod_container_resource_requests against available node capacity. Pending pods usually mean the scheduler can't find a node that meets the resource request: either the request is too high or nodes are full.

Unexpected cost increase → Compare kube_pod_container_resource_requests to container_cpu_usage_seconds_total and container_memory_working_set_bytes. If your requests significantly exceed actual usage, you're over-provisioning and paying for unused capacity.

For the operational mechanics of how autoscaling responds to resource metrics, the Learn Center article on Kubernetes Cluster Autoscaler covers how the autoscaler reads these signals to make scaling decisions.

What most teams get wrong about Kubernetes metrics

Based on how teams typically set up monitoring, there are a few consistent mistakes:

Alerting on metrics without context

CPU at 80% isn't inherently bad. It depends on whether that's normal for this workload at this time of day, whether it's trending upward, and whether there's throttling. Metrics without baselines generate noise. Alerts should fire on deviation from expected behavior, not absolute thresholds.

Watching nodes, not containers

Node-level averages hide container-level problems. A node might show 60% memory utilization while one container is at 95% of its limit and about to be OOMKilled. Always monitor at the container level in addition to the node level.

Missing request/limit parity

Many teams set resource requests and limits once and forget them. As workloads grow and change, those numbers become wrong, leading to either wasteful over-provisioning or dangerous under-provisioning. Metrics like container_cpu_usage_seconds_total relative to kube_pod_container_resource_limits are how you catch this drift.

No control-plane visibility

The API server and etcd are invisible to most monitoring setups because they require access to metrics endpoints inside the cluster control plane. This leaves a major blind spot: control-plane degradation often manifests as mysterious pod scheduling delays and kubectl timeouts before the root cause is clear.

How groundcover approaches Kubernetes metrics

groundcover collects all of these metrics automatically using an eBPF-based sensor that runs at the node level, with no instrumentation, no SDK integrations, and no configuration files for each new service. Every pod and container in your cluster is visible from the moment it starts, including control-plane components.

The platform surfaces the diagnostic chains described above natively: when a latency spike occurs, you can immediately see the correlated resource metrics from the same container, on the same timeline, in the same view. You don't have to jump between dashboards or write PromQL to connect the dots.

groundcover also monitors resource requests and limits relative to actual consumption across your entire cluster, giving you a continuous right-sizing view. Over-provisioning becomes visible before it becomes a surprise at billing time.

A note on metric sources

The metrics in this guide come from three separate systems that most teams run together under Prometheus. It helps to know which is which:

Node Exporter supplies the node-level metrics (node_cpu_seconds_total, node_memory_MemAvailable_bytes, node_disk_io_time_seconds_total). It runs as a DaemonSet and must be deployed separately.

cAdvisor supplies the container-level metrics (container_cpu_usage_seconds_total, container_memory_working_set_bytes, container_cpu_cfs_throttled_seconds_total). It runs embedded inside the kubelet, so no separate deployment is needed.

kube-state-metrics supplies the object-state metrics (kube_pod_*, kube_deployment_*, kube_node_status_condition, kube_resourcequota, etc.). It watches the Kubernetes API and translates object state into metrics. Also a separate deployment.

The API server and scheduler metrics (apiserver_request_total, scheduler_scheduling_duration_seconds) are emitted directly by those components and documented in the Kubernetes Metrics Reference.

Back to Kubernetes Observability

Kubernetes Metrics: What to Monitor and Why

Why Kubernetes metrics are harder than they look

The metrics that actually matter

Node-level metrics

Pod and container metrics

Deployment and workload metrics

Resource quota and limits metrics

Control-plane metrics

How these metrics connect to each other

What most teams get wrong about Kubernetes metrics

Alerting on metrics without context

Watching nodes, not containers

Missing request/limit parity

No control-plane visibility

How groundcover approaches Kubernetes metrics

A note on metric sources

Explore more guides

Datadog Alternatives for Full-Stack Observability: 7 Tools Engineers Are Switching To in 2026

What Is Sampling in Observability and Why It's a Problem

Datadog Pricing Explained: Why Your Bill Keeps Growing

Sign up for Updates

Observability
for what comes next.

Kubernetes Metrics: What to Monitor and Why

Why Kubernetes metrics are harder than they look

The metrics that actually matter

Node-level metrics

Pod and container metrics

Deployment and workload metrics

Resource quota and limits metrics

Control-plane metrics

How these metrics connect to each other

What most teams get wrong about Kubernetes metrics

Alerting on metrics without context

Watching nodes, not containers

Missing request/limit parity

No control-plane visibility

How groundcover approaches Kubernetes metrics

A note on metric sources

Explore more guides

Datadog Alternatives for Full-Stack Observability: 7 Tools Engineers Are Switching To in 2026

What Is Sampling in Observability and Why It's a Problem

Datadog Pricing Explained: Why Your Bill Keeps Growing

Sign up for Updates

Observability for what comes next.

Get startedwith groundcover

See the platform in action

Book an on-demand demo with a customer engineer

100% visibility all the time.

Troubleshoot like a pro.

Reduce data & growth costs, dramatically.

Done!

Book a demo

Observability
for what comes next.

Get started
with groundcover