Kubernetes
Noam Levy • Jun 15, 2022

Kubernetes Limits: Mastering CPU and Memory Constraints

Understand Kubernetes limits and their role in optimizing CPU and memory usage, while effectively managing dynamic workloads

Kubernetes Limits: Mastering CPU and Memory Constraints
Noam Levy
Noam Levy
June 15, 2022
May 10, 2026
7
min read
Kubernetes

You might think that limits are a Kubernetes admin’s best friend. By restricting the CPU and memory that Pods can consume, you can avoid “noisy neighbor” issues, streamline resource allocation and keep your clusters humming smoothly, right?

Well, not necessarily. Limits have their limitations. Although it’s good and well to set limits for those workloads that are predictable enough to benefit from resource consumption restrictions, limits can do more harm than good in cases where workloads are very “spiky.”

That’s why, on the whole, we think use of limits should be limited. As working with CPU limits requires a deep understanding of the implications, there are approaches that suggest avoiding the use of CPU limits.  While it’s critical to get to know your workloads well before making strategic decisions on whether to set limits, and which kinds of limits to apply if so, we would like to shed some light on use-cases where CPU limits can really come in handy.

Let’s dive in…

How Kubernetes limits work

Limits in Kubernetes work in a pretty straightforward way: They restrict how many CPU and/or memory resources Pods or containers can access. CPU and memory are the main compute resources managed in Kubernetes, and both limits and requests are set per container to control resource allocation and scheduling.

For example, if you want to set a CPU limit of 500m for a container inside your Pod, you’d describe it with YAML such as the following. In this YAML, container requests specify the guaranteed minimum resources for the container, while container limits define the maximum resources the container can use. CPU resources are measured in cpu units (millicores), where 500m equals 0.5 CPU cores, and memory is measured in bytes, such as MiB (Mebibytes) or GiB (Gibibytes).

If a limit is not specified for a container, it can use all available resources on the node. The container runtime, working with the kubelet, enforces these container limits and container requests using Linux cgroups to prevent runaway pods from destabilizing the node.

Using limits to reach resource consumption predictability 

The benefit of limits is obvious enough: By limiting the CPU or memory that a container or Pod can consume, you avoid situations where a workload sucks up resources, achieving resource consumption predictability—which is something your SRE will definitely love to see. When CPU and memory limits are used, you gain predictable resource consumption for the workload, trading off SLO in cases where the workloads demand more resources to fulfill their goal. Setting limits and requests for all the containers in a namespace ensures fair and efficient resource allocation across the cluster.

In more extreme scenarios, memory and CPU limits also help to stop an application from triggering a chain reaction that could cause your entire environment to crash. Want to make sure that the buggy development version of the app that you deployed into your dev namespace doesn’t suck up CPU to the point that your other production workloads are deprived? Limits, together with well-designed Kubernetes resource quotas at the namespace level, will help to do that by restricting how many resources the hungry app can consume. ResourceQuota can be set for the entire namespace to control the total resource usage across all the containers, preventing any single tenant or application from overusing resources. Additionally, LimitRange can enforce default limits, default requests, and default values for containers that do not specify them, ensuring consistent resource allocation policies are applied automatically.

CPU limits act as a hard limit, enforced by throttling the container's CPU usage, while memory limits are also hard limits, but exceeding them leads to the container being killed with an OOM (Out Of Memory) error. CPU is a compressible resource, meaning it can be stretched or throttled to manage demand, whereas memory is a non-compressible resource—exceeding memory limits leads to termination rather than throttling. Memory limits define the maximum RAM a container can use, and exceeding this limit can lead to termination of the container due to an OOM error. In contrast, exceeding CPU limits results in throttling the container's CPU usage without termination.

Why CPU limits don’t come easy

The fact that limits can help workloads to share resources appropriately, and prevent performance issues in one workload from bleeding over into others, doesn’t mean that every container should have a limit, especially when aggressive limits can trigger Kubernetes CPU throttling and related performance issues, often causing performance degradation in affected pods.

On the contrary, limits are a bad idea for certain types of Kubernetes workloads – namely, those that have spiky, unpredictable resource consumption patterns due to fluctuations in user requests. 

Now, you may be thinking: “OK, but if I can benchmark my spiky workloads accurately to determine their maximum resource needs, then I can just cap the limits to the maximum and call it a day. Right?”

Probably not. Although in theory accurate benchmarking would be a way to configure safe limits, in practice – as the wise Tim Hockin of Google observed recently on Hacker News – “accurate benchmarking is DAMN hard… Even within Google we only have a few apps that we REALLY trust the benchmarks on.”

In other words, even if you think your benchmark tests have determined the maximum amount of CPU or memory that your workloads may require, they probably haven’t. And if you set limits based on those tests, things are probably going to happen that you don’t want to happen – namely, your limits are likely to end up being insufficient for real-world instances of peak demand. Meanwhile, if you set limits too high, you risk depriving your other workloads of the CPU and memory they require. Instead, focus on setting CPU requests according to the actual needs of your applications, and regularly evaluate and adjust these requests and limits—a process known as rightsizing—to ensure optimal performance and cost-efficiency.

The bottom line here is that CPU limits tend to be less useful than a lot of people think, unless you truly know your workload requirements (which, again, you probably don’t, even if you think you do because you did a bunch of benchmarking). In most cases, it's best to avoid using CPU limits, as they can throttle application performance, and instead define CPU requests to allow applications to access necessary resources under normal load and additional resources when available.

Squaring the circle: Balancing requests and limits

You may also be thinking: “OK, since limits are really hard to get right, I’ll instead use requests to manage resource allocation for Pods.”

That’s not a bad thought. Requests are a helpful feature for controlling resource allocation.

A request tells the Kubernetes scheduler which minimum amount of memory or CPU resources should be available to your workload. For example, to request at least 64 megabytes of memory for a container, you could describe a Pod as follows:

The scheduler will then try to ensure that that minimum amount is available by scheduling the Pod on a node with the requisite resource availability. Pod requests are compared to the available resources (such as available CPU and memory) on each node, as indicated by the node's .status.allocatable field. If a Pod's requested resources exceed the available resources on a node, the Pod will remain in a Pending state until resources become available.

Resource settings (requests and limits) directly influence scheduling decisions and node utilization. Monitoring node utilization is important for efficient resource management, and tools like the metrics server can provide real-time resource usage data to inform resource allocation decisions. Kubernetes also allows for resource overcommitment, where requests can be set lower than actual usage to maximize node utilization, but this must be managed carefully to avoid scheduling issues.

The problem with requests, though, is that they’re not always enough on their own to manage resource allocations properly. Whether requests solve your resource-management woes depends – as does everything – your workload requirements and on continuously tracking key Kubernetes metrics for visibility and control.

To illustrate the point, let’s look at two common ways of leveraging requests and/or limits to improve resource allocation in Kubernetes, ideally guided by robust Kubernetes application performance monitoring practices.

Playing it safe: setting high requests

As a rule of thumb, memory limits are encouraged. Defining high requests usually safeguards the node while not sacrificing app performance. Setting appropriate limits requests is essential for balancing safety and efficiency, as they control CPU and memory allocation, affect scheduling, and help prevent resource overuse to maintain cluster stability. Monitoring memory resource usage is crucial to ensure that requests and limits are set appropriately and to avoid resource exhaustion. The caveat of this approach is that applications that don’t actually utilize the memory they request can leave other workloads under-provisioned because there aren’t enough resources left over for them. This leads to an underutilized node, which can increase expenses, especially when node autoscaling is implemented, making strong Kubernetes observability practices and tooling critical to detect waste. Additionally, when troubleshooting memory issues or OOM kills, always check for memory leaks in your application, as they can cause persistent memory problems and impact stability.

Set low requests, and set limits… only for spiky applications

In the case of a spiky workload, with a varying resource consumption, it’s possible to create a margin between the requests and the limits, while relying on Kubernetes health checks and probes to ensure Pods remain healthy as they scale resource usage up and down. Containers can use more CPU than their request if additional resources are available on the same node, but they cannot exceed their CPU limit due to kernel enforcement. Similarly, ephemeral storage is another resource that can be limited and should be monitored for spiky workloads to prevent unexpected evictions or failures.

The result is a cluster where nodes are less likely to go underutilized and Pods are less likely to have to wait to be assigned because your requests are not too high. At the same time, however, the limits serve as a protection against those workloads that may become especially disruptive to other containers.

Kubernetes supports three Quality of Service (QoS) classes: Guaranteed, Burstable, and BestEffort. Burstable is the most common class, allowing pods to burst above their requests if additional resources are available, but they will be evicted before Guaranteed pods under resource pressure. BestEffort pods have no resource guarantees and are the first to be evicted, making them suitable only for non-critical batch jobs that can be interrupted.

However, creating a large margin between request and limits can lead to a very nasty problem called memory overcommit. In this case, workloads lie to the node about their minimum requirements, but once assigned to it, they completely drain it out of resources - quickly leading to unpredictable instabilities and crashes. This is the classic scenario where a kid at six flags wants to go on the rollercoaster, but doesn’t reach the minimum height requirement so he stands on his parent’s feet to fake the extra two inches and gets in. 

Quality of Service (QoS) Classes: How Kubernetes Prioritizes Pods

Kubernetes uses Quality of Service (QoS) classes to determine how pods are prioritized for resource allocation, especially when nodes experience resource pressure. These classes—Guaranteed, Burstable, and BestEffort—are assigned based on the resource requests and limits specified in your pod definitions.

Guaranteed pods are given the highest priority. To achieve this class, you must set both CPU and memory requests and limits to the same value for all containers in the pod. This ensures that the pod receives exactly the amount of cpu and memory resources it requests, making it ideal for critical workloads that require consistent resource utilization and cannot tolerate performance fluctuations.

Burstable pods are assigned when at least one container in the pod has a memory or CPU request that is less than its limit. This allows the pod to use more resources than requested if available, but it may be throttled or evicted if the node runs low on resources. Burstable pods are a good fit for applications with variable resource usage patterns, where occasional spikes in cpu and memory requests are expected but not guaranteed.

BestEffort pods have no resource requests or limits set. These pods are the lowest priority and are the first to be evicted when the node runs out of resources. BestEffort is suitable only for non-critical workloads where resource guarantees are not necessary.

Understanding and leveraging QoS classes is essential for effective resource management in Kubernetes. By carefully setting resource requests and limits, you can ensure that your most important workloads are protected and that overall resource utilization across your cluster remains balanced.

Namespace ResourceQuota and LimitRange: Enforcing Boundaries in Multi-Tenant Clusters

In multi-tenant Kubernetes environments, it’s crucial to prevent any single team or application from monopolizing cluster resources. This is where Namespace ResourceQuota and LimitRange come into play, providing guardrails for resource allocation and usage.

ResourceQuota allows administrators to set hard resource limits—such as total CPU, memory, or the number of pods—within a namespace. This ensures that the sum of all resource requests and limits in that namespace cannot exceed the defined quota, preventing resource hogging and promoting fair access to shared resources. For example, if a namespace has a memory quota of 8Gi, all memory requests and limits for pods in that namespace must stay within this boundary.

LimitRange complements ResourceQuota by defining default and maximum resource requests and limits for containers within a namespace. This means that if a developer forgets to specify resource requests or limits in their container specification, Kubernetes will automatically apply the defaults set by the LimitRange. It also prevents users from requesting excessive resources for a single container, which could otherwise lead to performance degradation for other workloads.

By combining ResourceQuota and LimitRange, cluster administrators can enforce consistent resource requests and limits, reduce the risk of overcommitment, and maintain stable performance across all tenants. This approach is essential for maintaining a healthy, multi-tenant Kubernetes cluster where resources are shared equitably and efficiently.

Avoiding OOM Kills: Protecting Your Workloads from Memory Surprises

One of the most disruptive events in Kubernetes is an Out-of-Memory (OOM) kill, which occurs when a container exceeds its memory limit and the Linux kernel is forced to terminate it. OOM kills can lead to application downtime, data loss, and poor performance, especially if they affect critical workloads.

To minimize the risk of OOM kills, it’s vital to set accurate memory requests and memory limits for each container. By aligning memory requests with memory limits, you ensure that the Kubernetes scheduler only places pods on nodes with enough available memory, and that containers cannot consume more memory than allocated. This approach helps maintain predictable memory usage and prevents a single container from causing memory pressure that could impact other workloads.

Regularly monitoring memory usage with Kubernetes-native monitoring tools or third-party solutions allows you to spot trends and adjust resource requests and limits as your application’s actual usage patterns evolve. If you notice that a container frequently approaches its memory limit, consider increasing its memory requests and limits to provide more headroom and avoid unexpected OOM kills.

Effective memory management is about balancing the need for more memory with the risk of overcommitting resources. By setting appropriate resource limits and requests, and by keeping a close eye on memory usage, you can protect your workloads from memory surprises and ensure reliable, stable operation across your Kubernetes cluster.

What about autoscaling?

Another creative way to try to leverage limits while minimizing the risk of running out of resources is to configure horizontal or vertical pod autoscaling for your clusters or even use kubectl scale deployment to manually adjust replicas. Although each type of autoscaling works differently, they both serve the purpose of automatically making more resources available to your nodes (and, by extension, to your Pods) in situations where total cluster resources run low.

In addition, the vertical pod autoscaler (VPA) can analyze historical resource usage and provide recommendations for resource requests, which you can review and apply manually for safe sizing guidance without automatically changing your workloads unless configured to do so.

In theory, then, autoscaling can function against inadequate limit settings. The problem with this idea in practice, though, is that horizontal and vertical pod autoscalers “don't act instantly - certainly not on the timescale of seconds.” Your app may already be dropping requests by the time autoscaling rides in to save the day.

This doesn’t mean you shouldn’t use autoscaling. You certainly should in situations where it makes sense. But it is to say that you shouldn’t assume auto scaling has your back in the event that limit settings turn out to be insufficient; you still need sound Kubernetes deployment strategies for safe rollouts.

The bottom line: when it comes to K8s limits, know your app

Whichever approach you take, what matters most is knowing your application. You can’t determine which requests or limits to set unless you know what the consumption patterns of your application are likely to be. This requires tracking workload behavior over time, learning which resources the workload consumes during normal and peak activity, and then rightsizing requests and/or limits based on the actual needs of your applications. Profiling each application individually ensures that resource allocation is efficient and cost-effective. Additionally, Kubernetes allows for the overcommitment of resources, meaning you can set requests lower than actual usage to maximize node utilization, while limits provide a safety mechanism to prevent any single application from monopolizing node resources. Regularly evaluating and adjusting these settings is key, as well as configuring resilient Kubernetes liveness probes to restart unhealthy containers.

At the end of the day, Kubernetes is only as effective as the configurations you apply to it. It’s not an observability tool, so it can’t automatically determine which resources your apps need to run reliably. That task is on you, and it requires you to glean application insights with the help of a robust observability platform which will help you understand the resource consumption behavior of workloads in Kubernetes and beyond, including complex Kubernetes on‑premises environments with unique scaling constraints - so that you can make informed decisions about limit and request settings.

Noam Levy
Noam Levy
 
Founding Engineer

8 min read |
Published on: Jun 15, 2022

Latest posts

Explore related posts

Sign up for Updates

Keep up with all things cloud-native observability.

We care about data. Check out our privacy policy.