Table of Content
x min
December 2, 2025

Optimizing Kubernetes Workloads with Spot Instances

December 2, 2025
groundcover Team
December 2, 2025
Cover image of the article ‘Optimizing Kubernetes Workloads with Spot Instances

Running Kubernetes clusters in the cloud can get expensive fast. According to a March 2024 InfoQ report on the CNCF FinOps microsurvey, 49% of organizations saw their cloud costs increase after adopting Kubernetes. When you're paying for compute capacity 24/7, those costs add up whether your workloads are actually using that capacity or not.

Spot instances offer a way out of this problem. By using spare cloud capacity at steep discounts, commonly 50–90% off depending on region and instance family, you can run the same workloads for just a fraction of the cost. Cloud providers can reclaim that capacity with just a few minutes' notice when they need it for paying customers.

For many teams, that trade-off sounds risky. But when you pair spot instances with Kubernetes' built-in resilience and self-healing capabilities, you get a system that can handle interruptions without service disruption. This article breaks down how spot instances work in Kubernetes, the architectural patterns that make them reliable, and the tools that help teams manage interruptions with confidence.

What are Spot Instances in Kubernetes?

Spot instances are unused compute capacity that cloud providers sell at a discount. A familiar analogy is standby airline tickets. You get a great price, but you might lose your seat if a full-fare passenger needs it. Every major cloud provider offers some version of this model, though they call it different things and implement it slightly differently.

How Spot Instances Work in Cloud Environments

AWS calls them EC2 Spot Instances and offers discounts of 70-90% compared to on-demand prices. When AWS needs that capacity back, you get a two-minute warning before your instance terminates. Google Cloud has two options: Preemptible VMs (up to ~80% discount, limited to a 24-hour maximum runtime, often ~30–60 seconds notice) and the newer Spot VMs, which behave more like AWS Spot Instances without the fixed 24-hour limit. Azure Spot VMs also offer up to 90% savings with a 30-second eviction notice.

The pricing and availability fluctuate based on supply and demand. When a particular instance type is unpopular, you'll find plenty of spot capacity at rock-bottom prices. When demand spikes, that capacity dries up and prices rise. This dynamic nature means you can't treat spot instances like regular infrastructure. You need to be ready for interruptions and have a plan to handle them.

Why Kubernetes and Spot Instances Are a Natural Fit

Kubernetes was designed to run workloads across a constantly changing pool of nodes. It doesn't care if a node disappears, it just reschedules the pods that were running on it to other available nodes. This matches perfectly with how spot instances work. When a spot instance gets reclaimed, Kubernetes treats it like any other node failure and moves the workloads elsewhere.

The key difference is that with spot instances, you know failures are coming. You're not trying to prevent them, you're designing your system to handle them gracefully. This forces you to build more resilient applications, and you get a massive cost reduction in the process.

Why Use Spot Instances in Kubernetes for Workload Efficiency: Key Benefits

The case for spot instances goes beyond just saving money, though that's obviously the main draw. Here's what you actually get:

  • Massive Cost Savings: The 70-90% discount isn't marketing hype. A cluster running 100 m5.large instances on-demand costs about $6,912 per month (100 × $0.096/hour × 720 hours). With typical spot discounts of 70–90%, the same capacity would cost roughly $691–$2,074 per month. Scale that across your entire infrastructure, and you're talking about hundreds of thousands in annual savings.
  • Forced Resilience: Running on spot instances means you can't rely on any single instance sticking around. This pushes you to design stateless applications, use proper health checks, and implement graceful shutdown handling. You end up with more fault-tolerant applications by default, which helps even when spot instances aren't involved.
  • Access to Diverse Instance Types: Spot pricing varies by instance type, so you're incentivized to be flexible about what you use. Instead of standardizing on a single instance family, you can spread workloads across multiple types (e.g., m5, m6i, c5, c6i), whatever's available and cheap. Kubernetes handles the complexity of managing this diversity.
  • Capacity Flexibility During Demand Spikes: When you need to scale up quickly, spot instances let you grab capacity fast without waiting for procurement or worrying about long-term commitments. Spin up 50 nodes for a batch job, run it, then shut them down. You only pay for what you use.

How Kubernetes Manages Spot Instances

Kubernetes treats spot nodes like any other node in your cluster, but with some extra metadata and handling to deal with interruptions. The orchestration layer doesn't distinguish between spot and on-demand at a fundamental level, it just schedules pods wherever resources are available.

Interruption Handling and Graceful Termination

When a cloud provider needs to reclaim a spot instance, it sends a termination notice. AWS gives you two minutes, GCP and Azure give you 30 seconds. This isn't much time, but it's enough for Kubernetes to react. When a node receives a termination notice, it should immediately mark itself as unschedulable (cordon) so no new pods land on it. Then it starts evicting pods gracefully, giving them time to finish in-flight requests and shut down cleanly.

Kubernetes will reschedule those evicted pods onto other nodes that have capacity. If you've configured pod disruption budgets correctly, Kubernetes ensures it doesn't take down too many replicas of the same service at once. This prevents a mass spot instance reclaim event from taking your entire application offline.

The critical piece here is that your applications need to handle SIGTERM signals properly. When Kubernetes tells a pod to shut down, the application should stop accepting new connections, finish processing current requests, and exit cleanly. Most modern web frameworks support this, but you need to verify your code actually implements it.

Key Kubernetes Components for Spot Management

Several Kubernetes features work together to make spot instances viable:

  1. Node Labels and Taints let you mark spot nodes as such. You can add a label like `node.kubernetes.io/instance-type=spot` and a taint like `spot=true:NoSchedule`. This prevents pods from landing on spot nodes unless they specifically tolerate that taint. It gives you fine-grained control over what runs where.
  2. Node Affinity and Anti-Affinity rules let you express preferences for pod placement. You can say "prefer spot nodes for this deployment, but fall back to on-demand if needed." Or you can enforce hard requirements like "never run this database on a spot node." This flexibility lets you optimize costs while protecting critical workloads.
  3. Pod Disruption Budgets (PDBs) set limits on how many pods from a deployment can be unavailable at once. If you have 10 replicas and a PDB that says "keep at least 8 available," Kubernetes won't evict more than two pods simultaneously, even during spot interruptions. This maintains availability during disruptions.
  4. Cluster Autoscaler handles scaling the number of nodes up and down based on pending pods and resource utilization. It works with both spot and on-demand node pools, and you can configure it to prefer spot capacity when available. More recent tools like Karpenter improve on this by being faster and more flexible.

Here's what a basic pod configuration with spot node tolerance looks like:

apiVersion: v1
kind: Pod
metadata:
  name: web-app
spec:
  tolerations:
  - key: "spot"
    operator: "Equal"
    value: "true"
    effect: "NoSchedule"
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        preference:
          matchExpressions:
          - key: node.kubernetes.io/instance-type
            operator: In
            values:
            - spot

And a pod disruption budget example:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: web-app-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: web-app

How Spot Instances in Kubernetes Impact Cluster Architecture

Adding spot instances to your cluster isn't just about switching instance types. It changes how you think about node management and workload placement.

Designing a Hybrid Spot/On-Demand Architecture

Most production clusters use a hybrid approach with separate node pools for different purposes. You'll typically have an on-demand node pool for critical infrastructure components, things like your ingress controllers, monitoring agents, and cluster management tools. These need to stay up regardless of spot availability.

Then you have spot node pools for application workloads. You might create multiple spot pools with different instance types to increase your chances of getting capacity. When one instance type runs out of spot availability, Kubernetes can use a different pool.

The ratio between spot and on-demand capacity depends on your risk tolerance and workload characteristics. Some teams run 80-90% spot for cost-sensitive batch workloads. Others stick to a 30-40% spot for production services where they need more predictability. You can adjust this ratio over time as you gain confidence in handling interruptions.

Multi-AZ and Instance Diversity Strategies

Spreading spot instances across multiple availability zones reduces the risk that a single reclaim event impacts too much of your capacity at once. Different AZs have independent spot pricing and availability, so what's scarce in us-east-1a might be plentiful in us-east-1b.

The challenge with multi-AZ is that some resources, like EBS volumes in AWS, are zonal. If you need persistent storage, you have to think carefully about pod scheduling and volume placement. Kubernetes' topology-aware scheduling helps, but it adds complexity.

Instance type diversity is equally important. Spot pools for a single instance type can dry up quickly during high demand. If you configure multiple instance types of similar capacity (m5.large, m5a.large, m6i.large, c5.large), you spread your risk. Cloud providers use "capacity-optimized" allocation strategies that automatically pick the instance types with the deepest spot capacity pools.

Here's a node affinity rule that accepts multiple instance types:

affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
      - matchExpressions:
        - key: node.kubernetes.io/instance-type
          operator: In
          values:
          - m5.large
          - m5a.large
          - m6i.large
          - c5.large

Key Challenges with Spot Instances in Kubernetes

Spot instances solve the cost problem, but create new operational challenges you need to manage. Here's what you're signing up for:

| Challenge | Impact | Mitigation Strategy | | ------------------------------ | ------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | Unexpected interruptions | Pods get evicted with minimal notice, potentially disrupting active requests | Use pod disruption budgets, implement graceful shutdown handling, and maintain sufficient replicas | | Variable capacity availability | Spot instance pools can run dry during high-demand periods, preventing scale-up | Configure multiple instance types, maintain fallback to on-demand capacity, and use capacity-optimized allocation | | Configuration complexity | More node pools, taints, tolerations, and affinity rules to manage | Standardize configurations across workloads, use Infrastructure-as-Code, and document patterns clearly | | Stateful workload challenges | Databases and other stateful apps don't handle interruptions well | Reserve on-demand capacity for stateful workloads, use external storage, implement proper backup strategies | | Observability gaps | Harder to trace issues across ephemeral nodes that disappear | Centralize [logging](/kubernetes-monitoring/kubernetes-logging), implement [distributed tracing](/blog/distributed-tracing), use [observability tools](/blog/observability-tools) that handle dynamic infrastructure | | Cost tracking difficulty | Spot pricing varies, making budgeting and cost attribution more complex | Use cost monitoring tools, track savings metrics, implement proper tagging and labeling |

The biggest mistake teams make is underestimating the observability challenge. When nodes come and go frequently, you lose the ability to SSH into a box and check logs after the fact. Everything needs to be centralized and readily available, which requires better instrumentation from the start.

Best Practices for Running Spot Instances in Kubernetes

Getting spot instances right requires following some established patterns. Here's what actually works in production:

  • Classify Workloads by Spot-Readiness: Not everything should run on spot. Stateless web services, batch jobs, CI/CD runners, and dev/test environments are perfect candidates. Databases, single-replica critical services, and long-running jobs without checkpointing should stay on-demand. Build a simple framework: Can it handle interruptions? Can you run multiple replicas? Does it support graceful shutdown? If yes to all three, it's spot-ready.
  • Use Diversified Instance Types: Configure each spot node pool with 4-5 different instance types of similar specifications. The cloud provider will automatically select from the pool with the most available capacity. Stick to current-generation instances (m6i instead of m5, c6i instead of c5) as older generations usually have less spot availability.
  • Set Proper Pod Disruption Budgets: Every deployment that runs on spot instances needs a PDB. For web services, set `minAvailable` to maintain enough capacity to handle traffic. For batch jobs, you might set `maxUnavailable` to limit how many workers can be disrupted at once. Test your PDBs by simulating node failures to make sure they work as expected.
  • Configure Node Affinity and Tolerations Correctly: Use `preferredDuringSchedulingIgnoredDuringExecution` for workloads that should prefer spot but can fall back to on-demand. Use `requiredDuringSchedulingIgnoredDuringExecution` for workloads that must run on spot (like cost-sensitive batch jobs). Always include on-demand fallback capacity so pods don't stay pending when spot capacity runs out.
  • Implement Graceful Shutdown Handling: Your applications must respond to SIGTERM by stopping new work, finishing current requests, and exiting within the termination grace period (default 30 seconds). Use preStop hooks if you need to do cleanup before shutdown. Configure connection draining in your load balancers to stop sending traffic to terminating pods.
lifecycle:
  preStop:
    exec:
      command: ["/bin/sh", "-c", "sleep 15"]
  • Monitor Spot Instance Health and Interruptions: Track how often your spot instances get reclaimed, which instance types have the best stability, and whether interruptions are causing availability issues. Alert when the spot capacity in a pool drops below safe thresholds. Most importantly, monitor your actual cost savings to ensure spot instances are delivering the expected value.
  • Maintain Fallback to On-Demand Capacity: Configure your cluster autoscaler or Karpenter to automatically provision on-demand instances when spot capacity isn't available. This prevents workloads from staying in a pending state. You can set this up with priority-based scaling where spot pools are tried first, then on-demand pools if needed.

Tools and Integrations for Spot Instance Management

Managing spot instances gets easier with the right tooling. Each cloud provider offers native solutions, and several third-party options fill gaps.

1. groundcover

groundcover provides real-time observability specifically designed for dynamic Kubernetes environments running on spot instances. Using eBPF technology, it collects telemetry data with minimal overhead which is critical when you're trying to maximize the cost efficiency of spot capacity. The platform tracks pod rescheduling events, correlates performance issues with node interruptions, and helps you understand the actual impact of running on spot instances versus on-demand.

Cloud-Native Tools

AWS provides several options for spot management. EKS managed node groups support spot instances natively, handling the infrastructure side automatically. Karpenter is a newer autoscaling solution built specifically for Kubernetes on AWS. It's faster and more flexible than the traditional Cluster Autoscaler, with better support for diversified instance types and automatic fallback to on-demand. AWS also offers the Spot Instance Advisor tool that shows interruption rates by instance type so you can make informed decisions.

Google Cloud integrates spot instances through GKE node pools. You can mix spot and standard VMs in the same pool and configure the autoscaler to prefer spot capacity when available. GKE's bin-packing strategies help maximize resource utilization across spot nodes. The platform also provides detailed cost breakdowns showing savings from spot usage.

Azure implements spot instances through AKS spot node pools. The integration is straightforward—just create a node pool with spot VMs enabled and configure your workloads to tolerate the spot taint. Azure provides eviction policies and capacity reservations to give you some control over which VMs get reclaimed first.

Third-Party Solutions

Several commercial platforms abstract away much of the complexity. Spot.io (now part of NetApp) automatically manages spot instance allocation and handles interruptions transparently. CAST AI uses AI to optimize instance selection and provides one-click spot integration. Kubecost focuses on cost visibility and optimization, showing you exactly where your Kubernetes spend is going and how much you're saving with spot instances.

The traditional Cluster Autoscaler still works for spot instances, though it's slower and less sophisticated than newer options. It's a solid choice if you want something proven and cloud-agnostic, but expect more manual configuration to get spot diversity right.

How groundcover Enhances Spot Instance Resilience

Running workloads on spot instances creates specific observability challenges that traditional monitoring tools weren't designed to handle. groundcover addresses these gaps:

  • Real-Time Visibility into Pod Rescheduling: When spot instances get reclaimed, groundcover tracks which pods were affected, how long rescheduling took, and whether it caused any service degradation. You can see patterns like certain instance types getting reclaimed more frequently or specific workloads struggling with interruptions.
  • Performance Impact Correlation: The platform correlates spot interruptions with application performance metrics. If your API latency spikes every time spot nodes get reclaimed, you'll see that connection clearly. This helps you understand whether your spot strategy is actually working or causing hidden problems.
  • Low-Overhead Monitoring with eBPF: Since spot instances are all about cost efficiency, you don't want your monitoring solution eating up resources. groundcover uses eBPF to collect telemetry data directly from the kernel with minimal CPU and memory overhead. This matters more in spot instances where you want to maximize workload capacity.
  • Cost Attribution and Tracking: groundcover helps you prove the ROI of your spot instance strategy by tracking actual savings versus potential issues. You can see how much you're saving per cluster, namespace, or workload, and whether those savings are worth any stability trade-offs you're making.
  • Faster Troubleshooting in Ephemeral Environments: When a spot node disappears, traditional monitoring approaches lose visibility into what was happening on that node. groundcover's centralized data collection means you have complete visibility even after nodes terminate, making post-mortem analysis possible.

Conclusion

Spot instances can cut your Kubernetes infrastructure costs by 70-90% when you implement them correctly. The key is accepting that interruptions will happen and designing your system to handle them gracefully. With Kubernetes' built-in orchestration, pod disruption budgets, and proper workload design, spot interruptions become a routine operational event rather than a crisis.

Start with non-critical workloads like batch jobs and dev environments. Measure the impact on your bill and your team's operational burden. As you gain confidence, expand spot usage to more workloads. Most teams find that 50-70% of their non-critical Kubernetes capacity on spot instances once proper patterns and tooling are in place, though the safe ratio varies by workload.

The cloud cost problem isn't going away. As Kubernetes adoption grows, so do the bills. Spot instances give you a proven path to significant savings without sacrificing the reliability and scalability that made you choose Kubernetes in the first place. The tools and practices are mature enough that there's no reason to keep paying full price for capacity you're going to get back anyway when the next interruption hits.

FAQs

1. How do spot instance interruptions affect Kubernetes workloads and what strategies minimize disruption?

  • Interruption Behavior: Spot interruptions trigger pod evictions with ~30–120 seconds’ notice depending on provider.
  • Impact: Kubernetes reschedules pods automatically, but short-term capacity dips may affect availability.
  • Replica Safety: Use Pod Disruption Budgets (PDBs) to maintain minimum replica availability.
  • Workload Resilience: Run enough replicas so losing a few doesn’t cause outages.
  • Graceful Shutdown: Implement preStop hooks or graceful termination logic to finish in-flight work.
  • Diversification: Use multiple instance types and multiple AZs to reduce correlated interruptions.
  • Testing: Regularly simulate interruptions by cordoning/draining nodes to validate resilience.

2. Which types of Kubernetes workloads are (and aren't) a good fit for spot instances?

Good Fits:

  • Stateless web apps with multiple replicas
  • Batch jobs that checkpoint state
  • CI/CD workers and ephemeral workloads
  • Dev/test environments
  • Distributed data pipelines tolerant of retries

Bad Fits:

  • Stateful databases without replication
  • Single-replica workloads requiring 100% uptime
  • Long-running jobs without checkpointing
  • Real-time or latency-sensitive systems
  • Apps with irreplaceable local node state

The general rule: if losing the instance means losing work or causing an outage, don't run it on spot. If you can rebuild the state quickly or have enough redundancy that one instance doesn't matter, spot is fine.

3. How does groundcover integrate with Kubernetes clusters using spot instances to provide observability and control?

  • Cluster-Wide Coverage: Deploys as a DaemonSet, running an agent on every node including spot instances.
  • eBPF Telemetry: Uses eBPF to capture metrics, traces, and events without instrumentation or code changes.
  • No Data Loss: Observability continues even after a spot node is reclaimed because data streams to a central backend.
  • Spot Awareness: Auto-detects spot nodes via labels/taints and surfaces spot-specific dashboards and alerts.
  • Operational Insights: Tracks interruption events, pod rescheduling behavior, and node stability patterns.
  • Optimization: Helps identify which workloads tolerate spot interruptions best and whether cost savings outweigh stability trade-offs.

Sign up for Updates

Keep up with all things cloud-native observability.

We care about data. Check out our privacy policy.

Trusted by teams who demand more

Real teams, real workloads, real results with groundcover.

“We cut our costs in half and now have full coverage in prod, dev, and testing environments where we previously had to limit it due to cost concerns.”

Sushant Gulati

Sr Engineering Mgr, BigBasket

“Observability used to be scattered and unreliable. With groundcover, we finally have one consolidated, no-touch solution we can rely on.“

ShemTov Fisher

DevOps team lead
Solidus Labs

“We went from limited visibility to a full-cluster view in no time. groundcover’s eBPF tracing gave us deep Kubernetes insights with zero months spent on instrumentation.”

Kristian Lee

Global DevOps Lead, Tracr

“The POC took only a day and suddenly we had trace-level insight. groundcover was the snappiest, easiest observability platform we’ve touched.”

Adam Ceresia

Software Engineering Mgr, Posh

“All vendors charge on data ingest, some even on users, which doesn’t fit a growing company. One of the first things that we liked about groundcover is the fact that pricing is based on nodes, not data volumes, not number of users. That seemed like a perfect fit for our rapid growth”

Elihai Blomberg,

DevOps Team Lead, Riskified

“We got a bill from Datadog that was more then double the cost of the entire EC2 instance”

Said Sinai Rijcov,

DevOps Engineer at EX.CO.

“We ditched Datadog’s integration overhead and embraced groundcover’s eBPF approach. Now we get full-stack Kubernetes visibility, auto-enriched logs, and reliable alerts across clusters with zero code changes.”

Eli Yaacov

Prod Eng Team Lead, Similarweb

Make observability yours

Stop renting visibility. With groundcover, you get full fidelity, flat cost, and total control — all inside your cloud.