All Articles
Cloud
8 min read
December 5, 2024

Cutting Our Kubernetes Bill by 60% Without Touching the App

A practical walkthrough of the resource requests, node sizing, cluster autoscaler tuning, and spot instance strategies that slashed our monthly EKS spend from $14k to $5.6k.

KubernetesAWSEKSCostDevOps

We were spending $14,000/month on EKS. After a focused two-week optimization sprint, that dropped to $5,600 — with zero application changes and no degradation in performance or reliability. Here’s exactly what we did.

The Audit: Where Is the Money Going?

Before optimizing, instrument. We used kubecost to get per-namespace, per-deployment cost attribution. What we found:

  • 42% of spend on nodes that were idle >60% of the time
  • 23% on over-provisioned memory requests that were never consumed
  • 18% on on-demand instances that could be spot
  • 17% legitimately necessary

Fix the first three categories and you fix 83% of the bill.

Fix 1: Resource Requests That Reflect Reality

Most teams set resource requests once at deployment time and never revisit them. After six months of running, actual usage rarely matches the original estimates.

We used the Vertical Pod Autoscaler (VPA) in recommendation mode — it doesn’t change anything, just tells you what it would set:

kubectl apply -f vpa-recommender.yaml
# Wait 24 hours, then:
kubectl describe vpa my-deployment

What we found: almost every deployment had memory requests 3-4x higher than actual p99 usage. After adjusting:

DeploymentBeforeAfterSavings
API service2Gi512Mi75%
Worker pool4Gi1.5Gi63%
ML inference8Gi6Gi25%

Tighter requests → the cluster scheduler packs pods more efficiently → fewer nodes needed.

Fix 2: Right-Size Your Node Groups

We were running m5.2xlarge (8 vCPU, 32Gi) across the board because “it’s what we started with.” After analyzing our actual workload shapes:

  • Stateless API pods: CPU-bound, small memory → switched to c5.xlarge (4 vCPU, 8Gi)
  • ML workers: Memory-heavy, burstable CPU → switched to r5.large (2 vCPU, 16Gi)
  • Batch jobs: Interruptible, bursty → moved to spot m5.xlarge

Using purpose-built node groups cut unit costs and improved bin-packing efficiency simultaneously.

Fix 3: Cluster Autoscaler Tuning

Default CA settings are conservative — they wait too long to scale down and scale up too aggressively. Two parameters made the biggest difference:

--scale-down-delay-after-add=5m      # default: 10m
--scale-down-unneeded-time=3m        # default: 10m
--skip-nodes-with-system-pods=false  # reclaim daemon-set-only nodes

Faster scale-down means idle capacity doesn’t linger. We also enabled overprovisioning with a low-priority “placeholder” deployment so scale-up events don’t cause request queuing.

Fix 4: Spot Instances for Non-Critical Workloads

Any workload that can tolerate interruption should run on spot. For us, that was:

  • Background job processors (idempotent, can restart)
  • ML training jobs (checkpointed)
  • Dev/staging environments (obviously)

We used node affinity with a fallback to on-demand:

affinity:
  nodeAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
    - weight: 80
      preference:
        matchExpressions:
        - key: node.kubernetes.io/capacity-type
          operator: In
          values: ["spot"]

The preferred (not required) affinity means if no spot capacity is available, pods schedule on on-demand. No manual intervention needed during spot interruptions.

The Result

CategoryBeforeAfter
Node count (avg)2411
Monthly spend$14,200$5,600
P99 API latency180ms165ms
Incident ratebaselineno change

The latency actually improved slightly — better bin-packing reduced noisy-neighbor effects between pods.


The biggest unlock was the audit step. You can’t optimize what you haven’t measured. Spend the first day on instrumentation; the rest of the sprint becomes obvious.

Found this useful? I write about AI engineering, distributed systems, and cloud infrastructure.