Cutting Our Kubernetes Bill by 60% Without Touching the App
A practical walkthrough of the resource requests, node sizing, cluster autoscaler tuning, and spot instance strategies that slashed our monthly EKS spend from $14k to $5.6k.
We were spending $14,000/month on EKS. After a focused two-week optimization sprint, that dropped to $5,600 — with zero application changes and no degradation in performance or reliability. Here’s exactly what we did.
The Audit: Where Is the Money Going?
Before optimizing, instrument. We used kubecost to get per-namespace, per-deployment cost attribution. What we found:
- 42% of spend on nodes that were idle >60% of the time
- 23% on over-provisioned memory requests that were never consumed
- 18% on on-demand instances that could be spot
- 17% legitimately necessary
Fix the first three categories and you fix 83% of the bill.
Fix 1: Resource Requests That Reflect Reality
Most teams set resource requests once at deployment time and never revisit them. After six months of running, actual usage rarely matches the original estimates.
We used the Vertical Pod Autoscaler (VPA) in recommendation mode — it doesn’t change anything, just tells you what it would set:
kubectl apply -f vpa-recommender.yaml
# Wait 24 hours, then:
kubectl describe vpa my-deployment
What we found: almost every deployment had memory requests 3-4x higher than actual p99 usage. After adjusting:
| Deployment | Before | After | Savings |
|---|---|---|---|
| API service | 2Gi | 512Mi | 75% |
| Worker pool | 4Gi | 1.5Gi | 63% |
| ML inference | 8Gi | 6Gi | 25% |
Tighter requests → the cluster scheduler packs pods more efficiently → fewer nodes needed.
Fix 2: Right-Size Your Node Groups
We were running m5.2xlarge (8 vCPU, 32Gi) across the board because “it’s what we started with.” After analyzing our actual workload shapes:
- Stateless API pods: CPU-bound, small memory → switched to
c5.xlarge(4 vCPU, 8Gi) - ML workers: Memory-heavy, burstable CPU → switched to
r5.large(2 vCPU, 16Gi) - Batch jobs: Interruptible, bursty → moved to spot
m5.xlarge
Using purpose-built node groups cut unit costs and improved bin-packing efficiency simultaneously.
Fix 3: Cluster Autoscaler Tuning
Default CA settings are conservative — they wait too long to scale down and scale up too aggressively. Two parameters made the biggest difference:
--scale-down-delay-after-add=5m # default: 10m
--scale-down-unneeded-time=3m # default: 10m
--skip-nodes-with-system-pods=false # reclaim daemon-set-only nodes
Faster scale-down means idle capacity doesn’t linger. We also enabled overprovisioning with a low-priority “placeholder” deployment so scale-up events don’t cause request queuing.
Fix 4: Spot Instances for Non-Critical Workloads
Any workload that can tolerate interruption should run on spot. For us, that was:
- Background job processors (idempotent, can restart)
- ML training jobs (checkpointed)
- Dev/staging environments (obviously)
We used node affinity with a fallback to on-demand:
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 80
preference:
matchExpressions:
- key: node.kubernetes.io/capacity-type
operator: In
values: ["spot"]
The preferred (not required) affinity means if no spot capacity is available, pods schedule on on-demand. No manual intervention needed during spot interruptions.
The Result
| Category | Before | After |
|---|---|---|
| Node count (avg) | 24 | 11 |
| Monthly spend | $14,200 | $5,600 |
| P99 API latency | 180ms | 165ms |
| Incident rate | baseline | no change |
The latency actually improved slightly — better bin-packing reduced noisy-neighbor effects between pods.
The biggest unlock was the audit step. You can’t optimize what you haven’t measured. Spend the first day on instrumentation; the rest of the sprint becomes obvious.
Found this useful? I write about AI engineering, distributed systems, and cloud infrastructure.