← DevOps Daily with Fexingo: CI/CD, Kubernetes, and Modern Software Operations

Kubernetes Cluster Autoscaler Fails Under Spot Instance Interruptions

DevOps Daily with Fexingo: CI/CD, Kubernetes, and Modern Software Operations · 2026-06-10 · 10 min

Episode notes

Episode 43 of DevOps Daily with Fexingo dives into a hidden failure mode of the Kubernetes Cluster Autoscaler: it frequently fails to scale up new nodes quickly enough after a spot instance interruption. Lucas explains how the default unready-node-taint strategy can delay scale-up by several minutes, causing pods to stay pending and potentially triggering cascading failures. He walks through a real incident at a mid-size SaaS company where a 40% spot interruption rate led to 15-minute scale-up latencies and broken SLOs. Luna challenges whether unscheduled pods on a tainted node really need to be recreated from scratch, and the pair discuss workarounds like increasing the scale-up ratio of the CA's expansion options, using the new PodTopologySpread constraints, and pre-provisioning a small buffer of on-demand nodes. The episode also covers upcoming Kubernetes 1.32 changes to the Cluster Autoscaler's handling of instance interruptions. No ad breaks, ever. If today's tech conversation gave you something usable, consider supporting the show at buy me a coffee dot com slash fexingo.

All DevOps Daily with Fexingo: CI/CD, Kubernetes, and Modern Software Operations episodes →