Fail Small, IaC Control Planes, and Automated RCA
Ship It Weekly · 2026-01-03 · 18 min
Episode notes
This week on Ship It Weekly , Brian kicks off the new year with one theme: automation is getting faster, and that makes blast radius and oversight matter more than ever. We start with Cloudflare’s “fail small” mindset. The core idea is simple: big outages usually come from correlated failure, not one box dying. If a bad change lands everywhere at once, you’re toast. “Fail small” is about forcing problems to stay local so you can stop the bleeding before it becomes global. Next is Pulumi’s push to be the control plane for all your IaC, including Terraform and HCL. The interesting part isn’t syntax wars. It’s the workflow layer: approvals, policy enforcement, audit trails, drift, and how teams standardize without signing up for a multi-year rewrite. Third is Meta’s DrP, a root cause analysis platform that turns repeated incident investigation steps into software. Even if you’re not Meta, the pattern is worth stealing: automate the first 10 - 15 minutes of your most common incident types so on-call is consistent no matter who’s holding the pager.
More from Ship It Weekly
All episodes →- containerd CRI Vulnerabilities, Datadog PostgreSQL HA on Kubernetes, AWS DevOps Agent with Datadog MCP Server, EKS Control Plane Egress, and Why Users Feel the Wait50 / 100
- Ship It Conversations: Guardsquare’s Joel DeStefano on Mobile App Security, Runtime Protection, App Hardening, and Why Scanning Isn’t Enough35 / 100
- PeopleSoft Zero-Day Exploited, npm v12 Install Script Changes, GitHub Agentic Tokens, Anthropic Model Risk, and Default Trust Breaking28 / 100
- Ship It Conversations: Meta’s Francois Richard on AI Incident Response, SLOs, and Reliability at Scale
- Coinbase Outage, Meta AI Account Recovery, AWS AgentCore Code Injection, Apigee Tenant Isolation, and the Glue That Breaks Production