Ship It Conversations: Meta’s Francois Richard on AI Incident Response, SLOs, and Reliability at Scale
Ship It Weekly · 2026-06-16 · 43 min
Episode notes
This is a guest conversation episode of Ship It Weekly , separate from the weekly news recaps. In this Ship It: Conversations episode, I talk with Francois Richard, Engineering Director at Meta, about reliability at scale, how AI is changing production risk, what teams actually learn from incidents, and why recovery practice matters just as much as prevention. We talk about the proactive and reactive sides of reliability, why SLOs should represent a promise to users instead of just another dashboard number, how incident reviews should drive real system improvements, and how teams can practice recovery before production forces the lesson on them. The bigger theme here is that reliability is not just about avoiding failure. It is about knowing what happens when prevention fails. That means practicing regional failure, understanding overload behavior, improving incident response, using AI carefully during investigation, and making reliability targets match the actual lifecycle and importance of the system.
More from Ship It Weekly
All episodes →- containerd CRI Vulnerabilities, Datadog PostgreSQL HA on Kubernetes, AWS DevOps Agent with Datadog MCP Server, EKS Control Plane Egress, and Why Users Feel the Wait50 / 100
- Ship It Conversations: Guardsquare’s Joel DeStefano on Mobile App Security, Runtime Protection, App Hardening, and Why Scanning Isn’t Enough35 / 100
- PeopleSoft Zero-Day Exploited, npm v12 Install Script Changes, GitHub Agentic Tokens, Anthropic Model Risk, and Default Trust Breaking28 / 100
- Coinbase Outage, Meta AI Account Recovery, AWS AgentCore Code Injection, Apigee Tenant Isolation, and the Glue That Breaks Production
- Kiro CLI Approval Bypass, Amazon Braket Pickle Risk, AWS Org Logging, KEDA Upgrades, and Automation’s Hidden Boundaries