containerd CRI Vulnerabilities, Datadog PostgreSQL HA on Kubernetes, AWS DevOps Agent with Datadog MCP Server, EKS Control Plane Egress, and Why Users Feel the Wait

Ship It Weekly · 2026-06-26 · 19 min

Substance score

50 / 100

Five dimensions, 20 points each

Insight Density13 / 20

Originality11 / 20

Guest Caliber6 / 20

Specificity & Evidence12 / 20

Conversational Craft8 / 20

This episode covers five major infrastructure stories: containerd CRI plugin vulnerabilities affecting Kubernetes runtimes, Datadog's discovery that PostgreSQL failover on Kubernetes can work but often isn't safe, AWS DevOps Agent and Datadog MCP Server reaching GA for AI-assisted incident response, EKS customer-routed control plane egress expanding the blast radius, and Marc Brooker's analysis of why users experience latency differently than dashboards report it.

Key takeaways

Patch containerd across all node groups and managed runtimes immediately, as vulnerabilities affect the trust boundary beneath Kubernetes security controls and can enable host command execution and file reads from seemingly innocuous metadata like labels and log paths.
Test database HA failover under real constraints like replication lag and zone failures to ensure your automation prioritizes safety over speed, since unsafe failover trades an outage for a corruption event.
Treat AI incident response agents like production automation with least privilege, clear human approval boundaries, and separate recommendation authority from execution authority rather than grading purely on speed.
Customer-routed EKS control plane egress ownership requires mapping traffic dependencies, testing webhook and OIDC paths, and runbooks before deployment, as misconfigurations now silently break admission controllers and authentication.
Measure reliability from the user experience side not just server metrics, since users land in long-tail latencies more frequently and experience waiting rather than averages.

Guests

Marc Brooker

Topics in this episode

containerd CRI plugin vulnerabilities Kubernetes admission webhooks PostgreSQL HA on Kubernetes Patroni Datadog MCP Server AWS DevOps Agent EKS customer-routed control plane egress GitHub credential revocation AWS Management Console private access Vercel Connect

What our scoring noted

Our reviewer’s read on each dimension, with quotes from the episode.

Insight Density

13 / 20

The episode synthesises multiple stories with genuine operational framing - explaining *why* each development matters for on-call practitioners rather than just announcing news. However, many of the 'takeaways' (least privilege, write runbooks, test failure modes) are standard SRE doctrine that experienced operators already know, keeping it out of the top tier.

Your node runtime is the trust boundary under the trust boundary. Stop treating it like invisible plumbing.

The problem wasn't that the database couldn't fail over. It was that it couldn't fail over safely.

Originality

11 / 20

There are some genuinely sharp framings - containerd as the 'trust boundary under the trust boundary,' the distinction between 'available' and 'correct' in HA, and applying the inspection paradox to MTTR - but these are largely repackagings of existing ideas rather than novel first-principles arguments.

Because by the time a workload reaches the runtime, the rest of the system has already decided this thing is allowed to exist.

unsafe failover can be a lot worse

Guest Caliber

6 / 20

This is a solo-host news commentary format with no guests at all; there is no practitioner expertise to evaluate beyond the host's commentary, and the transcript provides no evidence of Brian Teller's own operational experience at scale.

I'm Brian Teller from Teller's Tech, and this is Ship It Weekly.

Specificity & Evidence

12 / 20

The episode names specific CVE categories, containerd version ranges, and concrete technologies (Patroni, Bottlerocket, EKS, Datadog MCP Server), which is solid for a news roundup. However, there are no hard metrics from real incidents, no recovery-time figures from the Datadog gameday, and the illustrative numbers are hypothetical rather than empirical.

AWS published a security bulletin spanning containerd branches 1.7 through 2.3

Image cache poisoning through checkpoint image references. Host command execution. through unsanitized image labels, CDI annotation handling that can inject devices and host mounts

Conversational Craft

8 / 20

The solo commentary is well-structured, uses pointed rhetorical questions to sharpen each segment, and avoids pure PR repetition of announcements. But there is no actual interview, no guest to push back on, and no moments of genuine productive friction - the format caps the ceiling on this dimension by design.

Can you fail over without losing writes? Can you prove which standby is safe to promote? Can your automation tell the difference between available and correct?

Can it only read? Can it write? Can it open tickets? Trigger automation? Roll back a deploy? Restart a service? Change config? Page a human at 4 a.m.?

Conversation analysis

Computed from the transcript - who did the talking, and the verbal tics along the way.

Filler words

so8like5actually5right4you know1kind of1honestly1

Episode notes

This week on Ship It Weekly : containerd disclosed a batch of CRI plugin vulnerabilities, Datadog tested PostgreSQL high availability on Kubernetes and found that failover is not useful if it cannot happen safely, AWS DevOps Agent and Datadog MCP Server moved AI incident response closer to real production workflows, and Amazon EKS added customer-routed control-plane egress. The bigger theme: the control plane keeps getting wider. Runtimes, databases, incident agents, API-server egress, credentials, the cloud console, and object metadata are all becoming part of the production blast radius. And when something breaks, users do not experience your architecture diagram. They experience waiting. In the lightning round, Brian covers GitHub self-service credential revocation for incident response, AWS Management Console Private Access without internet connectivity, Vercel Connect and short-lived agent credentials, and Amazon S3 annotations.

Full transcript

19 min

Transcribed and scored by The B2B Podcast Index.

This week, containerd disclosed a stack of CRI plugin vulnerabilities in the runtime layer, a huge number of Kubernetes nodes trust to start your containers. Datadog ran a PostgreSQL gameday and learned their database could fail over just fine. It just couldn't do it safely. AWS DevOps Agent and Datadog's MCP Server are both now generally available. And the new AWS integration means AI incident response just graduated from demo to on-call rotation. And EKS will now route your Kubernetes control plane's outbound traffic through your own VPC, which is great, right up until a stale route table quietly kills your admission webhooks. Put those together and the shape of the episode is pretty clear. The control plane keeps getting wider. Runtimes. Databases. Incident agents. API-server egress. credentials, even the cloud console. One by one, they are all sliding into your production blast radius. And here's the part that matters. Your users don't care which control plane failed. They just feel the wait. I'm Brian Teller from Teller's Tech, and this is Ship It Weekly. Welcome back to Ship It Weekly, the show about the DevOps, SRE, cloud, platform, and security stories that actually matter when you are the person who has to keep the thing running at 3 a.m. If you are new here, follow or subscribe wherever you are watching or listening. And if you want the weekly story list and source links, check out OnCallBrief.com For past episodes, full show notes, and more from the show, head over to ShipItWeekly.fm We open with the containerd CRI plugin vulnerabilities, because your node runtime is the trust boundary underneath the trust boundary. Then, Datadog's PostgreSQL HA gameday, where the scary discovery wasn't that failover was hard, it was that failover was unsafe. After that, AWS DevOps Agent and Datadog MCP Server going GA. And what it means when an AI agent gets a seat near your control plane. Then, EKS customer-routed control-plane egress. Because your API server is now part of your network perimeter, whether you plan for it or not. In the lightning round, GitHub Credential Revocation. AWS Console Private Access. Vercel Connect, and S3 annotations. And we close with Marc Brooker on waiting, on why your customers live in the tail of your latency distribution, even when your dashboards swear everything's fine. Let's get into it. First up, containerd has a batch of CRI plugin vulnerabilities. And if you run Kubernetes, this one's yours. AWS published a security bulletin spanning containerd branches 1.7 through 2.3. And the list is not a fun read. Image cache poisoning through checkpoint image references. Host command execution. through unsanitized image labels, CDI annotation handling that can inject devices and host mounts, host file reads through symlinked container log paths during checkpoint restore, and a denial of service from crafted images that exhaust memory. So not exactly a relaxing Patch Tuesday. Here's why it matters. containerd sits underneath an enormous number of clusters, and we spend almost all of our security attention on the layers above it. Pod specs, admission control, image scanning, RBAC, network policy, runtime classes, all the familiar Kubernetes machinery. But eventually, something has to actually pull the image, unpack it, restore it, wire up devices. handle the logs, and start the container. That layer is a trust boundary too. And in some ways, it's the more dangerous one. Because by the time a workload reaches the runtime, the rest of the system has already decided this thing is allowed to exist. That's why the boring fields turn out to matter. Labels, annotations, checkpoint and restore paths, CDI, log paths, every field. that feels like plumbing can become an input to privileged behavior on the node. A malicious image isn't just application code. It's metadata, build time weirdness, and a set of assumptions the runtime makes about what it can trust. The takeaway is direct. Patch containerd. Check your managed node groups, your self-managed nodes, your AMIs, your Bottlerocket versions, your distro packages, anything that controls the runtime. If you lean on checkpoint restore, CDI devices, or GPU workloads, look harder. And if you don't use any of that, don't relax. At least one of these issues doesn't need checkpoint and restore turned on at all. Your node runtime is the trust boundary under the trust boundary. Stop treating it like invisible plumbing. Second story. Datadog published a genuinely good engineering write-up on running high availability PostgreSQL on Kubernetes. And it's one of those pieces that sounds boring until the real problem comes into focus. The problem wasn't that the database couldn't fail over. It was that it couldn't fail over safely. During a gameday, Datadog simulated a zonal failure. That added network latency, replication lag grew, and when the cluster needed a new primary, Patroni couldn't safely promote a standby without risking data loss. So the system got stuck in the worst possible spot. The old primary was unhealthy. The standbys weren't safe to promote, and the only correct move was to wait. That's the kind of failure mode that ages every SRE in the room about three years. Because on paper, you have everything. Multiple nodes, standbys, Kubernetes, automation, failover machinery. And then the actual failure arrives and the system says, yes, but not safely. Which, honestly, is the right answer. Promoting a stale standby might hand you a writable primary faster. But if it costs you data loss, split brain, or a broken consistency guarantee, you haven't fixed the outage. You've traded it for a corruption event. That's not an improvement. It's just a different postmortem. The real lesson is that HA isn't only about whether the service comes back. It's about whether the recovery path itself is safe. Can you fail over without losing writes? Can you prove which standby is safe to promote? Can your automation tell the difference between available and correct? And does your whole team agree on which one it should prefer before the incident call is on fire? Datadog's answer was to move toward synchronous replication and stronger Patroni guardrails. So a promoted standby is guaranteed to have the writes it needs. And that's the part that's worth copying. They didn't just ask how to recover faster. They asked how to recover safely. So test your database HA against real constraints, not the easy ones. Ask what happens under replication lag. Ask what happens during a zone failure. Ask what happens when the network is slow instead of cleanly dead. Ask what happens when every standby is behind. And ask whether your automation prefers safety or availability. And whether everyone actually agrees with that choice. Because failover is useless, if the only safe option is waiting. But unsafe failover can be a lot worse. Third story, AWS DevOps Agent is now generally available and Datadog's MCP Server is GA as a standard way for AI agents to reach Datadog monitoring data. This is one of those announcements. where the slide says autonomous incident resolution and the operator says, cool, but what exactly is it allowed to touch? The idea is solid. AWS DevOps Agent can work through Datadog MCP Server to investigate an incident across logs, metrics, traces, deployment events, and AWS infrastructure context. Instead of one engineer bouncing between CloudWatch, Datadog, deploy history, traces, dashboards, and Slack, the agent correlates the signals and helps push the incident forward and nobody wants to spend the first 30 minutes of an outage doing browser-tab archaeology if an agent can gather context, summarize what changed, flag a suspicious deploy and propose likely causes that's real time saved but this is also the moment AI incident response stops being a chatbot and becomes a production workflow. It's an agent reading operational telemetry, interpreting signals, recommending fixes, and potentially wired into Slack, PagerDuty, ServiceNow, your code, your deploys, and your runbooks. That puts it right next to the control plane. And once something sits next to the control plane, the question stops being, is it smart? And becomes, what authority does it have? Can it only read? Can it write? Can it open tickets? Trigger automation? Roll back a deploy? Restart a service? Change config? Page a human at 4 a.m.? Can it make things worse quickly and very confidently? That last one is the whole game. Incident response isn't about speed. It's about safe speed. So treat AI incident tooling like any other production automation. Give it the least privilege that still leaves it useful. Log what it sees and what it does. Make the human approval boundary impossible to miss. And draw a hard line between what it can recommend and what it can execute. Have rollback rules. Know what happens when it's wrong. And don't grade it only on time to answer. Grade it on whether the answer was safe, auditable, and actually useful under pressure. AI incident response is moving from demo to production. That's exciting. Production just needs guardrails. Fourth story. Amazon EKS now supports customer-routed control-plane egress. That's a very AWS phrase. So here's the human version. The Kubernetes API server sometimes needs to call outward to admission webhooks, OIDC providers, aggregated API servers, other endpoints that you control. Historically, that outbound traffic took AWS managed egress paths. Now you can route it through your own VPC, which hands platform teams control over routing, inspection, firewalls, NAT, private connectivity, and compliance boundaries. For regulated environments, that's a real win. It also makes the control plane feel a lot more like part of your network. which of course it always was. The difference is that now you own the outbound path and AWS is blunt about what that ownership means. In customer routed mode, you are responsible for making sure the control plane can reach the endpoints it needs. Wrong route table, too-tight security group, a NACL that blocks the wrong thing, a broken firewall hop, and control plane operations start failing. That includes admission webhook calls, and OIDC authentication. So yes, great feature. But it isn't a checkbox. It's a failure mode change. If your API server can't reach an admission webhook, do pod creates fail? Do deploys hang? Does authentication break? Does your incident response now depend on a firewall path some other team owns? And do you have a metric? a test, and a name on the pager for when it breaks? This is a feature you bring to a design review. Not because it's risky, but because it's powerful. Map the traffic. Map the dependencies. Test the webhooks. Test OIDC. Test the failure modes. Make the routing visible. And write the runbook before the control plane starts failing in creative ways. The Kubernetes control plane is becoming part of your network perimeter. Treat it like one. Quick lightning round. First, GitHub added self-service credential revocation for incident response. Enterprise owners now get a break-glass capability to revoke a compromised user's credentials in one move. This matters because credential cleanup should never be a scavenger hunt. You do not want to be hand-hunting through SSO authorizations, personal access tokens, SSH keys, and OAuth grants while everyone argues in Slack. Revocation is incident response infrastructure. Know who can trigger it, know what it kills, know what it logs, and put it in the compromised-account runbook. Second, AWS Management Console private access now works without internet connectivity. Console traffic for supported services can flow over VPC endpoints instead of the public internet. It's a strong story for regulated environments. Even the console is getting pulled behind private network boundaries. The lesson? Console access is part of your control plane too. And private link, endpoint policies, and known-account restrictions are becoming cloud operations, not just app networking. Third, Vercel shipped Vercel Connect. And the idea worth catching is runtime credential exchange. Instead of stashing a long-lived provider token for an agent, The app proves its identity and gets a short-lived task-scoped credential. That's the pattern that we've been tracking for weeks. Agent credentials moving from store this token forever to prove who you are and get scoped access when you need it. Short-lived credentials don't solve every agent security problem, but they beat long-lived secrets sitting around waiting to become next quarter's incident. Fourth, Amazon S3 annotations are here. mutable, queryable context attached directly to S3 objects. Sounds dull, but object metadata has driven a lot of awkward platform design over the years. Side tables, DynamoDB metadata stores, Lambda sync jobs, custom catalogs, and constant drift between the object and whatever's describing it. If annotations shrink that glue layer, That's worth watching. Object metadata is quietly becoming a first-class platform layer, especially for data, AI, search, and agent workflows that need to know what an object is, not just where it lives. The human closer this week comes from a Marc Brooker post about waiting, latency, MTTR, and why averages can lie. The point is that your users don't experience your averages the way your dashboards report them. You measure mean latency, mean time to recovery, average outage duration. But people are far more likely to land in the long waits simply because long waits take up more of the time. That's the inspection paradox. A 10-minute outage catches a few users. A 10-hour outage catches a lot of them. Your incident tracker counts both as one outage. Your dashboard says MTTR looks fine. Your users say they spent all morning waiting. Both are true. And that's the whole episode, really. When the system breaks, nobody experiences your architecture diagram. They experience waiting. Waiting for a request. Waiting for recovery. Waiting for a credential to get revoked. Waiting for a deploy to stop failing. Waiting for the control plane to come back. Waiting for someone to find the right context. So here's the takeaway. Don't only measure the system from the server side. Measure it from the waiting side. Because your users don't live in your average. They live in the tail. And the tail is usually where the real reliability story is hiding. That's it for this week of Ship It Weekly. We covered containerd runtime risk, Postgres failover safety, AI incident response, EKS control-plane egress, and why your users feel the wait more than your dashboards show. If this episode was useful, follow or subscribe wherever you are watching or listening. If you're on YouTube, hit subscribe. If you're in a podcast app, follow the show there. And if you know someone wrestling with Kubernetes runtime security, database failover, AI incident response, or platform control planes, send them this one. It genuinely helps the show grow, and it helps me keep making this for people who actually live with these systems. You can find the weekly brief at OnCallBrief.com and the full show notes, links, and past episodes at ShipItWeekly.fm. I'm Brian Teller from Teller's Tech. Thanks for listening. And remember, your dashboards measure the average. Your users feel the wait.

More from Ship It Weekly

All episodes →

Explore the best B2B Engineering & DevTools podcasts →

Listen to this episode All Ship It Weekly episodes →