How Datadog Monitors Its Own Infrastructure

The CTO Podcast with Fexingo · 2026-06-18 · 8 min

Episode notes

Episode 58 of The CTO Podcast goes inside Datadog's engineering org to explore how the company monitors its own 100-terabyte infrastructure. Lucas and Luna walk through Datadog's dogfooding culture, the architectural challenges of running a monitoring platform for itself, and how the team handles alert fatigue, distributed tracing, and log ingestion at massive scale. They discuss specific tools like the Datadog Agent, the trace-agent, and the custom time-series database built in-house. The episode includes concrete numbers: 30 trillion time-series points ingested daily, 99.99 percent uptime target, and how the SRE team manages 8,000 hosts across multiple cloud providers. Tune in for a rare look at how the watcher watches itself. #Datadog #InfrastructureMonitoring #Dogfooding #SRE #Observability #TimeSeriesDatabase #DistributedTracing #AlertFatigue #CloudInfrastructure #EngineeringCulture #SiteReliabilityEngineering #DevOps #BusinessAndTechnology #FexingoBusiness #BusinessPodcast #CTO #TechnicalLeadership #Architecture Keep every episode free: buymeacoffee.com/fexingo

More from The CTO Podcast with Fexingo

All episodes →

Explore the best B2B Engineering & DevTools podcasts →

All The CTO Podcast with Fexingo episodes →