How Datadog Monitors Its Own Infrastructure
The CTO Podcast with Fexingo · 2026-06-18 · 8 min
Episode notes
Episode 58 of The CTO Podcast goes inside Datadog's engineering org to explore how the company monitors its own 100-terabyte infrastructure. Lucas and Luna walk through Datadog's dogfooding culture, the architectural challenges of running a monitoring platform for itself, and how the team handles alert fatigue, distributed tracing, and log ingestion at massive scale. They discuss specific tools like the Datadog Agent, the trace-agent, and the custom time-series database built in-house. The episode includes concrete numbers: 30 trillion time-series points ingested daily, 99.99 percent uptime target, and how the SRE team manages 8,000 hosts across multiple cloud providers. Tune in for a rare look at how the watcher watches itself. #Datadog #InfrastructureMonitoring #Dogfooding #SRE #Observability #TimeSeriesDatabase #DistributedTracing #AlertFatigue #CloudInfrastructure #EngineeringCulture #SiteReliabilityEngineering #DevOps #BusinessAndTechnology #FexingoBusiness #BusinessPodcast #CTO #TechnicalLeadership #Architecture Keep every episode free: buymeacoffee.com/fexingo
More from The CTO Podcast with Fexingo
All episodes →- How Airbnb Rebuilt Search for 8 Million Listings62 / 100
- How GitLab Built a Single Codebase for One Million CI Pipelines65 / 100
- How Slack Rebuilt Its Search Index for 10 Million Daily Queries57 / 100
- How Notion Rebuilt Its Sync Engine for Offline-First
- How Notion Rebuilt Its Block Engine for Hybrid Local-Sync