How GitLab Built a Single Codebase for One Million CI Pipelines

The CTO Podcast with Fexingo · 2026-06-25 · 8 min

Substance score

45 / 100

Five dimensions, 20 points each

Insight Density13 / 20

Originality9 / 20

Guest Caliber2 / 20

Specificity & Evidence14 / 20

Conversational Craft7 / 20

GitLab scaled its CI system to handle over one million pipelines per day by optimizing a single Rails monolith rather than breaking into microservices, focusing engineering effort on database partitioning, Redis Cluster queue management, and lock-free job scheduling mechanisms. The episode covers their architectural decisions, the elimination of global locks through atomic Redis operations, safe deployment practices using feature flags, and database migration strategies that enable zero-downtime changes at scale.

Key takeaways

PostgreSQL table partitioning by project ID and creation date significantly reduces query scans on high-volume tables like `ci_pipelines` in a monolithic architecture.
Redis Cluster cross-slot operation problems can be solved by using hash tags in key naming (e.g., `{pipeline_id}`) to colocate related jobs on the same node.
Lock-free job scheduling using atomic Redis operations like `WATCH`, `MULTI`, and Lua scripts eliminates global mutex bottlenecks that degrade performance at scale.
Feature flag-based rollouts allow safe gradual deployment of architectural changes, catching edge cases before full fleet exposure.
Zero-downtime database migrations require breaking changes into reversible, batched steps and establishing a database review process for every migration-related merge request.

Topics in this episode

GitLab Rails PostgreSQL partitioning Redis Cluster Redis Lua scripts Sidekiq Online schema changes (pt-osc, gh-ost)Lock-free algorithms Feature flags Database migration strategy

What our scoring noted

Our reviewer’s read on each dimension, with quotes from the episode.

Insight Density

13 / 20

For an 8-minute episode the content is genuinely dense with actionable technical specifics - Redis hash-tag key design, lock-free scheduling via WATCH/MULTI/Lua, PostgreSQL partitioning strategy, and zero-downtime migration steps. However, some filler commentary and a clichéd closing drag the density down.

they partition the `ci_pipelines` table by project ID and by creation date, so queries for recent pipelines on a busy project don't scan the whole table

they moved to a lock-free design using atomic Redis operations - like `WATCH` and `MULTI` for optimistic locking, and Lua scripts for atomic updates

Originality

9 / 20

The 'monolith beats microservices at scale' argument is now a well-worn contrarian take rather than a genuinely fresh one, and the closing lesson ('the best architecture is the one you already have') is close to a platitude. The GitLab-specific technical details add genuine value, but the framing is recycled.

Everyone talks about microservices for scale, but GitLab doubled down on the monolith

the best architecture is the one you already have, if you're willing to optimize it relentlessly

Guest Caliber

2 / 20

There is no guest at all - this is two co-hosts synthesising GitLab's publicly available engineering blog posts. No practitioner who actually built or operated the system is present, which is a fundamental weakness for a show claiming to cover real engineering stories.

That's all for this episode of The CTO Podcast. We'll be back next time with another deep dive.

This show exists because a group of listeners chip in monthly

Specificity & Evidence

14 / 20

The episode names specific tables (ci_pipelines, ci_job_artifacts), specific Redis primitives (WATCH, MULTI, Lua scripts, hash tags), specific tools (pt-osc, gh-ost, Sidekiq), a concrete batch size (1000 rows), and a hard p99 metric - strong specificity for the duration.

they caught a migration that would have locked the `ci_job_artifacts` table for hours on GitLab.com

GitLab's CI system handles over a million pipelines daily with p99 job scheduling latency under one second

Conversational Craft

7 / 20

Luna asks a few genuine follow-up questions (on Redis Cluster usage, on the microservice debate) but the dialogue is clearly scripted, both hosts agree on everything throughout, and there is zero pushback or productive tension - closer to a structured explainer than a real interview.

did they ever consider rewriting CI as a separate service? Like, a dedicated CI microservice?

Hash tags - that's the `{pipeline_id}` pattern, right? So all keys for a pipeline go to the same node.

Conversation analysis

Computed from the transcript - who did the talking, and the verbal tics along the way.

Filler words

so9right4like3

Episode notes

In this episode, Lucas and Luna dive into the specific technical decisions GitLab made to scale its CI/CD platform from tens of thousands of pipelines to over one million per day. They focus on the choice of a monolithic application architecture, the use of PostgreSQL partitioning and Redis Cluster, and the trade-offs around job scheduling and queue management. The discussion highlights how GitLab avoided microservice complexity early on and instead optimized a single codebase to handle massive parallel workloads. Listeners will learn about the concept of 'fleet-wide coordination' and how GitLab's engineering team used feature flags and gradual rollout to maintain reliability during rapid growth. This is a concrete case study in scaling developer tools without rewriting everything. #GitLab #CICD #PipelineScaling #MonolithArchitecture #PostgreSQL #RedisCluster #JobScheduling #FeatureFlags #DeveloperTools #ContinuousIntegration #ContinuousDeployment #EngineeringCulture #TechnicalLeadership #BusinessAndTechnology #DevOps #Scale #FexingoBusiness #BusinessPodcast Keep every episode free: buymeacoffee.com/fexingo

Full transcript

8 min

Transcribed and scored by The B2B Podcast Index.

Lucas: So it's mid-2026, and GitLab is running north of a million CI pipelines a day. That's not just a flex, it's a genuine infrastructure challenge - how do you build a system that can schedule, queue, and execute those jobs reliably without collapsing under its own weight? Luna: Especially when each pipeline can have hundreds of jobs, and users expect results in minutes, not hours. Lucas: Right. And what's interesting about GitLab's approach is that they've largely stuck with a monolithic application architecture. They didn't break out CI into a thousand microservices. Instead, they optimized a single Rails codebase to handle that scale. The engineering team wrote a lot about this around 2023 and 2024, and it's still the backbone today. Luna: That feels almost contrarian. Everyone talks about microservices for scale, but GitLab doubled down on the monolith. Lucas: Exactly. And the reasoning is pragmatic. They realized early that the bottleneck wasn't the application code per se - it was the database and the job scheduler. So they poured effort into PostgreSQL partitioning and Redis Cluster for queue management. For example, they partition the `ci_pipelines` table by project ID and by creation date, so queries for recent pipelines on a busy project don't scan the whole table. Luna: That makes sense. And Redis Cluster - are they using it for the job queue itself? Lucas: Yes. The job queue is essentially a set of Redis lists and sorted sets. But they ran into issues with Redis Cluster's cross-slot operations. So they had to design their queue keys to hash to the same slot for related jobs. That meant careful key naming - using the same hash tag for all jobs belonging to the same pipeline. Luna: Hash tags - that's the `{pipeline_id}` pattern, right? So all keys for a pipeline go to the same node. Lucas: Exactly. It's a small detail, but it avoids multi-key operations that fail in cluster mode. They also moved from a single global queue to a 'fleet-wide coordination' model where each runner picks up jobs based on tags and capacity. That's more complex but more scalable. Luna: And they have to handle retries, cancellations, and dynamic scaling of runners. It's not just queuing - it's state management. Lucas: Right. And one of the hardest parts was eliminating global locks. In early versions, they had a mutex around job scheduling to prevent duplicate assignments. But at a million pipelines, that lock became a bottleneck. So they moved to a lock-free design using atomic Redis operations - like `WATCH` and `MULTI` for optimistic locking, and Lua scripts for atomic updates. Luna: That's clever. But it also means more complexity in the application code to handle retries on conflicts. Lucas: Exactly. And they embraced that. The thing I respect is how they used feature flags to roll out these changes gradually. They'd enable the new scheduler for 1 percent of projects, monitor for regressions, then ramp up. That let them catch edge cases without blowing up the whole fleet. Luna: It's a good reminder that scaling isn't just about new architecture - it's about safe deployment. One thing I'm curious about: did they ever consider rewriting CI as a separate service? Like, a dedicated CI microservice? Lucas: They did. There was a lot of internal debate around 2021. But the conclusion was that the cost of splitting - especially the data consistency challenges - outweighed the benefits. You'd have to replicate pipeline data between services, deal with eventual consistency, and manage two deployment pipelines. It's not trivial. So they kept the monolith but made it modular internally. Luna: Modular monolith - that's the term. And they used Sidekiq for background jobs, which is another layer of queuing. Did they run into issues there? Lucas: Yeah, Sidekiq was fine for a while, but as job volumes grew, they had to shard Sidekiq queues by priority and by job type. They also moved some critical jobs to a separate Redis instance to avoid noisy neighbor effects. For example, pipeline creation jobs went to a dedicated queue with its own Redis, so a burst of low-priority jobs wouldn't delay a new pipeline from starting. Luna: That's a classic pattern - separate infrastructure for critical vs. non-critical workloads. Lucas: And the results speak for themselves. GitLab's CI system handles over a million pipelines daily with p99 job scheduling latency under one second. That's from a single Rails app, heavily optimized. It's a great case study for anyone building developer tools. Luna: Speaking of building and running things - Lucas, I know we usually just dive into the technical details, but I want to pause for a second. This show exists because a group of listeners chip in monthly, and that's what keeps it ad-free and focused on real engineering stories. Lucas: Yeah, it's a small group, but it makes a huge difference. If these conversations are useful for what you're building or running, you can join them at buy me a coffee dot com slash fexingo. No pressure, just mentioning it because it's the only reason we can keep doing deep dives like this. Luna: And we really mean that - no ads, no sponsor reads, just the content. So thanks to everyone who already supports it. Lucas: Alright, back to GitLab. One more angle I want to touch on: how they handle database migrations at scale. When you have a million pipelines a day, even a simple schema change can cause downtime. Luna: Right. They use a process called 'zero-downtime migrations' from Day One. All migrations must be reversible and batched. Lucas: Exactly. They break large changes into multiple steps. For example, adding a column with a default value: first they add the column as nullable, then backfill default values in batches of 1000 rows, then make it not null. This avoids long table locks. They also use `pt-osc` or `gh-ost` for online schema changes, but built their own tooling around it. Luna: And they also have a policy of 'database review' for every merge request that touches the data layer. That is a huge culture investment. Lucas: Yeah, they have a team of database reviewers who are assigned to every MR that includes a migration or a performance-sensitive query. That catches issues early. For example, they caught a migration that would have locked the `ci_job_artifacts` table for hours on GitLab.com. They redesigned it to use incremental cleanup instead. Luna: It's that combination of technical rigor and process that makes it work. The monolith itself isn't magical - it's the discipline around managing it. Lucas: Exactly. And that's the takeaway for me. GitLab didn't need to rewrite everything. They invested in understanding where the real bottlenecks were - database, queueing, deployment - and solved those specifically. The result is a system that's still growing. Luna: And it's a reminder that sometimes the best architecture is the one you already have, if you're willing to optimize it relentlessly. Lucas: Well said. That's all for this episode of The CTO Podcast. We'll be back next time with another deep dive.

More from The CTO Podcast with Fexingo

All episodes →

Explore the best B2B Engineering & DevTools podcasts →

All The CTO Podcast with Fexingo episodes →